March 2020 - ceph-users - lists.ceph.io

by Dan van der Ster

Hi, I have a feeling that the pg repair didn't actually run yet. Sometimes if the OSDs are busy scrubbing, the repair doesn't start when you ask it to. You can force it through with something like: ceph osd set noscrub ceph osd set nodeep-scrub ceph config set osd_max_scrubs 3 ceph pg repair <the pg> ceph status # and check that the repair really started ceph config set osd_max_scrubs 1 ceph osd unset nodeep-scrub ceph osd unset noscrub Once repair runs/completes, it will rewrite the inconsistent object replica (to a new place on the disk). Check your ceph.log to see when this happens. From my experience, the PendingSectors counter will not be decremented until that sector is written again (which will happen at some random point in the future when bluestore allocates some new data there). Hope that helps, Dan On Mon, Mar 30, 2020 at 9:00 AM David Herselman <dhe(a)syrex.co> wrote: > > Hi, > > We have a single inconsistent placement group where I then subsequently triggered a deep scrub and tried doing a 'pg repair'. The placement group remains in an inconsistent state. > > How do I discard the objects for this placement group only on the one OSD and get Ceph to essentially write the data out new. Drives will only mark a sector as remapped when asked to overwrite the problematic sector or repeated reads of the failed sector eventually succeed (this is my limited understanding). > > Nothing useful in the 'ceph pg 1.35 query' output that I could decipher. Then ran 'ceph pg deep-scrub 1.35' and 'rados list-inconsistent-obj 1.35' thereafter indicates a read error on one of the copies: > {"epoch":25776,"inconsistents":[{"object":{"name":"rbd_data.746f3c94fb3a42.000000000001e48d","nspace":"","locator":"","snap":"head","version":34866184},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":{"oid":{"oid":"rbd_data.746f3c94fb3a42.000000000001e48d","key":"","snapid":-2,"hash":3814100149,"max":0,"pool":1,"namespace":""},"version":"22845'1781037","prior_version":"22641'1771494","last_reqid":"client.136837683.0:124047","user_version":34866184,"size":4194304,"mtime":"2020-03-08 17:59:00.159846","local_mtime":"2020-03-08 17:59:00.159670","lost":0,"flags":["dirty","data_digest","omap_digest"],"truncate_seq":0,"truncate_size":0,"data_digest":"0x031cb17c","omap_digest":"0xffffffff","expected_object_size":4194304,"expected_write_size":4194304,"alloc_hint_flags":0,"manifest":{"type":0},"watchers":{}},"shards":[{"osd":51,"primary":false,"errors":["read_error"],"size":4194304},{"osd":60,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_dig > est":"0x031cb17c"},{"osd":82,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x031cb17c"}]}]} > > /var/log/syslog: > Mar 30 08:40:40 kvm1e kernel: [74792.229021] ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x0 > Mar 30 08:40:40 kvm1e kernel: [74792.230416] ata2.00: irq_stat 0x40000008 > Mar 30 08:40:40 kvm1e kernel: [74792.231715] ata2.00: failed command: READ FPDMA QUEUED > Mar 30 08:40:40 kvm1e kernel: [74792.233071] ata2.00: cmd 60/00:08:00:7a:50/04:00:c9:00:00/40 tag 1 ncq dma 524288 in > Mar 30 08:40:40 kvm1e kernel: [74792.233071] res 43/40:00:10:7b:50/00:04:c9:00:00/00 Emask 0x409 (media error) <F> > Mar 30 08:40:40 kvm1e kernel: [74792.235736] ata2.00: status: { DRDY SENSE ERR } > Mar 30 08:40:40 kvm1e kernel: [74792.237045] ata2.00: error: { UNC } > Mar 30 08:40:40 kvm1e ceph-osd[450777]: 2020-03-30 08:40:40.240 7f48a41f3700 -1 bluestore(/var/lib/ceph/osd/ceph-51) _do_read bdev-read failed: (5) Input/output error > Mar 30 08:40:40 kvm1e kernel: [74792.244914] ata2.00: configured for UDMA/133 > Mar 30 08:40:40 kvm1e kernel: [74792.244938] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > Mar 30 08:40:40 kvm1e kernel: [74792.244942] sd 1:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current] > Mar 30 08:40:40 kvm1e kernel: [74792.244945] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed > Mar 30 08:40:40 kvm1e kernel: [74792.244949] sd 1:0:0:0: [sdb] tag#1 CDB: Read(16) 88 00 00 00 00 00 c9 50 7a 00 00 00 04 00 00 00 > Mar 30 08:40:40 kvm1e kernel: [74792.244953] blk_update_request: I/O error, dev sdb, sector 3377494800 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0 > Mar 30 08:40:40 kvm1e kernel: [74792.246238] ata2: EH complete > > > Regards > David Herselman > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

4 years

1
0
0 0

Terrible IOPS performance

by Jarett DeAngelis

Hi folks, I have a three-node cluster on a 10G network with very little traffic. I have a six-OSD flash-only pool with two devices — a 1TB NVMe drive and a 256GB SATA SSD — on each node, and here’s how it benchmarks: Oof. How can I troubleshoot this? Anthony mentioned that I might be able to run more than one OSD on the NVMe — how is that done, and can I do it “on the fly” with the system already up and running like this? And, will more OSDs give me better IOPS? Thanks, Jarett

4 years

2
1
0 0

Move on cephfs not O(1)?

by Frank Schilder

Some time ago I made a surprising observation. I reorganised a directory structure and needed to move a folder one level up with a command like mv A/B/ B B contained something like 9TB in very large files. To my surprise, this command didn't return for a couple of minutes and I started to look what was going on. What I discovered was, that the mv command actually performed a full copy with a subsequent remove. I had to wait for several hours for the move to complete. I tried to reproduce this today to collect further information. However, this behaviour seems not reproducible. No matter what I try, mv completes almost instantly. I was running the original mv on mimic 13.2.2 and retried now with mimic 13.2.8. In addition, there was an OS upgrade from Centos 7.6 to 7.7. I'm using the kernel-ml versions (5.xxx). Only one cephfs mount was present at all times. My questions are: 1) Was there a change from 13.2.2 to 13.2.8 explaining this? 2) Are there (rare) conditions under which an mv on cephfs becomes a cp+rm? 3) Am I seeing ghosts? Thanks for clues and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

4 years

5
11
0 0

How do I get a sector marked bad?

by David Herselman

Hi, We have a single inconsistent placement group where I then subsequently triggered a deep scrub and tried doing a 'pg repair'. The placement group remains in an inconsistent state. How do I discard the objects for this placement group only on the one OSD and get Ceph to essentially write the data out new. Drives will only mark a sector as remapped when asked to overwrite the problematic sector or repeated reads of the failed sector eventually succeed (this is my limited understanding). Nothing useful in the 'ceph pg 1.35 query' output that I could decipher. Then ran 'ceph pg deep-scrub 1.35' and 'rados list-inconsistent-obj 1.35' thereafter indicates a read error on one of the copies: {"epoch":25776,"inconsistents":[{"object":{"name":"rbd_data.746f3c94fb3a42.000000000001e48d","nspace":"","locator":"","snap":"head","version":34866184},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":{"oid":{"oid":"rbd_data.746f3c94fb3a42.000000000001e48d","key":"","snapid":-2,"hash":3814100149,"max":0,"pool":1,"namespace":""},"version":"22845'1781037","prior_version":"22641'1771494","last_reqid":"client.136837683.0:124047","user_version":34866184,"size":4194304,"mtime":"2020-03-08 17:59:00.159846","local_mtime":"2020-03-08 17:59:00.159670","lost":0,"flags":["dirty","data_digest","omap_digest"],"truncate_seq":0,"truncate_size":0,"data_digest":"0x031cb17c","omap_digest":"0xffffffff","expected_object_size":4194304,"expected_write_size":4194304,"alloc_hint_flags":0,"manifest":{"type":0},"watchers":{}},"shards":[{"osd":51,"primary":false,"errors":["read_error"],"size":4194304},{"osd":60,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x031cb17c"},{"osd":82,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x031cb17c"}]}]} /var/log/syslog: Mar 30 08:40:40 kvm1e kernel: [74792.229021] ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x0 Mar 30 08:40:40 kvm1e kernel: [74792.230416] ata2.00: irq_stat 0x40000008 Mar 30 08:40:40 kvm1e kernel: [74792.231715] ata2.00: failed command: READ FPDMA QUEUED Mar 30 08:40:40 kvm1e kernel: [74792.233071] ata2.00: cmd 60/00:08:00:7a:50/04:00:c9:00:00/40 tag 1 ncq dma 524288 in Mar 30 08:40:40 kvm1e kernel: [74792.233071] res 43/40:00:10:7b:50/00:04:c9:00:00/00 Emask 0x409 (media error) <F> Mar 30 08:40:40 kvm1e kernel: [74792.235736] ata2.00: status: { DRDY SENSE ERR } Mar 30 08:40:40 kvm1e kernel: [74792.237045] ata2.00: error: { UNC } Mar 30 08:40:40 kvm1e ceph-osd[450777]: 2020-03-30 08:40:40.240 7f48a41f3700 -1 bluestore(/var/lib/ceph/osd/ceph-51) _do_read bdev-read failed: (5) Input/output error Mar 30 08:40:40 kvm1e kernel: [74792.244914] ata2.00: configured for UDMA/133 Mar 30 08:40:40 kvm1e kernel: [74792.244938] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Mar 30 08:40:40 kvm1e kernel: [74792.244942] sd 1:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current] Mar 30 08:40:40 kvm1e kernel: [74792.244945] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed Mar 30 08:40:40 kvm1e kernel: [74792.244949] sd 1:0:0:0: [sdb] tag#1 CDB: Read(16) 88 00 00 00 00 00 c9 50 7a 00 00 00 04 00 00 00 Mar 30 08:40:40 kvm1e kernel: [74792.244953] blk_update_request: I/O error, dev sdb, sector 3377494800 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0 Mar 30 08:40:40 kvm1e kernel: [74792.246238] ata2: EH complete Regards David Herselman

4 years

1
0
0 0

BlueStore and checksum

by Priya Sehgal

Hi, I am trying to find out whether Ceph has a method to detect silent corruption such as bit rot. I came across this text in a book - "Mastering Ceph : Infrastructure Storage Solution with the latest Ceph release" by Nick Fisk - Luminous release of Ceph employs ZFS-like ability to checksum data at every read. BlueStore calculates and stores the crc32 checksum of any data that is written. On each read request, BlueStore reads the checksum and compares it with the data read from the device. If there is a mismatch, BlueStore will report read error and repair damage. Ceph will then retry the read from another OSD holding the object. I have two questions: 1. If there is a mismatch will the ceph user (who initiated GET) receive any error message and he has to retry OR will this error be auto-corrected by Ceph and the data would be returned from the other OSD (user will never know what went wrong underneath? 2. Does this feature work for Multi-part upload objects also? -- Regards, Priya

4 years

2
2
0 0

Ceph Trace System

by Liu, Changcheng

Hi all, Does anyone know how to trace the write/read request from client to rbd/rados/OSD? Is there any useful document besides the doc/dev/blkin? I'm wondering how to trace the large distributed storage system e.g. Ceph to observer/monitor it. B.R. Changcheng

4 years

1
0
0 0

Re: samba ceph-vfs and scrubbing interval

by Marco Savoca

Hi Jeff, i have digged a bit deeper and I was able to get write access on the ceph-vfs share. But the file and directory permissions are a mess. It seems that ceph-VfL does not evaluate the secondary group permissions. > Am 27.03.2020 um 18:14 schrieb Dr. Marco Savoca <quaternionma(a)gmail.com>: > > > „No. I haven't tested it in some time, but it does allow clients to > write. When you say you can't get write access, what are you doing to > test this, and what error are you getting back?“ > > For example: after successfull connection via smbclient it says „NT_STATUS_ACCESS_DENIED making remote Directory“ or > > „NT_STATUS_ACCESS_DENIED deleting remote file“ when I try to make a directory or to delete a file. If the same samba user connects to a share with a mounted path, everything works as expeted, so that there should not be some ACL Errors. > > > > > > Von: Jeff Layton > Gesendet: Freitag, 27. März 2020 13:05 > An: Marco Savoca; ceph-users(a)ceph.io > Betreff: Re: [ceph-users] samba ceph-vfs and scrubbing interval > > On Fri, 2020-03-27 at 12:00 +0100, Marco Savoca wrote: > > Hi all, > > > > i‘m running a 3 node ceph cluster setup with collocated mons and mds > > for actually 3 filesystems at home since mimic. I’m planning to > > downgrade to one FS and use RBD in the future, but this is another > > story. I’m using the cluster as cold storage on spindles with EC-pools > > for archive purposes. The cluster usually does not run 24/7. I > > actually managed to upgrade to octopus without problems yesterday. So > > first of all: great job with the release. > > > > Now I have a little problem and a general question to address. > > > > I have tried to share the CephFS via samba and the ceph-vfs module but > > I could not manage to get write access (read access is not a problem) > > to the share (even with the admin key). When I share the mounted path > > (kernel module or fuser mount) instead as usual there are no problems > > at all. Is ceph-vfs generally read only and I missed this point? > > No. I haven't tested it in some time, but it does allow clients to > write. When you say you can't get write access, what are you doing to > test this, and what error are you getting back? > > > Furthermore I suppose, that there is no possibility to choose between > > the different mds namespaces, right? > > > > Yeah, doesn't look like anyone has added that. That would probably be > pretty easy to add, though it would take a little while to trickle out > to the distros. > > > Now the general question. Since the cluster does not run 24/7 as > > stated and is turned on perhaps once a week for a couple of hours on > > demand, what are reasonable settings for the scrubbing intervals? As I > > said, the storage is cold and there is mostly read i/o. The archiving > > process adds approximately 0.5 % of new data of the cluster’s total > > storage capacity. > > -- > Jeff Layton <jlayton(a)redhat.com> > >

4 years

1
0
0 0

Leave of absence...

by Sage Weil

Hi everyone, I am taking time off from the Ceph project and from Red Hat, starting in April and extending through the US election in November. I will initially be working with an organization focused on voter registration and turnout and combating voter suppression and disinformation campaigns. During this time I will maintain some involvement in the Ceph community, primarily around strategic planning for Pacific and the Ceph Foundation, but most of my time will be focused elsewhere. Most decision making around Ceph will remain in the capable hands of the Ceph Leadership Team and component leads--I have the utmost confidence in their judgement and abilities. Yehuda Sadeh and Josh Durgin will be filling in to provide high-level guidance where needed. I’ll be participating in the Pacific planning meetings planned for next week, which will be important in kicking off development for Pacific: https://ceph.io/cds/ceph-developer-summit-pacific/ I am extremely proud of what we have accomplished with the Octopus release, and I believe the Ceph community will continue to do great things with Pacific! I look forward to returning at the end of the year to help wrap up the release and (hopefully) get things ready for Cephalocon next March. Most of all, I am excited to become engaged in another effort that I feel strongly about--one that will have a very real impact on my kids’ futures--and that will be easier to explain to lay people! :) Thanks! sage

4 years

2
1
0 0

Space leak in Bluestore

by vitalif＠yourcmc.ru

Hi. I'm experiencing some kind of a space leak in Bluestore. I use EC, compression and snapshots. First I thought that the leak was caused by "virtual clones" (issue #38184). However, then I got rid of most of the snapshots, but continued to experience the problem. I suspected something when I added a new disk to the cluster and free space in the cluster didn't increase (!). So to track down the issue I moved one PG (34.1a) using upmaps from osd11,6,0 to osd6,0,7 and then back to osd11,6,0. It ate +59 GB after the first move and +51 GB after the second. As I understand this proves that it's not #38184. Devirtualizaton of virtual clones couldn't eat additional space after SECOND rebalance of the same PG. The PG has ~39000 objects, it is EC 2+1 and the compression is enabled. Compression ratio is about ~2.7 in my setup, so the PG should use ~90 GB raw space. Before and after moving the PG I stopped osd0, mounted it with ceph-objectstore-tool with debug bluestore = 20/20 and opened the 34.1a***/all directory. It seems to dump all object extents into the log in that case. So now I have two logs with all allocated extents for osd0 (I hope all extents are there). I parsed both logs and added all compressed blob sizes together ("get_ref Blob ... 0x20000 -> 0x... compressed"). But they add up to ~39 GB before first rebalance (34.1as2), ~22 GB after it (34.1as1) and ~41 GB again after the second move (34.1as2) which doesn't indicate a leak. But the raw space usage still exceeds initial by a lot. So it's clear that there's a leak somewhere. What additional details can I provide for you to identify the bug? I posted the same message in the issue tracker, https://tracker.ceph.com/issues/44731 -- Vitaliy Filippov

4 years, 1 month

5
9
0 0

Identify slow ops

by Thomas Schneider

Hi, the current output of ceph -s reports a warning: 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops This time is increasing. root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_WARN 9 daemons have recently crashed 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d) mgr: ld5507(active, since 8m), standbys: ld5506, ld5505 mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2 up:standby-replay 3 up:standby osd: 442 osds: 442 up (since 8d), 442 in (since 9d) data: pools: 7 pools, 19628 pgs objects: 65.78M objects, 251 TiB usage: 753 TiB used, 779 TiB / 1.5 PiB avail pgs: 19628 active+clean io: client: 427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr The details are as follows: root@ld3955:~# ceph health detail HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has slow ops RECENT_CRASH 9 daemons have recently crashed mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has slow ops There's no error on services (mgr, mon, osd). Can you please advise how to identify the root cause of this slow ops? THX

4 years, 1 month

3
6
0 0

2024

2023

2022

2021

2020

2019

ceph-users March 2020