Hi,
I have a feeling that the pg repair didn't actually run yet. Sometimes
if the OSDs are busy scrubbing, the repair doesn't start when you ask
it to.
You can force it through with something like:
ceph osd set noscrub
ceph osd set nodeep-scrub
ceph config set osd_max_scrubs 3
ceph pg repair <the pg>
ceph status # and check that the repair really started
ceph config set osd_max_scrubs 1
ceph osd unset nodeep-scrub
ceph osd unset noscrub
Once repair runs/completes, it will rewrite the inconsistent object
replica (to a new place on the disk). Check your ceph.log to see when
this happens.
From my experience, the PendingSectors counter will not be decremented
until that sector is written again (which will happen at some random
point in the future when bluestore allocates some new data there).
Hope that helps,
Dan
On Mon, Mar 30, 2020 at 9:00 AM David Herselman <dhe(a)syrex.co> wrote:
>
> Hi,
>
> We have a single inconsistent placement group where I then subsequently triggered a deep scrub and tried doing a 'pg repair'. The placement group remains in an inconsistent state.
>
> How do I discard the objects for this placement group only on the one OSD and get Ceph to essentially write the data out new. Drives will only mark a sector as remapped when asked to overwrite the problematic sector or repeated reads of the failed sector eventually succeed (this is my limited understanding).
>
> Nothing useful in the 'ceph pg 1.35 query' output that I could decipher. Then ran 'ceph pg deep-scrub 1.35' and 'rados list-inconsistent-obj 1.35' thereafter indicates a read error on one of the copies:
> {"epoch":25776,"inconsistents":[{"object":{"name":"rbd_data.746f3c94fb3a42.000000000001e48d","nspace":"","locator":"","snap":"head","version":34866184},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":{"oid":{"oid":"rbd_data.746f3c94fb3a42.000000000001e48d","key":"","snapid":-2,"hash":3814100149,"max":0,"pool":1,"namespace":""},"version":"22845'1781037","prior_version":"22641'1771494","last_reqid":"client.136837683.0:124047","user_version":34866184,"size":4194304,"mtime":"2020-03-08 17:59:00.159846","local_mtime":"2020-03-08 17:59:00.159670","lost":0,"flags":["dirty","data_digest","omap_digest"],"truncate_seq":0,"truncate_size":0,"data_digest":"0x031cb17c","omap_digest":"0xffffffff","expected_object_size":4194304,"expected_write_size":4194304,"alloc_hint_flags":0,"manifest":{"type":0},"watchers":{}},"shards":[{"osd":51,"primary":false,"errors":["read_error"],"size":4194304},{"osd":60,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_dig
> est":"0x031cb17c"},{"osd":82,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x031cb17c"}]}]}
>
> /var/log/syslog:
> Mar 30 08:40:40 kvm1e kernel: [74792.229021] ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x0
> Mar 30 08:40:40 kvm1e kernel: [74792.230416] ata2.00: irq_stat 0x40000008
> Mar 30 08:40:40 kvm1e kernel: [74792.231715] ata2.00: failed command: READ FPDMA QUEUED
> Mar 30 08:40:40 kvm1e kernel: [74792.233071] ata2.00: cmd 60/00:08:00:7a:50/04:00:c9:00:00/40 tag 1 ncq dma 524288 in
> Mar 30 08:40:40 kvm1e kernel: [74792.233071] res 43/40:00:10:7b:50/00:04:c9:00:00/00 Emask 0x409 (media error) <F>
> Mar 30 08:40:40 kvm1e kernel: [74792.235736] ata2.00: status: { DRDY SENSE ERR }
> Mar 30 08:40:40 kvm1e kernel: [74792.237045] ata2.00: error: { UNC }
> Mar 30 08:40:40 kvm1e ceph-osd[450777]: 2020-03-30 08:40:40.240 7f48a41f3700 -1 bluestore(/var/lib/ceph/osd/ceph-51) _do_read bdev-read failed: (5) Input/output error
> Mar 30 08:40:40 kvm1e kernel: [74792.244914] ata2.00: configured for UDMA/133
> Mar 30 08:40:40 kvm1e kernel: [74792.244938] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar 30 08:40:40 kvm1e kernel: [74792.244942] sd 1:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current]
> Mar 30 08:40:40 kvm1e kernel: [74792.244945] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
> Mar 30 08:40:40 kvm1e kernel: [74792.244949] sd 1:0:0:0: [sdb] tag#1 CDB: Read(16) 88 00 00 00 00 00 c9 50 7a 00 00 00 04 00 00 00
> Mar 30 08:40:40 kvm1e kernel: [74792.244953] blk_update_request: I/O error, dev sdb, sector 3377494800 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0
> Mar 30 08:40:40 kvm1e kernel: [74792.246238] ata2: EH complete
>
>
> Regards
> David Herselman
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi folks,
I have a three-node cluster on a 10G network with very little traffic. I have a six-OSD flash-only pool with two devices — a 1TB NVMe drive and a 256GB SATA SSD — on each node, and here’s how it benchmarks:
Oof. How can I troubleshoot this? Anthony mentioned that I might be able to run more than one OSD on the NVMe — how is that done, and can I do it “on the fly” with the system already up and running like this? And, will more OSDs give me better IOPS?
Thanks,
Jarett
Some time ago I made a surprising observation. I reorganised a directory structure and needed to move a folder one level up with a command like
mv A/B/ B
B contained something like 9TB in very large files. To my surprise, this command didn't return for a couple of minutes and I started to look what was going on. What I discovered was, that the mv command actually performed a full copy with a subsequent remove. I had to wait for several hours for the move to complete.
I tried to reproduce this today to collect further information. However, this behaviour seems not reproducible. No matter what I try, mv completes almost instantly.
I was running the original mv on mimic 13.2.2 and retried now with mimic 13.2.8. In addition, there was an OS upgrade from Centos 7.6 to 7.7. I'm using the kernel-ml versions (5.xxx). Only one cephfs mount was present at all times.
My questions are:
1) Was there a change from 13.2.2 to 13.2.8 explaining this?
2) Are there (rare) conditions under which an mv on cephfs becomes a cp+rm?
3) Am I seeing ghosts?
Thanks for clues and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi,
We have a single inconsistent placement group where I then subsequently triggered a deep scrub and tried doing a 'pg repair'. The placement group remains in an inconsistent state.
How do I discard the objects for this placement group only on the one OSD and get Ceph to essentially write the data out new. Drives will only mark a sector as remapped when asked to overwrite the problematic sector or repeated reads of the failed sector eventually succeed (this is my limited understanding).
Nothing useful in the 'ceph pg 1.35 query' output that I could decipher. Then ran 'ceph pg deep-scrub 1.35' and 'rados list-inconsistent-obj 1.35' thereafter indicates a read error on one of the copies:
{"epoch":25776,"inconsistents":[{"object":{"name":"rbd_data.746f3c94fb3a42.000000000001e48d","nspace":"","locator":"","snap":"head","version":34866184},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":{"oid":{"oid":"rbd_data.746f3c94fb3a42.000000000001e48d","key":"","snapid":-2,"hash":3814100149,"max":0,"pool":1,"namespace":""},"version":"22845'1781037","prior_version":"22641'1771494","last_reqid":"client.136837683.0:124047","user_version":34866184,"size":4194304,"mtime":"2020-03-08 17:59:00.159846","local_mtime":"2020-03-08 17:59:00.159670","lost":0,"flags":["dirty","data_digest","omap_digest"],"truncate_seq":0,"truncate_size":0,"data_digest":"0x031cb17c","omap_digest":"0xffffffff","expected_object_size":4194304,"expected_write_size":4194304,"alloc_hint_flags":0,"manifest":{"type":0},"watchers":{}},"shards":[{"osd":51,"primary":false,"errors":["read_error"],"size":4194304},{"osd":60,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x031cb17c"},{"osd":82,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x031cb17c"}]}]}
/var/log/syslog:
Mar 30 08:40:40 kvm1e kernel: [74792.229021] ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x0
Mar 30 08:40:40 kvm1e kernel: [74792.230416] ata2.00: irq_stat 0x40000008
Mar 30 08:40:40 kvm1e kernel: [74792.231715] ata2.00: failed command: READ FPDMA QUEUED
Mar 30 08:40:40 kvm1e kernel: [74792.233071] ata2.00: cmd 60/00:08:00:7a:50/04:00:c9:00:00/40 tag 1 ncq dma 524288 in
Mar 30 08:40:40 kvm1e kernel: [74792.233071] res 43/40:00:10:7b:50/00:04:c9:00:00/00 Emask 0x409 (media error) <F>
Mar 30 08:40:40 kvm1e kernel: [74792.235736] ata2.00: status: { DRDY SENSE ERR }
Mar 30 08:40:40 kvm1e kernel: [74792.237045] ata2.00: error: { UNC }
Mar 30 08:40:40 kvm1e ceph-osd[450777]: 2020-03-30 08:40:40.240 7f48a41f3700 -1 bluestore(/var/lib/ceph/osd/ceph-51) _do_read bdev-read failed: (5) Input/output error
Mar 30 08:40:40 kvm1e kernel: [74792.244914] ata2.00: configured for UDMA/133
Mar 30 08:40:40 kvm1e kernel: [74792.244938] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 30 08:40:40 kvm1e kernel: [74792.244942] sd 1:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current]
Mar 30 08:40:40 kvm1e kernel: [74792.244945] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 30 08:40:40 kvm1e kernel: [74792.244949] sd 1:0:0:0: [sdb] tag#1 CDB: Read(16) 88 00 00 00 00 00 c9 50 7a 00 00 00 04 00 00 00
Mar 30 08:40:40 kvm1e kernel: [74792.244953] blk_update_request: I/O error, dev sdb, sector 3377494800 op 0x0:(READ) flags 0x0 phys_seg 94 prio class 0
Mar 30 08:40:40 kvm1e kernel: [74792.246238] ata2: EH complete
Regards
David Herselman
Hi,
I am trying to find out whether Ceph has a method to detect silent
corruption such as bit rot. I came across this text in a book - "Mastering
Ceph : Infrastructure Storage Solution with the latest Ceph release" by
Nick Fisk -
Luminous release of Ceph employs ZFS-like ability to checksum data at every
read. BlueStore calculates and stores the crc32 checksum of any data that
is written. On each read request, BlueStore reads the checksum and compares
it with the data read from the device. If there is a mismatch, BlueStore
will report read error and repair damage. Ceph will then retry the read
from another OSD holding the object.
I have two questions:
1. If there is a mismatch will the ceph user (who initiated GET) receive
any error message and he has to retry OR will this error be auto-corrected
by Ceph and the data would be returned from the other OSD (user will never
know what went wrong underneath?
2. Does this feature work for Multi-part upload objects also?
--
Regards,
Priya
Hi all,
Does anyone know how to trace the write/read request from client to
rbd/rados/OSD?
Is there any useful document besides the doc/dev/blkin?
I'm wondering how to trace the large distributed storage system e.g.
Ceph to observer/monitor it.
B.R.
Changcheng
Hi Jeff,
i have digged a bit deeper and I was able to get write access on the ceph-vfs share. But the file and directory permissions are a mess. It seems that ceph-VfL does not evaluate the secondary group permissions.
> Am 27.03.2020 um 18:14 schrieb Dr. Marco Savoca <quaternionma(a)gmail.com>:
>
>
> „No. I haven't tested it in some time, but it does allow clients to
> write. When you say you can't get write access, what are you doing to
> test this, and what error are you getting back?“
>
> For example: after successfull connection via smbclient it says „NT_STATUS_ACCESS_DENIED making remote Directory“ or
>
> „NT_STATUS_ACCESS_DENIED deleting remote file“ when I try to make a directory or to delete a file. If the same samba user connects to a share with a mounted path, everything works as expeted, so that there should not be some ACL Errors.
>
>
>
>
>
> Von: Jeff Layton
> Gesendet: Freitag, 27. März 2020 13:05
> An: Marco Savoca; ceph-users(a)ceph.io
> Betreff: Re: [ceph-users] samba ceph-vfs and scrubbing interval
>
> On Fri, 2020-03-27 at 12:00 +0100, Marco Savoca wrote:
> > Hi all,
> >
> > i‘m running a 3 node ceph cluster setup with collocated mons and mds
> > for actually 3 filesystems at home since mimic. I’m planning to
> > downgrade to one FS and use RBD in the future, but this is another
> > story. I’m using the cluster as cold storage on spindles with EC-pools
> > for archive purposes. The cluster usually does not run 24/7. I
> > actually managed to upgrade to octopus without problems yesterday. So
> > first of all: great job with the release.
> >
> > Now I have a little problem and a general question to address.
> >
> > I have tried to share the CephFS via samba and the ceph-vfs module but
> > I could not manage to get write access (read access is not a problem)
> > to the share (even with the admin key). When I share the mounted path
> > (kernel module or fuser mount) instead as usual there are no problems
> > at all. Is ceph-vfs generally read only and I missed this point?
>
> No. I haven't tested it in some time, but it does allow clients to
> write. When you say you can't get write access, what are you doing to
> test this, and what error are you getting back?
>
> > Furthermore I suppose, that there is no possibility to choose between
> > the different mds namespaces, right?
> >
>
> Yeah, doesn't look like anyone has added that. That would probably be
> pretty easy to add, though it would take a little while to trickle out
> to the distros.
>
> > Now the general question. Since the cluster does not run 24/7 as
> > stated and is turned on perhaps once a week for a couple of hours on
> > demand, what are reasonable settings for the scrubbing intervals? As I
> > said, the storage is cold and there is mostly read i/o. The archiving
> > process adds approximately 0.5 % of new data of the cluster’s total
> > storage capacity.
>
> --
> Jeff Layton <jlayton(a)redhat.com>
>
>
Hi everyone,
I am taking time off from the Ceph project and from Red Hat, starting in
April and extending through the US election in November. I will initially
be working with an organization focused on voter registration and turnout
and combating voter suppression and disinformation campaigns.
During this time I will maintain some involvement in the Ceph community,
primarily around strategic planning for Pacific and the Ceph Foundation,
but most of my time will be focused elsewhere.
Most decision making around Ceph will remain in the capable hands of the
Ceph Leadership Team and component leads--I have the utmost confidence in
their judgement and abilities. Yehuda Sadeh and Josh Durgin will be
filling in to provide high-level guidance where needed.
I’ll be participating in the Pacific planning meetings planned for next
week, which will be important in kicking off development for Pacific:
https://ceph.io/cds/ceph-developer-summit-pacific/
I am extremely proud of what we have accomplished with the Octopus
release, and I believe the Ceph community will continue to do great things
with Pacific! I look forward to returning at the end of the year to help
wrap up the release and (hopefully) get things ready for Cephalocon next
March.
Most of all, I am excited to become engaged in another effort that I feel
strongly about--one that will have a very real impact on my kids’
futures--and that will be easier to explain to lay people! :)
Thanks!
sage
Hi.
I'm experiencing some kind of a space leak in Bluestore. I use EC,
compression and snapshots. First I thought that the leak was caused by
"virtual clones" (issue #38184). However, then I got rid of most of the
snapshots, but continued to experience the problem.
I suspected something when I added a new disk to the cluster and free
space in the cluster didn't increase (!).
So to track down the issue I moved one PG (34.1a) using upmaps from
osd11,6,0 to osd6,0,7 and then back to osd11,6,0.
It ate +59 GB after the first move and +51 GB after the second. As I
understand this proves that it's not #38184. Devirtualizaton of virtual
clones couldn't eat additional space after SECOND rebalance of the same
PG.
The PG has ~39000 objects, it is EC 2+1 and the compression is enabled.
Compression ratio is about ~2.7 in my setup, so the PG should use ~90 GB
raw space.
Before and after moving the PG I stopped osd0, mounted it with
ceph-objectstore-tool with debug bluestore = 20/20 and opened the
34.1a***/all directory. It seems to dump all object extents into the log
in that case. So now I have two logs with all allocated extents for osd0
(I hope all extents are there). I parsed both logs and added all
compressed blob sizes together ("get_ref Blob ... 0x20000 -> 0x...
compressed"). But they add up to ~39 GB before first rebalance
(34.1as2), ~22 GB after it (34.1as1) and ~41 GB again after the second
move (34.1as2) which doesn't indicate a leak.
But the raw space usage still exceeds initial by a lot. So it's clear
that there's a leak somewhere.
What additional details can I provide for you to identify the bug?
I posted the same message in the issue tracker,
https://tracker.ceph.com/issues/44731
--
Vitaliy Filippov
Hi,
the current output of ceph -s reports a warning:
2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops
This time is increasing.
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_WARN
9 daemons have recently crashed
2 slow ops, oldest one blocked for 347335 sec, mon.ld5505
has slow ops
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
mgr: ld5507(active, since 8m), standbys: ld5506, ld5505
mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
up:standby-replay 3 up:standby
osd: 442 osds: 442 up (since 8d), 442 in (since 9d)
data:
pools: 7 pools, 19628 pgs
objects: 65.78M objects, 251 TiB
usage: 753 TiB used, 779 TiB / 1.5 PiB avail
pgs: 19628 active+clean
io:
client: 427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr
The details are as follows:
root@ld3955:~# ceph health detail
HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one
blocked for 347755 sec, mon.ld5505 has slow ops
RECENT_CRASH 9 daemons have recently crashed
mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z
mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z
mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z
mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z
mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z
mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z
mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z
mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z
mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z
SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has
slow ops
There's no error on services (mgr, mon, osd).
Can you please advise how to identify the root cause of this slow ops?
THX