On Wed, Oct 23, 2019 at 11:27 PM Sage Weil <sage(a)newdream.net> wrote:
On Wed, 23 Oct 2019, Paul Emmerich wrote:
Hi,
I'm working on a curious case that looks like a bug in PG merging
maybe related to FileStore.
Setup is 14.2.1 that is half BlueStore half FileStore (being
migrated), and the number of PGs on an RGW index pool were reduced,
now one of the PGs (3 FileStore OSDs) seems to be corrupted. There are
some (29) objects that are affected (~20% of the PG), the issue looks
like this for one of the affected objects which I'll call dir.A here
# object seems to exist according to rados
rados -p default.rgw.buckets.index ls | grep .dir.A
.dir.A
# or doesn't it?
rados -p default.rgw.buckets.index get .dir.A -
error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory
Running deep-scrub reports that everything is okay with the affected PG
My guess is that the actual file is not in the right directory hash level.
Did you look at the underlying file system to see if it is clearly out of
place with the other objects?
PG is tiny with only ~150 files, so they aren't split into dirs, it's
right there next to all the working objects
Also, I'm curious if all of the replicas are
similarly affected? What
happens if you move the primary to one of the other replicas (e.g., via
ceph osd primary-affinity) and try reading it then?
yes, I've tried all 3 replicas, same problem :(
Paul
>
> s
>
> >
> > This is what the OSD logs when trying to access it, nothing really
> > relevant with debug 20:
> >
> > 10 osd.57 pg_epoch: 1149030 pg[18.2( v 1148996'1422066
> > (1144429'1418988,1148996'1422066] local-lis/les=1149021/1149022 n=135
> > ec=49611/596 lis/c 1149021/1149021 les/c/f 1149022/1149022/0
> > 1149015/1149021/1149021) [57,0,31] r=0 lpr=1149021 crt=1148996'1422066
> > lcod 1148996'1422065 mlcod 0'0 active+clean] get_object_context: no
> > obc for soid 18:764060e4:::.dir.A:head and !can_create
> >
> > So going one level deeper with ceph-objectstore-tool:
> > # --op list
> > (29 messages like this)
> > error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory
> > followed by a complete autoput of the json for the objects including
> > the broken ones
> >
> > # .dir.A dump
> > dump
> > Error stat on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such
> > file or directory
> > Error getting snapset on : 18.2_head,#18:73996afb:::.dir.A:head#, (2)
> > No such file or directory
> > {
> > "id": {
> > "oid": ".dir.A",
> > "key": "",
> > "snapid": -2,
> > "hash": 3746994638,
> > "max": 0,
> > "pool": 18,
> > "namespace": "",
> > "max": 0
> > }
> > }
> >
> > # --op export
> > stops after encountering a bad object with 'export_files error -2'
> >
> > This is the same for all 3 OSDs in that PG.
> >
> > Has anyone encountered something similar? I'll probably just nuke the
> > affected bucket indices tomorrow and re-create them.
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at
https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> >
www.croit.io
> > Tel: +49 89 1896585 90
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >