Hi Xiubo,
forgot to include these, the inodes i tried to dump and which caused a crash are
ceph tell "mds.ceph-10" dump inode 2199322355147 <-- original file/folder
causing trouble
ceph tell "mds.ceph-10" dump inode 2199322727209 <-- copy also causing
trouble (after taking snapshot??)
Other folders all the way up in the hierarchy did not lead to a crash, the dump worked
fine for these.
The debug settings during the first tries were:
ceph config set mds.ceph-10 debug_mds 20/5
ceph config set mds.ceph-10 debug_ms 5/0
The hex codes are 0x20011d3e5cb and 0x20011d99329. The dump commands are also in the log.
I tried a bit around to find out where the problem is localized. I didn't have the
high debug settings all the time due to disk space constraints. I can pull specific logs
over a short time for anything that is reproducible (short sequence of specific
commands).
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, May 15, 2023 6:33 PM
To: Xiubo Li; ceph-users(a)ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system
Dear Xiubo,
I uploaded the cache dump, the MDS log and the dmesg log containing the snaptrace dump to
ceph-post-file: 763955a3-7d37-408a-bbe4-a95dc687cd3f
Sorry, I forgot to add user and description this time.
A question about trouble shooting. I'm pretty sure I know the path where the error is
located. Would a "ceph tell mds.1 scrub start / recursive repair" be able to
discover and fix broken snaptraces? If not I'm awaiting further instructions.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Xiubo Li <xiubli(a)redhat.com>
Sent: Friday, May 12, 2023 3:44 PM
To: Frank Schilder; ceph-users(a)ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system
On 5/12/23 20:27, Frank Schilder wrote:
Dear Xiubo and others.
I have
never heard about that option until now. How do I check that and how to I disable it if
necessary?
I'm in meetings pretty much all day and will try to send some more info later.
$ mount|grep ceph
I get
MON-IPs:SRC on DST type ceph
(rw,relatime,name=con-fs2-rit-pfile,secret=<hidden>,noshare,acl,mds_namespace=con-fs2,_netdev)
so async dirop seems disabled.
Yeah.
Yeah, the
kclient just received a corrupted snaptrace from MDS.
So the first thing is you need to fix the corrupted snaptrace issue in cephfs and then
continue.
Ooookaaayyyy. I will take it as a compliment that you seem to assume I
know how to do that. The documentation gives 0 hits. Could you please provide me with
instructions of what to look for and/or what to do first?
There is no doc about this as I know.
If possible
you can parse the above corrupted snap message to check what exactly corrupted.
I haven't get a chance to do that.
Again, how would I do that? Is there some
documentation and what should I expect?
Currently there is no easy way to do this as I know, last time I have
parsed the corrupted binary data to the corresponding message manully.
And then we could know what exactly has happened for the snaptrace.
You seems
didn't enable the 'osd blocklist' cephx auth cap for mon:
I can't
find anything about an osd blocklist client auth cap in the documentation. Is this
something that came after octopus? Our caps are as shown in the documentation for a ceph
fs client (
https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is
"allow r":
caps mds = "allow rw path=/shares"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data=con-fs2"
Yeah, it seems the
'osd blocklist' was disabled. As I remembered if
enabled it should be something likes:
caps mon = "allow r, allow command \"osd blocklist\""
I checked that but by reading the code I
couldn't get what had cause the MDS crash.
There seems something wrong corrupt the metadata in cephfs.
He wrote something
about an invalid xattrib (empty value). It would be really helpful to get a clue how to
proceed. I managed to dump the MDS cache with the critical inode in cache. Would this help
with debugging? I also managed to get debug logs with debug_mds=20 during a crash caused
by an "mds dump inode" command. Would this contain something interesting? I can
also pull the rados objects out and can upload all of these files.
Yeah, possibly. Where is the logs ?
I managed to track the problem down to a specific
folder with a few files (I'm not sure if this coincides with the snaptrace issue, we
might have 2 issues here). I made a copy of the folder and checked that an "mds dump
inode" for the copy does not crash the MDS. I then moved the folders for which this
command causes a crash to a different location outside the mounts. Do you think this will
help? I'm wondering if after taking our daily snapshot tomorrow we end up in the
degraded situation again.
I really need instructions for how to check what is broken without an MDS crash and then
how to fix it.
Firstly we need to know where the corrupted metadata is.
I think the mds debug logs and the above corrupted snaptrace could help.
Need to parse that corrupted binary data.
Thanks
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io