Dear Xiubo and others.
I have
never heard about that option until now. How do I check that and how to I disable it if
necessary?
I'm in meetings pretty much all day and will try to send some more info later.
$ mount|grep ceph
I get
MON-IPs:SRC on DST type ceph
(rw,relatime,name=con-fs2-rit-pfile,secret=<hidden>,noshare,acl,mds_namespace=con-fs2,_netdev)
so async dirop seems disabled.
Yeah, the kclient just received a corrupted
snaptrace from MDS.
So the first thing is you need to fix the corrupted snaptrace issue in cephfs and then
continue.
Ooookaaayyyy. I will take it as a compliment that you seem to assume I know how to do
that. The documentation gives 0 hits. Could you please provide me with instructions of
what to look for and/or what to do first?
If possible you can parse the above corrupted
snap message to check what exactly corrupted.
I haven't get a chance to do that.
Again, how would I do that? Is there some documentation and what should I expect?
You seems didn't enable the 'osd
blocklist' cephx auth cap for mon:
I can't find anything about an osd blocklist client auth cap in the documentation. Is
this something that came after octopus? Our caps are as shown in the documentation for a
ceph fs client (
https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is
"allow r":
caps mds = "allow rw path=/shares"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data=con-fs2"
I checked that but by reading the code I
couldn't get what had cause the MDS crash.
There seems something wrong corrupt the metadata in cephfs.
He wrote something about an invalid xattrib (empty value). It would be really helpful to
get a clue how to proceed. I managed to dump the MDS cache with the critical inode in
cache. Would this help with debugging? I also managed to get debug logs with debug_mds=20
during a crash caused by an "mds dump inode" command. Would this contain
something interesting? I can also pull the rados objects out and can upload all of these
files.
I was just guessing about the invalid xattr based on the very limited
crash info, so if it's clearly broken snapshot metadata from the
kclient logs I would focus on that.
I'm surprised/concerned your system managed to generate one of those,
of course...I'll let Xiubo work with you on that.
-Greg