Dear Xiubo and others.
I have
never heard about that option until now. How do I check that and how to I disable it if
necessary?
I'm in meetings pretty much all day and will try to send some more info later.
$ mount|grep ceph
I get
MON-IPs:SRC on DST type ceph
(rw,relatime,name=con-fs2-rit-pfile,secret=<hidden>,noshare,acl,mds_namespace=con-fs2,_netdev)
so async dirop seems disabled.
Yeah, the kclient just received a corrupted
snaptrace from MDS.
So the first thing is you need to fix the corrupted snaptrace issue in cephfs and then
continue.
Ooookaaayyyy. I will take it as a compliment that you seem to assume I
know how to do that. The documentation gives 0 hits. Could you please provide me with
instructions of what to look for and/or what to do first?
If possible you can parse the above corrupted
snap message to check what exactly corrupted.
I haven't get a chance to do that.
Again, how would I do that? Is there some
documentation and what should I expect?
You seems didn't enable the 'osd
blocklist' cephx auth cap for mon:
I can't find anything about an osd
blocklist client auth cap in the documentation. Is this something that came after octopus?
Our caps are as shown in the documentation for a ceph fs client
(
https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is "allow
r":
caps mds = "allow rw path=/shares"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data=con-fs2"
I checked that but by reading the code I
couldn't get what had cause the MDS crash.
There seems something wrong corrupt the metadata in cephfs.
He wrote something
about an invalid xattrib (empty value). It would be really helpful to get a clue how to
proceed. I managed to dump the MDS cache with the critical inode in cache. Would this help
with debugging? I also managed to get debug logs with debug_mds=20 during a crash caused
by an "mds dump inode" command. Would this contain something interesting? I can
also pull the rados objects out and can upload all of these files.
I was just
guessing about the invalid xattr based on the very limited
crash info, so if it's clearly broken snapshot metadata from the
kclient logs I would focus on that.