New subject: MDS crashes to damaged metadata

8 Jan 2023

On Thu, Dec 15, 2022 at 9:32 AM Stolte, Felix &lt;f.stolte(a)fz-juelich.de&gt; wrote:
...

 Hi Patrick,

 we used your script to repair the damaged objects on the weekend and it went smoothly.
Thanks for your support.

 We adjusted your script to scan for damaged files on a daily basis, runtime is about 6h.
Until thursday last week, we had exactly the same 17 Files. On thursday at 13:05 a
snapshot was created and our active mds crashed once at this time (snapshot was created):

 2022-12-08T13:05:48.919+0100 7f440afec700 -1 /build/ceph-16.2.10/src/mds/ScatterLock.h:
In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread
7f440afec700 time 2022-12-08T13:05:48.921223+0100
 /build/ceph-16.2.10/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state LOCK_XLOCK ||
state LOCK_XLOCKDONE)

 12 Minutes lates the unlink_local error crashes appeared again. This time with a new
file. During debugging we noticed a MTU mismatch between MDS (1500) and client (9000) with
cephfs kernel mount. The client is also creating the snapshots via mkdir in the .snap
directory.

 We disabled snapshot creation for now, but really need this feature. I uploaded the mds
logs of the first crash along with the information above to
https://tracker.ceph.com/issues/38452

 I would greatly appreciate it, if you could answer me the following question:

 Is the Bug related to our MTU Mismatch? We fixed the MTU Issue going back to 1500 on all
nodes in the ceph public network on the weekend also. 
I doubt it.

...
  If you need a debug level 20 log of the ScatterLock
for further analysis, i could schedule snapshots at the end of our workdays and increase
the debug level 5 Minutes arround snap shot creation. 
This would be very helpful!

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Re: MDS crashes to damaged metadata