Hi Felix,
On Sat, May 13, 2023 at 9:18 AM Stolte, Felix <f.stolte(a)fz-juelich.de> wrote:
Hi Patrick,
we have been running one daily snapshot since december and our cephfs crashed 3 times
because of this
https://tracker.ceph.com/issues/38452
We currentliy have 19 files with corrupt metadata found by your first-damage.py script.
We isolated the these files from access by users and are waiting for a fix before we
remove them with your script (or maybe a new way?)
No other fix is anticipated at this time. Probably one will be
developed after the cause is understood.
Today we upgraded our cluster from 16.2.11 and
16.2.13. After Upgrading the mds servers, cluster health went to ERROR MDS_DAMAGE.
'ceph tells mds 0 damage ls‘ is showing me the same files as your script (initially
only a part, after a cephfs scrub all of them).
This is expected. Once the dentries are marked damaged, the MDS won't
allow operations on those files (like those triggering tracker
#38452).
I noticed "mds: catch damage to CDentry’s first
member before persisting (issue#58482, pr#50781, Patrick Donnelly)“ in the change logs for
16.2.13 and like to ask you the following questions:
a) can we repair the damaged files online now instead of bringing down the whole fs and
using the python script?
Not yet.
b) should we set one of the new mds options in our
specific case to avoid our fileserver crashing because of the wrong snap ids?
Have your MDS crashed or just marked the dentries damaged? If you can
reproduce a crash with detailed logs (debug_mds=20), that would be
incredibly helpful.
c) will your patch prevent wrong snap ids in the
future?
It will prevent persisting the damage.
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D