Hi Patrick,
Thanks for the instructions. We started the MDS recovery scan with below cmds following
the link below. The first bit of scan extens has finished and we're waiting on scan
inodes. Probably we shouldn't interrupt the process. Once this procedure failed,
I'll follow your steps and let you know. Appreciate your help.
cephfs-data-scan scan_extents [<data pool> [<extra data pool> ...]]
cephfs-data-scan scan_inodes [<data pool>]
cephfs-data-scan scan_links
Justin Li
Senior Technical Officer
School of Information Technology
Faculty of Science, Engineering and Built Environment
For ICT Support please see
Deakin University
Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
+61 3 9246 8932
Justin.li(a)deakin.edu.au
Deakin University CRICOS Provider Code 00113B
Important Notice: The contents of this email are intended solely for the named addressee
and are confidential; any unauthorised use, reproduction or storage of the contents is
expressly prohibited. If you have received this email in error, please delete it and any
attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus
free.
-----Original Message-----
From: Patrick Donnelly <pdonnell(a)redhat.com>
Sent: Wednesday, May 24, 2023 11:42 PM
To: Justin Li <justin.li(a)deakin.edu.au>
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] [Help appreciated] ceph mds damaged
Hello Justin,
Please do:
ceph config set mds debug_mds 20
ceph config set mds debug_ms 1
Then wait for a crash. Please upload the log.
To restore your file system:
ceph config set mds mds_abort_on_newly_corrupt_dentry false
Let the MDS purge the strays and then try:
ceph config set mds mds_abort_on_newly_corrupt_dentry true
On Tue, May 23, 2023 at 7:04 PM Justin Li <justin.li(a)deakin.edu.au> wrote:
Hi Patrick,
Sorry for keeping bothering you but I found that MDS service kept crashing even cluster
shows MDS is up. I attached another log of MDS server - eowyn at below. Look forward to
hearing more insights. Thanks a lot.
https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriv
e.google.com%2Ffile%2Fd%2F1nD_Ks7fNGQp0GE5Q_x8M57HldYurPhuN%2Fview%3Fu
sp%3Dsharing&data=05%7C01%7Cjustin.li%40deakin.edu.au%7C4ad0bc8e731646
fe66f308db5c5cb835%7Cd02378ec168846d585401c28b5f470f6%7C0%7C0%7C638205
325532676503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM
zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jgr%2Fz4lCpE3v
padi0kgZxqd31Zfk8rZWfYDYeuw4%2BzA%3D&reserved=0
MDS crashed:
root@eowyn:~# systemctl status ceph-mds@eowyn ●
ceph-mds(a)eowyn.service - Ceph metadata server daemon
Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset:
enabled)
Active: failed (Result: signal) since Wed 2023-05-24 08:55:12 AEST; 24s ago
Process: 44349 ExecStart=/usr/bin/ceph-mds -f --cluster ${CLUSTER} --id eowyn
--setuser ceph --setgroup ceph (code=kill>
Main PID: 44349 (code=killed, signal=ABRT)
May 24 08:55:12 eowyn systemd[1]: ceph-mds(a)eowyn.service: Scheduled restart job, restart
counter is at 3.
May 24 08:55:12 eowyn systemd[1]: Stopped Ceph metadata server daemon.
May 24 08:55:12 eowyn systemd[1]: ceph-mds(a)eowyn.service: Start request repeated too
quickly.
May 24 08:55:12 eowyn systemd[1]: ceph-mds(a)eowyn.service: Failed with result
'signal'.
May 24 08:55:12 eowyn systemd[1]: Failed to start Ceph metadata server daemon.
Part of MDS log on eowyn (MDS server):
-2> 2023-05-24T08:55:11.854+1000 7f1f8ee93700 -1 log_channel(cluster) log [ERR] :
MDS abort because newly corrupt dentry to be committed: [dentry #0x100/stray0/1005480d3ac
[19ce,head] auth (dversion lock) pv=2154265085 v=2154265074 ino=0x1005480d3ac
state=1342177316 | purging=1 0x55b04517ca00]
-1> 2023-05-24T08:55:11.858+1000 7f1f8ee93700 -1
/build/ceph-16.2.13/src/mds/CDentry.cc: In function 'bool
CDentry::check_corruption(bool)' thread 7f1f8ee93700 time
2023-05-24T08:55:11.858329+1000
/build/ceph-16.2.13/src/mds/CDentry.cc: 697: ceph_abort_msg("abort()
called")
ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e)
pacific (stable)
1: (ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe0) [0x7f1f99404495]
2: (CDentry::check_corruption(bool)+0x86b) [0x55b02652991b]
3: (StrayManager::_purge_stray_purged(CDentry*, bool)+0xc64)
[0x55b026480ed4]
4: (MDSContext::complete(int)+0x61) [0x55b026601471]
5: (MDSIOContextBase::complete(int)+0x4fc) [0x55b026601b9c]
6: (Finisher::finisher_thread_entry()+0x19d) [0x7f1f994b8c6d]
7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1f99146609]
8: clone()
Justin Li
Senior Technical Officer
School of Information Technology
Faculty of Science, Engineering and Built Environment For ICT Support
please see
https://www.deakin.edu.au/sebeicthelp
Deakin University
Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
+61 3 9246 8932
Justin.li(a)deakin.edu.au
http://www.deakin.edu.au/
Deakin University CRICOS Provider Code 00113B
Important Notice: The contents of this email are intended solely for the named addressee
and are confidential; any unauthorised use, reproduction or storage of the contents is
expressly prohibited. If you have received this email in error, please delete it and any
attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus
free.
-----Original Message-----
From: Justin Li
Sent: Wednesday, May 24, 2023 8:25 AM
To: Patrick Donnelly <pdonnell(a)redhat.com>
Cc: ceph-users(a)ceph.io
Subject: RE: [ceph-users] [Help appreciated] ceph mds damaged
Sorry Patrick, last email was restricted as attachment size. I attached a link for you to
download the log. Thanks.
https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriv
e.google.com%2Fdrive%2Ffolders%2F1bV_X7vyma_-gTfLrPnEV27QzsdmgyK4g%3Fu
sp%3Dsharing&data=05%7C01%7Cjustin.li%40deakin.edu.au%7C4ad0bc8e731646
fe66f308db5c5cb835%7Cd02378ec168846d585401c28b5f470f6%7C0%7C0%7C638205
325532676503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM
zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=iWOyl6z0W5r6fx
qKkAm1CNZg3soI3V3sSt%2F5kKFO7%2FQ%3D&reserved=0
Justin Li
Senior Technical Officer
School of Information Technology
Faculty of Science, Engineering and Built Environment For ICT Support
please see
https://www.deakin.edu.au/sebeicthelp
Deakin University
Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
+61 3 9246 8932
Justin.li(a)deakin.edu.au
http://www.deakin.edu.au/
Deakin University CRICOS Provider Code 00113B
Important Notice: The contents of this email are intended solely for the named addressee
and are confidential; any unauthorised use, reproduction or storage of the contents is
expressly prohibited. If you have received this email in error, please delete it and any
attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus
free.
-----Original Message-----
From: Justin Li
Sent: Wednesday, May 24, 2023 8:21 AM
To: Patrick Donnelly <pdonnell(a)redhat.com>
Cc: ceph-users(a)ceph.io
Subject: RE: [ceph-users] [Help appreciated] ceph mds damaged
Hi Patrick,
I attached two logs here. Those two servers are one of the monitors and MDSs. Let me know
if you need more logs. Thanks.
Justin Li
Senior Technical Officer
School of Information Technology
Faculty of Science, Engineering and Built Environment For ICT Support
please see
https://www.deakin.edu.au/sebeicthelp
Deakin University
Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
+61 3 9246 8932
Justin.li(a)deakin.edu.au
http://www.deakin.edu.au/
Deakin University CRICOS Provider Code 00113B
Important Notice: The contents of this email are intended solely for the named addressee
and are confidential; any unauthorised use, reproduction or storage of the contents is
expressly prohibited. If you have received this email in error, please delete it and any
attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus
free.
-----Original Message-----
From: Patrick Donnelly <pdonnell(a)redhat.com>
Sent: Wednesday, May 24, 2023 7:35 AM
To: Justin Li <justin.li(a)deakin.edu.au>
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] [Help appreciated] ceph mds damaged
Hello Justin,
On Tue, May 23, 2023 at 4:55 PM Justin Li <justin.li(a)deakin.edu.au> wrote:
Dear All,
After a unsuccessful upgrade to pacific, MDS were offline and could not get back on.
Checked the MDS log and found below. See cluster info from below as well. Appreciate it if
anyone can point me to the right direction. Thanks.
MDS log:
2023-05-24T06:21:36.831+1000 7efe56e7d700 1 mds.0.cache.den(0x600
1005480d3b2) loaded already corrupt dentry: [dentry
#0x100/stray0/1005480d3b2 [19ce,head] rep@0,-2.0<mailto:rep@0,-2.0>
NULL (dversion lock) pv=0 v=2154265030 ino=(nil) state=0
0x556433addb80]
-5> 2023-05-24T06:21:36.831+1000 7efe56e7d700 -1 mds.0.damage
notify_dentry Damage to dentries in fragment * of ino 0x600is fatal
because it is a system directory for this rank
-4> 2023-05-24T06:21:36.831+1000 7efe56e7d700 5
mds.beacon.posco
set_want_state: up:active -> down:damaged
-3> 2023-05-24T06:21:36.831+1000 7efe56e7d700 5
mds.beacon.posco Sending beacon down:damaged seq 5339
-2> 2023-05-24T06:21:36.831+1000 7efe56e7d700 10 monclient:
_send_mon_message to mon.ceph-3 at v2:10.120.0.146:3300/0
-1> 2023-05-24T06:21:37.659+1000 7efe60690700 5
mds.beacon.posco received beacon reply down:damaged seq 5339 rtt
0.827966
0> 2023-05-24T06:21:37.659+1000 7efe56e7d700 1 mds.posco respawn!
Cluster info:
root@ceph-1:~# ceph -s
cluster:
id: e2b93a76-2f97-4b34-8670-727d6ac72a64
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
services:
mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 26h)
mgr: ceph-3(active, since 15h), standbys: ceph-1, ceph-2
mds: 0/1 daemons up, 3 standby
osd: 135 osds: 133 up (since 10h), 133 in (since 2w)
data:
volumes: 0/1 healthy, 1 recovering; 1 damaged
pools: 4 pools, 4161 pgs
objects: 230.30M objects, 276 TiB
usage: 836 TiB used, 460 TiB / 1.3 PiB avail
pgs: 4138 active+clean
13 active+clean+scrubbing
10 active+clean+scrubbing+deep
root@ceph-1:~# ceph health detail
HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds
daemon damaged [WRN] FS_DEGRADED: 1 filesystem is degraded
fs cephfs is degraded
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
fs cephfs is offline because no MDS is active for it.
[ERR] MDS_DAMAGE: 1 mds daemon damaged
fs cephfs mds.0 is damaged
Do you have a complete log you can share? Try:
https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs
.ceph.com%2Fen%2Fquincy%2Fman%2F8%2Fceph-post-file%2F&data=05%7C01%7Cj
ustin.li%40deakin.edu.au%7C4ad0bc8e731646fe66f308db5c5cb835%7Cd02378ec
168846d585401c28b5f470f6%7C0%7C0%7C638205325532676503%7CUnknown%7CTWFp
bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
0%3D%7C3000%7C%7C%7C&sdata=w%2F2csj25RqZyMCHPnqvBJ0a4vk%2FhD1SRPJBK4%2
FgCbD0%3D&reserved=0
To get your upgrade to complete, you may set:
ceph config set mds mds_go_bad_corrupt_dentry false
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Important Notice: The contents of this email are intended solely for the named addressee
and are confidential; any unauthorised use, reproduction or storage of the contents is
expressly prohibited. If you have received this email in error, please delete it and any
attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus
free.
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Important Notice: The contents of this email are intended solely for the named addressee
and are confidential; any unauthorised use, reproduction or storage of the contents is
expressly prohibited. If you have received this email in error, please delete it and any
attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus
free.