One of my colleagues attempted to set quotas on a large number (some
dozens) of users with the session below, but it caused the MDS to hang and
reject client requests.
Offending command was:
cat recent-users | xargs -P16 -I% setfattr -n ceph.quota.max_bytes -v
8796093022208 /scratch/%
Result was to hang /scratch and any other mounts managed by the same MDS on
all clients.
Status of ceph-mds while broken was:
root@cnx-14:~# systemctl status ceph-mds@cnx-14
● ceph-mds(a)cnx-14.service - Ceph metadata server daemon
Loaded: loaded (/lib/systemd/system/ceph-mds@.service; indirect;
vendor preset: enabled)
Active: active (running) since Thu 2021-05-06 17:16:45 AEST; 1
weeks 3 days ago
Main PID: 2385 (ceph-mds)
Tasks: 23
CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds(a)cnx-14.service
└─2385 /usr/bin/ceph-mds -f --cluster ceph --id cnx-14
--setuser ceph --setgroup cephMay 13 06:25:01 cnx-14 ceph-mds[2385]:
2021-05-13T06:25:01.724+1000 7f5444832700 -1 received signal: Hangup
from killall -q -1 ceph-mon ceph-m
May 13 06:25:01 cnx-14 ceph-mds[2385]: 2021-05-13T06:25:01.736+1000
7f5444832700 -1 received signal: Hangup from (PID: 229281) UID: 0
May 14 06:25:01 cnx-14 ceph-mds[2385]: 2021-05-14T06:25:01.992+1000
7f5444832700 -1 received signal: Hangup from killall -q -1 ceph-mon
ceph-m
May 14 06:25:02 cnx-14 ceph-mds[2385]: 2021-05-14T06:25:02.004+1000
7f5444832700 -1 received signal: Hangup from (PID: 232464) UID: 0
May 15 06:25:01 cnx-14 ceph-mds[2385]: 2021-05-15T06:25:01.468+1000
7f5444832700 -1 received signal: Hangup from killall -q -1 ceph-mon
ceph-m
May 15 06:25:01 cnx-14 ceph-mds[2385]: 2021-05-15T06:25:01.480+1000
7f5444832700 -1 received signal: Hangup from (PID: 236005) UID: 0
May 16 06:25:01 cnx-14 ceph-mds[2385]: 2021-05-16T06:25:01.989+1000
7f5444832700 -1 received signal: Hangup from killall -q -1 ceph-mon
ceph-m
May 16 06:25:02 cnx-14 ceph-mds[2385]: 2021-05-16T06:25:02.001+1000
7f5444832700 -1 received signal: Hangup from (PID: 239260) UID: 0
May 17 06:25:01 cnx-14 ceph-mds[2385]: 2021-05-17T06:25:01.813+1000
7f5444832700 -1 received signal: Hangup from killall -q -1 ceph-mon
ceph-m
May 17 06:25:01 cnx-14 ceph-mds[2385]: 2021-05-17T06:25:01.829+1000
7f5444832700 -1 received signal: Hangup from (PID: 242044) UID: 0
Fix was to run:
systemctl restart ceph-mds@cnx-14
Non parallelised run of xargs with sleep 1 between each iteration worked.
Show replies by date