Hi list (and cephfs devs :-)),
On 2020-04-29 17:43, Jake Grimmett wrote:
...the "mdsmap_decode" errors stopped
suddenly on all our clients...
Not exactly sure what the problem was, but restarting our standby mds
demons seems to have been the fix.
Here's the log on the standby mds exactly when the errors stopped:
2020-04-29 15:41:22.944 7f3d04e06700 1 mds.ceph-s2 Map has assigned me
to become a standby
2020-04-29 15:43:05.621 7f3d04e06700 1 mds.ceph-s2 Updating MDS map to
version 394712 from mon.0
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 handle_mds_map i am now
mds.34541673.0 replaying mds.0.0
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 handle_mds_map state
change up:boot --> up:standby-replay
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 replay_start
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 recovery set is
2020-04-29 15:43:05.655 7f3cfe5f9700 0 mds.0.cache creating system
inode with ino:0x100
2020-04-29 15:43:05.656 7f3cfe5f9700 0 mds.0.cache creating system
inode with ino:0x1
So, we got some HEALTH_WARN on our cluster because of this issue.
Cluster: 13.2.8
client: cephfs kernel client 5.7.9-050709-generic with 13.2.10 (Ubuntu
18.04)
The standby mds, and only the standby, is logging about this:
2020-08-27 06:25:01.086 7efc10cad700 -1 received
signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw
(PID: 21705) UID: 0
2020-08-27 08:42:25.340 7efc0d2be700 0 log_channel(cluster) log [WRN] : 1 slow requests,
1 included below; oldest blocked for > 30.497840 secs
2020-08-27 08:42:25.340 7efc0d2be700 0 log_channel(cluster) log [WRN] : slow request
30.497839 seconds old, received at 2020-08-27 08:41:54.847218:
client_request(client.133487514:37390263 getattr AsLsXsFs #0x10050572c4e 2020-08-27
08:41:54.840824 caller_uid=3860, caller_gid=3860{}) currently failed to rdlock, waiting
2020-08-27 11:06:55.492 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768
isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued
pAsLsXsFscr, sent 64.583827 seconds ago
2020-08-27 11:07:55.502 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768
isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued
pAsLsXsFscr, sent 124.593098 seconds ago
2020-08-27 11:09:55.561 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768
isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued
pAsLsXsFscr, sent 244.651434 seconds ago
2020-08-27 11:13:55.505 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768
isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued
pAsLsXsFscr, sent 484.596083 seconds ago
2020-08-27 11:21:55.500 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768
isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued
pAsLsXsFscr, sent 964.592686 seconds ago
On the clients we get the "mdsmap_decode got incorrect
state(up:standby-replay)" logging exactly on the times the mds2 is logging.
No logging of this on the active mds.
I would expect exactly the opposite. Why is the standby mds logging this?
Sometimes the "client.$id isn't responding to mclientcaps(revoke)"
warnings resolve itself. But it can also take a considerable amount of time.
I of course could restart the standby mds ... but that's not my first
choice. If this is a software defect, I would like to get it fixed.
Gr. Stefan