Hello Frank,
On Tue, Aug 22, 2023 at 11:42 AM Frank Schilder <frans(a)dtu.dk> wrote:
Hi all,
I have this warning the whole day already (octopus latest cluster):
HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not deep-scrubbed
in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability release
mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to
capability release client_id: 145698301
mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to
capability release client_id: 189511877
mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to
capability release client_id: 189511887
mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to
capability release client_id: 231250695
If I look at the session info from mds.1 for these clients I see this:
# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname,
addr: .inst, fs: .client_metadata.root, caps: .num_caps, req:
.request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 189511877 -e 189511887
-e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887
v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695
v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877
v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301
v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
We have mds_min_caps_per_client=4096, so it looks like the limit is well satisfied. Also,
the file system is pretty idle at the moment.
Why and what exactly is the MDS complaining about here?
These days, you'll generally see this because the client is "quiet"
and the MDS is opportunistically recalling caps to reduce future work
when shrinking its cache is necessary. This would be indicated by:
* The MDS is not complaining about an oversized cache.
* The session listing shows the session is quiet (the
"session_cache_liveness" is near 0).
However, the MDS should respect mds_min_caps_per_client by (a) not
recalling more caps than mds_min_caps_per_client and (b) not
complaining the client has caps < mds_min_caps_per_client when it's
quiet.
So, you may have found a bug. The next time this happens, a `ceph tell
mds.X config diff`, `ceph tell mds.X perf dump`, and selection of the
relevant `ceph tell mds.X session ls` will help debug this I think.
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D