On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster <dan(a)vanderster.com> wrote:
Hi all,
We are quite regularly (a couple times per week) seeing:
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
to capability release client_id: 52919162
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
Which is being caused by osdc ops stuck in a kernel client, e.g.:
10:57:18 root hpc-be028 /root
→ cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
REQUESTS 9 homeless 0
46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243
[243,501,92]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
0x400014 1 read
46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243
[243,501,92]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
0x400014 1 read
46559323 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559341 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559342 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559345 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243
[243,330,521]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a
0x400014 1 read
46559629 osd243 3.b280c852 3.852 [243,113,539]/243
[243,113,539]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f
0x400014 1 read
46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243
[243,332,94]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f
0x400024 1 write
LINGER REQUESTS
BACKOFFS
We can unblock those requests by doing `ceph osd down osd.243` (or
restarting osd.243).
This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64.
Are there a better way to debug this?
Hi Dan,
I assume that these ops don't show up as slow requests on the OSD side?
How long did you see it stuck for before intervening?
Do you happen to have "debug ms = 1" logs from osd243?
Do you have PG autoscaler enabled? Any PG splits and/or merges at the
time?
Thanks,
Ilya