Hi Ilya,
On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov <idryomov(a)gmail.com> wrote:
On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster <dan(a)vanderster.com> wrote:
Hi all,
We are quite regularly (a couple times per week) seeing:
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
to capability release client_id: 52919162
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
Which is being caused by osdc ops stuck in a kernel client, e.g.:
10:57:18 root hpc-be028 /root
→ cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
REQUESTS 9 homeless 0
46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243
[243,501,92]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
0x400014 1 read
46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243
[243,501,92]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
0x400014 1 read
46559323 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559341 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559342 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559345 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243
[243,330,521]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a
0x400014 1 read
46559629 osd243 3.b280c852 3.852 [243,113,539]/243
[243,113,539]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f
0x400014 1 read
46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243
[243,332,94]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f
0x400024 1 write
LINGER REQUESTS
BACKOFFS
We can unblock those requests by doing `ceph osd down osd.243` (or
restarting osd.243).
This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64.
Are there a better way to debug this?
Hi Dan,
I assume that these ops don't show up as slow requests on the OSD side?
How long did you see it stuck for before intervening?
That's correct -- the osd had no active ops (ceph daemon.... ops).
The late release slow req was stuck for 4129s before we intervened.
Nope, but I can try to get it afterwards next time. (Though you need
it at the moment the ops get stuck, not only from the moment we notice
the stuck ops, right?)