Hi Ilya,
On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov <idryomov(a)gmail.com> wrote:
On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster <dan(a)vanderster.com> wrote:
Hi all,
We are quite regularly (a couple times per week) seeing:
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
to capability release client_id: 52919162
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
Which is being caused by osdc ops stuck in a kernel client, e.g.:
10:57:18 root hpc-be028 /root
→ cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
REQUESTS 9 homeless 0
46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243
[243,501,92]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
0x400014 1 read
46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243
[243,501,92]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057
0x400014 1 read
46559323 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559341 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559342 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559345 osd243 3.969cc573 3.573 [243,330,226]/243
[243,330,226]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056
0x400014 1 read
46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243
[243,330,521]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a
0x400014 1 read
46559629 osd243 3.b280c852 3.852 [243,113,539]/243
[243,113,539]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f
0x400014 1 read
46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243
[243,332,94]/243 e678697
fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f
0x400024 1 write
LINGER REQUESTS
BACKOFFS
We can unblock those requests by doing `ceph osd down osd.243` (or
restarting osd.243).
This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64.
Are there a better way to debug this?
Hi Dan,
I assume that these ops don't show up as slow requests on the OSD side?
How long did you see it stuck for before intervening?
That's correct -- the osd had no active ops (ceph daemon.... ops).
The late release slow req was stuck for 4129s before we intervened.
Do you happen to have "debug ms = 1" logs
from osd243?
Nope, but I can try to get it afterwards next time. (Though you need
it at the moment the ops get stuck, not only from the moment we notice
the stuck ops, right?)
Do you have PG autoscaler enabled? Any PG splits
and/or merges at the time?
Not on the cephfs_(meta)data pools (though on the 30th I increased
those pool sizes from 2 to 3). And also on the 30th I did some PG
merging on an unrelated test pool.
And anyway we have seen this type of lockup in the past, without those
pool changes (also with mimic MDS until we upgraded to nautilus).
Looking back further in the client's kernel log we see a page alloc
failure on the 30th:
Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page
allocation failure: order:5, mode:0x104050
Jan 30 16:16:35 hpc-be028.cern.ch kernel: CPU: 1 PID: 78445 Comm:
kworker/1:36 Kdump: loaded Tainted: P
Jan 30 16:16:35 hpc-be028.cern.ch kernel: Workqueue: ceph-msgr
ceph_con_workfn [libceph]
The machine is running hpc jobs, there is memory pressure and likely
fragmentation due to hugepages being enabled.
(But other instances of stuck ops didn't have any page alloc failures).
Here is the client dmesg when we restarted the osd with stuck ops:
Jan 31 10:57:59 hpc-be028.cern.ch kernel: libceph: osd243 down
Jan 31 10:58:17 hpc-be028.cern.ch kernel: libceph: osd243 up
(There was nothing else adjacent in time).
Cheers, Dan
>
> Thanks,
>
> Ilya