[ceph-users] Re: [ext] Re: Re: kernel client osdc ops stuck and mds slow reqs

23 Feb 2023

On Thu, Feb 23, 2023 at 3:31 PM Kuhring, Mathias
&lt;mathias.kuhring(a)bih-charite.de&gt; wrote:
...

 Hey Ilya,

 I'm not sure if the things I find in the logs are actually anything related or
useful.
 But I'm not really sure, if I'm looking in the right places.

 I enabled "debug_ms 1" for the OSDs as suggested above.
 But this filled up our host disks pretty fast,  leading to e.g. monitors crashing.
 I disabled the debug messages again and trimmed logs to free up space.
 But I made copies of two OSD logs files which were involved in another capability release
/ slow requests issue.
 They are quite big now (~3GB) and even if I remove things like ping stuff,
 I have more than 1 million lines just for the morning until the disk space was full
(around 7 hours).
 So now I'm wondering how to filter/look for the right things here.

 When I grep for "error", I get a few of these messages:
 {"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700  1 --
[v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] \u003c== osd.161
v2:192.168.1.31:6835/1012436344 182573 ==== pg_update_log_missing(3.1a6s2 epoch
646235/644895 rep_tid 1014320 entries 646235'7672108 (0'0) error   
3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 -2
ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0
trim_to 646178'7662340 roll_forward_to 646192'7672106) v3 ==== 261+0+0 (crc 0 0 0)
0x562d55e52380 con
0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"}

 And if I grep for "failed", I get a couple of those:
 {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 --
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e
172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 msgr2=0x55b9ce07e580 crc :-1
s=STATE_CONNECTION_ESTABLISHED l=1).read_until read
failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"}
 {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 --2-
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e
172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 0x55b9ce07e580 crc :-1 s=READY pgs=2096664
cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read
frame preamble failed r=-1 ((1) Operation not
permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"}

 Not sure, if they are related to the issue.

 In the kernel logs of the client (dmesg, journalctl or /var/log/messages),
 there seem to be no errors or any stack traces in the relevant time periods. 
Hi Mathias,

Then it's very unlikely to be a kernel client issue meaning that you
don't need to worry about your kernel versions.

...
  The only thing I can see is our restart of the
relevant OSDs:
 [Mi Feb 22 07:29:59 2023] libceph: osd90 down
 [Mi Feb 22 07:30:34 2023] libceph: osd90 up
 [Mi Feb 22 07:31:55 2023] libceph: osd93 down
 [Mi Feb 22 08:37:50 2023] libceph: osd93 up

 I noticed a socket closed for another client, but I assume that's more related to
monitors failing due to full disks:
 [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket closed (con state
OPEN)
 [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session lost, hunting for
new mon
 [Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session established 
Yeah, these are expected when a monitor or an OSD goes down.

Thanks,

                Ilya

2024

2023

2022

2021

2020

2019

[ceph-users] Re: [ext] Re: Re: kernel client osdc ops stuck and mds slow reqs