On Thu, Feb 23, 2023 at 3:31 PM Kuhring, Mathias
<mathias.kuhring(a)bih-charite.de> wrote:
Hey Ilya,
I'm not sure if the things I find in the logs are actually anything related or
useful.
But I'm not really sure, if I'm looking in the right places.
I enabled "debug_ms 1" for the OSDs as suggested above.
But this filled up our host disks pretty fast, leading to e.g. monitors crashing.
I disabled the debug messages again and trimmed logs to free up space.
But I made copies of two OSD logs files which were involved in another capability release
/ slow requests issue.
They are quite big now (~3GB) and even if I remove things like ping stuff,
I have more than 1 million lines just for the morning until the disk space was full
(around 7 hours).
So now I'm wondering how to filter/look for the right things here.
When I grep for "error", I get a few of these messages:
{"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700 1 --
[v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] \u003c== osd.161
v2:192.168.1.31:6835/1012436344 182573 ==== pg_update_log_missing(3.1a6s2 epoch
646235/644895 rep_tid 1014320 entries 646235'7672108 (0'0) error
3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 -2
ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0
trim_to 646178'7662340 roll_forward_to 646192'7672106) v3 ==== 261+0+0 (crc 0 0 0)
0x562d55e52380 con
0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"}
And if I grep for "failed", I get a couple of those:
{"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 --
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e
172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 msgr2=0x55b9ce07e580 crc :-1
s=STATE_CONNECTION_ESTABLISHED l=1).read_until read
failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"}
{"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 --2-
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e
172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 0x55b9ce07e580 crc :-1 s=READY pgs=2096664
cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read
frame preamble failed r=-1 ((1) Operation not
permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"}
Not sure, if they are related to the issue.
In the kernel logs of the client (dmesg, journalctl or /var/log/messages),
there seem to be no errors or any stack traces in the relevant time periods.
Hi Mathias,
Then it's very unlikely to be a kernel client issue meaning that you
don't need to worry about your kernel versions.
The only thing I can see is our restart of the
relevant OSDs:
[Mi Feb 22 07:29:59 2023] libceph: osd90 down
[Mi Feb 22 07:30:34 2023] libceph: osd90 up
[Mi Feb 22 07:31:55 2023] libceph: osd93 down
[Mi Feb 22 08:37:50 2023] libceph: osd93 up
I noticed a socket closed for another client, but I assume that's more related to
monitors failing due to full disks:
[Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket closed (con state
OPEN)
[Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session lost, hunting for
new mon
[Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session established
Yeah, these are expected when a monitor or an OSD goes down.
Thanks,
Ilya