Got a trace of the osd process, shortly after ceph status -w announced boot for the osd:
strace: Process 784735 attached
futex(0x5587c3e22fc8, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ exited with 1 +++
It was stuck at that one call for several minutes before exiting.
From: Stefan Wild <swild(a)tiltworks.com>
Date: Saturday, December 12, 2020 at 9:44 PM
To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
Subject: Re: OSD reboot loop after running out of memory
Just had another look at the logs and this is what I did notice after the affected OSD
starts up.
Loads of entries of this sort:
Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug 2020-12-13T02:38:40.851+0000
7fafd32c7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fafb721f700'
had timed out after 15
Then a few pages of this:
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9249>
2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9248>
2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9247>
2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9246>
2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9245>
2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13026 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9244>
2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9243>
2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9242>
2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9241>
2020-12-13T02:35:44.022+0000 7fafb721f700 1 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9240>
2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9239>
2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9238>
2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9237>
2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9236>
2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9235>
2020-12-13T02:35:44.022+0000 7fafb521b700 1 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
And this is where it crashes:
Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9232>
2020-12-13T02:35:44.022+0000 7fafd02c1700 0 log_channel(cluster) log [DBG] : purged_snaps
scrub starts
Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Main process exited, code=exited,
status=1/FAILURE
Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Failed with result
'exit-code'.
Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Service hold-off time over,
scheduling restart.
Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Scheduled restart job, restart
counter is at 1.
Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph osd.1 for
08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph osd.1 for
08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph osd.1 for
08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
Hope that helps…
Thanks,
Stefan
From: Stefan Wild <swild(a)tiltworks.com>
Date: Saturday, December 12, 2020 at 9:35 PM
To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
Subject: OSD reboot loop after running out of memory
Hi,
We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one of the
server ran out of memory for unknown reasons (normally the machine uses about 60 out of
128 GB). Since then, some OSDs on that machine get caught in an endless restart loop. Logs
will just mention system seeing the daemon fail and then restarting it. Since the out of
memory incident, we’ve have 3 OSDs fail this way at separate times. We resorted to wiping
the affected OSD and re-adding it to the cluster, but it seems as soon as all PGs have
moved to the OSD, the next one fails.
This is also keeping us from re-deploying RGW, which was affected by the same out of
memory incident, since cephadm runs a check and won’t deploy the service unless the
cluster is in HEALTH_OK status.
Any help would be greatly appreciated.
Thanks,
Stefan