[ceph-users] Re: OSD reboot loop after running out of memory

14 Dec 2020

Hello, Kalle,

Your comments abount some bugs with  pg_log memory and buffer_anon memory growth worry me
a lot, as i am planning to build a cluster with the latest Nautilous version.

Could you please comment on, how to safely deal with these bugs or to avoid, if indeed
they occur?

thanks a lot,

samuel

huxiaoyu(a)horebdata.cn

From: Kalle Happonen
Date: 2020-12-14 08:28
To: Stefan Wild
CC: ceph-users
Subject: [ceph-users] Re: OSD reboot loop after running out of memory
Hi Stefan,
we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case we hit a
some bugs with pg_log memory growth and buffer_anon memory growth. Can you check
what's taking up the memory on the OSD with the following command?

ceph daemon osd.123 dump_mempools

Cheers,
Kalle

----- Original Message -----
...
  From: "Stefan Wild"
&lt;swild(a)tiltworks.com&gt;
 To: "Igor Fedotov" &lt;ifedotov(a)suse.de&gt;de>, "ceph-users"
&lt;ceph-users(a)ceph.io&gt;
 Sent: Sunday, 13 December, 2020 14:46:44
 Subject: [ceph-users] Re: OSD reboot loop after running out of memory

...
  Hi Igor,

 Full osd logs from startup to failed exit:
 https://tiltworks.com/osd.1.log

 In other news, can I expect osd.10 to go down next?

 Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
 front 2020-12-13T12:39:43.311164+0000 (oldest deadline
 2020-12-13T12:40:06.810981+0000)
 Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
 front 2020-12-13T12:39:42.972702+0000 (oldest deadline
 2020-12-13T12:40:05.272435+0000)
 Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
 front 2020-12-13T12:39:42.181856+0000 (oldest deadline
 2020-12-13T12:40:06.281648+0000)
 Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:15.171+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.2
 is reporting failure:0
 Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:15.171+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.2
 Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 : cluster
 [DBG] osd.10 failure report canceled by osd.2
 Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
 front 2020-12-13T12:39:43.326666+0000 (oldest deadline
 2020-12-13T12:40:07.426786+0000)
 Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:15.423+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.6
 is reporting failure:0
 Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:15.423+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.6
 Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.770822+0000
 front 2020-12-13T12:39:39.770700+0000 (oldest deadline
 2020-12-13T12:40:05.070662+0000)
 Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.977106+0000
 front 2020-12-13T12:39:39.977176+0000 (oldest deadline
 2020-12-13T12:40:04.677320+0000)
 Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
 front 2020-12-13T12:39:43.311164+0000 (oldest deadline
 2020-12-13T12:40:06.810981+0000)
 Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
 front 2020-12-13T12:39:42.972702+0000 (oldest deadline
 2020-12-13T12:40:05.272435+0000)
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:16.179+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.4
 is reporting failure:0
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:16.179+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.4
 Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug
 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
 front 2020-12-13T12:39:42.181856+0000 (oldest deadline
 2020-12-13T12:40:06.281648+0000)
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 : cluster
 [DBG] osd.10 failure report canceled by osd.6
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 : cluster
 [DBG] osd.10 failure report canceled by osd.4
 Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug
 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
 front 2020-12-13T12:39:43.326666+0000 (oldest deadline
 2020-12-13T12:40:07.426786+0000)
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:16.371+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.3
 is reporting failure:0
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:16.371+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.3
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:16.611+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.7
 is reporting failure:0
 Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:16.611+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.7
 Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
 front 2020-12-13T12:39:42.972702+0000 (oldest deadline
 2020-12-13T12:40:05.272435+0000)
 Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug
 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
 reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
 front 2020-12-13T12:39:43.326666+0000 (oldest deadline
 2020-12-13T12:40:07.426786+0000)
 Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 : cluster
 [DBG] osd.10 failure report canceled by osd.3
 Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 : cluster
 [DBG] osd.10 failure report canceled by osd.7
 Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:17.727+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.0
 is reporting failure:0
 Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:17.727+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.0
 Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:17.839+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.5
 is reporting failure:0
 Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:17.839+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.5
 Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 : cluster
 [DBG] osd.10 failure report canceled by osd.0
 Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 : cluster
 [DBG] osd.10 failure report canceled by osd.5
 Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:18.575+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.11
 is reporting failure:0
 Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:18.575+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.11
 Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:18.783+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd
 e13375 prepare_failure osd.10
 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.8
 is reporting failure:0
 Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
 2020-12-13T12:40:18.783+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
 osd.10 failure report canceled by osd.8
 Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 : cluster
 [DBG] osd.10 failure report canceled by osd.11
 Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 : cluster
 [DBG] osd.10 failure report canceled by osd.8

 Thanks,
 Stefan

On 12/13/20, 2:18 AM, "Igor Fedotov" &lt;ifedotov(a)suse.de&gt; wrote:

    Hi Stefan,

    could you please share OSD startup log from /var/log/ceph?

    Thanks,

    Igor

    On 12/13/2020 5:44 AM, Stefan Wild wrote:
  Just had another look at the logs and this is
what I did notice after the
 affected OSD starts up.

 Loads of entries of this sort:

 Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
 2020-12-13T02:38:40.851+0000 7fafd32c7700  1 heartbeat_map is_healthy
 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15

 Then a few pages of this:

 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9249>
 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
 empty local-lis/les=13015/13016 n=0 ec=1530
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9248>
 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
 empty local-lis/les=13015/13016 n=0 ec=1530
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9247>
 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
 empty local-lis/les=13015/13016 n=0 ec=1530
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9246>
 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13024 pg[28.11(
 empty local-lis/les=13015/13016 n=0 ec=1530
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9245>
 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13026 pg[28.11(
 empty local-lis/les=13015/13016 n=0 ec=1530
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9244>
 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
 v 3437'1753192 (3437'1753192,3437'1753192
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9243>
 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
 v 3437'1753192 (3437'1753192,3437'1753192
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9242>
 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
 v 3437'1753192 (3437'1753192,3437'1753192
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9241>
 2020-12-13T02:35:44.022+0000 7fafb721f700  1 osd.1 pg_epoch: 13143 pg[19.69s2(
 v 3437'1753192 (3437'1753192,3437'1753192
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9240>
 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
 v 3437'1753192 (3437'1753192,3437'1753192
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9239>
 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
 v 3437'1753192 (3437'1753192,3437'1753192
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9238>
 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
 v 3437'1759161 (3437'1759161,3437'175916
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9237>
 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
 v 3437'1759161 (3437'1759161,3437'175916
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9236>
 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
 v 3437'1759161 (3437'1759161,3437'175916
 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9235>
 2020-12-13T02:35:44.022+0000 7fafb521b700  1 osd.1 pg_epoch: 13143 pg[19.3bs10(
 v 3437'1759161 (3437'1759161,3437'175916

 And this is where it crashes:

 Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9232>
 2020-12-13T02:35:44.022+0000 7fafd02c1700  0 log_channel(cluster) log [DBG] :
 purged_snaps scrub starts
 Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
 ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Main process exited,
 code=exited, status=1/FAILURE
 Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
 ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Failed with result
 'exit-code'.
 Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
 ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Service hold-off time
 over, scheduling restart.
 Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
 ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Scheduled restart job,
 restart counter is at 1.
 Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph osd.1 for
 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
 Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph osd.1 for
 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
 Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph osd.1 for
 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.

 Hope that helps…

 Thanks,
 Stefan

 From: Stefan Wild &lt;swild(a)tiltworks.com&gt;
 Date: Saturday, December 12, 2020 at 9:35 PM
 To: &quot;ceph-users(a)ceph.io&quot; &lt;ceph-users(a)ceph.io&gt;
 Subject: OSD reboot loop after running out of memory

 Hi,

 We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one
 of the server ran out of memory for unknown reasons (normally the machine uses
 about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an
 endless restart loop. Logs will just mention system seeing the daemon fail and
 then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail
 this way at separate times. We resorted to wiping the affected OSD and
 re-adding it to the cluster, but it seems as soon as all PGs have moved to the
 OSD, the next one fails.

 This is also keeping us from re-deploying RGW, which was affected by the same
 out of memory incident, since cephadm runs a check and won’t deploy the service
 unless the cluster is in HEALTH_OK status.

 Any help would be greatly appreciated.

 Thanks,
 Stefan

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io
     _______________________________________________
    ceph-users mailing list -- ceph-users(a)ceph.io
    To unsubscribe send an email to ceph-users-leave(a)ceph.io

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io
 _______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: OSD reboot loop after running out of memory