Hi Stefan,
we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case we hit a
some bugs with pg_log memory growth and buffer_anon memory growth. Can you check
what's taking up the memory on the OSD with the following command?
ceph daemon osd.123 dump_mempools
Cheers,
Kalle
----- Original Message -----
From: "Stefan Wild"
<swild(a)tiltworks.com>
To: "Igor Fedotov" <ifedotov(a)suse.de>de>, "ceph-users"
<ceph-users(a)ceph.io>
Sent: Sunday, 13 December, 2020 14:46:44
Subject: [ceph-users] Re: OSD reboot loop after running out of memory
> Hi Igor,
>
> Full osd logs from startup to failed exit:
>
https://tiltworks.com/osd.1.log
>
> In other news, can I expect osd.10 to go down next?
>
> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
> 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
> front 2020-12-13T12:39:43.311164+0000 (oldest deadline
> 2020-12-13T12:40:06.810981+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
> 2020-12-13T12:40:05.272435+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
> 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
> front 2020-12-13T12:39:42.181856+0000 (oldest deadline
> 2020-12-13T12:40:06.281648+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.171+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.2
> is reporting failure:0
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.171+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.2
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 : cluster
> [DBG] osd.10 failure report canceled by osd.2
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
> 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
> 2020-12-13T12:40:07.426786+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.423+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.6
> is reporting failure:0
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.423+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.6
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
> 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.770822+0000
> front 2020-12-13T12:39:39.770700+0000 (oldest deadline
> 2020-12-13T12:40:05.070662+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
> 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.977106+0000
> front 2020-12-13T12:39:39.977176+0000 (oldest deadline
> 2020-12-13T12:40:04.677320+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
> 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
> front 2020-12-13T12:39:43.311164+0000 (oldest deadline
> 2020-12-13T12:40:06.810981+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
> 2020-12-13T12:40:05.272435+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.179+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.4
> is reporting failure:0
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.179+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.4
> Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug
> 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
> front 2020-12-13T12:39:42.181856+0000 (oldest deadline
> 2020-12-13T12:40:06.281648+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 : cluster
> [DBG] osd.10 failure report canceled by osd.6
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 : cluster
> [DBG] osd.10 failure report canceled by osd.4
> Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug
> 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
> 2020-12-13T12:40:07.426786+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.371+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.3
> is reporting failure:0
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.371+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.3
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.611+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.7
> is reporting failure:0
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.611+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.7
> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
> 2020-12-13T12:40:05.272435+0000)
> Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug
> 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
> 2020-12-13T12:40:07.426786+0000)
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 : cluster
> [DBG] osd.10 failure report canceled by osd.3
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 : cluster
> [DBG] osd.10 failure report canceled by osd.7
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.727+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.0
> is reporting failure:0
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.727+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.0
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.839+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.5
> is reporting failure:0
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.839+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.5
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 : cluster
> [DBG] osd.10 failure report canceled by osd.0
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 : cluster
> [DBG] osd.10 failure report canceled by osd.5
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.575+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.11
> is reporting failure:0
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.575+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.11
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.783+0000 7fe929be8700 1 mon.ceph-tpa-server1(a)0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.8
> is reporting failure:0
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.783+0000 7fe929be8700 0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.8
> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 : cluster
> [DBG] osd.10 failure report canceled by osd.11
> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 : cluster
> [DBG] osd.10 failure report canceled by osd.8
>
>
> Thanks,
> Stefan
>
>
>On 12/13/20, 2:18 AM, "Igor Fedotov" <ifedotov(a)suse.de> wrote:
>
> Hi Stefan,
>
> could you please share OSD startup log from /var/log/ceph?
>
>
> Thanks,
>
> Igor
>
> On 12/13/2020 5:44 AM, Stefan Wild wrote:
> > Just had another look at the logs and this is what I did notice after the
> > affected OSD starts up.
> >
> > Loads of entries of this sort:
> >
> > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
> > 2020-12-13T02:38:40.851+0000 7fafd32c7700 1 heartbeat_map is_healthy
> > 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15
> >
> > Then a few pages of this:
> >
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9249>
> > 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11(
> > empty local-lis/les=13015/13016 n=0 ec=1530
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9248>
> > 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11(
> > empty local-lis/les=13015/13016 n=0 ec=1530
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9247>
> > 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11(
> > empty local-lis/les=13015/13016 n=0 ec=1530
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9246>
> > 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13024 pg[28.11(
> > empty local-lis/les=13015/13016 n=0 ec=1530
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9245>
> > 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13026 pg[28.11(
> > empty local-lis/les=13015/13016 n=0 ec=1530
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9244>
> > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
pg[19.69s2(
> > v 3437'1753192 (3437'1753192,3437'1753192
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9243>
> > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
pg[19.69s2(
> > v 3437'1753192 (3437'1753192,3437'1753192
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9242>
> > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
pg[19.69s2(
> > v 3437'1753192 (3437'1753192,3437'1753192
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9241>
> > 2020-12-13T02:35:44.022+0000 7fafb721f700 1 osd.1 pg_epoch: 13143
pg[19.69s2(
> > v 3437'1753192 (3437'1753192,3437'1753192
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9240>
> > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
pg[19.69s2(
> > v 3437'1753192 (3437'1753192,3437'1753192
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9239>
> > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
pg[19.69s2(
> > v 3437'1753192 (3437'1753192,3437'1753192
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9238>
> > 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
pg[19.3bs10(
> > v 3437'1759161 (3437'1759161,3437'175916
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9237>
> > 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
pg[19.3bs10(
> > v 3437'1759161 (3437'1759161,3437'175916
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9236>
> > 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
pg[19.3bs10(
> > v 3437'1759161 (3437'1759161,3437'175916
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9235>
> > 2020-12-13T02:35:44.022+0000 7fafb521b700 1 osd.1 pg_epoch: 13143
pg[19.3bs10(
> > v 3437'1759161 (3437'1759161,3437'175916
> >
> > And this is where it crashes:
> >
> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9232>
> > 2020-12-13T02:35:44.022+0000 7fafd02c1700 0 log_channel(cluster) log [DBG]
:
> > purged_snaps scrub starts
> > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
> > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Main process
exited,
> > code=exited, status=1/FAILURE
> > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
> > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Failed with result
> > 'exit-code'.
> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
> > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Service hold-off
time
> > over, scheduling restart.
> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
> > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Scheduled restart
job,
> > restart counter is at 1.
> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph osd.1 for
> > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph osd.1 for
> > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph osd.1 for
> > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
> >
> > Hope that helps…
> >
> >
> > Thanks,
> > Stefan
> >
> >
> > From: Stefan Wild <swild(a)tiltworks.com>
> > Date: Saturday, December 12, 2020 at 9:35 PM
> > To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
> > Subject: OSD reboot loop after running out of memory
> >
> > Hi,
> >
> > We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later,
one
> > of the server ran out of memory for unknown reasons (normally the machine
uses
> > about 60 out of 128 GB). Since then, some OSDs on that machine get caught in
an
> > endless restart loop. Logs will just mention system seeing the daemon fail
and
> > then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail
> > this way at separate times. We resorted to wiping the affected OSD and
> > re-adding it to the cluster, but it seems as soon as all PGs have moved to
the
> > OSD, the next one fails.
> >
> > This is also keeping us from re-deploying RGW, which was affected by the
same
> > out of memory incident, since cephadm runs a check and won’t deploy the
service
> > unless the cluster is in HEALTH_OK status.
> >
> > Any help would be greatly appreciated.
> >
> > Thanks,
> > Stefan
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io