[ceph-users] Re: OSD reboot loop after running out of memory

14 Dec 2020

Hi Stefan,

Initial data removal could also have resulted from a snapshot removal 
leading to OSDs OOMing and then pg remappings leading to more removals 
after OOMed OSDs rejoined the cluster and so on.

As mentioned by Igor : "Additionally there are users' reports that 
recent default value's modification for bluefs_buffered_io setting has 
negative impact (or just worsen existing issue with massive removal) as 
well. So you might want to switch it back to true."

We're some of them. Our cluster suffered from a severe performance drop 
during snapshot removal right after upgrading to Nautilus, due to 
bluefs_buffered_io being set to false by default, with slow requests 
observed around the cluster.
Once back to true (can be done with ceph tell osd.* injectargs 
'--bluefs_buffered_io=true') snap trimming would be fast again so as 
before the upgrade, with no more slow requests.

But of course we've seen the excessive memory swap usage described here 
: https://github.com/ceph/ceph/pull/34224
So we lower osd_memory_target from 8MB to 4MB and haven't observed any 
swap usage since then. You can also have a look here : 
https://github.com/ceph/ceph/pull/38044

What you need to look at to understand if your cluster would benefit 
from changing bluefs_buffered_io back to true is the %util of your 
RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if you're 
using SSD RocksDB devices) and look at the %util of the device with 
bluefs_buffered_io=false and with bluefs_buffered_io=true. If with 
bluefs_buffered_io=false, the %util is over 75% most of the time, then 
you'd better change it to true. :-)

Regards,

Frédéric.

Le 14/12/2020 à 12:47, Stefan Wild a écrit :
> Hi Igor,
>
> Thank you for the detailed analysis. That makes me hopeful we can get the cluster
back on track. No pools have been removed, but yes, due to the initial crash of multiple
OSDs and the subsequent issues with individual OSDs we’ve had substantial PG remappings
happening constantly.
>
> I will look up the referenced thread(s) and try the offline DB compaction. It would
be amazing if that does the trick.
>
> Will keep you posted, here.
>
> Thanks,
> Stefan
>
> ________________________________
> From: Igor Fedotov &lt;ifedotov(a)suse.de&gt;
> Sent: Monday, December 14, 2020 6:39:28 AM
> To: Stefan Wild &lt;swild(a)tiltworks.com&gt;om>; ceph-users(a)ceph.io
&lt;ceph-users(a)ceph.io&gt;
> Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory
>
> Hi Stefan,
>
> given the crash backtrace in your log I presume some data removal is in
> progress:
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3:
> (KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
> char*)+0xd8) [0x5587b9364a48]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4:
> (KernelDevice::read_random(unsigned long, unsigned long, char*,
> bool)+0x1b3) [0x5587b93653e3]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5:
> (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long,
> char*)+0x674) [0x5587b9328cb4]
> ...
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19:
> (BlueStore::_do_omap_clear(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Onode>&)+0xa2) [0x5587b922f0e2]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20:
> (BlueStore::_do_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>)+0xc65) [0x5587b923b555]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21:
> (BlueStore::_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>&)+0x64) [0x5587b923c3b4]
> ...
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24:
>
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>,
> ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25:
> (PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26:
> (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
> [0x5587b8fd6ede]
> ...
>
> Did you initiate some large pool removal recently? Or may be data
> rebalancing triggered PG migration (and hence source PG removal) for you?
>
> Highly likely you're facing a well known issue with RocksDB/BlueFS
> performance issues caused by massive data removal.
>
> So your OSDs are just processing I/O very slowly which triggers suicide
> timeout.
>
> We've had multiple threads on the issue in this mailing list - the
> latest one is at
>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72Z…
>
> For now the good enough workaround is manual offline DB compaction for
> all the OSDs (this might have temporary effect though as the removal
> proceeds).
>
> Additionally there are users' reports that recent default value's
> modification  for bluefs_buffered_io setting has negative impact (or
> just worsen existing issue with massive removal) as well. So you might
> want to switch it back to true.
>
> As for OSD.10 - can't say for sure as I haven't seen its' logs but I
> think it's experiencing the same issue which might eventually lead it
> into unresponsive state as well. Just grep its log for "heartbeat_map
> is_healthy 'OSD::osd_op_tp thread" strings.
>
>
> Thanks,
>
> Igor
>
> On 12/13/2020 3:46 PM, Stefan Wild wrote:
>> Hi Igor,
>>
>> Full osd logs from startup to failed exit:
>> https://tiltworks.com/osd.1.log
>>
>> In other news, can I expect osd.10 to go down next?
>>
>> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000 front
2020-12-13T12:39:43.311164+0000 (oldest deadline 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000 front
2020-12-13T12:39:42.972702+0000 (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000 front
2020-12-13T12:39:42.181856+0000 (oldest deadline 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:15.171+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.2 is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:15.171+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 : cluster [DBG]
osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000 front
2020-12-13T12:39:43.326666+0000 (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:15.423+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.6 is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:15.423+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.6
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.770822+0000 front
2020-12-13T12:39:39.770700+0000 (oldest deadline 2020-12-13T12:40:05.070662+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug 2020-12-13T12:40:15.687+0000
7fa8e1800700 -1 osd.4 13375 heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since
back 2020-12-13T12:39:39.977106+0000 front 2020-12-13T12:39:39.977176+0000 (oldest
deadline 2020-12-13T12:40:04.677320+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000 front
2020-12-13T12:39:43.311164+0000 (oldest deadline 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000 front
2020-12-13T12:39:42.972702+0000 (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:16.179+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.4 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:16.179+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug
2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000 front
2020-12-13T12:39:42.181856+0000 (oldest deadline 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 : cluster [DBG]
osd.10 failure report canceled by osd.6
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 : cluster [DBG]
osd.10 failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug
2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000 front
2020-12-13T12:39:43.326666+0000 (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:16.371+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.3 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:16.371+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.3
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:16.611+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.7 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:16.611+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.7
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000 front
2020-12-13T12:39:42.972702+0000 (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug
2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no reply from
172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000 front
2020-12-13T12:39:43.326666+0000 (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 : cluster [DBG]
osd.10 failure report canceled by osd.3
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 : cluster [DBG]
osd.10 failure report canceled by osd.7
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:17.727+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.0 is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:17.727+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:17.839+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.5 is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:17.839+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 : cluster [DBG]
osd.10 failure report canceled by osd.0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 : cluster [DBG]
osd.10 failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:18.575+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.11 is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:18.575+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.11
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:18.783+0000 7fe929be8700  1 mon.ceph-tpa-server1(a)0(leader).osd e13375
prepare_failure osd.10 [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
from osd.8 is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
2020-12-13T12:40:18.783+0000 7fe929be8700  0 log_channel(cluster) log [DBG] : osd.10
failure report canceled by osd.8
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 : cluster [DBG]
osd.10 failure report canceled by osd.11
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 : cluster [DBG]
osd.10 failure report canceled by osd.8
>>
>>
>> Thanks,
>> Stefan
>>
>>
>> On 12/13/20, 2:18 AM, "Igor Fedotov" &lt;ifedotov(a)suse.de&gt; wrote:
>>
>>       Hi Stefan,
>>
>>       could you please share OSD startup log from /var/log/ceph?
>>
>>
>>       Thanks,
>>
>>       Igor
>>
>>       On 12/13/2020 5:44 AM, Stefan Wild wrote:
>>       > Just had another look at the logs and this is what I did notice after
the affected OSD starts up.
>>       >
>>       > Loads of entries of this sort:
>>       >
>>       > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
2020-12-13T02:38:40.851+0000 7fafd32c7700  1 heartbeat_map is_healthy 'OSD::osd_op_tp
thread 0x7fafb721f700' had timed out after 15
>>       >
>>       > Then a few pages of this:
>>       >
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9249>
2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9248>
2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9247>
2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9246>
2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13024 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9245>
2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13026 pg[28.11( empty
local-lis/les=13015/13016 n=0 ec=1530
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9244>
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9243>
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9242>
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9241>
2020-12-13T02:35:44.022+0000 7fafb721f700  1 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9240>
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9239>
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2( v
3437'1753192 (3437'1753192,3437'1753192
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9238>
2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9237>
2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9236>
2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9235>
2020-12-13T02:35:44.022+0000 7fafb521b700  1 osd.1 pg_epoch: 13143 pg[19.3bs10( v
3437'1759161 (3437'1759161,3437'175916
>>       >
>>       > And this is where it crashes:
>>       >
>>       > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9232>
2020-12-13T02:35:44.022+0000 7fafd02c1700  0 log_channel(cluster) log [DBG] : purged_snaps
scrub starts
>>       > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Main process exited, code=exited,
status=1/FAILURE
>>       > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Failed with result
'exit-code'.
>>       > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Service hold-off time over,
scheduling restart.
>>       > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Scheduled restart job, restart
counter is at 1.
>>       > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph osd.1 for
08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>       > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph osd.1 for
08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
>>       > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph osd.1 for
08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>       >
>>       > Hope that helps…
>>       >
>>       >
>>       > Thanks,
>>       > Stefan
>>       >
>>       >
>>       > From: Stefan Wild &lt;swild(a)tiltworks.com&gt;
>>       > Date: Saturday, December 12, 2020 at 9:35 PM
>>       > To: &quot;ceph-users(a)ceph.io&quot; &lt;ceph-users(a)ceph.io&gt;
>>       > Subject: OSD reboot loop after running out of memory
>>       >
>>       > Hi,
>>       >
>>       > We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days
later, one of the server ran out of memory for unknown reasons (normally the machine uses
about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an endless
restart loop. Logs will just mention system seeing the daemon fail and then restarting it.
Since the out of memory incident, we’ve have 3 OSDs fail this way at separate times. We
resorted to wiping the affected OSD and re-adding it to the cluster, but it seems as soon
as all PGs have moved to the OSD, the next one fails.
>>       >
>>       > This is also keeping us from re-deploying RGW, which was affected by
the same out of memory incident, since cephadm runs a check and won’t deploy the service
unless the cluster is in HEALTH_OK status.
>>       >
>>       > Any help would be greatly appreciated.
>>       >
>>       > Thanks,
>>       > Stefan
>>       >
>>       > _______________________________________________
>>       > ceph-users mailing list -- ceph-users(a)ceph.io
>>       > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>       _______________________________________________
>>       ceph-users mailing list -- ceph-users(a)ceph.io
>>       To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: OSD reboot loop after running out of memory