[ceph-users] Re: OSD reboot loop after running out of memory

14 Dec 2020

Just a note - all the below is almost completely unrelated to high RAM 
usage. The latter is a different issue which presumably just triggered 
PG removal one...

On 12/14/2020 2:39 PM, Igor Fedotov wrote:
> Hi Stefan,
>
> given the crash backtrace in your log I presume some data removal is 
> in progress:
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3: 
> (KernelDevice::direct_read_unaligned(unsigned long, unsigned long, 
> char*)+0xd8) [0x5587b9364a48]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4: 
> (KernelDevice::read_random(unsigned long, unsigned long, char*, 
> bool)+0x1b3) [0x5587b93653e3]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5: 
> (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
> long, char*)+0x674) [0x5587b9328cb4]
> ...
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19: 
> (BlueStore::_do_omap_clear(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Onode>&)+0xa2) [0x5587b922f0e2]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20: 
> (BlueStore::_do_remove(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>)+0xc65) [0x5587b923b555]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21: 
> (BlueStore::_remove(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>&)+0x64) [0x5587b923c3b4]
> ...
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24: 
>
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,

> ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, 
> ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25: 
> (PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26: 
> (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e) 
> [0x5587b8fd6ede]
> ...
>
> Did you initiate some large pool removal recently? Or may be data 
> rebalancing triggered PG migration (and hence source PG removal) for you?
>
> Highly likely you're facing a well known issue with RocksDB/BlueFS 
> performance issues caused by massive data removal.
>
> So your OSDs are just processing I/O very slowly which triggers 
> suicide timeout.
>
> We've had multiple threads on the issue in this mailing list - the 
> latest one is at 
>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72Z…
>
> For now the good enough workaround is manual offline DB compaction for 
> all the OSDs (this might have temporary effect though as the removal 
> proceeds).
>
> Additionally there are users' reports that recent default value's 
> modification  for bluefs_buffered_io setting has negative impact (or 
> just worsen existing issue with massive removal) as well. So you might 
> want to switch it back to true.
>
> As for OSD.10 - can't say for sure as I haven't seen its' logs but I 
> think it's experiencing the same issue which might eventually lead it 
> into unresponsive state as well. Just grep its log for "heartbeat_map 
> is_healthy 'OSD::osd_op_tp thread" strings.
>
>
> Thanks,
>
> Igor
>
> On 12/13/2020 3:46 PM, Stefan Wild wrote:
>> Hi Igor,
>>
>> Full osd logs from startup to failed exit:
>> https://tiltworks.com/osd.1.log
>>
>> In other news, can I expect osd.10 to go down next?
>>
>> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug 
>> 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.310905+0000 front 2020-12-13T12:39:43.311164+0000 
>> (oldest deadline 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug 
>> 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000 
>> (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug 
>> 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.181904+0000 front 2020-12-13T12:39:42.181856+0000 
>> (oldest deadline 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:15.171+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.2 is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:15.171+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 
>> : cluster [DBG] osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug 
>> 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000 
>> (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:15.423+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.6 is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:15.423+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.6
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug 
>> 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:39.770822+0000 front 2020-12-13T12:39:39.770700+0000 
>> (oldest deadline 2020-12-13T12:40:05.070662+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug 
>> 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:39.977106+0000 front 2020-12-13T12:39:39.977176+0000 
>> (oldest deadline 2020-12-13T12:40:04.677320+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug 
>> 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.310905+0000 front 2020-12-13T12:39:43.311164+0000 
>> (oldest deadline 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug 
>> 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000 
>> (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:16.179+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.4 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:16.179+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug 
>> 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.181904+0000 front 2020-12-13T12:39:42.181856+0000 
>> (oldest deadline 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 
>> : cluster [DBG] osd.10 failure report canceled by osd.6
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 
>> : cluster [DBG] osd.10 failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug 
>> 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000 
>> (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:16.371+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.3 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:16.371+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.3
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:16.611+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.7 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:16.611+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.7
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug 
>> 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000 
>> (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug 
>> 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000 
>> (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 
>> : cluster [DBG] osd.10 failure report canceled by osd.3
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 
>> : cluster [DBG] osd.10 failure report canceled by osd.7
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:17.727+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.0 is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:17.727+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:17.839+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.5 is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:17.839+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 
>> : cluster [DBG] osd.10 failure report canceled by osd.0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 
>> : cluster [DBG] osd.10 failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:18.575+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.11 is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:18.575+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.11
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:18.783+0000 7fe929be8700  1 
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10 
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
>> from osd.8 is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
>> 2020-12-13T12:40:18.783+0000 7fe929be8700  0 log_channel(cluster) log 
>> [DBG] : osd.10 failure report canceled by osd.8
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 
>> : cluster [DBG] osd.10 failure report canceled by osd.11
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster 
>> 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 
>> : cluster [DBG] osd.10 failure report canceled by osd.8
>>
>>
>> Thanks,
>> Stefan
>>
>>
>> On 12/13/20, 2:18 AM, "Igor Fedotov" &lt;ifedotov(a)suse.de&gt; wrote:
>>
>>      Hi Stefan,
>>
>>      could you please share OSD startup log from /var/log/ceph?
>>
>>
>>      Thanks,
>>
>>      Igor
>>
>>      On 12/13/2020 5:44 AM, Stefan Wild wrote:
>>      > Just had another look at the logs and this is what I did 
>> notice after the affected OSD starts up.
>>      >
>>      > Loads of entries of this sort:
>>      >
>>      > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug 
>> 2020-12-13T02:38:40.851+0000 7fafd32c7700  1 heartbeat_map is_healthy 
>> 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15
>>      >
>>      > Then a few pages of this:
>>      >
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9249> 
>> 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9248> 
>> 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9247> 
>> 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9246> 
>> 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13024 
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9245> 
>> 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13026 
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9244> 
>> 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9243> 
>> 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9242> 
>> 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9241> 
>> 2020-12-13T02:35:44.022+0000 7fafb721f700  1 osd.1 pg_epoch: 13143 
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9240> 
>> 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9239> 
>> 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9238> 
>> 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9237> 
>> 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9236> 
>> 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9235> 
>> 2020-12-13T02:35:44.022+0000 7fafb521b700  1 osd.1 pg_epoch: 13143 
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>      >
>>      > And this is where it crashes:
>>      >
>>      > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9232> 
>> 2020-12-13T02:35:44.022+0000 7fafd02c1700  0 log_channel(cluster) log 
>> [DBG] : purged_snaps scrub starts
>>      > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]: 
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Main process 
>> exited, code=exited, status=1/FAILURE
>>      > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]: 
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Failed with 
>> result 'exit-code'.
>>      > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: 
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Service 
>> hold-off time over, scheduling restart.
>>      > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: 
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Scheduled 
>> restart job, restart counter is at 1.
>>      > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph 
>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>      > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph 
>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
>>      > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph 
>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>      >
>>      > Hope that helps…
>>      >
>>      >
>>      > Thanks,
>>      > Stefan
>>      >
>>      >
>>      > From: Stefan Wild &lt;swild(a)tiltworks.com&gt;
>>      > Date: Saturday, December 12, 2020 at 9:35 PM
>>      > To: &quot;ceph-users(a)ceph.io&quot; &lt;ceph-users(a)ceph.io&gt;
>>      > Subject: OSD reboot loop after running out of memory
>>      >
>>      > Hi,
>>      >
>>      > We recently upgraded a cluster from 15.2.1 to 15.2.5. About 
>> two days later, one of the server ran out of memory for unknown 
>> reasons (normally the machine uses about 60 out of 128 GB). Since 
>> then, some OSDs on that machine get caught in an endless restart 
>> loop. Logs will just mention system seeing the daemon fail and then 
>> restarting it. Since the out of memory incident, we’ve have 3 OSDs 
>> fail this way at separate times. We resorted to wiping the affected 
>> OSD and re-adding it to the cluster, but it seems as soon as all PGs 
>> have moved to the OSD, the next one fails.
>>      >
>>      > This is also keeping us from re-deploying RGW, which was 
>> affected by the same out of memory incident, since cephadm runs a 
>> check and won’t deploy the service unless the cluster is in HEALTH_OK 
>> status.
>>      >
>>      > Any help would be greatly appreciated.
>>      >
>>      > Thanks,
>>      > Stefan
>>      >
>>      > _______________________________________________
>>      > ceph-users mailing list -- ceph-users(a)ceph.io
>>      > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>      _______________________________________________
>>      ceph-users mailing list -- ceph-users(a)ceph.io
>>      To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: OSD reboot loop after running out of memory