Just a note - all the below is almost completely unrelated to high RAM
usage. The latter is a different issue which presumably just triggered
PG removal one...
On 12/14/2020 2:39 PM, Igor Fedotov wrote:
> Hi Stefan,
>
> given the crash backtrace in your log I presume some data removal is
> in progress:
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 3:
> (KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
> char*)+0xd8) [0x5587b9364a48]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 4:
> (KernelDevice::read_random(unsigned long, unsigned long, char*,
> bool)+0x1b3) [0x5587b93653e3]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 5:
> (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
> long, char*)+0x674) [0x5587b9328cb4]
> ...
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 19:
> (BlueStore::_do_omap_clear(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Onode>&)+0xa2) [0x5587b922f0e2]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 20:
> (BlueStore::_do_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>)+0xc65) [0x5587b923b555]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 21:
> (BlueStore::_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>&)+0x64) [0x5587b923c3b4]
> ...
>
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 24:
>
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>,
> ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 25:
> (PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 26:
> (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
> [0x5587b8fd6ede]
> ...
>
> Did you initiate some large pool removal recently? Or may be data
> rebalancing triggered PG migration (and hence source PG removal) for you?
>
> Highly likely you're facing a well known issue with RocksDB/BlueFS
> performance issues caused by massive data removal.
>
> So your OSDs are just processing I/O very slowly which triggers
> suicide timeout.
>
> We've had multiple threads on the issue in this mailing list - the
> latest one is at
>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72Z…
>
> For now the good enough workaround is manual offline DB compaction for
> all the OSDs (this might have temporary effect though as the removal
> proceeds).
>
> Additionally there are users' reports that recent default value's
> modification for bluefs_buffered_io setting has negative impact (or
> just worsen existing issue with massive removal) as well. So you might
> want to switch it back to true.
>
> As for OSD.10 - can't say for sure as I haven't seen its' logs but I
> think it's experiencing the same issue which might eventually lead it
> into unresponsive state as well. Just grep its log for "heartbeat_map
> is_healthy 'OSD::osd_op_tp thread" strings.
>
>
> Thanks,
>
> Igor
>
> On 12/13/2020 3:46 PM, Stefan Wild wrote:
>> Hi Igor,
>>
>> Full osd logs from startup to failed exit:
>>
https://tiltworks.com/osd.1.log
>>
>> In other news, can I expect osd.10 to go down next?
>>
>> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
>> 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:43.310905+0000 front 2020-12-13T12:39:43.311164+0000
>> (oldest deadline 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000
>> (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
>> 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:42.181904+0000 front 2020-12-13T12:39:42.181856+0000
>> (oldest deadline 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.2 is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513
>> : cluster [DBG] osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000
>> (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.423+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.6 is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.423+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.6
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
>> 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:39.770822+0000 front 2020-12-13T12:39:39.770700+0000
>> (oldest deadline 2020-12-13T12:40:05.070662+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
>> 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:39.977106+0000 front 2020-12-13T12:39:39.977176+0000
>> (oldest deadline 2020-12-13T12:40:04.677320+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
>> 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:43.310905+0000 front 2020-12-13T12:39:43.311164+0000
>> (oldest deadline 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000
>> (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.179+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.4 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.179+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug
>> 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:42.181904+0000 front 2020-12-13T12:39:42.181856+0000
>> (oldest deadline 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514
>> : cluster [DBG] osd.10 failure report canceled by osd.6
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515
>> : cluster [DBG] osd.10 failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000
>> (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.371+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.3 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.371+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.3
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.611+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.7 is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.611+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.7
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000
>> (oldest deadline 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375
>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>> 2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000
>> (oldest deadline 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516
>> : cluster [DBG] osd.10 failure report canceled by osd.3
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517
>> : cluster [DBG] osd.10 failure report canceled by osd.7
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.727+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.0 is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.727+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.839+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.5 is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.839+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518
>> : cluster [DBG] osd.10 failure report canceled by osd.0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519
>> : cluster [DBG] osd.10 failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.575+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.11 is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.575+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.11
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.783+0000 7fe929be8700 1
>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>> from osd.8 is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.783+0000 7fe929be8700 0 log_channel(cluster) log
>> [DBG] : osd.10 failure report canceled by osd.8
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520
>> : cluster [DBG] osd.10 failure report canceled by osd.11
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521
>> : cluster [DBG] osd.10 failure report canceled by osd.8
>>
>>
>> Thanks,
>> Stefan
>>
>>
>> On 12/13/20, 2:18 AM, "Igor Fedotov" <ifedotov(a)suse.de> wrote:
>>
>> Hi Stefan,
>>
>> could you please share OSD startup log from /var/log/ceph?
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 12/13/2020 5:44 AM, Stefan Wild wrote:
>> > Just had another look at the logs and this is what I did
>> notice after the affected OSD starts up.
>> >
>> > Loads of entries of this sort:
>> >
>> > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
>> 2020-12-13T02:38:40.851+0000 7fafd32c7700 1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15
>> >
>> > Then a few pages of this:
>> >
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9249>
>> 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9248>
>> 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9247>
>> 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9246>
>> 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13024
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9245>
>> 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13026
>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9244>
>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9243>
>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9242>
>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9241>
>> 2020-12-13T02:35:44.022+0000 7fafb721f700 1 osd.1 pg_epoch: 13143
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9240>
>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9239>
>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9238>
>> 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9237>
>> 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9236>
>> 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9235>
>> 2020-12-13T02:35:44.022+0000 7fafb521b700 1 osd.1 pg_epoch: 13143
>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>> >
>> > And this is where it crashes:
>> >
>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9232>
>> 2020-12-13T02:35:44.022+0000 7fafd02c1700 0 log_channel(cluster) log
>> [DBG] : purged_snaps scrub starts
>> > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Main process
>> exited, code=exited, status=1/FAILURE
>> > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Failed with
>> result 'exit-code'.
>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Service
>> hold-off time over, scheduling restart.
>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Scheduled
>> restart job, restart counter is at 1.
>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph
>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph
>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph
>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>> >
>> > Hope that helps…
>> >
>> >
>> > Thanks,
>> > Stefan
>> >
>> >
>> > From: Stefan Wild <swild(a)tiltworks.com>
>> > Date: Saturday, December 12, 2020 at 9:35 PM
>> > To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
>> > Subject: OSD reboot loop after running out of memory
>> >
>> > Hi,
>> >
>> > We recently upgraded a cluster from 15.2.1 to 15.2.5. About
>> two days later, one of the server ran out of memory for unknown
>> reasons (normally the machine uses about 60 out of 128 GB). Since
>> then, some OSDs on that machine get caught in an endless restart
>> loop. Logs will just mention system seeing the daemon fail and then
>> restarting it. Since the out of memory incident, we’ve have 3 OSDs
>> fail this way at separate times. We resorted to wiping the affected
>> OSD and re-adding it to the cluster, but it seems as soon as all PGs
>> have moved to the OSD, the next one fails.
>> >
>> > This is also keeping us from re-deploying RGW, which was
>> affected by the same out of memory incident, since cephadm runs a
>> check and won’t deploy the service unless the cluster is in HEALTH_OK
>> status.
>> >
>> > Any help would be greatly appreciated.
>> >
>> > Thanks,
>> > Stefan
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users(a)ceph.io
>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io