I forgot to mention "If with bluefs_buffered_io=false, the %util is over
75% most of the time ** during data removal (like snapshot removal) **,
then you'd better change it to true."
Regards,
Frédéric.
Le 14/12/2020 à 21:35, Frédéric Nass a écrit :
> Hi Stefan,
>
> Initial data removal could also have resulted from a snapshot removal
> leading to OSDs OOMing and then pg remappings leading to more removals
> after OOMed OSDs rejoined the cluster and so on.
>
> As mentioned by Igor : "Additionally there are users' reports that
> recent default value's modification for bluefs_buffered_io setting has
> negative impact (or just worsen existing issue with massive removal)
> as well. So you might want to switch it back to true."
>
> We're some of them. Our cluster suffered from a severe performance
> drop during snapshot removal right after upgrading to Nautilus, due to
> bluefs_buffered_io being set to false by default, with slow requests
> observed around the cluster.
> Once back to true (can be done with ceph tell osd.* injectargs
> '--bluefs_buffered_io=true') snap trimming would be fast again so as
> before the upgrade, with no more slow requests.
>
> But of course we've seen the excessive memory swap usage described
> here :
https://github.com/ceph/ceph/pull/34224
> So we lower osd_memory_target from 8MB to 4MB and haven't observed any
> swap usage since then. You can also have a look here :
>
https://github.com/ceph/ceph/pull/38044
>
> What you need to look at to understand if your cluster would benefit
> from changing bluefs_buffered_io back to true is the %util of your
> RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if
> you're using SSD RocksDB devices) and look at the %util of the device
> with bluefs_buffered_io=false and with bluefs_buffered_io=true. If
> with bluefs_buffered_io=false, the %util is over 75% most of the time,
> then you'd better change it to true. :-)
>
> Regards,
>
> Frédéric.
>
> Le 14/12/2020 à 12:47, Stefan Wild a écrit :
>> Hi Igor,
>>
>> Thank you for the detailed analysis. That makes me hopeful we can get
>> the cluster back on track. No pools have been removed, but yes, due
>> to the initial crash of multiple OSDs and the subsequent issues with
>> individual OSDs we’ve had substantial PG remappings happening
>> constantly.
>>
>> I will look up the referenced thread(s) and try the offline DB
>> compaction. It would be amazing if that does the trick.
>>
>> Will keep you posted, here.
>>
>> Thanks,
>> Stefan
>>
>> ________________________________
>> From: Igor Fedotov <ifedotov(a)suse.de>
>> Sent: Monday, December 14, 2020 6:39:28 AM
>> To: Stefan Wild <swild(a)tiltworks.com>om>; ceph-users(a)ceph.io
>> <ceph-users(a)ceph.io>
>> Subject: Re: [ceph-users] Re: OSD reboot loop after running out of
>> memory
>>
>> Hi Stefan,
>>
>> given the crash backtrace in your log I presume some data removal is in
>> progress:
>>
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 3:
>> (KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
>> char*)+0xd8) [0x5587b9364a48]
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 4:
>> (KernelDevice::read_random(unsigned long, unsigned long, char*,
>> bool)+0x1b3) [0x5587b93653e3]
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 5:
>> (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long,
>> char*)+0x674) [0x5587b9328cb4]
>> ...
>>
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 19:
>> (BlueStore::_do_omap_clear(BlueStore::TransContext*,
>> boost::intrusive_ptr<BlueStore::Onode>&)+0xa2) [0x5587b922f0e2]
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 20:
>> (BlueStore::_do_remove(BlueStore::TransContext*,
>> boost::intrusive_ptr<BlueStore::Collection>&,
>> boost::intrusive_ptr<BlueStore::Onode>)+0xc65) [0x5587b923b555]
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 21:
>> (BlueStore::_remove(BlueStore::TransContext*,
>> boost::intrusive_ptr<BlueStore::Collection>&,
>> boost::intrusive_ptr<BlueStore::Onode>&)+0x64) [0x5587b923c3b4]
>> ...
>>
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 24:
>>
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
>>
>> ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>,
>> ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 25:
>> (PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
>> Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 26:
>> (PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
>> [0x5587b8fd6ede]
>> ...
>>
>> Did you initiate some large pool removal recently? Or may be data
>> rebalancing triggered PG migration (and hence source PG removal) for
>> you?
>>
>> Highly likely you're facing a well known issue with RocksDB/BlueFS
>> performance issues caused by massive data removal.
>>
>> So your OSDs are just processing I/O very slowly which triggers suicide
>> timeout.
>>
>> We've had multiple threads on the issue in this mailing list - the
>> latest one is at
>>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72Z…
>>
>>
>> For now the good enough workaround is manual offline DB compaction for
>> all the OSDs (this might have temporary effect though as the removal
>> proceeds).
>>
>> Additionally there are users' reports that recent default value's
>> modification for bluefs_buffered_io setting has negative impact (or
>> just worsen existing issue with massive removal) as well. So you might
>> want to switch it back to true.
>>
>> As for OSD.10 - can't say for sure as I haven't seen its' logs but I
>> think it's experiencing the same issue which might eventually lead it
>> into unresponsive state as well. Just grep its log for "heartbeat_map
>> is_healthy 'OSD::osd_op_tp thread" strings.
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 12/13/2020 3:46 PM, Stefan Wild wrote:
>>> Hi Igor,
>>>
>>> Full osd logs from startup to failed exit:
>>>
https://tiltworks.com/osd.1.log
>>>
>>> In other news, can I expect osd.10 to go down next?
>>>
>>> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
>>> 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:43.310905+0000 front
>>> 2020-12-13T12:39:43.311164+0000 (oldest deadline
>>> 2020-12-13T12:40:06.810981+0000)
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
>>> 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:42.972558+0000 front
>>> 2020-12-13T12:39:42.972702+0000 (oldest deadline
>>> 2020-12-13T12:40:05.272435+0000)
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
>>> 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:42.181904+0000 front
>>> 2020-12-13T12:39:42.181856+0000 (oldest deadline
>>> 2020-12-13T12:40:06.281648+0000)
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:15.171+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.2 is reporting failure:0
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:15.171+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.2
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513
>>> : cluster [DBG] osd.10 failure report canceled by osd.2
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
>>> 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:43.326792+0000 front
>>> 2020-12-13T12:39:43.326666+0000 (oldest deadline
>>> 2020-12-13T12:40:07.426786+0000)
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:15.423+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.6 is reporting failure:0
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:15.423+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.6
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
>>> 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:39.770822+0000 front
>>> 2020-12-13T12:39:39.770700+0000 (oldest deadline
>>> 2020-12-13T12:40:05.070662+0000)
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
>>> 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:39.977106+0000 front
>>> 2020-12-13T12:39:39.977176+0000 (oldest deadline
>>> 2020-12-13T12:40:04.677320+0000)
>>> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
>>> 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:43.310905+0000 front
>>> 2020-12-13T12:39:43.311164+0000 (oldest deadline
>>> 2020-12-13T12:40:06.810981+0000)
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
>>> 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:42.972558+0000 front
>>> 2020-12-13T12:39:42.972702+0000 (oldest deadline
>>> 2020-12-13T12:40:05.272435+0000)
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:16.179+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.4 is reporting failure:0
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:16.179+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.4
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug
>>> 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:42.181904+0000 front
>>> 2020-12-13T12:39:42.181856+0000 (oldest deadline
>>> 2020-12-13T12:40:06.281648+0000)
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514
>>> : cluster [DBG] osd.10 failure report canceled by osd.6
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515
>>> : cluster [DBG] osd.10 failure report canceled by osd.4
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug
>>> 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:43.326792+0000 front
>>> 2020-12-13T12:39:43.326666+0000 (oldest deadline
>>> 2020-12-13T12:40:07.426786+0000)
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:16.371+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.3 is reporting failure:0
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:16.371+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.3
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:16.611+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.7 is reporting failure:0
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:16.611+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.7
>>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
>>> 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:42.972558+0000 front
>>> 2020-12-13T12:39:42.972702+0000 (oldest deadline
>>> 2020-12-13T12:40:05.272435+0000)
>>> Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug
>>> 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375
>>> heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back
>>> 2020-12-13T12:39:43.326792+0000 front
>>> 2020-12-13T12:39:43.326666+0000 (oldest deadline
>>> 2020-12-13T12:40:07.426786+0000)
>>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516
>>> : cluster [DBG] osd.10 failure report canceled by osd.3
>>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517
>>> : cluster [DBG] osd.10 failure report canceled by osd.7
>>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:17.727+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.0 is reporting failure:0
>>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:17.727+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.0
>>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:17.839+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.5 is reporting failure:0
>>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:17.839+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.5
>>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518
>>> : cluster [DBG] osd.10 failure report canceled by osd.0
>>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519
>>> : cluster [DBG] osd.10 failure report canceled by osd.5
>>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:18.575+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.11 is reporting failure:0
>>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:18.575+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.11
>>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:18.783+0000 7fe929be8700 1
>>> mon.ceph-tpa-server1(a)0(leader).osd e13375 prepare_failure osd.10
>>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710]
>>> from osd.8 is reporting failure:0
>>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>>> 2020-12-13T12:40:18.783+0000 7fe929be8700 0 log_channel(cluster)
>>> log [DBG] : osd.10 failure report canceled by osd.8
>>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520
>>> : cluster [DBG] osd.10 failure report canceled by osd.11
>>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
>>> 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521
>>> : cluster [DBG] osd.10 failure report canceled by osd.8
>>>
>>>
>>> Thanks,
>>> Stefan
>>>
>>>
>>> On 12/13/20, 2:18 AM, "Igor Fedotov" <ifedotov(a)suse.de>
wrote:
>>>
>>> Hi Stefan,
>>>
>>> could you please share OSD startup log from /var/log/ceph?
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>> On 12/13/2020 5:44 AM, Stefan Wild wrote:
>>> > Just had another look at the logs and this is what I did
>>> notice after the affected OSD starts up.
>>> >
>>> > Loads of entries of this sort:
>>> >
>>> > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
>>> 2020-12-13T02:38:40.851+0000 7fafd32c7700 1 heartbeat_map
>>> is_healthy 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out
>>> after 15
>>> >
>>> > Then a few pages of this:
>>> >
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9249>
>>> 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024
>>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9248>
>>> 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024
>>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9247>
>>> 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024
>>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9246>
>>> 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13024
>>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9245>
>>> 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13026
>>> pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9244>
>>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9243>
>>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9242>
>>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9241>
>>> 2020-12-13T02:35:44.022+0000 7fafb721f700 1 osd.1 pg_epoch: 13143
>>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9240>
>>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9239>
>>> 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143
>>> pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9238>
>>> 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
>>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9237>
>>> 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
>>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9236>
>>> 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143
>>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9235>
>>> 2020-12-13T02:35:44.022+0000 7fafb521b700 1 osd.1 pg_epoch: 13143
>>> pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
>>> >
>>> > And this is where it crashes:
>>> >
>>> > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9232>
>>> 2020-12-13T02:35:44.022+0000 7fafd02c1700 0 log_channel(cluster)
>>> log [DBG] : purged_snaps scrub starts
>>> > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
>>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Main
>>> process exited, code=exited, status=1/FAILURE
>>> > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
>>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Failed with
>>> result 'exit-code'.
>>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Service
>>> hold-off time over, scheduling restart.
>>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>>> ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142(a)osd.1.service: Scheduled
>>> restart job, restart counter is at 1.
>>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph
>>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph
>>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
>>> > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph
>>> osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>> >
>>> > Hope that helps…
>>> >
>>> >
>>> > Thanks,
>>> > Stefan
>>> >
>>> >
>>> > From: Stefan Wild <swild(a)tiltworks.com>
>>> > Date: Saturday, December 12, 2020 at 9:35 PM
>>> > To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
>>> > Subject: OSD reboot loop after running out of memory
>>> >
>>> > Hi,
>>> >
>>> > We recently upgraded a cluster from 15.2.1 to 15.2.5. About
>>> two days later, one of the server ran out of memory for unknown
>>> reasons (normally the machine uses about 60 out of 128 GB). Since
>>> then, some OSDs on that machine get caught in an endless restart
>>> loop. Logs will just mention system seeing the daemon fail and then
>>> restarting it. Since the out of memory incident, we’ve have 3 OSDs
>>> fail this way at separate times. We resorted to wiping the affected
>>> OSD and re-adding it to the cluster, but it seems as soon as all PGs
>>> have moved to the OSD, the next one fails.
>>> >
>>> > This is also keeping us from re-deploying RGW, which was
>>> affected by the same out of memory incident, since cephadm runs a
>>> check and won’t deploy the service unless the cluster is in
>>> HEALTH_OK status.
>>> >
>>> > Any help would be greatly appreciated.
>>> >
>>> > Thanks,
>>> > Stefan
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io