On 30/10/2020 11:28, Wido den Hollander wrote:
On 29/10/2020 18:58, Dan van der Ster wrote:
Hi Wido,
Could it be one of these?
mon osd min up ratio
mon osd min in ratio
36/120 is 0.3 so it might be one of those magic ratios at play.
I thought of those settings and looked at it. The weird thing is that
all 3 racks are equal and it works as expected in the other racks. There
all 40 OSDs are marked as down properly.
These settings should also yield log messages in the MON's log, but they
don't. Searching for 'ratio' in the logfile doesn't show me anything.
It's weird that osd.51 was only marked as down after 15 minutes because
it didn't send a beacon to the MON. Other OSDs kept sending reports that
it was down, but the MONs simply didn't act on it.
See:
https://tracker.ceph.com/issues/48274
2020-11-18T02:33:21.498-0500 7f849ae8c700 0 log_channel(cluster) log
[DBG] : osd.58 reported failed by osd.105
2020-11-18T02:33:21.498-0500 7f849ae8c700 10
mon.CEPH2-MON1-206-U39(a)0(leader).osd e108196 osd.58 has 1 reporters,
82.847365 grace (20.000000 + 62.8474 + 5.67647e-167), max_failed_since
2020-11-18T02:32:59.498216-0500
2020-11-18T02:33:21.498-0500 7f849ae8c700 10
mon.CEPH2-MON1-206-U39(a)0(leader).log v9218943 logging
2020-11-18T02:33:21.499338-0500 mon.CEPH2-MON1-206-U39 (mon.0) 165 :
cluster [DBG] osd.58 reported failed by osd.105
The MONs kept adding time to the grace period.
In the end setting mon_osd_adjust_heartbeat_grace to 'false' solved it.
Why? I'm not sure yet.
root default
rack 206
host A
rack 207
host B
rack 208
host C
The names of the racks are 'integers' so I tried renaming them to 'r206'
for example, but that didn't work either.
We achieved our goal, but I'm not sure why this setting is preventing
the OSDs from being marked as down.
Wido
> Wido
>
>>
>> Cheers,
>>
>> Dan
>>
>>
>> On Thu, 29 Oct 2020, 18:05 Wido den Hollander, <wido(a)42on.com
>> <mailto:wido@42on.com>> wrote:
>>
>> Hi,
>>
>> I'm investigating an issue where 4 to 5 OSDs in a rack aren't
>> marked as
>> down when the network is cut to that rack.
>>
>> Situation:
>>
>> - Nautilus cluster
>> - 3 racks
>> - 120 OSDs, 40 per rack
>>
>> We performed a test where we turned off the network Top-of-Rack for
>> each
>> rack. This worked as expected with two racks, but with the third
>> something weird happened.
>>
>> From the 40 OSDs which were supposed to be marked as down only 36
>> were
>> marked as down.
>>
>> In the end it took 15 minutes for all 40 OSDs to be marked as down.
>>
>> $ ceph config set mon mon_osd_reporter_subtree_level rack
>>
>> That setting is set to make sure that we only accept reports from
>> other
>> racks.
>>
>> What we saw in the logs for example:
>>
>> 2020-10-29T03:49:44.409-0400 7fbda185e700 10
>> mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.51 has 54
>> reporters,
>> 239.856038 grace (20.000000 + 219.856 + 7.43801e-23),
>> max_failed_since
>> 2020-10-29T03:47:22.374857-0400
>>
>> But osd.51 was still not marked as down after 54 reporters have
>> reported
>> that it is actually down.
>>
>> I checked, no ping or other traffic possible to osd.51. Host is
>> unreachable.
>>
>> Another osd was marked as down, but it took a couple of minutes as
>> well:
>>
>> 2020-10-29T03:50:54.455-0400 7fbda185e700 10
>> mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.37 has 48
>> reporters,
>> 221.378970 grace (20.000000 + 201.379 + 6.34437e-23),
>> max_failed_since
>> 2020-10-29T03:47:12.761584-0400
>> 2020-10-29T03:50:54.455-0400 7fbda185e700 1
>> mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 we have enough
>> reporters
>> to mark osd.37 down
>>
>> In the end osd.51 was marked as down, but only after the MON
>> decided to
>> do so:
>>
>> 2020-10-29T03:53:44.631-0400 7fbda185e700 0 log_channel(cluster) log
>> [INF] : osd.51 marked down after no beacon for 903.943390 seconds
>> 2020-10-29T03:53:44.631-0400 7fbda185e700 -1
>> mon.CEPH2-MON1-206-U39(a)0(leader).osd e107104 no beacon from osd.51
>> since
>> 2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago. marking
>> down
>>
>> I haven't seen this happen before in any cluster. It's also strange
>> that
>> this only happens in this rack, the other two racks work fine.
>>
>> ID CLASS WEIGHT TYPE NAME
>> -1 1545.35999 root default
>>
>> -206 515.12000 rack 206
>>
>> -7 27.94499 host CEPH2-206-U16
>> ...
>> -207 515.12000 rack 207
>>
>> -17 27.94499 host CEPH2-207-U16
>> ...
>> -208 515.12000 rack 208
>>
>> -31 27.94499 host CEPH2-208-U16
>> ...
>>
>> That's how the CRUSHMap looks like. Straight forward and 3x
>> replication
>> over 3 racks.
>>
>> This issue only occurs in rack *207*.
>>
>> Has anybody seen this before or knows where to start?
>>
>> Wido
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> <mailto:ceph-users-leave@ceph.io>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io