[ceph-users] Re: osd regularly wrongly marked down

31 Aug 2020

Hi Francois,

given that slow operations are observed for collection listings you 
might want to manually compact RocksDB using ceph-kvstore-tool.

The observed slowdown tends to happen after massive data removals. I've 
seen multiple compains about this issue including some post in this 
mailing list. BTW I can see your post from Jun 24  about slow pool 
removal - couldn't this be a trigger?

Also wondering whether you have standalone fast(SSD/NVMe) drive for 
DB/WAL? Aren't there any BlueFS spillovers which might be relevant?

Thanks,

Igor

On 8/28/2020 11:33 AM, Francois Legrand wrote:
> Hi all,
>
> We have a ceph cluster in production with 6 osds servers (with 16x8TB 
> disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are 
> in 10GB and works well.
>
> After a major crash in april, we turned the option bluefs_buffered_io 
> to false  to workaround the large write bug when bluefs_buffered_io 
> was true (we were in version 14.2.8 and the default value at this time 
> was true).
> Since that time, we regularly have some osds wrongly marked down by 
> the cluster after heartbeat timeout (heartbeat_map is_healthy 
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).
>
> Generally the osd restart and the cluster is back healthy, but several 
> time, after many of these kick-off the osd reach the 
> osd_op_thread_suicide_timeout and goes down definitely.
>
> We increased the osd_op_thread_timeout and 
> osd_op_thread_suicide_timeout... The problems still occurs (but less 
> frequently).
>
> Few days ago, we upgraded to 14.2.11 and revert the timeout to their 
> default value, hoping that it will solve the problem (we thought that 
> it should be related to this bug 
> https://tracker.ceph.com/issues/45943), but it didn't. We still have 
> some osds wrongly marked down.
>
> Can somebody help us to fix this problem ?
> Thanks.
>
> Here is an extract of an osd log at failure time:
>
> ---------------------------------
> 2020-08-28 02:19:05.019 7f03f1384700  0 log_channel(cluster) log [DBG] 
> : 44.7d scrub starts
> 2020-08-28 02:19:25.755 7f040e43d700  1 heartbeat_map is_healthy 
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:19:25.755 7f040dc3c700  1 heartbeat_map is_healthy 
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> this last line is repeated more than 1000 times
> ...
> 2020-08-28 02:20:17.484 7f040d43b700  1 heartbeat_map is_healthy 
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:20:17.551 7f03f1384700  0 
> bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
> observed for _collection_list, latency = 67.3532s, lat = 67s cid 
> =44.7d_head start GHMAX end GHMAX max 25
> ...
> 2020-08-28 02:20:22.600 7f040dc3c700  1 heartbeat_map is_healthy 
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:21:20.774 7f03f1384700  0 
> bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
> observed for _collection_list, latency = 63.223s, lat = 63s cid 
> =44.7d_head start 
> #44:beffc78d:::rbd_data.1e48e8ab988992.00000000000011bd:0# end #MAX# 
> max 2147483647
> 2020-08-28 02:21:20.774 7f03f1384700  1 heartbeat_map reset_timeout 
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:21:20.805 7f03f1384700  0 log_channel(cluster) log [DBG] 
> : 44.7d scrub ok
> 2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log [WRN] 
> : Monitor daemon marked osd.16 down, but it is still running
> 2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log [DBG] 
> : map e609411 wrongly marked me down at e609410
> 2020-08-28 02:21:21.099 7f03fd997700  1 osd.16 609411 
> start_waiting_for_healthy
> 2020-08-28 02:21:21.119 7f03fd997700  1 osd.16 609411 start_boot
> 2020-08-28 02:21:21.124 7f03f0b83700  1 osd.16 pg_epoch: 609410 
> pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] 
> local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 
> 609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] 
> r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 
> crt=609409'481293 lcod 609409'481292 active mbc={}] 
> start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> 
> [25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1, 
> features acting 4611087854031667199 upacting 4611087854031667199
> ...
> 2020-08-28 02:21:21.166 7f03f0b83700  1 osd.16 pg_epoch: 609411 
> pg[36.56( v 609409'480511 (449368'477424,609409'480511] 
> local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 
> 609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102] 
> r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511 lcod 
> 609409'480510 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
> set_numa_affinity public network em1 numa node 0
> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
> set_numa_affinity cluster network em2 numa node 0
> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
> set_numa_affinity objectstore and network numa nodes do not match
> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
> set_numa_affinity not setting numa affinity
> 2020-08-28 02:21:21.566 7f040a435700  1 osd.16 609413 tick checking 
> mon for new map
> 2020-08-28 02:21:22.515 7f03fd997700  1 osd.16 609414 state: booting 
> -> active
> 2020-08-28 02:21:22.515 7f03f0382700  1 osd.16 pg_epoch: 609414 
> pg[36.20( v 609409'483167 (449368'480117,609409'483167] 
> local-lis/les=609403/609404 n=155171 ec=435353/435353 lis/c 
> 609403/609403 les/c/f 609404/609404/0 609414/609414/609361) [97,16,72] 
> r=1 lpr=609414 pi=[609403,609414)/1 crt=609409'483167 lcod 
> 609409'483166 unknown NOTIFY mbc={}] start_peering_interval up [97,72] 
> -> [97,16,72], acting [97,72] -> [97,16,72], acting_primary 97 -> 97, 
> up_primary 97 -> 97, role -1 -> 1, features acting 4611087854031667199 
> upacting 4611087854031667199
> ...
> 2020-08-28 02:21:22.522 7f03f1384700  1 osd.16 pg_epoch: 609414 
> pg[36.2f3( v 609409'479796 (449368'476712,609409'479796] 
> local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c 
> 609403/609403 les/c/f 609404/609404/0 609414/609414/609414) [16,34,21] 
> r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796 lcod 
> 609409'479795 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval 
> up [34,21] -> [16,34,21], acting [34,21] -> [16,34,21], acting_primary 
> 34 -> 16, up_primary 34 -> 16, role -1 -> 0, features acting 
> 4611087854031667199 upacting 4611087854031667199
> 2020-08-28 02:21:22.522 7f03f1384700  1 osd.16 pg_epoch: 609414 
> pg[36.2f3( v 609409'479796 (449368'476712,609409'479796] 
> local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c 
> 609403/609403 les/c/f 609404/609404/0 609414/609414/609414) [16,34,21] 
> r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796 lcod 
> 609409'479795 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to

> Primary
> 2020-08-28 02:21:24.738 7f03f1384700  0 log_channel(cluster) log [DBG] 
> : 36.2f3 scrub starts
> 2020-08-28 02:22:18.857 7f03f1384700  0 log_channel(cluster) log [DBG] 
> : 36.2f3 scrub ok
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: osd regularly wrongly marked down