Hi Francois,
given that slow operations are observed for collection listings you
might want to manually compact RocksDB using ceph-kvstore-tool.
The observed slowdown tends to happen after massive data removals. I've
seen multiple compains about this issue including some post in this
mailing list. BTW I can see your post from Jun 24 about slow pool
removal - couldn't this be a trigger?
Also wondering whether you have standalone fast(SSD/NVMe) drive for
DB/WAL? Aren't there any BlueFS spillovers which might be relevant?
Thanks,
Igor
On 8/28/2020 11:33 AM, Francois Legrand wrote:
> Hi all,
>
> We have a ceph cluster in production with 6 osds servers (with 16x8TB
> disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are
> in 10GB and works well.
>
> After a major crash in april, we turned the option bluefs_buffered_io
> to false to workaround the large write bug when bluefs_buffered_io
> was true (we were in version 14.2.8 and the default value at this time
> was true).
> Since that time, we regularly have some osds wrongly marked down by
> the cluster after heartbeat timeout (heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).
>
> Generally the osd restart and the cluster is back healthy, but several
> time, after many of these kick-off the osd reach the
> osd_op_thread_suicide_timeout and goes down definitely.
>
> We increased the osd_op_thread_timeout and
> osd_op_thread_suicide_timeout... The problems still occurs (but less
> frequently).
>
> Few days ago, we upgraded to 14.2.11 and revert the timeout to their
> default value, hoping that it will solve the problem (we thought that
> it should be related to this bug
>
https://tracker.ceph.com/issues/45943), but it didn't. We still have
> some osds wrongly marked down.
>
> Can somebody help us to fix this problem ?
> Thanks.
>
> Here is an extract of an osd log at failure time:
>
> ---------------------------------
> 2020-08-28 02:19:05.019 7f03f1384700 0 log_channel(cluster) log [DBG]
> : 44.7d scrub starts
> 2020-08-28 02:19:25.755 7f040e43d700 1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:19:25.755 7f040dc3c700 1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> this last line is repeated more than 1000 times
> ...
> 2020-08-28 02:20:17.484 7f040d43b700 1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:20:17.551 7f03f1384700 0
> bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation
> observed for _collection_list, latency = 67.3532s, lat = 67s cid
> =44.7d_head start GHMAX end GHMAX max 25
> ...
> 2020-08-28 02:20:22.600 7f040dc3c700 1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:21:20.774 7f03f1384700 0
> bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation
> observed for _collection_list, latency = 63.223s, lat = 63s cid
> =44.7d_head start
> #44:beffc78d:::rbd_data.1e48e8ab988992.00000000000011bd:0# end #MAX#
> max 2147483647
> 2020-08-28 02:21:20.774 7f03f1384700 1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
> 2020-08-28 02:21:20.805 7f03f1384700 0 log_channel(cluster) log [DBG]
> : 44.7d scrub ok
> 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [WRN]
> : Monitor daemon marked osd.16 down, but it is still running
> 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [DBG]
> : map e609411 wrongly marked me down at e609410
> 2020-08-28 02:21:21.099 7f03fd997700 1 osd.16 609411
> start_waiting_for_healthy
> 2020-08-28 02:21:21.119 7f03fd997700 1 osd.16 609411 start_boot
> 2020-08-28 02:21:21.124 7f03f0b83700 1 osd.16 pg_epoch: 609410
> pg[36.3d0( v 609409'481293 (449368'478292,609409'481293]
> local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c
> 609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72]
> r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198
> crt=609409'481293 lcod 609409'481292 active mbc={}]
> start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] ->
> [25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1,
> features acting 4611087854031667199 upacting 4611087854031667199
> ...
> 2020-08-28 02:21:21.166 7f03f0b83700 1 osd.16 pg_epoch: 609411
> pg[36.56( v 609409'480511 (449368'477424,609409'480511]
> local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c
> 609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102]
> r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511 lcod
> 609409'480510 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
> 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
> set_numa_affinity public network em1 numa node 0
> 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
> set_numa_affinity cluster network em2 numa node 0
> 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
> set_numa_affinity objectstore and network numa nodes do not match
> 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413
> set_numa_affinity not setting numa affinity
> 2020-08-28 02:21:21.566 7f040a435700 1 osd.16 609413 tick checking
> mon for new map
> 2020-08-28 02:21:22.515 7f03fd997700 1 osd.16 609414 state: booting
> -> active
> 2020-08-28 02:21:22.515 7f03f0382700 1 osd.16 pg_epoch: 609414
> pg[36.20( v 609409'483167 (449368'480117,609409'483167]
> local-lis/les=609403/609404 n=155171 ec=435353/435353 lis/c
> 609403/609403 les/c/f 609404/609404/0 609414/609414/609361) [97,16,72]
> r=1 lpr=609414 pi=[609403,609414)/1 crt=609409'483167 lcod
> 609409'483166 unknown NOTIFY mbc={}] start_peering_interval up [97,72]
> -> [97,16,72], acting [97,72] -> [97,16,72], acting_primary 97 -> 97,
> up_primary 97 -> 97, role -1 -> 1, features acting 4611087854031667199
> upacting 4611087854031667199
> ...
> 2020-08-28 02:21:22.522 7f03f1384700 1 osd.16 pg_epoch: 609414
> pg[36.2f3( v 609409'479796 (449368'476712,609409'479796]
> local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c
> 609403/609403 les/c/f 609404/609404/0 609414/609414/609414) [16,34,21]
> r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796 lcod
> 609409'479795 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval
> up [34,21] -> [16,34,21], acting [34,21] -> [16,34,21], acting_primary
> 34 -> 16, up_primary 34 -> 16, role -1 -> 0, features acting
> 4611087854031667199 upacting 4611087854031667199
> 2020-08-28 02:21:22.522 7f03f1384700 1 osd.16 pg_epoch: 609414
> pg[36.2f3( v 609409'479796 (449368'476712,609409'479796]
> local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c
> 609403/609403 les/c/f 609404/609404/0 609414/609414/609414) [16,34,21]
> r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796 lcod
> 609409'479795 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to
> Primary
> 2020-08-28 02:21:24.738 7f03f1384700 0 log_channel(cluster) log [DBG]
> : 36.2f3 scrub starts
> 2020-08-28 02:22:18.857 7f03f1384700 0 log_channel(cluster) log [DBG]
> : 36.2f3 scrub ok
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io