[ceph-users] Re: osd regularly wrongly marked down

31 Aug 2020

On 31/08/2020 15:44, Francois Legrand wrote:
...
  Thanks Igor for your answer,

 We could try do a compaction of RocksDB manually, but it's not clear to 
 me if we have to compact on the mon with something like
 ceph-kvstore-tool rocksdb  /var/lib/ceph/mon/mon01/store.db/ compact
 or on the concerned osd with
 ceph-kvstore-tool rocksdb  /var/lib/ceph/osd/ceph-16/ compact
 (or for all osd with a script like in 
 https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 )
  You would compact the OSDs, not the MONs. So the last command or my 
script which you linked there.

...
  For my culture, how does compaction works ? Is it done
automatically in 
 background, regularly, at startup ? 
Usually it's done by the OSD in the background, but sometimes an offline 
compact works best.

...
  Because in the logs of the osd we have every 10mn some
reports about 
 compaction (which suggests that compaction occurs regularly), like :

Yes, that is normal. But the offline compaction is sometimes more 
effective than the online ones are.

...
  2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb:
[db/db_impl.cc:777] 
 ------- DUMPING STATS -------
 2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:778]
 ** DB Stats **
 Uptime(secs): 449404.8 total, 600.0 interval
 Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0 
 writes per commit group, ingest: 0.28 GB, 0.00 MB/s
 Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync, written: 
 0.28 GB, 0.00 MB/s
 Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
 Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes per 
 commit group, ingest: 0.22 MB, 0.00 MB/s
 Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written: 0.00 
 MB, 0.00 MB/s
 Interval stall: 00:00:0.000 H:M:S, 0.0 percent

 ** Compaction Stats [default] **
 Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
 Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
 Comp(cnt) Avg(sec) KeyIn KeyDrop

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    L0      1/0   60.48 MB   0.2      0.0     0.0 0.0       0.1 0.1       
 0.0   1.0      0.0    163.7 0.52              0.40 2    0.258       
 0      0
    L1      0/0    0.00 KB   0.0      0.1     0.1 0.0       0.1 0.1       
 0.0   0.5     48.2     26.1 2.32              0.64 1    2.319    920K   
 197K
    L2     17/0    1.00 GB   0.8      1.1     0.1 1.1       1.1 0.0       
 0.0  18.3     69.8     67.5 16.38              4.97 1   16.380   
 4747K    82K
    L3     81/0    4.50 GB   0.9      0.6     0.1 0.5       0.3 
 -0.2       0.0   4.3     66.9     36.6 9.23              4.95 2    
 4.617   9544K   802K
    L4    285/0   16.64 GB   0.1      2.4     0.3 2.0       0.2 
 -1.8       0.0   0.8    110.3     11.7 21.92              4.37 5    
 4.384     12M    12M
   Sum    384/0   22.20 GB   0.0      4.2     0.6 3.6       1.8 
 -1.8       0.0  21.8     85.2     36.6 50.37             15.32 11    
 4.579     28M    13M
   Int      0/0    0.00 KB   0.0      0.0     0.0 0.0       0.0 0.0       
 0.0   0.0      0.0      0.0 0.00              0.00 0    0.000       
 0      0

 ** Compaction Stats [default] **
 Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
 Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
 Comp(cnt) Avg(sec) KeyIn KeyDrop

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

   Low      0/0    0.00 KB   0.0      4.2     0.6 3.6       1.7 
 -1.9       0.0   0.0     86.0     35.3 49.86             14.92 9    
 5.540     28M    13M
 High      0/0    0.00 KB   0.0      0.0     0.0 0.0       0.1 0.1       
 0.0   0.0      0.0    150.2 0.40              0.40 1    0.403       
 0      0
 User      0/0    0.00 KB   0.0      0.0     0.0 0.0       0.0 0.0       
 0.0   0.0      0.0    211.7 0.11              0.00 1    0.114       
 0      0
 Uptime(secs): 449404.8 total, 600.0 interval
 Flush(GB): cumulative 0.083, interval 0.000
 AddFile(Total Files): cumulative 0, interval 0
 AddFile(L0 Files): cumulative 0, interval 0
 AddFile(Keys): cumulative 0, interval 0
 Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB read, 
 0.01 MB/s read, 50.4 seconds
 Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 
 MB/s read, 0.0 seconds
 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 
 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for 
 pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 
 memtable_compaction, 0 memtable_slowdown, interval 0 total count

 Concerning the data removal, I don't know if this could be the trigger. 
 We had some osd marked down before starting the removal, but at this 
 epoch the situation was so confuse that I cannot be sure that the origin 
 of the problem was the same. Indeed, the large data removal concerned 
 datas in an old pool which have been destroyed since (thus all the pg of 
 this ancient pool no longer exists). And it seems that now the cluster 
 is rather inactive due to holidays time (only a few KB/s of io and no 
 recover).

 We have no standalone fast drive for DB/WAL and nothing in the osds (nor 
 mons) logs suggesting any problem (apart the heartbeat_map is_healthy 
 timeout). 
Can you also check if during compaction the %util of the disks spike to 
100%? I think this is the case. Your DB isn't offloaded to an SSD and 
this probably causes the HDD to become overloaded with IOps.

Wido

> Thanks
> F.
> 
> 
> Le 31/08/2020 à 12:15, Igor Fedotov a écrit :
>> Hi Francois,
>>
>> given that slow operations are observed for collection listings you 
>> might want to manually compact RocksDB using ceph-kvstore-tool.
>>
>> The observed slowdown tends to happen after massive data removals. 
>> I've seen multiple compains about this issue including some post in 
>> this mailing list. BTW I can see your post from Jun 24  about slow 
>> pool removal - couldn't this be a trigger?
>>
>> Also wondering whether you have standalone fast(SSD/NVMe) drive for 
>> DB/WAL? Aren't there any BlueFS spillovers which might be relevant?
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 8/28/2020 11:33 AM, Francois Legrand wrote:
>>> Hi all,
>>>
>>> We have a ceph cluster in production with 6 osds servers (with 16x8TB 
>>> disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are 
>>> in 10GB and works well.
>>>
>>> After a major crash in april, we turned the option bluefs_buffered_io 
>>> to false  to workaround the large write bug when bluefs_buffered_io 
>>> was true (we were in version 14.2.8 and the default value at this 
>>> time was true).
>>> Since that time, we regularly have some osds wrongly marked down by 
>>> the cluster after heartbeat timeout (heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).
>>>
>>> Generally the osd restart and the cluster is back healthy, but 
>>> several time, after many of these kick-off the osd reach the 
>>> osd_op_thread_suicide_timeout and goes down definitely.
>>>
>>> We increased the osd_op_thread_timeout and 
>>> osd_op_thread_suicide_timeout... The problems still occurs (but less 
>>> frequently).
>>>
>>> Few days ago, we upgraded to 14.2.11 and revert the timeout to their 
>>> default value, hoping that it will solve the problem (we thought that 
>>> it should be related to this bug 
>>> https://tracker.ceph.com/issues/45943), but it didn't. We still have 
>>> some osds wrongly marked down.
>>>
>>> Can somebody help us to fix this problem ?
>>> Thanks.
>>>
>>> Here is an extract of an osd log at failure time:
>>>
>>> ---------------------------------
>>> 2020-08-28 02:19:05.019 7f03f1384700  0 log_channel(cluster) log 
>>> [DBG] : 44.7d scrub starts
>>> 2020-08-28 02:19:25.755 7f040e43d700  1 heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
>>> 2020-08-28 02:19:25.755 7f040dc3c700  1 heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
>>> this last line is repeated more than 1000 times
>>> ...
>>> 2020-08-28 02:20:17.484 7f040d43b700  1 heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
>>> 2020-08-28 02:20:17.551 7f03f1384700  0 
>>> bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
>>> observed for _collection_list, latency = 67.3532s, lat = 67s cid 
>>> =44.7d_head start GHMAX end GHMAX max 25
>>> ...
>>> 2020-08-28 02:20:22.600 7f040dc3c700  1 heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
>>> 2020-08-28 02:21:20.774 7f03f1384700  0 
>>> bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
>>> observed for _collection_list, latency = 63.223s, lat = 63s cid 
>>> =44.7d_head start 
>>> #44:beffc78d:::rbd_data.1e48e8ab988992.00000000000011bd:0# end #MAX# 
>>> max 2147483647
>>> 2020-08-28 02:21:20.774 7f03f1384700  1 heartbeat_map reset_timeout 
>>> 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
>>> 2020-08-28 02:21:20.805 7f03f1384700  0 log_channel(cluster) log 
>>> [DBG] : 44.7d scrub ok
>>> 2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log 
>>> [WRN] : Monitor daemon marked osd.16 down, but it is still running
>>> 2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log 
>>> [DBG] : map e609411 wrongly marked me down at e609410
>>> 2020-08-28 02:21:21.099 7f03fd997700  1 osd.16 609411 
>>> start_waiting_for_healthy
>>> 2020-08-28 02:21:21.119 7f03fd997700  1 osd.16 609411 start_boot
>>> 2020-08-28 02:21:21.124 7f03f0b83700  1 osd.16 pg_epoch: 609410 
>>> pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] 
>>> local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 
>>> 609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] 
>>> r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 
>>> crt=609409'481293 lcod 609409'481292 active mbc={}] 
>>> start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> 
>>> [25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1,

>>> features acting 4611087854031667199 upacting 4611087854031667199
>>> ...
>>> 2020-08-28 02:21:21.166 7f03f0b83700  1 osd.16 pg_epoch: 609411 
>>> pg[36.56( v 609409'480511 (449368'477424,609409'480511] 
>>> local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 
>>> 609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102] 
>>> r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511 lcod 
>>> 609409'480510 unknown NOTIFY mbc={}] state<Start>: transitioning to

>>> Stray
>>> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
>>> set_numa_affinity public network em1 numa node 0
>>> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
>>> set_numa_affinity cluster network em2 numa node 0
>>> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
>>> set_numa_affinity objectstore and network numa nodes do not match
>>> 2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 
>>> set_numa_affinity not setting numa affinity
>>> 2020-08-28 02:21:21.566 7f040a435700  1 osd.16 609413 tick checking 
>>> mon for new map
>>> 2020-08-28 02:21:22.515 7f03fd997700  1 osd.16 609414 state: booting 
>>> -> active
>>> 2020-08-28 02:21:22.515 7f03f0382700  1 osd.16 pg_epoch: 609414 
>>> pg[36.20( v 609409'483167 (449368'480117,609409'483167] 
>>> local-lis/les=609403/609404 n=155171 ec=435353/435353 lis/c 
>>> 609403/609403 les/c/f 609404/609404/0 609414/609414/609361) 
>>> [97,16,72] r=1 lpr=609414 pi=[609403,609414)/1 crt=609409'483167 lcod 
>>> 609409'483166 unknown NOTIFY mbc={}] start_peering_interval up 
>>> [97,72] -> [97,16,72], acting [97,72] -> [97,16,72], acting_primary 
>>> 97 -> 97, up_primary 97 -> 97, role -1 -> 1, features acting 
>>> 4611087854031667199 upacting 4611087854031667199
>>> ...
>>> 2020-08-28 02:21:22.522 7f03f1384700  1 osd.16 pg_epoch: 609414 
>>> pg[36.2f3( v 609409'479796 (449368'476712,609409'479796] 
>>> local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c 
>>> 609403/609403 les/c/f 609404/609404/0 609414/609414/609414) 
>>> [16,34,21] r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796 lcod 
>>> 609409'479795 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval

>>> up [34,21] -> [16,34,21], acting [34,21] -> [16,34,21], 
>>> acting_primary 34 -> 16, up_primary 34 -> 16, role -1 -> 0, features

>>> acting 4611087854031667199 upacting 4611087854031667199
>>> 2020-08-28 02:21:22.522 7f03f1384700  1 osd.16 pg_epoch: 609414 
>>> pg[36.2f3( v 609409'479796 (449368'476712,609409'479796] 
>>> local-lis/les=609403/609404 n=154451 ec=435353/435353 lis/c 
>>> 609403/609403 les/c/f 609404/609404/0 609414/609414/609414) 
>>> [16,34,21] r=0 lpr=609414 pi=[609403,609414)/1 crt=609409'479796 lcod 
>>> 609409'479795 mlcod 0'0 unknown mbc={}] state<Start>:
transitioning 
>>> to Primary
>>> 2020-08-28 02:21:24.738 7f03f1384700  0 log_channel(cluster) log 
>>> [DBG] : 36.2f3 scrub starts
>>> 2020-08-28 02:22:18.857 7f03f1384700  0 log_channel(cluster) log 
>>> [DBG] : 36.2f3 scrub ok
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: osd regularly wrongly marked down