Problem with OSD::osd_op_tp thread had timed out and other connected issues - ceph-users

21 Mar 2020

Hello,

I have ceph cluster version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)

4 nodes - each node 11 HDD, 1 SSD, 10Gbit network

Cluster was empty, fresh install. We filled cluster with data (small blocks) using RGW.

Cluster is now used for testing so no client was using it during my admin operations
mentioned below

After a while (7TB of data / 40M objects uploaded) we decided, that we increase pg_num
from 128 to 256 to better spread data and to speedup 
this operation, I've set

  ceph config set mgr target_max_misplaced_ratio 1

so that whole cluster rebalance as quickly as it can.

I have 3 issues/questions below:

1)

I noticed, that manual increase from 128 to 256 caused approx. 6 OSD's to restart with
logged

heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7f8c84b8b700' had suicide
timed out after 150

after a while OSD's were back so I continued after a while with my tests.

My question - increasing number of PG with maximal target_max_misplaced_ratio was too much
for that OSDs? It is not recommended to do it 
this way? I had no problem with this increase before, but configuration of cluster was
slightly different and it was luminous version.

2)

Rebuild was still slow so I increased number of backfills

  ceph tell osd.*  injectargs "--osd-max-backfills 10"

and reduced recovery sleep time

  ceph tell osd.*  injectargs "--osd-recovery-sleep-hdd 0.01"

and after few hours I noticed, that some of my OSD's were restarted during recovery,
in log I can see

...

|2020-03-21 06:41:28.343 7fe1f8bee700 1 heartbeat_map is_healthy 'OSD::osd_op_tp
thread 0x7fe1da154700' had timed out after 15 2020-03-21 
06:41:28.343 7fe1f8bee700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7fe1da154700' had timed out after 15 2020-03-21 06:41:36.780 
7fe1da154700 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7fe1da154700'
had timed out after 15 2020-03-21 06:41:36.888 
7fe1e7769700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.7 down, but it
is still running 2020-03-21 06:41:36.888 
7fe1e7769700 0 log_channel(cluster) log [DBG] : map e3574 wrongly marked me down at e3573
2020-03-21 06:41:36.888 7fe1e7769700 1 osd.7 3574 
start_waiting_for_healthy |

I observed network graph usage and network utilization was low during recovery (10Gbit was
not saturated).

So lot of IOPS on OSD causes also hartbeat operation to timeout? I thought that OSD is
using threads and HDD timeouts are not influencing 
heartbeats to other OSD's and MON. It looks like it is not true.

3)

After OSD was wrongly marked down I can see that cluster has object degraded. There were
no degraded object before that.

  Degraded data redundancy: 251754/117225048 objects degraded (0.215%), 8 pgs degraded, 8
pgs undersized

It means that this OSD disconnection causes data degraded? How is it possible, when no OSD
was lost. Data should be on that OSD and after 
peering should be everything OK. With luminous I had no problem, after OSD up degraded
objects where recovered/found during few seconds and 
cluster was healthy within seconds.

Thank you very much for additional info. I can perform additional tests you recommend
because cluster is used for testing purpose now.

With regards
Jan Pekar

-- 
============
Ing. Jan Pekař
jan.pekar(a)imatic.cz
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz | +420326555326
============
--