Hi Seena,
one of the frequent cause for such a timeout is slow RocksDB
operationing. Which in turn might be caused by bluefs_buffered_io set to
false and/or DB "fragmentation" after massive data removal.
Hence the potential workarounds are adjusting bluefs_buffered_io and
manual RocksDB compaction.
This topic has been discussed in this mailing list and relevant tickets
multiple times.
Thanks,
Igor
On 12/23/2020 3:24 PM, Seena Fallah wrote:
> Hi,
>
> All my OSD nodes in the SSD tier are getting heartbeat_map timed out
> randomly and I don't find why!
>
> 7ff2ed3f2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
> 0x7ff2c8943700' had timed out after 15
>
> It occurs many times in a day and causes my cluster to be down.
>
> Is there any way to find why the OSDs get time out? I don't think it's
> because of heartbeat and there is an issue with OSD that came to the
> heartbeat to be timeout because ODSs don't suicide and OSDs get too slow
> and cause downtime on RBD and S3 gateway because the queue is full!
>
> Thanks.
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io