Ceph OSD reported Slow operations - ceph-users

28 Oct 2023

In a production setup of  36 OSDs( SAS disks) totalling 180 TB allocated to a single Ceph
Cluster with 3 monitors and 3 managers. There were 830 volumes and VMs created in
Openstack with Ceph as a backend. On Sep 21, users reported slowness in accessing the VMs.

Analysing the logs lead us to problem with SAS , Network congestion and Ceph
configuration( as all default values were used). We updated the Network from 1Gbps to
10Gbps for public and cluster networking. There was no change. 
The ceph benchmark performance showed that 28 OSDs out of 36 OSDs reported very low IOPS
of 30 to 50 while the remaining showed 300+ IOPS. 
We gradually started reducing the load on the ceph cluster  and now the volumes count is
650. Now the slow operations has gradually reduced but I am aware that this is not the
solution.
Ceph configuration is updated with increasing the
osd_journal_size to 10 GB,
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
bluestore_cache_trim_max_skip_pinned=10000

After one month, now we faced another issue with Mgr daemon stopped in all 3 quorums and
16 OSDs went down. From the ceph-mon,ceph-mgr.log could not get the reason. Please guide
me as its a production setup