I encountered persistent SLOW_OPS just a few days ago on a recently upgraded 13.2.8
cluster, which has an SSD pool and an HDD pool. All OSDs are Bluestore, we're not
using separate journal / DB volumes. The HDD pool is more or less used for cold storage,
so performance is not critical.
One OSD in particular (HDD) was reporting the SLOW_OPS. I suspected that the drive was on
the way out, but SMART stats looked ok, and there were no IO errors reported in the kernel
log. Restarting that OSD helped initially, but eventually the SLOW_OPS starting to pile up
again.
We have a fair number of VMs running from RBDs, most of them on the SSD pool, but a few on
HDD. Most of the VMs are configured with a weekly fstrim cronjob, and we have QEMU
configured to pass the DISCARD commands down to Ceph. One VM however, which has a bunch of
50 GB files as part of a Bareos setup (fork of Bacula), has the filesystem mounted with
discard option, so it will trim immediately when files are deleted. I tracked the SLOW_OPS
to a time period during which that VM was recycling (i.e., deleting & trimming) some
of these large 50 GB files. In other words, it seems that there might be a performance
regression in deleting large numbers of rados objects at once.