hi folks,
i want to raise your attention to the tracker ticket of
https://tracker.ceph.com/issues/48909
<https://tracker.ceph.com/issues/48909>. and discuss with you for a
better solution.
some context first, back in
https://github.com/ceph/ceph/pull/18614
<https://github.com/ceph/ceph/pull/18614>, changes were made so the slow
requests were reported to mgr to move the burden from monitor to mgr.
with that change, all health related reports are sent to mgr, and the
aggregated version is composed by mgr, and sent to monitor. i
think, that'd help to improve the scalability of a Ceph cluster.
moreover, IIUC, to let mgr take part of the load of the monitor was one
of the reasons why mgr was introduced in the first place.
in
https://tracker.ceph.com/issues/43975
<https://tracker.ceph.com/issues/43975>, it's reported that the slow ops
were no longer recorded in cluster log anymore since mimic. as a fix,
https://github.com/ceph/ceph/pull/33328
<https://github.com/ceph/ceph/pull/33328> was created to send slow ops
and their types to cluster log.
in
https://tracker.ceph.com/issues/43975
<https://tracker.ceph.com/issues/43975>, it's noticed that this fix even
worsen the performance of a cluster suffering from slow ops by adding
more load to monitor. hence
https://github.com/ceph/ceph/pull/39199
<https://github.com/ceph/ceph/pull/39199> was created to throttle this.
i am wondering if we can make better use of the health reporting
machinery instead of pouring the health warnings to clog when slow ops
are observed?
what do you think?
Thanks for bringing this up Kefu, I agree there's a lot of room for
improvement here. It'd be a good topic for CDS.