log slow ops to cluster log

27 Mar 2021

hi folks,

i want to raise your attention to the tracker ticket of
https://tracker.ceph.com/issues/48909. and discuss with you for a better
solution.

some context first, back in https://github.com/ceph/ceph/pull/18614,
changes were made so the slow requests were reported to mgr to move the
burden from monitor to mgr. with that change, all health related reports
are sent to mgr, and the aggregated version is composed by mgr, and sent to
monitor. i think, that'd help to improve the scalability of a Ceph cluster.
moreover, IIUC, to let mgr take part of the load of the monitor was one of
the reasons why mgr was introduced in the first place.

in https://tracker.ceph.com/issues/43975, it's reported that the slow ops
were no longer recorded in cluster log anymore since mimic. as a fix,
https://github.com/ceph/ceph/pull/33328 was created to send slow ops and
their types to cluster log.

in https://tracker.ceph.com/issues/43975, it's noticed that this fix even
worsen the performance  of a cluster suffering from slow ops by adding more
load to monitor. hence https://github.com/ceph/ceph/pull/39199 was created
to throttle this.

i am wondering if we can make better use of the health reporting
machinery instead of pouring the health warnings to clog when slow ops are
observed?

what do you think?

cheers,

2024

2023

2022

2021

2020

2019

log slow ops to cluster log