Re: log slow ops to cluster log

29 Mar 2021

On Sat, Mar 27, 2021 at 11:15 AM Josh Durgin &lt;jdurgin(a)redhat.com&gt; wrote:
...

 On 3/27/21 1:11 AM, Kefu Chai wrote:
  hi folks,

 i want to raise your attention to the tracker ticket of
 https://tracker.ceph.com/issues/48909
 <https://tracker.ceph.com/issues/48909>. and discuss with you for a
 better solution.

 some context first, back in https://github.com/ceph/ceph/pull/18614
 <https://github.com/ceph/ceph/pull/18614>, changes were made so the slow
 requests were reported to mgr to move the burden from monitor to mgr.
 with that change, all health related reports are sent to mgr, and the
 aggregated version is composed by mgr, and sent to monitor. i
 think, that'd help to improve the scalability of a Ceph cluster.
 moreover, IIUC, to let mgr take part of the load of the monitor was one
 of the reasons why mgr was introduced in the first place.

 in https://tracker.ceph.com/issues/43975
 <https://tracker.ceph.com/issues/43975>, it's reported that the slow ops
 were no longer recorded in cluster log anymore since mimic. as a fix,
 https://github.com/ceph/ceph/pull/33328
 <https://github.com/ceph/ceph/pull/33328> was created to send slow ops
 and their types to cluster log.

 in https://tracker.ceph.com/issues/43975
 <https://tracker.ceph.com/issues/43975>, it's noticed that this fix even
 worsen the performance  of a cluster suffering from slow ops by adding
 more load to monitor. hence https://github.com/ceph/ceph/pull/39199
 <https://github.com/ceph/ceph/pull/39199> was created to throttle this.

 i am wondering if we can make better use of the health reporting
 machinery instead of pouring the health warnings to clog when slow ops
 are observed?

 what do you think? 
 Thanks for bringing this up Kefu, I agree there's a lot of room for
 improvement here. It'd be a good topic for CDS. 
Agreed, added it to https://pad.ceph.com/p/cds-quincy.

Neha

...

 There's no reason the cluster log needs to go through paxos or be stored
 in the monitor DB, and some sort of throttling or data reduction would
 help on the producer side. We've seen issues not just with slow ops
 but other warnings reporting too frequently overloading the monitors
 as well.

 https://github.com/ceph/ceph/pull/40168 is related on the consumer side,
 and also helps other cases of temporary mon overload (e.g. from a burst
 of osdmap creation from blocklisting).

 Josh

2024

2023

2022

2021

2020

2019

Re: log slow ops to cluster log