Hello Ceph-Devs,
we have noticed a rise in overall load from the MGR daemon after upgrading to Nautilus
14.2.9 from Luminous 12.2.13. This has resulted in the Prometheus module not being able to
respond due to overload while an OSD is out for example. We evaluated this on our test
clusters with recent hardware and the issues still persisted and even getting worse with
gaps in the Prometheus metric collection while the cluster is being written to in a
perfectly healthy state.
After some digging and hoping the pull request from
https://tracker.ceph.com/issues/45439
(
https://github.com/ceph/ceph/pull/34356) Elatives the issue, which it didn't, we have
traced most of our troubles down to the Progress MGR module:
The notify function in the progress module is highly inefficient in its current form due
to unnecessary collection of PG data when nothing is beieng done with it (self._events
being empty).
This results in the Prometheus module being blocked regularly and thus not responding in
time (response times of > 10 seconds, or even outright cherrypy timeouts)
We have prepared an issue ticket and a Pull request for this to be fixed:
https://tracker.ceph.com/issues/46416
https://github.com/ceph/ceph/pull/35973
After implementing this easy fix we haven't experienced any Prometheus timeouts.
Could someone please review, merge and Backport this pull request.
Thanks in advance