We just had the same problem again after a power outage that took out
62% of our cluster and three out of five MONs. Once everything was back
up, the MONs started lagging and piling up slow ops while to MON store
was growing to double-digit gigabytes. It was so bad that I couldn't
even list the flying ops anymore, because ceph daemon mon.XXX ops did
not return at all.
Like last time, after I restarted all five MONs, the store size
decreased and everything went back to normal. I also had to restart MGRs
and MDSs afterwards. This starts looking like a bug to me.
Janek
On 26/02/2021 15:24, Janek Bevendorff wrote:
Since the full cluster restart and disabling logging
to syslog, it's
not a problem any more (for now).
Unfortunately, just disabling clog_to_monitors didn't have the wanted
effect when I tried it yesterday. But I also believe that it is
somehow related. I could not find any specific reason for the incident
yesterday in the logs besides a few more RocksDB status and compact
messages than usual, but that's more symptomatic.
On 26/02/2021 13:05, Mykola Golub wrote:
> On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:
>
>> On the first MON, the command doesn’t even return, but I was able to
>> get a dump from the one I restarted most recently. The oldest ops
>> look like this:
>>
>> {
>> "description": "log(1000 entries from seq 17876238 at
>> 2021-02-25T15:13:20.306487+0100)",
>> "initiated_at":
"2021-02-25T20:40:34.698932+0100",
>> "age": 183.762551121,
>> "duration": 183.762599201,
> The mon stores cluster log messages in the mon db. You mentioned
> problems with osds flooding with log messages. It looks like related.
>
> If you still observe the db growth you may try temporarily disable
> clog_to_monitors, i.e. set for all osds:
>
> clog_to_monitors = false
>
> And see if it stops growing after this and if it helps with the slow
> ops (it might make sense to restar mons if some look like get
> stuck). You can apply the config option on the fly (without restarting
> the osds, e.g with injectargs), but when re-enabling back you will
> have to restart the osds to avoid crashes due to this bug [1].
>
> [1]
https://tracker.ceph.com/issues/48946
>
--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany
Phone: +49 3643 58 3577
www.webis.de