On Fri, Jun 9, 2023 at 3:27 AM Janek Bevendorff
I'm afraid your ceph-post-file logs were lost
to the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.
I have it here still. Any other way I can send it to you?
Nevermind, I found the machine it was stored on. It was a
misconfiguration caused by post-lab-outage rebuilds.
Okay, taking your word for it. But something seems to be stalling
journal trimming. We had a similar thing yesterday evening, but at much
smaller scale without noticeable pool size increase. I only got an alert
that the ceph_mds_log_ev Prometheus metric starting going up again for a
single MDS. It grew past 1M events, so I restarted it. I also restarted
the other MDS and they all immediately jumped to above 5M events and
stayed there. They are, in fact, still there and have decreased only
very slightly in the morning. The pool size is totally within a normal
range, though, at 290GiB.
Please keep monitoring it. I think you're not the only cluster to
So clearly (a)
an incredible number of journal events are being logged
and (b) trimming is slow or unable to make progress. I'm looking into
why but you can help by running the attached script when the problem
is occurring so I can investigate. I'll need a tarball of the outputs.
How do I send it to you if not via ceph-post-file?
It should work soon next week. We're moving the drop.ceph.com
to a standalone VM soonish.
Also, in the
off-chance this is related to the MDS balancer, please
disable it since you're using ephemeral pinning:
ceph config set mds mds_bal_interval 0
Thanks for your help!
Bauhausstr. 9a, R308
99423 Weimar, Germany
Phone: +49 3643 58 3577
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer