I'm afraid your ceph-post-file logs were lost to
the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.
I have it here still. Any other way I can send it to you?
Okay, taking your word for it. But something seems to be stalling
journal trimming. We had a similar thing yesterday evening, but at much
smaller scale without noticeable pool size increase. I only got an alert
that the ceph_mds_log_ev Prometheus metric starting going up again for a
single MDS. It grew past 1M events, so I restarted it. I also restarted
the other MDS and they all immediately jumped to above 5M events and
stayed there. They are, in fact, still there and have decreased only
very slightly in the morning. The pool size is totally within a normal
range, though, at 290GiB.
So clearly (a) an incredible number of journal events
are being logged
and (b) trimming is slow or unable to make progress. I'm looking into
why but you can help by running the attached script when the problem
is occurring so I can investigate. I'll need a tarball of the outputs.
How do I send it to you if not via ceph-post-file?
Also, in the off-chance this is related to the MDS
disable it since you're using ephemeral pinning:
ceph config set mds mds_bal_interval 0
Thanks for your help!
Bauhausstr. 9a, R308
99423 Weimar, Germany
Phone: +49 3643 58 3577