We hit this issue over the weekend on our HDD backed EC Nautilus cluster while removing a
single OSD. We also did not have any luck using compaction. The mon-logs filled up our
entire root disk on the mon servers and we were running on a single monitor for hours
while we tried to finish recovery and reclaim space. The past couple weeks we also noticed
"pg not scubbed in time" errors but are unsure if they are related. I'm
still the exact cause of this(other than the general misplaced/degraded objects) and what
kind of growth is acceptable for these store.db files.
In order to get our downed mons restarted, we ended up backing up and coping the
/var/lib/ceph/mon/* contents to a remote host, setting up an sshfs mount to that new host
with large NVME and SSDs, ensuring the mount paths were owned by ceph, then clearing up
enough space on the monitor host to start the service. This allowed our store.db directory
to grow freely until the misplaced/degraded objects could recover and monitors all
rejoined eventually.