Hi Wido,
thanks for the explanation. I think the root cause is the disks are too
slow for campaction.
I add two new mon with ssd to the cluter to speed it up and the issue
resolved.
That's a good advice and I have plan to migrate my mon to bigger SSD disks.
Thanks again.
Wido den Hollander <wido(a)42on.com> 于2020年10月30日周五 下午4:39写道:
On 29/10/2020 19:29, Zhenshi Zhou wrote:
Hi Alex,
We found that there were a huge number of keys in the "logm" and
"osdmap"
table
while using ceph-monstore-tool. I think that could be the root cause.
But that is exactly how Ceph works. It might need that very old OSDMap
to get all the PGs clean again. An OSD which has been gone for a very
long time and needs to catch up to make a PG clean.
If not all PGs are active+clean you will and can see the MON databases
grow rapidly.
Therefor I always deploy 1TB SSDs in all Monitors. Not expensive anymore
and they give breathing room.
I always deploy physical and dedicated machines for Monitors just to
prevent these cases.
Wido
Well, some pages also say that disable
'insight' module can resolve this
issue, but
I checked our cluster and we didn't enable this module. check this page
<https://tracker.ceph.com/issues/39955>.
Anyway, our cluster is unhealthy though, it just need time keep
recovering
data :)
Thanks
Alex Gracie <alexandergracie17(a)gmail.com> 于2020年10月29日周四 下午10:57写道:
> We hit this issue over the weekend on our HDD backed EC Nautilus cluster
> while removing a single OSD. We also did not have any luck using
> compaction. The mon-logs filled up our entire root disk on the mon
servers
> and we were running on a single monitor for
hours while we tried to
finish
> recovery and reclaim space. The past couple
weeks we also noticed "pg
not
> scubbed in time" errors but are unsure
if they are related. I'm still
the
> exact cause of this(other than the general
misplaced/degraded objects)
and
> what kind of growth is acceptable for these
store.db files.
>
> In order to get our downed mons restarted, we ended up backing up and
> coping the /var/lib/ceph/mon/* contents to a remote host, setting up an
> sshfs mount to that new host with large NVME and SSDs, ensuring the
mount
> paths were owned by ceph, then clearing up
enough space on the monitor
host
to start
the service. This allowed our store.db directory to grow freely
until the misplaced/degraded objects could recover and monitors all
rejoined eventually.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io