Best guess: the recovery process doesn't really stop, but it's just that
the mgr is dead and it no longer reports the progress
And yeah, I can confirm that having a huge number of crash reports is a
problem (had a case where a monitoring script crashed due to a
radosgw-admin bug... lots of crash reports)
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at
https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Thu, Apr 30, 2020 at 4:09 PM Francois Legrand <fleg(a)lpnhe.in2p3.fr>
wrote:
Hi everybody (again),
We recently had a lot of osd crashs (more than 30 osd crashed). This is
now fixed, but it triggered a huge rebalancing+recovery.
More or less in the same time, we noticed that the ceph crash ls (or
whatever other ceph crash command) hangs forever and never returns.
And finally, the recovery process stops regularly (after ~1 hour) but it
can be restarted by reseting the mgr daemon (systemctl restart
ceph-mgr.target on the active manager).
There is nothing in the logs (the manager still works, the service is
up, the dashboard is accessible but simply the recovery stops).
We also tryed to reboot the managers, but it doesn't solve the problem.
I guess theses two problems should be linked, but not sure.
Does anybody have a clue ?
Thanks.
F.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io