MONs unresponsive for excessive amount of time - ceph-users

18 Nov 2020

Hi all,

one of our MONs was down for maintenance for ca. 45 minutes. After this time I started it
up again and it joined the cluster.

Unfortunately, things did not go as expected. The MON sub-cluster became unresponsive for
a bit more than 10 minutes. Admin commands would hang, even if issued directly to a
specific monitor via "ceph tell mon.xxx". In addition, our MDS lost connection
to the MONs and reported a laggy connection. Consequently, all ceph fs access was frozen
for a bit more than 10 minutes as well.

From the little I could get out with "ceph daemon mon.xxx mon_status" I could
see that the restarted MON was in state "synchronizing" (or similar, its from
memory) while the other mons were in quorum.

Our cluster is mimic-12.2.8. Somehow, this observation does not fit together with the
intended HA of the MON cluster, there should not be any stall at all.

My questions: Why do the MONs become unresponsive for such a long time? What are the MONs
doing during this time frame? Are there any config options I should look at? Are there any
log messages I should hunt for?

Any hint is appreciated.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14