OSDs cannot join, MON leader at 100% - ceph-users

29 Jan 2021

Dear cephers,

I was doing some maintenance yesterday involving shutdown-power up cycles of ceph servers.
With the last server I run into a problem. The server runs an MDS and a couple of OSDs.
After reboot, the MDS joined the MDS cluster without problems, but the OSDs didn't
come up. This was 1 out of 12 servers and I had no such problems with the other 11. I also
observed that "ceph status" was responding very slow.

Upon further inspection, I found out that 2 of my 3 MONs (the leader and a peon) were
running at 100% CPU. Client I/O was continuing, probably because the last cluster map
remained valid. On our node performance monitoring I could see that the 2 busy MONs were
showing extraordinary network activity.

This state lasted for over one hour. After the MONs settled down, the OSDs finally joined
as well and everything went back to normal.

The other instance I have seen similar behaviour was, when I restarted a MON on an empty
disk and the re-sync was extremely slow due to a too large value for
mon_sync_max_payload_size. This time, I'm pretty sure it was MON-client communication;
see below.

Are there any settings similar to mon_sync_max_payload_size that could influence
responsiveness of MONs in a similar way?

Why do I suspect it is MON-client communication? In our monitoring, I do not see the huge
amount of packages sent by the MONs arriving at any other ceph daemon. They seem to be
distributed over client nodes, but since we have a large count of client nodes (>550)
this is covered by the background network traffic. A second clue is that I have had such
extended lock-ups before and, whenever I checked, I only observed these in case the leader
had a large share of client sessions.

For example, yesterday the client session count per MON was:

ceph-01: 1339 (leader)
ceph-02:  189 (peon)
ceph-03:  839 (peon)

I usually restart the leader when such a critical distribution occurs. As long as the
leader has the fewest client sessions, I never observe this problem.

Ceph version is 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable).

Thanks for any clues!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14