Hi Paul,
we might have found the reason for MONs going silly on our cluster. There is a message
size parameter that seems way too large. We reduced it today from 10M (default) to 1M and
didn't observe silly MONs since then:
ceph config set global osd_map_message_max_bytes $((1*1024*1024))
I cannot guarantee that this is the fix. However, I observed one window of a MON with high
packet-out load after setting the above and it remained responsive and did not go to 100%
CPU. Maybe worth a try? I will keep observing.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 10 February 2021 17:32:07
To: Paul Mezzanini; ceph-users(a)ceph.io
Subject: [ceph-users] Re: OSDs cannot join, MON leader at 100%
It has become a ot more severe after adding a large nubmer of disks. I added a tracker
https://tracker.ceph.com/issues/49231
In case you have additional information, feel free to add.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Paul Mezzanini <pfmeec(a)rit.edu>
Sent: 29 January 2021 20:04:12
To: Frank Schilder; ceph-users(a)ceph.io
Subject: Re: OSDs cannot join, MON leader at 100%
We are currently running 3 MONs. When one goes into silly town the others get wedged and
won't respond well. I don't think more MONs would solve that... but I'm not
sure.
--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec(a)rit.edu
CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Friday, January 29, 2021 12:58 PM
To: Paul Mezzanini; ceph-users(a)ceph.io
Subject: Re: OSDs cannot join, MON leader at 100%
Hi Poul,
thanks for sharing. I have the MONs on 2x10G bonded active-active. They don't manage
to saturate 10G, but the CPU core is overloaded.
How many MONs do you have? I believe I have never seen more than 2 to be in this state for
an extended period of time. My plan is to go from 3 to 5, which would leave a subcluster
of 3 and I would be less hesitant to restart an affected MON right away.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Paul Mezzanini <pfmeec(a)rit.edu>
Sent: 29 January 2021 17:44:42
To: Frank Schilder; ceph-users(a)ceph.io
Subject: Re: OSDs cannot join, MON leader at 100%
We've been watching our MONs go unresponsive with a saturated 10GbE NIC. The problem
seems to be aggravated by peering. We were shrinking the PG count on one of our large
pools and it was happening a bunch. Once that finished it seemed to calm down. Yesterday
I had an OSD go down and as it was rebalancing we had another MON go into silly mode. We
recover from this situation by just restarting the MON process on the hung node.
We are running 14.2.15.
I wish I could tell you what the problem actually is and how to fix it. At least we
aren't alone in this failure mode.
--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec(a)rit.edu
CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Friday, January 29, 2021 5:22 AM
To: ceph-users(a)ceph.io
Subject: [ceph-users] OSDs cannot join, MON leader at 100%
Dear cephers,
I was doing some maintenance yesterday involving shutdown-power up cycles of ceph servers.
With the last server I run into a problem. The server runs an MDS and a couple of OSDs.
After reboot, the MDS joined the MDS cluster without problems, but the OSDs didn't
come up. This was 1 out of 12 servers and I had no such problems with the other 11. I also
observed that "ceph status" was responding very slow.
Upon further inspection, I found out that 2 of my 3 MONs (the leader and a peon) were
running at 100% CPU. Client I/O was continuing, probably because the last cluster map
remained valid. On our node performance monitoring I could see that the 2 busy MONs were
showing extraordinary network activity.
This state lasted for over one hour. After the MONs settled down, the OSDs finally joined
as well and everything went back to normal.
The other instance I have seen similar behaviour was, when I restarted a MON on an empty
disk and the re-sync was extremely slow due to a too large value for
mon_sync_max_payload_size. This time, I'm pretty sure it was MON-client communication;
see below.
Are there any settings similar to mon_sync_max_payload_size that could influence
responsiveness of MONs in a similar way?
Why do I suspect it is MON-client communication? In our monitoring, I do not see the huge
amount of packages sent by the MONs arriving at any other ceph daemon. They seem to be
distributed over client nodes, but since we have a large count of client nodes (>550)
this is covered by the background network traffic. A second clue is that I have had such
extended lock-ups before and, whenever I checked, I only observed these in case the leader
had a large share of client sessions.
For example, yesterday the client session count per MON was:
ceph-01: 1339 (leader)
ceph-02: 189 (peon)
ceph-03: 839 (peon)
I usually restart the leader when such a critical distribution occurs. As long as the
leader has the fewest client sessions, I never observe this problem.
Ceph version is 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable).
Thanks for any clues!
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io