We implemented the (dropping ulimit to 1024:4096 for mgr) suggested change
last night, and on our test cluster of 504 OSDs, being polled by the
internal prometheus and our external instance, the mgrs stopped responding
and dropped out of the cluster entirely. This is impacting not just
metrics, but the mgr itself. I think this is a high priority issue, as
metrics are critical for prod, but mgr itself seems to be impacted on a
moderately sized cluster.
Respectfully,
David Orman
On Mon, Dec 7, 2020 at 1:50 PM Ken Dreyer <kdreyer(a)redhat.com> wrote:
Thanks for bringing this up.
We need to update Cheroot in Fedora and EPEL 8. I've opened
https://src.fedoraproject.org/rpms/python-cheroot/pull-request/3 to
get this into Fedora first.
I've published an el8 RPM at
https://fedorapeople.org/~ktdreyer/bz1868629/ for early testing. I can
bring up a "hello world" cherrypy app with this, but I've not tested
it with Ceph.
- Ken
On Mon, Dec 7, 2020 at 9:57 AM David Orman <ormandj(a)corenode.com> wrote:
Hi,
We have a ceph 15.2.7 deployment using cephadm under podman w/ systemd.
We've run into what we believe is:
https://github.com/ceph/ceph-container/issues/1748
https://tracker.ceph.com/issues/47875
In our case, eventually the mgr container stops emitting output/logging.
We
are polling with external prometheus clusters,
which is likely what
triggers the issue, as it appears some amount of time after the container
is spawned.
Unfortunately, setting limits in the systemd service file for the mgr
service on the host OS doesn't work, nor does modifying the unit.run file
which is used to start the container under podman to include the --ulimit
settings as suggested. Looking inside the container:
lib/systemd/system/ceph-mgr@.service:LimitNOFILE=1048576
This prevents us from deploying medium to large ceph clusters, so I would
argue it's a high priority bug that should not be closed, unless there
is a
workaround that works until EPEL 8 contains the
fixed version of cheroot
and the ceph containers include it.
My understanding is this was fixed in cheroot 8.4.0:
https://github.com/cherrypy/cheroot/issues/249
https://github.com/cherrypy/cheroot/pull/301
Thank you in advance for any suggestions,
David
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io