Thanks David! A couple of things have happened since the last update.
The primary Fedora cheroot package maintainer updated cheroot from
8.5.0 to 8.5.1 in Rawhide. I've rebuilt this for el8 and put it into a
new repository here:
There are a few more small cleanups I need to land in order to
reconcile the epel8 and master branches.
For rawhide (master):
RPM)
... and then merge master into epel8, and then build 8.5.1 in the main
build system (Koji) and ship to epel-testing.
- Ken
On Fri, Dec 11, 2020 at 5:38 AM David Orman <ormandj(a)corenode.com> wrote:
Hi Ken,
This seems to have fixed that issue. It exposed another:
https://tracker.ceph.com/issues/39264 which is causing ceph-mgr to become entirely
unresponsive across the cluster, but cheroot seems to be ok.
David
On Wed, Dec 9, 2020 at 12:25 PM David Orman <ormandj(a)corenode.com> wrote:
>
> Ken,
>
> We have rebuilt the container images of 15.2.7 with this RPM applied, and will be
deploying it to a larger (504 OSD) cluster to test - this cluster had the issue previously
until we disabled polling via Prometheus. We will update as soon as it's run for a day
or two and we've been able to verify the mgr issues we saw no longer occur after
extended polling via external and internal prometheus instances.
>
> Thank you again for the quick update, we'll let you know as soon as we have more
feedback,
> David
>
> On Tue, Dec 8, 2020 at 10:37 AM David Orman <ormandj(a)corenode.com> wrote:
>>
>> Hi Ken,
>>
>> Thank you for the update! As per:
https://github.com/ceph/ceph-container/issues/1748
>>
>> We implemented the (dropping ulimit to 1024:4096 for mgr) suggested change last
night, and on our test cluster of 504 OSDs, being polled by the internal prometheus and
our external instance, the mgrs stopped responding and dropped out of the cluster
entirely. This is impacting not just metrics, but the mgr itself. I think this is a high
priority issue, as metrics are critical for prod, but mgr itself seems to be impacted on a
moderately sized cluster.
>>
>> Respectfully,
>> David Orman
>>
>> On Mon, Dec 7, 2020 at 1:50 PM Ken Dreyer <kdreyer(a)redhat.com> wrote:
>>>
>>> Thanks for bringing this up.
>>>
>>> We need to update Cheroot in Fedora and EPEL 8. I've opened
>>>
https://src.fedoraproject.org/rpms/python-cheroot/pull-request/3 to
>>> get this into Fedora first.
>>>
>>> I've published an el8 RPM at
>>>
https://fedorapeople.org/~ktdreyer/bz1868629/ for early testing. I can
>>> bring up a "hello world" cherrypy app with this, but I've not
tested
>>> it with Ceph.
>>>
>>> - Ken
>>>
>>> On Mon, Dec 7, 2020 at 9:57 AM David Orman <ormandj(a)corenode.com>
wrote:
>>> >
>>> > Hi,
>>> >
>>> > We have a ceph 15.2.7 deployment using cephadm under podman w/ systemd.
>>> > We've run into what we believe is:
>>> >
>>> >
https://github.com/ceph/ceph-container/issues/1748
>>> >
https://tracker.ceph.com/issues/47875
>>> >
>>> > In our case, eventually the mgr container stops emitting output/logging.
We
>>> > are polling with external prometheus clusters, which is likely what
>>> > triggers the issue, as it appears some amount of time after the
container
>>> > is spawned.
>>> >
>>> > Unfortunately, setting limits in the systemd service file for the mgr
>>> > service on the host OS doesn't work, nor does modifying the unit.run
file
>>> > which is used to start the container under podman to include the
--ulimit
>>> > settings as suggested. Looking inside the container:
>>> >
>>> > lib/systemd/system/ceph-mgr@.service:LimitNOFILE=1048576
>>> >
>>> > This prevents us from deploying medium to large ceph clusters, so I
would
>>> > argue it's a high priority bug that should not be closed, unless
there is a
>>> > workaround that works until EPEL 8 contains the fixed version of
cheroot
>>> > and the ceph containers include it.
>>> >
>>> > My understanding is this was fixed in cheroot 8.4.0:
>>> >
>>> >
https://github.com/cherrypy/cheroot/issues/249
>>> >
https://github.com/cherrypy/cheroot/pull/301
>>> >
>>> > Thank you in advance for any suggestions,
>>> > David
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> >
>>>