New subject: diskprediction_local to be retired or fixed or??

7 Dec 2020

Hi,

We have a ceph 15.2.7 deployment using cephadm under podman w/ systemd.
We've run into what we believe is:

https://github.com/ceph/ceph-container/issues/1748
https://tracker.ceph.com/issues/47875

In our case, eventually the mgr container stops emitting output/logging. We
are polling with external prometheus clusters, which is likely what
triggers the issue, as it appears some amount of time after the container
is spawned.

Unfortunately, setting limits in the systemd service file for the mgr
service on the host OS doesn't work, nor does modifying the unit.run file
which is used to start the container under podman to include the --ulimit
settings as suggested. Looking inside the container:

lib/systemd/system/ceph-mgr@.service:LimitNOFILE=1048576

This prevents us from deploying medium to large ceph clusters, so I would
argue it's a high priority bug that should not be closed, unless there is a
workaround that works until EPEL 8 contains the fixed version of cheroot
and the ceph containers include it.

My understanding is this was fixed in cheroot 8.4.0:

https://github.com/cherrypy/cheroot/issues/249
https://github.com/cherrypy/cheroot/pull/301

Thank you in advance for any suggestions,
David

Larger number of OSDs, cheroot, cherrypy, limits + containers == broken