Hello,
We are running Mimic 13.2.8 with our cluster, and since upgrading to 13.2.8
the Prometheus plugin seems to hang a lot. It used to respond under 10s but
now it often hangs. Restarting the mgr processes helps temporarily but
within minutes it gets stuck again.
The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" and
needs to
be kill -9'ed.
Is there anything I can do to address this issue, or at least get better
visibility into the issue?
We only have a few plugins enabled:
$ ceph mgr module ls
{
"enabled_modules": [
"balancer",
"prometheus",
"zabbix"
],
3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's
a busy one with lots of rebalancing. (I don't know if a busy cluster would
seriously affect the mgr's performance, but just throwing it out there)
services:
mon: 5 daemons, quorum
woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay
osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
rgw: 4 daemons active
Thanks in advance for your help,
-Paul Choi
Hi guys,
This is documented as an experimental feature, but it doesn’t explain how to ensure that affinity for a given MDS sticks to the second filesystem you create. Has anyone had success implementing a second CephFS? In my case it will be based on a completely different pool from my first one.
Thanks.
J