I don't have a solution to offer, but I've seen this for years with no solution.
Any time a MGR bounces, be it for upgrades, or a new daemon coming online, etc, I'll
see a scale spike like is reported below.
Just out of curiosity, which MGR plugins are you using?
I have historically used the influx plugin for stats exports, and it shows up in those
values as well, throwing everything off.
I don't see it in my Zabbix stats, albeit those are scraped at a longer interval that
may not catch this.
Just looking for any common threads.
Reed
On May 4, 2021, at 3:46 AM, Nico Schottelius
<nico.schottelius(a)ungleich.ch> wrote:
Hello,
we have a recurring, funky problem with managers on Nautilus (and
probably also earlier versions): the manager displays incorrect
information.
This is a recurring pattern and it also breaks the prometheus graphs, as
the I/O is described insanely incorrectly: "recovery: 43 TiB/s, 3.62k
keys/s, 11.40M objects/s" - which basically changes the scale of any
related graph to unusable.
The latest example from today shows slow ops for an OSD
that has been down for 17h:
--------------------------------------------------------------------------------
[09:50:31] black2.place6:~# ceph -s
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_WARN
18 slow ops, oldest one blocked for 975 sec, osd.53 has slow ops
services:
mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
mgr: server2(active, since 2w), standbys: server8, server4, server9, server6, ciara3
osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
data:
pools: 4 pools, 2624 pgs
objects: 42.52M objects, 162 TiB
usage: 486 TiB used, 298 TiB / 784 TiB avail
pgs: 2616 active+clean
8 active+clean+scrubbing+deep
io:
client: 522 MiB/s rd, 22 MiB/s wr, 8.18k op/s rd, 689 op/s wr
--------------------------------------------------------------------------------
Killing the manager on server2 changes the status to another temporary
incorrect status, because the rebalance finished hours ago, paired with
the incorrect rebalance speed that we see from time to time:
--------------------------------------------------------------------------------
[09:51:59] black2.place6:~# ceph -s
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_OK
services:
mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
mgr: server8(active, since 11s), standbys: server4, server9, server6, ciara3
osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
data:
pools: 4 pools, 2624 pgs
objects: 42.52M objects, 162 TiB
usage: 486 TiB used, 298 TiB / 784 TiB avail
pgs: 2616 active+clean
8 active+clean+scrubbing+deep
io:
client: 214 TiB/s rd, 54 TiB/s wr, 4.86G op/s rd, 1.06G op/s wr
recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s
progress:
Rebalancing after osd.53 marked out
[========================......]
--------------------------------------------------------------------------------
Then a bit later, the status on the newly started manager is correct:
--------------------------------------------------------------------------------
[09:52:18] black2.place6:~# ceph -s
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_OK
services:
mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
mgr: server8(active, since 47s), standbys: server4, server9, server6, server2, ciara3
osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
data:
pools: 4 pools, 2624 pgs
objects: 42.52M objects, 162 TiB
usage: 486 TiB used, 298 TiB / 784 TiB avail
pgs: 2616 active+clean
8 active+clean+scrubbing+deep
io:
client: 422 MiB/s rd, 39 MiB/s wr, 7.91k op/s rd, 752 op/s wr
--------------------------------------------------------------------------------
Question: is this a know bug, is anyone else seeing it or are we doing
something wrong?
Best regards,
Nico
--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io