We had an OSD host with 13 OSDs fail today and we have a weird blocked
OP message that I can't understand. There are no OSDs with blocked
ops, just `mon` (multiple times), and some of the rgw instances.
cluster:
id: 570bcdbb-9fdf-406f-9079-b0181025f8d0
health: HEALTH_WARN
1 large omap objects
Degraded data redundancy: 2083023/195702437 objects
degraded (1.064%), 880 pgs degraded, 880 pgs undersized
1609 pgs not deep-scrubbed in time
4 slow ops, oldest one blocked for 506699 sec, daemons
[mon,sun-gcs02-rgw01,mon,sun-gcs02-rgw02,mon,sun-gcs02-rgw03] have
slow ops.
services:
mon: 3 daemons, quorum
sun-gcs02-rgw01,sun-gcs02-rgw02,sun-gcs02-rgw03 (age 6m)
mgr: sun-gcs02-rgw02(active, since 5d), standbys: sun-gcs02-rgw03,
sun-gcs02-rgw04
osd: 767 osds: 754 up (since 10m), 754 in (since 104m); 880 remapped pgs
rgw: 16 daemons active (sun-gcs02-rgw01.rgw0, sun-gcs02-rgw01.rgw1,
sun-gcs02-rgw01.rgw2, sun-gcs02-rgw01.rgw3, sun-gcs02-rgw02.rgw0,
sun-gcs02-rgw02.rgw1, sun-gcs02-rgw02.rgw2, sun-gcs02-rgw02.rgw3,
sun-gcs02-rgw03.rgw0, sun-gcs02-rgw03.rgw1, sun-gcs02-rgw03.rgw2, s
un-gcs02-rgw03.rgw3, sun-gcs02-rgw04.rgw0, sun-gcs02-rgw04.rgw1,
sun-gcs02-rgw04.rgw2, sun-gcs02-rgw04.rgw3)
data:
pools: 7 pools, 8240 pgs
objects: 19.57M objects, 52 TiB
usage: 88 TiB used, 6.1 PiB / 6.2 PiB avail
pgs: 2083023/195702437 objects degraded (1.064%)
43492/195702437 objects misplaced (0.022%)
7360 active+clean
868 active+undersized+degraded+remapped+backfill_wait
12 active+undersized+degraded+remapped+backfilling
io:
client: 150 MiB/s rd, 642 op/s rd, 0 op/s wr
recovery: 626 MiB/s, 223 objects/s
$ ceph versions
{
"mon": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
},
"osd": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754
},
"mds": {},
"rgw": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 16
},
"overall": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754,
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 22
}
}
I restarted one of the monitors and it dropped out of the list only
showing 2 blocked ops, but then showed up again a little while later.
Any ideas on where to look?
Thanks,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1