On Fri, Nov 1, 2019 at 6:10 PM Robert LeBlanc <robert(a)leblancnet.us> wrote:
We had an OSD host with 13 OSDs fail today and we have a weird blocked
OP message that I can't understand. There are no OSDs with blocked
ops, just `mon` (multiple times), and some of the rgw instances.
cluster:
id: 570bcdbb-9fdf-406f-9079-b0181025f8d0
health: HEALTH_WARN
1 large omap objects
Degraded data redundancy: 2083023/195702437 objects
degraded (1.064%), 880 pgs degraded, 880 pgs undersized
1609 pgs not deep-scrubbed in time
4 slow ops, oldest one blocked for 506699 sec, daemons
[mon,sun-gcs02-rgw01,mon,sun-gcs02-rgw02,mon,sun-gcs02-rgw03] have
slow ops.
services:
mon: 3 daemons, quorum
sun-gcs02-rgw01,sun-gcs02-rgw02,sun-gcs02-rgw03 (age 6m)
mgr: sun-gcs02-rgw02(active, since 5d), standbys: sun-gcs02-rgw03,
sun-gcs02-rgw04
osd: 767 osds: 754 up (since 10m), 754 in (since 104m); 880 remapped pgs
rgw: 16 daemons active (sun-gcs02-rgw01.rgw0, sun-gcs02-rgw01.rgw1,
sun-gcs02-rgw01.rgw2, sun-gcs02-rgw01.rgw3, sun-gcs02-rgw02.rgw0,
sun-gcs02-rgw02.rgw1, sun-gcs02-rgw02.rgw2, sun-gcs02-rgw02.rgw3,
sun-gcs02-rgw03.rgw0, sun-gcs02-rgw03.rgw1, sun-gcs02-rgw03.rgw2, s
un-gcs02-rgw03.rgw3, sun-gcs02-rgw04.rgw0, sun-gcs02-rgw04.rgw1,
sun-gcs02-rgw04.rgw2, sun-gcs02-rgw04.rgw3)
data:
pools: 7 pools, 8240 pgs
objects: 19.57M objects, 52 TiB
usage: 88 TiB used, 6.1 PiB / 6.2 PiB avail
pgs: 2083023/195702437 objects degraded (1.064%)
43492/195702437 objects misplaced (0.022%)
7360 active+clean
868 active+undersized+degraded+remapped+backfill_wait
12 active+undersized+degraded+remapped+backfilling
io:
client: 150 MiB/s rd, 642 op/s rd, 0 op/s wr
recovery: 626 MiB/s, 223 objects/s
$ ceph versions
{
"mon": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
},
"osd": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754
},
"mds": {},
"rgw": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 16
},
"overall": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754,
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 22
}
}
I restarted one of the monitors and it dropped out of the list only
showing 2 blocked ops, but then showed up again a little while later.
Any ideas on where to look?
For posterity's sake, it looks like I got things happy again.
The rgw data pool is 8+2 EC, but was set for min_size=10. I thought I
had configured that min_size=9, but it was recovering PGs, so I didn't
think about it at the time. Then one OSD started crashing with
something about strays and would be restarted and crash again. Then
incomplete PGs showed up. I dropped the min_size to 8 to get things
recovered and marked osd.119 out to empty it off. Once the cluster
recovered and all PGs were healthy, I set min_size=9. I then noticed
that what I thought were rgw instances being blocked where actually
the names of the monitors (the hosts are named after the rgws, but
mon, mgr and rgw are all containers on the boxes). I thought, well let
me try to roll the first monitor again and see if that unblocks the
op, sure enough it looks like it unblocked this time and has not
showed up again in 10 minutes. After letting osd.119 sit empty for
about 10 minutes, I set it back in and it doesn't seem to be crashing
anymore, so I wonder if it had some bad db entry. It's almost halfway
back in and so far so good.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1