Weird blocked OP issue. - ceph-users

1 Nov 2019

We had an OSD host with 13 OSDs fail today and we have a weird blocked
OP message that I can't understand. There are no OSDs with blocked
ops, just `mon` (multiple times), and some of the rgw instances.

  cluster:
   id:     570bcdbb-9fdf-406f-9079-b0181025f8d0
   health: HEALTH_WARN
           1 large omap objects
           Degraded data redundancy: 2083023/195702437 objects
degraded (1.064%), 880 pgs degraded, 880 pgs undersized
           1609 pgs not deep-scrubbed in time
           4 slow ops, oldest one blocked for 506699 sec, daemons
[mon,sun-gcs02-rgw01,mon,sun-gcs02-rgw02,mon,sun-gcs02-rgw03] have
slow ops.

 services:
   mon: 3 daemons, quorum
sun-gcs02-rgw01,sun-gcs02-rgw02,sun-gcs02-rgw03 (age 6m)
   mgr: sun-gcs02-rgw02(active, since 5d), standbys: sun-gcs02-rgw03,
sun-gcs02-rgw04
   osd: 767 osds: 754 up (since 10m), 754 in (since 104m); 880 remapped pgs
   rgw: 16 daemons active (sun-gcs02-rgw01.rgw0, sun-gcs02-rgw01.rgw1,
sun-gcs02-rgw01.rgw2, sun-gcs02-rgw01.rgw3, sun-gcs02-rgw02.rgw0,
sun-gcs02-rgw02.rgw1, sun-gcs02-rgw02.rgw2, sun-gcs02-rgw02.rgw3,
sun-gcs02-rgw03.rgw0, sun-gcs02-rgw03.rgw1, sun-gcs02-rgw03.rgw2, s
un-gcs02-rgw03.rgw3, sun-gcs02-rgw04.rgw0, sun-gcs02-rgw04.rgw1,
sun-gcs02-rgw04.rgw2, sun-gcs02-rgw04.rgw3)

 data:
   pools:   7 pools, 8240 pgs
   objects: 19.57M objects, 52 TiB
   usage:   88 TiB used, 6.1 PiB / 6.2 PiB avail
   pgs:     2083023/195702437 objects degraded (1.064%)
            43492/195702437 objects misplaced (0.022%)
            7360 active+clean
            868  active+undersized+degraded+remapped+backfill_wait
            12   active+undersized+degraded+remapped+backfilling

 io:
   client:   150 MiB/s rd, 642 op/s rd, 0 op/s wr
   recovery: 626 MiB/s, 223 objects/s

$ ceph versions
{
   "mon": {
       "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
   },
   "mgr": {
       "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
   },
   "osd": {
       "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754
   },
   "mds": {},
   "rgw": {
       "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 16
   },
   "overall": {
       "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754,
       "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 22
   }
}

I restarted one of the monitors and it dropped out of the list only
showing 2 blocked ops, but then showed up again a little while later.

Any ideas on where to look?

Thanks,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1