Unknown PGs after osd move - ceph-users

22 Sep 2020

Hello,

after having moved 4 ssds to another host (+ the ceph tell hanging issue
- see previous mail), we ran into 241 unknown pgs:

  cluster:
    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
    health: HEALTH_WARN
            noscrub flag(s) set
            2 nearfull osd(s)
            1 pool(s) nearfull
            Reduced data availability: 241 pgs inactive
            1532 slow requests are blocked > 32 sec
            789 slow ops, oldest one blocked for 1949 sec, daemons
[osd.12,osd.14,osd.2,osd.20,osd.23,osd.25,osd.3,osd.33,osd.35,osd.50]... have slow ops.

  services:
    mon: 3 daemons, quorum black1,black2,black3 (age 97m)
    mgr: black2(active, since 96m), standbys: black1, black3
    osd: 85 osds: 85 up, 82 in; 118 remapped pgs
         flags noscrub
    rgw: 1 daemon active (admin)

  data:
    pools:   12 pools, 3000 pgs
    objects: 33.96M objects, 129 TiB
    usage:   388 TiB used, 159 TiB / 548 TiB avail
    pgs:     8.033% pgs unknown
             409151/101874117 objects misplaced (0.402%)
             2634 active+clean
             241  unknown
             107  active+remapped+backfill_wait
             11   active+remapped+backfilling
             7    active+clean+scrubbing+deep

  io:
    client:   91 MiB/s rd, 28 MiB/s wr, 1.76k op/s rd, 686 op/s wr
    recovery: 67 MiB/s, 17 objects/s

This used to be around 700+ unknown, however these 241 are stuck in this
state for more than 1h. Below is a sample of pgs from "ceph pg dump
all | grep unknown"

2.7f7         0                  0        0         0       0           0           0     
    0    0        0                       unknown 2020-09-22 19:03:00.694873            
0'0              0:0         []         -1         []             -1            
0'0 2020-09-22 19:03:00.694873             0'0 2020-09-22 19:03:00.694873         
   0
2.7c7         0                  0        0         0       0           0           0     
    0    0        0                       unknown 2020-09-22 19:03:00.694873            
0'0              0:0         []         -1         []             -1            
0'0 2020-09-22 19:03:00.694873             0'0 2020-09-22 19:03:00.694873         
   0
2.7c2         0                  0        0         0       0           0           0     
    0    0        0                       unknown 2020-09-22 19:03:00.694873            
0'0              0:0         []         -1         []             -1            
0'0 2020-09-22 19:03:00.694873             0'0 2020-09-22 19:03:00.694873         
   0
2.7ab         0                  0        0         0       0           0           0     
    0    0        0                       unknown 2020-09-22 19:03:00.694873            
0'0              0:0         []         -1         []             -1            
0'0 2020-09-22 19:03:00.694873             0'0 2020-09-22 19:03:00.694873         
   0
2.78b         0                  0        0         0       0           0           0     
    0    0        0                       unknown 2020-09-22 19:03:00.694873            
0'0              0:0         []         -1         []             -1            
0'0 2020-09-22 19:03:00.694873             0'0 2020-09-22 19:03:00.694873         
   0
2.788         0                  0        0         0       0           0           0     
    0    0        0                       unknown 2020-09-22 19:03:00.694873            
0'0              0:0         []         -1         []             -1            
0'0 2020-09-22 19:03:00.694873             0'0 2020-09-22 19:03:00.694873         
   0
2.76e         0

Using ceph pg 2.7f7 query hangs.

We checked and one server did have an incorrect MTU setting (9204
instead of the correct 9000), but that was fixed some hours ago.

Does anyone have a hint on how to find those unknown osds?

Version wise this is 14.2.9:

[20:42:20] black2.place6:~# ceph versions
{
    "mon": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus
(stable)": 3
    },
    "mgr": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus
(stable)": 3
    },
    "osd": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus
(stable)": 85
    },
    "mds": {},
    "rgw": {
        "ceph version 20200428-923-g4004f081ec
(4004f081ec047d60e84d76c2dad6f31e2ac44484) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus
(stable)": 91,
        "ceph version 20200428-923-g4004f081ec
(4004f081ec047d60e84d76c2dad6f31e2ac44484) nautilus (stable)": 1
    }
}

From ceph health detail:

[20:42:58] black2.place6:~# ceph health detail
HEALTH_WARN noscrub flag(s) set; 2 nearfull osd(s); 1 pool(s) nearfull; Reduced data
availability: 241 pgs inactive; 1575 slow requests are blocked > 32 sec; 751 slow ops,
oldest one blocked for 1986 sec, daemons
[osd.12,osd.14,osd.2,osd.20,osd.23,osd.25,osd.3,osd.31,osd.33,osd.35]... have slow ops.
OSDMAP_FLAGS noscrub flag(s) set
OSD_NEARFULL 2 nearfull osd(s)
    osd.36 is near full
    osd.54 is near full
POOL_NEARFULL 1 pool(s) nearfull
    pool 'ssd' is nearfull
PG_AVAILABILITY Reduced data availability: 241 pgs inactive
    pg 2.82 is stuck inactive for 6027.042489, current state unknown, last acting []
    pg 2.88 is stuck inactive for 6027.042489, current state unknown, last acting []
...
    pg 19.6e is stuck inactive for 6027.042489, current state unknown, last acting []
    pg 20.69 is stuck inactive for 6027.042489, current state unknown, last acting []

As can be seen, multiple pools are affected even though most missing pgs
are from pool 2.

Best regards,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch