Nautilus: PGs stuck remapped+backfilling - ceph-users

10 Oct 2019

Hi all,

I have a strange issue with backfilling and I'm not sure what the cause is.
It's a Nautilus cluster (upgraded) that has an SSD cache tier for  
OpenStack and CephFS metadata residing on the same SSDs, there were  
three SSDs in total.
Today I added two new SSDs (NVMe) (osd.15, osd.16) to be able to  
shutoff one old server that has only one SSD-OSD left (osd.20).  
Setting the crush weight of osd.20 to 0 (and adjusting the weight of  
the remaining SSDs for an even distribution) leaves 3 PGs in  
active+remapped+backfilling state. I don't understand why the  
remaining PGs aren't backfilling, the crush rule is quite simple (all  
ssd pools are replicated with size 3). The backfilling PGs are all  
from the cephfs_metadata pool. Although there are 4 SSDs for 3  
replicas the backfilling still should finish, right?

Can anyone share their thoughts why 3 PGs can't be recovered? If more  
information about the cluster is required please let me know.

Regards,
Eugen

ceph01:~ # ceph osd pool ls detail | grep meta
pool 36 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 1  
object_hash rjenkins pg_num 16 pgp_num 16 last_change 283362 flags  
hashpspool,nodelete,nodeep-scrub stripe_width 0 application cephfs

ceph01:~ # ceph pg dump | grep remapp
dumped all
36.b      28306                  0        0     28910       0      
8388608   101408323     219497 3078     3078  
active+remapped+backfilling 2019-10-10 13:36:27.427527   
284595'98565869  284595:254216941             [15,16,9]         15      
         [20,9,10]             20  284427'98489406 2019-10-10  
00:16:02.682911  284089'98003598 2019-10-06 16:03:27.558267             
  0
36.d      28087                  0        0     25327       0     
26375382   106722204     231020 3041     3041  
active+remapped+backfilling 2019-10-10 13:36:27.404739   
284595'97933905  284595:252878816             [16,15,9]         16      
         [20,9,10]             20  284427'97887652 2019-10-10  
04:13:29.371905  284259'97502135 2019-10-07 20:06:43.304593             
  0
36.4      28060                  0        0     28406       0      
8389242   104059103     225188 3061     3061  
active+remapped+backfilling 2019-10-10 13:36:27.440390  
284595'105299618  284595:312976619             [16,9,15]         16     
          [20,9,10]             20 284427'105218591 2019-10-10  
00:18:07.924006 284089'104696098 2019-10-06 16:20:17.123149             
  0

rule ssd_ruleset {
         id 1
         type replicated
         min_size 1
         max_size 10
         step take default class ssd
         step chooseleaf firstn 0 type host
         step emit
}

This is the relevant part of the osd tree:

ceph01:~ #  ceph osd tree
ID  CLASS WEIGHT   TYPE NAME             STATUS REWEIGHT PRI-AFF
  -1       34.21628 root default
-31       11.25406     host ceph01
  25   hdd  3.59999         osd.25            up  1.00000 1.00000
  26   hdd  3.59999         osd.26            up  1.00000 1.00000
  27   hdd  3.59999         osd.27            up  1.00000 1.00000
  15   ssd  0.45409         osd.15            up  1.00000 1.00000
-34       11.25406     host ceph02
   0   hdd  3.59999         osd.0             up  1.00000 1.00000
  28   hdd  3.59999         osd.28            up  1.00000 1.00000
  29   hdd  3.59999         osd.29            up  1.00000 1.00000
  16   ssd  0.45409         osd.16            up  1.00000 1.00000
-37       10.79999     host ceph03
  31   hdd  3.59999         osd.31            up  1.00000 1.00000
  32   hdd  3.59999         osd.32            up  1.00000 1.00000
  33   hdd  3.59999         osd.33            up  1.00000 1.00000
-24        0.45409     host san01-ssd
  10   ssd  0.45409         osd.10            up  1.00000 1.00000
-23        0.45409     host san02-ssd
   9   ssd  0.45409         osd.9             up  1.00000 1.00000
-22              0     host san03-ssd
  20   ssd        0         osd.20            up  1.00000 1.00000

Don't be confused because of the '-ssd' suffix, we're using crush  
location hooks.
This is the current PG distribution on the SSDs:

ceph01:~ # ceph osd df | grep -E "^15 |^16 |^ 9|^10 |^20 "
15   ssd 0.45409  1.00000 465 GiB  34 GiB  32 GiB 1.2 GiB  857 MiB 431  
GiB  7.29 0.22  27     up
16   ssd 0.45409  1.00000 465 GiB  37 GiB  34 GiB 1.5 GiB  964 MiB 428  
GiB  7.87 0.23  31     up
10   ssd 0.45409  1.00000 745 GiB  27 GiB  25 GiB 1.7 GiB  950 MiB 718  
GiB  3.65 0.11  29     up
  9   ssd 0.45409  1.00000 745 GiB  34 GiB  32 GiB 1.3 GiB  902 MiB  
711 GiB  4.60 0.14  30     up
20   ssd       0  1.00000 894 GiB 8.2 GiB 4.3 GiB 1.5 GiB  2.4 GiB 886  
GiB  0.91 0.03   3     up

Current ceph status:

ceph01:~ #  ceph -s
   cluster:
     id:     655cb05a-435a-41ba-83d9-8549f7c36167
     health: HEALTH_OK

   services:
     mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 2d)
     mgr: ceph03(active, since 8d), standbys: ceph01, ceph02
     mds: cephfs:1 {0=mds01=up:active} 1 up:standby-replay 1 up:standby
     osd: 26 osds: 26 up (since 66m), 26 in (since 66m); 3 remapped pgs

   data:
     pools:   8 pools, 264 pgs
     objects: 4.96M objects, 5.0 TiB
     usage:   16 TiB used, 31 TiB / 47 TiB avail
     pgs:     115745/14865558 objects misplaced (0.779%)
              261 active+clean
              3   active+remapped+backfilling

   io:
     client:   903 KiB/s rd, 8.8 MiB/s wr, 85 op/s rd, 266 op/s wr
     recovery: 0 B/s, 61 keys/s, 12 objects/s
     cache:    4.2 MiB/s flush, 15 MiB/s evict, 0 op/s promote