Wrong PG placement with custom CRUSH rule - ceph-users

25 Mar 2021

Hi,

I have a small cluster of 3 nodes. Each node has 10 or 11 OSDs, mostly HDDs
with a couple of SSDs for faster pools. I am trying to set up an erasure
coded pool with m=6 k=6, with each node storing 4 chunks on seperate OSDs.
Since this seems not possible with the CLI tooling I have written my own
CRUSH rule to achieve this, which looks like this:
```
rule 3host4osd {
	id 3
	type erasure
	min_size 12
	max_size 12
	step set_chooseleaf_tries 20
	step set_choose_tries 100
	step take default class hdd
	step choose indep 3 type host
	step choose indep 4 type osd
	step emit
}
```

I've set up my erasure code profile and pool:
```
root@virt02:~# ceph osd pool get rbd_erasure crush_rule
crush_rule: 3host4osd
root@virt02:~# ceph osd pool get rbd_erasure size 
size: 12
root@virt02:~# ceph osd pool get rbd_erasure min_size
min_size: 7
root@virt02:~# ceph osd pool get rbd_erasure erasure_code_profile
erasure_code_profile: default
root@virt02:~# ceph osd erasure-code-profile get default
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=6
plugin=jerasure
technique=reed_sol_van
w=8
```

Based on my understanding of ceph, this should pick 3 hosts, then pick 4
OSDs for each of those hosts. This is *almost* the case. However when testing
taking out a host after putting a bunch of data on there, it seems 5 PGs (out
of 512) seem to have more than 4 chunks placed on the same host. In all cases
it's the same host that gets the extra pieces. When the host is out, I see
errors:
```
[WRN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive, 5 pgs down
    pg 2.87 is down, acting
[2147483647,2147483647,2147483647,2147483647,22,2147483647,2147483647,20,16,2147483647,17,18]
    pg 2.f3 is down, acting
[2147483647,22,2147483647,2147483647,23,2147483647,18,17,2147483647,2147483647,2147483647,2147483647]
    pg 2.100 is down, acting
[2147483647,18,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,9,20,22,4]
    pg 2.141 is down, acting
[2147483647,18,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,20,7,4,22]
    pg 2.1bb is down, acting
[20,2147483647,2147483647,2147483647,18,2147483647,23,17,2147483647,2147483647,2147483647,2147483647]
```

As an example, PG 2.87 (rbd_erasure has pool ID 2 according to
`ceph osd lspools`):
```
root@virt02:~# ceph pg 2.87 query
[...]
    "up": [
        2,
        0,
        6,
        5,
        22,
        1,
        8,
        20,
        16,
        14,
        17,
        18
    ],
    "acting": [
        2,
        0,
        6,
        5,
        22,
        1,
        8,
        20,
        16,
        14,
        17,
        18
    ],
[...]
```
OSDs 0, 1, 2, 5, 6, 8 and 14 are all running on the same
OSD host.

All hosts are running ceph octopus 15.2.9. 

I've put the output of various diagnostic commands into files accessible over
HTTPS here:
https://dsg.is/ceph_placement_problem_data/ceph_osd_crush_rule_dump_3host4o…
https://dsg.is/ceph_placement_problem_data/ceph_osd_lspools.txt
https://dsg.is/ceph_placement_problem_data/ceph_osd_pool_get_rbd_erasure_al…
https://dsg.is/ceph_placement_problem_data/ceph_pg_2.87_query.txt
https://dsg.is/ceph_placement_problem_data/ceph_pg_dump_all.txt
https://dsg.is/ceph_placement_problem_data/ceph_pg_ls.txt

Any thoughts or ideas what I'm doing wrong?

Kind regards,
Davíð