Strange behavior for crush buckets of erasure-profile - ceph-users

19 Dec 2019

So I wanted to report a crush rule/ec profile strange behaviour  regarding radosgw items
which i am not sure if it's a bug or it's supposed to work that way.
I am trying to implement the below scenario in my home lab:

By default there is a "default" erasure-code-profile with the below settings:
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8

From the above we see that it uses the root bucket. Now ofcourse you would want to create
your own ec-profile with custom algorithm/crush buckets etc
Let's say for example we create two new ec profiles. One with specific crush-root =
ssd-performance2 and one with the crush-root=default (there are no disks there according
to ceph osd tree-> end of page)

ceph osd erasure-code-profile set test-ec crush-device-class= crush-failure-domain=host
crush-root=ssd-performance2 jerasure-per-chunk-alignment=false k=2 m=1 plugin=jerasure
technique=reed_sol_van w=8
ceph osd erasure-code-profile set test-ec2 crush-device-class= crush-failure-domain=host
crush-root=default jerasure-per-chunk-alignment=false k=2 m=1 plugin=jerasure
technique=reed_sol_van w=8

Now let's create the associated crush rules to use these profiles:

ceph osd crush rule create-erasure erasure-test-rule1 test-ec
ceph osd crush rule create-erasure erasure-test-rule2 test-ec2

Now let's say you have a radosgw server that has started and by default it creates the
5 default radosgwpools(supposed you have uploaded some data as well):
default.rgw.buckets.data
default.rgw.buckets.index
default.rgw.control
default.rgw.log
default.rgw.meta

Now if you grep these pools with ceph osd dump you will see that all of them are using
replicated rules but we want to use erasure for the radosgw data pool. So let's
migrate the default.rgw.buckets.data pool to a erasure-coded one.

1) We shutdown the radosgw-server so that we don't allow any requests coming in.
2) ceph osd pool rename default.rgw.buckets.data default.rgw.buckets.data-old
3) ceph osd pool create default.rgw.buckets.data 8 8  erasure test-ec erasure-test-rule -
> We use the newly created erasure crush rule with the profile we created and use the
ssd-performance2 root bucket
4) rados cppool default.rgw.buckets.data-old default.rgw.buckets.data
5) Start radosgw server again

At this point i can see the old objects and i can upload new objects in radosgw and
everything is working fine.

Now i see this strange behavior after i do the below:

We set the default.rgw.buckets.data to use the other erasure crush rule (This is using the
root bucket=default which doesn't have any disks):
ceph osd pool set default.rgw.buckets.data crush_rule erasure-test-rule2

Bug1? You could still browse the data but any attempt to upload/download hangs there with
the below log messages:

2019-12-18 17:07:07.037 7f05a1ece700  0 ERROR: client_io->complete_request() returned
Input/output error
2019-12-18 17:07:07.037 7f05a1ece700  2 req 712 0.004s s3:list_buckets op status=

Monitor nodes don't display anything and seems that new items cannot be saved (which
is correct as it doesn't know where to save them) but at least Monitor nodes should
display something  as a warning or there must be crush check before to see if the rule can
be applied?

Reverting back the rule to erasure-test-rule works fine again
=================================
Bug 2? If you modify the erasure-test-rule profile to use a null crush bucket (like
erasure-test-rule2) then this is not being parsed and identified by the crush rule. Seems
crush rules skips that part

Example:

ceph osd erasure-code-profile set test-ec crush-root=default --force

At this point nothing happens and radosgw is working fine. Which it shouldn't as it
should see that the data cannot be saved anywhere. Unless it keeps the crush root bucket
from the crush rules and not from the erasure coded profiles...even if you force
apply/change it to the erasure profile like above.
=================================
Bug 3? You don't know which rule is using which erasure-code-profile from ceph osd
dump. You only see that this pool is using crush rule number 1 but if you dump this crush
rule it doesn't mention which erasure-code profile is using, other than which
item_name eg = root bucket

Even with the telemetry on with latest release and if you do "ceph telemetry show
basic" with below you see there is no crush-root being mentioned.
So is the crush rule > erasure_code_profile regarding parsing of the crush_root
buckets?

{
     "min_size": 2,
     "erasure_code_profile": {
         "crush-failure-domain": "host",
         "k": "2",
         "technique": "reed_sol_van",
         "m": "1",
         "plugin": "jerasure"
     },
     "pg_autoscale_mode": "warn",
     "pool": 860,
     "size": 3,
     "cache_mode": "none",
     "target_max_objects": 0,
     "pg_num": 8,
     "pgp_num": 8,
     "target_max_bytes": 0,
     "type": "erasure"
}

root@ceph-mon01:~# ceph osd crush rule dump erasure-test-rule
{
    "rule_id": 2,
    "rule_name": "erasure-test-rule",
    "ruleset": 2,
    "type": 3,
    "min_size": 3,
    "max_size": 3,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -2,
            "item_name": "ssd-performance2"
        },
        {
            "op": "chooseleaf_indep",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

root@ceph-mon01:~# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME                      STATUS REWEIGHT PRI-AFF
-37        0.18398 root really-low
-40        0.09799     host ceph-osd01-really-low
 11   hdd  0.09799         osd.11                     up  1.00000 1.00000
-41        0.04799     host ceph-osd02-really-low
  1   hdd  0.01900         osd.1                      up  1.00000 1.00000
  9   hdd  0.02899         osd.9                      up  1.00000 1.00000
-42        0.03799     host ceph-osd03-really-low
  6   hdd  0.01900         osd.6                      up  1.00000 1.00000
  7   hdd  0.01900         osd.7                      up  1.00000 1.00000
-23       10.67598 root spinning-rust
-20        2.04900     rack rack1
 -3        2.04900         host ceph-osd01
  3   hdd  0.04900             osd.3                  up  0.95001 1.00000
 22   hdd  1.00000             osd.22                 up  0.90002 1.00000
 17   ssd  1.00000             osd.17                 up  1.00000 1.00000
-25        3.07799     rack rack2
 -5        3.07799         host ceph-osd02
  4   hdd  0.04900             osd.4                  up  1.00000 1.00000
  8   hdd  0.02899             osd.8                  up  1.00000 1.00000
 23   hdd  1.00000             osd.23                 up  1.00000 1.00000
 25   hdd  1.00000             osd.25                 up  1.00000 1.00000
 12   ssd  1.00000             osd.12                 up  1.00000 1.00000
-28        3.54900     rack rack3
 -7        3.54900         host ceph-osd03
  0   hdd  1.00000             osd.0                  up  0.90002 1.00000
  5   hdd  0.04900             osd.5                  up  1.00000 1.00000
 30   hdd  0.50000             osd.30                 up  1.00000 1.00000
 21   ssd  1.00000             osd.21                 up  0.95001 1.00000
 24   ssd  1.00000             osd.24                 up  1.00000 1.00000
-55        2.00000     rack rack4
-49        2.00000         host ceph-osd04
 26   hdd  1.00000             osd.26                 up  1.00000 1.00000
 27   hdd  1.00000             osd.27                 up  1.00000 1.00000
 -2        9.10799 root ssd-performance2
-32        2.09799     host ceph-osd01-ssd
  2   ssd  0.09799         osd.2                      up  1.00000 1.00000
 13   ssd  1.00000         osd.13                     up  1.00000 1.00000
 16   ssd  1.00000         osd.16                     up  1.00000 1.00000
-31        3.00000     host ceph-osd02-ssd
 14   ssd  1.00000         osd.14                     up  1.00000 1.00000
 18   ssd  1.00000         osd.18                     up  1.00000 1.00000
 19   ssd  1.00000         osd.19                     up  1.00000 1.00000
 -9        2.00999     host ceph-osd03-ssd
 10   ssd  0.00999         osd.10                     up  0.90002 1.00000
 15   ssd  1.00000         osd.15                     up  1.00000 1.00000
 20   ssd  1.00000         osd.20                     up  1.00000 1.00000
-52        2.00000     host ceph-osd04-ssd
 28   ssd  1.00000         osd.28                     up  1.00000 1.00000
 29   ssd  1.00000         osd.29                     up  1.00000 1.00000
 -1              0 root default
root@ceph-mon01:~#

Thanks,
Anastasios