Help needed to configure erasure coding LRC plugin

List overview All Threads
Download

newer

older

Quincy osd bench in order to...

ceph-fuse crash

Michel Jouvin

4 Apr 2023 4 Apr '23

2:26 p.m.

Hi, As discussed in another thread (Crushmap rule for multi-datacenter erasure coding), I'm trying to create an EC pool spanning 3 datacenters (datacenters are present in the crushmap), with the objective to be resilient to 1 DC down, at least keeping the readonly access to the pool and if possible the read-write access, and have a storage efficiency better than 3 replica (let say a storage overhead <= 2). In the discussion, somebody mentioned LRC plugin as a possible jerasure alternative to implement this without tweaking the crushmap rule to implement the 2-step OSD allocation. I looked at the documentation (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) but I have some questions if someone has experience/expertise with this LRC plugin. I tried to create a rule for using 5 OSDs per datacenter (15 in total), with 3 (9 in total) being data chunks and others being coding chunks. For this, based of my understanding of examples, I used k=9, m=3, l=4. Is it right? Is this configuration equivalent, in terms of redundancy, to a jerasure configuration with k=9, m=6? The resulting rule, which looks correct to me, is: -------- { "rule_id": 6, "rule_name": "test_lrc_2", "ruleset": 6, "type": 3, "min_size": 3, "max_size": 15, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -4, "item_name": "default~hdd" }, { "op": "choose_indep", "num": 3, "type": "datacenter" }, { "op": "chooseleaf_indep", "num": 5, "type": "host" }, { "op": "emit" } ] } ------------ Unfortunately, it doesn't work as expected: a pool created with this rule ends up with its pages active+undersize, which is unexpected for me. Looking at 'ceph health detail` output, I see for each page something like: pg 52.14 is stuck undersized for 27m, current state active+undersized, last acting [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] For each PG, there is 3 '2147483647' entries and I guess it is the reason of the problem. What are these entries about? Clearly it is not OSD entries... Looks like a negative number, -1, which in terms of crushmap ID is the crushmap root (named "default" in our configuration). Any trivial mistake I would have made? Thanks in advance for any help or for sharing any successful configuration? Best regards, Michel

Show replies by date

Michel Jouvin

4 Apr 4 Apr

2:53 p.m.

Answering to myself, I found the reason for 2147483647: it's documented as a failure to find enough OSD (missing OSDs). And it is normal as I selected different hosts for the 15 OSDs but I have only 12 hosts! I'm still interested by an "expert" to confirm that LRC k=9, m=3, l=4 configuration is equivalent, in terms of redundancy, to a jerasure configuration with k=9, m=6. Michel Le 04/04/2023 à 15:26, Michel Jouvin a écrit : > Hi, > > As discussed in another thread (Crushmap rule for multi-datacenter > erasure coding), I'm trying to create an EC pool spanning 3 > datacenters (datacenters are present in the crushmap), with the > objective to be resilient to 1 DC down, at least keeping the readonly > access to the pool and if possible the read-write access, and have a > storage efficiency better than 3 replica (let say a storage overhead > <= 2). > > In the discussion, somebody mentioned LRC plugin as a possible > jerasure alternative to implement this without tweaking the crushmap > rule to implement the 2-step OSD allocation. I looked at the > documentation > (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) > but I have some questions if someone has experience/expertise with > this LRC plugin. > > I tried to create a rule for using 5 OSDs per datacenter (15 in > total), with 3 (9 in total) being data chunks and others being coding > chunks. For this, based of my understanding of examples, I used k=9, > m=3, l=4. Is it right? Is this configuration equivalent, in terms of > redundancy, to a jerasure configuration with k=9, m=6? > > The resulting rule, which looks correct to me, is: > > -------- > > { > "rule_id": 6, > "rule_name": "test_lrc_2", > "ruleset": 6, > "type": 3, > "min_size": 3, > "max_size": 15, > "steps": [ > { > "op": "set_chooseleaf_tries", > "num": 5 > }, > { > "op": "set_choose_tries", > "num": 100 > }, > { > "op": "take", > "item": -4, > "item_name": "default~hdd" > }, > { > "op": "choose_indep", > "num": 3, > "type": "datacenter" > }, > { > "op": "chooseleaf_indep", > "num": 5, > "type": "host" > }, > { > "op": "emit" > } > ] > } > > ------------ > > Unfortunately, it doesn't work as expected: a pool created with this > rule ends up with its pages active+undersize, which is unexpected for > me. Looking at 'ceph health detail` output, I see for each page > something like: > > pg 52.14 is stuck undersized for 27m, current state active+undersized, > last acting > [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] > > For each PG, there is 3 '2147483647' entries and I guess it is the > reason of the problem. What are these entries about? Clearly it is not > OSD entries... Looks like a negative number, -1, which in terms of > crushmap ID is the crushmap root (named "default" in our > configuration). Any trivial mistake I would have made? > > Thanks in advance for any help or for sharing any successful > configuration? > > Best regards, > > Michel > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

6 Apr 6 Apr

7:51 a.m.

Hi, Is somebody using LRC plugin ? I came to the conclusion that LRC k=9, m=3, l=4 is not the same as jerasure k=9, m=6 in terms of protection against failures and that I should use k=9, m=6, l=5 to get a level of resilience >= jerasure k=9, m=6. The example in the documentation (k=4, m=2, l=3) suggests that this LRC configuration gives something better than jerasure k=4, m=2 as it is resilient to 3 drive failures (but not 4 if I understood properly). So how many drives can fail in the k=9, m=6, l=5 configuration first without loosing RW access and second without loosing data? Another thing that I don't quite understand is that a pool created with this configuration (and failure domain=osd, locality=datacenter) has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd expected something ~10 (depending on answer to the previous question)... Thanks in advance if somebody could provide some sort of authoritative answer on these 2 questions. Best regards, Michel Le 04/04/2023 à 15:53, Michel Jouvin a écrit : > Answering to myself, I found the reason for 2147483647: it's > documented as a failure to find enough OSD (missing OSDs). And it is > normal as I selected different hosts for the 15 OSDs but I have only > 12 hosts! > > I'm still interested by an "expert" to confirm that LRC k=9, m=3, l=4 > configuration is equivalent, in terms of redundancy, to a jerasure > configuration with k=9, m=6. > > Michel > > Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >> Hi, >> >> As discussed in another thread (Crushmap rule for multi-datacenter >> erasure coding), I'm trying to create an EC pool spanning 3 >> datacenters (datacenters are present in the crushmap), with the >> objective to be resilient to 1 DC down, at least keeping the readonly >> access to the pool and if possible the read-write access, and have a >> storage efficiency better than 3 replica (let say a storage overhead >> <= 2). >> >> In the discussion, somebody mentioned LRC plugin as a possible >> jerasure alternative to implement this without tweaking the crushmap >> rule to implement the 2-step OSD allocation. I looked at the >> documentation >> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >> but I have some questions if someone has experience/expertise with >> this LRC plugin. >> >> I tried to create a rule for using 5 OSDs per datacenter (15 in >> total), with 3 (9 in total) being data chunks and others being coding >> chunks. For this, based of my understanding of examples, I used k=9, >> m=3, l=4. Is it right? Is this configuration equivalent, in terms of >> redundancy, to a jerasure configuration with k=9, m=6? >> >> The resulting rule, which looks correct to me, is: >> >> -------- >> >> { >> "rule_id": 6, >> "rule_name": "test_lrc_2", >> "ruleset": 6, >> "type": 3, >> "min_size": 3, >> "max_size": 15, >> "steps": [ >> { >> "op": "set_chooseleaf_tries", >> "num": 5 >> }, >> { >> "op": "set_choose_tries", >> "num": 100 >> }, >> { >> "op": "take", >> "item": -4, >> "item_name": "default~hdd" >> }, >> { >> "op": "choose_indep", >> "num": 3, >> "type": "datacenter" >> }, >> { >> "op": "chooseleaf_indep", >> "num": 5, >> "type": "host" >> }, >> { >> "op": "emit" >> } >> ] >> } >> >> ------------ >> >> Unfortunately, it doesn't work as expected: a pool created with this >> rule ends up with its pages active+undersize, which is unexpected for >> me. Looking at 'ceph health detail` output, I see for each page >> something like: >> >> pg 52.14 is stuck undersized for 27m, current state >> active+undersized, last acting >> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >> >> For each PG, there is 3 '2147483647' entries and I guess it is the >> reason of the problem. What are these entries about? Clearly it is >> not OSD entries... Looks like a negative number, -1, which in terms >> of crushmap ID is the crushmap root (named "default" in our >> configuration). Any trivial mistake I would have made? >> >> Thanks in advance for any help or for sharing any successful >> configuration? >> >> Best regards, >> >> Michel >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

24 Apr 24 Apr

12:24 p.m.

Hi, I'm still interesting by getting feedback from those using the LRC plugin about the right way to configure it... Last week I upgraded from Pacific to Quincy (17.2.6) with cephadm which is doing the upgrade host by host, checking if an OSD is ok to stop before actually upgrading it. I had the surprise to see 1 or 2 PGs down at some points in the upgrade (happened not for all OSDs but for every site/datacenter). Looking at the details with "ceph health detail", I saw that for these PGs there was 3 OSDs down but I was expecting the pool to be resilient to 6 OSDs down (5 for R/W access) so I'm wondering if there is something wrong in our pool configuration (k=9, m=6, l=5). Cheers, Michel Le 06/04/2023 à 08:51, Michel Jouvin a écrit : > Hi, > > Is somebody using LRC plugin ? > > I came to the conclusion that LRC k=9, m=3, l=4 is not the same as > jerasure k=9, m=6 in terms of protection against failures and that I > should use k=9, m=6, l=5 to get a level of resilience >= jerasure k=9, > m=6. The example in the documentation (k=4, m=2, l=3) suggests that > this LRC configuration gives something better than jerasure k=4, m=2 > as it is resilient to 3 drive failures (but not 4 if I understood > properly). So how many drives can fail in the k=9, m=6, l=5 > configuration first without loosing RW access and second without > loosing data? > > Another thing that I don't quite understand is that a pool created > with this configuration (and failure domain=osd, locality=datacenter) > has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd > expected something ~10 (depending on answer to the previous question)... > > Thanks in advance if somebody could provide some sort of authoritative > answer on these 2 questions. Best regards, > > Michel > > Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >> Answering to myself, I found the reason for 2147483647: it's >> documented as a failure to find enough OSD (missing OSDs). And it is >> normal as I selected different hosts for the 15 OSDs but I have only >> 12 hosts! >> >> I'm still interested by an "expert" to confirm that LRC k=9, m=3, >> l=4 configuration is equivalent, in terms of redundancy, to a >> jerasure configuration with k=9, m=6. >> >> Michel >> >> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>> Hi, >>> >>> As discussed in another thread (Crushmap rule for multi-datacenter >>> erasure coding), I'm trying to create an EC pool spanning 3 >>> datacenters (datacenters are present in the crushmap), with the >>> objective to be resilient to 1 DC down, at least keeping the >>> readonly access to the pool and if possible the read-write access, >>> and have a storage efficiency better than 3 replica (let say a >>> storage overhead <= 2). >>> >>> In the discussion, somebody mentioned LRC plugin as a possible >>> jerasure alternative to implement this without tweaking the crushmap >>> rule to implement the 2-step OSD allocation. I looked at the >>> documentation >>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >>> but I have some questions if someone has experience/expertise with >>> this LRC plugin. >>> >>> I tried to create a rule for using 5 OSDs per datacenter (15 in >>> total), with 3 (9 in total) being data chunks and others being >>> coding chunks. For this, based of my understanding of examples, I >>> used k=9, m=3, l=4. Is it right? Is this configuration equivalent, >>> in terms of redundancy, to a jerasure configuration with k=9, m=6? >>> >>> The resulting rule, which looks correct to me, is: >>> >>> -------- >>> >>> { >>> "rule_id": 6, >>> "rule_name": "test_lrc_2", >>> "ruleset": 6, >>> "type": 3, >>> "min_size": 3, >>> "max_size": 15, >>> "steps": [ >>> { >>> "op": "set_chooseleaf_tries", >>> "num": 5 >>> }, >>> { >>> "op": "set_choose_tries", >>> "num": 100 >>> }, >>> { >>> "op": "take", >>> "item": -4, >>> "item_name": "default~hdd" >>> }, >>> { >>> "op": "choose_indep", >>> "num": 3, >>> "type": "datacenter" >>> }, >>> { >>> "op": "chooseleaf_indep", >>> "num": 5, >>> "type": "host" >>> }, >>> { >>> "op": "emit" >>> } >>> ] >>> } >>> >>> ------------ >>> >>> Unfortunately, it doesn't work as expected: a pool created with this >>> rule ends up with its pages active+undersize, which is unexpected >>> for me. Looking at 'ceph health detail` output, I see for each page >>> something like: >>> >>> pg 52.14 is stuck undersized for 27m, current state >>> active+undersized, last acting >>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>> >>> For each PG, there is 3 '2147483647' entries and I guess it is the >>> reason of the problem. What are these entries about? Clearly it is >>> not OSD entries... Looks like a negative number, -1, which in terms >>> of crushmap ID is the crushmap root (named "default" in our >>> configuration). Any trivial mistake I would have made? >>> >>> Thanks in advance for any help or for sharing any successful >>> configuration? >>> >>> Best regards, >>> >>> Michel >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

28 Apr 28 Apr

9:26 a.m.

Hi, I think I found a possible cause of my PG down but still understand why. As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, m=6) but I have only 12 OSD servers in the cluster. To workaround the problem I defined the failure domain as 'osd' with the reasoning that as I was using the LRC plugin, I had the warranty that I could loose a site without impact, thus the possibility to loose 1 OSD server. Am I wrong? Best regards, Michel Le 24/04/2023 à 13:24, Michel Jouvin a écrit : > Hi, > > I'm still interesting by getting feedback from those using the LRC > plugin about the right way to configure it... Last week I upgraded > from Pacific to Quincy (17.2.6) with cephadm which is doing the > upgrade host by host, checking if an OSD is ok to stop before actually > upgrading it. I had the surprise to see 1 or 2 PGs down at some points > in the upgrade (happened not for all OSDs but for every > site/datacenter). Looking at the details with "ceph health detail", I > saw that for these PGs there was 3 OSDs down but I was expecting the > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm > wondering if there is something wrong in our pool configuration (k=9, > m=6, l=5). > > Cheers, > > Michel > > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >> Hi, >> >> Is somebody using LRC plugin ? >> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the same as >> jerasure k=9, m=6 in terms of protection against failures and that I >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) suggests >> that this LRC configuration gives something better than jerasure k=4, >> m=2 as it is resilient to 3 drive failures (but not 4 if I understood >> properly). So how many drives can fail in the k=9, m=6, l=5 >> configuration first without loosing RW access and second without >> loosing data? >> >> Another thing that I don't quite understand is that a pool created >> with this configuration (and failure domain=osd, locality=datacenter) >> has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd >> expected something ~10 (depending on answer to the previous question)... >> >> Thanks in advance if somebody could provide some sort of >> authoritative answer on these 2 questions. Best regards, >> >> Michel >> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>> Answering to myself, I found the reason for 2147483647: it's >>> documented as a failure to find enough OSD (missing OSDs). And it is >>> normal as I selected different hosts for the 15 OSDs but I have only >>> 12 hosts! >>> >>> I'm still interested by an "expert" to confirm that LRC k=9, m=3, >>> l=4 configuration is equivalent, in terms of redundancy, to a >>> jerasure configuration with k=9, m=6. >>> >>> Michel >>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>> Hi, >>>> >>>> As discussed in another thread (Crushmap rule for multi-datacenter >>>> erasure coding), I'm trying to create an EC pool spanning 3 >>>> datacenters (datacenters are present in the crushmap), with the >>>> objective to be resilient to 1 DC down, at least keeping the >>>> readonly access to the pool and if possible the read-write access, >>>> and have a storage efficiency better than 3 replica (let say a >>>> storage overhead <= 2). >>>> >>>> In the discussion, somebody mentioned LRC plugin as a possible >>>> jerasure alternative to implement this without tweaking the >>>> crushmap rule to implement the 2-step OSD allocation. I looked at >>>> the documentation >>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >>>> but I have some questions if someone has experience/expertise with >>>> this LRC plugin. >>>> >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in >>>> total), with 3 (9 in total) being data chunks and others being >>>> coding chunks. For this, based of my understanding of examples, I >>>> used k=9, m=3, l=4. Is it right? Is this configuration equivalent, >>>> in terms of redundancy, to a jerasure configuration with k=9, m=6? >>>> >>>> The resulting rule, which looks correct to me, is: >>>> >>>> -------- >>>> >>>> { >>>> "rule_id": 6, >>>> "rule_name": "test_lrc_2", >>>> "ruleset": 6, >>>> "type": 3, >>>> "min_size": 3, >>>> "max_size": 15, >>>> "steps": [ >>>> { >>>> "op": "set_chooseleaf_tries", >>>> "num": 5 >>>> }, >>>> { >>>> "op": "set_choose_tries", >>>> "num": 100 >>>> }, >>>> { >>>> "op": "take", >>>> "item": -4, >>>> "item_name": "default~hdd" >>>> }, >>>> { >>>> "op": "choose_indep", >>>> "num": 3, >>>> "type": "datacenter" >>>> }, >>>> { >>>> "op": "chooseleaf_indep", >>>> "num": 5, >>>> "type": "host" >>>> }, >>>> { >>>> "op": "emit" >>>> } >>>> ] >>>> } >>>> >>>> ------------ >>>> >>>> Unfortunately, it doesn't work as expected: a pool created with >>>> this rule ends up with its pages active+undersize, which is >>>> unexpected for me. Looking at 'ceph health detail` output, I see >>>> for each page something like: >>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>> active+undersized, last acting >>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it is the >>>> reason of the problem. What are these entries about? Clearly it is >>>> not OSD entries... Looks like a negative number, -1, which in terms >>>> of crushmap ID is the crushmap root (named "default" in our >>>> configuration). Any trivial mistake I would have made? >>>> >>>> Thanks in advance for any help or for sharing any successful >>>> configuration? >>>> >>>> Best regards, >>>> >>>> Michel >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Curt

29 Apr 29 Apr

7:36 p.m.

...

Hi, I'm still interesting by getting feedback from those using the LRC plugin about the right way to configure it... Last week I upgraded from Pacific to Quincy (17.2.6) with cephadm which is doing the upgrade host by host, checking if an OSD is ok to stop before actually upgrading it. I had the surprise to see 1 or 2 PGs down at some points in the upgrade (happened not for all OSDs but for every site/datacenter). Looking at the details with "ceph health detail", I saw that for these PGs there was 3 OSDs down but I was expecting the pool to be resilient to 6 OSDs down (5 for R/W access) so I'm wondering if there is something wrong in our pool configuration (k=9, m=6, l=5). Cheers, Michel Le 06/04/2023 à 08:51, Michel Jouvin a écrit : > Hi, > > Is somebody using LRC plugin ? > > I came to the conclusion that LRC k=9, m=3, l=4 is not the same as > jerasure k=9, m=6 in terms of protection against failures and that I > should use k=9, m=6, l=5 to get a level of resilience >= jerasure > k=9, m=6. The example in the documentation (k=4, m=2, l=3) suggests > that this LRC configuration gives something better than jerasure k=4, > m=2 as it is resilient to 3 drive failures (but not 4 if I understood > properly). So how many drives can fail in the k=9, m=6, l=5 > configuration first without loosing RW access and second without > loosing data? > > Another thing that I don't quite understand is that a pool created > with this configuration (and failure domain=osd, locality=datacenter) > has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd > expected something ~10 (depending on answer to the previous question)... > > Thanks in advance if somebody could provide some sort of > authoritative answer on these 2 questions. Best regards, > > Michel > > Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >> Answering to myself, I found the reason for 2147483647: it's >> documented as a failure to find enough OSD (missing OSDs). And it is >> normal as I selected different hosts for the 15 OSDs but I have only >> 12 hosts! >> >> I'm still interested by an "expert" to confirm that LRC k=9, m=3, >> l=4 configuration is equivalent, in terms of redundancy, to a >> jerasure configuration with k=9, m=6. >> >> Michel >> >> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>> Hi, >>> >>> As discussed in another thread (Crushmap rule for multi-datacenter >>> erasure coding), I'm trying to create an EC pool spanning 3 >>> datacenters (datacenters are present in the crushmap), with the >>> objective to be resilient to 1 DC down, at least keeping the >>> readonly access to the pool and if possible the read-write access, >>> and have a storage efficiency better than 3 replica (let say a >>> storage overhead <= 2). >>> >>> In the discussion, somebody mentioned LRC plugin as a possible >>> jerasure alternative to implement this without tweaking the >>> crushmap rule to implement the 2-step OSD allocation. I looked at >>> the documentation >>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >>> but I have some questions if someone has experience/expertise with >>> this LRC plugin. >>> >>> I tried to create a rule for using 5 OSDs per datacenter (15 in >>> total), with 3 (9 in total) being data chunks and others being >>> coding chunks. For this, based of my understanding of examples, I >>> used k=9, m=3, l=4. Is it right? Is this configuration equivalent, >>> in terms of redundancy, to a jerasure configuration with k=9, m=6? >>> >>> The resulting rule, which looks correct to me, is: >>> >>> -------- >>> >>> { >>> "rule_id": 6, >>> "rule_name": "test_lrc_2", >>> "ruleset": 6, >>> "type": 3, >>> "min_size": 3, >>> "max_size": 15, >>> "steps": [ >>> { >>> "op": "set_chooseleaf_tries", >>> "num": 5 >>> }, >>> { >>> "op": "set_choose_tries", >>> "num": 100 >>> }, >>> { >>> "op": "take", >>> "item": -4, >>> "item_name": "default~hdd" >>> }, >>> { >>> "op": "choose_indep", >>> "num": 3, >>> "type": "datacenter" >>> }, >>> { >>> "op": "chooseleaf_indep", >>> "num": 5, >>> "type": "host" >>> }, >>> { >>> "op": "emit" >>> } >>> ] >>> } >>> >>> ------------ >>> >>> Unfortunately, it doesn't work as expected: a pool created with >>> this rule ends up with its pages active+undersize, which is >>> unexpected for me. Looking at 'ceph health detail` output, I see >>> for each page something like: >>> >>> pg 52.14 is stuck undersized for 27m, current state >>> active+undersized, last acting >>>

[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]

>>> >>> For each PG, there is 3 '2147483647' entries and I guess it is the >>> reason of the problem. What are these entries about? Clearly it is >>> not OSD entries... Looks like a negative number, -1, which in terms >>> of crushmap ID is the crushmap root (named "default" in our >>> configuration). Any trivial mistake I would have made? >>> >>> Thanks in advance for any help or for sharing any successful >>> configuration? >>> >>> Best regards, >>> >>> Michel >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

8:28 p.m.

Hi, No... our current setup is 3 datacenters with the same configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be a multiple of l, I found that k=9/m=66/l=5 with crush-locality=datacenter was achieving my goal of being resilient to a datacenter failure. Because I had this, I considered that lowering the crush failure domain to osd was not a major issue in my case (as it would not be worst than a datacenter failure if all the shards are on the same server in a datacenter) and was working around the lack of hosts for k=9/m=6 (15 OSDs). May be it helps, if I give the erasure code profile used: crush-device-class=hdd crush-failure-domain=osd crush-locality=datacenter crush-root=default k=9 l=5 m=6 plugin=lrc The previously mentioned strange number for min_size for the pool created with this profile has vanished after Quincy upgrade as this parameter is no longer in the CRUH map rule! and the `ceph osd pool get` command reports the expected number (10): ---------

...

ceph osd pool get fink-z1.rgw.buckets.data min_size

min_size: 10 -------- Cheers, Michel Le 29/04/2023 à 20:36, Curt a écrit : > Hello, > > What is your current setup, 1 server pet data center with 12 osd each? > What is your current crush rule and LRC crush rule? > > > On Fri, Apr 28, 2023, 12:29 Michel Jouvin > <michel.jouvin(a)ijclab.in2p3.fr> wrote: > > Hi, > > I think I found a possible cause of my PG down but still > understand why. > As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, > m=6) but I have only 12 OSD servers in the cluster. To workaround the > problem I defined the failure domain as 'osd' with the reasoning > that as > I was using the LRC plugin, I had the warranty that I could loose > a site > without impact, thus the possibility to loose 1 OSD server. Am I > wrong? > > Best regards, > > Michel > > Le 24/04/2023 à 13:24, Michel Jouvin a écrit : > > Hi, > > > > I'm still interesting by getting feedback from those using the LRC > > plugin about the right way to configure it... Last week I upgraded > > from Pacific to Quincy (17.2.6) with cephadm which is doing the > > upgrade host by host, checking if an OSD is ok to stop before > actually > > upgrading it. I had the surprise to see 1 or 2 PGs down at some > points > > in the upgrade (happened not for all OSDs but for every > > site/datacenter). Looking at the details with "ceph health > detail", I > > saw that for these PGs there was 3 OSDs down but I was expecting > the > > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm > > wondering if there is something wrong in our pool configuration > (k=9, > > m=6, l=5). > > > > Cheers, > > > > Michel > > > > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : > >> Hi, > >> > >> Is somebody using LRC plugin ? > >> > >> I came to the conclusion that LRC k=9, m=3, l=4 is not the > same as > >> jerasure k=9, m=6 in terms of protection against failures and > that I > >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure > >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) > suggests > >> that this LRC configuration gives something better than > jerasure k=4, > >> m=2 as it is resilient to 3 drive failures (but not 4 if I > understood > >> properly). So how many drives can fail in the k=9, m=6, l=5 > >> configuration first without loosing RW access and second without > >> loosing data? > >> > >> Another thing that I don't quite understand is that a pool created > >> with this configuration (and failure domain=osd, > locality=datacenter) > >> has a min_size=3 (max_size=18 as expected). It seems wrong to > me, I'd > >> expected something ~10 (depending on answer to the previous > question)... > >> > >> Thanks in advance if somebody could provide some sort of > >> authoritative answer on these 2 questions. Best regards, > >> > >> Michel > >> > >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : > >>> Answering to myself, I found the reason for 2147483647: it's > >>> documented as a failure to find enough OSD (missing OSDs). And > it is > >>> normal as I selected different hosts for the 15 OSDs but I > have only > >>> 12 hosts! > >>> > >>> I'm still interested by an "expert" to confirm that LRC k=9, > m=3, > >>> l=4 configuration is equivalent, in terms of redundancy, to a > >>> jerasure configuration with k=9, m=6. > >>> > >>> Michel > >>> > >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : > >>>> Hi, > >>>> > >>>> As discussed in another thread (Crushmap rule for > multi-datacenter > >>>> erasure coding), I'm trying to create an EC pool spanning 3 > >>>> datacenters (datacenters are present in the crushmap), with the > >>>> objective to be resilient to 1 DC down, at least keeping the > >>>> readonly access to the pool and if possible the read-write > access, > >>>> and have a storage efficiency better than 3 replica (let say a > >>>> storage overhead <= 2). > >>>> > >>>> In the discussion, somebody mentioned LRC plugin as a possible > >>>> jerasure alternative to implement this without tweaking the > >>>> crushmap rule to implement the 2-step OSD allocation. I > looked at > >>>> the documentation > >>>> > (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) > >>>> but I have some questions if someone has experience/expertise > with > >>>> this LRC plugin. > >>>> > >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in > >>>> total), with 3 (9 in total) being data chunks and others being > >>>> coding chunks. For this, based of my understanding of > examples, I > >>>> used k=9, m=3, l=4. Is it right? Is this configuration > equivalent, > >>>> in terms of redundancy, to a jerasure configuration with k=9, > m=6? > >>>> > >>>> The resulting rule, which looks correct to me, is: > >>>> > >>>> -------- > >>>> > >>>> { > >>>> "rule_id": 6, > >>>> "rule_name": "test_lrc_2", > >>>> "ruleset": 6, > >>>> "type": 3, > >>>> "min_size": 3, > >>>> "max_size": 15, > >>>> "steps": [ > >>>> { > >>>> "op": "set_chooseleaf_tries", > >>>> "num": 5 > >>>> }, > >>>> { > >>>> "op": "set_choose_tries", > >>>> "num": 100 > >>>> }, > >>>> { > >>>> "op": "take", > >>>> "item": -4, > >>>> "item_name": "default~hdd" > >>>> }, > >>>> { > >>>> "op": "choose_indep", > >>>> "num": 3, > >>>> "type": "datacenter" > >>>> }, > >>>> { > >>>> "op": "chooseleaf_indep", > >>>> "num": 5, > >>>> "type": "host" > >>>> }, > >>>> { > >>>> "op": "emit" > >>>> } > >>>> ] > >>>> } > >>>> > >>>> ------------ > >>>> > >>>> Unfortunately, it doesn't work as expected: a pool created with > >>>> this rule ends up with its pages active+undersize, which is > >>>> unexpected for me. Looking at 'ceph health detail` output, I see > >>>> for each page something like: > >>>> > >>>> pg 52.14 is stuck undersized for 27m, current state > >>>> active+undersized, last acting > >>>> > [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] > >>>> > >>>> For each PG, there is 3 '2147483647' entries and I guess it > is the > >>>> reason of the problem. What are these entries about? Clearly > it is > >>>> not OSD entries... Looks like a negative number, -1, which in > terms > >>>> of crushmap ID is the crushmap root (named "default" in our > >>>> configuration). Any trivial mistake I would have made? > >>>> > >>>> Thanks in advance for any help or for sharing any successful > >>>> configuration? > >>>> > >>>> Best regards, > >>>> > >>>> Michel > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users(a)ceph.io > >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

Eugen Block

2 May 2 May

2:35 p.m.

...

ceph osd pool get fink-z1.rgw.buckets.data min_size

min_size: 10 -------- Cheers, Michel Le 29/04/2023 à 20:36, Curt a écrit :

Hello, What is your current setup, 1 server pet data center with 12 osd each? What is your current crush rule and LRC crush rule? On Fri, Apr 28, 2023, 12:29 Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr> wrote: Hi, I think I found a possible cause of my PG down but still understand why. As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, m=6) but I have only 12 OSD servers in the cluster. To workaround the problem I defined the failure domain as 'osd' with the reasoning that as I was using the LRC plugin, I had the warranty that I could loose a site without impact, thus the possibility to loose 1 OSD server. Am I wrong? Best regards, Michel Le 24/04/2023 à 13:24, Michel Jouvin a écrit :

actually

upgrading it. I had the surprise to see 1 or 2 PGs down at some

points

in the upgrade (happened not for all OSDs but for every site/datacenter). Looking at the details with "ceph health

detail", I

saw that for these PGs there was 3 OSDs down but I was expecting

the

pool to be resilient to 6 OSDs down (5 for R/W access) so I'm wondering if there is something wrong in our pool configuration

(k=9,

m=6, l=5). Cheers, Michel Le 06/04/2023 à 08:51, Michel Jouvin a écrit : > Hi, > > Is somebody using LRC plugin ? > > I came to the conclusion that LRC k=9, m=3, l=4 is not the

same as

> jerasure k=9, m=6 in terms of protection against failures and

that I

> should use k=9, m=6, l=5 to get a level of resilience >= jerasure > k=9, m=6. The example in the documentation (k=4, m=2, l=3)

suggests

> that this LRC configuration gives something better than

jerasure k=4,

> m=2 as it is resilient to 3 drive failures (but not 4 if I

understood

> properly). So how many drives can fail in the k=9, m=6, l=5 > configuration first without loosing RW access and second without > loosing data? > > Another thing that I don't quite understand is that a pool created > with this configuration (and failure domain=osd,

locality=datacenter)

> has a min_size=3 (max_size=18 as expected). It seems wrong to

me, I'd

> expected something ~10 (depending on answer to the previous

question)...

> > Thanks in advance if somebody could provide some sort of > authoritative answer on these 2 questions. Best regards, > > Michel > > Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >> Answering to myself, I found the reason for 2147483647: it's >> documented as a failure to find enough OSD (missing OSDs). And

it is

>> normal as I selected different hosts for the 15 OSDs but I

have only

>> 12 hosts! >> >> I'm still interested by an "expert" to confirm that LRC k=9,

m=3,

>> l=4 configuration is equivalent, in terms of redundancy, to a >> jerasure configuration with k=9, m=6. >> >> Michel >> >> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>> Hi, >>> >>> As discussed in another thread (Crushmap rule for

multi-datacenter

>>> erasure coding), I'm trying to create an EC pool spanning 3 >>> datacenters (datacenters are present in the crushmap), with the >>> objective to be resilient to 1 DC down, at least keeping the >>> readonly access to the pool and if possible the read-write

access,

>>> and have a storage efficiency better than 3 replica (let say a >>> storage overhead <= 2). >>> >>> In the discussion, somebody mentioned LRC plugin as a possible >>> jerasure alternative to implement this without tweaking the >>> crushmap rule to implement the 2-step OSD allocation. I

looked at

>>> the documentation >>>

(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)

>>> but I have some questions if someone has experience/expertise

with

>>> this LRC plugin. >>> >>> I tried to create a rule for using 5 OSDs per datacenter (15 in >>> total), with 3 (9 in total) being data chunks and others being >>> coding chunks. For this, based of my understanding of

examples, I

>>> used k=9, m=3, l=4. Is it right? Is this configuration

equivalent,

>>> in terms of redundancy, to a jerasure configuration with k=9,

m=6?

>>> >>> The resulting rule, which looks correct to me, is: >>> >>> -------- >>> >>> { >>> "rule_id": 6, >>> "rule_name": "test_lrc_2", >>> "ruleset": 6, >>> "type": 3, >>> "min_size": 3, >>> "max_size": 15, >>> "steps": [ >>> { >>> "op": "set_chooseleaf_tries", >>> "num": 5 >>> }, >>> { >>> "op": "set_choose_tries", >>> "num": 100 >>> }, >>> { >>> "op": "take", >>> "item": -4, >>> "item_name": "default~hdd" >>> }, >>> { >>> "op": "choose_indep", >>> "num": 3, >>> "type": "datacenter" >>> }, >>> { >>> "op": "chooseleaf_indep", >>> "num": 5, >>> "type": "host" >>> }, >>> { >>> "op": "emit" >>> } >>> ] >>> } >>> >>> ------------ >>> >>> Unfortunately, it doesn't work as expected: a pool created with >>> this rule ends up with its pages active+undersize, which is >>> unexpected for me. Looking at 'ceph health detail` output, I see >>> for each page something like: >>> >>> pg 52.14 is stuck undersized for 27m, current state >>> active+undersized, last acting >>>

[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]

>>> >>> For each PG, there is 3 '2147483647' entries and I guess it

is the

>>> reason of the problem. What are these entries about? Clearly

it is

>>> not OSD entries... Looks like a negative number, -1, which in

terms

>>> of crushmap ID is the crushmap root (named "default" in our >>> configuration). Any trivial mistake I would have made? >>> >>> Thanks in advance for any help or for sharing any successful >>> configuration? >>> >>> Best regards, >>> >>> Michel >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

3 May 3 May

9:21 a.m.

I think I got it wrong with the locality setting, I'm still limited by the number of hosts I have available in my test cluster, but as far as I got with failure-domain=osd I believe k=6, m=3, l=3 with locality=datacenter could fit your requirement, at least with regards to the recovery bandwidth usage between DCs, but the resiliency would not match your requirement (one DC failure). That profile creates 3 groups of 4 chunks (3 data/coding chunks and one parity chunk) across three DCs, in total 12 chunks. The min_size=7 would not allow an entire DC to go down, I'm afraid, you'd have to reduce it to 6 to allow reads/writes in a disaster scenario. I'm still not sure if I got it right this time, but maybe you're better off without the LRC plugin with the limited number of hosts. Instead you could use the jerasure plugin with a profile like k=4 m=5 allowing an entire DC to fail without losing data access (we have one customer using that). Zitat von Eugen Block <eblock(a)nde.ag>ag>:

...

Hi, disclaimer: I haven't used LRC in a real setup yet, so there might be some misunderstandings on my side. But I tried to play around with one of my test clusters (Nautilus). Because I'm limited in the number of hosts (6 across 3 virtual DCs) I tried two different profiles with lower numbers to get a feeling for how that works. # first attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 m=2 l=3 crush-failure-domain=host For every third OSD one parity chunk is added, so 2 more chunks to store ==> 8 chunks in total. Since my failure-domain is host and I only have 6 I get incomplete PGs. # second attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 m=2 l=2 crush-failure-domain=host This gives me 6 chunks in total to store across 6 hosts which works: ceph:~ # ceph pg ls-by-pool lrcpool PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 50.0 1 0 0 0 619 0 0 1 active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.1 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.2 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.3 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 After stopping all OSDs on one host I was still able to read and write into the pool, but after stopping a second host one PG from that pool went "down". That I don't fully understand yet, but I just started to look into it. With your setup (12 hosts) I would recommend to not utilize all of them so you have capacity to recover, let's say one "spare" host per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make sense here, resulting in 9 total chunks (one more parity chunks for every other OSD), min_size 4. But as I wrote, it probably doesn't have the resiliency for a DC failure, so that needs some further investigation. Regards, Eugen Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: > Hi, > > No... our current setup is 3 datacenters with the same > configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. > Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be > a multiple of l, I found that k=9/m=66/l=5 with > crush-locality=datacenter was achieving my goal of being resilient > to a datacenter failure. Because I had this, I considered that > lowering the crush failure domain to osd was not a major issue in > my case (as it would not be worst than a datacenter failure if all > the shards are on the same server in a datacenter) and was working > around the lack of hosts for k=9/m=6 (15 OSDs). > > May be it helps, if I give the erasure code profile used: > > crush-device-class=hdd > crush-failure-domain=osd > crush-locality=datacenter > crush-root=default > k=9 > l=5 > m=6 > plugin=lrc > > The previously mentioned strange number for min_size for the pool > created with this profile has vanished after Quincy upgrade as this > parameter is no longer in the CRUH map rule! and the `ceph osd pool > get` command reports the expected number (10): > > --------- > >> ceph osd pool get fink-z1.rgw.buckets.data min_size > min_size: 10 > -------- > > Cheers, > > Michel > > Le 29/04/2023 à 20:36, Curt a écrit : >> Hello, >> >> What is your current setup, 1 server pet data center with 12 osd >> each? What is your current crush rule and LRC crush rule? >> >> >> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >> >> Hi, >> >> I think I found a possible cause of my PG down but still >> understand why. >> As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, >> m=6) but I have only 12 OSD servers in the cluster. To workaround the >> problem I defined the failure domain as 'osd' with the reasoning >> that as >> I was using the LRC plugin, I had the warranty that I could loose >> a site >> without impact, thus the possibility to loose 1 OSD server. Am I >> wrong? >> >> Best regards, >> >> Michel >> >> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >> > Hi, >> > >> > I'm still interesting by getting feedback from those using the LRC >> > plugin about the right way to configure it... Last week I upgraded >> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >> > upgrade host by host, checking if an OSD is ok to stop before >> actually >> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >> points >> > in the upgrade (happened not for all OSDs but for every >> > site/datacenter). Looking at the details with "ceph health >> detail", I >> > saw that for these PGs there was 3 OSDs down but I was expecting >> the >> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >> > wondering if there is something wrong in our pool configuration >> (k=9, >> > m=6, l=5). >> > >> > Cheers, >> > >> > Michel >> > >> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >> >> Hi, >> >> >> >> Is somebody using LRC plugin ? >> >> >> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >> same as >> >> jerasure k=9, m=6 in terms of protection against failures and >> that I >> >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure >> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >> suggests >> >> that this LRC configuration gives something better than >> jerasure k=4, >> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >> understood >> >> properly). So how many drives can fail in the k=9, m=6, l=5 >> >> configuration first without loosing RW access and second without >> >> loosing data? >> >> >> >> Another thing that I don't quite understand is that a pool created >> >> with this configuration (and failure domain=osd, >> locality=datacenter) >> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >> me, I'd >> >> expected something ~10 (depending on answer to the previous >> question)... >> >> >> >> Thanks in advance if somebody could provide some sort of >> >> authoritative answer on these 2 questions. Best regards, >> >> >> >> Michel >> >> >> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >> >>> Answering to myself, I found the reason for 2147483647: it's >> >>> documented as a failure to find enough OSD (missing OSDs). And >> it is >> >>> normal as I selected different hosts for the 15 OSDs but I >> have only >> >>> 12 hosts! >> >>> >> >>> I'm still interested by an "expert" to confirm that LRC k=9, >> m=3, >> >>> l=4 configuration is equivalent, in terms of redundancy, to a >> >>> jerasure configuration with k=9, m=6. >> >>> >> >>> Michel >> >>> >> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >> >>>> Hi, >> >>>> >> >>>> As discussed in another thread (Crushmap rule for >> multi-datacenter >> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >> >>>> datacenters (datacenters are present in the crushmap), with the >> >>>> objective to be resilient to 1 DC down, at least keeping the >> >>>> readonly access to the pool and if possible the read-write >> access, >> >>>> and have a storage efficiency better than 3 replica (let say a >> >>>> storage overhead <= 2). >> >>>> >> >>>> In the discussion, somebody mentioned LRC plugin as a possible >> >>>> jerasure alternative to implement this without tweaking the >> >>>> crushmap rule to implement the 2-step OSD allocation. I >> looked at >> >>>> the documentation >> >>>> >> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >> >>>> but I have some questions if someone has experience/expertise >> with >> >>>> this LRC plugin. >> >>>> >> >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in >> >>>> total), with 3 (9 in total) being data chunks and others being >> >>>> coding chunks. For this, based of my understanding of >> examples, I >> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >> equivalent, >> >>>> in terms of redundancy, to a jerasure configuration with k=9, >> m=6? >> >>>> >> >>>> The resulting rule, which looks correct to me, is: >> >>>> >> >>>> -------- >> >>>> >> >>>> { >> >>>> "rule_id": 6, >> >>>> "rule_name": "test_lrc_2", >> >>>> "ruleset": 6, >> >>>> "type": 3, >> >>>> "min_size": 3, >> >>>> "max_size": 15, >> >>>> "steps": [ >> >>>> { >> >>>> "op": "set_chooseleaf_tries", >> >>>> "num": 5 >> >>>> }, >> >>>> { >> >>>> "op": "set_choose_tries", >> >>>> "num": 100 >> >>>> }, >> >>>> { >> >>>> "op": "take", >> >>>> "item": -4, >> >>>> "item_name": "default~hdd" >> >>>> }, >> >>>> { >> >>>> "op": "choose_indep", >> >>>> "num": 3, >> >>>> "type": "datacenter" >> >>>> }, >> >>>> { >> >>>> "op": "chooseleaf_indep", >> >>>> "num": 5, >> >>>> "type": "host" >> >>>> }, >> >>>> { >> >>>> "op": "emit" >> >>>> } >> >>>> ] >> >>>> } >> >>>> >> >>>> ------------ >> >>>> >> >>>> Unfortunately, it doesn't work as expected: a pool created with >> >>>> this rule ends up with its pages active+undersize, which is >> >>>> unexpected for me. Looking at 'ceph health detail` output, I see >> >>>> for each page something like: >> >>>> >> >>>> pg 52.14 is stuck undersized for 27m, current state >> >>>> active+undersized, last acting >> >>>> >> >> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >> >>>> >> >>>> For each PG, there is 3 '2147483647' entries and I guess it >> is the >> >>>> reason of the problem. What are these entries about? Clearly >> it is >> >>>> not OSD entries... Looks like a negative number, -1, which in >> terms >> >>>> of crushmap ID is the crushmap root (named "default" in our >> >>>> configuration). Any trivial mistake I would have made? >> >>>> >> >>>> Thanks in advance for any help or for sharing any successful >> >>>> configuration? >> >>>> >> >>>> Best regards, >> >>>> >> >>>> Michel >> >>>> _______________________________________________ >> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

4 May 4 May

11:51 a.m.

Hi, I had to restart one of my OSD server today and the problem showed up again. This time I managed to capture "ceph health detail" output showing the problem with the 2 PGs: [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down pg 56.1 is down, acting [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] pg 56.12 is down, acting [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] I still doesn't understand why, if I am supposed to survive to a datacenter failure, I cannot survive to 3 OSDs down on the same host, hosting shards for the PG. In the second case it is only 2 OSDs down but I'm surprised they don't seem in the same "group" of OSD (I'd expected all the the OSDs of one datacenter to be in the same groupe of 5 if the order given really reflects the allocation done... Still interested by some explanation on what I'm doing wrong! Best regards, Michel Le 03/05/2023 à 10:21, Eugen Block a écrit : > I think I got it wrong with the locality setting, I'm still limited by > the number of hosts I have available in my test cluster, but as far as > I got with failure-domain=osd I believe k=6, m=3, l=3 with > locality=datacenter could fit your requirement, at least with regards > to the recovery bandwidth usage between DCs, but the resiliency would > not match your requirement (one DC failure). That profile creates 3 > groups of 4 chunks (3 data/coding chunks and one parity chunk) across > three DCs, in total 12 chunks. The min_size=7 would not allow an > entire DC to go down, I'm afraid, you'd have to reduce it to 6 to > allow reads/writes in a disaster scenario. I'm still not sure if I got > it right this time, but maybe you're better off without the LRC plugin > with the limited number of hosts. Instead you could use the jerasure > plugin with a profile like k=4 m=5 allowing an entire DC to fail > without losing data access (we have one customer using that). > > Zitat von Eugen Block <eblock(a)nde.ag>ag>: > >> Hi, >> >> disclaimer: I haven't used LRC in a real setup yet, so there might be >> some misunderstandings on my side. But I tried to play around with >> one of my test clusters (Nautilus). Because I'm limited in the number >> of hosts (6 across 3 virtual DCs) I tried two different profiles with >> lower numbers to get a feeling for how that works. >> >> # first attempt >> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 >> m=2 l=3 crush-failure-domain=host >> >> For every third OSD one parity chunk is added, so 2 more chunks to >> store ==> 8 chunks in total. Since my failure-domain is host and I >> only have 6 I get incomplete PGs. >> >> # second attempt >> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 >> m=2 l=2 crush-failure-domain=host >> >> This gives me 6 chunks in total to store across 6 hosts which works: >> >> ceph:~ # ceph pg ls-by-pool lrcpool >> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* >> LOG STATE SINCE VERSION REPORTED UP ACTING >> SCRUB_STAMP DEEP_SCRUB_STAMP >> 50.0 1 0 0 0 619 0 0 1 >> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >> 14:53:54.322135 >> 50.1 0 0 0 0 0 0 0 0 >> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >> 14:53:54.322135 >> 50.2 0 0 0 0 0 0 0 0 >> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >> 14:53:54.322135 >> 50.3 0 0 0 0 0 0 0 0 >> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >> 14:53:54.322135 >> >> After stopping all OSDs on one host I was still able to read and >> write into the pool, but after stopping a second host one PG from >> that pool went "down". That I don't fully understand yet, but I just >> started to look into it. >> With your setup (12 hosts) I would recommend to not utilize all of >> them so you have capacity to recover, let's say one "spare" host per >> DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make >> sense here, resulting in 9 total chunks (one more parity chunks for >> every other OSD), min_size 4. But as I wrote, it probably doesn't >> have the resiliency for a DC failure, so that needs some further >> investigation. >> >> Regards, >> Eugen >> >> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >> >>> Hi, >>> >>> No... our current setup is 3 datacenters with the same >>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. >>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be a >>> multiple of l, I found that k=9/m=66/l=5 with >>> crush-locality=datacenter was achieving my goal of being resilient >>> to a datacenter failure. Because I had this, I considered that >>> lowering the crush failure domain to osd was not a major issue in my >>> case (as it would not be worst than a datacenter failure if all the >>> shards are on the same server in a datacenter) and was working >>> around the lack of hosts for k=9/m=6 (15 OSDs). >>> >>> May be it helps, if I give the erasure code profile used: >>> >>> crush-device-class=hdd >>> crush-failure-domain=osd >>> crush-locality=datacenter >>> crush-root=default >>> k=9 >>> l=5 >>> m=6 >>> plugin=lrc >>> >>> The previously mentioned strange number for min_size for the pool >>> created with this profile has vanished after Quincy upgrade as this >>> parameter is no longer in the CRUH map rule! and the `ceph osd pool >>> get` command reports the expected number (10): >>> >>> --------- >>> >>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>> min_size: 10 >>> -------- >>> >>> Cheers, >>> >>> Michel >>> >>> Le 29/04/2023 à 20:36, Curt a écrit : >>>> Hello, >>>> >>>> What is your current setup, 1 server pet data center with 12 osd >>>> each? What is your current crush rule and LRC crush rule? >>>> >>>> >>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>> >>>> Hi, >>>> >>>> I think I found a possible cause of my PG down but still >>>> understand why. >>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool >>>> (k=9, >>>> m=6) but I have only 12 OSD servers in the cluster. To workaround >>>> the >>>> problem I defined the failure domain as 'osd' with the reasoning >>>> that as >>>> I was using the LRC plugin, I had the warranty that I could loose >>>> a site >>>> without impact, thus the possibility to loose 1 OSD server. Am I >>>> wrong? >>>> >>>> Best regards, >>>> >>>> Michel >>>> >>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>> > Hi, >>>> > >>>> > I'm still interesting by getting feedback from those using the LRC >>>> > plugin about the right way to configure it... Last week I upgraded >>>> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >>>> > upgrade host by host, checking if an OSD is ok to stop before >>>> actually >>>> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >>>> points >>>> > in the upgrade (happened not for all OSDs but for every >>>> > site/datacenter). Looking at the details with "ceph health >>>> detail", I >>>> > saw that for these PGs there was 3 OSDs down but I was expecting >>>> the >>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >>>> > wondering if there is something wrong in our pool configuration >>>> (k=9, >>>> > m=6, l=5). >>>> > >>>> > Cheers, >>>> > >>>> > Michel >>>> > >>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>> >> Hi, >>>> >> >>>> >> Is somebody using LRC plugin ? >>>> >> >>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >>>> same as >>>> >> jerasure k=9, m=6 in terms of protection against failures and >>>> that I >>>> >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure >>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >>>> suggests >>>> >> that this LRC configuration gives something better than >>>> jerasure k=4, >>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >>>> understood >>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 >>>> >> configuration first without loosing RW access and second without >>>> >> loosing data? >>>> >> >>>> >> Another thing that I don't quite understand is that a pool >>>> created >>>> >> with this configuration (and failure domain=osd, >>>> locality=datacenter) >>>> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >>>> me, I'd >>>> >> expected something ~10 (depending on answer to the previous >>>> question)... >>>> >> >>>> >> Thanks in advance if somebody could provide some sort of >>>> >> authoritative answer on these 2 questions. Best regards, >>>> >> >>>> >> Michel >>>> >> >>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>> >>> Answering to myself, I found the reason for 2147483647: it's >>>> >>> documented as a failure to find enough OSD (missing OSDs). And >>>> it is >>>> >>> normal as I selected different hosts for the 15 OSDs but I >>>> have only >>>> >>> 12 hosts! >>>> >>> >>>> >>> I'm still interested by an "expert" to confirm that LRC k=9, >>>> m=3, >>>> >>> l=4 configuration is equivalent, in terms of redundancy, to a >>>> >>> jerasure configuration with k=9, m=6. >>>> >>> >>>> >>> Michel >>>> >>> >>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>> >>>> Hi, >>>> >>>> >>>> >>>> As discussed in another thread (Crushmap rule for >>>> multi-datacenter >>>> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >>>> >>>> datacenters (datacenters are present in the crushmap), with the >>>> >>>> objective to be resilient to 1 DC down, at least keeping the >>>> >>>> readonly access to the pool and if possible the read-write >>>> access, >>>> >>>> and have a storage efficiency better than 3 replica (let say a >>>> >>>> storage overhead <= 2). >>>> >>>> >>>> >>>> In the discussion, somebody mentioned LRC plugin as a possible >>>> >>>> jerasure alternative to implement this without tweaking the >>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>> looked at >>>> >>>> the documentation >>>> >>>> >>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >>>> >>>> but I have some questions if someone has experience/expertise >>>> with >>>> >>>> this LRC plugin. >>>> >>>> >>>> >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in >>>> >>>> total), with 3 (9 in total) being data chunks and others being >>>> >>>> coding chunks. For this, based of my understanding of >>>> examples, I >>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>> equivalent, >>>> >>>> in terms of redundancy, to a jerasure configuration with k=9, >>>> m=6? >>>> >>>> >>>> >>>> The resulting rule, which looks correct to me, is: >>>> >>>> >>>> >>>> -------- >>>> >>>> >>>> >>>> { >>>> >>>> "rule_id": 6, >>>> >>>> "rule_name": "test_lrc_2", >>>> >>>> "ruleset": 6, >>>> >>>> "type": 3, >>>> >>>> "min_size": 3, >>>> >>>> "max_size": 15, >>>> >>>> "steps": [ >>>> >>>> { >>>> >>>> "op": "set_chooseleaf_tries", >>>> >>>> "num": 5 >>>> >>>> }, >>>> >>>> { >>>> >>>> "op": "set_choose_tries", >>>> >>>> "num": 100 >>>> >>>> }, >>>> >>>> { >>>> >>>> "op": "take", >>>> >>>> "item": -4, >>>> >>>> "item_name": "default~hdd" >>>> >>>> }, >>>> >>>> { >>>> >>>> "op": "choose_indep", >>>> >>>> "num": 3, >>>> >>>> "type": "datacenter" >>>> >>>> }, >>>> >>>> { >>>> >>>> "op": "chooseleaf_indep", >>>> >>>> "num": 5, >>>> >>>> "type": "host" >>>> >>>> }, >>>> >>>> { >>>> >>>> "op": "emit" >>>> >>>> } >>>> >>>> ] >>>> >>>> } >>>> >>>> >>>> >>>> ------------ >>>> >>>> >>>> >>>> Unfortunately, it doesn't work as expected: a pool created with >>>> >>>> this rule ends up with its pages active+undersize, which is >>>> >>>> unexpected for me. Looking at 'ceph health detail` output, I >>>> see >>>> >>>> for each page something like: >>>> >>>> >>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>> >>>> active+undersized, last acting >>>> >>>> >>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>> >>>> >>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it >>>> is the >>>> >>>> reason of the problem. What are these entries about? Clearly >>>> it is >>>> >>>> not OSD entries... Looks like a negative number, -1, which in >>>> terms >>>> >>>> of crushmap ID is the crushmap root (named "default" in our >>>> >>>> configuration). Any trivial mistake I would have made? >>>> >>>> >>>> >>>> Thanks in advance for any help or for sharing any successful >>>> >>>> configuration? >>>> >>>> >>>> >>>> Best regards, >>>> >>>> >>>> >>>> Michel >>>> >>>> _______________________________________________ >>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

2:07 p.m.

...

Hi, disclaimer: I haven't used LRC in a real setup yet, so there might be some misunderstandings on my side. But I tried to play around with one of my test clusters (Nautilus). Because I'm limited in the number of hosts (6 across 3 virtual DCs) I tried two different profiles with lower numbers to get a feeling for how that works. # first attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 m=2 l=3 crush-failure-domain=host For every third OSD one parity chunk is added, so 2 more chunks to store ==> 8 chunks in total. Since my failure-domain is host and I only have 6 I get incomplete PGs. # second attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 m=2 l=2 crush-failure-domain=host This gives me 6 chunks in total to store across 6 hosts which works: ceph:~ # ceph pg ls-by-pool lrcpool PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 50.0 1 0 0 0 619 0 0 1 active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.1 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.2 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.3 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 After stopping all OSDs on one host I was still able to read and write into the pool, but after stopping a second host one PG from that pool went "down". That I don't fully understand yet, but I just started to look into it. With your setup (12 hosts) I would recommend to not utilize all of them so you have capacity to recover, let's say one "spare" host per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make sense here, resulting in 9 total chunks (one more parity chunks for every other OSD), min_size 4. But as I wrote, it probably doesn't have the resiliency for a DC failure, so that needs some further investigation. Regards, Eugen Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: > Hi, > > No... our current setup is 3 datacenters with the same > configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. > Thus the total of 12 OSDs servers. As with LRC plugin, k+m must > be a multiple of l, I found that k=9/m=66/l=5 with > crush-locality=datacenter was achieving my goal of being > resilient to a datacenter failure. Because I had this, I > considered that lowering the crush failure domain to osd was not > a major issue in my case (as it would not be worst than a > datacenter failure if all the shards are on the same server in a > datacenter) and was working around the lack of hosts for k=9/m=6 > (15 OSDs). > > May be it helps, if I give the erasure code profile used: > > crush-device-class=hdd > crush-failure-domain=osd > crush-locality=datacenter > crush-root=default > k=9 > l=5 > m=6 > plugin=lrc > > The previously mentioned strange number for min_size for the pool > created with this profile has vanished after Quincy upgrade as > this parameter is no longer in the CRUH map rule! and the `ceph > osd pool get` command reports the expected number (10): > > --------- > >> ceph osd pool get fink-z1.rgw.buckets.data min_size > min_size: 10 > -------- > > Cheers, > > Michel > > Le 29/04/2023 à 20:36, Curt a écrit : >> Hello, >> >> What is your current setup, 1 server pet data center with 12 osd >> each? What is your current crush rule and LRC crush rule? >> >> >> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >> >> Hi, >> >> I think I found a possible cause of my PG down but still >> understand why. >> As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, >> m=6) but I have only 12 OSD servers in the cluster. To workaround the >> problem I defined the failure domain as 'osd' with the reasoning >> that as >> I was using the LRC plugin, I had the warranty that I could loose >> a site >> without impact, thus the possibility to loose 1 OSD server. Am I >> wrong? >> >> Best regards, >> >> Michel >> >> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >> > Hi, >> > >> > I'm still interesting by getting feedback from those using the LRC >> > plugin about the right way to configure it... Last week I upgraded >> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >> > upgrade host by host, checking if an OSD is ok to stop before >> actually >> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >> points >> > in the upgrade (happened not for all OSDs but for every >> > site/datacenter). Looking at the details with "ceph health >> detail", I >> > saw that for these PGs there was 3 OSDs down but I was expecting >> the >> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >> > wondering if there is something wrong in our pool configuration >> (k=9, >> > m=6, l=5). >> > >> > Cheers, >> > >> > Michel >> > >> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >> >> Hi, >> >> >> >> Is somebody using LRC plugin ? >> >> >> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >> same as >> >> jerasure k=9, m=6 in terms of protection against failures and >> that I >> >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure >> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >> suggests >> >> that this LRC configuration gives something better than >> jerasure k=4, >> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >> understood >> >> properly). So how many drives can fail in the k=9, m=6, l=5 >> >> configuration first without loosing RW access and second without >> >> loosing data? >> >> >> >> Another thing that I don't quite understand is that a pool created >> >> with this configuration (and failure domain=osd, >> locality=datacenter) >> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >> me, I'd >> >> expected something ~10 (depending on answer to the previous >> question)... >> >> >> >> Thanks in advance if somebody could provide some sort of >> >> authoritative answer on these 2 questions. Best regards, >> >> >> >> Michel >> >> >> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >> >>> Answering to myself, I found the reason for 2147483647: it's >> >>> documented as a failure to find enough OSD (missing OSDs). And >> it is >> >>> normal as I selected different hosts for the 15 OSDs but I >> have only >> >>> 12 hosts! >> >>> >> >>> I'm still interested by an "expert" to confirm that LRC k=9, >> m=3, >> >>> l=4 configuration is equivalent, in terms of redundancy, to a >> >>> jerasure configuration with k=9, m=6. >> >>> >> >>> Michel >> >>> >> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >> >>>> Hi, >> >>>> >> >>>> As discussed in another thread (Crushmap rule for >> multi-datacenter >> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >> >>>> datacenters (datacenters are present in the crushmap), with the >> >>>> objective to be resilient to 1 DC down, at least keeping the >> >>>> readonly access to the pool and if possible the read-write >> access, >> >>>> and have a storage efficiency better than 3 replica (let say a >> >>>> storage overhead <= 2). >> >>>> >> >>>> In the discussion, somebody mentioned LRC plugin as a possible >> >>>> jerasure alternative to implement this without tweaking the >> >>>> crushmap rule to implement the 2-step OSD allocation. I >> looked at >> >>>> the documentation >> >>>> >> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >> >>>> but I have some questions if someone has experience/expertise >> with >> >>>> this LRC plugin. >> >>>> >> >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in >> >>>> total), with 3 (9 in total) being data chunks and others being >> >>>> coding chunks. For this, based of my understanding of >> examples, I >> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >> equivalent, >> >>>> in terms of redundancy, to a jerasure configuration with k=9, >> m=6? >> >>>> >> >>>> The resulting rule, which looks correct to me, is: >> >>>> >> >>>> -------- >> >>>> >> >>>> { >> >>>> "rule_id": 6, >> >>>> "rule_name": "test_lrc_2", >> >>>> "ruleset": 6, >> >>>> "type": 3, >> >>>> "min_size": 3, >> >>>> "max_size": 15, >> >>>> "steps": [ >> >>>> { >> >>>> "op": "set_chooseleaf_tries", >> >>>> "num": 5 >> >>>> }, >> >>>> { >> >>>> "op": "set_choose_tries", >> >>>> "num": 100 >> >>>> }, >> >>>> { >> >>>> "op": "take", >> >>>> "item": -4, >> >>>> "item_name": "default~hdd" >> >>>> }, >> >>>> { >> >>>> "op": "choose_indep", >> >>>> "num": 3, >> >>>> "type": "datacenter" >> >>>> }, >> >>>> { >> >>>> "op": "chooseleaf_indep", >> >>>> "num": 5, >> >>>> "type": "host" >> >>>> }, >> >>>> { >> >>>> "op": "emit" >> >>>> } >> >>>> ] >> >>>> } >> >>>> >> >>>> ------------ >> >>>> >> >>>> Unfortunately, it doesn't work as expected: a pool created with >> >>>> this rule ends up with its pages active+undersize, which is >> >>>> unexpected for me. Looking at 'ceph health detail` output, I see >> >>>> for each page something like: >> >>>> >> >>>> pg 52.14 is stuck undersized for 27m, current state >> >>>> active+undersized, last acting >> >>>> >> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >> >>>> >> >>>> For each PG, there is 3 '2147483647' entries and I guess it >> is the >> >>>> reason of the problem. What are these entries about? Clearly >> it is >> >>>> not OSD entries... Looks like a negative number, -1, which in >> terms >> >>>> of crushmap ID is the crushmap root (named "default" in our >> >>>> configuration). Any trivial mistake I would have made? >> >>>> >> >>>> Thanks in advance for any help or for sharing any successful >> >>>> configuration? >> >>>> >> >>>> Best regards, >> >>>> >> >>>> Michel >> >>>> _______________________________________________ >> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

3:35 p.m.

Yep, reading but not using LRC. Please keep it on the ceph user list for future reference -- thanks! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock(a)nde.ag> Sent: Thursday, May 4, 2023 3:07 PM To: ceph-users(a)ceph.io Subject: [ceph-users] Re: Help needed to configure erasure coding LRC plugin Hi, I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-) Thanks, Eugen Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>:

...

Hi, disclaimer: I haven't used LRC in a real setup yet, so there might be some misunderstandings on my side. But I tried to play around with one of my test clusters (Nautilus). Because I'm limited in the number of hosts (6 across 3 virtual DCs) I tried two different profiles with lower numbers to get a feeling for how that works. # first attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 m=2 l=3 crush-failure-domain=host For every third OSD one parity chunk is added, so 2 more chunks to store ==> 8 chunks in total. Since my failure-domain is host and I only have 6 I get incomplete PGs. # second attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 m=2 l=2 crush-failure-domain=host This gives me 6 chunks in total to store across 6 hosts which works: ceph:~ # ceph pg ls-by-pool lrcpool PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 50.0 1 0 0 0 619 0 0 1 active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.1 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.2 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.3 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 After stopping all OSDs on one host I was still able to read and write into the pool, but after stopping a second host one PG from that pool went "down". That I don't fully understand yet, but I just started to look into it. With your setup (12 hosts) I would recommend to not utilize all of them so you have capacity to recover, let's say one "spare" host per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make sense here, resulting in 9 total chunks (one more parity chunks for every other OSD), min_size 4. But as I wrote, it probably doesn't have the resiliency for a DC failure, so that needs some further investigation. Regards, Eugen Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: > Hi, > > No... our current setup is 3 datacenters with the same > configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. > Thus the total of 12 OSDs servers. As with LRC plugin, k+m must > be a multiple of l, I found that k=9/m=66/l=5 with > crush-locality=datacenter was achieving my goal of being > resilient to a datacenter failure. Because I had this, I > considered that lowering the crush failure domain to osd was not > a major issue in my case (as it would not be worst than a > datacenter failure if all the shards are on the same server in a > datacenter) and was working around the lack of hosts for k=9/m=6 > (15 OSDs). > > May be it helps, if I give the erasure code profile used: > > crush-device-class=hdd > crush-failure-domain=osd > crush-locality=datacenter > crush-root=default > k=9 > l=5 > m=6 > plugin=lrc > > The previously mentioned strange number for min_size for the pool > created with this profile has vanished after Quincy upgrade as > this parameter is no longer in the CRUH map rule! and the `ceph > osd pool get` command reports the expected number (10): > > --------- > >> ceph osd pool get fink-z1.rgw.buckets.data min_size > min_size: 10 > -------- > > Cheers, > > Michel > > Le 29/04/2023 à 20:36, Curt a écrit : >> Hello, >> >> What is your current setup, 1 server pet data center with 12 osd >> each? What is your current crush rule and LRC crush rule? >> >> >> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >> >> Hi, >> >> I think I found a possible cause of my PG down but still >> understand why. >> As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, >> m=6) but I have only 12 OSD servers in the cluster. To workaround the >> problem I defined the failure domain as 'osd' with the reasoning >> that as >> I was using the LRC plugin, I had the warranty that I could loose >> a site >> without impact, thus the possibility to loose 1 OSD server. Am I >> wrong? >> >> Best regards, >> >> Michel >> >> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >> > Hi, >> > >> > I'm still interesting by getting feedback from those using the LRC >> > plugin about the right way to configure it... Last week I upgraded >> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >> > upgrade host by host, checking if an OSD is ok to stop before >> actually >> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >> points >> > in the upgrade (happened not for all OSDs but for every >> > site/datacenter). Looking at the details with "ceph health >> detail", I >> > saw that for these PGs there was 3 OSDs down but I was expecting >> the >> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >> > wondering if there is something wrong in our pool configuration >> (k=9, >> > m=6, l=5). >> > >> > Cheers, >> > >> > Michel >> > >> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >> >> Hi, >> >> >> >> Is somebody using LRC plugin ? >> >> >> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >> same as >> >> jerasure k=9, m=6 in terms of protection against failures and >> that I >> >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure >> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >> suggests >> >> that this LRC configuration gives something better than >> jerasure k=4, >> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >> understood >> >> properly). So how many drives can fail in the k=9, m=6, l=5 >> >> configuration first without loosing RW access and second without >> >> loosing data? >> >> >> >> Another thing that I don't quite understand is that a pool created >> >> with this configuration (and failure domain=osd, >> locality=datacenter) >> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >> me, I'd >> >> expected something ~10 (depending on answer to the previous >> question)... >> >> >> >> Thanks in advance if somebody could provide some sort of >> >> authoritative answer on these 2 questions. Best regards, >> >> >> >> Michel >> >> >> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >> >>> Answering to myself, I found the reason for 2147483647: it's >> >>> documented as a failure to find enough OSD (missing OSDs). And >> it is >> >>> normal as I selected different hosts for the 15 OSDs but I >> have only >> >>> 12 hosts! >> >>> >> >>> I'm still interested by an "expert" to confirm that LRC k=9, >> m=3, >> >>> l=4 configuration is equivalent, in terms of redundancy, to a >> >>> jerasure configuration with k=9, m=6. >> >>> >> >>> Michel >> >>> >> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >> >>>> Hi, >> >>>> >> >>>> As discussed in another thread (Crushmap rule for >> multi-datacenter >> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >> >>>> datacenters (datacenters are present in the crushmap), with the >> >>>> objective to be resilient to 1 DC down, at least keeping the >> >>>> readonly access to the pool and if possible the read-write >> access, >> >>>> and have a storage efficiency better than 3 replica (let say a >> >>>> storage overhead <= 2). >> >>>> >> >>>> In the discussion, somebody mentioned LRC plugin as a possible >> >>>> jerasure alternative to implement this without tweaking the >> >>>> crushmap rule to implement the 2-step OSD allocation. I >> looked at >> >>>> the documentation >> >>>> >> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >> >>>> but I have some questions if someone has experience/expertise >> with >> >>>> this LRC plugin. >> >>>> >> >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in >> >>>> total), with 3 (9 in total) being data chunks and others being >> >>>> coding chunks. For this, based of my understanding of >> examples, I >> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >> equivalent, >> >>>> in terms of redundancy, to a jerasure configuration with k=9, >> m=6? >> >>>> >> >>>> The resulting rule, which looks correct to me, is: >> >>>> >> >>>> -------- >> >>>> >> >>>> { >> >>>> "rule_id": 6, >> >>>> "rule_name": "test_lrc_2", >> >>>> "ruleset": 6, >> >>>> "type": 3, >> >>>> "min_size": 3, >> >>>> "max_size": 15, >> >>>> "steps": [ >> >>>> { >> >>>> "op": "set_chooseleaf_tries", >> >>>> "num": 5 >> >>>> }, >> >>>> { >> >>>> "op": "set_choose_tries", >> >>>> "num": 100 >> >>>> }, >> >>>> { >> >>>> "op": "take", >> >>>> "item": -4, >> >>>> "item_name": "default~hdd" >> >>>> }, >> >>>> { >> >>>> "op": "choose_indep", >> >>>> "num": 3, >> >>>> "type": "datacenter" >> >>>> }, >> >>>> { >> >>>> "op": "chooseleaf_indep", >> >>>> "num": 5, >> >>>> "type": "host" >> >>>> }, >> >>>> { >> >>>> "op": "emit" >> >>>> } >> >>>> ] >> >>>> } >> >>>> >> >>>> ------------ >> >>>> >> >>>> Unfortunately, it doesn't work as expected: a pool created with >> >>>> this rule ends up with its pages active+undersize, which is >> >>>> unexpected for me. Looking at 'ceph health detail` output, I see >> >>>> for each page something like: >> >>>> >> >>>> pg 52.14 is stuck undersized for 27m, current state >> >>>> active+undersized, last acting >> >>>> >> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >> >>>> >> >>>> For each PG, there is 3 '2147483647' entries and I guess it >> is the >> >>>> reason of the problem. What are these entries about? Clearly >> it is >> >>>> not OSD entries... Looks like a negative number, -1, which in >> terms >> >>>> of crushmap ID is the crushmap root (named "default" in our >> >>>> configuration). Any trivial mistake I would have made? >> >>>> >> >>>> Thanks in advance for any help or for sharing any successful >> >>>> configuration? >> >>>> >> >>>> Best regards, >> >>>> >> >>>> Michel >> >>>> _______________________________________________ >> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

16 May 16 May

8:06 a.m.

Hi Eugen, Yes, sure, no problem to share it. I attach it to this email (as it may clutter the discussion if inline). If somebody on the list has some clue on the LRC plugin, I'm still interested by understand what I'm doing wrong! Cheers, Michel Le 04/05/2023 à 15:07, Eugen Block a écrit :

...

Hi, disclaimer: I haven't used LRC in a real setup yet, so there might be some misunderstandings on my side. But I tried to play around with one of my test clusters (Nautilus). Because I'm limited in the number of hosts (6 across 3 virtual DCs) I tried two different profiles with lower numbers to get a feeling for how that works. # first attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 m=2 l=3 crush-failure-domain=host For every third OSD one parity chunk is added, so 2 more chunks to store ==> 8 chunks in total. Since my failure-domain is host and I only have 6 I get incomplete PGs. # second attempt ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 m=2 l=2 crush-failure-domain=host This gives me 6 chunks in total to store across 6 hosts which works: ceph:~ # ceph pg ls-by-pool lrcpool PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 50.0 1 0 0 0 619 0 0 1 active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.1 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.2 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.3 0 0 0 0 0 0 0 0 active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 After stopping all OSDs on one host I was still able to read and write into the pool, but after stopping a second host one PG from that pool went "down". That I don't fully understand yet, but I just started to look into it. With your setup (12 hosts) I would recommend to not utilize all of them so you have capacity to recover, let's say one "spare" host per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make sense here, resulting in 9 total chunks (one more parity chunks for every other OSD), min_size 4. But as I wrote, it probably doesn't have the resiliency for a DC failure, so that needs some further investigation. Regards, Eugen Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: > Hi, > > No... our current setup is 3 datacenters with the same > configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. > Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be > a multiple of l, I found that k=9/m=66/l=5 with > crush-locality=datacenter was achieving my goal of being resilient > to a datacenter failure. Because I had this, I considered that > lowering the crush failure domain to osd was not a major issue in > my case (as it would not be worst than a datacenter failure if all > the shards are on the same server in a datacenter) and was working > around the lack of hosts for k=9/m=6 (15 OSDs). > > May be it helps, if I give the erasure code profile used: > > crush-device-class=hdd > crush-failure-domain=osd > crush-locality=datacenter > crush-root=default > k=9 > l=5 > m=6 > plugin=lrc > > The previously mentioned strange number for min_size for the pool > created with this profile has vanished after Quincy upgrade as > this parameter is no longer in the CRUH map rule! and the `ceph > osd pool get` command reports the expected number (10): > > --------- > >> ceph osd pool get fink-z1.rgw.buckets.data min_size > min_size: 10 > -------- > > Cheers, > > Michel > > Le 29/04/2023 à 20:36, Curt a écrit : >> Hello, >> >> What is your current setup, 1 server pet data center with 12 osd >> each? What is your current crush rule and LRC crush rule? >> >> >> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >> >> Hi, >> >> I think I found a possible cause of my PG down but still >> understand why. >> As explained in a previous mail, I setup a 15-chunk/OSD EC pool >> (k=9, >> m=6) but I have only 12 OSD servers in the cluster. To >> workaround the >> problem I defined the failure domain as 'osd' with the reasoning >> that as >> I was using the LRC plugin, I had the warranty that I could loose >> a site >> without impact, thus the possibility to loose 1 OSD server. Am I >> wrong? >> >> Best regards, >> >> Michel >> >> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >> > Hi, >> > >> > I'm still interesting by getting feedback from those using >> the LRC >> > plugin about the right way to configure it... Last week I >> upgraded >> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >> > upgrade host by host, checking if an OSD is ok to stop before >> actually >> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >> points >> > in the upgrade (happened not for all OSDs but for every >> > site/datacenter). Looking at the details with "ceph health >> detail", I >> > saw that for these PGs there was 3 OSDs down but I was expecting >> the >> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >> > wondering if there is something wrong in our pool configuration >> (k=9, >> > m=6, l=5). >> > >> > Cheers, >> > >> > Michel >> > >> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >> >> Hi, >> >> >> >> Is somebody using LRC plugin ? >> >> >> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >> same as >> >> jerasure k=9, m=6 in terms of protection against failures and >> that I >> >> should use k=9, m=6, l=5 to get a level of resilience >= >> jerasure >> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >> suggests >> >> that this LRC configuration gives something better than >> jerasure k=4, >> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >> understood >> >> properly). So how many drives can fail in the k=9, m=6, l=5 >> >> configuration first without loosing RW access and second >> without >> >> loosing data? >> >> >> >> Another thing that I don't quite understand is that a pool >> created >> >> with this configuration (and failure domain=osd, >> locality=datacenter) >> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >> me, I'd >> >> expected something ~10 (depending on answer to the previous >> question)... >> >> >> >> Thanks in advance if somebody could provide some sort of >> >> authoritative answer on these 2 questions. Best regards, >> >> >> >> Michel >> >> >> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >> >>> Answering to myself, I found the reason for 2147483647: it's >> >>> documented as a failure to find enough OSD (missing OSDs). And >> it is >> >>> normal as I selected different hosts for the 15 OSDs but I >> have only >> >>> 12 hosts! >> >>> >> >>> I'm still interested by an "expert" to confirm that LRC k=9, >> m=3, >> >>> l=4 configuration is equivalent, in terms of redundancy, to a >> >>> jerasure configuration with k=9, m=6. >> >>> >> >>> Michel >> >>> >> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >> >>>> Hi, >> >>>> >> >>>> As discussed in another thread (Crushmap rule for >> multi-datacenter >> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >> >>>> datacenters (datacenters are present in the crushmap), >> with the >> >>>> objective to be resilient to 1 DC down, at least keeping the >> >>>> readonly access to the pool and if possible the read-write >> access, >> >>>> and have a storage efficiency better than 3 replica (let >> say a >> >>>> storage overhead <= 2). >> >>>> >> >>>> In the discussion, somebody mentioned LRC plugin as a >> possible >> >>>> jerasure alternative to implement this without tweaking the >> >>>> crushmap rule to implement the 2-step OSD allocation. I >> looked at >> >>>> the documentation >> >>>> >> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >> >>>> but I have some questions if someone has experience/expertise >> with >> >>>> this LRC plugin. >> >>>> >> >>>> I tried to create a rule for using 5 OSDs per datacenter >> (15 in >> >>>> total), with 3 (9 in total) being data chunks and others >> being >> >>>> coding chunks. For this, based of my understanding of >> examples, I >> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >> equivalent, >> >>>> in terms of redundancy, to a jerasure configuration with k=9, >> m=6? >> >>>> >> >>>> The resulting rule, which looks correct to me, is: >> >>>> >> >>>> -------- >> >>>> >> >>>> { >> >>>> "rule_id": 6, >> >>>> "rule_name": "test_lrc_2", >> >>>> "ruleset": 6, >> >>>> "type": 3, >> >>>> "min_size": 3, >> >>>> "max_size": 15, >> >>>> "steps": [ >> >>>> { >> >>>> "op": "set_chooseleaf_tries", >> >>>> "num": 5 >> >>>> }, >> >>>> { >> >>>> "op": "set_choose_tries", >> >>>> "num": 100 >> >>>> }, >> >>>> { >> >>>> "op": "take", >> >>>> "item": -4, >> >>>> "item_name": "default~hdd" >> >>>> }, >> >>>> { >> >>>> "op": "choose_indep", >> >>>> "num": 3, >> >>>> "type": "datacenter" >> >>>> }, >> >>>> { >> >>>> "op": "chooseleaf_indep", >> >>>> "num": 5, >> >>>> "type": "host" >> >>>> }, >> >>>> { >> >>>> "op": "emit" >> >>>> } >> >>>> ] >> >>>> } >> >>>> >> >>>> ------------ >> >>>> >> >>>> Unfortunately, it doesn't work as expected: a pool created >> with >> >>>> this rule ends up with its pages active+undersize, which is >> >>>> unexpected for me. Looking at 'ceph health detail` output, >> I see >> >>>> for each page something like: >> >>>> >> >>>> pg 52.14 is stuck undersized for 27m, current state >> >>>> active+undersized, last acting >> >>>> >> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >> >> >>>> >> >>>> For each PG, there is 3 '2147483647' entries and I guess it >> is the >> >>>> reason of the problem. What are these entries about? Clearly >> it is >> >>>> not OSD entries... Looks like a negative number, -1, which in >> terms >> >>>> of crushmap ID is the crushmap root (named "default" in our >> >>>> configuration). Any trivial mistake I would have made? >> >>>> >> >>>> Thanks in advance for any help or for sharing any successful >> >>>> configuration? >> >>>> >> >>>> Best regards, >> >>>> >> >>>> Michel >> >>>> _______________________________________________ >> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Curt

17 May 17 May

9:33 p.m.

Hi, I've been following this thread with interest as it seems like a unique use case to expand my knowledge. I don't use LRC or anything outside basic erasure coding. What is your current crush steps rule? I know you made changes since your first post and had some thoughts I wanted to share, but wanted to see your rule first so I could try to visualize the distribution better. The only way I can currently visualize it working is with more servers, I'm thinking 6 or 9 per data center min, but that could be my lack of knowledge on some of the step rules. Thanks Curt On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < michel.jouvin(a)ijclab.in2p3.fr> wrote:

...

Hi, I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-) Thanks, Eugen Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: > Hi, > > I had to restart one of my OSD server today and the problem showed up > again. This time I managed to capture "ceph health detail" output > showing the problem with the 2 PGs: > > [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 > pgs down > pg 56.1 is down, acting > [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] > pg 56.12 is down, acting >

[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]

> > I still doesn't understand why, if I am supposed to survive to a > datacenter failure, I cannot survive to 3 OSDs down on the same host, > hosting shards for the PG. In the second case it is only 2 OSDs down > but I'm surprised they don't seem in the same "group" of OSD (I'd > expected all the the OSDs of one datacenter to be in the same groupe > of 5 if the order given really reflects the allocation done... > > Still interested by some explanation on what I'm doing wrong! Best > regards, > > Michel > > Le 03/05/2023 à 10:21, Eugen Block a écrit : >> I think I got it wrong with the locality setting, I'm still limited >> by the number of hosts I have available in my test cluster, but as >> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with >> locality=datacenter could fit your requirement, at least with >> regards to the recovery bandwidth usage between DCs, but the >> resiliency would not match your requirement (one DC failure). That >> profile creates 3 groups of 4 chunks (3 data/coding chunks and one >> parity chunk) across three DCs, in total 12 chunks. The min_size=7 >> would not allow an entire DC to go down, I'm afraid, you'd have to >> reduce it to 6 to allow reads/writes in a disaster scenario. I'm >> still not sure if I got it right this time, but maybe you're better >> off without the LRC plugin with the limited number of hosts. Instead >> you could use the jerasure plugin with a profile like k=4 m=5 >> allowing an entire DC to fail without losing data access (we have >> one customer using that). >> >> Zitat von Eugen Block <eblock(a)nde.ag>ag>: >> >>> Hi, >>> >>> disclaimer: I haven't used LRC in a real setup yet, so there might >>> be some misunderstandings on my side. But I tried to play around >>> with one of my test clusters (Nautilus). Because I'm limited in the >>> number of hosts (6 across 3 virtual DCs) I tried two different >>> profiles with lower numbers to get a feeling for how that works. >>> >>> # first attempt >>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>> k=4 m=2 l=3 crush-failure-domain=host >>> >>> For every third OSD one parity chunk is added, so 2 more chunks to >>> store ==> 8 chunks in total. Since my failure-domain is host and I >>> only have 6 I get incomplete PGs. >>> >>> # second attempt >>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>> k=2 m=2 l=2 crush-failure-domain=host >>> >>> This gives me 6 chunks in total to store across 6 hosts which works: >>> >>> ceph:~ # ceph pg ls-by-pool lrcpool >>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* >>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED >>> UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP >>> 50.0 1 0 0 0 619 0 0 1 >>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> 50.1 0 0 0 0 0 0 0 0 >>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> 50.2 0 0 0 0 0 0 0 0 >>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> 50.3 0 0 0 0 0 0 0 0 >>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> >>> After stopping all OSDs on one host I was still able to read and >>> write into the pool, but after stopping a second host one PG from >>> that pool went "down". That I don't fully understand yet, but I >>> just started to look into it. >>> With your setup (12 hosts) I would recommend to not utilize all of >>> them so you have capacity to recover, let's say one "spare" host >>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could >>> make sense here, resulting in 9 total chunks (one more parity >>> chunks for every other OSD), min_size 4. But as I wrote, it >>> probably doesn't have the resiliency for a DC failure, so that >>> needs some further investigation. >>> >>> Regards, >>> Eugen >>> >>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>> >>>> Hi, >>>> >>>> No... our current setup is 3 datacenters with the same >>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. >>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be >>>> a multiple of l, I found that k=9/m=66/l=5 with >>>> crush-locality=datacenter was achieving my goal of being resilient >>>> to a datacenter failure. Because I had this, I considered that >>>> lowering the crush failure domain to osd was not a major issue in >>>> my case (as it would not be worst than a datacenter failure if all >>>> the shards are on the same server in a datacenter) and was working >>>> around the lack of hosts for k=9/m=6 (15 OSDs). >>>> >>>> May be it helps, if I give the erasure code profile used: >>>> >>>> crush-device-class=hdd >>>> crush-failure-domain=osd >>>> crush-locality=datacenter >>>> crush-root=default >>>> k=9 >>>> l=5 >>>> m=6 >>>> plugin=lrc >>>> >>>> The previously mentioned strange number for min_size for the pool >>>> created with this profile has vanished after Quincy upgrade as >>>> this parameter is no longer in the CRUH map rule! and the `ceph >>>> osd pool get` command reports the expected number (10): >>>> >>>> --------- >>>> >>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>>> min_size: 10 >>>> -------- >>>> >>>> Cheers, >>>> >>>> Michel >>>> >>>> Le 29/04/2023 à 20:36, Curt a écrit : >>>>> Hello, >>>>> >>>>> What is your current setup, 1 server pet data center with 12 osd >>>>> each? What is your current crush rule and LRC crush rule? >>>>> >>>>> >>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>>> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I think I found a possible cause of my PG down but still >>>>> understand why. >>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool >>>>> (k=9, >>>>> m=6) but I have only 12 OSD servers in the cluster. To >>>>> workaround the >>>>> problem I defined the failure domain as 'osd' with the reasoning >>>>> that as >>>>> I was using the LRC plugin, I had the warranty that I could loose >>>>> a site >>>>> without impact, thus the possibility to loose 1 OSD server. Am I >>>>> wrong? >>>>> >>>>> Best regards, >>>>> >>>>> Michel >>>>> >>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>>> > Hi, >>>>> > >>>>> > I'm still interesting by getting feedback from those using >>>>> the LRC >>>>> > plugin about the right way to configure it... Last week I >>>>> upgraded >>>>> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >>>>> > upgrade host by host, checking if an OSD is ok to stop before >>>>> actually >>>>> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >>>>> points >>>>> > in the upgrade (happened not for all OSDs but for every >>>>> > site/datacenter). Looking at the details with "ceph health >>>>> detail", I >>>>> > saw that for these PGs there was 3 OSDs down but I was expecting >>>>> the >>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >>>>> > wondering if there is something wrong in our pool configuration >>>>> (k=9, >>>>> > m=6, l=5). >>>>> > >>>>> > Cheers, >>>>> > >>>>> > Michel >>>>> > >>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>>> >> Hi, >>>>> >> >>>>> >> Is somebody using LRC plugin ? >>>>> >> >>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >>>>> same as >>>>> >> jerasure k=9, m=6 in terms of protection against failures and >>>>> that I >>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= >>>>> jerasure >>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >>>>> suggests >>>>> >> that this LRC configuration gives something better than >>>>> jerasure k=4, >>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >>>>> understood >>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 >>>>> >> configuration first without loosing RW access and second >>>>> without >>>>> >> loosing data? >>>>> >> >>>>> >> Another thing that I don't quite understand is that a pool >>>>> created >>>>> >> with this configuration (and failure domain=osd, >>>>> locality=datacenter) >>>>> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >>>>> me, I'd >>>>> >> expected something ~10 (depending on answer to the previous >>>>> question)... >>>>> >> >>>>> >> Thanks in advance if somebody could provide some sort of >>>>> >> authoritative answer on these 2 questions. Best regards, >>>>> >> >>>>> >> Michel >>>>> >> >>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>>> >>> Answering to myself, I found the reason for 2147483647: it's >>>>> >>> documented as a failure to find enough OSD (missing OSDs). And >>>>> it is >>>>> >>> normal as I selected different hosts for the 15 OSDs but I >>>>> have only >>>>> >>> 12 hosts! >>>>> >>> >>>>> >>> I'm still interested by an "expert" to confirm that LRC k=9, >>>>> m=3, >>>>> >>> l=4 configuration is equivalent, in terms of redundancy, to a >>>>> >>> jerasure configuration with k=9, m=6. >>>>> >>> >>>>> >>> Michel >>>>> >>> >>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>>> >>>> Hi, >>>>> >>>> >>>>> >>>> As discussed in another thread (Crushmap rule for >>>>> multi-datacenter >>>>> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >>>>> >>>> datacenters (datacenters are present in the crushmap), >>>>> with the >>>>> >>>> objective to be resilient to 1 DC down, at least keeping the >>>>> >>>> readonly access to the pool and if possible the read-write >>>>> access, >>>>> >>>> and have a storage efficiency better than 3 replica (let >>>>> say a >>>>> >>>> storage overhead <= 2). >>>>> >>>> >>>>> >>>> In the discussion, somebody mentioned LRC plugin as a >>>>> possible >>>>> >>>> jerasure alternative to implement this without tweaking the >>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>>> looked at >>>>> >>>> the documentation >>>>> >>>> >>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/

)

>>>>> >>>> but I have some questions if someone has experience/expertise >>>>> with >>>>> >>>> this LRC plugin. >>>>> >>>> >>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter >>>>> (15 in >>>>> >>>> total), with 3 (9 in total) being data chunks and others >>>>> being >>>>> >>>> coding chunks. For this, based of my understanding of >>>>> examples, I >>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>>> equivalent, >>>>> >>>> in terms of redundancy, to a jerasure configuration with k=9, >>>>> m=6? >>>>> >>>> >>>>> >>>> The resulting rule, which looks correct to me, is: >>>>> >>>> >>>>> >>>> -------- >>>>> >>>> >>>>> >>>> { >>>>> >>>> "rule_id": 6, >>>>> >>>> "rule_name": "test_lrc_2", >>>>> >>>> "ruleset": 6, >>>>> >>>> "type": 3, >>>>> >>>> "min_size": 3, >>>>> >>>> "max_size": 15, >>>>> >>>> "steps": [ >>>>> >>>> { >>>>> >>>> "op": "set_chooseleaf_tries", >>>>> >>>> "num": 5 >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "set_choose_tries", >>>>> >>>> "num": 100 >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "take", >>>>> >>>> "item": -4, >>>>> >>>> "item_name": "default~hdd" >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "choose_indep", >>>>> >>>> "num": 3, >>>>> >>>> "type": "datacenter" >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "chooseleaf_indep", >>>>> >>>> "num": 5, >>>>> >>>> "type": "host" >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "emit" >>>>> >>>> } >>>>> >>>> ] >>>>> >>>> } >>>>> >>>> >>>>> >>>> ------------ >>>>> >>>> >>>>> >>>> Unfortunately, it doesn't work as expected: a pool created >>>>> with >>>>> >>>> this rule ends up with its pages active+undersize, which is >>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, >>>>> I see >>>>> >>>> for each page something like: >>>>> >>>> >>>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>>> >>>> active+undersized, last acting >>>>> >>>> >>>>>

[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]

>>> >>> >>>> >>> >>>> For each PG, there is 3 '2147483647' entries and I guess it >>> is the >>> >>>> reason of the problem. What are these entries about? Clearly >>> it is >>> >>>> not OSD entries... Looks like a negative number, -1, which in >>> terms >>> >>>> of crushmap ID is the crushmap root (named "default" in our >>> >>>> configuration). Any trivial mistake I would have made? >>> >>>> >>> >>>> Thanks in advance for any help or for sharing any successful >>> >>>> configuration? >>> >>>> >>> >>>> Best regards, >>> >>>> >>> >>>> Michel >>> >>>> _______________________________________________ >>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

18 May 18 May

10:16 a.m.

Hi, I don’t have a good explanation for this yet, but I’ll soon get the opportunity to play around with a decommissioned cluster. I’ll try to get a better understanding of the LRC plugin, but it might take some time, especially since my vacation is coming up. :-) I have some thoughts about the down PGs with failure domain OSD, but I don’t have anything to confirm it yet. Zitat von Curt <lightspd(a)gmail.com>om>:

...

[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]

)

[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]

>>>> >>>> >>>> >>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it >>>> is the >>>> >>>> reason of the problem. What are these entries about? Clearly >>>> it is >>>> >>>> not OSD entries... Looks like a negative number, -1, which in >>>> terms >>>> >>>> of crushmap ID is the crushmap root (named "default" in our >>>> >>>> configuration). Any trivial mistake I would have made? >>>> >>>> >>>> >>>> Thanks in advance for any help or for sharing any successful >>>> >>>> configuration? >>>> >>>> >>>> >>>> Best regards, >>>> >>>> >>>> >>>> Michel >>>> >>>> _______________________________________________ >>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

21 May 21 May

3:07 p.m.

Hi Eugen, My LRC pool is also somewhat experimental so nothing really urgent. If you manage to do some tests that help me to understand the problem I remain interested. I propose to keep this thread for that. Zitat, I shared my crush map in the email you answered if the attachment was not suppressed by mailman. Cheers, Michel Sent from my mobile Le 18 mai 2023 11:19:35 Eugen Block <eblock(a)nde.ag> a écrit :

...

[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]

)

[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]

>>>>> >>>>> >>>> >>>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it >>>>> is the >>>>> >>>> reason of the problem. What are these entries about? Clearly >>>>> it is >>>>> >>>> not OSD entries... Looks like a negative number, -1, which in >>>>> terms >>>>> >>>> of crushmap ID is the crushmap root (named "default" in our >>>>> >>>> configuration). Any trivial mistake I would have made? >>>>> >>>> >>>>> >>>> Thanks in advance for any help or for sharing any successful >>>>> >>>> configuration? >>>>> >>>> >>>>> >>>> Best regards, >>>>> >>>> >>>>> >>>> Michel >>>>> >>>> _______________________________________________ >>>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Michel Jouvin

26 May 26 May

10:32 a.m.

Hi, I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : > Hi Eugen, > > My LRC pool is also somewhat experimental so nothing really urgent. If > you manage to do some tests that help me to understand the problem I > remain interested. I propose to keep this thread for that. > > Zitat, I shared my crush map in the email you answered if the > attachment was not suppressed by mailman. > > Cheers, > > Michel > Sent from my mobile > > Le 18 mai 2023 11:19:35 Eugen Block <eblock(a)nde.ag> a écrit : > >> Hi, I don’t have a good explanation for this yet, but I’ll soon get >> the opportunity to play around with a decommissioned cluster. I’ll try >> to get a better understanding of the LRC plugin, but it might take >> some time, especially since my vacation is coming up. :-) >> I have some thoughts about the down PGs with failure domain OSD, but I >> don’t have anything to confirm it yet. >> >> Zitat von Curt <lightspd(a)gmail.com>om>: >> >>> Hi, >>> >>> I've been following this thread with interest as it seems like a >>> unique use >>> case to expand my knowledge. I don't use LRC or anything outside basic >>> erasure coding. >>> >>> What is your current crush steps rule? I know you made changes >>> since your >>> first post and had some thoughts I wanted to share, but wanted to >>> see your >>> rule first so I could try to visualize the distribution better. The >>> only >>> way I can currently visualize it working is with more servers, I'm >>> thinking >>> 6 or 9 per data center min, but that could be my lack of knowledge >>> on some >>> of the step rules. >>> >>> Thanks >>> Curt >>> >>> On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < >>> michel.jouvin(a)ijclab.in2p3.fr> wrote: >>> >>>> Hi Eugen, >>>> >>>> Yes, sure, no problem to share it. I attach it to this email (as it may >>>> clutter the discussion if inline). >>>> >>>> If somebody on the list has some clue on the LRC plugin, I'm still >>>> interested by understand what I'm doing wrong! >>>> >>>> Cheers, >>>> >>>> Michel >>>> >>>> Le 04/05/2023 à 15:07, Eugen Block a écrit : >>>>> Hi, >>>>> >>>>> I don't think you've shared your osd tree yet, could you do that? >>>>> Apparently nobody else but us reads this thread or nobody reading this >>>>> uses the LRC plugin. ;-) >>>>> >>>>> Thanks, >>>>> Eugen >>>>> >>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>> >>>>>> Hi, >>>>>> >>>>>> I had to restart one of my OSD server today and the problem showed up >>>>>> again. This time I managed to capture "ceph health detail" output >>>>>> showing the problem with the 2 PGs: >>>>>> >>>>>> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 >>>>>> pgs down >>>>>> pg 56.1 is down, acting >>>>>> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] >>>>>> pg 56.12 is down, acting >>>>>> >>>> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] >>>>>> >>>>>> I still doesn't understand why, if I am supposed to survive to a >>>>>> datacenter failure, I cannot survive to 3 OSDs down on the same host, >>>>>> hosting shards for the PG. In the second case it is only 2 OSDs down >>>>>> but I'm surprised they don't seem in the same "group" of OSD (I'd >>>>>> expected all the the OSDs of one datacenter to be in the same groupe >>>>>> of 5 if the order given really reflects the allocation done... >>>>>> >>>>>> Still interested by some explanation on what I'm doing wrong! Best >>>>>> regards, >>>>>> >>>>>> Michel >>>>>> >>>>>> Le 03/05/2023 à 10:21, Eugen Block a écrit : >>>>>>> I think I got it wrong with the locality setting, I'm still limited >>>>>>> by the number of hosts I have available in my test cluster, but as >>>>>>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with >>>>>>> locality=datacenter could fit your requirement, at least with >>>>>>> regards to the recovery bandwidth usage between DCs, but the >>>>>>> resiliency would not match your requirement (one DC failure). That >>>>>>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one >>>>>>> parity chunk) across three DCs, in total 12 chunks. The min_size=7 >>>>>>> would not allow an entire DC to go down, I'm afraid, you'd have to >>>>>>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm >>>>>>> still not sure if I got it right this time, but maybe you're better >>>>>>> off without the LRC plugin with the limited number of hosts. Instead >>>>>>> you could use the jerasure plugin with a profile like k=4 m=5 >>>>>>> allowing an entire DC to fail without losing data access (we have >>>>>>> one customer using that). >>>>>>> >>>>>>> Zitat von Eugen Block <eblock(a)nde.ag>ag>: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> disclaimer: I haven't used LRC in a real setup yet, so there might >>>>>>>> be some misunderstandings on my side. But I tried to play around >>>>>>>> with one of my test clusters (Nautilus). Because I'm limited in the >>>>>>>> number of hosts (6 across 3 virtual DCs) I tried two different >>>>>>>> profiles with lower numbers to get a feeling for how that works. >>>>>>>> >>>>>>>> # first attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=4 m=2 l=3 crush-failure-domain=host >>>>>>>> >>>>>>>> For every third OSD one parity chunk is added, so 2 more chunks to >>>>>>>> store ==> 8 chunks in total. Since my failure-domain is host and I >>>>>>>> only have 6 I get incomplete PGs. >>>>>>>> >>>>>>>> # second attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=2 m=2 l=2 crush-failure-domain=host >>>>>>>> >>>>>>>> This gives me 6 chunks in total to store across 6 hosts which >>>>>>>> works: >>>>>>>> >>>>>>>> ceph:~ # ceph pg ls-by-pool lrcpool >>>>>>>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* >>>>>>>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED >>>>>>>> UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP >>>>>>>> 50.0 1 0 0 0 619 0 0 1 >>>>>>>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >>>>>>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.1 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >>>>>>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.2 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >>>>>>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.3 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >>>>>>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> >>>>>>>> After stopping all OSDs on one host I was still able to read and >>>>>>>> write into the pool, but after stopping a second host one PG from >>>>>>>> that pool went "down". That I don't fully understand yet, but I >>>>>>>> just started to look into it. >>>>>>>> With your setup (12 hosts) I would recommend to not utilize all of >>>>>>>> them so you have capacity to recover, let's say one "spare" host >>>>>>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could >>>>>>>> make sense here, resulting in 9 total chunks (one more parity >>>>>>>> chunks for every other OSD), min_size 4. But as I wrote, it >>>>>>>> probably doesn't have the resiliency for a DC failure, so that >>>>>>>> needs some further investigation. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Eugen >>>>>>>> >>>>>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> No... our current setup is 3 datacenters with the same >>>>>>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. >>>>>>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be >>>>>>>>> a multiple of l, I found that k=9/m=66/l=5 with >>>>>>>>> crush-locality=datacenter was achieving my goal of being resilient >>>>>>>>> to a datacenter failure. Because I had this, I considered that >>>>>>>>> lowering the crush failure domain to osd was not a major issue in >>>>>>>>> my case (as it would not be worst than a datacenter failure if all >>>>>>>>> the shards are on the same server in a datacenter) and was working >>>>>>>>> around the lack of hosts for k=9/m=6 (15 OSDs). >>>>>>>>> >>>>>>>>> May be it helps, if I give the erasure code profile used: >>>>>>>>> >>>>>>>>> crush-device-class=hdd >>>>>>>>> crush-failure-domain=osd >>>>>>>>> crush-locality=datacenter >>>>>>>>> crush-root=default >>>>>>>>> k=9 >>>>>>>>> l=5 >>>>>>>>> m=6 >>>>>>>>> plugin=lrc >>>>>>>>> >>>>>>>>> The previously mentioned strange number for min_size for the pool >>>>>>>>> created with this profile has vanished after Quincy upgrade as >>>>>>>>> this parameter is no longer in the CRUH map rule! and the `ceph >>>>>>>>> osd pool get` command reports the expected number (10): >>>>>>>>> >>>>>>>>> --------- >>>>>>>>> >>>>>>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>>>>>>>> min_size: 10 >>>>>>>>> -------- >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Michel >>>>>>>>> >>>>>>>>> Le 29/04/2023 à 20:36, Curt a écrit : >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> What is your current setup, 1 server pet data center with 12 osd >>>>>>>>>> each? What is your current crush rule and LRC crush rule? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>>>>>>>> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I think I found a possible cause of my PG down but still >>>>>>>>>> understand why. >>>>>>>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool >>>>>>>>>> (k=9, >>>>>>>>>> m=6) but I have only 12 OSD servers in the cluster. To >>>>>>>>>> workaround the >>>>>>>>>> problem I defined the failure domain as 'osd' with the >>>>>>>>>> reasoning >>>>>>>>>> that as >>>>>>>>>> I was using the LRC plugin, I had the warranty that I could >>>>>>>>>> loose >>>>>>>>>> a site >>>>>>>>>> without impact, thus the possibility to loose 1 OSD server. >>>>>>>>>> Am I >>>>>>>>>> wrong? >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> Michel >>>>>>>>>> >>>>>>>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > I'm still interesting by getting feedback from those using >>>>>>>>>> the LRC >>>>>>>>>> > plugin about the right way to configure it... Last week I >>>>>>>>>> upgraded >>>>>>>>>> > from Pacific to Quincy (17.2.6) with cephadm which is >>>>>>>>>> doing the >>>>>>>>>> > upgrade host by host, checking if an OSD is ok to stop before >>>>>>>>>> actually >>>>>>>>>> > upgrading it. I had the surprise to see 1 or 2 PGs down >>>>>>>>>> at some >>>>>>>>>> points >>>>>>>>>> > in the upgrade (happened not for all OSDs but for every >>>>>>>>>> > site/datacenter). Looking at the details with "ceph health >>>>>>>>>> detail", I >>>>>>>>>> > saw that for these PGs there was 3 OSDs down but I was >>>>>>>>>> expecting >>>>>>>>>> the >>>>>>>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >>>>>>>>>> > wondering if there is something wrong in our pool >>>>>>>>>> configuration >>>>>>>>>> (k=9, >>>>>>>>>> > m=6, l=5). >>>>>>>>>> > >>>>>>>>>> > Cheers, >>>>>>>>>> > >>>>>>>>>> > Michel >>>>>>>>>> > >>>>>>>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>>>>>>>> >> Hi, >>>>>>>>>> >> >>>>>>>>>> >> Is somebody using LRC plugin ? >>>>>>>>>> >> >>>>>>>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >>>>>>>>>> same as >>>>>>>>>> >> jerasure k=9, m=6 in terms of protection against >>>>>>>>>> failures and >>>>>>>>>> that I >>>>>>>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= >>>>>>>>>> jerasure >>>>>>>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >>>>>>>>>> suggests >>>>>>>>>> >> that this LRC configuration gives something better than >>>>>>>>>> jerasure k=4, >>>>>>>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >>>>>>>>>> understood >>>>>>>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 >>>>>>>>>> >> configuration first without loosing RW access and second >>>>>>>>>> without >>>>>>>>>> >> loosing data? >>>>>>>>>> >> >>>>>>>>>> >> Another thing that I don't quite understand is that a pool >>>>>>>>>> created >>>>>>>>>> >> with this configuration (and failure domain=osd, >>>>>>>>>> locality=datacenter) >>>>>>>>>> >> has a min_size=3 (max_size=18 as expected). It seems >>>>>>>>>> wrong to >>>>>>>>>> me, I'd >>>>>>>>>> >> expected something ~10 (depending on answer to the previous >>>>>>>>>> question)... >>>>>>>>>> >> >>>>>>>>>> >> Thanks in advance if somebody could provide some sort of >>>>>>>>>> >> authoritative answer on these 2 questions. Best regards, >>>>>>>>>> >> >>>>>>>>>> >> Michel >>>>>>>>>> >> >>>>>>>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>>>>>>>> >>> Answering to myself, I found the reason for 2147483647: >>>>>>>>>> it's >>>>>>>>>> >>> documented as a failure to find enough OSD (missing >>>>>>>>>> OSDs). And >>>>>>>>>> it is >>>>>>>>>> >>> normal as I selected different hosts for the 15 OSDs but I >>>>>>>>>> have only >>>>>>>>>> >>> 12 hosts! >>>>>>>>>> >>> >>>>>>>>>> >>> I'm still interested by an "expert" to confirm that LRC >>>>>>>>>> k=9, >>>>>>>>>> m=3, >>>>>>>>>> >>> l=4 configuration is equivalent, in terms of >>>>>>>>>> redundancy, to a >>>>>>>>>> >>> jerasure configuration with k=9, m=6. >>>>>>>>>> >>> >>>>>>>>>> >>> Michel >>>>>>>>>> >>> >>>>>>>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>>>>>>>> >>>> Hi, >>>>>>>>>> >>>> >>>>>>>>>> >>>> As discussed in another thread (Crushmap rule for >>>>>>>>>> multi-datacenter >>>>>>>>>> >>>> erasure coding), I'm trying to create an EC pool >>>>>>>>>> spanning 3 >>>>>>>>>> >>>> datacenters (datacenters are present in the crushmap), >>>>>>>>>> with the >>>>>>>>>> >>>> objective to be resilient to 1 DC down, at least >>>>>>>>>> keeping the >>>>>>>>>> >>>> readonly access to the pool and if possible the read-write >>>>>>>>>> access, >>>>>>>>>> >>>> and have a storage efficiency better than 3 replica (let >>>>>>>>>> say a >>>>>>>>>> >>>> storage overhead <= 2). >>>>>>>>>> >>>> >>>>>>>>>> >>>> In the discussion, somebody mentioned LRC plugin as a >>>>>>>>>> possible >>>>>>>>>> >>>> jerasure alternative to implement this without >>>>>>>>>> tweaking the >>>>>>>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>>>>>>>> looked at >>>>>>>>>> >>>> the documentation >>>>>>>>>> >>>> >>>>>>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/ >>>> ) >>>>>>>>>> >>>> but I have some questions if someone has >>>>>>>>>> experience/expertise >>>>>>>>>> with >>>>>>>>>> >>>> this LRC plugin. >>>>>>>>>> >>>> >>>>>>>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter >>>>>>>>>> (15 in >>>>>>>>>> >>>> total), with 3 (9 in total) being data chunks and others >>>>>>>>>> being >>>>>>>>>> >>>> coding chunks. For this, based of my understanding of >>>>>>>>>> examples, I >>>>>>>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>>>>>>>> equivalent, >>>>>>>>>> >>>> in terms of redundancy, to a jerasure configuration >>>>>>>>>> with k=9, >>>>>>>>>> m=6? >>>>>>>>>> >>>> >>>>>>>>>> >>>> The resulting rule, which looks correct to me, is: >>>>>>>>>> >>>> >>>>>>>>>> >>>> -------- >>>>>>>>>> >>>> >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "rule_id": 6, >>>>>>>>>> >>>> "rule_name": "test_lrc_2", >>>>>>>>>> >>>> "ruleset": 6, >>>>>>>>>> >>>> "type": 3, >>>>>>>>>> >>>> "min_size": 3, >>>>>>>>>> >>>> "max_size": 15, >>>>>>>>>> >>>> "steps": [ >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_chooseleaf_tries", >>>>>>>>>> >>>> "num": 5 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_choose_tries", >>>>>>>>>> >>>> "num": 100 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "take", >>>>>>>>>> >>>> "item": -4, >>>>>>>>>> >>>> "item_name": "default~hdd" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "choose_indep", >>>>>>>>>> >>>> "num": 3, >>>>>>>>>> >>>> "type": "datacenter" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "chooseleaf_indep", >>>>>>>>>> >>>> "num": 5, >>>>>>>>>> >>>> "type": "host" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "emit" >>>>>>>>>> >>>> } >>>>>>>>>> >>>> ] >>>>>>>>>> >>>> } >>>>>>>>>> >>>> >>>>>>>>>> >>>> ------------ >>>>>>>>>> >>>> >>>>>>>>>> >>>> Unfortunately, it doesn't work as expected: a pool created >>>>>>>>>> with >>>>>>>>>> >>>> this rule ends up with its pages active+undersize, >>>>>>>>>> which is >>>>>>>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, >>>>>>>>>> I see >>>>>>>>>> >>>> for each page something like: >>>>>>>>>> >>>> >>>>>>>>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>>>>>>>> >>>> active+undersized, last acting >>>>>>>>>> >>>> >>>>>>>>>> >>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>>>>>>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> For each PG, there is 3 '2147483647' entries and I >>>>>>>>>> guess it >>>>>>>>>> is the >>>>>>>>>> >>>> reason of the problem. What are these entries about? >>>>>>>>>> Clearly >>>>>>>>>> it is >>>>>>>>>> >>>> not OSD entries... Looks like a negative number, -1, >>>>>>>>>> which in >>>>>>>>>> terms >>>>>>>>>> >>>> of crushmap ID is the crushmap root (named "default" >>>>>>>>>> in our >>>>>>>>>> >>>> configuration). Any trivial mistake I would have made? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Thanks in advance for any help or for sharing any >>>>>>>>>> successful >>>>>>>>>> >>>> configuration? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Best regards, >>>>>>>>>> >>>> >>>>>>>>>> >>>> Michel >>>>>>>>>> >>>> _______________________________________________ >>>>>>>>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >

Eugen Block

19 Jun 19 Jun

1:09 p.m.

Hi, I have a real hardware cluster for testing available now. I'm not sure whether I'm completely misunderstanding how it's supposed to work or if it's a bug in the LRC plugin. This cluster has 18 HDD nodes available across 3 rooms (or DCs), I intend to use 15 nodes to be able to recover if one node fails. Given that I need one additional locality chunk per DC I need a profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 chunks in total across those 3 DCs, one chunk per host, I checked the chunk placement and it is correct. This is the profile I created: ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 crush-failure-domain=host crush-locality=room crush-device-class=hdd I created a pool with only one PG to make the output more readable. This profile should allow the cluster to sustain the loss of three chunks, the results are interesting. This is what I tested: 1. I stopped all OSDs on one host and the PG was still active with one missing chunk, everything's good. 2. Stopping a second host in the same DC resulted in the PG being marked as "down". That was unexpected since with m=3 I expected the PG to still be active but degraded. Before test #3 I started all OSDs to have the PG active+clean again. 3. I stopped one host per DC, so in total 3 chunks were missing and the PG was still active. Apparently, this profile is able to sustain the loss of m chunks, but not an entire DC. I get the impression (and I also discussed this with a colleague) that LRC with this implementation is either designed to loose only single OSDs which can be recovered quicker with fewer surviving OSDs and saving bandwidth. Or this is a bug because according to the low-level description [1] the algorithm works its way up in the reverse order within the configured layers, like in this example (not displaying my k, m, l requirements, just for reference): chunk nr 01234567 step 1 _cDD_cDD step 2 cDDD____ step 3 ____cDDD So if a whole DC fails and the chunks from step 3 can not be recovered, and maybe step 2 also fails, but eventually step 1 contains the actual k and m chunks which should sustain the loss of an entire DC. My impression is that the algorithm somehow doesn't arrive at step 1 and therefore the PG stays down although there are enough surviving chunks. I'm not sure if my observations and conclusion are correct, I'd love to have a comment from the developers on this topic. But in this state I would not recommend to use the LRC plugin when the resiliency requirements are to sustain the loss of an entire DC. Thanks, Eugen [1] https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-leve… Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>:

...

Hi, I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : > Hi Eugen, > > My LRC pool is also somewhat experimental so nothing really urgent. > If you manage to do some tests that help me to understand the > problem I remain interested. I propose to keep this thread for that. > > Zitat, I shared my crush map in the email you answered if the > attachment was not suppressed by mailman. > > Cheers, > > Michel > Sent from my mobile > > Le 18 mai 2023 11:19:35 Eugen Block <eblock(a)nde.ag> a écrit : > >> Hi, I don’t have a good explanation for this yet, but I’ll soon get >> the opportunity to play around with a decommissioned cluster. I’ll try >> to get a better understanding of the LRC plugin, but it might take >> some time, especially since my vacation is coming up. :-) >> I have some thoughts about the down PGs with failure domain OSD, but I >> don’t have anything to confirm it yet. >> >> Zitat von Curt <lightspd(a)gmail.com>om>: >> >>> Hi, >>> >>> I've been following this thread with interest as it seems like a >>> unique use >>> case to expand my knowledge. I don't use LRC or anything outside basic >>> erasure coding. >>> >>> What is your current crush steps rule? I know you made changes since your >>> first post and had some thoughts I wanted to share, but wanted to see your >>> rule first so I could try to visualize the distribution better. The only >>> way I can currently visualize it working is with more servers, >>> I'm thinking >>> 6 or 9 per data center min, but that could be my lack of knowledge on some >>> of the step rules. >>> >>> Thanks >>> Curt >>> >>> On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < >>> michel.jouvin(a)ijclab.in2p3.fr> wrote: >>> >>>> Hi Eugen, >>>> >>>> Yes, sure, no problem to share it. I attach it to this email (as it may >>>> clutter the discussion if inline). >>>> >>>> If somebody on the list has some clue on the LRC plugin, I'm still >>>> interested by understand what I'm doing wrong! >>>> >>>> Cheers, >>>> >>>> Michel >>>> >>>> Le 04/05/2023 à 15:07, Eugen Block a écrit : >>>>> Hi, >>>>> >>>>> I don't think you've shared your osd tree yet, could you do that? >>>>> Apparently nobody else but us reads this thread or nobody reading this >>>>> uses the LRC plugin. ;-) >>>>> >>>>> Thanks, >>>>> Eugen >>>>> >>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>> >>>>>> Hi, >>>>>> >>>>>> I had to restart one of my OSD server today and the problem showed up >>>>>> again. This time I managed to capture "ceph health detail" output >>>>>> showing the problem with the 2 PGs: >>>>>> >>>>>> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 >>>>>> pgs down >>>>>> pg 56.1 is down, acting >>>>>> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] >>>>>> pg 56.12 is down, acting >>>>>> >>>> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] >>>>>> >>>>>> I still doesn't understand why, if I am supposed to survive to a >>>>>> datacenter failure, I cannot survive to 3 OSDs down on the same host, >>>>>> hosting shards for the PG. In the second case it is only 2 OSDs down >>>>>> but I'm surprised they don't seem in the same "group" of OSD (I'd >>>>>> expected all the the OSDs of one datacenter to be in the same groupe >>>>>> of 5 if the order given really reflects the allocation done... >>>>>> >>>>>> Still interested by some explanation on what I'm doing wrong! Best >>>>>> regards, >>>>>> >>>>>> Michel >>>>>> >>>>>> Le 03/05/2023 à 10:21, Eugen Block a écrit : >>>>>>> I think I got it wrong with the locality setting, I'm still limited >>>>>>> by the number of hosts I have available in my test cluster, but as >>>>>>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with >>>>>>> locality=datacenter could fit your requirement, at least with >>>>>>> regards to the recovery bandwidth usage between DCs, but the >>>>>>> resiliency would not match your requirement (one DC failure). That >>>>>>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one >>>>>>> parity chunk) across three DCs, in total 12 chunks. The min_size=7 >>>>>>> would not allow an entire DC to go down, I'm afraid, you'd have to >>>>>>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm >>>>>>> still not sure if I got it right this time, but maybe you're better >>>>>>> off without the LRC plugin with the limited number of hosts. Instead >>>>>>> you could use the jerasure plugin with a profile like k=4 m=5 >>>>>>> allowing an entire DC to fail without losing data access (we have >>>>>>> one customer using that). >>>>>>> >>>>>>> Zitat von Eugen Block <eblock(a)nde.ag>ag>: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> disclaimer: I haven't used LRC in a real setup yet, so there might >>>>>>>> be some misunderstandings on my side. But I tried to play around >>>>>>>> with one of my test clusters (Nautilus). Because I'm limited in the >>>>>>>> number of hosts (6 across 3 virtual DCs) I tried two different >>>>>>>> profiles with lower numbers to get a feeling for how that works. >>>>>>>> >>>>>>>> # first attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=4 m=2 l=3 crush-failure-domain=host >>>>>>>> >>>>>>>> For every third OSD one parity chunk is added, so 2 more chunks to >>>>>>>> store ==> 8 chunks in total. Since my failure-domain is host and I >>>>>>>> only have 6 I get incomplete PGs. >>>>>>>> >>>>>>>> # second attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=2 m=2 l=2 crush-failure-domain=host >>>>>>>> >>>>>>>> This gives me 6 chunks in total to store across 6 hosts which works: >>>>>>>> >>>>>>>> ceph:~ # ceph pg ls-by-pool lrcpool >>>>>>>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* >>>>>>>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED >>>>>>>> UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP >>>>>>>> 50.0 1 0 0 0 619 0 0 1 >>>>>>>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >>>>>>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.1 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >>>>>>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.2 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >>>>>>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.3 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >>>>>>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> >>>>>>>> After stopping all OSDs on one host I was still able to read and >>>>>>>> write into the pool, but after stopping a second host one PG from >>>>>>>> that pool went "down". That I don't fully understand yet, but I >>>>>>>> just started to look into it. >>>>>>>> With your setup (12 hosts) I would recommend to not utilize all of >>>>>>>> them so you have capacity to recover, let's say one "spare" host >>>>>>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could >>>>>>>> make sense here, resulting in 9 total chunks (one more parity >>>>>>>> chunks for every other OSD), min_size 4. But as I wrote, it >>>>>>>> probably doesn't have the resiliency for a DC failure, so that >>>>>>>> needs some further investigation. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Eugen >>>>>>>> >>>>>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> No... our current setup is 3 datacenters with the same >>>>>>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. >>>>>>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be >>>>>>>>> a multiple of l, I found that k=9/m=66/l=5 with >>>>>>>>> crush-locality=datacenter was achieving my goal of being resilient >>>>>>>>> to a datacenter failure. Because I had this, I considered that >>>>>>>>> lowering the crush failure domain to osd was not a major issue in >>>>>>>>> my case (as it would not be worst than a datacenter failure if all >>>>>>>>> the shards are on the same server in a datacenter) and was working >>>>>>>>> around the lack of hosts for k=9/m=6 (15 OSDs). >>>>>>>>> >>>>>>>>> May be it helps, if I give the erasure code profile used: >>>>>>>>> >>>>>>>>> crush-device-class=hdd >>>>>>>>> crush-failure-domain=osd >>>>>>>>> crush-locality=datacenter >>>>>>>>> crush-root=default >>>>>>>>> k=9 >>>>>>>>> l=5 >>>>>>>>> m=6 >>>>>>>>> plugin=lrc >>>>>>>>> >>>>>>>>> The previously mentioned strange number for min_size for the pool >>>>>>>>> created with this profile has vanished after Quincy upgrade as >>>>>>>>> this parameter is no longer in the CRUH map rule! and the `ceph >>>>>>>>> osd pool get` command reports the expected number (10): >>>>>>>>> >>>>>>>>> --------- >>>>>>>>> >>>>>>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>>>>>>>> min_size: 10 >>>>>>>>> -------- >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Michel >>>>>>>>> >>>>>>>>> Le 29/04/2023 à 20:36, Curt a écrit : >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> What is your current setup, 1 server pet data center with 12 osd >>>>>>>>>> each? What is your current crush rule and LRC crush rule? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>>>>>>>> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I think I found a possible cause of my PG down but still >>>>>>>>>> understand why. >>>>>>>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool >>>>>>>>>> (k=9, >>>>>>>>>> m=6) but I have only 12 OSD servers in the cluster. To >>>>>>>>>> workaround the >>>>>>>>>> problem I defined the failure domain as 'osd' with the reasoning >>>>>>>>>> that as >>>>>>>>>> I was using the LRC plugin, I had the warranty that I could loose >>>>>>>>>> a site >>>>>>>>>> without impact, thus the possibility to loose 1 OSD server. Am I >>>>>>>>>> wrong? >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> Michel >>>>>>>>>> >>>>>>>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > I'm still interesting by getting feedback from those using >>>>>>>>>> the LRC >>>>>>>>>> > plugin about the right way to configure it... Last week I >>>>>>>>>> upgraded >>>>>>>>>> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >>>>>>>>>> > upgrade host by host, checking if an OSD is ok to stop before >>>>>>>>>> actually >>>>>>>>>> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >>>>>>>>>> points >>>>>>>>>> > in the upgrade (happened not for all OSDs but for every >>>>>>>>>> > site/datacenter). Looking at the details with "ceph health >>>>>>>>>> detail", I >>>>>>>>>> > saw that for these PGs there was 3 OSDs down but I was >>>>>>>>>> expecting >>>>>>>>>> the >>>>>>>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >>>>>>>>>> > wondering if there is something wrong in our pool configuration >>>>>>>>>> (k=9, >>>>>>>>>> > m=6, l=5). >>>>>>>>>> > >>>>>>>>>> > Cheers, >>>>>>>>>> > >>>>>>>>>> > Michel >>>>>>>>>> > >>>>>>>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>>>>>>>> >> Hi, >>>>>>>>>> >> >>>>>>>>>> >> Is somebody using LRC plugin ? >>>>>>>>>> >> >>>>>>>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >>>>>>>>>> same as >>>>>>>>>> >> jerasure k=9, m=6 in terms of protection against failures and >>>>>>>>>> that I >>>>>>>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= >>>>>>>>>> jerasure >>>>>>>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >>>>>>>>>> suggests >>>>>>>>>> >> that this LRC configuration gives something better than >>>>>>>>>> jerasure k=4, >>>>>>>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >>>>>>>>>> understood >>>>>>>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 >>>>>>>>>> >> configuration first without loosing RW access and second >>>>>>>>>> without >>>>>>>>>> >> loosing data? >>>>>>>>>> >> >>>>>>>>>> >> Another thing that I don't quite understand is that a pool >>>>>>>>>> created >>>>>>>>>> >> with this configuration (and failure domain=osd, >>>>>>>>>> locality=datacenter) >>>>>>>>>> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >>>>>>>>>> me, I'd >>>>>>>>>> >> expected something ~10 (depending on answer to the previous >>>>>>>>>> question)... >>>>>>>>>> >> >>>>>>>>>> >> Thanks in advance if somebody could provide some sort of >>>>>>>>>> >> authoritative answer on these 2 questions. Best regards, >>>>>>>>>> >> >>>>>>>>>> >> Michel >>>>>>>>>> >> >>>>>>>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>>>>>>>> >>> Answering to myself, I found the reason for 2147483647: it's >>>>>>>>>> >>> documented as a failure to find enough OSD (missing >>>>>>>>>> OSDs). And >>>>>>>>>> it is >>>>>>>>>> >>> normal as I selected different hosts for the 15 OSDs but I >>>>>>>>>> have only >>>>>>>>>> >>> 12 hosts! >>>>>>>>>> >>> >>>>>>>>>> >>> I'm still interested by an "expert" to confirm that LRC k=9, >>>>>>>>>> m=3, >>>>>>>>>> >>> l=4 configuration is equivalent, in terms of redundancy, to a >>>>>>>>>> >>> jerasure configuration with k=9, m=6. >>>>>>>>>> >>> >>>>>>>>>> >>> Michel >>>>>>>>>> >>> >>>>>>>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>>>>>>>> >>>> Hi, >>>>>>>>>> >>>> >>>>>>>>>> >>>> As discussed in another thread (Crushmap rule for >>>>>>>>>> multi-datacenter >>>>>>>>>> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >>>>>>>>>> >>>> datacenters (datacenters are present in the crushmap), >>>>>>>>>> with the >>>>>>>>>> >>>> objective to be resilient to 1 DC down, at least keeping the >>>>>>>>>> >>>> readonly access to the pool and if possible the read-write >>>>>>>>>> access, >>>>>>>>>> >>>> and have a storage efficiency better than 3 replica (let >>>>>>>>>> say a >>>>>>>>>> >>>> storage overhead <= 2). >>>>>>>>>> >>>> >>>>>>>>>> >>>> In the discussion, somebody mentioned LRC plugin as a >>>>>>>>>> possible >>>>>>>>>> >>>> jerasure alternative to implement this without tweaking the >>>>>>>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>>>>>>>> looked at >>>>>>>>>> >>>> the documentation >>>>>>>>>> >>>> >>>>>>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/ >>>> ) >>>>>>>>>> >>>> but I have some questions if someone has >>>>>>>>>> experience/expertise >>>>>>>>>> with >>>>>>>>>> >>>> this LRC plugin. >>>>>>>>>> >>>> >>>>>>>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter >>>>>>>>>> (15 in >>>>>>>>>> >>>> total), with 3 (9 in total) being data chunks and others >>>>>>>>>> being >>>>>>>>>> >>>> coding chunks. For this, based of my understanding of >>>>>>>>>> examples, I >>>>>>>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>>>>>>>> equivalent, >>>>>>>>>> >>>> in terms of redundancy, to a jerasure configuration >>>>>>>>>> with k=9, >>>>>>>>>> m=6? >>>>>>>>>> >>>> >>>>>>>>>> >>>> The resulting rule, which looks correct to me, is: >>>>>>>>>> >>>> >>>>>>>>>> >>>> -------- >>>>>>>>>> >>>> >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "rule_id": 6, >>>>>>>>>> >>>> "rule_name": "test_lrc_2", >>>>>>>>>> >>>> "ruleset": 6, >>>>>>>>>> >>>> "type": 3, >>>>>>>>>> >>>> "min_size": 3, >>>>>>>>>> >>>> "max_size": 15, >>>>>>>>>> >>>> "steps": [ >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_chooseleaf_tries", >>>>>>>>>> >>>> "num": 5 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_choose_tries", >>>>>>>>>> >>>> "num": 100 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "take", >>>>>>>>>> >>>> "item": -4, >>>>>>>>>> >>>> "item_name": "default~hdd" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "choose_indep", >>>>>>>>>> >>>> "num": 3, >>>>>>>>>> >>>> "type": "datacenter" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "chooseleaf_indep", >>>>>>>>>> >>>> "num": 5, >>>>>>>>>> >>>> "type": "host" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "emit" >>>>>>>>>> >>>> } >>>>>>>>>> >>>> ] >>>>>>>>>> >>>> } >>>>>>>>>> >>>> >>>>>>>>>> >>>> ------------ >>>>>>>>>> >>>> >>>>>>>>>> >>>> Unfortunately, it doesn't work as expected: a pool created >>>>>>>>>> with >>>>>>>>>> >>>> this rule ends up with its pages active+undersize, which is >>>>>>>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, >>>>>>>>>> I see >>>>>>>>>> >>>> for each page something like: >>>>>>>>>> >>>> >>>>>>>>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>>>>>>>> >>>> active+undersized, last acting >>>>>>>>>> >>>> >>>>>>>>>> >>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>>>>>>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it >>>>>>>>>> is the >>>>>>>>>> >>>> reason of the problem. What are these entries about? Clearly >>>>>>>>>> it is >>>>>>>>>> >>>> not OSD entries... Looks like a negative number, >>>>>>>>>> -1, which in >>>>>>>>>> terms >>>>>>>>>> >>>> of crushmap ID is the crushmap root (named "default" in our >>>>>>>>>> >>>> configuration). Any trivial mistake I would have made? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Thanks in advance for any help or for sharing any successful >>>>>>>>>> >>>> configuration? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Best regards, >>>>>>>>>> >>>> >>>>>>>>>> >>>> Michel >>>>>>>>>> >>>> _______________________________________________ >>>>>>>>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >

Michel Jouvin

3:48 p.m.

Hi Eugen, Thank you very much for these detailed tests that match what I observed and reported earlier. I'm happy to see that we have the same understanding of how it should work (based on the documentation). Is there any other way that this list to enter in contact with the plugin developers as it seems they are not following this (very high volume) list... Or may somebody pass the email thread to one of them? Help would be really appreciated. Cheers, Michel Le 19/06/2023 à 14:09, Eugen Block a écrit : > Hi, I have a real hardware cluster for testing available now. I'm not > sure whether I'm completely misunderstanding how it's supposed to work > or if it's a bug in the LRC plugin. > This cluster has 18 HDD nodes available across 3 rooms (or DCs), I > intend to use 15 nodes to be able to recover if one node fails. > Given that I need one additional locality chunk per DC I need a > profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 > chunks in total across those 3 DCs, one chunk per host, I checked the > chunk placement and it is correct. This is the profile I created: > > ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 > crush-failure-domain=host crush-locality=room crush-device-class=hdd > > I created a pool with only one PG to make the output more readable. > > This profile should allow the cluster to sustain the loss of three > chunks, the results are interesting. This is what I tested: > > 1. I stopped all OSDs on one host and the PG was still active with one > missing chunk, everything's good. > 2. Stopping a second host in the same DC resulted in the PG being > marked as "down". That was unexpected since with m=3 I expected the PG > to still be active but degraded. Before test #3 I started all OSDs to > have the PG active+clean again. > 3. I stopped one host per DC, so in total 3 chunks were missing and > the PG was still active. > > Apparently, this profile is able to sustain the loss of m chunks, but > not an entire DC. I get the impression (and I also discussed this with > a colleague) that LRC with this implementation is either designed to > loose only single OSDs which can be recovered quicker with fewer > surviving OSDs and saving bandwidth. Or this is a bug because > according to the low-level description [1] the algorithm works its way > up in the reverse order within the configured layers, like in this > example (not displaying my k, m, l requirements, just for reference): > > chunk nr 01234567 > step 1 _cDD_cDD > step 2 cDDD____ > step 3 ____cDDD > > So if a whole DC fails and the chunks from step 3 can not be > recovered, and maybe step 2 also fails, but eventually step 1 contains > the actual k and m chunks which should sustain the loss of an entire > DC. My impression is that the algorithm somehow doesn't arrive at step > 1 and therefore the PG stays down although there are enough surviving > chunks. I'm not sure if my observations and conclusion are correct, > I'd love to have a comment from the developers on this topic. But in > this state I would not recommend to use the LRC plugin when the > resiliency requirements are to sustain the loss of an entire DC. > > Thanks, > Eugen > > [1] > https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-leve… > > Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: > >> Hi, >> >> I realize that the crushmap I attached to one of my email, probably >> required to understand the discussion here, has been stripped down by >> mailman. To avoid poluting the thread with a long output, I put it on >> at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if >> you are interested. >> >> Best regards, >> >> Michel >> >> Le 21/05/2023 à 16:07, Michel Jouvin a écrit : >>> Hi Eugen, >>> >>> My LRC pool is also somewhat experimental so nothing really urgent. >>> If you manage to do some tests that help me to understand the >>> problem I remain interested. I propose to keep this thread for that. >>> >>> Zitat, I shared my crush map in the email you answered if the >>> attachment was not suppressed by mailman. >>> >>> Cheers, >>> >>> Michel >>> Sent from my mobile >>> >>> Le 18 mai 2023 11:19:35 Eugen Block <eblock(a)nde.ag> a écrit : >>> >>>> Hi, I don’t have a good explanation for this yet, but I’ll soon get >>>> the opportunity to play around with a decommissioned cluster. I’ll try >>>> to get a better understanding of the LRC plugin, but it might take >>>> some time, especially since my vacation is coming up. :-) >>>> I have some thoughts about the down PGs with failure domain OSD, but I >>>> don’t have anything to confirm it yet. >>>> >>>> Zitat von Curt <lightspd(a)gmail.com>om>: >>>> >>>>> Hi, >>>>> >>>>> I've been following this thread with interest as it seems like a >>>>> unique use >>>>> case to expand my knowledge. I don't use LRC or anything outside >>>>> basic >>>>> erasure coding. >>>>> >>>>> What is your current crush steps rule? I know you made changes >>>>> since your >>>>> first post and had some thoughts I wanted to share, but wanted to >>>>> see your >>>>> rule first so I could try to visualize the distribution better. >>>>> The only >>>>> way I can currently visualize it working is with more servers, I'm >>>>> thinking >>>>> 6 or 9 per data center min, but that could be my lack of knowledge >>>>> on some >>>>> of the step rules. >>>>> >>>>> Thanks >>>>> Curt >>>>> >>>>> On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < >>>>> michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>>> >>>>>> Hi Eugen, >>>>>> >>>>>> Yes, sure, no problem to share it. I attach it to this email (as >>>>>> it may >>>>>> clutter the discussion if inline). >>>>>> >>>>>> If somebody on the list has some clue on the LRC plugin, I'm still >>>>>> interested by understand what I'm doing wrong! >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Michel >>>>>> >>>>>> Le 04/05/2023 à 15:07, Eugen Block a écrit : >>>>>>> Hi, >>>>>>> >>>>>>> I don't think you've shared your osd tree yet, could you do that? >>>>>>> Apparently nobody else but us reads this thread or nobody >>>>>>> reading this >>>>>>> uses the LRC plugin. ;-) >>>>>>> >>>>>>> Thanks, >>>>>>> Eugen >>>>>>> >>>>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I had to restart one of my OSD server today and the problem >>>>>>>> showed up >>>>>>>> again. This time I managed to capture "ceph health detail" output >>>>>>>> showing the problem with the 2 PGs: >>>>>>>> >>>>>>>> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs >>>>>>>> inactive, 2 >>>>>>>> pgs down >>>>>>>> pg 56.1 is down, acting >>>>>>>> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] >>>>>>>> >>>>>>>> pg 56.12 is down, acting >>>>>>>> >>>>>> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] >>>>>> >>>>>>>> >>>>>>>> I still doesn't understand why, if I am supposed to survive to a >>>>>>>> datacenter failure, I cannot survive to 3 OSDs down on the same >>>>>>>> host, >>>>>>>> hosting shards for the PG. In the second case it is only 2 OSDs >>>>>>>> down >>>>>>>> but I'm surprised they don't seem in the same "group" of OSD (I'd >>>>>>>> expected all the the OSDs of one datacenter to be in the same >>>>>>>> groupe >>>>>>>> of 5 if the order given really reflects the allocation done... >>>>>>>> >>>>>>>> Still interested by some explanation on what I'm doing wrong! Best >>>>>>>> regards, >>>>>>>> >>>>>>>> Michel >>>>>>>> >>>>>>>> Le 03/05/2023 à 10:21, Eugen Block a écrit : >>>>>>>>> I think I got it wrong with the locality setting, I'm still >>>>>>>>> limited >>>>>>>>> by the number of hosts I have available in my test cluster, >>>>>>>>> but as >>>>>>>>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with >>>>>>>>> locality=datacenter could fit your requirement, at least with >>>>>>>>> regards to the recovery bandwidth usage between DCs, but the >>>>>>>>> resiliency would not match your requirement (one DC failure). >>>>>>>>> That >>>>>>>>> profile creates 3 groups of 4 chunks (3 data/coding chunks and >>>>>>>>> one >>>>>>>>> parity chunk) across three DCs, in total 12 chunks. The >>>>>>>>> min_size=7 >>>>>>>>> would not allow an entire DC to go down, I'm afraid, you'd >>>>>>>>> have to >>>>>>>>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm >>>>>>>>> still not sure if I got it right this time, but maybe you're >>>>>>>>> better >>>>>>>>> off without the LRC plugin with the limited number of hosts. >>>>>>>>> Instead >>>>>>>>> you could use the jerasure plugin with a profile like k=4 m=5 >>>>>>>>> allowing an entire DC to fail without losing data access (we have >>>>>>>>> one customer using that). >>>>>>>>> >>>>>>>>> Zitat von Eugen Block <eblock(a)nde.ag>ag>: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> disclaimer: I haven't used LRC in a real setup yet, so there >>>>>>>>>> might >>>>>>>>>> be some misunderstandings on my side. But I tried to play around >>>>>>>>>> with one of my test clusters (Nautilus). Because I'm limited >>>>>>>>>> in the >>>>>>>>>> number of hosts (6 across 3 virtual DCs) I tried two different >>>>>>>>>> profiles with lower numbers to get a feeling for how that works. >>>>>>>>>> >>>>>>>>>> # first attempt >>>>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>>>> k=4 m=2 l=3 crush-failure-domain=host >>>>>>>>>> >>>>>>>>>> For every third OSD one parity chunk is added, so 2 more >>>>>>>>>> chunks to >>>>>>>>>> store ==> 8 chunks in total. Since my failure-domain is host >>>>>>>>>> and I >>>>>>>>>> only have 6 I get incomplete PGs. >>>>>>>>>> >>>>>>>>>> # second attempt >>>>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>>>> k=2 m=2 l=2 crush-failure-domain=host >>>>>>>>>> >>>>>>>>>> This gives me 6 chunks in total to store across 6 hosts which >>>>>>>>>> works: >>>>>>>>>> >>>>>>>>>> ceph:~ # ceph pg ls-by-pool lrcpool >>>>>>>>>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* >>>>>>>>>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED >>>>>>>>>> UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP >>>>>>>>>> 50.0 1 0 0 0 619 0 0 1 >>>>>>>>>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >>>>>>>>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>>>> 14:53:54.322135 >>>>>>>>>> 50.1 0 0 0 0 0 0 0 0 >>>>>>>>>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >>>>>>>>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>>>> 14:53:54.322135 >>>>>>>>>> 50.2 0 0 0 0 0 0 0 0 >>>>>>>>>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >>>>>>>>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>>>> 14:53:54.322135 >>>>>>>>>> 50.3 0 0 0 0 0 0 0 0 >>>>>>>>>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >>>>>>>>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>>>> 14:53:54.322135 >>>>>>>>>> >>>>>>>>>> After stopping all OSDs on one host I was still able to read and >>>>>>>>>> write into the pool, but after stopping a second host one PG >>>>>>>>>> from >>>>>>>>>> that pool went "down". That I don't fully understand yet, but I >>>>>>>>>> just started to look into it. >>>>>>>>>> With your setup (12 hosts) I would recommend to not utilize >>>>>>>>>> all of >>>>>>>>>> them so you have capacity to recover, let's say one "spare" host >>>>>>>>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 >>>>>>>>>> could >>>>>>>>>> make sense here, resulting in 9 total chunks (one more parity >>>>>>>>>> chunks for every other OSD), min_size 4. But as I wrote, it >>>>>>>>>> probably doesn't have the resiliency for a DC failure, so that >>>>>>>>>> needs some further investigation. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Eugen >>>>>>>>>> >>>>>>>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> No... our current setup is 3 datacenters with the same >>>>>>>>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs >>>>>>>>>>> each. >>>>>>>>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m >>>>>>>>>>> must be >>>>>>>>>>> a multiple of l, I found that k=9/m=66/l=5 with >>>>>>>>>>> crush-locality=datacenter was achieving my goal of being >>>>>>>>>>> resilient >>>>>>>>>>> to a datacenter failure. Because I had this, I considered that >>>>>>>>>>> lowering the crush failure domain to osd was not a major >>>>>>>>>>> issue in >>>>>>>>>>> my case (as it would not be worst than a datacenter failure >>>>>>>>>>> if all >>>>>>>>>>> the shards are on the same server in a datacenter) and was >>>>>>>>>>> working >>>>>>>>>>> around the lack of hosts for k=9/m=6 (15 OSDs). >>>>>>>>>>> >>>>>>>>>>> May be it helps, if I give the erasure code profile used: >>>>>>>>>>> >>>>>>>>>>> crush-device-class=hdd >>>>>>>>>>> crush-failure-domain=osd >>>>>>>>>>> crush-locality=datacenter >>>>>>>>>>> crush-root=default >>>>>>>>>>> k=9 >>>>>>>>>>> l=5 >>>>>>>>>>> m=6 >>>>>>>>>>> plugin=lrc >>>>>>>>>>> >>>>>>>>>>> The previously mentioned strange number for min_size for the >>>>>>>>>>> pool >>>>>>>>>>> created with this profile has vanished after Quincy upgrade as >>>>>>>>>>> this parameter is no longer in the CRUH map rule! and the `ceph >>>>>>>>>>> osd pool get` command reports the expected number (10): >>>>>>>>>>> >>>>>>>>>>> --------- >>>>>>>>>>> >>>>>>>>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>>>>>>>>>> min_size: 10 >>>>>>>>>>> -------- >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> Michel >>>>>>>>>>> >>>>>>>>>>> Le 29/04/2023 à 20:36, Curt a écrit : >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> What is your current setup, 1 server pet data center with >>>>>>>>>>>> 12 osd >>>>>>>>>>>> each? What is your current crush rule and LRC crush rule? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>>>>>>>>>> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I think I found a possible cause of my PG down but still >>>>>>>>>>>> understand why. >>>>>>>>>>>> As explained in a previous mail, I setup a 15-chunk/OSD >>>>>>>>>>>> EC pool >>>>>>>>>>>> (k=9, >>>>>>>>>>>> m=6) but I have only 12 OSD servers in the cluster. To >>>>>>>>>>>> workaround the >>>>>>>>>>>> problem I defined the failure domain as 'osd' with the >>>>>>>>>>>> reasoning >>>>>>>>>>>> that as >>>>>>>>>>>> I was using the LRC plugin, I had the warranty that I >>>>>>>>>>>> could loose >>>>>>>>>>>> a site >>>>>>>>>>>> without impact, thus the possibility to loose 1 OSD >>>>>>>>>>>> server. Am I >>>>>>>>>>>> wrong? >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> >>>>>>>>>>>> Michel >>>>>>>>>>>> >>>>>>>>>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>>>>>>>>>> > Hi, >>>>>>>>>>>> > >>>>>>>>>>>> > I'm still interesting by getting feedback from those using >>>>>>>>>>>> the LRC >>>>>>>>>>>> > plugin about the right way to configure it... Last week I >>>>>>>>>>>> upgraded >>>>>>>>>>>> > from Pacific to Quincy (17.2.6) with cephadm which is >>>>>>>>>>>> doing the >>>>>>>>>>>> > upgrade host by host, checking if an OSD is ok to stop >>>>>>>>>>>> before >>>>>>>>>>>> actually >>>>>>>>>>>> > upgrading it. I had the surprise to see 1 or 2 PGs down >>>>>>>>>>>> at some >>>>>>>>>>>> points >>>>>>>>>>>> > in the upgrade (happened not for all OSDs but for every >>>>>>>>>>>> > site/datacenter). Looking at the details with "ceph health >>>>>>>>>>>> detail", I >>>>>>>>>>>> > saw that for these PGs there was 3 OSDs down but I was >>>>>>>>>>>> expecting >>>>>>>>>>>> the >>>>>>>>>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) >>>>>>>>>>>> so I'm >>>>>>>>>>>> > wondering if there is something wrong in our pool >>>>>>>>>>>> configuration >>>>>>>>>>>> (k=9, >>>>>>>>>>>> > m=6, l=5). >>>>>>>>>>>> > >>>>>>>>>>>> > Cheers, >>>>>>>>>>>> > >>>>>>>>>>>> > Michel >>>>>>>>>>>> > >>>>>>>>>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>>>>>>>>>> >> Hi, >>>>>>>>>>>> >> >>>>>>>>>>>> >> Is somebody using LRC plugin ? >>>>>>>>>>>> >> >>>>>>>>>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is >>>>>>>>>>>> not the >>>>>>>>>>>> same as >>>>>>>>>>>> >> jerasure k=9, m=6 in terms of protection against >>>>>>>>>>>> failures and >>>>>>>>>>>> that I >>>>>>>>>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= >>>>>>>>>>>> jerasure >>>>>>>>>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, >>>>>>>>>>>> l=3) >>>>>>>>>>>> suggests >>>>>>>>>>>> >> that this LRC configuration gives something better than >>>>>>>>>>>> jerasure k=4, >>>>>>>>>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 >>>>>>>>>>>> if I >>>>>>>>>>>> understood >>>>>>>>>>>> >> properly). So how many drives can fail in the k=9, >>>>>>>>>>>> m=6, l=5 >>>>>>>>>>>> >> configuration first without loosing RW access and second >>>>>>>>>>>> without >>>>>>>>>>>> >> loosing data? >>>>>>>>>>>> >> >>>>>>>>>>>> >> Another thing that I don't quite understand is that a >>>>>>>>>>>> pool >>>>>>>>>>>> created >>>>>>>>>>>> >> with this configuration (and failure domain=osd, >>>>>>>>>>>> locality=datacenter) >>>>>>>>>>>> >> has a min_size=3 (max_size=18 as expected). It seems >>>>>>>>>>>> wrong to >>>>>>>>>>>> me, I'd >>>>>>>>>>>> >> expected something ~10 (depending on answer to the >>>>>>>>>>>> previous >>>>>>>>>>>> question)... >>>>>>>>>>>> >> >>>>>>>>>>>> >> Thanks in advance if somebody could provide some sort of >>>>>>>>>>>> >> authoritative answer on these 2 questions. Best regards, >>>>>>>>>>>> >> >>>>>>>>>>>> >> Michel >>>>>>>>>>>> >> >>>>>>>>>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>>>>>>>>>> >>> Answering to myself, I found the reason for >>>>>>>>>>>> 2147483647: it's >>>>>>>>>>>> >>> documented as a failure to find enough OSD (missing >>>>>>>>>>>> OSDs). And >>>>>>>>>>>> it is >>>>>>>>>>>> >>> normal as I selected different hosts for the 15 OSDs >>>>>>>>>>>> but I >>>>>>>>>>>> have only >>>>>>>>>>>> >>> 12 hosts! >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> I'm still interested by an "expert" to confirm that >>>>>>>>>>>> LRC k=9, >>>>>>>>>>>> m=3, >>>>>>>>>>>> >>> l=4 configuration is equivalent, in terms of >>>>>>>>>>>> redundancy, to a >>>>>>>>>>>> >>> jerasure configuration with k=9, m=6. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Michel >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>>>>>>>>>> >>>> Hi, >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> As discussed in another thread (Crushmap rule for >>>>>>>>>>>> multi-datacenter >>>>>>>>>>>> >>>> erasure coding), I'm trying to create an EC pool >>>>>>>>>>>> spanning 3 >>>>>>>>>>>> >>>> datacenters (datacenters are present in the crushmap), >>>>>>>>>>>> with the >>>>>>>>>>>> >>>> objective to be resilient to 1 DC down, at least >>>>>>>>>>>> keeping the >>>>>>>>>>>> >>>> readonly access to the pool and if possible the >>>>>>>>>>>> read-write >>>>>>>>>>>> access, >>>>>>>>>>>> >>>> and have a storage efficiency better than 3 replica >>>>>>>>>>>> (let >>>>>>>>>>>> say a >>>>>>>>>>>> >>>> storage overhead <= 2). >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> In the discussion, somebody mentioned LRC plugin as a >>>>>>>>>>>> possible >>>>>>>>>>>> >>>> jerasure alternative to implement this without >>>>>>>>>>>> tweaking the >>>>>>>>>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>>>>>>>>>> looked at >>>>>>>>>>>> >>>> the documentation >>>>>>>>>>>> >>>> >>>>>>>>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/ >>>>>>>>>>>> >>>>>> ) >>>>>>>>>>>> >>>> but I have some questions if someone has >>>>>>>>>>>> experience/expertise >>>>>>>>>>>> with >>>>>>>>>>>> >>>> this LRC plugin. >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> I tried to create a rule for using 5 OSDs per >>>>>>>>>>>> datacenter >>>>>>>>>>>> (15 in >>>>>>>>>>>> >>>> total), with 3 (9 in total) being data chunks and >>>>>>>>>>>> others >>>>>>>>>>>> being >>>>>>>>>>>> >>>> coding chunks. For this, based of my understanding of >>>>>>>>>>>> examples, I >>>>>>>>>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>>>>>>>>>> equivalent, >>>>>>>>>>>> >>>> in terms of redundancy, to a jerasure configuration >>>>>>>>>>>> with k=9, >>>>>>>>>>>> m=6? >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> The resulting rule, which looks correct to me, is: >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> -------- >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> { >>>>>>>>>>>> >>>> "rule_id": 6, >>>>>>>>>>>> >>>> "rule_name": "test_lrc_2", >>>>>>>>>>>> >>>> "ruleset": 6, >>>>>>>>>>>> >>>> "type": 3, >>>>>>>>>>>> >>>> "min_size": 3, >>>>>>>>>>>> >>>> "max_size": 15, >>>>>>>>>>>> >>>> "steps": [ >>>>>>>>>>>> >>>> { >>>>>>>>>>>> >>>> "op": "set_chooseleaf_tries", >>>>>>>>>>>> >>>> "num": 5 >>>>>>>>>>>> >>>> }, >>>>>>>>>>>> >>>> { >>>>>>>>>>>> >>>> "op": "set_choose_tries", >>>>>>>>>>>> >>>> "num": 100 >>>>>>>>>>>> >>>> }, >>>>>>>>>>>> >>>> { >>>>>>>>>>>> >>>> "op": "take", >>>>>>>>>>>> >>>> "item": -4, >>>>>>>>>>>> >>>> "item_name": "default~hdd" >>>>>>>>>>>> >>>> }, >>>>>>>>>>>> >>>> { >>>>>>>>>>>> >>>> "op": "choose_indep", >>>>>>>>>>>> >>>> "num": 3, >>>>>>>>>>>> >>>> "type": "datacenter" >>>>>>>>>>>> >>>> }, >>>>>>>>>>>> >>>> { >>>>>>>>>>>> >>>> "op": "chooseleaf_indep", >>>>>>>>>>>> >>>> "num": 5, >>>>>>>>>>>> >>>> "type": "host" >>>>>>>>>>>> >>>> }, >>>>>>>>>>>> >>>> { >>>>>>>>>>>> >>>> "op": "emit" >>>>>>>>>>>> >>>> } >>>>>>>>>>>> >>>> ] >>>>>>>>>>>> >>>> } >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> ------------ >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> Unfortunately, it doesn't work as expected: a pool >>>>>>>>>>>> created >>>>>>>>>>>> with >>>>>>>>>>>> >>>> this rule ends up with its pages active+undersize, >>>>>>>>>>>> which is >>>>>>>>>>>> >>>> unexpected for me. Looking at 'ceph health detail` >>>>>>>>>>>> output, >>>>>>>>>>>> I see >>>>>>>>>>>> >>>> for each page something like: >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>>>>>>>>>> >>>> active+undersized, last acting >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> For each PG, there is 3 '2147483647' entries and I >>>>>>>>>>>> guess it >>>>>>>>>>>> is the >>>>>>>>>>>> >>>> reason of the problem. What are these entries about? >>>>>>>>>>>> Clearly >>>>>>>>>>>> it is >>>>>>>>>>>> >>>> not OSD entries... Looks like a negative number, -1, >>>>>>>>>>>> which in >>>>>>>>>>>> terms >>>>>>>>>>>> >>>> of crushmap ID is the crushmap root (named "default" >>>>>>>>>>>> in our >>>>>>>>>>>> >>>> configuration). Any trivial mistake I would have made? >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> Thanks in advance for any help or for sharing any >>>>>>>>>>>> successful >>>>>>>>>>>> >>>> configuration? >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> Best regards, >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> Michel >>>>>>>>>>>> >>>> _______________________________________________ >>>>>>>>>>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>>>> >>>> To unsubscribe send an email to >>>>>>>>>>>> ceph-users-leave(a)ceph.io >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

6:08 p.m.

Hi, adding the dev mailing list, hopefully someone there can chime in. But apparently the LRC code hasn't been maintained for a few years (https://github.com/ceph/ceph/tree/main/src/erasure-code/lrc). Let's see... Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>:

...

Hi, I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : > Hi Eugen, > > My LRC pool is also somewhat experimental so nothing really > urgent. If you manage to do some tests that help me to understand > the problem I remain interested. I propose to keep this thread > for that. > > Zitat, I shared my crush map in the email you answered if the > attachment was not suppressed by mailman. > > Cheers, > > Michel > Sent from my mobile > > Le 18 mai 2023 11:19:35 Eugen Block <eblock(a)nde.ag> a écrit : > >> Hi, I don’t have a good explanation for this yet, but I’ll soon get >> the opportunity to play around with a decommissioned cluster. I’ll try >> to get a better understanding of the LRC plugin, but it might take >> some time, especially since my vacation is coming up. :-) >> I have some thoughts about the down PGs with failure domain OSD, but I >> don’t have anything to confirm it yet. >> >> Zitat von Curt <lightspd(a)gmail.com>om>: >> >>> Hi, >>> >>> I've been following this thread with interest as it seems like >>> a unique use >>> case to expand my knowledge. I don't use LRC or anything outside basic >>> erasure coding. >>> >>> What is your current crush steps rule? I know you made changes >>> since your >>> first post and had some thoughts I wanted to share, but wanted >>> to see your >>> rule first so I could try to visualize the distribution better. >>> The only >>> way I can currently visualize it working is with more servers, >>> I'm thinking >>> 6 or 9 per data center min, but that could be my lack of >>> knowledge on some >>> of the step rules. >>> >>> Thanks >>> Curt >>> >>> On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < >>> michel.jouvin(a)ijclab.in2p3.fr> wrote: >>> >>>> Hi Eugen, >>>> >>>> Yes, sure, no problem to share it. I attach it to this email (as it may >>>> clutter the discussion if inline). >>>> >>>> If somebody on the list has some clue on the LRC plugin, I'm still >>>> interested by understand what I'm doing wrong! >>>> >>>> Cheers, >>>> >>>> Michel >>>> >>>> Le 04/05/2023 à 15:07, Eugen Block a écrit : >>>>> Hi, >>>>> >>>>> I don't think you've shared your osd tree yet, could you do that? >>>>> Apparently nobody else but us reads this thread or nobody reading this >>>>> uses the LRC plugin. ;-) >>>>> >>>>> Thanks, >>>>> Eugen >>>>> >>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>> >>>>>> Hi, >>>>>> >>>>>> I had to restart one of my OSD server today and the problem showed up >>>>>> again. This time I managed to capture "ceph health detail" output >>>>>> showing the problem with the 2 PGs: >>>>>> >>>>>> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 >>>>>> pgs down >>>>>> pg 56.1 is down, acting >>>>>> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] pg 56.12 is down, >>>>>> acting >>>>>> >>>> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] >>>>>> >>>>>> I still doesn't understand why, if I am supposed to survive to a >>>>>> datacenter failure, I cannot survive to 3 OSDs down on the same host, >>>>>> hosting shards for the PG. In the second case it is only 2 OSDs down >>>>>> but I'm surprised they don't seem in the same "group" of OSD (I'd >>>>>> expected all the the OSDs of one datacenter to be in the same groupe >>>>>> of 5 if the order given really reflects the allocation done... >>>>>> >>>>>> Still interested by some explanation on what I'm doing wrong! Best >>>>>> regards, >>>>>> >>>>>> Michel >>>>>> >>>>>> Le 03/05/2023 à 10:21, Eugen Block a écrit : >>>>>>> I think I got it wrong with the locality setting, I'm still limited >>>>>>> by the number of hosts I have available in my test cluster, but as >>>>>>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with >>>>>>> locality=datacenter could fit your requirement, at least with >>>>>>> regards to the recovery bandwidth usage between DCs, but the >>>>>>> resiliency would not match your requirement (one DC failure). That >>>>>>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one >>>>>>> parity chunk) across three DCs, in total 12 chunks. The min_size=7 >>>>>>> would not allow an entire DC to go down, I'm afraid, you'd have to >>>>>>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm >>>>>>> still not sure if I got it right this time, but maybe you're better >>>>>>> off without the LRC plugin with the limited number of hosts. Instead >>>>>>> you could use the jerasure plugin with a profile like k=4 m=5 >>>>>>> allowing an entire DC to fail without losing data access (we have >>>>>>> one customer using that). >>>>>>> >>>>>>> Zitat von Eugen Block <eblock(a)nde.ag>ag>: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> disclaimer: I haven't used LRC in a real setup yet, so there might >>>>>>>> be some misunderstandings on my side. But I tried to play around >>>>>>>> with one of my test clusters (Nautilus). Because I'm limited in the >>>>>>>> number of hosts (6 across 3 virtual DCs) I tried two different >>>>>>>> profiles with lower numbers to get a feeling for how that works. >>>>>>>> >>>>>>>> # first attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=4 m=2 l=3 crush-failure-domain=host >>>>>>>> >>>>>>>> For every third OSD one parity chunk is added, so 2 more chunks to >>>>>>>> store ==> 8 chunks in total. Since my failure-domain is host and I >>>>>>>> only have 6 I get incomplete PGs. >>>>>>>> >>>>>>>> # second attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=2 m=2 l=2 crush-failure-domain=host >>>>>>>> >>>>>>>> This gives me 6 chunks in total to store across 6 hosts >>>>>>>> which works: >>>>>>>> >>>>>>>> ceph:~ # ceph pg ls-by-pool lrcpool >>>>>>>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* >>>>>>>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED >>>>>>>> UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP >>>>>>>> 50.0 1 0 0 0 619 0 0 1 >>>>>>>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >>>>>>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.1 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >>>>>>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.2 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >>>>>>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.3 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >>>>>>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> >>>>>>>> After stopping all OSDs on one host I was still able to read and >>>>>>>> write into the pool, but after stopping a second host one PG from >>>>>>>> that pool went "down". That I don't fully understand yet, but I >>>>>>>> just started to look into it. >>>>>>>> With your setup (12 hosts) I would recommend to not utilize all of >>>>>>>> them so you have capacity to recover, let's say one "spare" host >>>>>>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could >>>>>>>> make sense here, resulting in 9 total chunks (one more parity >>>>>>>> chunks for every other OSD), min_size 4. But as I wrote, it >>>>>>>> probably doesn't have the resiliency for a DC failure, so that >>>>>>>> needs some further investigation. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Eugen >>>>>>>> >>>>>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> No... our current setup is 3 datacenters with the same >>>>>>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. >>>>>>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be >>>>>>>>> a multiple of l, I found that k=9/m=66/l=5 with >>>>>>>>> crush-locality=datacenter was achieving my goal of being resilient >>>>>>>>> to a datacenter failure. Because I had this, I considered that >>>>>>>>> lowering the crush failure domain to osd was not a major issue in >>>>>>>>> my case (as it would not be worst than a datacenter failure if all >>>>>>>>> the shards are on the same server in a datacenter) and was working >>>>>>>>> around the lack of hosts for k=9/m=6 (15 OSDs). >>>>>>>>> >>>>>>>>> May be it helps, if I give the erasure code profile used: >>>>>>>>> >>>>>>>>> crush-device-class=hdd >>>>>>>>> crush-failure-domain=osd >>>>>>>>> crush-locality=datacenter >>>>>>>>> crush-root=default >>>>>>>>> k=9 >>>>>>>>> l=5 >>>>>>>>> m=6 >>>>>>>>> plugin=lrc >>>>>>>>> >>>>>>>>> The previously mentioned strange number for min_size for the pool >>>>>>>>> created with this profile has vanished after Quincy upgrade as >>>>>>>>> this parameter is no longer in the CRUH map rule! and the `ceph >>>>>>>>> osd pool get` command reports the expected number (10): >>>>>>>>> >>>>>>>>> --------- >>>>>>>>> >>>>>>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>>>>>>>> min_size: 10 >>>>>>>>> -------- >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Michel >>>>>>>>> >>>>>>>>> Le 29/04/2023 à 20:36, Curt a écrit : >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> What is your current setup, 1 server pet data center with 12 osd >>>>>>>>>> each? What is your current crush rule and LRC crush rule? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>>>>>>>> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I think I found a possible cause of my PG down but still >>>>>>>>>> understand why. >>>>>>>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool >>>>>>>>>> (k=9, >>>>>>>>>> m=6) but I have only 12 OSD servers in the cluster. To >>>>>>>>>> workaround the >>>>>>>>>> problem I defined the failure domain as 'osd' with the >>>>>>>>>> reasoning >>>>>>>>>> that as >>>>>>>>>> I was using the LRC plugin, I had the warranty that I >>>>>>>>>> could loose >>>>>>>>>> a site >>>>>>>>>> without impact, thus the possibility to loose 1 OSD >>>>>>>>>> server. Am I >>>>>>>>>> wrong? >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> Michel >>>>>>>>>> >>>>>>>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > I'm still interesting by getting feedback from those using >>>>>>>>>> the LRC >>>>>>>>>> > plugin about the right way to configure it... Last week I >>>>>>>>>> upgraded >>>>>>>>>> > from Pacific to Quincy (17.2.6) with cephadm which >>>>>>>>>> is doing the >>>>>>>>>> > upgrade host by host, checking if an OSD is ok to stop before >>>>>>>>>> actually >>>>>>>>>> > upgrading it. I had the surprise to see 1 or 2 PGs >>>>>>>>>> down at some >>>>>>>>>> points >>>>>>>>>> > in the upgrade (happened not for all OSDs but for every >>>>>>>>>> > site/datacenter). Looking at the details with "ceph health >>>>>>>>>> detail", I >>>>>>>>>> > saw that for these PGs there was 3 OSDs down but I >>>>>>>>>> was expecting >>>>>>>>>> the >>>>>>>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >>>>>>>>>> > wondering if there is something wrong in our pool >>>>>>>>>> configuration >>>>>>>>>> (k=9, >>>>>>>>>> > m=6, l=5). >>>>>>>>>> > >>>>>>>>>> > Cheers, >>>>>>>>>> > >>>>>>>>>> > Michel >>>>>>>>>> > >>>>>>>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>>>>>>>> >> Hi, >>>>>>>>>> >> >>>>>>>>>> >> Is somebody using LRC plugin ? >>>>>>>>>> >> >>>>>>>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >>>>>>>>>> same as >>>>>>>>>> >> jerasure k=9, m=6 in terms of protection against >>>>>>>>>> failures and >>>>>>>>>> that I >>>>>>>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= >>>>>>>>>> jerasure >>>>>>>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >>>>>>>>>> suggests >>>>>>>>>> >> that this LRC configuration gives something better than >>>>>>>>>> jerasure k=4, >>>>>>>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >>>>>>>>>> understood >>>>>>>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 >>>>>>>>>> >> configuration first without loosing RW access and second >>>>>>>>>> without >>>>>>>>>> >> loosing data? >>>>>>>>>> >> >>>>>>>>>> >> Another thing that I don't quite understand is that a pool >>>>>>>>>> created >>>>>>>>>> >> with this configuration (and failure domain=osd, >>>>>>>>>> locality=datacenter) >>>>>>>>>> >> has a min_size=3 (max_size=18 as expected). It >>>>>>>>>> seems wrong to >>>>>>>>>> me, I'd >>>>>>>>>> >> expected something ~10 (depending on answer to the previous >>>>>>>>>> question)... >>>>>>>>>> >> >>>>>>>>>> >> Thanks in advance if somebody could provide some sort of >>>>>>>>>> >> authoritative answer on these 2 questions. Best regards, >>>>>>>>>> >> >>>>>>>>>> >> Michel >>>>>>>>>> >> >>>>>>>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>>>>>>>> >>> Answering to myself, I found the reason for >>>>>>>>>> 2147483647: it's >>>>>>>>>> >>> documented as a failure to find enough OSD >>>>>>>>>> (missing OSDs). And >>>>>>>>>> it is >>>>>>>>>> >>> normal as I selected different hosts for the 15 OSDs but I >>>>>>>>>> have only >>>>>>>>>> >>> 12 hosts! >>>>>>>>>> >>> >>>>>>>>>> >>> I'm still interested by an "expert" to confirm >>>>>>>>>> that LRC k=9, >>>>>>>>>> m=3, >>>>>>>>>> >>> l=4 configuration is equivalent, in terms of >>>>>>>>>> redundancy, to a >>>>>>>>>> >>> jerasure configuration with k=9, m=6. >>>>>>>>>> >>> >>>>>>>>>> >>> Michel >>>>>>>>>> >>> >>>>>>>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>>>>>>>> >>>> Hi, >>>>>>>>>> >>>> >>>>>>>>>> >>>> As discussed in another thread (Crushmap rule for >>>>>>>>>> multi-datacenter >>>>>>>>>> >>>> erasure coding), I'm trying to create an EC pool >>>>>>>>>> spanning 3 >>>>>>>>>> >>>> datacenters (datacenters are present in the crushmap), >>>>>>>>>> with the >>>>>>>>>> >>>> objective to be resilient to 1 DC down, at least >>>>>>>>>> keeping the >>>>>>>>>> >>>> readonly access to the pool and if possible the read-write >>>>>>>>>> access, >>>>>>>>>> >>>> and have a storage efficiency better than 3 replica (let >>>>>>>>>> say a >>>>>>>>>> >>>> storage overhead <= 2). >>>>>>>>>> >>>> >>>>>>>>>> >>>> In the discussion, somebody mentioned LRC plugin as a >>>>>>>>>> possible >>>>>>>>>> >>>> jerasure alternative to implement this without >>>>>>>>>> tweaking the >>>>>>>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>>>>>>>> looked at >>>>>>>>>> >>>> the documentation >>>>>>>>>> >>>> >>>>>>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/ >>>> ) >>>>>>>>>> >>>> but I have some questions if someone has >>>>>>>>>> experience/expertise >>>>>>>>>> with >>>>>>>>>> >>>> this LRC plugin. >>>>>>>>>> >>>> >>>>>>>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter >>>>>>>>>> (15 in >>>>>>>>>> >>>> total), with 3 (9 in total) being data chunks and others >>>>>>>>>> being >>>>>>>>>> >>>> coding chunks. For this, based of my understanding of >>>>>>>>>> examples, I >>>>>>>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>>>>>>>> equivalent, >>>>>>>>>> >>>> in terms of redundancy, to a jerasure >>>>>>>>>> configuration with k=9, >>>>>>>>>> m=6? >>>>>>>>>> >>>> >>>>>>>>>> >>>> The resulting rule, which looks correct to me, is: >>>>>>>>>> >>>> >>>>>>>>>> >>>> -------- >>>>>>>>>> >>>> >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "rule_id": 6, >>>>>>>>>> >>>> "rule_name": "test_lrc_2", >>>>>>>>>> >>>> "ruleset": 6, >>>>>>>>>> >>>> "type": 3, >>>>>>>>>> >>>> "min_size": 3, >>>>>>>>>> >>>> "max_size": 15, >>>>>>>>>> >>>> "steps": [ >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_chooseleaf_tries", >>>>>>>>>> >>>> "num": 5 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_choose_tries", >>>>>>>>>> >>>> "num": 100 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "take", >>>>>>>>>> >>>> "item": -4, >>>>>>>>>> >>>> "item_name": "default~hdd" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "choose_indep", >>>>>>>>>> >>>> "num": 3, >>>>>>>>>> >>>> "type": "datacenter" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "chooseleaf_indep", >>>>>>>>>> >>>> "num": 5, >>>>>>>>>> >>>> "type": "host" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "emit" >>>>>>>>>> >>>> } >>>>>>>>>> >>>> ] >>>>>>>>>> >>>> } >>>>>>>>>> >>>> >>>>>>>>>> >>>> ------------ >>>>>>>>>> >>>> >>>>>>>>>> >>>> Unfortunately, it doesn't work as expected: a pool created >>>>>>>>>> with >>>>>>>>>> >>>> this rule ends up with its pages >>>>>>>>>> active+undersize, which is >>>>>>>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, >>>>>>>>>> I see >>>>>>>>>> >>>> for each page something like: >>>>>>>>>> >>>> >>>>>>>>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>>>>>>>> >>>> active+undersized, last acting >>>>>>>>>> >>>> >>>>>>>>>> >>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>>>>>>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> For each PG, there is 3 '2147483647' entries and >>>>>>>>>> I guess it >>>>>>>>>> is the >>>>>>>>>> >>>> reason of the problem. What are these entries >>>>>>>>>> about? Clearly >>>>>>>>>> it is >>>>>>>>>> >>>> not OSD entries... Looks like a negative number, >>>>>>>>>> -1, which in >>>>>>>>>> terms >>>>>>>>>> >>>> of crushmap ID is the crushmap root (named >>>>>>>>>> "default" in our >>>>>>>>>> >>>> configuration). Any trivial mistake I would have made? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Thanks in advance for any help or for sharing any >>>>>>>>>> successful >>>>>>>>>> >>>> configuration? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Best regards, >>>>>>>>>> >>>> >>>>>>>>>> >>>> Michel >>>>>>>>>> >>>> _______________________________________________ >>>>>>>>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

30 Jun 30 Jun

12:22 p.m.

I created a tracker issue, maybe that will get some attention: https://tracker.ceph.com/issues/61861 Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>:

...

Hi, I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : > Hi Eugen, > > My LRC pool is also somewhat experimental so nothing really > urgent. If you manage to do some tests that help me to understand > the problem I remain interested. I propose to keep this thread > for that. > > Zitat, I shared my crush map in the email you answered if the > attachment was not suppressed by mailman. > > Cheers, > > Michel > Sent from my mobile > > Le 18 mai 2023 11:19:35 Eugen Block <eblock(a)nde.ag> a écrit : > >> Hi, I don’t have a good explanation for this yet, but I’ll soon get >> the opportunity to play around with a decommissioned cluster. I’ll try >> to get a better understanding of the LRC plugin, but it might take >> some time, especially since my vacation is coming up. :-) >> I have some thoughts about the down PGs with failure domain OSD, but I >> don’t have anything to confirm it yet. >> >> Zitat von Curt <lightspd(a)gmail.com>om>: >> >>> Hi, >>> >>> I've been following this thread with interest as it seems like >>> a unique use >>> case to expand my knowledge. I don't use LRC or anything outside basic >>> erasure coding. >>> >>> What is your current crush steps rule? I know you made changes >>> since your >>> first post and had some thoughts I wanted to share, but wanted >>> to see your >>> rule first so I could try to visualize the distribution better. >>> The only >>> way I can currently visualize it working is with more servers, >>> I'm thinking >>> 6 or 9 per data center min, but that could be my lack of >>> knowledge on some >>> of the step rules. >>> >>> Thanks >>> Curt >>> >>> On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < >>> michel.jouvin(a)ijclab.in2p3.fr> wrote: >>> >>>> Hi Eugen, >>>> >>>> Yes, sure, no problem to share it. I attach it to this email (as it may >>>> clutter the discussion if inline). >>>> >>>> If somebody on the list has some clue on the LRC plugin, I'm still >>>> interested by understand what I'm doing wrong! >>>> >>>> Cheers, >>>> >>>> Michel >>>> >>>> Le 04/05/2023 à 15:07, Eugen Block a écrit : >>>>> Hi, >>>>> >>>>> I don't think you've shared your osd tree yet, could you do that? >>>>> Apparently nobody else but us reads this thread or nobody reading this >>>>> uses the LRC plugin. ;-) >>>>> >>>>> Thanks, >>>>> Eugen >>>>> >>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>> >>>>>> Hi, >>>>>> >>>>>> I had to restart one of my OSD server today and the problem showed up >>>>>> again. This time I managed to capture "ceph health detail" output >>>>>> showing the problem with the 2 PGs: >>>>>> >>>>>> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 >>>>>> pgs down >>>>>> pg 56.1 is down, acting >>>>>> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] pg 56.12 is down, >>>>>> acting >>>>>> >>>> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] >>>>>> >>>>>> I still doesn't understand why, if I am supposed to survive to a >>>>>> datacenter failure, I cannot survive to 3 OSDs down on the same host, >>>>>> hosting shards for the PG. In the second case it is only 2 OSDs down >>>>>> but I'm surprised they don't seem in the same "group" of OSD (I'd >>>>>> expected all the the OSDs of one datacenter to be in the same groupe >>>>>> of 5 if the order given really reflects the allocation done... >>>>>> >>>>>> Still interested by some explanation on what I'm doing wrong! Best >>>>>> regards, >>>>>> >>>>>> Michel >>>>>> >>>>>> Le 03/05/2023 à 10:21, Eugen Block a écrit : >>>>>>> I think I got it wrong with the locality setting, I'm still limited >>>>>>> by the number of hosts I have available in my test cluster, but as >>>>>>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with >>>>>>> locality=datacenter could fit your requirement, at least with >>>>>>> regards to the recovery bandwidth usage between DCs, but the >>>>>>> resiliency would not match your requirement (one DC failure). That >>>>>>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one >>>>>>> parity chunk) across three DCs, in total 12 chunks. The min_size=7 >>>>>>> would not allow an entire DC to go down, I'm afraid, you'd have to >>>>>>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm >>>>>>> still not sure if I got it right this time, but maybe you're better >>>>>>> off without the LRC plugin with the limited number of hosts. Instead >>>>>>> you could use the jerasure plugin with a profile like k=4 m=5 >>>>>>> allowing an entire DC to fail without losing data access (we have >>>>>>> one customer using that). >>>>>>> >>>>>>> Zitat von Eugen Block <eblock(a)nde.ag>ag>: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> disclaimer: I haven't used LRC in a real setup yet, so there might >>>>>>>> be some misunderstandings on my side. But I tried to play around >>>>>>>> with one of my test clusters (Nautilus). Because I'm limited in the >>>>>>>> number of hosts (6 across 3 virtual DCs) I tried two different >>>>>>>> profiles with lower numbers to get a feeling for how that works. >>>>>>>> >>>>>>>> # first attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=4 m=2 l=3 crush-failure-domain=host >>>>>>>> >>>>>>>> For every third OSD one parity chunk is added, so 2 more chunks to >>>>>>>> store ==> 8 chunks in total. Since my failure-domain is host and I >>>>>>>> only have 6 I get incomplete PGs. >>>>>>>> >>>>>>>> # second attempt >>>>>>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>>>>>>> k=2 m=2 l=2 crush-failure-domain=host >>>>>>>> >>>>>>>> This gives me 6 chunks in total to store across 6 hosts >>>>>>>> which works: >>>>>>>> >>>>>>>> ceph:~ # ceph pg ls-by-pool lrcpool >>>>>>>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* >>>>>>>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED >>>>>>>> UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP >>>>>>>> 50.0 1 0 0 0 619 0 0 1 >>>>>>>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >>>>>>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.1 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >>>>>>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.2 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >>>>>>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> 50.3 0 0 0 0 0 0 0 0 >>>>>>>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >>>>>>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >>>>>>>> 14:53:54.322135 >>>>>>>> >>>>>>>> After stopping all OSDs on one host I was still able to read and >>>>>>>> write into the pool, but after stopping a second host one PG from >>>>>>>> that pool went "down". That I don't fully understand yet, but I >>>>>>>> just started to look into it. >>>>>>>> With your setup (12 hosts) I would recommend to not utilize all of >>>>>>>> them so you have capacity to recover, let's say one "spare" host >>>>>>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could >>>>>>>> make sense here, resulting in 9 total chunks (one more parity >>>>>>>> chunks for every other OSD), min_size 4. But as I wrote, it >>>>>>>> probably doesn't have the resiliency for a DC failure, so that >>>>>>>> needs some further investigation. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Eugen >>>>>>>> >>>>>>>> Zitat von Michel Jouvin <michel.jouvin(a)ijclab.in2p3.fr>fr>: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> No... our current setup is 3 datacenters with the same >>>>>>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. >>>>>>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be >>>>>>>>> a multiple of l, I found that k=9/m=66/l=5 with >>>>>>>>> crush-locality=datacenter was achieving my goal of being resilient >>>>>>>>> to a datacenter failure. Because I had this, I considered that >>>>>>>>> lowering the crush failure domain to osd was not a major issue in >>>>>>>>> my case (as it would not be worst than a datacenter failure if all >>>>>>>>> the shards are on the same server in a datacenter) and was working >>>>>>>>> around the lack of hosts for k=9/m=6 (15 OSDs). >>>>>>>>> >>>>>>>>> May be it helps, if I give the erasure code profile used: >>>>>>>>> >>>>>>>>> crush-device-class=hdd >>>>>>>>> crush-failure-domain=osd >>>>>>>>> crush-locality=datacenter >>>>>>>>> crush-root=default >>>>>>>>> k=9 >>>>>>>>> l=5 >>>>>>>>> m=6 >>>>>>>>> plugin=lrc >>>>>>>>> >>>>>>>>> The previously mentioned strange number for min_size for the pool >>>>>>>>> created with this profile has vanished after Quincy upgrade as >>>>>>>>> this parameter is no longer in the CRUH map rule! and the `ceph >>>>>>>>> osd pool get` command reports the expected number (10): >>>>>>>>> >>>>>>>>> --------- >>>>>>>>> >>>>>>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>>>>>>>> min_size: 10 >>>>>>>>> -------- >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Michel >>>>>>>>> >>>>>>>>> Le 29/04/2023 à 20:36, Curt a écrit : >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> What is your current setup, 1 server pet data center with 12 osd >>>>>>>>>> each? What is your current crush rule and LRC crush rule? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>>>>>>>> <michel.jouvin(a)ijclab.in2p3.fr> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I think I found a possible cause of my PG down but still >>>>>>>>>> understand why. >>>>>>>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool >>>>>>>>>> (k=9, >>>>>>>>>> m=6) but I have only 12 OSD servers in the cluster. To >>>>>>>>>> workaround the >>>>>>>>>> problem I defined the failure domain as 'osd' with the >>>>>>>>>> reasoning >>>>>>>>>> that as >>>>>>>>>> I was using the LRC plugin, I had the warranty that I >>>>>>>>>> could loose >>>>>>>>>> a site >>>>>>>>>> without impact, thus the possibility to loose 1 OSD >>>>>>>>>> server. Am I >>>>>>>>>> wrong? >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> Michel >>>>>>>>>> >>>>>>>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > I'm still interesting by getting feedback from those using >>>>>>>>>> the LRC >>>>>>>>>> > plugin about the right way to configure it... Last week I >>>>>>>>>> upgraded >>>>>>>>>> > from Pacific to Quincy (17.2.6) with cephadm which >>>>>>>>>> is doing the >>>>>>>>>> > upgrade host by host, checking if an OSD is ok to stop before >>>>>>>>>> actually >>>>>>>>>> > upgrading it. I had the surprise to see 1 or 2 PGs >>>>>>>>>> down at some >>>>>>>>>> points >>>>>>>>>> > in the upgrade (happened not for all OSDs but for every >>>>>>>>>> > site/datacenter). Looking at the details with "ceph health >>>>>>>>>> detail", I >>>>>>>>>> > saw that for these PGs there was 3 OSDs down but I >>>>>>>>>> was expecting >>>>>>>>>> the >>>>>>>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >>>>>>>>>> > wondering if there is something wrong in our pool >>>>>>>>>> configuration >>>>>>>>>> (k=9, >>>>>>>>>> > m=6, l=5). >>>>>>>>>> > >>>>>>>>>> > Cheers, >>>>>>>>>> > >>>>>>>>>> > Michel >>>>>>>>>> > >>>>>>>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>>>>>>>> >> Hi, >>>>>>>>>> >> >>>>>>>>>> >> Is somebody using LRC plugin ? >>>>>>>>>> >> >>>>>>>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >>>>>>>>>> same as >>>>>>>>>> >> jerasure k=9, m=6 in terms of protection against >>>>>>>>>> failures and >>>>>>>>>> that I >>>>>>>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= >>>>>>>>>> jerasure >>>>>>>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >>>>>>>>>> suggests >>>>>>>>>> >> that this LRC configuration gives something better than >>>>>>>>>> jerasure k=4, >>>>>>>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >>>>>>>>>> understood >>>>>>>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 >>>>>>>>>> >> configuration first without loosing RW access and second >>>>>>>>>> without >>>>>>>>>> >> loosing data? >>>>>>>>>> >> >>>>>>>>>> >> Another thing that I don't quite understand is that a pool >>>>>>>>>> created >>>>>>>>>> >> with this configuration (and failure domain=osd, >>>>>>>>>> locality=datacenter) >>>>>>>>>> >> has a min_size=3 (max_size=18 as expected). It >>>>>>>>>> seems wrong to >>>>>>>>>> me, I'd >>>>>>>>>> >> expected something ~10 (depending on answer to the previous >>>>>>>>>> question)... >>>>>>>>>> >> >>>>>>>>>> >> Thanks in advance if somebody could provide some sort of >>>>>>>>>> >> authoritative answer on these 2 questions. Best regards, >>>>>>>>>> >> >>>>>>>>>> >> Michel >>>>>>>>>> >> >>>>>>>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>>>>>>>> >>> Answering to myself, I found the reason for >>>>>>>>>> 2147483647: it's >>>>>>>>>> >>> documented as a failure to find enough OSD >>>>>>>>>> (missing OSDs). And >>>>>>>>>> it is >>>>>>>>>> >>> normal as I selected different hosts for the 15 OSDs but I >>>>>>>>>> have only >>>>>>>>>> >>> 12 hosts! >>>>>>>>>> >>> >>>>>>>>>> >>> I'm still interested by an "expert" to confirm >>>>>>>>>> that LRC k=9, >>>>>>>>>> m=3, >>>>>>>>>> >>> l=4 configuration is equivalent, in terms of >>>>>>>>>> redundancy, to a >>>>>>>>>> >>> jerasure configuration with k=9, m=6. >>>>>>>>>> >>> >>>>>>>>>> >>> Michel >>>>>>>>>> >>> >>>>>>>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>>>>>>>> >>>> Hi, >>>>>>>>>> >>>> >>>>>>>>>> >>>> As discussed in another thread (Crushmap rule for >>>>>>>>>> multi-datacenter >>>>>>>>>> >>>> erasure coding), I'm trying to create an EC pool >>>>>>>>>> spanning 3 >>>>>>>>>> >>>> datacenters (datacenters are present in the crushmap), >>>>>>>>>> with the >>>>>>>>>> >>>> objective to be resilient to 1 DC down, at least >>>>>>>>>> keeping the >>>>>>>>>> >>>> readonly access to the pool and if possible the read-write >>>>>>>>>> access, >>>>>>>>>> >>>> and have a storage efficiency better than 3 replica (let >>>>>>>>>> say a >>>>>>>>>> >>>> storage overhead <= 2). >>>>>>>>>> >>>> >>>>>>>>>> >>>> In the discussion, somebody mentioned LRC plugin as a >>>>>>>>>> possible >>>>>>>>>> >>>> jerasure alternative to implement this without >>>>>>>>>> tweaking the >>>>>>>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>>>>>>>> looked at >>>>>>>>>> >>>> the documentation >>>>>>>>>> >>>> >>>>>>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/ >>>> ) >>>>>>>>>> >>>> but I have some questions if someone has >>>>>>>>>> experience/expertise >>>>>>>>>> with >>>>>>>>>> >>>> this LRC plugin. >>>>>>>>>> >>>> >>>>>>>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter >>>>>>>>>> (15 in >>>>>>>>>> >>>> total), with 3 (9 in total) being data chunks and others >>>>>>>>>> being >>>>>>>>>> >>>> coding chunks. For this, based of my understanding of >>>>>>>>>> examples, I >>>>>>>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>>>>>>>> equivalent, >>>>>>>>>> >>>> in terms of redundancy, to a jerasure >>>>>>>>>> configuration with k=9, >>>>>>>>>> m=6? >>>>>>>>>> >>>> >>>>>>>>>> >>>> The resulting rule, which looks correct to me, is: >>>>>>>>>> >>>> >>>>>>>>>> >>>> -------- >>>>>>>>>> >>>> >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "rule_id": 6, >>>>>>>>>> >>>> "rule_name": "test_lrc_2", >>>>>>>>>> >>>> "ruleset": 6, >>>>>>>>>> >>>> "type": 3, >>>>>>>>>> >>>> "min_size": 3, >>>>>>>>>> >>>> "max_size": 15, >>>>>>>>>> >>>> "steps": [ >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_chooseleaf_tries", >>>>>>>>>> >>>> "num": 5 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "set_choose_tries", >>>>>>>>>> >>>> "num": 100 >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "take", >>>>>>>>>> >>>> "item": -4, >>>>>>>>>> >>>> "item_name": "default~hdd" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "choose_indep", >>>>>>>>>> >>>> "num": 3, >>>>>>>>>> >>>> "type": "datacenter" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "chooseleaf_indep", >>>>>>>>>> >>>> "num": 5, >>>>>>>>>> >>>> "type": "host" >>>>>>>>>> >>>> }, >>>>>>>>>> >>>> { >>>>>>>>>> >>>> "op": "emit" >>>>>>>>>> >>>> } >>>>>>>>>> >>>> ] >>>>>>>>>> >>>> } >>>>>>>>>> >>>> >>>>>>>>>> >>>> ------------ >>>>>>>>>> >>>> >>>>>>>>>> >>>> Unfortunately, it doesn't work as expected: a pool created >>>>>>>>>> with >>>>>>>>>> >>>> this rule ends up with its pages >>>>>>>>>> active+undersize, which is >>>>>>>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, >>>>>>>>>> I see >>>>>>>>>> >>>> for each page something like: >>>>>>>>>> >>>> >>>>>>>>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>>>>>>>> >>>> active+undersized, last acting >>>>>>>>>> >>>> >>>>>>>>>> >>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>>>>>>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> For each PG, there is 3 '2147483647' entries and >>>>>>>>>> I guess it >>>>>>>>>> is the >>>>>>>>>> >>>> reason of the problem. What are these entries >>>>>>>>>> about? Clearly >>>>>>>>>> it is >>>>>>>>>> >>>> not OSD entries... Looks like a negative number, >>>>>>>>>> -1, which in >>>>>>>>>> terms >>>>>>>>>> >>>> of crushmap ID is the crushmap root (named >>>>>>>>>> "default" in our >>>>>>>>>> >>>> configuration). Any trivial mistake I would have made? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Thanks in advance for any help or for sharing any >>>>>>>>>> successful >>>>>>>>>> >>>> configuration? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Best regards, >>>>>>>>>> >>>> >>>>>>>>>> >>>> Michel >>>>>>>>>> >>>> _______________________________________________ >>>>>>>>>> >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

323

days inactive

410

days old

ceph-users@ceph.io

Manage subscription

20 comments

4 participants

tags (0)

participants (4)

Curt
Eugen Block
Frank Schilder
Michel Jouvin