Large number of misplaced PGs but little backfill going on

List overview All Threads
Download

newer

older

Why you might want packages not...

Spam in log file

Torkil Svensgaard

23 Mar 2024 23 Mar '24

1:55 p.m.

Hi We have this after adding some hosts and changing crush failure domain to datacenter: pgs: 1338512379/3162732055 objects misplaced (42.321%) 5970 active+remapped+backfill_wait 4853 active+clean 11 active+remapped+backfilling We have 3 datacenters each with 6 hosts and ~400 HDD OSDs with DB/WAL on NVMe. Using mclock with high_recovery_ops profile. What is the bottleneck here? I would have expected a huge number of simultaneous backfills. Backfill reservation logjam? Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

Show replies by date

Alexander E. Patrakov

23 Mar 23 Mar

3:14 p.m.

Hello Torkil, It would help if you provided the whole "ceph osd df tree" and "ceph pg ls" outputs. On Sat, Mar 23, 2024 at 4:26 PM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

...

-- Alexander E. Patrakov

Torkil Svensgaard

4:39 p.m.

On 23-03-2024 10:44, Alexander E. Patrakov wrote:

...

Hello Torkil,

Hi Alexander

...

It would help if you provided the whole "ceph osd df tree" and "ceph pg ls" outputs.

Of course, here's ceph osd df tree to start with: https://pastebin.com/X50b2W0J The other output is too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output? Mvh. Torkil

...

On Sat, Mar 23, 2024 at 4:26 PM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

-- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

Kai Stian Olstad

10:24 p.m.

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:

...

The other output is too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output?

You can attached files to the mail here on the list. -- Kai Stian Olstad

Torkil Svensgaard

10:38 p.m.

On 2024-03-23 17:54, Kai Stian Olstad wrote:

...

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:

The other output is too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output?

You can attached files to the mail here on the list.

Doh, for some reason I was sure attachments would be stripped. Thanks, attached. Mvh. Torkil

Alexander E. Patrakov

11:13 p.m.

...

On 2024-03-23 17:54, Kai Stian Olstad wrote:

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:

The other output is too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output?

You can attached files to the mail here on the list.

Doh, for some reason I was sure attachments would be stripped. Thanks, attached. Mvh. Torkil

-- Alexander E. Patrakov

Alexander E. Patrakov

11:26 p.m.

Hi Torkil, I take my previous response back. You have an erasure-coded pool with nine shards but only three datacenters. This, in general, cannot work. You need either nine datacenters or a very custom CRUSH rule. The second option may not be available if the current EC setup is already incompatible, as there is no way to change the EC parameters. It would help if you provided the output of "ceph osd pool ls detail". On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov <patrakov(a)gmail.com> wrote:

...

On 2024-03-23 17:54, Kai Stian Olstad wrote:

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:

The other output is too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output?

You can attached files to the mail here on the list.

Doh, for some reason I was sure attachments would be stripped. Thanks, attached. Mvh. Torkil

-- Alexander E. Patrakov

Alexander E. Patrakov

11:35 p.m.

Sorry for replying to myself, but "ceph osd pool ls detail" by itself is insufficient. For every erasure code profile mentioned in the output, please also run something like this: ceph osd erasure-code-profile get prf-for-ec-data ...where "prf-for-ec-data" is the name that appears after the words "erasure profile" in the "ceph osd pool ls detail" output. On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov <patrakov(a)gmail.com> wrote:

...

On 2024-03-23 17:54, Kai Stian Olstad wrote:

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: > > The other output is too big for pastebin and I'm not familiar with > paste services, any suggestion for a preferred way to share such > output? You can attached files to the mail here on the list.

Doh, for some reason I was sure attachments would be stripped. Thanks, attached. Mvh. Torkil

-- Alexander E. Patrakov

Torkil Svensgaard

24 Mar 24 Mar

12:46 a.m.

On 23-03-2024 19:05, Alexander E. Patrakov wrote:

...

[root@lazy ~]# ceph osd pool ls detail | grep erasure pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 fast_read 1 compression_algorithm snappy compression_mode aggressive application rbd pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 compression_algorithm zstd compression_mode aggressive application cephfs pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 compression_algorithm zstd compression_mode aggressive application rbd [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 crush-device-class=hdd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd crush-device-class=hdd crush-failure-domain=datacenter crush-root=default jerasure-per-chunk-alignment=false k=4 m=5 plugin=jerasure technique=reed_sol_van w=8 [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd crush-device-class=ssd crush-failure-domain=datacenter crush-root=default jerasure-per-chunk-alignment=false k=4 m=5 plugin=jerasure technique=reed_sol_van w=8 But as I understand it those profiles are only used to create the initial crush rule for the pool, and we have manually edited those along the way. Here are the 3 rules in use for the 3 EC pools: rule rbd_ec_data { id 0 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 0 type datacenter step chooseleaf indep 2 type host step emit } rule cephfs.hdd.data { id 7 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 0 type datacenter step chooseleaf indep 3 type host step emit } rule rbd.ssd.data { id 8 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class ssd step choose indep 0 type datacenter step chooseleaf indep 3 type host step emit } Which should first pick all 3 datacenters in the choose step and then either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 respectively. Mvh. Torkil

...

On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov <patrakov(a)gmail.com> wrote:

On 2024-03-23 17:54, Kai Stian Olstad wrote: > On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: >> >> The other output is too big for pastebin and I'm not familiar with >> paste services, any suggestion for a preferred way to share such >> output? > > You can attached files to the mail here on the list. Doh, for some reason I was sure attachments would be stripped. Thanks, attached. Mvh. Torkil

-- Alexander E. Patrakov

Alexander E. Patrakov

1:49 a.m.

Hi Torkil, I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage. What happens if you increase the osd_max_backfills setting temporarily? It may be a good idea to investigate a few of the stalled PGs. Please run commands similar to this one: ceph pg 37.0 query > query.37.0.txt ceph pg 37.1 query > query.37.1.txt ... and the same for the other affected pools. Still, I must say that some of your rules are actually unsafe. The 4+2 rule as used by rbd_ec_data will not survive a datacenter-offline incident. Namely, for each PG, it chooses OSDs from two hosts in each datacenter, so 6 OSDs total. When a datacenter is offline, you will, therefore, have only 4 OSDs up, which is exactly the number of data chunks. However, the pool requires min_size 5, so all PGs will be inactive (to prevent data corruption) and will stay inactive until the datacenter comes up again. However, please don't set min_size to 4 - then, any additional incident (like a defective disk) will lead to data loss, and the shards in the datacenter which went offline would be useless because they do not correspond to the updated shards written by the clients. The 4+5 rule as used by cephfs.hdd.data has min_size equal to the number of data chunks. See above why it is bad. Please set min_size to 5. The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are 100% active+clean. Regarding the mon_max_pg_per_osd setting, you have a few OSDs that have 300+ PGs, the observed maximum is 347. Please set it to 400. On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

...

On 23-03-2024 19:05, Alexander E. Patrakov wrote:

On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov <patrakov(a)gmail.com> wrote:

Hi Torkil, Unfortunately, your files contain nothing obviously bad or suspicious, except for two things: more PGs than usual and bad balance. What's your "mon max pg per osd" setting? On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: > > On 2024-03-23 17:54, Kai Stian Olstad wrote: >> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: >>> >>> The other output is too big for pastebin and I'm not familiar with >>> paste services, any suggestion for a preferred way to share such >>> output? >> >> You can attached files to the mail here on the list. > > Doh, for some reason I was sure attachments would be stripped. Thanks, > attached. > > Mvh. > > Torkil -- Alexander E. Patrakov

-- Alexander E. Patrakov

Torkil Svensgaard

2:26 a.m.

On 23-03-2024 21:19, Alexander E. Patrakov wrote:

...

Hi Torkil,

Hi Alexander

...

I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage.

Thank you for taking the time =)

...

What happens if you increase the osd_max_backfills setting temporarily?

We already had the mclock override option in place and I re-enabled our babysitter script which sets osd_max_backfills pr OSD to 1-3 depending on how full they are. Active backfills went from 16 to 53 which is probably because default osd_max_backfills for mclock is 1. I think 53 is still a low number of active backfills given the large percentage misplaced.

...

It may be a good idea to investigate a few of the stalled PGs. Please run commands similar to this one: ceph pg 37.0 query > query.37.0.txt ceph pg 37.1 query > query.37.1.txt ... and the same for the other affected pools.

A few samples attached.

...

Still, I must say that some of your rules are actually unsafe. The 4+2 rule as used by rbd_ec_data will not survive a datacenter-offline incident. Namely, for each PG, it chooses OSDs from two hosts in each datacenter, so 6 OSDs total. When a datacenter is offline, you will, therefore, have only 4 OSDs up, which is exactly the number of data chunks. However, the pool requires min_size 5, so all PGs will be inactive (to prevent data corruption) and will stay inactive until the datacenter comes up again. However, please don't set min_size to 4 - then, any additional incident (like a defective disk) will lead to data loss, and the shards in the datacenter which went offline would be useless because they do not correspond to the updated shards written by the clients.

Thanks for the explanation. This is an old pool predating the 3 DC setup and we'll migrate the data to a 4+5 pool when we can.

...

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the number of data chunks. See above why it is bad. Please set min_size to 5.

Thanks, that was a leftover for getting the PGs to peer (stuck at creating+incomplete) when we created the pool. It's back to 5 now.

...

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are 100% active+clean.

There is very little data in this pool, that is probably the main reason.

...

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that have 300+ PGs, the observed maximum is 347. Please set it to 400.

Copy that. Didn't seem to make a difference though, and we have osd_max_pg_per_osd_hard_ratio set to 5.000000. Mvh. Torkil > On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: >> >> >> >> On 23-03-2024 19:05, Alexander E. Patrakov wrote: >>> Sorry for replying to myself, but "ceph osd pool ls detail" by itself >>> is insufficient. For every erasure code profile mentioned in the >>> output, please also run something like this: >>> >>> ceph osd erasure-code-profile get prf-for-ec-data >>> >>> ...where "prf-for-ec-data" is the name that appears after the words >>> "erasure profile" in the "ceph osd pool ls detail" output. >> >> [root@lazy ~]# ceph osd pool ls detail | grep erasure >> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 >> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags >> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >> fast_read 1 compression_algorithm snappy compression_mode aggressive >> application rbd >> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size >> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 >> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags >> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >> compression_algorithm zstd compression_mode aggressive application cephfs >> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 >> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 >> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags >> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >> compression_algorithm zstd compression_mode aggressive application rbd >> >> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 >> crush-device-class=hdd >> crush-failure-domain=host >> crush-root=default >> jerasure-per-chunk-alignment=false >> k=4 >> m=2 >> plugin=jerasure >> technique=reed_sol_van >> w=8 >> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd >> crush-device-class=hdd >> crush-failure-domain=datacenter >> crush-root=default >> jerasure-per-chunk-alignment=false >> k=4 >> m=5 >> plugin=jerasure >> technique=reed_sol_van >> w=8 >> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd >> crush-device-class=ssd >> crush-failure-domain=datacenter >> crush-root=default >> jerasure-per-chunk-alignment=false >> k=4 >> m=5 >> plugin=jerasure >> technique=reed_sol_van >> w=8 >> >> But as I understand it those profiles are only used to create the >> initial crush rule for the pool, and we have manually edited those along >> the way. Here are the 3 rules in use for the 3 EC pools: >> >> rule rbd_ec_data { >> id 0 >> type erasure >> step set_chooseleaf_tries 5 >> step set_choose_tries 100 >> step take default class hdd >> step choose indep 0 type datacenter >> step chooseleaf indep 2 type host >> step emit >> } >> rule cephfs.hdd.data { >> id 7 >> type erasure >> step set_chooseleaf_tries 5 >> step set_choose_tries 100 >> step take default class hdd >> step choose indep 0 type datacenter >> step chooseleaf indep 3 type host >> step emit >> } >> rule rbd.ssd.data { >> id 8 >> type erasure >> step set_chooseleaf_tries 5 >> step set_choose_tries 100 >> step take default class ssd >> step choose indep 0 type datacenter >> step chooseleaf indep 3 type host >> step emit >> } >> >> Which should first pick all 3 datacenters in the choose step and then >> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 >> respectively. >> >> Mvh. >> >> Torkil >> >>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov >>> <patrakov(a)gmail.com> wrote: >>>> >>>

...

Hi Torkil,

>>>> >>>> I take my previous response back. >>>> >>>> You have an erasure-coded pool with nine shards but only three >>>> datacenters. This, in general, cannot work. You need either nine >>>> datacenters or a very custom CRUSH rule. The second option may not be >>>> available if the current EC setup is already incompatible, as there is >>>> no way to change the EC parameters. >>>> >>>> It would help if you provided the output of "ceph osd pool ls detail". >>>> >>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov >>>> <patrakov(a)gmail.com> wrote: >>>>> >>>>

...

Hi Torkil,

>>>>> >>>>> Unfortunately, your files contain nothing obviously bad or suspicious, >>>>> except for two things: more PGs than usual and bad balance. >>>>> >>>>> What's your "mon max pg per osd" setting? >>>>> >>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: >>>>>> >>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: >>>>>>>> >>>>>>>> The other output is too big for pastebin and I'm not familiar with >>>>>>>> paste services, any suggestion for a preferred way to share such >>>>>>>> output? >>>>>>> >>>>>>> You can attached files to the mail here on the list. >>>>>> >>>>>> Doh, for some reason I was sure attachments would be stripped. Thanks, >>>>>> attached. >>>>>> >>>>>> Mvh. >>>>>> >>>>>> Torkil >>>>> >>>>> >>>>> >>>>> -- >>>>> Alexander E. Patrakov >>>> >>>> >>>> >>>> -- >>>> Alexander E. Patrakov >>> >>> >>> >> >> -- >> Torkil Svensgaard >> Systems Administrator >> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >> Copenhagen University Hospital Amager and Hvidovre >> Kettegaard Allé 30, 2650 Hvidovre, Denmark >> > > -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

Alexander E. Patrakov

2:56 a.m.

...

On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,

Hi Alexander

I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage.

Thank you for taking the time =)

What happens if you increase the osd_max_backfills setting temporarily?

A few samples attached.

Thanks for the explanation. This is an old pool predating the 3 DC setup and we'll migrate the data to a 4+5 pool when we can.

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the number of data chunks. See above why it is bad. Please set min_size to 5.

Thanks, that was a leftover for getting the PGs to peer (stuck at creating+incomplete) when we created the pool. It's back to 5 now.

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are 100% active+clean.

There is very little data in this pool, that is probably the main reason.

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that have 300+ PGs, the observed maximum is 347. Please set it to 400.

Copy that. Didn't seem to make a difference though, and we have osd_max_pg_per_osd_hard_ratio set to 5.000000. Mvh. Torkil

On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

On 23-03-2024 19:05, Alexander E. Patrakov wrote:

On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov <patrakov(a)gmail.com> wrote: > > Hi Torkil, > > I take my previous response back. > > You have an erasure-coded pool with nine shards but only three > datacenters. This, in general, cannot work. You need either nine > datacenters or a very custom CRUSH rule. The second option may not be > available if the current EC setup is already incompatible, as there is > no way to change the EC parameters. > > It would help if you provided the output of "ceph osd pool ls detail". > > On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov > <patrakov(a)gmail.com> wrote: >> >> Hi Torkil, >> >> Unfortunately, your files contain nothing obviously bad or suspicious, >> except for two things: more PGs than usual and bad balance. >> >> What's your "mon max pg per osd" setting? >> >> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: >>> >>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: >>>>> >>>>> The other output is too big for pastebin and I'm not familiar with >>>>> paste services, any suggestion for a preferred way to share such >>>>> output? >>>> >>>> You can attached files to the mail here on the list. >>> >>> Doh, for some reason I was sure attachments would be stripped. Thanks, >>> attached. >>> >>> Mvh. >>> >>> Torkil >> >> >> >> -- >> Alexander E. Patrakov > > > > -- > Alexander E. Patrakov

-- Alexander E. Patrakov

Torkil Svensgaard

3:20 a.m.

...

On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,

Hi Alexander

I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage.

Thank you for taking the time =)

What happens if you increase the osd_max_backfills setting temporarily?

A few samples attached.

Thanks for the explanation. This is an old pool predating the 3 DC setup and we'll migrate the data to a 4+5 pool when we can.

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the number of data chunks. See above why it is bad. Please set min_size to 5.

Thanks, that was a leftover for getting the PGs to peer (stuck at creating+incomplete) when we created the pool. It's back to 5 now.

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are 100% active+clean.

There is very little data in this pool, that is probably the main reason.

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that have 300+ PGs, the observed maximum is 347. Please set it to 400.

Hi Torkil,

Alexander E. Patrakov

5:01 a.m.

Hi Torkil, Thanks for the update. Even though the improvement is small, it is still an improvement, consistent with the osd_max_backfills value, and it proves that there are still unsolved peering issues. I have looked at both the old and the new state of the PG, but could not find anything else interesting. I also looked again at the state of PG 37.1. It is known what blocks the backfill of this PG; please search for "blocked_by." However, this is just one data point, which is insufficient for any conclusions. Try looking at other PGs. Is there anything too common in the non-empty "blocked_by" blocks? I think we have to look for patterns in other ways, too. One tool that produces good visualizations is TheJJ balancer. Although it is called a "balancer," it can also visualize the ongoing backfills. The tool is available at https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptim… Run it as follows: ./placementoptimizer.py showremapped --by-osd | tee remapped.txt On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

...

On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,

Hi Alexander

I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage.

Thank you for taking the time =)

What happens if you increase the osd_max_backfills setting temporarily?

A few samples attached.

Thanks for the explanation. This is an old pool predating the 3 DC setup and we'll migrate the data to a 4+5 pool when we can.

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the number of data chunks. See above why it is bad. Please set min_size to 5.

Thanks, that was a leftover for getting the PGs to peer (stuck at creating+incomplete) when we created the pool. It's back to 5 now.

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are 100% active+clean.

There is very little data in this pool, that is probably the main reason.

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that have 300+ PGs, the observed maximum is 347. Please set it to 400.

Hi Torkil,

-- Alexander E. Patrakov

Torkil Svensgaard

5:44 a.m.

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

...

Hi Torkil,

Hi Alexander

...

Thanks for the update. Even though the improvement is small, it is still an improvement, consistent with the osd_max_backfills value, and it proves that there are still unsolved peering issues. I have looked at both the old and the new state of the PG, but could not find anything else interesting. I also looked again at the state of PG 37.1. It is known what blocks the backfill of this PG; please search for "blocked_by." However, this is just one data point, which is insufficient for any conclusions. Try looking at other PGs. Is there anything too common in the non-empty "blocked_by" blocks?

I'll take a look at that tomorrow, perhaps we can script something meaningful.

...

I think we have to look for patterns in other ways, too. One tool that produces good visualizations is TheJJ balancer. Although it is called a "balancer," it can also visualize the ongoing backfills. The tool is available at https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptim… Run it as follows: ./placementoptimizer.py showremapped --by-osd | tee remapped.txt

Output attached. Thanks again. Mvh. Torkil > On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: >> >> Hi Alex >> >> New query output attached after restarting both OSDs. OSD 237 is no >> longer mentioned but it unfortunately made no difference for the number >> of backfills which went 59->62->62. >> >> Mvh. >> >> Torkil >> >> On 23-03-2024 22:26, Alexander E. Patrakov wrote: >>

...

Hi Torkil,

>>> >>> I have looked at the files that you attached. They were helpful: pool >>> 11 is problematic, it complains about degraded objects for no obvious >>> reason. I think that is the blocker. >>> >>> I also noted that you mentioned peering problems, and I suspect that >>> they are not completely resolved. As a somewhat-irrational move, to >>> confirm this theory, you can restart osd.237 (it is mentioned at the >>> end of query.11.fff.txt, although I don't understand why it is there) >>> and then osd.298 (it is the primary for that pg) and see if any >>> additional backfills are unblocked after that. Also, please re-query >>> that PG again after the OSD restart. >>> >>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: >>>> >>>> >>>> >>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote: >>>>

...

Hi Torkil,

>>>> >>>> Hi Alexander >>>> >>>>> I have looked at the CRUSH rules, and the equivalent rules work on my >>>>> test cluster. So this cannot be the cause of the blockage. >>>> >>>> Thank you for taking the time =) >>>> >>>>> What happens if you increase the osd_max_backfills setting temporarily? >>>> >>>> We already had the mclock override option in place and I re-enabled our >>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending >>>> on how full they are. Active backfills went from 16 to 53 which is >>>> probably because default osd_max_backfills for mclock is 1. >>>> >>>> I think 53 is still a low number of active backfills given the large >>>> percentage misplaced. >>>> >>>>> It may be a good idea to investigate a few of the stalled PGs. Please >>>>> run commands similar to this one: >>>>> >>>>> ceph pg 37.0 query > query.37.0.txt >>>>> ceph pg 37.1 query > query.37.1.txt >>>>> ... >>>>> and the same for the other affected pools. >>>> >>>> A few samples attached. >>>> >>>>> Still, I must say that some of your rules are actually unsafe. >>>>> >>>>> The 4+2 rule as used by rbd_ec_data will not survive a >>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs from >>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is >>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly >>>>> the number of data chunks. However, the pool requires min_size 5, so >>>>> all PGs will be inactive (to prevent data corruption) and will stay >>>>> inactive until the datacenter comes up again. However, please don't >>>>> set min_size to 4 - then, any additional incident (like a defective >>>>> disk) will lead to data loss, and the shards in the datacenter which >>>>> went offline would be useless because they do not correspond to the >>>>> updated shards written by the clients. >>>> >>>> Thanks for the explanation. This is an old pool predating the 3 DC setup >>>> and we'll migrate the data to a 4+5 pool when we can. >>>> >>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the >>>>> number of data chunks. See above why it is bad. Please set min_size to >>>>> 5. >>>> >>>> Thanks, that was a leftover for getting the PGs to peer (stuck at >>>> creating+incomplete) when we created the pool. It's back to 5 now. >>>> >>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are >>>>> 100% active+clean. >>>> >>>> There is very little data in this pool, that is probably the main reason. >>>> >>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that >>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400. >>>> >>>> Copy that. Didn't seem to make a difference though, and we have >>>> osd_max_pg_per_osd_hard_ratio set to 5.000000. >>>> >>>> Mvh. >>>> >>>> Torkil >>>> >>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: >>>>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by itself >>>>>>> is insufficient. For every erasure code profile mentioned in the >>>>>>> output, please also run something like this: >>>>>>> >>>>>>> ceph osd erasure-code-profile get prf-for-ec-data >>>>>>> >>>>>>> ...where "prf-for-ec-data" is the name that appears after the words >>>>>>> "erasure profile" in the "ceph osd pool ls detail" output. >>>>>> >>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure >>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 >>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags >>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >>>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive >>>>>> application rbd >>>>>> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size >>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 >>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags >>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >>>>>> compression_algorithm zstd compression_mode aggressive application cephfs >>>>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 >>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 >>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags >>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>>>>> compression_algorithm zstd compression_mode aggressive application rbd >>>>>> >>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 >>>>>> crush-device-class=hdd >>>>>> crush-failure-domain=host >>>>>> crush-root=default >>>>>> jerasure-per-chunk-alignment=false >>>>>> k=4 >>>>>> m=2 >>>>>> plugin=jerasure >>>>>> technique=reed_sol_van >>>>>> w=8 >>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd >>>>>> crush-device-class=hdd >>>>>> crush-failure-domain=datacenter >>>>>> crush-root=default >>>>>> jerasure-per-chunk-alignment=false >>>>>> k=4 >>>>>> m=5 >>>>>> plugin=jerasure >>>>>> technique=reed_sol_van >>>>>> w=8 >>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd >>>>>> crush-device-class=ssd >>>>>> crush-failure-domain=datacenter >>>>>> crush-root=default >>>>>> jerasure-per-chunk-alignment=false >>>>>> k=4 >>>>>> m=5 >>>>>> plugin=jerasure >>>>>> technique=reed_sol_van >>>>>> w=8 >>>>>> >>>>>> But as I understand it those profiles are only used to create the >>>>>> initial crush rule for the pool, and we have manually edited those along >>>>>> the way. Here are the 3 rules in use for the 3 EC pools: >>>>>> >>>>>> rule rbd_ec_data { >>>>>> id 0 >>>>>> type erasure >>>>>> step set_chooseleaf_tries 5 >>>>>> step set_choose_tries 100 >>>>>> step take default class hdd >>>>>> step choose indep 0 type datacenter >>>>>> step chooseleaf indep 2 type host >>>>>> step emit >>>>>> } >>>>>> rule cephfs.hdd.data { >>>>>> id 7 >>>>>> type erasure >>>>>> step set_chooseleaf_tries 5 >>>>>> step set_choose_tries 100 >>>>>> step take default class hdd >>>>>> step choose indep 0 type datacenter >>>>>> step chooseleaf indep 3 type host >>>>>> step emit >>>>>> } >>>>>> rule rbd.ssd.data { >>>>>> id 8 >>>>>> type erasure >>>>>> step set_chooseleaf_tries 5 >>>>>> step set_choose_tries 100 >>>>>> step take default class ssd >>>>>> step choose indep 0 type datacenter >>>>>> step chooseleaf indep 3 type host >>>>>> step emit >>>>>> } >>>>>> >>>>>> Which should first pick all 3 datacenters in the choose step and then >>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 >>>>>> respectively. >>>>>> >>>>>> Mvh. >>>>>> >>>>>> Torkil >>>>>> >>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov >>>>>>> <patrakov(a)gmail.com> wrote: >>>>>>>> >>>>>>>

...

Hi Torkil,

>>>>>>>> >>>>>>>> I take my previous response back. >>>>>>>> >>>>>>>> You have an erasure-coded pool with nine shards but only three >>>>>>>> datacenters. This, in general, cannot work. You need either nine >>>>>>>> datacenters or a very custom CRUSH rule. The second option may not be >>>>>>>> available if the current EC setup is already incompatible, as there is >>>>>>>> no way to change the EC parameters. >>>>>>>> >>>>>>>> It would help if you provided the output of "ceph osd pool ls detail". >>>>>>>> >>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov >>>>>>>> <patrakov(a)gmail.com> wrote: >>>>>>>>> >>>>>>>>

...

Hi Torkil,

>>>>>>>>> >>>>>>>>> Unfortunately, your files contain nothing obviously bad or suspicious, >>>>>>>>> except for two things: more PGs than usual and bad balance. >>>>>>>>> >>>>>>>>> What's your "mon max pg per osd" setting? >>>>>>>>> >>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: >>>>>>>>>> >>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: >>>>>>>>>>>> >>>>>>>>>>>> The other output is too big for pastebin and I'm not familiar with >>>>>>>>>>>> paste services, any suggestion for a preferred way to share such >>>>>>>>>>>> output? >>>>>>>>>>> >>>>>>>>>>> You can attached files to the mail here on the list. >>>>>>>>>> >>>>>>>>>> Doh, for some reason I was sure attachments would be stripped. Thanks, >>>>>>>>>> attached. >>>>>>>>>> >>>>>>>>>> Mvh. >>>>>>>>>> >>>>>>>>>> Torkil >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Alexander E. Patrakov >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Alexander E. Patrakov >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Torkil Svensgaard >>>>>> Systems Administrator >>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>>>> Copenhagen University Hospital Amager and Hvidovre >>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Torkil Svensgaard >>>> Systems Administrator >>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>> Copenhagen University Hospital Amager and Hvidovre >>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>> >>> >>> >> >> -- >> Torkil Svensgaard >> Systems Administrator >> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >> Copenhagen University Hospital Amager and Hvidovre >> Kettegaard Allé 30, 2650 Hvidovre, Denmark > > > -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

Torkil Svensgaard

25 Mar 25 Mar

5:07 p.m.

On 24/03/2024 01:14, Torkil Svensgaard wrote:

...

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

Hi Torkil,

Hi Alexander

I'll take a look at that tomorrow, perhaps we can script something meaningful.

Hi Alexander While working on a script querying all PGs and making a list of all OSDs found in a blocked_by list, and how many times for each, I discovered something odd about pool 38: " [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38 OSDs blocking other OSDs: OSD 425: 5 instance(s) OSD 426: 6 instance(s) OSD 34: 7 instance(s) OSD 36: 5 instance(s) OSD 146: 3 instance(s) OSD 6: 2 instance(s) OSD 5: 8 instance(s) OSD 131: 7 instance(s) OSD 4: 9 instance(s) OSD 3: 5 instance(s) OSD 2: 5 instance(s) OSD 1: 2 instance(s) OSD 0: 4 instance(s) OSD 167: 1 instance(s) OSD 168: 3 instance(s) OSD 450: 2 instance(s) OSD 46: 6 instance(s) OSD 154: 3 instance(s) OSD 156: 2 instance(s) OSD 90: 2 instance(s) OSD 227: 4 instance(s) OSD 10: 4 instance(s) OSD 15: 6 instance(s) OSD 449: 4 instance(s) OSD 192: 2 instance(s) OSD 67: 3 instance(s) " All PGs in the pool are active+clean so why are there any blocked_by at all? One example attached. Mvh. Torkil

...

Output attached. Thanks again. Mvh. Torkil

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

Hi Torkil, I have looked at the files that you attached. They were helpful: pool 11 is problematic, it complains about degraded objects for no obvious reason. I think that is the blocker. I also noted that you mentioned peering problems, and I suspect that they are not completely resolved. As a somewhat-irrational move, to confirm this theory, you can restart osd.237 (it is mentioned at the end of query.11.fff.txt, although I don't understand why it is there) and then osd.298 (it is the primary for that pg) and see if any additional backfills are unblocked after that. Also, please re-query that PG again after the OSD restart. On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: > > > > On 23-03-2024 21:19, Alexander E. Patrakov wrote: >> Hi Torkil, > > Hi Alexander > >> I have looked at the CRUSH rules, and the equivalent rules work on my >> test cluster. So this cannot be the cause of the blockage. > > Thank you for taking the time =) > >> What happens if you increase the osd_max_backfills setting >> temporarily? > > We already had the mclock override option in place and I re-enabled > our > babysitter script which sets osd_max_backfills pr OSD to 1-3 depending > on how full they are. Active backfills went from 16 to 53 which is > probably because default osd_max_backfills for mclock is 1. > > I think 53 is still a low number of active backfills given the large > percentage misplaced. > >> It may be a good idea to investigate a few of the stalled PGs. Please >> run commands similar to this one: >> >> ceph pg 37.0 query > query.37.0.txt >> ceph pg 37.1 query > query.37.1.txt >> ... >> and the same for the other affected pools. > > A few samples attached. > >> Still, I must say that some of your rules are actually unsafe. >> >> The 4+2 rule as used by rbd_ec_data will not survive a >> datacenter-offline incident. Namely, for each PG, it chooses OSDs >> from >> two hosts in each datacenter, so 6 OSDs total. When a datacenter is >> offline, you will, therefore, have only 4 OSDs up, which is exactly >> the number of data chunks. However, the pool requires min_size 5, so >> all PGs will be inactive (to prevent data corruption) and will stay >> inactive until the datacenter comes up again. However, please don't >> set min_size to 4 - then, any additional incident (like a defective >> disk) will lead to data loss, and the shards in the datacenter which >> went offline would be useless because they do not correspond to the >> updated shards written by the clients. > > Thanks for the explanation. This is an old pool predating the 3 DC > setup > and we'll migrate the data to a 4+5 pool when we can. > >> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the >> number of data chunks. See above why it is bad. Please set >> min_size to >> 5. > > Thanks, that was a leftover for getting the PGs to peer (stuck at > creating+incomplete) when we created the pool. It's back to 5 now. > >> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are >> 100% active+clean. > > There is very little data in this pool, that is probably the main > reason. > >> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that >> have 300+ PGs, the observed maximum is 347. Please set it to 400. > > Copy that. Didn't seem to make a difference though, and we have > osd_max_pg_per_osd_hard_ratio set to 5.000000. > > Mvh. > > Torkil > >> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard >> <torkil(a)drcmr.dk> wrote: >>> >>> >>> >>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: >>>> Sorry for replying to myself, but "ceph osd pool ls detail" by >>>> itself >>>> is insufficient. For every erasure code profile mentioned in the >>>> output, please also run something like this: >>>> >>>> ceph osd erasure-code-profile get prf-for-ec-data >>>> >>>> ...where "prf-for-ec-data" is the name that appears after the words >>>> "erasure profile" in the "ceph osd pool ls detail" output. >>> >>> [root@lazy ~]# ceph osd pool ls detail | grep erasure >>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 >>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags >>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >>> fast_read 1 compression_algorithm snappy compression_mode aggressive >>> application rbd >>> pool 37 'cephfs.hdd.data' erasure profile >>> DRCMR_k4m5_datacenter_hdd size >>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 >>> pgp_num 2048 >>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags >>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >>> compression_algorithm zstd compression_mode aggressive >>> application cephfs >>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd >>> size 9 >>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 >>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags >>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>> compression_algorithm zstd compression_mode aggressive >>> application rbd >>> >>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 >>> crush-device-class=hdd >>> crush-failure-domain=host >>> crush-root=default >>> jerasure-per-chunk-alignment=false >>> k=4 >>> m=2 >>> plugin=jerasure >>> technique=reed_sol_van >>> w=8 >>> [root@lazy ~]# ceph osd erasure-code-profile get >>> DRCMR_k4m5_datacenter_hdd >>> crush-device-class=hdd >>> crush-failure-domain=datacenter >>> crush-root=default >>> jerasure-per-chunk-alignment=false >>> k=4 >>> m=5 >>> plugin=jerasure >>> technique=reed_sol_van >>> w=8 >>> [root@lazy ~]# ceph osd erasure-code-profile get >>> DRCMR_k4m5_datacenter_ssd >>> crush-device-class=ssd >>> crush-failure-domain=datacenter >>> crush-root=default >>> jerasure-per-chunk-alignment=false >>> k=4 >>> m=5 >>> plugin=jerasure >>> technique=reed_sol_van >>> w=8 >>> >>> But as I understand it those profiles are only used to create the >>> initial crush rule for the pool, and we have manually edited >>> those along >>> the way. Here are the 3 rules in use for the 3 EC pools: >>> >>> rule rbd_ec_data { >>> id 0 >>> type erasure >>> step set_chooseleaf_tries 5 >>> step set_choose_tries 100 >>> step take default class hdd >>> step choose indep 0 type datacenter >>> step chooseleaf indep 2 type host >>> step emit >>> } >>> rule cephfs.hdd.data { >>> id 7 >>> type erasure >>> step set_chooseleaf_tries 5 >>> step set_choose_tries 100 >>> step take default class hdd >>> step choose indep 0 type datacenter >>> step chooseleaf indep 3 type host >>> step emit >>> } >>> rule rbd.ssd.data { >>> id 8 >>> type erasure >>> step set_chooseleaf_tries 5 >>> step set_choose_tries 100 >>> step take default class ssd >>> step choose indep 0 type datacenter >>> step chooseleaf indep 3 type host >>> step emit >>> } >>> >>> Which should first pick all 3 datacenters in the choose step and >>> then >>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 >>> respectively. >>> >>> Mvh. >>> >>> Torkil >>> >>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov >>>> <patrakov(a)gmail.com> wrote: >>>>> >>>>> Hi Torkil, >>>>> >>>>> I take my previous response back. >>>>> >>>>> You have an erasure-coded pool with nine shards but only three >>>>> datacenters. This, in general, cannot work. You need either nine >>>>> datacenters or a very custom CRUSH rule. The second option may >>>>> not be >>>>> available if the current EC setup is already incompatible, as >>>>> there is >>>>> no way to change the EC parameters. >>>>> >>>>> It would help if you provided the output of "ceph osd pool ls >>>>> detail". >>>>> >>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov >>>>> <patrakov(a)gmail.com> wrote: >>>>>> >>>>>> Hi Torkil, >>>>>> >>>>>> Unfortunately, your files contain nothing obviously bad or >>>>>> suspicious, >>>>>> except for two things: more PGs than usual and bad balance. >>>>>> >>>>>> What's your "mon max pg per osd" setting? >>>>>> >>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard >>>>>> <torkil(a)drcmr.dk> wrote: >>>>>>> >>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> The other output is too big for pastebin and I'm not >>>>>>>>> familiar with >>>>>>>>> paste services, any suggestion for a preferred way to share >>>>>>>>> such >>>>>>>>> output? >>>>>>>> >>>>>>>> You can attached files to the mail here on the list. >>>>>>> >>>>>>> Doh, for some reason I was sure attachments would be >>>>>>> stripped. Thanks, >>>>>>> attached. >>>>>>> >>>>>>> Mvh. >>>>>>> >>>>>>> Torkil >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Alexander E. Patrakov >>>>> >>>>> >>>>> >>>>> -- >>>>> Alexander E. Patrakov >>>> >>>> >>>> >>> >>> -- >>> Torkil Svensgaard >>> Systems Administrator >>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>> Copenhagen University Hospital Amager and Hvidovre >>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>> >> >> > > -- > Torkil Svensgaard > Systems Administrator > Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > Copenhagen University Hospital Amager and Hvidovre > Kettegaard Allé 30, 2650 Hvidovre, Denmark

-- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård Allé 30 DK-2650 Hvidovre Denmark Tel: +45 386 22828 E-mail: torkil(a)drcmr.dk

Alexander E. Patrakov

26 Mar 26 Mar

1:07 a.m.

On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

...

On 24/03/2024 01:14, Torkil Svensgaard wrote:

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

Hi Torkil,

Hi Alexander

I'll take a look at that tomorrow, perhaps we can script something meaningful.

<snip>

...

All PGs in the pool are active+clean so why are there any blocked_by at all? One example attached.

I don't know. In any case, it doesn't match the "one OSD blocks them all" scenario that I was looking for. I think this is something bogus that can probably be cleared in your example by restarting osd.89 (i.e, the one being blocked).

...

Mvh. Torkil

Output attached. Thanks again. Mvh. Torkil

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

Hi Alex New query output attached after restarting both OSDs. OSD 237 is no longer mentioned but it unfortunately made no difference for the number of backfills which went 59->62->62. Mvh. Torkil On 23-03-2024 22:26, Alexander E. Patrakov wrote: > Hi Torkil, > > I have looked at the files that you attached. They were helpful: pool > 11 is problematic, it complains about degraded objects for no obvious > reason. I think that is the blocker. > > I also noted that you mentioned peering problems, and I suspect that > they are not completely resolved. As a somewhat-irrational move, to > confirm this theory, you can restart osd.237 (it is mentioned at the > end of query.11.fff.txt, although I don't understand why it is there) > and then osd.298 (it is the primary for that pg) and see if any > additional backfills are unblocked after that. Also, please re-query > that PG again after the OSD restart. > > On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil(a)drcmr.dk> > wrote: >> >> >> >> On 23-03-2024 21:19, Alexander E. Patrakov wrote: >>> Hi Torkil, >> >> Hi Alexander >> >>> I have looked at the CRUSH rules, and the equivalent rules work on my >>> test cluster. So this cannot be the cause of the blockage. >> >> Thank you for taking the time =) >> >>> What happens if you increase the osd_max_backfills setting >>> temporarily? >> >> We already had the mclock override option in place and I re-enabled >> our >> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending >> on how full they are. Active backfills went from 16 to 53 which is >> probably because default osd_max_backfills for mclock is 1. >> >> I think 53 is still a low number of active backfills given the large >> percentage misplaced. >> >>> It may be a good idea to investigate a few of the stalled PGs. Please >>> run commands similar to this one: >>> >>> ceph pg 37.0 query > query.37.0.txt >>> ceph pg 37.1 query > query.37.1.txt >>> ... >>> and the same for the other affected pools. >> >> A few samples attached. >> >>> Still, I must say that some of your rules are actually unsafe. >>> >>> The 4+2 rule as used by rbd_ec_data will not survive a >>> datacenter-offline incident. Namely, for each PG, it chooses OSDs >>> from >>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is >>> offline, you will, therefore, have only 4 OSDs up, which is exactly >>> the number of data chunks. However, the pool requires min_size 5, so >>> all PGs will be inactive (to prevent data corruption) and will stay >>> inactive until the datacenter comes up again. However, please don't >>> set min_size to 4 - then, any additional incident (like a defective >>> disk) will lead to data loss, and the shards in the datacenter which >>> went offline would be useless because they do not correspond to the >>> updated shards written by the clients. >> >> Thanks for the explanation. This is an old pool predating the 3 DC >> setup >> and we'll migrate the data to a 4+5 pool when we can. >> >>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the >>> number of data chunks. See above why it is bad. Please set >>> min_size to >>> 5. >> >> Thanks, that was a leftover for getting the PGs to peer (stuck at >> creating+incomplete) when we created the pool. It's back to 5 now. >> >>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are >>> 100% active+clean. >> >> There is very little data in this pool, that is probably the main >> reason. >> >>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that >>> have 300+ PGs, the observed maximum is 347. Please set it to 400. >> >> Copy that. Didn't seem to make a difference though, and we have >> osd_max_pg_per_osd_hard_ratio set to 5.000000. >> >> Mvh. >> >> Torkil >> >>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard >>> <torkil(a)drcmr.dk> wrote: >>>> >>>> >>>> >>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: >>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by >>>>> itself >>>>> is insufficient. For every erasure code profile mentioned in the >>>>> output, please also run something like this: >>>>> >>>>> ceph osd erasure-code-profile get prf-for-ec-data >>>>> >>>>> ...where "prf-for-ec-data" is the name that appears after the words >>>>> "erasure profile" in the "ceph osd pool ls detail" output. >>>> >>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure >>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 >>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags >>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >>>> fast_read 1 compression_algorithm snappy compression_mode aggressive >>>> application rbd >>>> pool 37 'cephfs.hdd.data' erasure profile >>>> DRCMR_k4m5_datacenter_hdd size >>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 >>>> pgp_num 2048 >>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags >>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >>>> compression_algorithm zstd compression_mode aggressive >>>> application cephfs >>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd >>>> size 9 >>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 >>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags >>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>>> compression_algorithm zstd compression_mode aggressive >>>> application rbd >>>> >>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 >>>> crush-device-class=hdd >>>> crush-failure-domain=host >>>> crush-root=default >>>> jerasure-per-chunk-alignment=false >>>> k=4 >>>> m=2 >>>> plugin=jerasure >>>> technique=reed_sol_van >>>> w=8 >>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>> DRCMR_k4m5_datacenter_hdd >>>> crush-device-class=hdd >>>> crush-failure-domain=datacenter >>>> crush-root=default >>>> jerasure-per-chunk-alignment=false >>>> k=4 >>>> m=5 >>>> plugin=jerasure >>>> technique=reed_sol_van >>>> w=8 >>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>> DRCMR_k4m5_datacenter_ssd >>>> crush-device-class=ssd >>>> crush-failure-domain=datacenter >>>> crush-root=default >>>> jerasure-per-chunk-alignment=false >>>> k=4 >>>> m=5 >>>> plugin=jerasure >>>> technique=reed_sol_van >>>> w=8 >>>> >>>> But as I understand it those profiles are only used to create the >>>> initial crush rule for the pool, and we have manually edited >>>> those along >>>> the way. Here are the 3 rules in use for the 3 EC pools: >>>> >>>> rule rbd_ec_data { >>>> id 0 >>>> type erasure >>>> step set_chooseleaf_tries 5 >>>> step set_choose_tries 100 >>>> step take default class hdd >>>> step choose indep 0 type datacenter >>>> step chooseleaf indep 2 type host >>>> step emit >>>> } >>>> rule cephfs.hdd.data { >>>> id 7 >>>> type erasure >>>> step set_chooseleaf_tries 5 >>>> step set_choose_tries 100 >>>> step take default class hdd >>>> step choose indep 0 type datacenter >>>> step chooseleaf indep 3 type host >>>> step emit >>>> } >>>> rule rbd.ssd.data { >>>> id 8 >>>> type erasure >>>> step set_chooseleaf_tries 5 >>>> step set_choose_tries 100 >>>> step take default class ssd >>>> step choose indep 0 type datacenter >>>> step chooseleaf indep 3 type host >>>> step emit >>>> } >>>> >>>> Which should first pick all 3 datacenters in the choose step and >>>> then >>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 >>>> respectively. >>>> >>>> Mvh. >>>> >>>> Torkil >>>> >>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov >>>>> <patrakov(a)gmail.com> wrote: >>>>>> >>>>>> Hi Torkil, >>>>>> >>>>>> I take my previous response back. >>>>>> >>>>>> You have an erasure-coded pool with nine shards but only three >>>>>> datacenters. This, in general, cannot work. You need either nine >>>>>> datacenters or a very custom CRUSH rule. The second option may >>>>>> not be >>>>>> available if the current EC setup is already incompatible, as >>>>>> there is >>>>>> no way to change the EC parameters. >>>>>> >>>>>> It would help if you provided the output of "ceph osd pool ls >>>>>> detail". >>>>>> >>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov >>>>>> <patrakov(a)gmail.com> wrote: >>>>>>> >>>>>>> Hi Torkil, >>>>>>> >>>>>>> Unfortunately, your files contain nothing obviously bad or >>>>>>> suspicious, >>>>>>> except for two things: more PGs than usual and bad balance. >>>>>>> >>>>>>> What's your "mon max pg per osd" setting? >>>>>>> >>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard >>>>>>> <torkil(a)drcmr.dk> wrote: >>>>>>>> >>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> The other output is too big for pastebin and I'm not >>>>>>>>>> familiar with >>>>>>>>>> paste services, any suggestion for a preferred way to share >>>>>>>>>> such >>>>>>>>>> output? >>>>>>>>> >>>>>>>>> You can attached files to the mail here on the list. >>>>>>>> >>>>>>>> Doh, for some reason I was sure attachments would be >>>>>>>> stripped. Thanks, >>>>>>>> attached. >>>>>>>> >>>>>>>> Mvh. >>>>>>>> >>>>>>>> Torkil >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Alexander E. Patrakov >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Alexander E. Patrakov >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Torkil Svensgaard >>>> Systems Administrator >>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>> Copenhagen University Hospital Amager and Hvidovre >>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>> >>> >>> >> >> -- >> Torkil Svensgaard >> Systems Administrator >> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >> Copenhagen University Hospital Amager and Hvidovre >> Kettegaard Allé 30, 2650 Hvidovre, Denmark > > > -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

-- Alexander E. Patrakov

Anthony D'Atri

1:14 a.m.

First try "ceph osd down 89"

...

On Mar 25, 2024, at 15:37, Alexander E. Patrakov <patrakov(a)gmail.com> wrote: On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

On 24/03/2024 01:14, Torkil Svensgaard wrote:

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

Hi Torkil,

Hi Alexander

I'll take a look at that tomorrow, perhaps we can script something meaningful.

<snip>

All PGs in the pool are active+clean so why are there any blocked_by at all? One example attached.

Mvh. Torkil

Output attached. Thanks again. Mvh. Torkil

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote: > > Hi Alex > > New query output attached after restarting both OSDs. OSD 237 is no > longer mentioned but it unfortunately made no difference for the number > of backfills which went 59->62->62. > > Mvh. > > Torkil > > On 23-03-2024 22:26, Alexander E. Patrakov wrote: >> Hi Torkil, >> >> I have looked at the files that you attached. They were helpful: pool >> 11 is problematic, it complains about degraded objects for no obvious >> reason. I think that is the blocker. >> >> I also noted that you mentioned peering problems, and I suspect that >> they are not completely resolved. As a somewhat-irrational move, to >> confirm this theory, you can restart osd.237 (it is mentioned at the >> end of query.11.fff.txt, although I don't understand why it is there) >> and then osd.298 (it is the primary for that pg) and see if any >> additional backfills are unblocked after that. Also, please re-query >> that PG again after the OSD restart. >> >> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil(a)drcmr.dk> >> wrote: >>> >>> >>> >>> On 23-03-2024 21:19, Alexander E. Patrakov wrote: >>>> Hi Torkil, >>> >>> Hi Alexander >>> >>>> I have looked at the CRUSH rules, and the equivalent rules work on my >>>> test cluster. So this cannot be the cause of the blockage. >>> >>> Thank you for taking the time =) >>> >>>> What happens if you increase the osd_max_backfills setting >>>> temporarily? >>> >>> We already had the mclock override option in place and I re-enabled >>> our >>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending >>> on how full they are. Active backfills went from 16 to 53 which is >>> probably because default osd_max_backfills for mclock is 1. >>> >>> I think 53 is still a low number of active backfills given the large >>> percentage misplaced. >>> >>>> It may be a good idea to investigate a few of the stalled PGs. Please >>>> run commands similar to this one: >>>> >>>> ceph pg 37.0 query > query.37.0.txt >>>> ceph pg 37.1 query > query.37.1.txt >>>> ... >>>> and the same for the other affected pools. >>> >>> A few samples attached. >>> >>>> Still, I must say that some of your rules are actually unsafe. >>>> >>>> The 4+2 rule as used by rbd_ec_data will not survive a >>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs >>>> from >>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is >>>> offline, you will, therefore, have only 4 OSDs up, which is exactly >>>> the number of data chunks. However, the pool requires min_size 5, so >>>> all PGs will be inactive (to prevent data corruption) and will stay >>>> inactive until the datacenter comes up again. However, please don't >>>> set min_size to 4 - then, any additional incident (like a defective >>>> disk) will lead to data loss, and the shards in the datacenter which >>>> went offline would be useless because they do not correspond to the >>>> updated shards written by the clients. >>> >>> Thanks for the explanation. This is an old pool predating the 3 DC >>> setup >>> and we'll migrate the data to a 4+5 pool when we can. >>> >>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the >>>> number of data chunks. See above why it is bad. Please set >>>> min_size to >>>> 5. >>> >>> Thanks, that was a leftover for getting the PGs to peer (stuck at >>> creating+incomplete) when we created the pool. It's back to 5 now. >>> >>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are >>>> 100% active+clean. >>> >>> There is very little data in this pool, that is probably the main >>> reason. >>> >>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that >>>> have 300+ PGs, the observed maximum is 347. Please set it to 400. >>> >>> Copy that. Didn't seem to make a difference though, and we have >>> osd_max_pg_per_osd_hard_ratio set to 5.000000. >>> >>> Mvh. >>> >>> Torkil >>> >>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard >>>> <torkil(a)drcmr.dk> wrote: >>>>> >>>>> >>>>> >>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: >>>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by >>>>>> itself >>>>>> is insufficient. For every erasure code profile mentioned in the >>>>>> output, please also run something like this: >>>>>> >>>>>> ceph osd erasure-code-profile get prf-for-ec-data >>>>>> >>>>>> ...where "prf-for-ec-data" is the name that appears after the words >>>>>> "erasure profile" in the "ceph osd pool ls detail" output. >>>>> >>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure >>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 >>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags >>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive >>>>> application rbd >>>>> pool 37 'cephfs.hdd.data' erasure profile >>>>> DRCMR_k4m5_datacenter_hdd size >>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 >>>>> pgp_num 2048 >>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags >>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >>>>> compression_algorithm zstd compression_mode aggressive >>>>> application cephfs >>>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd >>>>> size 9 >>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 >>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags >>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>>>> compression_algorithm zstd compression_mode aggressive >>>>> application rbd >>>>> >>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 >>>>> crush-device-class=hdd >>>>> crush-failure-domain=host >>>>> crush-root=default >>>>> jerasure-per-chunk-alignment=false >>>>> k=4 >>>>> m=2 >>>>> plugin=jerasure >>>>> technique=reed_sol_van >>>>> w=8 >>>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>>> DRCMR_k4m5_datacenter_hdd >>>>> crush-device-class=hdd >>>>> crush-failure-domain=datacenter >>>>> crush-root=default >>>>> jerasure-per-chunk-alignment=false >>>>> k=4 >>>>> m=5 >>>>> plugin=jerasure >>>>> technique=reed_sol_van >>>>> w=8 >>>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>>> DRCMR_k4m5_datacenter_ssd >>>>> crush-device-class=ssd >>>>> crush-failure-domain=datacenter >>>>> crush-root=default >>>>> jerasure-per-chunk-alignment=false >>>>> k=4 >>>>> m=5 >>>>> plugin=jerasure >>>>> technique=reed_sol_van >>>>> w=8 >>>>> >>>>> But as I understand it those profiles are only used to create the >>>>> initial crush rule for the pool, and we have manually edited >>>>> those along >>>>> the way. Here are the 3 rules in use for the 3 EC pools: >>>>> >>>>> rule rbd_ec_data { >>>>> id 0 >>>>> type erasure >>>>> step set_chooseleaf_tries 5 >>>>> step set_choose_tries 100 >>>>> step take default class hdd >>>>> step choose indep 0 type datacenter >>>>> step chooseleaf indep 2 type host >>>>> step emit >>>>> } >>>>> rule cephfs.hdd.data { >>>>> id 7 >>>>> type erasure >>>>> step set_chooseleaf_tries 5 >>>>> step set_choose_tries 100 >>>>> step take default class hdd >>>>> step choose indep 0 type datacenter >>>>> step chooseleaf indep 3 type host >>>>> step emit >>>>> } >>>>> rule rbd.ssd.data { >>>>> id 8 >>>>> type erasure >>>>> step set_chooseleaf_tries 5 >>>>> step set_choose_tries 100 >>>>> step take default class ssd >>>>> step choose indep 0 type datacenter >>>>> step chooseleaf indep 3 type host >>>>> step emit >>>>> } >>>>> >>>>> Which should first pick all 3 datacenters in the choose step and >>>>> then >>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 >>>>> respectively. >>>>> >>>>> Mvh. >>>>> >>>>> Torkil >>>>> >>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov >>>>>> <patrakov(a)gmail.com> wrote: >>>>>>> >>>>>>> Hi Torkil, >>>>>>> >>>>>>> I take my previous response back. >>>>>>> >>>>>>> You have an erasure-coded pool with nine shards but only three >>>>>>> datacenters. This, in general, cannot work. You need either nine >>>>>>> datacenters or a very custom CRUSH rule. The second option may >>>>>>> not be >>>>>>> available if the current EC setup is already incompatible, as >>>>>>> there is >>>>>>> no way to change the EC parameters. >>>>>>> >>>>>>> It would help if you provided the output of "ceph osd pool ls >>>>>>> detail". >>>>>>> >>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov >>>>>>> <patrakov(a)gmail.com> wrote: >>>>>>>> >>>>>>>> Hi Torkil, >>>>>>>> >>>>>>>> Unfortunately, your files contain nothing obviously bad or >>>>>>>> suspicious, >>>>>>>> except for two things: more PGs than usual and bad balance. >>>>>>>> >>>>>>>> What's your "mon max pg per osd" setting? >>>>>>>> >>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard >>>>>>>> <torkil(a)drcmr.dk> wrote: >>>>>>>>> >>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> The other output is too big for pastebin and I'm not >>>>>>>>>>> familiar with >>>>>>>>>>> paste services, any suggestion for a preferred way to share >>>>>>>>>>> such >>>>>>>>>>> output? >>>>>>>>>> >>>>>>>>>> You can attached files to the mail here on the list. >>>>>>>>> >>>>>>>>> Doh, for some reason I was sure attachments would be >>>>>>>>> stripped. Thanks, >>>>>>>>> attached. >>>>>>>>> >>>>>>>>> Mvh. >>>>>>>>> >>>>>>>>> Torkil >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Alexander E. Patrakov >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Alexander E. Patrakov >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Torkil Svensgaard >>>>> Systems Administrator >>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>>> Copenhagen University Hospital Amager and Hvidovre >>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>>> >>>> >>>> >>> >>> -- >>> Torkil Svensgaard >>> Systems Administrator >>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>> Copenhagen University Hospital Amager and Hvidovre >>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >> >> >> > > -- > Torkil Svensgaard > Systems Administrator > Danish Research Centre for Magnetic Resonance DRCMR, Section 714 > Copenhagen University Hospital Amager and Hvidovre > Kettegaard Allé 30, 2650 Hvidovre, Denmark

-- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Torkil Svensgaard

1:58 a.m.

Neither downing or restarting the OSD cleared the bogus blocked_by. I guess it makes no sense to look further at blocked_by as the cause when the data can't be trusted and there is no obvious smoking gun like a few OSDs blocking everything. My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. Mvh. Torkil On 25-03-2024 20:44, Anthony D'Atri wrote:

...

First try "ceph osd down 89"

On Mar 25, 2024, at 15:37, Alexander E. Patrakov <patrakov(a)gmail.com> wrote: On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

On 24/03/2024 01:14, Torkil Svensgaard wrote:

On 24-03-2024 00:31, Alexander E. Patrakov wrote: > Hi Torkil, Hi Alexander > Thanks for the update. Even though the improvement is small, it is > still an improvement, consistent with the osd_max_backfills value, and > it proves that there are still unsolved peering issues. > > I have looked at both the old and the new state of the PG, but could > not find anything else interesting. > > I also looked again at the state of PG 37.1. It is known what blocks > the backfill of this PG; please search for "blocked_by." However, this > is just one data point, which is insufficient for any conclusions. Try > looking at other PGs. Is there anything too common in the non-empty > "blocked_by" blocks? I'll take a look at that tomorrow, perhaps we can script something meaningful.

<snip>

All PGs in the pool are active+clean so why are there any blocked_by at all? One example attached.

Mvh. Torkil

> I think we have to look for patterns in other ways, too. One tool that > produces good visualizations is TheJJ balancer. Although it is called > a "balancer," it can also visualize the ongoing backfills. > > The tool is available at > https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptim… > > Run it as follows: > > ./placementoptimizer.py showremapped --by-osd | tee remapped.txt Output attached. Thanks again. Mvh. Torkil > On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil(a)drcmr.dk> > wrote: >> >> Hi Alex >> >> New query output attached after restarting both OSDs. OSD 237 is no >> longer mentioned but it unfortunately made no difference for the number >> of backfills which went 59->62->62. >> >> Mvh. >> >> Torkil >> >> On 23-03-2024 22:26, Alexander E. Patrakov wrote: >>> Hi Torkil, >>> >>> I have looked at the files that you attached. They were helpful: pool >>> 11 is problematic, it complains about degraded objects for no obvious >>> reason. I think that is the blocker. >>> >>> I also noted that you mentioned peering problems, and I suspect that >>> they are not completely resolved. As a somewhat-irrational move, to >>> confirm this theory, you can restart osd.237 (it is mentioned at the >>> end of query.11.fff.txt, although I don't understand why it is there) >>> and then osd.298 (it is the primary for that pg) and see if any >>> additional backfills are unblocked after that. Also, please re-query >>> that PG again after the OSD restart. >>> >>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil(a)drcmr.dk> >>> wrote: >>>> >>>> >>>> >>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote: >>>>> Hi Torkil, >>>> >>>> Hi Alexander >>>> >>>>> I have looked at the CRUSH rules, and the equivalent rules work on my >>>>> test cluster. So this cannot be the cause of the blockage. >>>> >>>> Thank you for taking the time =) >>>> >>>>> What happens if you increase the osd_max_backfills setting >>>>> temporarily? >>>> >>>> We already had the mclock override option in place and I re-enabled >>>> our >>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending >>>> on how full they are. Active backfills went from 16 to 53 which is >>>> probably because default osd_max_backfills for mclock is 1. >>>> >>>> I think 53 is still a low number of active backfills given the large >>>> percentage misplaced. >>>> >>>>> It may be a good idea to investigate a few of the stalled PGs. Please >>>>> run commands similar to this one: >>>>> >>>>> ceph pg 37.0 query > query.37.0.txt >>>>> ceph pg 37.1 query > query.37.1.txt >>>>> ... >>>>> and the same for the other affected pools. >>>> >>>> A few samples attached. >>>> >>>>> Still, I must say that some of your rules are actually unsafe. >>>>> >>>>> The 4+2 rule as used by rbd_ec_data will not survive a >>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs >>>>> from >>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is >>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly >>>>> the number of data chunks. However, the pool requires min_size 5, so >>>>> all PGs will be inactive (to prevent data corruption) and will stay >>>>> inactive until the datacenter comes up again. However, please don't >>>>> set min_size to 4 - then, any additional incident (like a defective >>>>> disk) will lead to data loss, and the shards in the datacenter which >>>>> went offline would be useless because they do not correspond to the >>>>> updated shards written by the clients. >>>> >>>> Thanks for the explanation. This is an old pool predating the 3 DC >>>> setup >>>> and we'll migrate the data to a 4+5 pool when we can. >>>> >>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the >>>>> number of data chunks. See above why it is bad. Please set >>>>> min_size to >>>>> 5. >>>> >>>> Thanks, that was a leftover for getting the PGs to peer (stuck at >>>> creating+incomplete) when we created the pool. It's back to 5 now. >>>> >>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are >>>>> 100% active+clean. >>>> >>>> There is very little data in this pool, that is probably the main >>>> reason. >>>> >>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that >>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400. >>>> >>>> Copy that. Didn't seem to make a difference though, and we have >>>> osd_max_pg_per_osd_hard_ratio set to 5.000000. >>>> >>>> Mvh. >>>> >>>> Torkil >>>> >>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard >>>>> <torkil(a)drcmr.dk> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote: >>>>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by >>>>>>> itself >>>>>>> is insufficient. For every erasure code profile mentioned in the >>>>>>> output, please also run something like this: >>>>>>> >>>>>>> ceph osd erasure-code-profile get prf-for-ec-data >>>>>>> >>>>>>> ...where "prf-for-ec-data" is the name that appears after the words >>>>>>> "erasure profile" in the "ceph osd pool ls detail" output. >>>>>> >>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure >>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 >>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags >>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >>>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive >>>>>> application rbd >>>>>> pool 37 'cephfs.hdd.data' erasure profile >>>>>> DRCMR_k4m5_datacenter_hdd size >>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 >>>>>> pgp_num 2048 >>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags >>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >>>>>> compression_algorithm zstd compression_mode aggressive >>>>>> application cephfs >>>>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd >>>>>> size 9 >>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 >>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags >>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>>>>> compression_algorithm zstd compression_mode aggressive >>>>>> application rbd >>>>>> >>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2 >>>>>> crush-device-class=hdd >>>>>> crush-failure-domain=host >>>>>> crush-root=default >>>>>> jerasure-per-chunk-alignment=false >>>>>> k=4 >>>>>> m=2 >>>>>> plugin=jerasure >>>>>> technique=reed_sol_van >>>>>> w=8 >>>>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>>>> DRCMR_k4m5_datacenter_hdd >>>>>> crush-device-class=hdd >>>>>> crush-failure-domain=datacenter >>>>>> crush-root=default >>>>>> jerasure-per-chunk-alignment=false >>>>>> k=4 >>>>>> m=5 >>>>>> plugin=jerasure >>>>>> technique=reed_sol_van >>>>>> w=8 >>>>>> [root@lazy ~]# ceph osd erasure-code-profile get >>>>>> DRCMR_k4m5_datacenter_ssd >>>>>> crush-device-class=ssd >>>>>> crush-failure-domain=datacenter >>>>>> crush-root=default >>>>>> jerasure-per-chunk-alignment=false >>>>>> k=4 >>>>>> m=5 >>>>>> plugin=jerasure >>>>>> technique=reed_sol_van >>>>>> w=8 >>>>>> >>>>>> But as I understand it those profiles are only used to create the >>>>>> initial crush rule for the pool, and we have manually edited >>>>>> those along >>>>>> the way. Here are the 3 rules in use for the 3 EC pools: >>>>>> >>>>>> rule rbd_ec_data { >>>>>> id 0 >>>>>> type erasure >>>>>> step set_chooseleaf_tries 5 >>>>>> step set_choose_tries 100 >>>>>> step take default class hdd >>>>>> step choose indep 0 type datacenter >>>>>> step chooseleaf indep 2 type host >>>>>> step emit >>>>>> } >>>>>> rule cephfs.hdd.data { >>>>>> id 7 >>>>>> type erasure >>>>>> step set_chooseleaf_tries 5 >>>>>> step set_choose_tries 100 >>>>>> step take default class hdd >>>>>> step choose indep 0 type datacenter >>>>>> step chooseleaf indep 3 type host >>>>>> step emit >>>>>> } >>>>>> rule rbd.ssd.data { >>>>>> id 8 >>>>>> type erasure >>>>>> step set_chooseleaf_tries 5 >>>>>> step set_choose_tries 100 >>>>>> step take default class ssd >>>>>> step choose indep 0 type datacenter >>>>>> step chooseleaf indep 3 type host >>>>>> step emit >>>>>> } >>>>>> >>>>>> Which should first pick all 3 datacenters in the choose step and >>>>>> then >>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 >>>>>> respectively. >>>>>> >>>>>> Mvh. >>>>>> >>>>>> Torkil >>>>>> >>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov >>>>>>> <patrakov(a)gmail.com> wrote: >>>>>>>> >>>>>>>> Hi Torkil, >>>>>>>> >>>>>>>> I take my previous response back. >>>>>>>> >>>>>>>> You have an erasure-coded pool with nine shards but only three >>>>>>>> datacenters. This, in general, cannot work. You need either nine >>>>>>>> datacenters or a very custom CRUSH rule. The second option may >>>>>>>> not be >>>>>>>> available if the current EC setup is already incompatible, as >>>>>>>> there is >>>>>>>> no way to change the EC parameters. >>>>>>>> >>>>>>>> It would help if you provided the output of "ceph osd pool ls >>>>>>>> detail". >>>>>>>> >>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov >>>>>>>> <patrakov(a)gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hi Torkil, >>>>>>>>> >>>>>>>>> Unfortunately, your files contain nothing obviously bad or >>>>>>>>> suspicious, >>>>>>>>> except for two things: more PGs than usual and bad balance. >>>>>>>>> >>>>>>>>> What's your "mon max pg per osd" setting? >>>>>>>>> >>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard >>>>>>>>> <torkil(a)drcmr.dk> wrote: >>>>>>>>>> >>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote: >>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> The other output is too big for pastebin and I'm not >>>>>>>>>>>> familiar with >>>>>>>>>>>> paste services, any suggestion for a preferred way to share >>>>>>>>>>>> such >>>>>>>>>>>> output? >>>>>>>>>>> >>>>>>>>>>> You can attached files to the mail here on the list. >>>>>>>>>> >>>>>>>>>> Doh, for some reason I was sure attachments would be >>>>>>>>>> stripped. Thanks, >>>>>>>>>> attached. >>>>>>>>>> >>>>>>>>>> Mvh. >>>>>>>>>> >>>>>>>>>> Torkil >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Alexander E. Patrakov >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Alexander E. Patrakov >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Torkil Svensgaard >>>>>> Systems Administrator >>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>>>> Copenhagen University Hospital Amager and Hvidovre >>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Torkil Svensgaard >>>> Systems Administrator >>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>> Copenhagen University Hospital Amager and Hvidovre >>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>> >>> >>> >> >> -- >> Torkil Svensgaard >> Systems Administrator >> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >> Copenhagen University Hospital Amager and Hvidovre >> Kettegaard Allé 30, 2650 Hvidovre, Denmark > > >

-- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kai Stian Olstad

3:28 a.m.

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:

...

My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this.

It's a huge change and 42% of you data need to be moved. And this move is not only to the new OSD but also between the existing OSD, but they are busy with backfilling so they have no free backfill reservation. I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper. -- Kai Stian Olstad

Kai Stian Olstad

3:37 a.m.

On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:

...

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:

Forgot the link https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf -- Kai Stian Olstad

Torkil Svensgaard

4:27 a.m.

On 25-03-2024 23:07, Kai Stian Olstad wrote:

...

On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:

Forgot the link https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf

Thanks again, seems the explanation for the low number of concurrent backfills is then simply that backfill_wait can hold partial reservations. Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

Torkil Svensgaard

3:48 a.m.

On 25-03-2024 22:58, Kai Stian Olstad wrote:

...

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:

If I have 60 backfills going on that would be 60 read reservations and 60 write reservations if I understand it correctly. The only way I can see that getting stuck at 60 backfills with osd_max_backfill = 3 is for those 60 reservations to be tied up on 20 OSDs being the only ones either read from or written to, and all other OSDs waiting on those.

...

I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper.

Thanks, I'll take a look at that =) Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

Torkil Svensgaard

24 Mar 24 Mar

12:50 a.m.

On 23-03-2024 18:43, Alexander E. Patrakov wrote:

...

Hi Torkil, Unfortunately, your files contain nothing obviously bad or suspicious, except for two things: more PGs than usual and bad balance. What's your "mon max pg per osd" setting?

[root@lazy ~]# ceph config get mon mon_max_pg_per_osd 250 Mvh. Torkil

...

On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

On 2024-03-23 17:54, Kai Stian Olstad wrote:

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:

The other output is too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output?

You can attached files to the mail here on the list.

Doh, for some reason I was sure attachments would be stripped. Thanks, attached. Mvh. Torkil

Tyler Stachecki

6:11 p.m.

On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

...

Hi ... Using mclock with high_recovery_ops profile. What is the bottleneck here? I would have expected a huge number of simultaneous backfills. Backfill reservation logjam?

mClock is very buggy in my experience and frequently leads to issues like this. Try using regular backfill and see if the problem goes away. Tyler

...

Torkil Svensgaard

11:06 p.m.

On 24-03-2024 13:41, Tyler Stachecki wrote:

...

On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:

Hi ... Using mclock with high_recovery_ops profile. What is the bottleneck here? I would have expected a huge number of simultaneous backfills. Backfill reservation logjam?

mClock is very buggy in my experience and frequently leads to issues like this. Try using regular backfill and see if the problem goes away.

Hi Tyler Just tried switching to wpq, same thing. I'm inclined to think is must be a read reservation logjam of some sort, given that increasing osd_max_backfills had an immediate effect and we have 4 empty hosts as main write targets. Here's the output for one such OSD from the script Alexander linked: " osd.539: gimpy =>0 0.0B <=42 1.54T (Δ1.54T) drive=9.0% 358.72G/3.90T crush=9 .0% 358.72G/3.90T <-11.54f waiting 44.8G 539<-421 1070 of 230320, 0.5% <-37.538 waiting 22.3G 539<-121 2819 of 556087, 0.5% <-11.507 waiting 45.1G 539<-61 912 of 139227, 0.7% <-37.450 waiting 22.3G 539<-220 1235 of 632776, 0.2% <-11.458 waiting 45.3G 539<-121 178 of 279150, 0.1% <-37.47c waiting 22.2G 539<-83 2281 of 634472, 0.4% <-37.434 waiting 22.1G 539<-78 9496 of 316052, 3.0% <-11.3f3 waiting 44.9G 539<-109 2375 of 231055, 1.0% <-37.3d3 waiting 22.0G 539<-73 2144 of 316508, 0.7% <-37.3c5 waiting 22.2G 539<-83 313880 of 313880, 100.0% <-11.3c1 waiting 44.8G 539<-223 93878 of 230270, 40.8% <-37.3a4 waiting 21.9G 539<-85 4604 of 315504, 1.5% <-11.344 waiting 44.5G 539<-63 728 of 182876, 0.4% <-6.1ca waiting 100.9G 539<-443 36076 of 56270, 64.1% <-4.1a2 waiting 157.6G 539<-218 508 of 91456, 0.6% <-37.1ba waiting 22.2G 539<-64 316848 of 316848, 100.0% <-37.84 waiting 22.0G 539<-33 4380 of 237633, 1.8% <-37.ad waiting 22.2G 539<-77 6730 of 396635, 1.7% <-37.36 waiting 22.1G 539<-47 2170 of 395955, 0.5% <-11.b9 waiting 45.1G 539<-223 0 of 231940, 0.0% <-37.11c waiting 22.1G 539<-33 9952 of 316448, 3.1% <-11.144 waiting 45.1G 539<-207 528 of 278094, 0.2% <-37.2ae waiting 22.1G 539<-224 2565 of 712539, 0.4% <-37.285 waiting 22.0G 539<-65 441 of 315336, 0.1% <-37.2ef waiting 22.0G 539<-414 2124 of 475410, 0.4% <-37.674 waiting 22.0G 539<-56 60 of 236511, 0.0% <-37.655 waiting 22.3G 539<-143 237316 of 237381, 100.0% <-11.6b0 waiting 44.9G 539<-282 1131 of 277122, 0.4% <-37.71a waiting 22.2G 539<-49 82865 of 315684, 26.2% <-11.789 waiting 45.0G 539<-196 736 of 277584, 0.3% <-11.7cf waiting 44.8G 539<-127 143 of 276582, 0.1% <-11.7f2 waiting 45.2G 539<-272 145857 of 185680, 78.6% <-37.7dd waiting 22.0G 539<-72 0 of 393475, 0.0% <-37.7d9 waiting 22.2G 539<-37 930 of 237831, 0.4% <-11.7fb waiting 45.2G 539<-78 1062 of 279042, 0.4% <-37.7d2 waiting 22.0G 539<-71 2409 of 631368, 0.4% <-11.8db waiting 44.9G 539<-84 108 of 277182, 0.0% <-11.9b6 waiting 44.8G 539<-74 772 of 184432, 0.4% <-11.b0b waiting 45.0G 539<-166 2569 of 231430, 1.1% <-11.d42 waiting 45.2G 539<-118 15428 of 46429, 33.2% <-11.d5f waiting 44.8G 539<-64 4 of 184356, 0.0% <-11.d98 waiting 45.1G 539<-418 0 of 278568, 0.0% " All waiting for something. Mvh. Torkil

...

Tyler

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

days inactive

days old

ceph-users@ceph.io

Manage subscription

25 comments

5 participants

tags (0)

participants (5)

Alexander E. Patrakov
Anthony D'Atri
Kai Stian Olstad
Torkil Svensgaard
Tyler Stachecki