[ceph-users] Re: Large number of misplaced PGs but little backfill going on

25 Mar 2024

On 24/03/2024 01:14, Torkil Svensgaard wrote:
...
  On 24-03-2024 00:31, Alexander E. Patrakov wrote:
  Hi Torkil,  
 Hi Alexander

  Thanks for the update. Even though the
improvement is small, it is
 still an improvement, consistent with the osd_max_backfills value, and
 it proves that there are still unsolved peering issues.

 I have looked at both the old and the new state of the PG, but could
 not find anything else interesting.

 I also looked again at the state of PG 37.1. It is known what blocks
 the backfill of this PG; please search for "blocked_by." However, this
 is just one data point, which is insufficient for any conclusions. Try
 looking at other PGs. Is there anything too common in the non-empty
 "blocked_by" blocks?  
 I'll take a look at that tomorrow, perhaps we can script something 
 meaningful. 
Hi Alexander

While working on a script querying all PGs and making a list of all OSDs 
found in a blocked_by list, and how many times for each, I discovered 
something odd about pool 38:

"
[root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
OSDs blocking other OSDs:
OSD 425: 5 instance(s)
OSD 426: 6 instance(s)
OSD 34: 7 instance(s)
OSD 36: 5 instance(s)
OSD 146: 3 instance(s)
OSD 6: 2 instance(s)
OSD 5: 8 instance(s)
OSD 131: 7 instance(s)
OSD 4: 9 instance(s)
OSD 3: 5 instance(s)
OSD 2: 5 instance(s)
OSD 1: 2 instance(s)
OSD 0: 4 instance(s)
OSD 167: 1 instance(s)
OSD 168: 3 instance(s)
OSD 450: 2 instance(s)
OSD 46: 6 instance(s)
OSD 154: 3 instance(s)
OSD 156: 2 instance(s)
OSD 90: 2 instance(s)
OSD 227: 4 instance(s)
OSD 10: 4 instance(s)
OSD 15: 6 instance(s)
OSD 449: 4 instance(s)
OSD 192: 2 instance(s)
OSD 67: 3 instance(s)
"

All PGs in the pool are active+clean so why are there any blocked_by at 
all? One example attached.

Mvh.

Torkil

...
   I think we
have to look for patterns in other ways, too. One tool that
 produces good visualizations is TheJJ balancer. Although it is called
 a "balancer," it can also visualize the ongoing backfills.

 The tool is available at
 https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptim…

 Run it as follows:

 ./placementoptimizer.py showremapped --by-osd | tee remapped.txt  
 Output attached.

 Thanks again.

 Mvh.

 Torkil

  On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard
&lt;torkil(a)drcmr.dk&gt; 
 wrote:

 Hi Alex

 New query output attached after restarting both OSDs. OSD 237 is no
 longer mentioned but it unfortunately made no difference for the number
 of backfills which went 59->62->62.

 Mvh.

 Torkil

 On 23-03-2024 22:26, Alexander E. Patrakov wrote:
  Hi Torkil,

 I have looked at the files that you attached. They were helpful: pool
 11 is problematic, it complains about degraded objects for no obvious
 reason. I think that is the blocker.

 I also noted that you mentioned peering problems, and I suspect that
 they are not completely resolved. As a somewhat-irrational move, to
 confirm this theory, you can restart osd.237 (it is mentioned at the
 end of query.11.fff.txt, although I don't understand why it is there)
 and then osd.298 (it is the primary for that pg) and see if any
 additional backfills are unblocked after that. Also, please re-query
 that PG again after the OSD restart.

 On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard &lt;torkil(a)drcmr.dk&gt; 
 wrote:
>
>
>
> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
>> Hi Torkil,
>
> Hi Alexander
>
>> I have looked at the CRUSH rules, and the equivalent rules work on my
>> test cluster. So this cannot be the cause of the blockage.
>
> Thank you for taking the time =)
>
>> What happens if you increase the osd_max_backfills setting 
>> temporarily?
>
> We already had the mclock override option in place and I re-enabled 
> our
> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> on how full they are. Active backfills went from 16 to 53 which is
> probably because default osd_max_backfills for mclock is 1.
>
> I think 53 is still a low number of active backfills given the large
> percentage misplaced.
>
>> It may be a good idea to investigate a few of the stalled PGs. Please
>> run commands similar to this one:
>>
>> ceph pg 37.0 query > query.37.0.txt
>> ceph pg 37.1 query > query.37.1.txt
>> ...
>> and the same for the other affected pools.
>
> A few samples attached.
>
>> Still, I must say that some of your rules are actually unsafe.
>>
>> The 4+2 rule as used by rbd_ec_data will not survive a
>> datacenter-offline incident. Namely, for each PG, it chooses OSDs 
>> from
>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
>> offline, you will, therefore, have only 4 OSDs up, which is exactly
>> the number of data chunks. However, the pool requires min_size 5, so
>> all PGs will be inactive (to prevent data corruption) and will stay
>> inactive until the datacenter comes up again. However, please don't
>> set min_size to 4 - then, any additional incident (like a defective
>> disk) will lead to data loss, and the shards in the datacenter which
>> went offline would be useless because they do not correspond to the
>> updated shards written by the clients.
>
> Thanks for the explanation. This is an old pool predating the 3 DC 
> setup
> and we'll migrate the data to a 4+5 pool when we can.
>
>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
>> number of data chunks. See above why it is bad. Please set 
>> min_size to
>> 5.
>
> Thanks, that was a leftover for getting the PGs to peer (stuck at
> creating+incomplete) when we created the pool. It's back to 5 now.
>
>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
>> 100% active+clean.
>
> There is very little data in this pool, that is probably the main 
> reason.
>
>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
>> have 300+ PGs, the observed maximum is 347. Please set it to 400.
>
> Copy that. Didn't seem to make a difference though, and we have
> osd_max_pg_per_osd_hard_ratio set to 5.000000.
>
> Mvh.
>
> Torkil
>
>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard 
>> &lt;torkil(a)drcmr.dk&gt; wrote:
>>>
>>>
>>>
>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
>>>> Sorry for replying to myself, but "ceph osd pool ls detail" by

>>>> itself
>>>> is insufficient. For every erasure code profile mentioned in the
>>>> output, please also run something like this:
>>>>
>>>> ceph osd erasure-code-profile get prf-for-ec-data
>>>>
>>>> ...where "prf-for-ec-data" is the name that appears after the
words
>>>> "erasure profile" in the "ceph osd pool ls detail"
output.
>>>
>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure
>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
>>> fast_read 1 compression_algorithm snappy compression_mode aggressive
>>> application rbd
>>> pool 37 'cephfs.hdd.data' erasure profile 
>>> DRCMR_k4m5_datacenter_hdd size
>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 
>>> pgp_num 2048
>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
>>> compression_algorithm zstd compression_mode aggressive 
>>> application cephfs
>>> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd 
>>> size 9
>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>> compression_algorithm zstd compression_mode aggressive 
>>> application rbd
>>>
>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
>>> crush-device-class=hdd
>>> crush-failure-domain=host
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=4
>>> m=2
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>> [root@lazy ~]# ceph osd erasure-code-profile get 
>>> DRCMR_k4m5_datacenter_hdd
>>> crush-device-class=hdd
>>> crush-failure-domain=datacenter
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=4
>>> m=5
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>> [root@lazy ~]# ceph osd erasure-code-profile get 
>>> DRCMR_k4m5_datacenter_ssd
>>> crush-device-class=ssd
>>> crush-failure-domain=datacenter
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=4
>>> m=5
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>>
>>> But as I understand it those profiles are only used to create the
>>> initial crush rule for the pool, and we have manually edited 
>>> those along
>>> the way. Here are the 3 rules in use for the 3 EC pools:
>>>
>>> rule rbd_ec_data {
>>>             id 0
>>>             type erasure
>>>             step set_chooseleaf_tries 5
>>>             step set_choose_tries 100
>>>             step take default class hdd
>>>             step choose indep 0 type datacenter
>>>             step chooseleaf indep 2 type host
>>>             step emit
>>> }
>>> rule cephfs.hdd.data {
>>>             id 7
>>>             type erasure
>>>             step set_chooseleaf_tries 5
>>>             step set_choose_tries 100
>>>             step take default class hdd
>>>             step choose indep 0 type datacenter
>>>             step chooseleaf indep 3 type host
>>>             step emit
>>> }
>>> rule rbd.ssd.data {
>>>             id 8
>>>             type erasure
>>>             step set_chooseleaf_tries 5
>>>             step set_choose_tries 100
>>>             step take default class ssd
>>>             step choose indep 0 type datacenter
>>>             step chooseleaf indep 3 type host
>>>             step emit
>>> }
>>>
>>> Which should first pick all 3 datacenters in the choose step and 
>>> then
>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
>>> respectively.
>>>
>>> Mvh.
>>>
>>> Torkil
>>>
>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
>>>> &lt;patrakov(a)gmail.com&gt; wrote:
>>>>>
>>>>> Hi Torkil,
>>>>>
>>>>> I take my previous response back.
>>>>>
>>>>> You have an erasure-coded pool with nine shards but only three
>>>>> datacenters. This, in general, cannot work. You need either nine
>>>>> datacenters or a very custom CRUSH rule. The second option may 
>>>>> not be
>>>>> available if the current EC setup is already incompatible, as 
>>>>> there is
>>>>> no way to change the EC parameters.
>>>>>
>>>>> It would help if you provided the output of "ceph osd pool ls 
>>>>> detail".
>>>>>
>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
>>>>> &lt;patrakov(a)gmail.com&gt; wrote:
>>>>>>
>>>>>> Hi Torkil,
>>>>>>
>>>>>> Unfortunately, your files contain nothing obviously bad or 
>>>>>> suspicious,
>>>>>> except for two things: more PGs than usual and bad balance.
>>>>>>
>>>>>> What's your "mon max pg per osd" setting?
>>>>>>
>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard 
>>>>>> &lt;torkil(a)drcmr.dk&gt; wrote:
>>>>>>>
>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote:
>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil
Svensgaard 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> The other output is too big for pastebin and I'm
not 
>>>>>>>>> familiar with
>>>>>>>>> paste services, any suggestion for a preferred way to
share 
>>>>>>>>> such
>>>>>>>>> output?
>>>>>>>>
>>>>>>>> You can attached files to the mail here on the list.
>>>>>>>
>>>>>>> Doh, for some reason I was sure attachments would be 
>>>>>>> stripped. Thanks,
>>>>>>> attached.
>>>>>>>
>>>>>>> Mvh.
>>>>>>>
>>>>>>> Torkil
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Alexander E. Patrakov
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Alexander E. Patrakov
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> Torkil Svensgaard
>>> Systems Administrator
>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>> Copenhagen University Hospital Amager and Hvidovre
>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>
>>
>>
>
> -- 
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark

 -- 
 Torkil Svensgaard
 Systems Administrator
 Danish Research Centre for Magnetic Resonance DRCMR, Section 714
 Copenhagen University Hospital Amager and Hvidovre
 Kettegaard Allé 30, 2650 Hvidovre, Denmark 

-- 
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil(a)drcmr.dk

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Large number of misplaced PGs but little backfill going on