[ceph-users] Re: Large number of misplaced PGs but little backfill going on

26 Mar 2024

First try "ceph osd down 89"

...
  On Mar 25, 2024, at 15:37, Alexander E. Patrakov
&lt;patrakov(a)gmail.com&gt; wrote:

 On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard &lt;torkil(a)drcmr.dk&gt; wrote:

 On 24/03/2024 01:14, Torkil Svensgaard wrote:
  On 24-03-2024 00:31, Alexander E. Patrakov
wrote:
  Hi Torkil,  
 Hi Alexander

  Thanks for the update. Even though the
improvement is small, it is
 still an improvement, consistent with the osd_max_backfills value, and
 it proves that there are still unsolved peering issues.

 I have looked at both the old and the new state of the PG, but could
 not find anything else interesting.

 I also looked again at the state of PG 37.1. It is known what blocks
 the backfill of this PG; please search for "blocked_by." However, this
 is just one data point, which is insufficient for any conclusions. Try
 looking at other PGs. Is there anything too common in the non-empty
 "blocked_by" blocks?  
 I'll take a look at that tomorrow, perhaps we can script something
 meaningful.  
 Hi Alexander

 While working on a script querying all PGs and making a list of all OSDs
 found in a blocked_by list, and how many times for each, I discovered
 something odd about pool 38:

 "
 [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
 OSDs blocking other OSDs:  <snip>

  All PGs in the pool are active+clean so why are
there any blocked_by at
 all? One example attached.  
 I don't know. In any case, it doesn't match the "one OSD blocks them
 all" scenario that I was looking for. I think this is something bogus
 that can probably be cleared in your example by restarting osd.89
 (i.e, the one being blocked).

 Mvh.

 Torkil

   I think
we have to look for patterns in other ways, too. One tool that
 produces good visualizations is TheJJ balancer. Although it is called
 a "balancer," it can also visualize the ongoing backfills.

 The tool is available at
 https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptim…

 Run it as follows:

 ./placementoptimizer.py showremapped --by-osd | tee remapped.txt  
 Output attached.

 Thanks again.

 Mvh.

 Torkil

  On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard
&lt;torkil(a)drcmr.dk&gt;
 wrote:
> 
> Hi Alex
> 
> New query output attached after restarting both OSDs. OSD 237 is no
> longer mentioned but it unfortunately made no difference for the number
> of backfills which went 59->62->62.
> 
> Mvh.
> 
> Torkil
> 
> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>> Hi Torkil,
>> 
>> I have looked at the files that you attached. They were helpful: pool
>> 11 is problematic, it complains about degraded objects for no obvious
>> reason. I think that is the blocker.
>> 
>> I also noted that you mentioned peering problems, and I suspect that
>> they are not completely resolved. As a somewhat-irrational move, to
>> confirm this theory, you can restart osd.237 (it is mentioned at the
>> end of query.11.fff.txt, although I don't understand why it is there)
>> and then osd.298 (it is the primary for that pg) and see if any
>> additional backfills are unblocked after that. Also, please re-query
>> that PG again after the OSD restart.
>> 
>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard &lt;torkil(a)drcmr.dk&gt;
>> wrote:
>>> 
>>> 
>>> 
>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
>>>> Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
>>>> I have looked at the CRUSH rules, and the equivalent rules work on my
>>>> test cluster. So this cannot be the cause of the blockage.
>>> 
>>> Thank you for taking the time =)
>>> 
>>>> What happens if you increase the osd_max_backfills setting
>>>> temporarily?
>>> 
>>> We already had the mclock override option in place and I re-enabled
>>> our
>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
>>> on how full they are. Active backfills went from 16 to 53 which is
>>> probably because default osd_max_backfills for mclock is 1.
>>> 
>>> I think 53 is still a low number of active backfills given the large
>>> percentage misplaced.
>>> 
>>>> It may be a good idea to investigate a few of the stalled PGs. Please
>>>> run commands similar to this one:
>>>> 
>>>> ceph pg 37.0 query > query.37.0.txt
>>>> ceph pg 37.1 query > query.37.1.txt
>>>> ...
>>>> and the same for the other affected pools.
>>> 
>>> A few samples attached.
>>> 
>>>> Still, I must say that some of your rules are actually unsafe.
>>>> 
>>>> The 4+2 rule as used by rbd_ec_data will not survive a
>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs
>>>> from
>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly
>>>> the number of data chunks. However, the pool requires min_size 5, so
>>>> all PGs will be inactive (to prevent data corruption) and will stay
>>>> inactive until the datacenter comes up again. However, please don't
>>>> set min_size to 4 - then, any additional incident (like a defective
>>>> disk) will lead to data loss, and the shards in the datacenter which
>>>> went offline would be useless because they do not correspond to the
>>>> updated shards written by the clients.
>>> 
>>> Thanks for the explanation. This is an old pool predating the 3 DC
>>> setup
>>> and we'll migrate the data to a 4+5 pool when we can.
>>> 
>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
>>>> number of data chunks. See above why it is bad. Please set
>>>> min_size to
>>>> 5.
>>> 
>>> Thanks, that was a leftover for getting the PGs to peer (stuck at
>>> creating+incomplete) when we created the pool. It's back to 5 now.
>>> 
>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
>>>> 100% active+clean.
>>> 
>>> There is very little data in this pool, that is probably the main
>>> reason.
>>> 
>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400.
>>> 
>>> Copy that. Didn't seem to make a difference though, and we have
>>> osd_max_pg_per_osd_hard_ratio set to 5.000000.
>>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard
>>>> &lt;torkil(a)drcmr.dk&gt; wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
>>>>>> Sorry for replying to myself, but "ceph osd pool ls
detail" by
>>>>>> itself
>>>>>> is insufficient. For every erasure code profile mentioned in the
>>>>>> output, please also run something like this:
>>>>>> 
>>>>>> ceph osd erasure-code-profile get prf-for-ec-data
>>>>>> 
>>>>>> ...where "prf-for-ec-data" is the name that appears
after the words
>>>>>> "erasure profile" in the "ceph osd pool ls
detail" output.
>>>>> 
>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure
>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6
min_size 5
>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
>>>>> fast_read 1 compression_algorithm snappy compression_mode aggressive
>>>>> application rbd
>>>>> pool 37 'cephfs.hdd.data' erasure profile
>>>>> DRCMR_k4m5_datacenter_hdd size
>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048
>>>>> pgp_num 2048
>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
>>>>> compression_algorithm zstd compression_mode aggressive
>>>>> application cephfs
>>>>> pool 38 'rbd.ssd.data' erasure profile
DRCMR_k4m5_datacenter_ssd
>>>>> size 9
>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>>>> compression_algorithm zstd compression_mode aggressive
>>>>> application rbd
>>>>> 
>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
>>>>> crush-device-class=hdd
>>>>> crush-failure-domain=host
>>>>> crush-root=default
>>>>> jerasure-per-chunk-alignment=false
>>>>> k=4
>>>>> m=2
>>>>> plugin=jerasure
>>>>> technique=reed_sol_van
>>>>> w=8
>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
>>>>> DRCMR_k4m5_datacenter_hdd
>>>>> crush-device-class=hdd
>>>>> crush-failure-domain=datacenter
>>>>> crush-root=default
>>>>> jerasure-per-chunk-alignment=false
>>>>> k=4
>>>>> m=5
>>>>> plugin=jerasure
>>>>> technique=reed_sol_van
>>>>> w=8
>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
>>>>> DRCMR_k4m5_datacenter_ssd
>>>>> crush-device-class=ssd
>>>>> crush-failure-domain=datacenter
>>>>> crush-root=default
>>>>> jerasure-per-chunk-alignment=false
>>>>> k=4
>>>>> m=5
>>>>> plugin=jerasure
>>>>> technique=reed_sol_van
>>>>> w=8
>>>>> 
>>>>> But as I understand it those profiles are only used to create the
>>>>> initial crush rule for the pool, and we have manually edited
>>>>> those along
>>>>> the way. Here are the 3 rules in use for the 3 EC pools:
>>>>> 
>>>>> rule rbd_ec_data {
>>>>>            id 0
>>>>>            type erasure
>>>>>            step set_chooseleaf_tries 5
>>>>>            step set_choose_tries 100
>>>>>            step take default class hdd
>>>>>            step choose indep 0 type datacenter
>>>>>            step chooseleaf indep 2 type host
>>>>>            step emit
>>>>> }
>>>>> rule cephfs.hdd.data {
>>>>>            id 7
>>>>>            type erasure
>>>>>            step set_chooseleaf_tries 5
>>>>>            step set_choose_tries 100
>>>>>            step take default class hdd
>>>>>            step choose indep 0 type datacenter
>>>>>            step chooseleaf indep 3 type host
>>>>>            step emit
>>>>> }
>>>>> rule rbd.ssd.data {
>>>>>            id 8
>>>>>            type erasure
>>>>>            step set_chooseleaf_tries 5
>>>>>            step set_choose_tries 100
>>>>>            step take default class ssd
>>>>>            step choose indep 0 type datacenter
>>>>>            step chooseleaf indep 3 type host
>>>>>            step emit
>>>>> }
>>>>> 
>>>>> Which should first pick all 3 datacenters in the choose step and
>>>>> then
>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5
>>>>> respectively.
>>>>> 
>>>>> Mvh.
>>>>> 
>>>>> Torkil
>>>>> 
>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
>>>>>> &lt;patrakov(a)gmail.com&gt; wrote:
>>>>>>> 
>>>>>>> Hi Torkil,
>>>>>>> 
>>>>>>> I take my previous response back.
>>>>>>> 
>>>>>>> You have an erasure-coded pool with nine shards but only
three
>>>>>>> datacenters. This, in general, cannot work. You need either
nine
>>>>>>> datacenters or a very custom CRUSH rule. The second option
may
>>>>>>> not be
>>>>>>> available if the current EC setup is already incompatible,
as
>>>>>>> there is
>>>>>>> no way to change the EC parameters.
>>>>>>> 
>>>>>>> It would help if you provided the output of "ceph osd
pool ls
>>>>>>> detail".
>>>>>>> 
>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
>>>>>>> &lt;patrakov(a)gmail.com&gt; wrote:
>>>>>>>> 
>>>>>>>> Hi Torkil,
>>>>>>>> 
>>>>>>>> Unfortunately, your files contain nothing obviously bad
or
>>>>>>>> suspicious,
>>>>>>>> except for two things: more PGs than usual and bad
balance.
>>>>>>>> 
>>>>>>>> What's your "mon max pg per osd" setting?
>>>>>>>> 
>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard
>>>>>>>> &lt;torkil(a)drcmr.dk&gt; wrote:
>>>>>>>>> 
>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote:
>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil
Svensgaard
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> The other output is too big for pastebin and
I'm not
>>>>>>>>>>> familiar with
>>>>>>>>>>> paste services, any suggestion for a
preferred way to share
>>>>>>>>>>> such
>>>>>>>>>>> output?
>>>>>>>>>> 
>>>>>>>>>> You can attached files to the mail here on the
list.
>>>>>>>>> 
>>>>>>>>> Doh, for some reason I was sure attachments would be
>>>>>>>>> stripped. Thanks,
>>>>>>>>> attached.
>>>>>>>>> 
>>>>>>>>> Mvh.
>>>>>>>>> 
>>>>>>>>> Torkil
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Alexander E. Patrakov
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Alexander E. Patrakov
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> Torkil Svensgaard
>>>>> Systems Administrator
>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>>>> Copenhagen University Hospital Amager and Hvidovre
>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> Torkil Svensgaard
>>> Systems Administrator
>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>> Copenhagen University Hospital Amager and Hvidovre
>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>> 
>> 
>> 
> 
> --
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark

 --
 Torkil Svensgaard
 Sysadmin
 MR-Forskningssektionen, afs. 714
 DRCMR, Danish Research Centre for Magnetic Resonance
 Hvidovre Hospital
 Kettegård Allé 30
 DK-2650 Hvidovre
 Denmark
 Tel: +45 386 22828
 E-mail: torkil(a)drcmr.dk  

 -- 
 Alexander E. Patrakov
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Large number of misplaced PGs but little backfill going on