[ceph-users] Re: Impact of large PG splits

25 Apr 2024

For those interested, just a short update: the split process is  
approaching its end, two days ago there were around 230 PGs left  
(target are 4096 PGs). So far there were no complaints, no cluster  
impact was reported (the cluster load is quite moderate, but still  
sensitive). Every now and then a single OSD (not the same) reaches 85%  
nearfull ratio, but that was expected since the first nearfull OSD was  
the root cause of this operation. I expect the balancer to kick in as  
soon as the backfill has completed or when there are less than 5%  
misplaced objects.

Zitat von Anthony D'Atri &lt;anthony.datri(a)gmail.com&gt;om>:

...
  One can up the ratios temporarily but it's all too
easy to forget to  
 reduce them later, or think that it's okay to run all the time with  
 reduced headroom.

 Until a host blows up and you don't have enough space to recover into.

> On Apr 12, 2024, at 05:01, Frédéric Nass  
> &lt;frederic.nass(a)univ-lorraine.fr&gt; wrote:
>
>
> Oh, and yeah, considering "The fullest OSD is already at 85% usage"  
> best move for now would be to add new hardware/OSDs (to avoid  
> reaching the backfill too full limit), prior to start the splitting  
> PGs before or after enabling upmap balancer depending on how the  
> PGs got rebalanced (well enough or not) after adding new OSDs.
>
> BTW, what ceph version is this? You should make sure you're running  
> v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:  
> https://tracker.ceph.com/issues/53729
>
> Cheers,
> Frédéric.
>
> ----- Le 12 Avr 24, à 10:41, Frédéric Nass  
> frederic.nass(a)univ-lorraine.fr a écrit :
>
>> Hello Eugen,
>>
>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph  
>> daemon osd.0
>> config show | grep osd_op_queue)
>>
>> If WPQ, you might want to tune osd_recovery_sleep* values as they  
>> do have a real
>> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
>> before doing that.
>> If mClock scheduler then you might want to use a specific mClock profile as
>> suggested by Gregory (as osd_recovery_sleep* are not considered when using
>> mClock).
>>
>> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
>> cluster only has 240, increasing osd_max_backfills to any values  
>> higher than
>> 2-3 will not help much with the recovery/backfilling speed.
>>
>> All the way, you'll have to be patient. :-)
>>
>> Cheers,
>> Frédéric.
>>
>> ----- Le 10 Avr 24, à 12:54, Eugen Block eblock(a)nde.ag a écrit :
>>
>>> Thank you for input!
>>> We started the split with max_backfills = 1 and watched for a few
>>> minutes, then gradually increased it to 8. Now it's backfilling with
>>> around 180 MB/s, not really much but since client impact has to be
>>> avoided if possible, we decided to let that run for a couple of hours.
>>> Then reevaluate the situation and maybe increase the backfills a bit
>>> more.
>>>
>>> Thanks!
>>>
>>> Zitat von Gregory Orange &lt;gregory.orange(a)pawsey.org.au&gt;au>:
>>>
>>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>>>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>>>> objects. We are splitting for the same reason as you - improved
>>>> balance. We also thought long and hard before we began, concerned
>>>> about impact, stability etc.
>>>>
>>>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>>>> retain some control and stop it again fairly quickly if we weren't
>>>> happy with the behaviour. It also serves to limit the performance
>>>> impact on the cluster, but unfortunately it also makes the whole
>>>> process slower.
>>>>
>>>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>>>> issues with the cluster. We could go higher, but are not in a rush
>>>> at this point. Sometimes nearfull osd warnings get high and MAX
>>>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>>>> interrupt it. So, we set pg_num to whatever the current value is
>>>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>>>> gets to work once the misplaced objects drop below the ratio, and
>>>> things balance out. Nearfull osds drop usually to zero, and MAX
>>>> AVAIL goes up again.
>>>>
>>>> The above behaviour is because while they share the same threshold
>>>> setting, the autoscaler only runs every minute, and it won't run
>>>> when misplaced are over the threshold. Meanwhile, checks for the
>>>> next PG to split happen much more frequently, so the balancer never
>>>> wins that race.
>>>>
>>>>
>>>> We didn't know how long to expect it all to take, but decided that
>>>> any improvement in PG size was worth starting. We now estimate it
>>>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>>>> total.
>>>>
>>>> We have lost a drive or two during the process, and of course
>>>> degraded objects went up, and more backfilling work got going. We
>>>> paused splits for at least one of those, to make sure the degraded
>>>> objects were sorted out as quick as possible. We can't be sure it
>>>> went any faster though - there's always a long tail on that sort of
>>>> thing.
>>>>
>>>> Inconsistent objects are found at least a couple of times a week,
>>>> and to get them repairing we disable scrubs, wait until they're
>>>> stopped, then set the repair going and reenable scrubs. I don't know
>>>> if this is special to the current higher splitting load, but we
>>>> haven't noticed it before.
>>>>
>>>> HTH,
>>>> Greg.
>>>>
>>>>
>>>> On 10/4/24 14:42, Eugen Block wrote:
>>>>> Thank you, Janne.
>>>>> I believe the default 5% target_max_misplaced_ratio would work as
>>>>> well, we've had good experience with that in the past, without
the
>>>>> autoscaler. I just haven't dealt with such large PGs, I've
been
>>>>> warning them for two years (when the PGs were only almost half this
>>>>> size) and now they finally started to listen. Well, they would
>>>>> still ignore it if it wouldn't impact all kinds of things now.
;-)
>>>>>
>>>>> Thanks,
>>>>> Eugen
>>>>>
>>>>> Zitat von Janne Johansson &lt;icepic.dz(a)gmail.com&gt;om>:
>>>>>
>>>>>> Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block
&lt;eblock(a)nde.ag&gt;ag>:
>>>>>>> I'm trying to estimate the possible impact when large PGs
are
>>>>>>> splitted. Here's one example of such a PG:
>>>>>>>
>>>>>>> PG_STAT  OBJECTS  BYTES         OMAP_BYTES*  OMAP_KEYS*  LOG
>>>>>>> DISK_LOG    UP
>>>>>>> 86.3ff    277708  414403098409            0           0 
3092
>>>>>>> 3092
>>>>>>>
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
>>>>>>
>>>>>> If you ask for small increases of pg_num, it will only split that
many
>>>>>> PGs at a time, so while there will be a lot of data movement,
(50% due
>>>>>> to half of the data needs to go to another newly made PG, and on
top
>>>>>> of that, PGs per OSD will change, but also the balancing can now
work
>>>>>> better) it will not be affecting the whole cluster if you
increase
>>>>>> with say, 8 pg_nums at a time. As per the other reply, if you
bump the
>>>>>> number with a small amount - wait for HEALTH_OK - bump some more
it
>>>>>> will take a lot of calendar time, but have rather small impact.
My
>>>>>> view of it is basically that this will be far less impactful than
if
>>>>>> you lose a whole OSD, and hopefully your cluster can survive
this
>>>>>> event, so it should be able to handle a slow trickle of PG splits
too.
>>>>>>
>>>>>> You can set a target number for the pool and let the autoscaler
run a
>>>>>> few splits at a time, there are some settings to look at on how
>>>>>> aggressive the autoscaler will be, so it doesn't have to be
>>>>>> manual/scripted, but it's not very hard to script it if you
are unsure
>>>>>> about the amount of work the autoscaler will start at any given
time.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> May the most significant bit of your life be positive.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>
>>>> --
>>>> Gregory Orange
>>>>
>>>> System Administrator, Scientific Platforms Team
>>>> Pawsey Supercomputing Centre, CSIRO
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Impact of large PG splits