[ceph-users] Re: Impact of large PG splits

25 Apr 2024

No, we didn’t change much, just increased the max pg per osd to avoid  
warnings and inactive PGs in case a node would fail during this  
process. And the max backfills, of course.

Zitat von Frédéric Nass &lt;frederic.nass(a)univ-lorraine.fr&gt;fr>:

...
  Hello Eugen,

 Thanks for sharing the good news. Did you have to raise  
 mon_osd_nearfull_ratio temporarily?

 Frédéric.

 ----- Le 25 Avr 24, à 12:35, Eugen Block eblock(a)nde.ag a écrit :

> For those interested, just a short update: the split process is
> approaching its end, two days ago there were around 230 PGs left
> (target are 4096 PGs). So far there were no complaints, no cluster
> impact was reported (the cluster load is quite moderate, but still
> sensitive). Every now and then a single OSD (not the same) reaches 85%
> nearfull ratio, but that was expected since the first nearfull OSD was
> the root cause of this operation. I expect the balancer to kick in as
> soon as the backfill has completed or when there are less than 5%
> misplaced objects.
>
> Zitat von Anthony D'Atri &lt;anthony.datri(a)gmail.com&gt;om>:
>
>> One can up the ratios temporarily but it's all too easy to forget to
>> reduce them later, or think that it's okay to run all the time with
>> reduced headroom.
>>
>> Until a host blows up and you don't have enough space to recover into.
>>
>>> On Apr 12, 2024, at 05:01, Frédéric Nass
>>> &lt;frederic.nass(a)univ-lorraine.fr&gt; wrote:
>>>
>>>
>>> Oh, and yeah, considering "The fullest OSD is already at 85%
usage"
>>> best move for now would be to add new hardware/OSDs (to avoid
>>> reaching the backfill too full limit), prior to start the splitting
>>> PGs before or after enabling upmap balancer depending on how the
>>> PGs got rebalanced (well enough or not) after adding new OSDs.
>>>
>>> BTW, what ceph version is this? You should make sure you're running
>>> v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:
>>> https://tracker.ceph.com/issues/53729
>>>
>>> Cheers,
>>> Frédéric.
>>>
>>> ----- Le 12 Avr 24, à 10:41, Frédéric Nass
>>> frederic.nass(a)univ-lorraine.fr a écrit :
>>>
>>>> Hello Eugen,
>>>>
>>>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph
>>>> daemon osd.0
>>>> config show | grep osd_op_queue)
>>>>
>>>> If WPQ, you might want to tune osd_recovery_sleep* values as they
>>>> do have a real
>>>> impact on the recovery/backfilling speed. Just lower  
>>>> osd_max_backfills to 1
>>>> before doing that.
>>>> If mClock scheduler then you might want to use a specific mClock  
>>>> profile as
>>>> suggested by Gregory (as osd_recovery_sleep* are not considered  
>>>> when using
>>>> mClock).
>>>>
>>>> Since each PG involves reads/writes from/to apparently 18 OSDs  
>>>> (!) and this
>>>> cluster only has 240, increasing osd_max_backfills to any values
>>>> higher than
>>>> 2-3 will not help much with the recovery/backfilling speed.
>>>>
>>>> All the way, you'll have to be patient. :-)
>>>>
>>>> Cheers,
>>>> Frédéric.
>>>>
>>>> ----- Le 10 Avr 24, à 12:54, Eugen Block eblock(a)nde.ag a écrit :
>>>>
>>>>> Thank you for input!
>>>>> We started the split with max_backfills = 1 and watched for a few
>>>>> minutes, then gradually increased it to 8. Now it's backfilling
with
>>>>> around 180 MB/s, not really much but since client impact has to be
>>>>> avoided if possible, we decided to let that run for a couple of
hours.
>>>>> Then reevaluate the situation and maybe increase the backfills a bit
>>>>> more.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Zitat von Gregory Orange &lt;gregory.orange(a)pawsey.org.au&gt;au>:
>>>>>
>>>>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB
OSDs
>>>>>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>>>>>> objects. We are splitting for the same reason as you - improved
>>>>>> balance. We also thought long and hard before we began,
concerned
>>>>>> about impact, stability etc.
>>>>>>
>>>>>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>>>>>> retain some control and stop it again fairly quickly if we
weren't
>>>>>> happy with the behaviour. It also serves to limit the
performance
>>>>>> impact on the cluster, but unfortunately it also makes the whole
>>>>>> process slower.
>>>>>>
>>>>>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s.
No
>>>>>> issues with the cluster. We could go higher, but are not in a
rush
>>>>>> at this point. Sometimes nearfull osd warnings get high and MAX
>>>>>> AVAIL on the data pool in `ceph df` gets low enough that we want
to
>>>>>> interrupt it. So, we set pg_num to whatever the current value is
>>>>>> (ceph osd pool ls detail), and let it stabilise. Then the
balancer
>>>>>> gets to work once the misplaced objects drop below the ratio,
and
>>>>>> things balance out. Nearfull osds drop usually to zero, and MAX
>>>>>> AVAIL goes up again.
>>>>>>
>>>>>> The above behaviour is because while they share the same
threshold
>>>>>> setting, the autoscaler only runs every minute, and it won't
run
>>>>>> when misplaced are over the threshold. Meanwhile, checks for the
>>>>>> next PG to split happen much more frequently, so the balancer
never
>>>>>> wins that race.
>>>>>>
>>>>>>
>>>>>> We didn't know how long to expect it all to take, but decided
that
>>>>>> any improvement in PG size was worth starting. We now estimate
it
>>>>>> will take another 2-3 weeks to complete, for a total of 4-5
weeks
>>>>>> total.
>>>>>>
>>>>>> We have lost a drive or two during the process, and of course
>>>>>> degraded objects went up, and more backfilling work got going.
We
>>>>>> paused splits for at least one of those, to make sure the
degraded
>>>>>> objects were sorted out as quick as possible. We can't be
sure it
>>>>>> went any faster though - there's always a long tail on that
sort of
>>>>>> thing.
>>>>>>
>>>>>> Inconsistent objects are found at least a couple of times a
week,
>>>>>> and to get them repairing we disable scrubs, wait until
they're
>>>>>> stopped, then set the repair going and reenable scrubs. I
don't know
>>>>>> if this is special to the current higher splitting load, but we
>>>>>> haven't noticed it before.
>>>>>>
>>>>>> HTH,
>>>>>> Greg.
>>>>>>
>>>>>>
>>>>>> On 10/4/24 14:42, Eugen Block wrote:
>>>>>>> Thank you, Janne.
>>>>>>> I believe the default 5% target_max_misplaced_ratio would
work as
>>>>>>> well, we've had good experience with that in the past,
without the
>>>>>>> autoscaler. I just haven't dealt with such large PGs,
I've been
>>>>>>> warning them for two years (when the PGs were only almost
half this
>>>>>>> size) and now they finally started to listen. Well, they
would
>>>>>>> still ignore it if it wouldn't impact all kinds of things
now. ;-)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Eugen
>>>>>>>
>>>>>>> Zitat von Janne Johansson &lt;icepic.dz(a)gmail.com&gt;om>:
>>>>>>>
>>>>>>>> Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block
&lt;eblock(a)nde.ag&gt;ag>:
>>>>>>>>> I'm trying to estimate the possible impact when
large PGs are
>>>>>>>>> splitted. Here's one example of such a PG:
>>>>>>>>>
>>>>>>>>> PG_STAT  OBJECTS  BYTES         OMAP_BYTES* 
OMAP_KEYS*  LOG
>>>>>>>>> DISK_LOG    UP
>>>>>>>>> 86.3ff    277708  414403098409            0          
0  3092
>>>>>>>>> 3092
>>>>>>>>>
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
>>>>>>>>
>>>>>>>> If you ask for small increases of pg_num, it will only
split  
>>>>>>>> that many
>>>>>>>> PGs at a time, so while there will be a lot of data  
>>>>>>>> movement, (50% due
>>>>>>>> to half of the data needs to go to another newly made PG,
and on top
>>>>>>>> of that, PGs per OSD will change, but also the balancing
can now work
>>>>>>>> better) it will not be affecting the whole cluster if you
increase
>>>>>>>> with say, 8 pg_nums at a time. As per the other reply, if

>>>>>>>> you bump the
>>>>>>>> number with a small amount - wait for HEALTH_OK - bump
some more it
>>>>>>>> will take a lot of calendar time, but have rather small
impact. My
>>>>>>>> view of it is basically that this will be far less
impactful than if
>>>>>>>> you lose a whole OSD, and hopefully your cluster can
survive this
>>>>>>>> event, so it should be able to handle a slow trickle of
PG  
>>>>>>>> splits too.
>>>>>>>>
>>>>>>>> You can set a target number for the pool and let the
autoscaler run a
>>>>>>>> few splits at a time, there are some settings to look at
on how
>>>>>>>> aggressive the autoscaler will be, so it doesn't have
to be
>>>>>>>> manual/scripted, but it's not very hard to script it
if you  
>>>>>>>> are unsure
>>>>>>>> about the amount of work the autoscaler will start at any
given time.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> May the most significant bit of your life be positive.
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>
>>>>>> --
>>>>>> Gregory Orange
>>>>>>
>>>>>> System Administrator, Scientific Platforms Team
>>>>>> Pawsey Supercomputing Centre, CSIRO
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Impact of large PG splits