The split process completed over the weekend, the balancer did a great job:
MIN PGs | MAX PGs | MIN USE % | MAX USE %
322 | 338 | 73,3 | 75,5
Although the number of PGs per OSD differs a bit the usage per OSD is
quite good (and more important). The new hardware also arrived, so
there will be soon some more remapping. :-)
So I would consider this thread as closed, all good.
Zitat von Eugen Block <eblock(a)nde.ag>ag>:
No, we didn’t change much, just increased the max pg
per osd to
avoid warnings and inactive PGs in case a node would fail during
this process. And the max backfills, of course.
Zitat von Frédéric Nass <frederic.nass(a)univ-lorraine.fr>fr>:
> Hello Eugen,
>
> Thanks for sharing the good news. Did you have to raise
> mon_osd_nearfull_ratio temporarily?
>
> Frédéric.
>
> ----- Le 25 Avr 24, à 12:35, Eugen Block eblock(a)nde.ag a écrit :
>
>> For those interested, just a short update: the split process is
>> approaching its end, two days ago there were around 230 PGs left
>> (target are 4096 PGs). So far there were no complaints, no cluster
>> impact was reported (the cluster load is quite moderate, but still
>> sensitive). Every now and then a single OSD (not the same) reaches 85%
>> nearfull ratio, but that was expected since the first nearfull OSD was
>> the root cause of this operation. I expect the balancer to kick in as
>> soon as the backfill has completed or when there are less than 5%
>> misplaced objects.
>>
>> Zitat von Anthony D'Atri <anthony.datri(a)gmail.com>om>:
>>
>>> One can up the ratios temporarily but it's all too easy to forget to
>>> reduce them later, or think that it's okay to run all the time with
>>> reduced headroom.
>>>
>>> Until a host blows up and you don't have enough space to recover into.
>>>
>>>> On Apr 12, 2024, at 05:01, Frédéric Nass
>>>> <frederic.nass(a)univ-lorraine.fr> wrote:
>>>>
>>>>
>>>> Oh, and yeah, considering "The fullest OSD is already at 85%
usage"
>>>> best move for now would be to add new hardware/OSDs (to avoid
>>>> reaching the backfill too full limit), prior to start the splitting
>>>> PGs before or after enabling upmap balancer depending on how the
>>>> PGs got rebalanced (well enough or not) after adding new OSDs.
>>>>
>>>> BTW, what ceph version is this? You should make sure you're running
>>>> v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:
>>>>
https://tracker.ceph.com/issues/53729
>>>>
>>>> Cheers,
>>>> Frédéric.
>>>>
>>>> ----- Le 12 Avr 24, à 10:41, Frédéric Nass
>>>> frederic.nass(a)univ-lorraine.fr a écrit :
>>>>
>>>>> Hello Eugen,
>>>>>
>>>>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph
>>>>> daemon osd.0
>>>>> config show | grep osd_op_queue)
>>>>>
>>>>> If WPQ, you might want to tune osd_recovery_sleep* values as they
>>>>> do have a real
>>>>> impact on the recovery/backfilling speed. Just lower
>>>>> osd_max_backfills to 1
>>>>> before doing that.
>>>>> If mClock scheduler then you might want to use a specific
>>>>> mClock profile as
>>>>> suggested by Gregory (as osd_recovery_sleep* are not considered
>>>>> when using
>>>>> mClock).
>>>>>
>>>>> Since each PG involves reads/writes from/to apparently 18 OSDs
>>>>> (!) and this
>>>>> cluster only has 240, increasing osd_max_backfills to any values
>>>>> higher than
>>>>> 2-3 will not help much with the recovery/backfilling speed.
>>>>>
>>>>> All the way, you'll have to be patient. :-)
>>>>>
>>>>> Cheers,
>>>>> Frédéric.
>>>>>
>>>>> ----- Le 10 Avr 24, à 12:54, Eugen Block eblock(a)nde.ag a écrit :
>>>>>
>>>>>> Thank you for input!
>>>>>> We started the split with max_backfills = 1 and watched for a
few
>>>>>> minutes, then gradually increased it to 8. Now it's
backfilling with
>>>>>> around 180 MB/s, not really much but since client impact has to
be
>>>>>> avoided if possible, we decided to let that run for a couple of
hours.
>>>>>> Then reevaluate the situation and maybe increase the backfills a
bit
>>>>>> more.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Zitat von Gregory Orange <gregory.orange(a)pawsey.org.au>au>:
>>>>>>
>>>>>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x
16TB OSDs
>>>>>>> with NVME RocksDB, used exclusively for RGWs, holding about
60b
>>>>>>> objects. We are splitting for the same reason as you -
improved
>>>>>>> balance. We also thought long and hard before we began,
concerned
>>>>>>> about impact, stability etc.
>>>>>>>
>>>>>>> We set target_max_misplaced_ratio to 0.1% initially, so we
could
>>>>>>> retain some control and stop it again fairly quickly if we
weren't
>>>>>>> happy with the behaviour. It also serves to limit the
performance
>>>>>>> impact on the cluster, but unfortunately it also makes the
whole
>>>>>>> process slower.
>>>>>>>
>>>>>>> We now have the setting up to 1.5%, seeing recovery up to
10GB/s. No
>>>>>>> issues with the cluster. We could go higher, but are not in a
rush
>>>>>>> at this point. Sometimes nearfull osd warnings get high and
MAX
>>>>>>> AVAIL on the data pool in `ceph df` gets low enough that we
want to
>>>>>>> interrupt it. So, we set pg_num to whatever the current value
is
>>>>>>> (ceph osd pool ls detail), and let it stabilise. Then the
balancer
>>>>>>> gets to work once the misplaced objects drop below the ratio,
and
>>>>>>> things balance out. Nearfull osds drop usually to zero, and
MAX
>>>>>>> AVAIL goes up again.
>>>>>>>
>>>>>>> The above behaviour is because while they share the same
threshold
>>>>>>> setting, the autoscaler only runs every minute, and it
won't run
>>>>>>> when misplaced are over the threshold. Meanwhile, checks for
the
>>>>>>> next PG to split happen much more frequently, so the balancer
never
>>>>>>> wins that race.
>>>>>>>
>>>>>>>
>>>>>>> We didn't know how long to expect it all to take, but
decided that
>>>>>>> any improvement in PG size was worth starting. We now
estimate it
>>>>>>> will take another 2-3 weeks to complete, for a total of 4-5
weeks
>>>>>>> total.
>>>>>>>
>>>>>>> We have lost a drive or two during the process, and of
course
>>>>>>> degraded objects went up, and more backfilling work got
going. We
>>>>>>> paused splits for at least one of those, to make sure the
degraded
>>>>>>> objects were sorted out as quick as possible. We can't be
sure it
>>>>>>> went any faster though - there's always a long tail on
that sort of
>>>>>>> thing.
>>>>>>>
>>>>>>> Inconsistent objects are found at least a couple of times a
week,
>>>>>>> and to get them repairing we disable scrubs, wait until
they're
>>>>>>> stopped, then set the repair going and reenable scrubs. I
don't know
>>>>>>> if this is special to the current higher splitting load, but
we
>>>>>>> haven't noticed it before.
>>>>>>>
>>>>>>> HTH,
>>>>>>> Greg.
>>>>>>>
>>>>>>>
>>>>>>> On 10/4/24 14:42, Eugen Block wrote:
>>>>>>>> Thank you, Janne.
>>>>>>>> I believe the default 5% target_max_misplaced_ratio would
work as
>>>>>>>> well, we've had good experience with that in the
past, without the
>>>>>>>> autoscaler. I just haven't dealt with such large PGs,
I've been
>>>>>>>> warning them for two years (when the PGs were only almost
half this
>>>>>>>> size) and now they finally started to listen. Well, they
would
>>>>>>>> still ignore it if it wouldn't impact all kinds of
things now. ;-)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Eugen
>>>>>>>>
>>>>>>>> Zitat von Janne Johansson <icepic.dz(a)gmail.com>om>:
>>>>>>>>
>>>>>>>>> Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block
<eblock(a)nde.ag>ag>:
>>>>>>>>>> I'm trying to estimate the possible impact
when large PGs are
>>>>>>>>>> splitted. Here's one example of such a PG:
>>>>>>>>>>
>>>>>>>>>> PG_STAT OBJECTS BYTES OMAP_BYTES*
OMAP_KEYS* LOG
>>>>>>>>>> DISK_LOG UP
>>>>>>>>>> 86.3ff 277708 414403098409 0
0 3092
>>>>>>>>>> 3092
>>>>>>>>>>
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
>>>>>>>>>
>>>>>>>>> If you ask for small increases of pg_num, it will
only
>>>>>>>>> split that many
>>>>>>>>> PGs at a time, so while there will be a lot of data
>>>>>>>>> movement, (50% due
>>>>>>>>> to half of the data needs to go to another newly made
PG, and on top
>>>>>>>>> of that, PGs per OSD will change, but also the
balancing
>>>>>>>>> can now work
>>>>>>>>> better) it will not be affecting the whole cluster if
you increase
>>>>>>>>> with say, 8 pg_nums at a time. As per the other
reply, if
>>>>>>>>> you bump the
>>>>>>>>> number with a small amount - wait for HEALTH_OK -
bump some more it
>>>>>>>>> will take a lot of calendar time, but have rather
small impact. My
>>>>>>>>> view of it is basically that this will be far less
impactful than if
>>>>>>>>> you lose a whole OSD, and hopefully your cluster can
survive this
>>>>>>>>> event, so it should be able to handle a slow trickle
of PG
>>>>>>>>> splits too.
>>>>>>>>>
>>>>>>>>> You can set a target number for the pool and let the
>>>>>>>>> autoscaler run a
>>>>>>>>> few splits at a time, there are some settings to look
at on how
>>>>>>>>> aggressive the autoscaler will be, so it doesn't
have to be
>>>>>>>>> manual/scripted, but it's not very hard to script
it if you
>>>>>>>>> are unsure
>>>>>>>>> about the amount of work the autoscaler will start at
any
>>>>>>>>> given time.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> May the most significant bit of your life be
positive.
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>>
>>>>>>> --
>>>>>>> Gregory Orange
>>>>>>>
>>>>>>> System Administrator, Scientific Platforms Team
>>>>>>> Pawsey Supercomputing Centre, CSIRO
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io