[ceph-users] Re: BlueStore fragmentation woes

29 May 2023

On 29/05/2023 22.26, Igor Fedotov wrote:
...
  So fragmentation score calculation was improved
recently indeed, see 
 https://github.com/ceph/ceph/pull/49885

 And yeah one can see some fragmentation in allocations for the first two
 OSDs. Doesn't look that dramatic as fragmentation scores tell though.

 Additionally you might want to collect free extents dump using 'ceph
 tell osd.N ceph bluestore allocator dump block' command and do more
 analysis on these data.

 E.g. I'd recommend to build something like a histogram showing amount of
 chunks for specific size range:

 [1-4K]: N1 chunks

 (4K-16]: N2 chunks

 (16K-64K): N3

 ...

 [16M-inf) : Nn chunks

 This should be even more informative about fragmentation state -
 particularly if observed in evolution.

 Looking for volunteers to write a script for building such a histogram... ;) 
I'm up for that, once I get through some other cluster maintenance I
need to deal with first :)

Backfill is almost done and I was finally able to destroy two OSDs, will
be doing a bunch of restructuring in the coming weeks. I can probably
get the script done partway through doing this, so I can see how the
distributions evolve over a bunch of data movement.

...

 Thanks,

 Igor

 On 28/05/2023 08:31, Hector Martin wrote:
> So chiming in, I think something is definitely wrong with at *least* the
> frag score.
>
> Here's what happened so far:
>
> 1. I had 8 OSDs (all 8T HDDs)
> 2. I added 2 more (osd.0,1) , with Quincy defaults
> 3. I marked 2 old ones out (the ones that seemed to be struggling the
> most with IOPS)
> 4. I added 2 more (osd.2,3), but this time I had previously set
> bluestore_min_alloc_size_hdd to 16K as an experiment
>
> This has all happened in the space of a ~week. That means there was data
> movement into the first 2 new OSDs, then before that completed I added 2
> new OSDs. So I would expect some data thashing on the first 2, but
> nothing extreme.
>
> The fragmentation scores for the 4 new OSDs are, respectively:
>
> 0.746, 0.835, 0.160, 0.067
>
> That seems ridiculous for the first two, it's only been a week. The
> newest two seem in better shape, though those mostly would've seen only
> data moving in, not out. The rebalance isn't done yet, but it's almost
> done and all 4 OSDs have a similar fullness level at this time.
>
> Looking at alloc stats:
>
> ceph-0)  allocation stats probe 6: cnt: 2219302 frags: 2328003 size:
> 1238454677504
> ceph-0)  probe -1: 1848577,  1970325, 1022324588544
> ceph-0)  probe -2: 848301,  862622, 505329963008
> ceph-0)  probe -6: 2187448,  2187448, 1055241568256
> ceph-0)  probe -14: 0,  0, 0
> ceph-0)  probe -22: 0,  0, 0
>
> ceph-1)  allocation stats probe 6: cnt: 1882396 frags: 1947321 size:
> 1054829641728
> ceph-1)  probe -1: 2212293,  2345923, 1215418728448
> ceph-1)  probe -2: 1471623,  1525498, 826984652800
> ceph-1)  probe -6: 2095298,  2095298, 1000065933312
> ceph-1)  probe -14: 0,  0, 0
> ceph-1)  probe -22: 0,  0, 0
>
> ceph-2)  allocation stats probe 3: cnt: 2760200 frags: 2760200 size:
> 1554513903616
> ceph-2)  probe -1: 2584046,  2584046, 1498140393472
> ceph-2)  probe -3: 1696921,  1696921, 869424496640
> ceph-2)  probe -7: 0,  0, 0
> ceph-2)  probe -11: 0,  0, 0
> ceph-2)  probe -19: 0,  0, 0
>
> ceph-3)  allocation stats probe 3: cnt: 2544818 frags: 2544818 size:
> 1432225021952
> ceph-3)  probe -1: 2688015,  2688015, 1515260739584
> ceph-3)  probe -3: 1086875,  1086875, 622025424896
> ceph-3)  probe -7: 0,  0, 0
> ceph-3)  probe -11: 0,  0, 0
> ceph-3)  probe -19: 0,  0, 0
>
> So OSDs 2 and 3 (the latest ones to be added, note that these 4 new OSDs
> are 0-3 since those IDs were free) are in good shape, but 0 and 1 are
> already suffering from at least some fragmentation of objects, which is
> a bit worrying when they are only ~70% full right now and only a week old.
>
> I did delete a couple million small objects during the rebalance to try
> to reduce load (I had some nasty directories), but that was cumulatively
> only about 60GB of data. So while that could explain a high frag score
> if there are now a million little holes in the free space map of the
> OSDs (how is it calculated?), it should not actually cause new data
> moving in to end up fragmented since there should be plenty of
> unfragmented free space going around still.
>
> I am now restarting OSDs 0 and 1 to see whether that makes the frag
> score go down over time. I will do further analysis later with the raw
> bluestore free space map, since I still have a bunch of rebalancing and
> moving data around planned (I'm moving my cluster to new machines).
>
> On 26/05/2023 00.29, Igor Fedotov wrote:
>> Hi Hector,
>>
>> I can advise two tools for further fragmentation analysis:
>>
>> 1) One might want to use ceph-bluestore-tool's free-dump command to get 
>> a list of free chunks for an OSD and try to analyze whether it's really 
>> highly fragmented and lacks long enough extents. free-dump just returns 
>> a list of extents in json format, I can take a look to the output if 
>> shared...
>>
>> 2) You might want to look for allocation probs in OSD logs and see how 
>> fragmentation in allocated chunks has evolved.
>>
>> E.g.
>>
>> allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
>> probe -1: 35168547,  46401246, 1199516209152
>> probe -3: 27275094,  35681802, 200121712640
>> probe -5: 34847167,  52539758, 271272230912
>> probe -9: 44291522,  60025613, 523997483008
>> probe -17: 10646313,  10646313, 155178434560
>>
>> The first probe refers to the last day while others match days (or 
>> rather probes) -1, -3, -5, -9, -17
>>
>> 'cnt' column represents the amount of allocations performed in the 
>> previous 24 hours and 'frags' one shows amount of fragments in the 
>> resulted allocations. So significant mismatch between frags and cnt 
>> might indicate some issues with high fragmentation indeed.
>>
>> Apart from retrospective analysis you might also want how OSD behavior 
>> changes after reboot - e.g. wouldn't rebooted OSD produce less 
>> fragmentation... Which in turn might indicate some issues with BlueStore 
>> allocator..
>>
>> Just FYI: allocation probe printing interval is controlled by 
>> bluestore_alloc_stats_dump_interval parameter.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>>
>> On 24/05/2023 17:18, Hector Martin wrote:
>>> On 24/05/2023 22.07, Mark Nelson wrote:
>>>> Yep, bluestore fragmentation is an issue.  It's sort of a natural
result
>>>> of using copy-on-write and never implementing any kind of
>>>> defragmentation scheme.  Adam and I have been talking about doing it
>>>> now, probably piggybacking on scrub or other operations that already
>>>> area reading all of the extents for an object anyway.
>>>>
>>>>
>>>> I wrote a very simply prototype for clone to speed up the rbd mirror use
>>>> case here:
>>>>
>>>>
https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5…
>>>>
>>>>
>>>> Adam ended up going the extra mile and completely changed how shared
>>>> blobs works which probably eliminates the need to do defrag on clone
>>>> anymore from an rbd-mirror perspective, but I think we still need to
>>>> identify any times we are doing full object reads of fragmented objects
>>>> and consider defragmenting at that time.  It might be clone, or scrub,
>>>> or other things, but the point is that if we are already doing most of
>>>> the work (seeks on HDD especially!) the extra cost of a large write to
>>>> clean it up isn't that bad, especially if we are doing it over the
>>>> course of months or years and can help keep freespace less fragmented.
>>> Note that my particular issue seemed to specifically be free space
>>> fragmentation. I don't use RBD mirror and I would not *expect* most of
>>> my cephfs use cases to lead to any weird cow/fragmentation issues with
>>> objects other than those forced by the free space becoming fragmented
>>> (unless there is some weird pathological use case I'm hitting). Most of
>>> my write workloads are just copying files in bulk and incrementally
>>> writing out files.
>>>
>>> Would simply defragging objects during scrub/etc help with free space
>>> fragmentation itself? Those seem like two somewhat unrelated issues...
>>> note that if free space is already fragmented, you wouldn't even have a
>>> place to put down a defragmented object.
>>>
>>> Are there any stats I can look at to figure out how bad object and free
>>> space fragmentation is? It would be nice to have some clearer data
>>> beyond my hunch/deduction after seeing the I/O patterns and the sole
>>> fragmentation number :). Also would be interesting to get some kind of
>>> trace of the bluestore ops the OSD is doing, so I can find out whether
>>> it's doing something pathological that causes more fragmentation for
>>> some reason.
>>>
>>>> Mark
>>>>
>>>>
>>>> On 5/24/23 07:17, Hector Martin wrote:
>>>>> Hi,
>>>>>
>>>>> I've been seeing relatively large fragmentation numbers on all my
OSDs:
>>>>>
>>>>> ceph daemon osd.13 bluestore allocator score block
>>>>> {
>>>>>       "fragmentation_rating": 0.77251526920454427
>>>>> }
>>>>>
>>>>> These aren't that old, as I recreated them all around July last
year.
>>>>> They mostly hold CephFS data with erasure coding, with a mix of
large
>>>>> and small files. The OSDs are at around 80%-85% utilization right
now.
>>>>> Most of the data was written sequentially when the OSDs were created
(I
>>>>> rsynced everything from a remote backup). Since then more data has
been
>>>>> added, but not particularly quickly.
>>>>>
>>>>> At some point I noticed pathologically slow writes, and I
couldn't
>>>>> figure out what was wrong. Eventually I did some block tracing and
>>>>> noticed the I/Os were very small, even though CephFS-side I was just
>>>>> writing one large file sequentially, and that's when I stumbled
upon the
>>>>> free space fragmentation problem. Indeed, deleting some large files
>>>>> opened up some larger free extents and resolved the problem, but
only
>>>>> until those get filled up and I'm back to fragmented tiny
extents. So
>>>>> effectively I'm stuck at the current utilization, as trying to
fill them
>>>>> up any more just slows down to an absolute crawl.
>>>>>
>>>>> I'm adding a few more OSDs and plan on doing the dance of
removing one
>>>>> OSD at a time and replacing it with another one to hopefully improve
the
>>>>> situation, but obviously this is going to take forever.
>>>>>
>>>>> Is there any plan for offering a defrag tool of some sort for
bluestore?
>>>>>
>>>>> - Hector
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> - Hector
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> - Hector
> 
- Hector

2024

2023

2022

2021

2020

2019

[ceph-users] Re: BlueStore fragmentation woes