[ceph-users] Re: BlueStore fragmentation woes

25 May 2023

Yeah this looks fine. Please collect all of them for a given OSD.

Then restart OSD, wait more to come (1-2 days) and collect them too.

A side note - in the attached probe I can't see any fragmentation at all 
- amount of allocations is equal to amount of fragments, e.g.

cnt: 27637 frags: 27637

And the average requested chunk is 63777406976 / 27637 = ~2308 bytes. 
I.e. in average one needed less than a single alloc unit. Which would 
tell us nothing about the fragmentation...

Thanks,
Igor

On 25/05/2023 19:36, Fox, Kevin M wrote:
> Ok, I'm gathering the "allocation stats probe" stuff. Not sure I follow
what you mean by the historic probes. just:
> | egrep "allocation stats probe|probe"   ?
>
> That gets something like:
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183) 
allocation stats probe 110: cnt: 27637 frags: 27637 size: 63777406976
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe
-1: 24503,  24503, 58141900800
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe
-2: 24594,  24594, 56951898112
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe
-6: 19737,  19737, 37299027968
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe
-14: 20373,  20373, 35302801408
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700  0 bluestore(/var/lib/ceph/osd/ceph-183)  probe
-30: 19072,  19072, 33645854720
>
> if that is the right query, then I'll gather the metrics, restart and gather some
more after and let you know.
>
> Thanks,
> Kevin
>
> ________________________________________
> From: Igor Fedotov &lt;igor.fedotov(a)croit.io&gt;
> Sent: Thursday, May 25, 2023 9:29 AM
> To: Fox, Kevin M; Hector Martin; ceph-users(a)ceph.io
> Subject: Re: [ceph-users] Re: BlueStore fragmentation woes
>
> Just run through available logs for a specific OSD (which you suspect
> suffer from high fragmentation) and collect all allocation stats probes
> you can find ("allocation stats probe" string is a perfect grep pattern,
> please append lines with historic probes following day-0 line as well.
> Given this is printed once per day there wouldn't be too many).
>
> Then do OSD restart and wait a couple more days. Would allocation stats
> show much better disparity between cnt and frags columns?
>
> Is the similar pattern (eventual degradation in stats prior to restart
> and severe improvement afterwards) be observed for other OSDs?
>
>
> On 25/05/2023 19:20, Fox, Kevin M wrote:
>> If you can give me instructions on what you want me to gather before the restart
and after restart I can do it. I have some running away right now.
>>
>> Thanks,
>> Kevin
>>
>> ________________________________________
>> From: Igor Fedotov &lt;igor.fedotov(a)croit.io&gt;
>> Sent: Thursday, May 25, 2023 9:17 AM
>> To: Fox, Kevin M; Hector Martin; ceph-users(a)ceph.io
>> Subject: Re: [ceph-users] Re: BlueStore fragmentation woes
>>
>> Perhaps...
>>
>> I don't like the idea to use fragmentation score as a real index. IMO
>> it's mostly like a very imprecise first turn marker to alert that
>> something might be wrong. But not a real quantitative high-quality estimate.
>>
>> So in fact I'd like to see a series of allocation probes showing
>> eventual degradation without OSD restart and immediate severe
>> improvement after the restart.
>>
>> Can you try to collect something like that? Would the same behavior
>> persist with an alternative allocator?
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 25/05/2023 18:41, Fox, Kevin M wrote:
>>> Is this related to https://tracker.ceph.com/issues/58022 ?
>>>
>>> We still see run away osds at times, somewhat randomly, that causes runaway
fragmentation issues.
>>>
>>> Thanks,
>>> Kevin
>>>
>>> ________________________________________
>>> From: Igor Fedotov &lt;igor.fedotov(a)croit.io&gt;
>>> Sent: Thursday, May 25, 2023 8:29 AM
>>> To: Hector Martin; ceph-users(a)ceph.io
>>> Subject: [ceph-users] Re: BlueStore fragmentation woes
>>>
>>> Check twice before you click! This email originated from outside PNNL.
>>>
>>>
>>> Hi Hector,
>>>
>>> I can advise two tools for further fragmentation analysis:
>>>
>>> 1) One might want to use ceph-bluestore-tool's free-dump command to get
>>> a list of free chunks for an OSD and try to analyze whether it's really
>>> highly fragmented and lacks long enough extents. free-dump just returns
>>> a list of extents in json format, I can take a look to the output if
>>> shared...
>>>
>>> 2) You might want to look for allocation probs in OSD logs and see how
>>> fragmentation in allocated chunks has evolved.
>>>
>>> E.g.
>>>
>>> allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
>>> probe -1: 35168547,  46401246, 1199516209152
>>> probe -3: 27275094,  35681802, 200121712640
>>> probe -5: 34847167,  52539758, 271272230912
>>> probe -9: 44291522,  60025613, 523997483008
>>> probe -17: 10646313,  10646313, 155178434560
>>>
>>> The first probe refers to the last day while others match days (or
>>> rather probes) -1, -3, -5, -9, -17
>>>
>>> 'cnt' column represents the amount of allocations performed in the
>>> previous 24 hours and 'frags' one shows amount of fragments in the
>>> resulted allocations. So significant mismatch between frags and cnt
>>> might indicate some issues with high fragmentation indeed.
>>>
>>> Apart from retrospective analysis you might also want how OSD behavior
>>> changes after reboot - e.g. wouldn't rebooted OSD produce less
>>> fragmentation... Which in turn might indicate some issues with BlueStore
>>> allocator..
>>>
>>> Just FYI: allocation probe printing interval is controlled by
>>> bluestore_alloc_stats_dump_interval parameter.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>> On 24/05/2023 17:18, Hector Martin wrote:
>>>> On 24/05/2023 22.07, Mark Nelson wrote:
>>>>> Yep, bluestore fragmentation is an issue.  It's sort of a natural
result
>>>>> of using copy-on-write and never implementing any kind of
>>>>> defragmentation scheme.  Adam and I have been talking about doing it
>>>>> now, probably piggybacking on scrub or other operations that already
>>>>> area reading all of the extents for an object anyway.
>>>>>
>>>>>
>>>>> I wrote a very simply prototype for clone to speed up the rbd mirror
use
>>>>> case here:
>>>>>
>>>>>
https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5…
>>>>>
>>>>>
>>>>> Adam ended up going the extra mile and completely changed how shared
>>>>> blobs works which probably eliminates the need to do defrag on clone
>>>>> anymore from an rbd-mirror perspective, but I think we still need to
>>>>> identify any times we are doing full object reads of fragmented
objects
>>>>> and consider defragmenting at that time.  It might be clone, or
scrub,
>>>>> or other things, but the point is that if we are already doing most
of
>>>>> the work (seeks on HDD especially!) the extra cost of a large write
to
>>>>> clean it up isn't that bad, especially if we are doing it over
the
>>>>> course of months or years and can help keep freespace less
fragmented.
>>>> Note that my particular issue seemed to specifically be free space
>>>> fragmentation. I don't use RBD mirror and I would not *expect* most
of
>>>> my cephfs use cases to lead to any weird cow/fragmentation issues with
>>>> objects other than those forced by the free space becoming fragmented
>>>> (unless there is some weird pathological use case I'm hitting). Most
of
>>>> my write workloads are just copying files in bulk and incrementally
>>>> writing out files.
>>>>
>>>> Would simply defragging objects during scrub/etc help with free space
>>>> fragmentation itself? Those seem like two somewhat unrelated issues...
>>>> note that if free space is already fragmented, you wouldn't even have
a
>>>> place to put down a defragmented object.
>>>>
>>>> Are there any stats I can look at to figure out how bad object and free
>>>> space fragmentation is? It would be nice to have some clearer data
>>>> beyond my hunch/deduction after seeing the I/O patterns and the sole
>>>> fragmentation number :). Also would be interesting to get some kind of
>>>> trace of the bluestore ops the OSD is doing, so I can find out whether
>>>> it's doing something pathological that causes more fragmentation for
>>>> some reason.
>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On 5/24/23 07:17, Hector Martin wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been seeing relatively large fragmentation numbers on
all my OSDs:
>>>>>>
>>>>>> ceph daemon osd.13 bluestore allocator score block
>>>>>> {
>>>>>>          "fragmentation_rating": 0.77251526920454427
>>>>>> }
>>>>>>
>>>>>> These aren't that old, as I recreated them all around July
last year.
>>>>>> They mostly hold CephFS data with erasure coding, with a mix of
large
>>>>>> and small files. The OSDs are at around 80%-85% utilization right
now.
>>>>>> Most of the data was written sequentially when the OSDs were
created (I
>>>>>> rsynced everything from a remote backup). Since then more data
has been
>>>>>> added, but not particularly quickly.
>>>>>>
>>>>>> At some point I noticed pathologically slow writes, and I
couldn't
>>>>>> figure out what was wrong. Eventually I did some block tracing
and
>>>>>> noticed the I/Os were very small, even though CephFS-side I was
just
>>>>>> writing one large file sequentially, and that's when I
stumbled upon the
>>>>>> free space fragmentation problem. Indeed, deleting some large
files
>>>>>> opened up some larger free extents and resolved the problem, but
only
>>>>>> until those get filled up and I'm back to fragmented tiny
extents. So
>>>>>> effectively I'm stuck at the current utilization, as trying
to fill them
>>>>>> up any more just slows down to an absolute crawl.
>>>>>>
>>>>>> I'm adding a few more OSDs and plan on doing the dance of
removing one
>>>>>> OSD at a time and replacing it with another one to hopefully
improve the
>>>>>> situation, but obviously this is going to take forever.
>>>>>>
>>>>>> Is there any plan for offering a defrag tool of some sort for
bluestore?
>>>>>>
>>>>>> - Hector
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> - Hector
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: BlueStore fragmentation woes