Hi Kevin,
according to the shared probes there were no fragmented allocations -
cnt = frags for all the probes. And average allocation request is
pretty large - more than 1.5 MB for the probes I checked.
So to me it looks like your disk fragmentation (at least for new
allocations) is of little significance at the moment - it doesn't affect
write requests.
As I mentioned before for further analysis you might want to run through
the output from 'ceph tell osd.N bluestore allocator dump block' command.
This is my recent commit to build free space histogram from it:
https://github.com/ceph/ceph/pull/51820
One can use this as an example and create a script to do the same (just
to avoid all the tricks with building/upgrading Ceph binaries) or
backport and build custom Ceph image.
Thanks,
Igor
On 31/05/2023 01:11, Fox, Kevin M wrote:
> Ok, I restarted it May 25th, ~11:30, let it run over the long weekend and just
checked on it. Data attached.
>
> May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-21T18:24:34.040+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 107: cnt: 17991 fr
> ags: 17991 size: 32016760832
> May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-21T18:24:34.040+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 20267, 20267, 39482425344
> May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-21T18:24:34.040+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-3: 19737, 19737, 37299027968
> May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-21T18:24:34.040+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-7: 18498, 18498, 32395558912
> May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-21T18:24:34.040+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-11: 20373, 20373, 35302801408
> May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-21T18:24:34.040+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-27: 19072, 19072, 33645854720
> May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-22T18:24:34.057+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 108: cnt: 24594 fr
> ags: 24594 size: 56951898112
> May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-22T18:24:34.057+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 17991, 17991, 32016760832
> May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-22T18:24:34.057+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-2: 20267, 20267, 39482425344
> May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-22T18:24:34.057+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-4: 19737, 19737, 37299027968
> May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-22T18:24:34.057+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-12: 20373, 20373, 35302801408
> May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-22T18:24:34.057+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-28: 19072, 19072, 33645854720
> May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-23T18:24:34.095+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 109: cnt: 24503 frags: 24503 size: 58141900800
> May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-23T18:24:34.095+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 24594, 24594, 56951898112
> May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-23T18:24:34.095+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-3: 20267, 20267, 39482425344
> May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-23T18:24:34.095+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-5: 19737, 19737, 37299027968
> May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-23T18:24:34.095+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-13: 20373, 20373, 35302801408
> May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-23T18:24:34.095+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-29: 19072, 19072, 33645854720
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 110: cnt: 27637 frags: 27637 size: 63777406976
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 24503, 24503, 58141900800
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-2: 24594, 24594, 56951898112
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-6: 19737, 19737, 37299027968
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-14: 20373, 20373, 35302801408
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-30: 19072, 19072, 33645854720
> May 25 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-25T18:24:34.151+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 111: cnt: 22136 frags: 22136 size: 48656023552
> May 25 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-25T18:24:34.151+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 27637, 27637, 63777406976
> May 25 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-25T18:24:34.151+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-3: 24594, 24594, 56951898112
> May 25 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-25T18:24:34.151+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-7: 19737, 19737, 37299027968
> May 25 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-25T18:24:34.151+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-15: 20373, 20373, 35302801408
> May 25 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-25T18:24:34.151+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-31: 19072, 19072, 33645854720
> May 26 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-26T18:35:22.701+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 0: cnt: 21986 frags: 21986 size: 47407562752
> May 26 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-26T18:35:22.701+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 0, 0, 0
> May 26 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-26T18:35:22.701+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-2: 0, 0, 0
> May 26 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-26T18:35:22.701+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-4: 0, 0, 0
> May 26 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-26T18:35:22.701+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-8: 0, 0, 0
> May 26 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-26T18:35:22.701+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-16: 0, 0, 0
> May 27 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-27T18:35:22.740+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 1: cnt: 21145 frags: 21145 size: 44858146816
> May 27 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-27T18:35:22.740+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 21986, 21986, 47407562752
> May 27 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-27T18:35:22.740+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-3: 0, 0, 0
> May 27 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-27T18:35:22.740+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-5: 0, 0, 0
> May 27 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-27T18:35:22.740+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-9: 0, 0, 0
> May 27 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-27T18:35:22.740+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-17: 0, 0, 0
> May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-28T18:35:22.790+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 2: cnt: 17987 frags: 17987 size: 32446676992
> May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-28T18:35:22.790+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 21145, 21145, 44858146816
> May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-28T18:35:22.790+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-2: 21986, 21986, 47407562752
> May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-28T18:35:22.790+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-6: 0, 0, 0
> May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-28T18:35:22.790+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-10: 0, 0, 0
> May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-28T18:35:22.790+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-18: 0, 0, 0
> May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-29T18:35:22.815+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 3: cnt: 17509 frags: 17509 size: 31015436288
> May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-29T18:35:22.815+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 17987, 17987, 32446676992
> May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-29T18:35:22.815+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-3: 21986, 21986, 47407562752
> May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-29T18:35:22.815+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-7: 0, 0, 0
> May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-29T18:35:22.815+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-11: 0, 0, 0
> May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-29T18:35:22.815+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-19: 0, 0, 0
> May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-30T18:35:22.826+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 4: cnt: 21016 frags: 21016 size: 45432438784
> May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-30T18:35:22.826+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 17509, 17509, 31015436288
> May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-30T18:35:22.826+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-2: 17987, 17987, 32446676992
> May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-30T18:35:22.826+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-4: 21986, 21986, 47407562752
> May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-30T18:35:22.826+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-12: 0, 0, 0
> May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: debug
2023-05-30T18:35:22.826+0000 7fe190013700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-20: 0, 0, 0
>
> Thanks,
> Kevin
>
> ________________________________________
> From: Fox, Kevin M <Kevin.Fox(a)pnnl.gov>
> Sent: Thursday, May 25, 2023 9:36 AM
> To: Igor Fedotov; Hector Martin; ceph-users(a)ceph.io
> Subject: Re: [ceph-users] Re: BlueStore fragmentation woes
>
> Ok, I'm gathering the "allocation stats probe" stuff. Not sure I follow
what you mean by the historic probes. just:
> | egrep "allocation stats probe|probe" ?
>
> That gets something like:
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183)
allocation stats probe 110: cnt: 27637 frags: 27637 size: 63777406976
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-1: 24503, 24503, 58141900800
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-2: 24594, 24594, 56951898112
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-6: 19737, 19737, 37299027968
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-14: 20373, 20373, 35302801408
> May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug
2023-05-24T18:24:34.105+0000 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe
-30: 19072, 19072, 33645854720
>
> if that is the right query, then I'll gather the metrics, restart and gather some
more after and let you know.
>
> Thanks,
> Kevin
>
> ________________________________________
> From: Igor Fedotov <igor.fedotov(a)croit.io>
> Sent: Thursday, May 25, 2023 9:29 AM
> To: Fox, Kevin M; Hector Martin; ceph-users(a)ceph.io
> Subject: Re: [ceph-users] Re: BlueStore fragmentation woes
>
> Just run through available logs for a specific OSD (which you suspect
> suffer from high fragmentation) and collect all allocation stats probes
> you can find ("allocation stats probe" string is a perfect grep pattern,
> please append lines with historic probes following day-0 line as well.
> Given this is printed once per day there wouldn't be too many).
>
> Then do OSD restart and wait a couple more days. Would allocation stats
> show much better disparity between cnt and frags columns?
>
> Is the similar pattern (eventual degradation in stats prior to restart
> and severe improvement afterwards) be observed for other OSDs?
>
>
> On 25/05/2023 19:20, Fox, Kevin M wrote:
>> If you can give me instructions on what you want me to gather before the restart
and after restart I can do it. I have some running away right now.
>>
>> Thanks,
>> Kevin
>>
>> ________________________________________
>> From: Igor Fedotov <igor.fedotov(a)croit.io>
>> Sent: Thursday, May 25, 2023 9:17 AM
>> To: Fox, Kevin M; Hector Martin; ceph-users(a)ceph.io
>> Subject: Re: [ceph-users] Re: BlueStore fragmentation woes
>>
>> Perhaps...
>>
>> I don't like the idea to use fragmentation score as a real index. IMO
>> it's mostly like a very imprecise first turn marker to alert that
>> something might be wrong. But not a real quantitative high-quality estimate.
>>
>> So in fact I'd like to see a series of allocation probes showing
>> eventual degradation without OSD restart and immediate severe
>> improvement after the restart.
>>
>> Can you try to collect something like that? Would the same behavior
>> persist with an alternative allocator?
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 25/05/2023 18:41, Fox, Kevin M wrote:
>>> Is this related to
https://tracker.ceph.com/issues/58022 ?
>>>
>>> We still see run away osds at times, somewhat randomly, that causes runaway
fragmentation issues.
>>>
>>> Thanks,
>>> Kevin
>>>
>>> ________________________________________
>>> From: Igor Fedotov <igor.fedotov(a)croit.io>
>>> Sent: Thursday, May 25, 2023 8:29 AM
>>> To: Hector Martin; ceph-users(a)ceph.io
>>> Subject: [ceph-users] Re: BlueStore fragmentation woes
>>>
>>> Check twice before you click! This email originated from outside PNNL.
>>>
>>>
>>> Hi Hector,
>>>
>>> I can advise two tools for further fragmentation analysis:
>>>
>>> 1) One might want to use ceph-bluestore-tool's free-dump command to get
>>> a list of free chunks for an OSD and try to analyze whether it's really
>>> highly fragmented and lacks long enough extents. free-dump just returns
>>> a list of extents in json format, I can take a look to the output if
>>> shared...
>>>
>>> 2) You might want to look for allocation probs in OSD logs and see how
>>> fragmentation in allocated chunks has evolved.
>>>
>>> E.g.
>>>
>>> allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
>>> probe -1: 35168547, 46401246, 1199516209152
>>> probe -3: 27275094, 35681802, 200121712640
>>> probe -5: 34847167, 52539758, 271272230912
>>> probe -9: 44291522, 60025613, 523997483008
>>> probe -17: 10646313, 10646313, 155178434560
>>>
>>> The first probe refers to the last day while others match days (or
>>> rather probes) -1, -3, -5, -9, -17
>>>
>>> 'cnt' column represents the amount of allocations performed in the
>>> previous 24 hours and 'frags' one shows amount of fragments in the
>>> resulted allocations. So significant mismatch between frags and cnt
>>> might indicate some issues with high fragmentation indeed.
>>>
>>> Apart from retrospective analysis you might also want how OSD behavior
>>> changes after reboot - e.g. wouldn't rebooted OSD produce less
>>> fragmentation... Which in turn might indicate some issues with BlueStore
>>> allocator..
>>>
>>> Just FYI: allocation probe printing interval is controlled by
>>> bluestore_alloc_stats_dump_interval parameter.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>> On 24/05/2023 17:18, Hector Martin wrote:
>>>> On 24/05/2023 22.07, Mark Nelson wrote:
>>>>> Yep, bluestore fragmentation is an issue. It's sort of a natural
result
>>>>> of using copy-on-write and never implementing any kind of
>>>>> defragmentation scheme. Adam and I have been talking about doing it
>>>>> now, probably piggybacking on scrub or other operations that already
>>>>> area reading all of the extents for an object anyway.
>>>>>
>>>>>
>>>>> I wrote a very simply prototype for clone to speed up the rbd mirror
use
>>>>> case here:
>>>>>
>>>>>
https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5…
>>>>>
>>>>>
>>>>> Adam ended up going the extra mile and completely changed how shared
>>>>> blobs works which probably eliminates the need to do defrag on clone
>>>>> anymore from an rbd-mirror perspective, but I think we still need to
>>>>> identify any times we are doing full object reads of fragmented
objects
>>>>> and consider defragmenting at that time. It might be clone, or
scrub,
>>>>> or other things, but the point is that if we are already doing most
of
>>>>> the work (seeks on HDD especially!) the extra cost of a large write
to
>>>>> clean it up isn't that bad, especially if we are doing it over
the
>>>>> course of months or years and can help keep freespace less
fragmented.
>>>> Note that my particular issue seemed to specifically be free space
>>>> fragmentation. I don't use RBD mirror and I would not *expect* most
of
>>>> my cephfs use cases to lead to any weird cow/fragmentation issues with
>>>> objects other than those forced by the free space becoming fragmented
>>>> (unless there is some weird pathological use case I'm hitting). Most
of
>>>> my write workloads are just copying files in bulk and incrementally
>>>> writing out files.
>>>>
>>>> Would simply defragging objects during scrub/etc help with free space
>>>> fragmentation itself? Those seem like two somewhat unrelated issues...
>>>> note that if free space is already fragmented, you wouldn't even have
a
>>>> place to put down a defragmented object.
>>>>
>>>> Are there any stats I can look at to figure out how bad object and free
>>>> space fragmentation is? It would be nice to have some clearer data
>>>> beyond my hunch/deduction after seeing the I/O patterns and the sole
>>>> fragmentation number :). Also would be interesting to get some kind of
>>>> trace of the bluestore ops the OSD is doing, so I can find out whether
>>>> it's doing something pathological that causes more fragmentation for
>>>> some reason.
>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On 5/24/23 07:17, Hector Martin wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been seeing relatively large fragmentation numbers on
all my OSDs:
>>>>>>
>>>>>> ceph daemon osd.13 bluestore allocator score block
>>>>>> {
>>>>>> "fragmentation_rating": 0.77251526920454427
>>>>>> }
>>>>>>
>>>>>> These aren't that old, as I recreated them all around July
last year.
>>>>>> They mostly hold CephFS data with erasure coding, with a mix of
large
>>>>>> and small files. The OSDs are at around 80%-85% utilization right
now.
>>>>>> Most of the data was written sequentially when the OSDs were
created (I
>>>>>> rsynced everything from a remote backup). Since then more data
has been
>>>>>> added, but not particularly quickly.
>>>>>>
>>>>>> At some point I noticed pathologically slow writes, and I
couldn't
>>>>>> figure out what was wrong. Eventually I did some block tracing
and
>>>>>> noticed the I/Os were very small, even though CephFS-side I was
just
>>>>>> writing one large file sequentially, and that's when I
stumbled upon the
>>>>>> free space fragmentation problem. Indeed, deleting some large
files
>>>>>> opened up some larger free extents and resolved the problem, but
only
>>>>>> until those get filled up and I'm back to fragmented tiny
extents. So
>>>>>> effectively I'm stuck at the current utilization, as trying
to fill them
>>>>>> up any more just slows down to an absolute crawl.
>>>>>>
>>>>>> I'm adding a few more OSDs and plan on doing the dance of
removing one
>>>>>> OSD at a time and replacing it with another one to hopefully
improve the
>>>>>> situation, but obviously this is going to take forever.
>>>>>>
>>>>>> Is there any plan for offering a defrag tool of some sort for
bluestore?
>>>>>>
>>>>>> - Hector
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> - Hector
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io