Hi,
The cluster is all-flash (NVMe), so the removal is fast and it's in fact
pretty noticeable, even on Prometheus graphs.
Also I've logged raw space usage from `ceph -f json df`:
1) before pg rebalance started the space usage was 32724002664448 bytes
2) just before the rebalance finished it was 32883513622528 bytes (1920
of ~120k objects misplaced) = +100 GB
3) then it started to drop. not instantly, but fast, and stopped at
32785906380800 = +58 GB to the original
I've repeated it several times. The behaviour is always the same. First
it copies the PG, then removes the old copy, but space usage doesn't
drop to the original point. It's obviously not client IO too, it always
happens exactly during the rebalance.
> Hi Vitaliy,
>
> just as a guess to verify:
>
> a while ago I've been observed very long pool (pretty large) removal.
> It took several days to complete. DB was at spinner which was one of
> driver of this slow behavior.
>
> Another one - PG removal design which enumerates up to 30 entries max
> to fill single removal batch. Then execute it. Everything in a single
> "thread". So the process is pretty slow for millions of objects...
>
> During removal pool (read PGs) space was in use ad decreased slowly.
> Pretty high DB volume utilization was observed.
>
> I assume rebalance performs PG removal as well - may be that's the
> case?
>
> Thanks,
>
> Igor
> On 3/26/2020 1:51 AM, Виталий Филиппов wrote:
>
>> Hi Igor,
>>
>> I think so because
>> 1) space usage increases after each rebalance. Even when the same pg
>> is moved twice (!)
>> 2) I use 4k min_alloc_size from the beginning
>>
>> One crazy hypothesis is that maybe ceph allocates space for
>> uncompressed objects, then compresses them and leaks
>> (uncompressed-compressed) space. Really crazy idea but who knows
>> o_O.
>>
>> I already did a deep fsck, it didn't help... what else could I
>> check?...
>>
>> 26 марта 2020 г. 1:40:52 GMT+03:00, Igor Fedotov
>> <ifedotov(a)suse.de> пишет:
>>
>> Bluestore fsck/repair detect and fix leaks at Bluestore level but I
>> doubt your issue is here. To be honest I don't understand from the
>> overview why do you think that there are any leaks at all.... Not
>> sure whether this is relevant but from my experience space "leaks"
>> are sometimes caused by 64K allocation unit and keeping tons of
>> small files or massive small EC overwrites. To verify if this is
>> applicable you might want to inspect bluestore performance counters
>> (bluestore_stored vs. bluestore_allocated) to estimate your losses
>> due to high allocation units. Significant difference at multiple
>> OSDs might indicate that overhead is caused by high allocation
>> granularity. Compression might make this analysis not that simple
>> though... Thanks, Igor On 3/26/2020 1:19 AM, vitalif(a)yourcmc.ru
>> wrote: I have a question regarding this problem - is it possible to
>> rebuild bluestore allocation metadata? I could try it to test if
>> it's an allocator problem... Hi. I'm experiencing some kind of a
>> space leak in Bluestore. I use EC, compression and snapshots. First
>> I thought that the leak was caused by "virtual clones" (issue
>> #38184). However, then I got rid of most of the snapshots, but
>> continued to experience the problem. I suspected something when I
>> added a new disk to the cluster and free space in the cluster didn't
>> increase (!). So to track down the issue I moved one PG (34.1a)
>> using upmaps from osd11,6,0 to osd6,0,7 and then back to osd11,6,0.
>> It ate +59 GB after the first move and +51 GB after the second. As I
>> understand this proves that it's not #38184. Devirtualizaton of
>> virtual clones couldn't eat additional space after SECOND rebalance
>> of the same PG. The PG has ~39000 objects, it is EC 2+1 and the
>> compression is enabled. Compression ratio is about ~2.7 in my setup,
>> so the PG should use ~90 GB raw space. Before and after moving the
>> PG I stopped osd0, mounted it with ceph-objectstore-tool with debug
>> bluestore = 20/20 and opened the 34.1a***/all directory. It seems to
>> dump all object extents into the log in that case. So now I have two
>> logs with all allocated extents for osd0 (I hope all extents are
>> there). I parsed both logs and added all compressed blob sizes
>> together ("get_ref Blob ... 0x20000 -> 0x... compressed"). But they
>> add up to ~39 GB before first rebalance (34.1as2), ~22 GB after it
>> (34.1as1) and ~41 GB again after the second move (34.1as2) which
>> doesn't indicate a leak. But the raw space usage still exceeds
>> initial by a lot. So it's clear that there's a leak somewhere. What
>> additional details can I provide for you to identify the bug? I
>> posted the same message in the issue tracker,
>>
https://tracker.ceph.com/issues/44731
>
> --
> With best regards,
> Vitaliy Filippov