[ceph-users] Re: Impact of large PG splits

10 Apr 2024

Hi,
I appreciate your message, it really sounds tough (9 months,  
really?!). But thanks for the reassurance :-)
They don’t have any other options so we’ll have to start that process  
anyway, probably tomorrow. We’ll see how it goes…

Zitat von Konstantin Shalygin &lt;k0ste(a)k0ste.ru&gt;ru>:

...
  Hi Eugene!

 I have a case, where PG with millions of objects, like this

 ```
 root@host# ./show_osd_pool_pg_usage.sh <pool> | less | head
 id      used_mbytes         used_objects  omap_used_mbytes  omap_used_keys
 --      -----------         ------------  ----------------  --------------
 17.c91  1213.2482748031616  2539152       0                 0
 17.9ae  1213.3145303726196  2539025       0                 0
 17.1a4  1213.432228088379   2539752       0                 0
 17.8f4  1213.4958791732788  2539831       0                 0
 17.f9   1213.5339193344116  2539837       0                 0
 17.c9d  1213.564414024353   2540014       0                 0
 17.89   1213.6339054107666  2540183       0                 0
 17.412  1213.6393299102783  2539797       0                 0
 ```

 And OSD was very small, like 1TB with RocksDB ~150-200GB. Actually  
 currently you see splitted PG. So one OSD was serve 64PG * 4M =  
 256,000,000 of objects...

 Main problem was - to remove something, you need to move something.  
 While the move is in progress, nothing is deleted
 Also, deleting is slower than writing. So one task for all  
 operations was impossible. I do it manually for a 9 moths. After the  
 splitting of the some PG was completed, I took other PG away from  
 the most crowded (from the operator’s point of view, problematic)  
 OSD. The pgremapper [1] helped me with this. As far as I remember,  
 in this way I got from 2048 to 3000 PG, then I was able to set 4096  
 PG, after which it became possible to move to 4TV NVME

 Your case doesn't look that scary. Firstly, your 85% means that you  
 have hundreds of free gigabytes (8TB's). If new data does not  
 arrive, the reservation mechanism is sufficient and after some time  
 the process will end. On the other hand, I had a replica, so  
 compared to the EC - my case is a simpler

 In any case, it’s worth trying and using the maximum capabilities of  
 the upmap

 Good luck,
 k

 [1] https://github.com/digitalocean/pgremapper

> On 9 Apr 2024, at 11:39, Eugen Block &lt;eblock(a)nde.ag&gt; wrote:
>
> I'm trying to estimate the possible impact when large PGs are  
> splitted. Here's one example of such a PG:
>
> PG_STAT  OBJECTS  BYTES         OMAP_BYTES*  OMAP_KEYS*  LOG    
> DISK_LOG    UP
> 86.3ff    277708  414403098409            0           0  3092       
> 3092     
> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
>
> Their main application is RGW on EC (currently 1024 PGs on 240  
> OSDs), 8TB HDDs backed by SSDs. There are 6 RGWs running behind  
> HAProxies. It took me a while to convince them to do a PG split and  
> now they're trying to assess how big the impact could be. The  
> fullest OSD is already at 85% usage, the least filled one at 59%,  
> so there is definitely room for a better balancing which, will be  
> necessary until the new hardware arrives. The current distribution  
> is around 100 PGs per OSD which usually would be fine, but since  
> the PGs are that large only a few PGs difference have a huge impact  
> on the OSD utilization.
>
> I'm targeting 2048 PGs for that pool for now, probably do another  
> split when the new hardware has been integrated.
> Any comments are appreciated! 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Impact of large PG splits