Hi Andras.
Assuming that you've already tightened the
mgr/balancer/upmap_max_deviation to 1, I suspect that this cluster
already has too many upmaps.
Last time I checked, the balancer implementation is not able to
improve a pg-upmap-items entry if one already exists for a PG. (It can
add an OSD mapping pair to an PG, but not change an existing pair from
one osd to another).
So I think that what happens in this case is the balancer gets stuck
in a sort of local minimum in the overall optimization.
It can therefore help to simply remove some upmaps, and then wait for
the balancer to do a better job when it re-creates new entries for
those PGs.
And there's usually some low hanging fruit -- you can start by
removing pg-upmap-items which are mapping PGs away from the least full
OSDs. (Those upmap entries are making the least full OSDs even *less*
full.)
We have a script for that:
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/rm-upmaps-…
It's a pretty hacky and I don't use it often, so please use it with
caution -- you can run it and review which upmaps it would remove.
Hope this helps,
Dan
On Fri, Apr 2, 2021 at 10:18 AM Andras Pataki
<apataki(a)flatironinstitute.org> wrote:
>
> Dear ceph users,
>
> On one of our clusters I have some difficulties with the upmap
> balancer. We started with a reasonably well balanced cluster (using the
> balancer in upmap mode). After a node failure, we crush reweighted all
> the OSDs of the node to take it out of the cluster - and waited for the
> cluster to rebalance. Obviously, this significantly changes the crush
> map - hence the nice balance created by the balancer was gone. The
> recovery mostly completed - but some of the OSDs became too full - so we
> neded up with a few PGs that were backfill_toofull. The cluster has
> plenty of space (overall perhaps 65% full), only a few OSDs are >90% (we
> have backfillfull_ratio at 92%). The balancer refuses to change
> anything since the cluster is not clean. Yet - the cluster can't become
> clean without a few upmaps to help the top 3 or 4 most full OSDs.
>
> I would think this is a fairly common situation - trying to recover
> after some failure. Are there any recommendations on how to proceed?
> Obviously I can manually find and insert upmaps - but for a large
> cluster with tens of thousands of PGs, that isn't too practical. Is
> there a way to tell the balancer to still do something even though some
> PGs are undersized (with a quick look at the python module - I didn't
> see any)?
>
> The cluster is on Nautilus 14.2.15.
>
> Thanks,
>
> Andras
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io