Right, it would freeze the PGs in place at the time upmap-remapped is run.
You need to keep running the upmap balancer afterwards to restore the
optimized state.
I don't quite understand your question about a failed / replaced osd,
but yes it is relevant here.
Suppose you have osds 0, 1, 2, and 3 and the osd.1 fails:
A hypothetical pg_upmap_items which mapped 0 to 1 *and* 2 to 3 would
be removed when osd.1 is marked out. This would result in a PG being
remapped and data moved from 3 to 2. [1]
So if you run upmap-remapped just afterwards, it would create a new
pg_upmap_items mapping 2 to 3 and making that PG active+clean again
immediately.
And then later when you recreate osd.1, crush would recalculate and
then after some iterations of the upmap balancer the original
pg_upmap_items would be created.
-- Dan
[1] this hints at an optimization for the "clean upmaps"
functionalities in OSDMap.cc -- if an osd is marked out it could
modify the relevant pg_upmap_items' accordingly, rather than remove
them completely.
On Sun, May 3, 2020 at 10:27 PM Anthony D'Atri <aad(a)dreamsnake.net> wrote:
>
> Do I misunderstand this script, or does it not _quite_ do what’s desired here?
>
> I fully get the scenario of applying a full-cluster map to allow incremental topology
changes.
>
> To be clear, if this is run to effectively freeze backfill during / following a
traumatic event, it will freeze that adapted state, not strictly return to the pre-event
state? And thus the pg-upmap balancer would still need to be run to revert to the prior
state? And this would also hold true for a failed/replaced OSD?
>
>
> > On May 1, 2020, at 7:37 AM, Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
> >
> > Thanks Dan, that looks like a really neat method & script for a few
use-cases. We've actually used several of the scripts in that repo over the years, so,
many thanks for sharing.
> >
> > That method will definitely help in the scenario in which a set of unnecessary
pg remaps have been triggered and can be caught early and reverted. I'm still a little
concerned about the possibility of, for example, a brief network glitch occurring at night
and then waking up to a full unbalanced cluster. Especially with NVMe clusters that can
rapidly remap and rebalance (and for which we also have a greater impetus to squeeze out
as much available capacity as possible with upmap due to cost per TB). It's just a
risk I hadn't previously considered and was wondering if others have either run into
it or felt any need to plan around it.
> >
> > Cheers,
> > Dylan
> >
> >
> >> From: Dan van der Ster <dan(a)vanderster.com>
> >> Sent: Friday, 1 May 2020 5:53 PM
> >> To: Dylan McCulloch <dmc(a)unimelb.edu.au>
> >> Cc: ceph-users <ceph-users(a)ceph.io>
> >>
> >> Subject: Re: [ceph-users] upmap balancer and consequences of osds briefly
marked out
> >>
> >> Hi,
> >>
> >> You're correct that all the relevant upmap entries are removed when an
> >> OSD is marked out.
> >> You can try to use this script which will recreate them and get the
> >> cluster back to HEALTH_OK quickly:
> >>
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…
> >>
> >> Cheers, Dan
> >>
> >>
> >> On Fri, May 1, 2020 at 9:36 AM Dylan McCulloch <dmc(a)unimelb.edu.au>
wrote:
> >>>
> >>> Hi all,
> >>>
> >>> We're using upmap balancer which has made a huge improvement in
evenly distributing data on our osds and has provided a substantial increase in usable
capacity.
> >>>
> >>> Currently on ceph version: 12.2.13 luminous
> >>>
> >>> We ran into a firewall issue recently which led to a large number of
osds being briefly marked 'down' & 'out'. The osds came back
'up' & 'in' after about 25 mins and the cluster was fine but had to
perform a significant amount of backfilling/recovery despite
> >> there being no end-user client I/O during that period.
> >>>
> >>> Presumably the large number of remapped pgs and backfills were due to
pg_upmap_items being removed from the osdmap when osds were marked out and subsequently
those pgs were redistributed using the default crush algorithm.
> >>> As a result of the brief outage our cluster became significantly
imbalanced again with several osds very close to full.
> >>> Is there any reasonable mitigation for that scenario?
> >>>
> >>> The auto-balancer will not perform optimizations while there are
degraded pgs, so it would only start reapplying pg upmap exceptions after initial recovery
is complete (at which point capacity may be dangerously reduced).
> >>> Similarly, as admins, we normally only apply changes when the cluster is
in a healthy state, but if the same issue were to occur again would it be advisable to
manually apply balancer plans while initial recovery is still taking place?
> >>>
> >>> I guess my concern from this experience is that making use of the
capacity gained by using upmap balancer appears to carry some risk. i.e. it's possible
for a brief outage to remove those space efficiencies relatively quickly and potentially
result in full
> >> osds/cluster before the automatic balancer is able to resume and
redistribute pgs using upmap.
> >>>
> >>> Curious whether others have any thoughts or experience regarding this.
> >>>
> >>> Cheers,
> >>> Dylan
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users(a)ceph.io
> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>