One reason for such observations is swap usage. If you have swap configured, you should
probably disable it. Swap can be useful with ceph, but you really need to know what you
are doing and how swap actually works (it is not for providing more RAM as most people
tend to believe).
In my case, I have substantial amounts swap configured. Then one needs to be aware of its
impact on certain ceph operations. Code and data that's rarely used, as well as leaked
memory will end up on swap. During normal operations, that is not a problem. However,
during exceptional operations, you are likely in a situation where all OSDs try to swap
the same code/data in/out at the same time, which can temporarily lead to very large
response latencies.
One of these exceptional operations are large peering operations. The code/data for
peering is rarely used, so it will be on swap. The increased latency can be bad enough for
MONs to mark OSDs as down for a short while, I have seen that. Usually, the cluster
recovers very quickly and this is not a real issue if you have an actual OSD fail.
If you add/remove disks, it can be irritating. The workaround is to set nodown in addition
to noout when doing admin. This will not only speed up peering dramatically, it will also
ignore the increased heartbeat ping times during the admin operation. I see the warnings,
but no detrimental effects.
In general, deploying swap in a ceph cluster is more an exception than a rule. The most
common use is to allow a cluster to recover during a period of increased RAM requirements.
There are cases in this list for both, MDS and OSD recoveries where having more address
space was the only way forward. If deployed during normal operation, swap really needs to
be fast and be able to handle simultaneous requests from many processes in parallel.
Usually, only RAM is fast enough, so don't buy NVMe drives, just buy more RAM. Having
some fast drives in stock for emergency swap deployment is a good idea though.
I deployed swap to cope with a memory leak that was present in mimic 13.2.8. Seems to be
fixed in 13.2.10. If swap is fast enough, the impact is there but harmless. Swap on a
crappy disk is dangerous.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Anthony D'Atri <anthony.datri(a)gmail.com>
Sent: 08 January 2021 23:58:43
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: osd gradual reweight question
Hi,
We are replacing HDD with SSD, and we first (gradually) drain (reweight) the HDDs with
0.5 steps until 0 = empty.
Works perfectly.
Then (just for kicks) I tried reducing HDD weight from 3.6 to 0 in one large step. That
seemed to have had more impact on the cluster, and we even noticed some OSD's
temporarily go down after a few minutes. It all worked out, but the impact seemed much
larger.
Please clarify “impact”. Do you mean that client performance was decreased, or something
else?
We never had OSDs go down when gradually reducing the
weight step by step. This surprised us.
Please also clarify what you mean by going down — do you mean being marked “down” by the
mons, or the daemons actually crashing? I’m not being critical — I want to fully
understand your situation.
Is it expected that the impact of a sudden reweight
from 3.6 to 0 is bigger than a gradual step-by-step decrease?
There are a lot of variables there, so It Depends.
For sure going in one step means that more PGs will peer, which can be expensive. I’ll
speculate, with incomplete information, that this is what most of what you’re seeing.
I would assume the impact to be similar, only the time
it takes to reach HEALTH_OK to be longer.
The end result, yes — the concern is how we get there.
The strategy of incremental downweighting has some advantages:
* If something goes wrong, you can stop without having a huge delta of data to move before
health is restored
* Peering is spread out
* Impact on the network and drives *may* be less at a given time
A disadvantage is that you end up moving some data more than once. This was worse with
older releases and CRUSH details than with recent deployments.
The impact due to data movement can be limited by lowering the usual recovery/backfill
settings to 1 from their defaults, and depending on release by adjusting the
osd_op_queue_cutoff.
The impact due to peering can be limited by spreading out peering, either through an
incremental process like yours, or by letting the balancer module do the work.
There are other strategies as well, eg. disabling rebalancing, downweighting OSDs in
sequence or a little at a time then enabling balancing when 0.
Thanks,
MJ
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io