Hi Stefan,
many thanks for your good advice.
We are using ceph version 14.2.11
There is an issue with full osds - I'm not sure it's causing this
misplaced jump problem; I've reweighting the most full osds on several
consecutive days to reduce the number of nearfull osds, and it seems to
have no effect on the misplaced jump. I've not done a reweight for a few
days, so we have a lot of very full osds. The osd balance is way off, so
something is amis.
Out of 550 OSD we see this spread in use
(sorted least full to most full):
ID WEIGHT REWEIGHT %USE VAR PGS
430 7.27730 1.00000 46.81 0.68 171
73 7.27730 1.00000 46.91 0.68 170
189 7.27730 1.00000 47.15 0.68 173
199 7.27730 1.00000 47.24 0.69 172
86 7.27730 1.00000 48.62 0.71 176
234 7.27730 1.00000 48.73 0.71 178
437 7.27730 1.00000 49.65 0.72 182
288 7.27730 1.00000 50.12 0.73 184
(SNIP)
ID WEIGHT REWEIGHT %USE VAR PGS
455 14.55299 1.00000 84.39 1.23 619
541 14.55299 1.00000 84.40 1.23 620
456 14.55299 1.00000 84.73 1.23 621
487 14.55299 0.90002 85.56 1.24 620
527 14.55299 1.00000 86.61 1.26 638
466 14.55299 0.90002 86.78 1.26 639
501 14.55299 1.00000 87.39 1.27 645
542 14.55299 1.00000 88.06 1.28 645
462 14.55299 0.95001 91.23 1.32 670
549 14.55299 1.00000 91.45 1.33 676
I like your idea of remapping the pgs to their original location, and
then re-balancing the osds to a sensible arrangement. will see if this
works and report back....
best regards,
Jake
On 28/09/2020 11:08, Stefan Kooman wrote:
On 2020-09-28 11:45, Jake Grimmett wrote:
To show the cluster before and immediately after
an "episode"
***************************************************
[root@ceph7 ceph]# ceph -s
cluster:
id: 36ed7113-080c-49b8-80e2-4947cc456f2a
health: HEALTH_WARN
7 nearfull osd(s)
2 pool(s) nearfull
Low space hindering backfill (add storage if this doesn't
resolve itself): 11 pgs backfill_toofull
What version are you running? I'm worried the nearfull OSDs might be the
culprit here. There has been a bug with respect to neafull OSDs [1] that
has been fixed since. You might or might not hit that. Check with "ceph
osd df" to see if there are OSDs really too full or not.
You can use Dan's upmap-remapped.py [2] to remap the PGs back to their
original location and get the cluster in HEALTH_OK again. You might want
to select deep-scrub by hand to make sure you get the most efficient way
of deep-scrubbing (instead of randomly choosing a PG to deep-scrub).
Gr. Stefan
[1]:
https://tracker.ceph.com/issues/39555
[2]:
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.