Hello XuYun,
In my experience, I would always disable swap, it won't do any good.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
E-Mail: martin.verges(a)croit.io
Chat:
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:
Am Do., 7. Mai 2020 um 12:07 Uhr schrieb XuYun <yunxu(a)me.com>om>:
We had got some ping back/front problems after
upgrading from filestore to
bluestore. It turned out to be related to insufficient memory/swap usage.
2020年5月6日 下午10:08,Frank Schilder
<frans(a)dtu.dk> 写道:
To answer some of my own questions:
1) Setting
ceph osd set noout
ceph osd set nodown
ceph osd set norebalance
before restart/re-deployment did not harm. I don't know if it helped,
because
I didn't retry the procedure that led to OSDs going down. See also
point 3 below.
2) A peculiarity of this specific deployment of 2 OSDs was, that it was
a mix of
OSD deployment and restart after a reboot. I'm working on getting
this sorted and this is a different story. For anyone who might find
him-/herself in a situation where some OSDs are temporarily down/out with
PGs remapped and objects degraded for whatever reason while new OSDs come
up, the way to have ceph rescan the down/out OSDs after they come up is to
- "ceph osd crush move" the new OSDs temporarily to a location outside
the crush sub tree covering any pools (I have such a parking space in the
crush hierarchy for easy draining and parking disks)
- bring up the down/out OSDs
- at this point, the cluster will fall back to the original crush map
that was in
place when the OSDs went down/out
- the cluster will now find all shards that went
orphan and health will
be restored very quickly
- once the cluster is healthy, "ceph osd
crush move" the new OSDs back
to their desired location
- now you will see remapped PGs/misplaced
objects, but no degraded
objects
3) I still don't have an answer why long heartbeat ping times were
observed.
There seems to be a more serious issue and this will continue in
its own thread "Cluster outage due to client IO" to be opened soon.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 25 April 2020 15:34:25
To: ceph-users
Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat
ping times
Dear all,
Two days ago I added very few disks to a ceph cluster and run into a
problem I
have never seen before when doing that. The entire cluster was
deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the
first time I added OSDs under 13.2.8.
I had a few hosts that I needed to add 1 or 2 OSDs to and I started with
one that
needed 1. Procedure was as usual:
ceph osd set norebalance
deploy additional OSD
The OSD came up and PGs started peering, so far so good. To my surprise,
however,
I started seeing health-warnings about slow ping times:
Long heartbeat ping times on back interface seen, longest is 1171.910
msec
Long heartbeat ping times on front interface
seen, longest is 1180.764
msec
After peering it looked like it got better and I waited it out until the
messages
were gone. This took a really long time, at least 5-10 minutes.
I went on to the next host and deployed 2 new OSDs this time. Same as
above, but
with much worse consequences. Apparently, the ping times
exceeded a timeout for a very short moment and an OSD was marked out for
ca. 2 seconds. Now all hell broke loose. I got health errors with the
dreaded "backfill_toofull", undersized PGs and a large amount of degraded
objects. I don't know what is causing what, but I ended up with data loss
by just adding 2 disks.
We have dedicated network hardware and each of the OSD hosts has 20GBit
front and
40GBit back network capacity (LACP trunking). There are
currently no more than 16 disks per server. The disks were added to an SSD
pool. There was no traffic nor any other exceptional load on the system. I
have ganglia resource monitoring on all nodes and cannot see a single curve
going up. Network, CPU utilisation, load, everything below measurement
accuracy. The hosts and network are quite overpowered and dimensioned to
host many more OSDs (in future expansions).
I have three questions, ordered by how urgently I need an answer:
1) I need to add more disks next week and need a workaround. Will
something like
this help avoiding the heartbeat time-out:
ceph osd set noout
ceph osd set nodown
ceph osd set norebalance
2) The "lost" shards of the degraded objects were obviously still on the
cluster somewhere. Is there any way to force the cluster to rescan OSDs for
the shards that went orphan during the incident?
3) This smells a bit like a bug that requires attention. I was probably
just lucky
that I only lost 1 shard per PG. Has something similar reported
before? Is this fixed in 13.2.10? Is it something new? Any settings that
need to be looked at? If logs need to be collected, I can do so during my
next attempt. However, I cannot risk data integrity of a production cluster
and, therefore, probably not run the original procedure again.
Many thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io