Dear all,
Two days ago I added very few disks to a ceph cluster and run into a problem I have never
seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and
recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8.
I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed
1. Procedure was as usual:
ceph osd set norebalance
deploy additional OSD
The OSD came up and PGs started peering, so far so good. To my surprise, however, I
started seeing health-warnings about slow ping times:
Long heartbeat ping times on back interface seen, longest is 1171.910 msec
Long heartbeat ping times on front interface seen, longest is 1180.764 msec
After peering it looked like it got better and I waited it out until the messages were
gone. This took a really long time, at least 5-10 minutes.
I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much
worse consequences. Apparently, the ping times exceeded a timeout for a very short moment
and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors
with the dreaded "backfill_toofull", undersized PGs and a large amount of
degraded objects. I don't know what is causing what, but I ended up with data loss by
just adding 2 disks.
We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit
back network capacity (LACP trunking). There are currently no more than 16 disks per
server. The disks were added to an SSD pool. There was no traffic nor any other
exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot
see a single curve going up. Network, CPU utilisation, load, everything below measurement
accuracy. The hosts and network are quite overpowered and dimensioned to host many more
OSDs (in future expansions).
I have three questions, ordered by how urgently I need an answer:
1) I need to add more disks next week and need a workaround. Will something like this help
avoiding the heartbeat time-out:
ceph osd set noout
ceph osd set nodown
ceph osd set norebalance
2) The "lost" shards of the degraded objects were obviously still on the cluster
somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went
orphan during the incident?
3) This smells a bit like a bug that requires attention. I was probably just lucky that I
only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10?
Is it something new? Any settings that need to be looked at? If logs need to be collected,
I can do so during my next attempt. However, I cannot risk data integrity of a production
cluster and, therefore, probably not run the original procedure again.
Many thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14