On Tue, Nov 26, 2019 at 7:45 PM majia xiao <xiaomajia.st@gmail.com> wrote:

Hi all,

We have a Ceph（version 12.2.4）cluster that adopts EC pools, and it consists of 10 hosts for OSDs.

The corresponding commands to create the EC pool are listed as follows:

ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \
plugin=jerasure \
k=4 \
m=3 \
technique=reed_sol_van \
packetsize=2048 \
crush-device-class=hdd \
crush-failure-domain=host

ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure profile_jerasure_4_3_reed_sol_van

Since that the EC pool's crush-failure-domain is configured to be "host", we just disable the network interfaces of some hosts (using "ifdown" command) to verify the functionality of the EC pool.
And here are the phenomena we have observed:

First of all, the IO rate (of "rados bench", which we used for benchmark) drops immediately to 0 when one host goes offline.

Secondly, it takes a lot of time (around 100 seconds) for Ceph to detect corresponding OSDs on that host are down.

Finally, once the Ceph has detected all offline OSDs, the EC pool seems to act normally and it is ready for IO operations again.

So, here are my questions:

1. Is this normal that the IO rate drops to 0 immediately even though there is only one host goes offline?
2. How to make Ceph reduce the time needed to detect failed OSDs?

This is intended as there is no communication from the host that it's going down (basically cable yanked), so the other clients expect it to be alive. I would recommend not setting this value too low as it would introduce false positives that could cause a death spiral in Ceph. If a host takes longer to respond to heartbeats than your new shorter time out, it will get kicked out of the cluster, causing peering of all the other nodes. Then when it comes back a short time later, it will cause peering again. All these peerings could cause other OSDs to miss their heartbeats and the problem only gets worse as it compounds. Server failure events should be fairly infrequent that 100 seconds is a good compromise. You are able to adjust the timeout, but I highly recommend you don't go shorter.

----------------
Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1