Hi all,

We have a Ceph(version 12.2.4)cluster that adopts EC pools, and it consists of 10 hosts for OSDs.

The corresponding commands to create the EC pool are listed as follows:



ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \
  plugin=jerasure \
  k=4 \
  m=3 \
  technique=reed_sol_van \
  packetsize=2048 \
  crush-device-class=hdd \
  crush-failure-domain=host

ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure profile_jerasure_4_3_reed_sol_van



Since that the EC pool's crush-failure-domain is configured to be "host", we just disable the network interfaces of some hosts (using "ifdown" command) to verify the functionality of the EC pool.
And here are the phenomena we have observed:


First of all, the IO rate (of "rados bench", which we used for benchmark) drops immediately to 0 when one host goes offline.

Secondly, it takes a lot of time (around 100 seconds) for Ceph to detect corresponding OSDs on that host are down.

Finally, once the Ceph has detected all offline OSDs, the EC pool seems to act normally and it is ready for IO operations again.

So, here are my questions:

1. Is this normal that the IO rate drops to 0 immediately even though there is only one host goes offline?
2. How to make Ceph reduce the time needed to detect failed OSDs?


Thanks for any help.


Best regards,
Majia Xiao