Hello,


We have a Ceph cluster (version 12.2.4) with 10 hosts, and there are 21 OSDs on each host.


 An EC pool is created with the following commands:


ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \

  plugin=jerasure \

  k=4 \

  m=3 \

  technique=reed_sol_van \

  packetsize=2048 \

  crush-device-class=hdd \

  crush-failure-domain=host


ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure profile_jerasure_4_3_reed_sol_van



Here are my questions:

  1. The EC pool is created using k=4, m=3, and crush-device-class=hdd, so we just disable the network interfaces of some hosts (using "ifdown" command) to verify the functionality of the EC pool while performing ‘rados bench’ command.
    However, the IO rate drops immediately to 0 when a single host goes offline, and it takes a long time (~100 seconds) for the IO rate becoming normal.
    As far as I know, the default value of min_size is k+1 or 5, which means that the EC pool can be still working even if there are two hosts offline.
    Is there something wrong with my understanding?
  2. According to our observations, it seems that the IO rate becomes normal when Ceph detects all OSDs corresponding to the failed host.
    Is there any way to reduce the time needed for Ceph to detect all failed OSDs?



Thanks for any help.


Best regards,

Majia Xiao