We are running some tests under Luminous using an erasure code data pool for RGW where we push large amounts of data to the gateway using COSbench and then monitor performance while we remove devices and watch the subsequent recovery. The goal is to see how the RGW handles failures and recovery when using EC data pools.

Our EC data pool is consists of 13 NVME bluestore devices using jerasure with K=7, M=3 and min_size=7. Under this configuration, I would expect that we could lose 3 devices without risk of data loss. But what we see is that once a single OSD is removed and recovery begins, the RGW stops working (its still running, just not doing anything) and will not begin processing requests (read or write) until the cluster has no more degraded PGs, usually this means re-adding the device and allowing the rebalance to backfill the degraded PGs.

We did not appear to lose any data, but the fact that RGW stops processing requests while a recovery is happening is troubling. We expected the data to continue to be processed since we still had enough OSDs in the pool to satisfy the EC parameters (7/3) Is this expected behavior?

thanks,

Wyllys Ingersoll