EC Data pools with RGW - recovery issues - Dev

29 Aug 2019

We are running some tests under Luminous using an erasure code data pool
for RGW where we push large amounts of data to the gateway using COSbench
and then monitor performance while we remove devices and watch the
subsequent recovery.  The goal is to see how the RGW handles failures and
recovery when using EC data pools.

Our EC data pool is consists of 13 NVME bluestore devices using jerasure
with K=7, M=3 and min_size=7.  Under this configuration, I would expect
that we could lose 3 devices without risk of data loss. But what we see is
that once a single OSD is removed and recovery begins, the RGW stops
working (its still running, just not doing anything) and will not begin
processing requests (read or write) until the cluster has no more degraded
PGs, usually this means re-adding the device and allowing the rebalance to
backfill the degraded PGs.

We did not appear to lose any data, but the fact that RGW stops processing
requests while a recovery is happening is troubling.  We expected the data
to continue to be processed since we still had enough OSDs in the pool to
satisfy the EC parameters (7/3)  Is this expected behavior?

thanks,
  Wyllys Ingersoll