In an exploration of trying to speedup the long tail of backfills resulting from marking a failing OSD out I began looking at my PGs to see if i could tune some settings and noticed the following:

Scenario: on a 12.2.12 Cluster, I am alerted of an inconsistent PG and am alerted of SMART failures on that OSD. I inspect that PG and notice it is a read_error from the SMART-failing osd. 

Steps I take: Set the primary affinity of the failing OSD to 0 (thought process being, I dont want a failing drive to be responsible for backfilling data), wait for peering to complete, then mark the OSD out. At this point backfill begins.

90% of the PGs complete backfill very quickly. Towards the tail end of the backfill I have 20 PGs or so in backfill_wait and 1 backfilling (presuming because of osd_max_backfills = 1).

I do a `ceph pg ls backfill_wait` and notice that 100% of the tail end PGs are such that all OSDs in the up_set are different than those of acting_set and that the acting_primary is the OSD that was set with primary affinity 0 and marked out.

My questions are the following:
- Upon learning a disk has failed smart and has an inconsistent PG I want to prevent its potentially-corrupt data from being replicated out to other OSDs, even for PGs which may not have been discovered to be inconsistent yet so I set primary affinity to 0. At this step shouldn't the acting_primary be another OSD from the acting_set and backfill be copied out of a different OSD?
- Should I be additionally marking the OSD as down, which would cause the PGs to go degraded until backfill finishes but would presumably finish faster as more OSDs would become the acting_primary and I wouldnt be throttled by osd_max_backfills. My thought here is its best to avoid degraded PGs as I do not want to drop below min_size. 

I recognize some of these things may be different in Nautilus but I am waiting on the 14.2.6 release as i am aware of some bugs I do not want to contend with. Thanks.


Respectfully,

Wes Dillingham