Re: backfill_unfound state reset to clean after osd restart

19 May 2021

...
  Suppose we have a 2+1 EC pool, and an object is
missing 2 shards on
 both non-primary osds. We initiate backfill by setting a non-primary
 osd out. During the backfill the primary osd detects the missing
 shards and the pg enters "backfill_unfound" state, the last_backfill
 position is properly set to the object before the "unfound" (in
 post-nautilus, for nautilus I opened [1] to make it work). If
 re-peering occurs due to a non-primary osd is restarted, the backfill
 is restarted from the last_backfill position and the "unfound" object
 is detected again. But if re-peering occurs due the primary osd is
 temporarily stopped (restarted), another non-primary osd becomes
 primary and "drives" the backfill from the last_backfill position, and
 as the shard is missing here it is just skipped from the backfill, the
 missing object is not detected and the pg enters clean state.

 Is there something that can/should be improved here? It is rather
 unfortunate that the information about missing object is lost on the
 restart (until scrub or next backfill). On the other hand the
 situation when we have many shards are missing for an object is rather
 unlikely. Also, if for example it happened that the shard was missing
 on the primary it would not even be detected on backfill.

 [1] https://github.com/ceph/ceph/pull/41293 
In the case of primary osd, is there a case where the user wants to reset the state (from
unfound state)?
If we fix this behavior, is there another problem because we can't reset the state?

--
Jin

2024

2023

2022

2021

2020

2019

Re: backfill_unfound state reset to clean after osd restart