On Thu, Apr 22, 2021 at 04:16:34PM +0300, Mykola Golub
wrote:
I would like to bring some attention to a problem
we have been
observing with nautilus, and which I reported here [1].
If a pg is in backfill_unfound state ("unfound" objects were detected
during backfill), and one of the osds from the active set is restarted
the state changes to clean, losing the information about unfound
objects.
And when I tired to reproduce the issue on the master with the same
scenario, the status did not change, but I was observing the primary
osd crash after a non-primary restart.
Ok. Now I seem to have better understanding what is going on here.
As I wrote in [1], when `PrimaryLogPG::on_failed_pull` is called when
the object is not found on the backfill source osd, the oid is removed
from `backfills_in_flight` only if the backfill source is primary [2].
In our case we are backfilling a non-primary EC shard, so the oid is
not removed from `backfills_in_flight`. And later it causes the
assertion failure in `PrimaryLogPG::_clear_recovery_state`.
The behavior seemed to be changed during post-nautilus refactoring, in
[3]. Previously for the EC backend the oid was removed from
`backfills_in_flight` unconditionally, and now it is removed only if
the source is primary.
In [1] I questioned this change, but after investigating how it works,
now it looks quite reasonable to me.
So, the current behavior is: In `PrimaryLogPG::recover_backfill`, due
to the "unfound" oid is not removed from `backfills_in_flight`,
`next_backfill_to_complete` is always set to the "unfound" oid [4],
and `new_last_backfill` is not updated any more pointing to the object
before the "unfound" oid. The backfill still continues and terminates
only after all objects are pulled/pushed, but "complete" position
remains on the object before "unfound". After the backfill is finished
the pg enters "backfill_unfound" state. When the pg is re-peered
(e.g. after restarting an osd) it enters "backfilling" state starting
the backfill from "unfound" oid position, detects the "unfound"
object
again, scans the remaining objects detecting they are already copied,
and enters "backfill_unfound" state again with the same "complete"
position on the "unfound" object.
This looks like a reasonable behavoir to me, and the only problem is
that reported assertion failure, which probably is just needed to be
removed?
In Nautilus, because the "unfound" oid is removed from
`backfills_in_flight`, the "complete" position is not stopped on this
oid, and when the backfill is finished it also enters
"backfill_unfound" state, but "complete" backfill postion is at the
end now. So when the pg is re-peered, the backfill is not re-started
from "unfound" position, the "unfound" object is not detected and
the
pg enters "clean" state.
If my understanding is correct, it looks like we have to:
1) in master, fix the assertion failure, probably by just removing the
assertion, and backport the fix.
2) in nautilus (direct commit), make the EC backend not remove
"unfound" oid from `backfills_in_flight` to have post-nautilus
behavior.