Hi,
I would like to bring some attention to a problem we have been
observing with nautilus, and which I reported here [1].
If a pg is in backfill_unfound state ("unfound" objects were detected
during backfill), and one of the osds from the active set is restarted
the state changes to clean, losing the information about unfound
objects.
And when I tired to reproduce the issue on the master with the same
scenario, the status did not change, but I was observing the primary
osd crash after a non-primary restart.
I looked through the commit log and did not find a commit explicitely
saying (or giving a hint) this problem was adressing in the master and
I see there was large refactoring in the related code since
nautilus. So probably the issue was "solved" during refactoring?
We would love to see the problem fixed in the nautilus, and I would
like to backport the "fix", but right now I don't have a clear
understanding if there really was a fix in the master and what to do
with that crash that may be related to the "fix".
I might try to find the commit that changed the behaviour by
bisecting, but this looks like a long way, so I want to ask here first
if anybody has a hint.
[1]
https://tracker.ceph.com/issues/50351
Thanks,
--
Mykola Golub