Hi Song,
On Mon, 13 Jan 2020, song wrote:
Hi Sage,
happy new year!
I am a software engineer from China. Recently I found a issue for fastinfo in Ceph and
want to consult you about it.
In the scenario of EC deployment, suppose we done a peering process for a pg and changed
one shard's last_update from lu1(e1'3) to lu2(e1'2) .lu1 was written as
fastinfo and lu2 was written as info. After that we restarted this osd and loaded pgs
again. when we read pg info from disk, we will find the pg info is lu1 applied to lu2,
which becomes incorrect. the true value should be lu2. That may cause the coming peering
execute incorrectly and result in unfound objects.
I currently considered below two options:
1. delete fastinfo when we need to change info;
2. add extra sequence number to fastinfo and info structure to make it keep them in the
right order.
I am looking forward to hearing your suggestions about this issue and preferred
solution.
if you need any more info, please let me know.
Ah, that does look like a bug. I've opened a tracker ticket for this,
https://tracker.ceph.com/issues/43580
Does that look right? I think the fix is pretty simple:
https://github.com/ceph/ceph/pull/32615
Thanks!
sage
thanks,
Song