On Sun, Jul 30, 2023 at 7:09 PM Ilya Dryomov <idryomov(a)gmail.com> wrote:
I'm missing how connectivity issues between the
sites can lead to
mirror snapshot creation being interrupted on the primary cluster.
Isn't that operation local to the cluster?
Also, do you know who/what is actually interrupting it? Even if mirror
snapshot creation ends up taking a while for some reason, I don't think
anything in RBD would interrupt it.
Indeed, just thinking more about it, it does not look like the
connectivity issue
that was interrupting it.
The first thing noticed was a lot of "purge" (former mirroring) snapshots,
so my logical assumption was that snapshot removal was interrupted and we
also knew from the customer about "connectivity issues" between sites, so it
was just my assumption that those interruptions were due to network issues.
At first I was thinking that "snap id leak" might have happened on snapshot
removal. And I used the test with creating primary snapshots because it was
also removing snapshots. But reviewing the code I have not found any suspicious
place where we could leak the snap id on snapshot removal, while the testing
showed that it is rather possible to leak it on snapshot creation.
So currently I think it happens on snapshot creation, just forgot to revisit my
initial assumption what could cause the interruption. Ok, then another suspect
could be the rbd_support mgr module.
They are still running octopus (latest), there are more than 500
mirroring images,
and the snapshot schedule was configured for 3 minutes for each image. I expect
it could cause a considerable load and could trigger this interruption somehow.
Could it be due to blacklisting? (Recently we added to rbd_support module the
ability to restart rados when it is blacklisted).
Now after our recommendation I believe they have the schedule changed to
30 minutes interval.
Unfortunately the communication with the customer is troublesome, I get only
limited secondhand information and there are a lot of assumptions here.
Currently we are more interested in actually how to fix the large number of
purged_snap keys in the monstore, still I thought it would be useful to report
some details how it could happen.
This applies to regular user (i.e. non-mirror)
snapshots too.
Sure, mirroring is just a case when you may hit it due to frequent use.
Thanks,
--
Mykola Golub