On Sat, Jul 29, 2023 at 2:11 PM Mykola Golub <to.my.trociny(a)gmail.com> wrote:
Hi,
We have a customer with an abnormally large number of "osd_snap /
purged_snap_{pool}_{snapid}" keys in monstore db: almost 40
million. Among other problems it causes a very long mon
synchronization on startup.
Our understanding is that the cause is that a mirroring snapshot
creation is very frequently interrupted in their environment, most
likely due to connectivity issues between the sites. The assumption is
Hi Mykola,
I'm missing how connectivity issues between the sites can lead to
mirror snapshot creation being interrupted on the primary cluster.
Isn't that operation local to the cluster?
Also, do you know who/what is actually interrupting it? Even if mirror
snapshot creation ends up taking a while for some reason, I don't think
anything in RBD would interrupt it.
based on the fact that they have a lot of rbd
"trash" snapshots, which
may happen when an rbd snapshot removal is interrupted. (A mirroring
snapshot creation usually includes removal of some older snapshot to keep
the total number of the image mirroring snapshots under the limit).
We removed all "trash" snapshots manually, so currently they have a
limited number of "expected" snapshots but the number of purged_snap
keys is still the same large.
So, our understanding is that if an rbd snapshot creation is
frequently interrupted there is a chance it will be interrupted in or
just after SnapshotCreateRequest::send_allocate_snap_id [1], when
it requests a new snap id from the mon. As a result this id is not
tracked by rbd and never removed, and snap id holes like this make
"purged_snap_{pool}_{snapid}" ranges never merge.
To confirm that this scenario is likely I ran the following simple test
that interrupted rbd mirror snapshot creation at random time:
for i in `seq 500`;do
rbd mirror image snapshot test&
PID=$!
sleep $((RANDOM % 5)).$((RANDOM % 10))
kill $PID && sleep 30
done
Running this with debug_rbd=30, from the rbd client logs I see that it
was interrupted in send_allocate_snap_id 74 times, which is (surprisingly)
very high.
This applies to regular user (i.e. non-mirror) snapshots too.
And after the experiment, and after removing the rbd image with all
tracked snapshots (i.e having the pool with no known rbd snapshots),
I see "purged_snap_{pool}_{snapid}" keys for ranges that I believe will
never be merged.
So the questions are:
1) Is there a way we could improve this to avoid monstore growing large?
Nothing simple comes to mind. The issue is that getting a snap ID on
the monitor and registering a snapshot with the image on the OSDs are
fundamentally separate steps, with the latter requiring a snap ID from
the former. Unless the process of allocating a snap ID itself becomes
two-step, where a freshly allocated snap ID is initially marked
inactive and later, after it gets persisted, it's switched to active
with a separate request to the monitor, one could always generate
"forgotten" snap IDs if they try hard enough. (I'm assuming that in
such a two-step process, monitors would clean up inactive snap IDs
after a timeout.)
In general, I don't think we are resilient to these scenarios.
I suspect there are many similar "some piece of metadata is left behind
if the command is killed at the wrong moment" issues lurking there.
2) How can we fix the current situation in the cluster? Would it be safe
enough to just run `ceph-kvstore-tool rocksdb store.db rm-prefix osd_snap`
to remove all osd_snap keys (including purged_epoch keys)? Due to
large db size I don't think it would be possible to selectively remove
keys with `ceph-kvstore-tool rocksdb store.db rm {prefix} {key}`
command and we may use only the `rm-prefix` command. Looking at the
code and actually trying it in a test environment it seems like it could
work, but I may be missing something dangerous here?
Adding Radek.
Thanks,
Ilya