Insane number of "osd_snap purged_snap" keys in monstore db due to rbd-mirror - Dev

29 Jul 2023

Hi,

We have a customer with an abnormally large number of "osd_snap /
purged_snap_{pool}_{snapid}" keys in monstore db: almost 40
million. Among other problems it causes a very long mon
synchronization on startup.

Our understanding is that the cause is that a mirroring snapshot
creation is very frequently interrupted in their environment, most
likely due to connectivity issues between the sites. The assumption is
based on the fact that they have a lot of rbd "trash" snapshots, which
may happen when an rbd snapshot removal is interrupted. (A mirroring
snapshot creation usually includes removal of some older snapshot to keep
the total number of the image mirroring snapshots under the limit).

We removed all "trash" snapshots manually, so currently they have a
limited number of "expected" snapshots but the number of purged_snap
keys is still the same large.

So, our understanding is that if an rbd snapshot creation is
frequently interrupted there is a chance it will be interrupted in or
just after SnapshotCreateRequest::send_allocate_snap_id [1], when
it requests a new snap id from the mon. As a result this id is not
tracked by rbd and never removed, and snap id holes like this make
"purged_snap_{pool}_{snapid}" ranges never merge.

To confirm that this scenario is likely I ran the following simple test
that interrupted rbd mirror snapshot creation at random time:

  for i in `seq 500`;do
    rbd mirror image snapshot test&
    PID=$!
    sleep $((RANDOM % 5)).$((RANDOM % 10))
    kill $PID && sleep 30
  done

Running this with debug_rbd=30, from the rbd client logs I see that it
was interrupted in send_allocate_snap_id 74 times, which is (surprisingly)
very high.

And after the experiment, and after removing the rbd image with all
tracked snapshots (i.e having the pool with no known rbd snapshots),
I see "purged_snap_{pool}_{snapid}" keys for ranges that I believe will
never be merged.

So the questions are:

1) Is there a way we could improve this to avoid monstore growing large?

2) How can we fix the current situation in the cluster? Would it be safe
enough to just run `ceph-kvstore-tool rocksdb store.db rm-prefix osd_snap`
to remove all osd_snap keys (including purged_epoch keys)? Due to
large db size I don't think it would be possible to selectively remove
keys with `ceph-kvstore-tool rocksdb store.db rm {prefix} {key}`
command and we may use only the `rm-prefix` command. Looking at the
code and actually trying it in a test environment it seems like it could
work, but I may be missing something dangerous here?

If (1) is not possible, then maybe we could provide a tool/command
for users to clean the keys if they observe this issue?

[1]
https://github.com/ceph/ceph/blob/e45272df047af71825445aeb6503073ba06123b0/…

Thanks,

--
Mykola Golub