Hi List,

We are using RBD Snapshots as timely backup for DBs, 24 hourly snapshot + 30 daily snapshots are taken for each RBDs. It works perfect at the beginning however with the # of volumes increasing, more and more significant pitfalls were seen. we are at ~ 700 volumes which will create 700 snapshots and rotate 700 snapshots every hour.

Huge and frequent OSDMap update

The OSDMap is ~640K in size , with a long and scattered "removed_snaps". The holes in the removed_snap interval set are from two part,

In our use case as we keep daily snapshots for longer ,which turn out to be a hole in the removed_snap interval set for each daily snapshots.
https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586 add a new snapid for each snapshot removal, according to the comment the new snapid is intent to keep the interval_set contiguous. However I cannot understand how it works, it seems to me like this behavior is creating more holes when create/delete interleaving with each other.
After processing 4 or 5 versions of map, the rocksdb write-ahead log (WAL) is full and the corresponding memtable has to be flushed to disk.

2. pgstat update burn out MGR

starting from Mimic, PG by default update 500 (osd_max_snap_prune_intervals_per_epoch) purged_snapshot interval to MGR, which significant inflate the size of pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on messenger threads and MGR_FIN thread), and very unresponsive. Reduce the osd_max_snap_prune_intervals_per_epoch to 10 fix the issue in our env.

3. SnapTrim IO overhead

Though there are tuning knobs to control the speed of snaptrim however it anyway need to catch up with the snapshot creation speed. What is more, the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to 4K alignment in WAL. We observed 156GB WAL was written during trimming 100 snapshots, however the generated L0 is 4.63GB which seems related with WAL page align amplification. The PG purged snapshot from snaptrim_q one by one , we are thinking if several purged snapshots for a given volume, can be compacted and trim together, perhaps we can get better efficiency (we only need change snapset for given obj once).

4. Deep-scurb on objects with hundreds of snapshots are super slow and resulting osd_op_w_latency surged up 10x in our env, not yet deep dived.

5. How cache tier works with snapshots? does cache tier help with write performance in this case?

There are several outstanding PRs like https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim especially get rid of the removed_snaps, we believe it will helps partly on #1 but not sure how significant it helps others. As the env is a production env so upgrading to Octopus RC is not flexible at the moment, will try out once stable released.

-Xiaoxi