Pitfalls when using RBD Snapshot as timely backup - Dev

19 Feb 2020

Hi List,
    We are using RBD Snapshots as timely backup for DBs,   24 hourly
snapshot + 30 daily snapshots are taken for each RBDs.  It works perfect at
the beginning however with the # of volumes increasing,  more and more
significant pitfalls were seen.  we are at ~ 700 volumes which will create
700 snapshots and rotate 700 snapshots every hour.

   1.  Huge and frequent OSDMap update

          The OSDMap is ~640K in size , with a long and scattered
"removed_snaps".  The holes in the removed_snap interval set are from two
part,

   - In our use case as we keep daily snapshots for longer ,which turn out
   to be a hole  in the removed_snap interval set for each daily snapshots.
   -
   https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586
add
   a new snapid for each snapshot removal, according to the comment the new
   snapid is intent to keep the interval_set contiguous.  However I cannot
   understand how it works, it seems to me like this behavior is creating more
   holes when create/delete interleaving with each other.
   -  After processing 4 or 5 versions of map, the rocksdb write-ahead log
   (WAL) is full and the corresponding memtable has to be flushed to disk.

        2.  pgstat update burn out MGR

starting from Mimic,  PG by default update 500
(osd_max_snap_prune_intervals_per_epoch)
purged_snapshot interval to MGR,  which significant inflate the size of
pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on
messenger threads and MGR_FIN thread), and very unresponsive.   Reduce
the osd_max_snap_prune_intervals_per_epoch
to 10 fix the issue in our env.

        3.  SnapTrim IO overhead

Though there are tuning knobs to control the speed of snaptrim however it
anyway need to catch up with the snapshot creation speed.  What is more,
the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to
4K alignment in WAL.  We observed 156GB WAL was written during trimming 100
snapshots,  however the generated L0 is 4.63GB which seems related with WAL
page align amplification.   The PG purged snapshot from snaptrim_q one by
one ,  we are thinking if several purged snapshots for a given volume, can
be compacted and trim together, perhaps we can get better efficiency (we
only need change snapset for given obj once).

4. Deep-scurb on objects with hundreds of snapshots are super slow and
resulting osd_op_w_latency surged up 10x in our env, not yet deep dived.

5.  How cache tier works with snapshots?  does cache tier help with write
performance in this case?

      There are several outstanding PRs like
https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim especially
get rid of the removed_snaps,  we believe it will helps partly on #1 but
not sure how significant it helps others.  As the env is a production env
so upgrading to Octopus RC is not flexible at the moment,  will try out
once stable released.

-Xiaoxi