Re: Pitfalls when using RBD Snapshot as timely backup

18 Feb 2020

On Wed, 19 Feb 2020, Xiaoxi Chen wrote:
...
  Hi List,
     We are using RBD Snapshots as timely backup for DBs,   24 hourly
 snapshot + 30 daily snapshots are taken for each RBDs.  It works perfect at
 the beginning however with the # of volumes increasing,  more and more
 significant pitfalls were seen.  we are at ~ 700 volumes which will create
 700 snapshots and rotate 700 snapshots every hour.

    1.  Huge and frequent OSDMap update

           The OSDMap is ~640K in size , with a long and scattered
 "removed_snaps".  The holes in the removed_snap interval set are from two
 part,

    - In our use case as we keep daily snapshots for longer ,which turn out
    to be a hole  in the removed_snap interval set for each daily snapshots.
    -
    https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586
 add
    a new snapid for each snapshot removal, according to the comment the new
    snapid is intent to keep the interval_set contiguous.  However I cannot
    understand how it works, it seems to me like this behavior is creating more
    holes when create/delete interleaving with each other.
    -  After processing 4 or 5 versions of map, the rocksdb write-ahead log
    (WAL) is full and the corresponding memtable has to be flushed to disk. 
What version are you running?  The removed_snaps code was reworked in 
octopus.  You should only see recently deleted snaps in the OSDMap.

...
          2.  pgstat update burn out MGR

 starting from Mimic,  PG by default update 500
 (osd_max_snap_prune_intervals_per_epoch)
 purged_snapshot interval to MGR,  which significant inflate the size of
 pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on
 messenger threads and MGR_FIN thread), and very unresponsive.   Reduce
 the osd_max_snap_prune_intervals_per_epoch
 to 10 fix the issue in our env.

         3.  SnapTrim IO overhead

 Though there are tuning knobs to control the speed of snaptrim however it
 anyway need to catch up with the snapshot creation speed.  What is more,
 the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to
 4K alignment in WAL.  We observed 156GB WAL was written during trimming 100
 snapshots,  however the generated L0 is 4.63GB which seems related with WAL
 page align amplification.   The PG purged snapshot from snaptrim_q one by
 one ,  we are thinking if several purged snapshots for a given volume, can
 be compacted and trim together, perhaps we can get better efficiency (we
 only need change snapset for given obj once).

 4. Deep-scurb on objects with hundreds of snapshots are super slow and
 resulting osd_op_w_latency surged up 10x in our env, not yet deep dived.

 5.  How cache tier works with snapshots?  does cache tier help with write
 performance in this case?

       There are several outstanding PRs like
 https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim especially
 get rid of the removed_snaps,  we believe it will helps partly on #1 but
 not sure how significant it helps others.  As the env is a production env
 so upgrading to Octopus RC is not flexible at the moment,  will try out
 once stable released.

 -Xiaoxi

2024

2023

2022

2021

2020

2019

Re: Pitfalls when using RBD Snapshot as timely backup