On Wed, 19 Feb 2020, Xiaoxi Chen wrote:
Hi List,
We are using RBD Snapshots as timely backup for DBs, 24 hourly
snapshot + 30 daily snapshots are taken for each RBDs. It works perfect
at
the beginning however with the # of volumes
increasing, more and more
significant pitfalls were seen. we are at ~ 700 volumes which will
create
700 snapshots and rotate 700 snapshots every
hour.
1. Huge and frequent OSDMap update
The OSDMap is ~640K in size , with a long and scattered
"removed_snaps". The holes in the removed_snap interval set are from two
part,
- In our use case as we keep daily snapshots for longer ,which turn
out
to be a hole in the removed_snap interval set
for each daily
snapshots.
-
https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586
add
a new snapid for each snapshot removal, according to the comment the
new
snapid is intent to keep the interval_set
contiguous. However I
cannot
understand how it works, it seems to me like
this behavior is
creating more
holes when create/delete interleaving with
each other.
- After processing 4 or 5 versions of map, the rocksdb write-ahead
log
(WAL) is full and the corresponding memtable
has to be flushed to
disk.
What version are you running? The removed_snaps code was reworked in
octopus. You should only see recently deleted snaps in the OSDMap.
2.
pgstat update burn out MGR
starting from Mimic, PG by default update 500
(osd_max_snap_prune_intervals_per_epoch)
purged_snapshot interval to MGR, which significant inflate the size of
pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on
messenger threads and MGR_FIN thread), and very unresponsive. Reduce
the osd_max_snap_prune_intervals_per_epoch
to 10 fix the issue in our env.
3. SnapTrim IO overhead
Though there are tuning knobs to control the speed of snaptrim however it
anyway need to catch up with the snapshot creation speed. What is more,
the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to
4K alignment in WAL. We observed 156GB WAL was written during trimming
100
snapshots, however the generated L0 is 4.63GB
which seems related with
WAL
page align amplification. The PG purged
snapshot from snaptrim_q one by
one , we are thinking if several purged snapshots for a given volume,
can
be compacted and trim together, perhaps we can
get better efficiency (we
only need change snapset for given obj once).
4. Deep-scurb on objects with hundreds of snapshots are super slow and
resulting osd_op_w_latency surged up 10x in our env, not yet deep dived.
5. How cache tier works with snapshots? does cache tier help with write
performance in this case?
There are several outstanding PRs like
https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim
especially
get rid of the removed_snaps, we believe it will
helps partly on #1 but
not sure how significant it helps others. As the env is a production env
so upgrading to Octopus RC is not flexible at the moment, will try out
once stable released.
-Xiaoxi
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io