Re: Recommendation for decent write latency performance from HDDs - ceph-users

12 Apr 2020

On 12/04/2020 21:41, huxiaoyu(a)horebdata.cn wrote:
...
  thanks again.

 I will try PetaSAN later. How big is the recommended cache size 
 (dm-writecache) for a OSD?
 Actual number of partitions per SSD is more important, each partition 
serves 1 HDD/OSD, we allow 1-8.

For size anything above 50GB is good, more will help specially with read 
caching. You need 2% of cache size in RAM, so a 100GB partitions 
requires 2GB in RAM, RAM is used internally to keep track of things, no 
data caching happens in RAM.

/Maged

...

------------------------------------------------------------------------
 huxiaoyu(a)horebdata.cn

     *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
     *Date:* 2020-04-12 21:34
     *To:* huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; Reed
     Dier <mailto:reed.dier@focusvq.com>; jesper <mailto:jesper@krogh.cc>
     *CC:* ceph-users <mailto:ceph-users@ceph.io>
     *Subject:* Re: [ceph-users] Re: Recommendation for decent write
     latency performance from HDDs

     On 12/04/2020 20:35, huxiaoyu(a)horebdata.cn wrote:

     That said, with a recent kernel such as 4.19 stable release, and
     a decent enterprise SSD such as Intel D4510/4610, i do not need
     to worry about the data safety related to dm-writecache.

     thanks a lot.

     samuel

     The patch recently went in 5.4 and backported to 4.18+, so you
     need to check your version has it.

     This will guarantee that a fua (force unit access)/sync write
     generated by the OSD, will end up on media before successful
     return. Enterprise drives + any drive that supports PLP will
     guarantee stored data will not corrupt in a power failure.

     If you wish, you can install PetaSAN to  test cache performance
     yourself.

     /Maged

>     ------------------------------------------------------------------------
>     huxiaoyu(a)horebdata.cn
>
>         *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
>         *Date:* 2020-04-12 20:03
>         *To:* huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>;
>         Reed Dier <mailto:reed.dier@focusvq.com>; jesper
>         <mailto:jesper@krogh.cc>
>         *CC:* ceph-users <mailto:ceph-users@ceph.io>
>         *Subject:* Re: [ceph-users] Re: Recommendation for decent
>         write latency performance from HDDs
>
>
>         On 12/04/2020 18:10, huxiaoyu(a)horebdata.cn wrote:
>>         Dear Maged Mokhtar，
>>
>>         It is very interesting to know that your experiment shows
>>         dm-writecache would be better than other alternatives. I
>>         have two questions:
>
>         yes much better.
>
>>
>>         1  can one cache device serve multiple HDDs? I know bcache
>>         can do this, which is convenient. dont know whether
>>         dm-writecache has such a feature
>
>         it works on a partition, so you can partition your disk to
>         several partitions to support multiple OSDs,in our ui we
>         allow from 1-8 partitions.
>
>>         2 Did you test whether write-back to disks from
>>         dm-writecache is power-safe or not. As far as know, bcache
>>         does not gurantee power-safe writebacks, thus i have to turn
>>         off HDD write cache (otherwise a data loss may occur)
>>
>         Get a recent kernel and insure it has the fua patch
>         mentioned, this will correctly handle sync writes, else you
>         may lose data. You also need a recent lvm tool set that
>         support dm-writecache. You need also use an SSD with PLP
>         support (enterprise models and some consumer models), some
>         cheaper SSDs without PLP support can lose existing stored
>         data on power loss, since their write cycle involves a
>         read/erase/write block so a power loss can erase already
>         stored data on such consumer devices. We also have another
>         patch (see our source) that adds mirroring of metadata to
>         dm-writecache to handle this, but that is not needed for
>         decent drives.
>
>>         best regards,
>>
>>         samuel
>>
>>
>>
>>
>>         ------------------------------------------------------------------------
>>         huxiaoyu(a)horebdata.cn
>>
>>             *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
>>             *Date:* 2020-04-12 16:45
>>             *To:* Reed Dier <mailto:reed.dier@focusvq.com>; jesper
>>             <mailto:jesper@krogh.cc>
>>             *CC:* ceph-users <mailto:ceph-users@ceph.io>
>>             *Subject:* [ceph-users] Re: Recommendation for decent
>>             write latency performance from HDDs
>>             On 10/04/2020 23:17, Reed Dier wrote:
>>             > Going to resurrect this thread to provide another option:
>>             >
>>             > LVM-cache, ie putting a cache device in-front of the
>>             bluestore-LVM LV.
>>             >
>>             > I only mention this because I noticed it in the SUSE
>>             documentation for
>>             > SES6 (based on Nautilus) here:
>>             >
>>             https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
>>             in PetaSAN project, we support dm-writecache and it
>>             works very well. We
>>             had done tests with other cache devices such as bcache
>>             and dm-cache,
>>             and it is much better. it is mainly a write cache, but
>>             reads are read
>>             from cache device if present, but does not promote reads
>>             from slow
>>             device. Typically with hdd clusters, write latency is
>>             the issue, reads
>>             are helped by OSD cache and in case of reduplicated
>>             pools, are much
>>             faster anyways.
>>             You need a recent kernel, we have an upstreamed patch:
>>            
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…
>>             + depending on your distribution, you may need an
>>             updated lvm tool set.
>>             /Maged
>>             >
>>             >>   *  If you plan to use a fast drive as an LVM cache
>>             for multiple
>>             >>     OSDs, be aware that all OSD operations (including
>>             replication)
>>             >>     will go through the caching device. All
>>             reads will be queried
>>             >>     from the caching device, and are only served from
>>             the slow device
>>             >>     in case of a cache miss. Writes are always
>>             applied to the caching
>>             >>     device first, and are flushed to the slow device
>>             at a later time
>>             >>     ('writeback' is the default caching mode).
>>             >>   * When deciding whether to utilize an LVM cache,
>>             verify whether the
>>             >>     fast drive can serve as a front for multiple OSDs
>>             while still
>>             >>     providing an acceptable amount of IOPS. You
>>             can test it by
>>             >>     measuring the maximum amount of IOPS that the
>>             fast device can
>>             >>     serve, and then dividing the result by the number
>>             of OSDs behind
>>             >>     the fast device. If the result is lower or close
>>             to the maximum
>>             >>     amount of IOPS that the OSD can provide without
>>             the cache, LVM
>>             >>     cache is probably not suited for this setup.
>>             >>
>>             >>   * The interaction of the LVM cache device with OSDs is
>>             >>     important. Writes are periodically flushed from
>>             the caching
>>             >>     device to the slow device. If the incoming
>>             traffic is sustained
>>             >>     and significant, the caching device will struggle
>>             to keep up with
>>             >>     incoming requests as well as the flushing
>>             process, resulting in
>>             >>     performance drop. Unless the fast device can
>>             provide much more
>>             >>     IOPS with better latency than the slow device, do
>>             not use LVM
>>             >>     cache with a sustained high volume workload.
>>             Traffic in a burst
>>             >>     pattern is more suited for LVM cache as it gives
>>             the cache time
>>             >>     to flush its dirty data without interfering with
>>             client traffic.
>>             >>     For a sustained low traffic workload, it is
>>             difficult to guess in
>>             >>     advance whether using LVM cache will improve
>>             performance. The
>>             >>     best test is to benchmark and compare the LVM
>>             cache setup against
>>             >>     the WAL/DB setup. Moreover, as small writes
>>             are heavy on the WAL
>>             >>     partition, it is suggested to use the fast device
>>             for the DB
>>             >>     and/or WAL instead of an LVM cache.
>>             >>
>>             >
>>             > So it sounds like you could partition your NVMe for
>>             either LVM-cache,
>>             > DB/WAL, or both?
>>             >
>>             > Just figured this sounded a bit more akin to what you
>>             were looking for
>>             > in your original post and figured I would share.
>>             >
>>             > I don't use this, but figured I would share it.
>>             >
>>             > Reed
>>             >
>>             >> On Apr 4, 2020, at 9:12 AM, jesper(a)krogh.cc
>>             <mailto:jesper@krogh.cc>
>>             >> wrote:
>>             >>
>>             >> Hi.
>>             >>
>>             >> We have a need for "bulk" storage - but with
decent
>>             write latencies.
>>             >> Normally we would do this with a DAS with a Raid5
>>             with 2GB Battery
>>             >> backed write cache in front - As cheap as possible
>>             but still getting the
>>             >> features of scalability of ceph.
>>             >>
>>             >> In our "first" ceph cluster we did the same -
just
>>             stuffed in BBWC
>>             >> in the OSD nodes and we're fine - but now we're
onto
>>             the next one and
>>             >> systems like:
>>             >>
>>            
https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
>>             >> Does not support a Raid controller like that - but is
>>             branded as for
>>             >> "Ceph
>>             >> Storage Solutions".
>>             >>
>>             >> It do however support 4 NVMe slots in the front - So
>>             - some level of
>>             >> "tiering" using the NVMe drives should be what is
>>             "suggested" - but what
>>             >> do people do? What is recommeneded. I see multiple
>>             options:
>>             >>
>>             >> Ceph tiering at the "pool - layer":
>>             >>
>>             https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>>             >> And rumors that it is "deprectated:
>>             >>
>>            
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html…
>>             >>
>>             >> Pro: Abstract layer
>>             >> Con: Deprecated? - Lots of warnings?
>>             >>
>>             >> Offloading the block.db on NVMe / SSD:
>>             >>
>>            
https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>>             >>
>>             >> Pro: Easy to deal with - seem heavily supported.
>>             >> Con: As far as I can tell - this will only benefit
>>             the metadata of the
>>             >> osd- not actual data. Thus a data-commit to the osd
>>             til still be
>>             >> dominated
>>             >> by the writelatency of the underlying - very slow HDD.
>>             >>
>>             >> Bcache:
>>             >>
>>            
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
>>             >>
>>             >> Pro: Closest to the BBWC mentioned above - but with
>>             way-way larger cache
>>             >> sizes.
>>             >> Con: It is hard to see if I end up being the only one
>>             on the planet using
>>             >> this
>>             >> solution.
>>             >>
>>             >> Eat it - Writes will be as slow as hitting dead-rust
>>             - anything that
>>             >> cannot live
>>             >> with that need to be entirely on SSD/NVMe.
>>             >>
>>             >> Other?
>>             >>
>>             >> Thanks for your input.
>>             >>
>>             >> Jesper
>>             >> _______________________________________________
>>             >> ceph-users mailing list -- ceph-users(a)ceph.io
>>             >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>             >
>>             >
>>             > _______________________________________________
>>             > ceph-users mailing list -- ceph-users(a)ceph.io
>>             > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>             _______________________________________________
>>             ceph-users mailing list -- ceph-users(a)ceph.io
>>             To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>