thanks again.
I will try PetaSAN later. How big is the recommended cache size
(dm-writecache) for a OSD?
Actual number of partitions per SSD is more important, each partition
serves 1 HDD/OSD, we allow 1-8.
For size anything above 50GB is good, more will help specially with read
caching. You need 2% of cache size in RAM, so a 100GB partitions
requires 2GB in RAM, RAM is used internally to keep track of things, no
data caching happens in RAM.
/Maged
------------------------------------------------------------------------
huxiaoyu(a)horebdata.cn
*From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
*Date:* 2020-04-12 21:34
*To:* huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; Reed
Dier <mailto:reed.dier@focusvq.com>; jesper <mailto:jesper@krogh.cc>
*CC:* ceph-users <mailto:ceph-users@ceph.io>
*Subject:* Re: [ceph-users] Re: Recommendation for decent write
latency performance from HDDs
On 12/04/2020 20:35, huxiaoyu(a)horebdata.cn wrote:
That said, with a recent kernel such as 4.19 stable release, and
a decent enterprise SSD such as Intel D4510/4610, i do not need
to worry about the data safety related to dm-writecache.
thanks a lot.
samuel
The patch recently went in 5.4 and backported to 4.18+, so you
need to check your version has it.
This will guarantee that a fua (force unit access)/sync write
generated by the OSD, will end up on media before successful
return. Enterprise drives + any drive that supports PLP will
guarantee stored data will not corrupt in a power failure.
If you wish, you can install PetaSAN to test cache performance
yourself.
/Maged
> ------------------------------------------------------------------------
> huxiaoyu(a)horebdata.cn
>
> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
> *Date:* 2020-04-12 20:03
> *To:* huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>;
> Reed Dier <mailto:reed.dier@focusvq.com>; jesper
> <mailto:jesper@krogh.cc>
> *CC:* ceph-users <mailto:ceph-users@ceph.io>
> *Subject:* Re: [ceph-users] Re: Recommendation for decent
> write latency performance from HDDs
>
>
> On 12/04/2020 18:10, huxiaoyu(a)horebdata.cn wrote:
>> Dear Maged Mokhtar,
>>
>> It is very interesting to know that your experiment shows
>> dm-writecache would be better than other alternatives. I
>> have two questions:
>
> yes much better.
>
>>
>> 1 can one cache device serve multiple HDDs? I know bcache
>> can do this, which is convenient. dont know whether
>> dm-writecache has such a feature
>
> it works on a partition, so you can partition your disk to
> several partitions to support multiple OSDs,in our ui we
> allow from 1-8 partitions.
>
>> 2 Did you test whether write-back to disks from
>> dm-writecache is power-safe or not. As far as know, bcache
>> does not gurantee power-safe writebacks, thus i have to turn
>> off HDD write cache (otherwise a data loss may occur)
>>
> Get a recent kernel and insure it has the fua patch
> mentioned, this will correctly handle sync writes, else you
> may lose data. You also need a recent lvm tool set that
> support dm-writecache. You need also use an SSD with PLP
> support (enterprise models and some consumer models), some
> cheaper SSDs without PLP support can lose existing stored
> data on power loss, since their write cycle involves a
> read/erase/write block so a power loss can erase already
> stored data on such consumer devices. We also have another
> patch (see our source) that adds mirroring of metadata to
> dm-writecache to handle this, but that is not needed for
> decent drives.
>
>> best regards,
>>
>> samuel
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>> huxiaoyu(a)horebdata.cn
>>
>> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
>> *Date:* 2020-04-12 16:45
>> *To:* Reed Dier <mailto:reed.dier@focusvq.com>; jesper
>> <mailto:jesper@krogh.cc>
>> *CC:* ceph-users <mailto:ceph-users@ceph.io>
>> *Subject:* [ceph-users] Re: Recommendation for decent
>> write latency performance from HDDs
>> On 10/04/2020 23:17, Reed Dier wrote:
>> > Going to resurrect this thread to provide another option:
>> >
>> > LVM-cache, ie putting a cache device in-front of the
>> bluestore-LVM LV.
>> >
>> > I only mention this because I noticed it in the SUSE
>> documentation for
>> > SES6 (based on Nautilus) here:
>> >
>>
https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
>> in PetaSAN project, we support dm-writecache and it
>> works very well. We
>> had done tests with other cache devices such as bcache
>> and dm-cache,
>> and it is much better. it is mainly a write cache, but
>> reads are read
>> from cache device if present, but does not promote reads
>> from slow
>> device. Typically with hdd clusters, write latency is
>> the issue, reads
>> are helped by OSD cache and in case of reduplicated
>> pools, are much
>> faster anyways.
>> You need a recent kernel, we have an upstreamed patch:
>>
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…
>> + depending on your distribution, you may need an
>> updated lvm tool set.
>> /Maged
>> >
>> >> * If you plan to use a fast drive as an LVM cache
>> for multiple
>> >> OSDs, be aware that all OSD operations (including
>> replication)
>> >> will go through the caching device. All
>> reads will be queried
>> >> from the caching device, and are only served from
>> the slow device
>> >> in case of a cache miss. Writes are always
>> applied to the caching
>> >> device first, and are flushed to the slow device
>> at a later time
>> >> ('writeback' is the default caching mode).
>> >> * When deciding whether to utilize an LVM cache,
>> verify whether the
>> >> fast drive can serve as a front for multiple OSDs
>> while still
>> >> providing an acceptable amount of IOPS. You
>> can test it by
>> >> measuring the maximum amount of IOPS that the
>> fast device can
>> >> serve, and then dividing the result by the number
>> of OSDs behind
>> >> the fast device. If the result is lower or close
>> to the maximum
>> >> amount of IOPS that the OSD can provide without
>> the cache, LVM
>> >> cache is probably not suited for this setup.
>> >>
>> >> * The interaction of the LVM cache device with OSDs is
>> >> important. Writes are periodically flushed from
>> the caching
>> >> device to the slow device. If the incoming
>> traffic is sustained
>> >> and significant, the caching device will struggle
>> to keep up with
>> >> incoming requests as well as the flushing
>> process, resulting in
>> >> performance drop. Unless the fast device can
>> provide much more
>> >> IOPS with better latency than the slow device, do
>> not use LVM
>> >> cache with a sustained high volume workload.
>> Traffic in a burst
>> >> pattern is more suited for LVM cache as it gives
>> the cache time
>> >> to flush its dirty data without interfering with
>> client traffic.
>> >> For a sustained low traffic workload, it is
>> difficult to guess in
>> >> advance whether using LVM cache will improve
>> performance. The
>> >> best test is to benchmark and compare the LVM
>> cache setup against
>> >> the WAL/DB setup. Moreover, as small writes
>> are heavy on the WAL
>> >> partition, it is suggested to use the fast device
>> for the DB
>> >> and/or WAL instead of an LVM cache.
>> >>
>> >
>> > So it sounds like you could partition your NVMe for
>> either LVM-cache,
>> > DB/WAL, or both?
>> >
>> > Just figured this sounded a bit more akin to what you
>> were looking for
>> > in your original post and figured I would share.
>> >
>> > I don't use this, but figured I would share it.
>> >
>> > Reed
>> >
>> >> On Apr 4, 2020, at 9:12 AM, jesper(a)krogh.cc
>> <mailto:jesper@krogh.cc>
>> >> wrote:
>> >>
>> >> Hi.
>> >>
>> >> We have a need for "bulk" storage - but with
decent
>> write latencies.
>> >> Normally we would do this with a DAS with a Raid5
>> with 2GB Battery
>> >> backed write cache in front - As cheap as possible
>> but still getting the
>> >> features of scalability of ceph.
>> >>
>> >> In our "first" ceph cluster we did the same -
just
>> stuffed in BBWC
>> >> in the OSD nodes and we're fine - but now we're
onto
>> the next one and
>> >> systems like:
>> >>
>>
https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
>> >> Does not support a Raid controller like that - but is
>> branded as for
>> >> "Ceph
>> >> Storage Solutions".
>> >>
>> >> It do however support 4 NVMe slots in the front - So
>> - some level of
>> >> "tiering" using the NVMe drives should be what is
>> "suggested" - but what
>> >> do people do? What is recommeneded. I see multiple
>> options:
>> >>
>> >> Ceph tiering at the "pool - layer":
>> >>
>>
https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>> >> And rumors that it is "deprectated:
>> >>
>>
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html…
>> >>
>> >> Pro: Abstract layer
>> >> Con: Deprecated? - Lots of warnings?
>> >>
>> >> Offloading the block.db on NVMe / SSD:
>> >>
>>
https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>> >>
>> >> Pro: Easy to deal with - seem heavily supported.
>> >> Con: As far as I can tell - this will only benefit
>> the metadata of the
>> >> osd- not actual data. Thus a data-commit to the osd
>> til still be
>> >> dominated
>> >> by the writelatency of the underlying - very slow HDD.
>> >>
>> >> Bcache:
>> >>
>>
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
>> >>
>> >> Pro: Closest to the BBWC mentioned above - but with
>> way-way larger cache
>> >> sizes.
>> >> Con: It is hard to see if I end up being the only one
>> on the planet using
>> >> this
>> >> solution.
>> >>
>> >> Eat it - Writes will be as slow as hitting dead-rust
>> - anything that
>> >> cannot live
>> >> with that need to be entirely on SSD/NVMe.
>> >>
>> >> Other?
>> >>
>> >> Thanks for your input.
>> >>
>> >> Jesper
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users(a)ceph.io
>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users(a)ceph.io
>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>