I'm happy to announce the another release of the go-ceph API
bindings. This is a regular release following our every-two-months release
cadence.
https://github.com/ceph/go-ceph/releases/tag/v0.3.0
The bindings aim to play a similar role to the "pybind" python bindings in the
ceph tree but for the Go language. These API bindings require the use of cgo.
There are already a few consumers of this library in the wild and the
ceph-csi project is starting to make use of this library.
Specific questions, comments, bugs etc are best directed at our github issues
tracker.
---
John Mulligan
phlogistonjohn(a)asynchrono.us
jmulligan(a)redhat.com
Hello,
Is there a way to find out all the clients where the volumes are mapped
from a central point?
We have a large fleet of machines that use ceph rbd volumes. For some
maintenance purposes, we need to find out if a volume is mapped anywhere
before acting on it. Right now we go and query each client machines with
`rbd showmapped` commands. Is there a variant of this CLI that lists all
mappings on a single node (ceph mon for instance)?
Thanks,
Shridhar
Hello All,
Is there a way to specify that a lock (shared or exclusive) on an rbd
volume be released if the client machine becomes unreachable or
irresponsive?
In one of our clusters, we use rbd locks on volumes to make sure provide a
kind of shared or exclusive access - to make sure there are no writers when
someone is reading and there are no readers when someone is writing.
However, we often run into issues when one of the machines gets into kernel
panic or something and the whole pipeline gets stalled.
So is there a way to tell ceph to release the lock if the client becomes
unavailable?
Thanks,
Shridhar
(resending to the new maillist)
Dear Casey, Dear All,
We tested the migration from Luminous to Nautilus and noticed two regressions breaking the RGW integration in Openstack:
1) the following config parameter is not working on Nautilus but is valid on Luminous and on Master:
rgw_keystone_implicit_tenants = swift
In the log: parse error setting 'rgw_keystone_implicit_tenants' to 'swift' (Expected option value to be integer, got 'swift')
This param is important to make RGW working for S3 and Swift.
Setting it to false breaks swift/openstack and setting it to true makes S3 incompatible with dns-style bucketnames (with shared or public access).
Please note that path-style bucketnames are deprecated by AWS and most clients are only supporting dns-style...
Ref.:
https://tracker.ceph.com/issues/24348https://github.com/ceph/ceph/commit/3ba7be8d1ac7ee43e69eebb58263cd080cca1d38
2) the server-side encryption (SSE-KMS) is broken on Nautilus:
to reproduce the issue:
s3cmd --access_key $ACCESSKEY --secret_key $SECRETKEY --host-bucket "%(bucket)s.$ENDPOINT" --host "$ENDPOINT" --region="$REGION" --signature-v2 --no-preserve --no-ssl --server-side-encryption --server-side-encryption-kms-id ${SECRET##*/} put helloenc.txt s3://testenc/
output:
upload: 'helloenc.txt' -> 's3://testenc/helloenc.txt' [1 of 1]
9 of 9 100% in 0s 37.50 B/s done
ERROR: S3 error: 403 (AccessDenied): Failed to retrieve the actual key, kms-keyid: cd0903db-c613-49be-96d9-165c02544bc7
rgw log: see below
TLDR: after investigating, I found that radosgw was actually getting the barbican secret correctly but the HTTP CODE (=200) validation was failing because of a bug in Nautilus.
My understanding is following (please correct me):
The bug in src/rgw/rgw_http_client.cc .
Since Nautilus HTTP_CODE are converted into ERROR_CODE (200 becomes 0) in the request processing.
This happens in RGWHTTPManager::reqs_thread_entry(), which centralizes the processing of (curl) HTTP Requests with multi-treading.
This is fine but the member variable http_status of the class RGWHTTPClient is not updated with the resulting HTTP CODE, so the variable keeps its initial value of 0.
Then in src/rgw/rgw_crypt.cc the logic is still verifying that http_status is in range [200,299] and this fails...
I wrote the following oneliner bugfix for src/rgw/rgw_http_client.cc:
diff --git a/src/rgw/rgw_http_client.cc b/src/rgw/rgw_http_client.cc
index d0f0baead6..7c115293ad 100644
--- a/src/rgw/rgw_http_client.cc
+++ b/src/rgw/rgw_http_client.cc
@@ -1146,6 +1146,7 @@ void *RGWHTTPManager::reqs_thread_entry()
status = -EAGAIN;
}
int id = req_data->id;
+ req_data->client->http_status = http_status;
finish_request(req_data, status);
switch (result) {
case CURLE_OK:
The s3cmd is then working fine with KMS server side encryption.
Questions:
* Could someone please write a fix for the regression of 1) and make a PR ?
* Could somebody also make a PR for 2?
Thank you for your help. :-)
Cheers
Francois Scheurer
rgw log:
export CLUSTER=ceph; /home/local/ceph/build/bin/radosgw -f --cluster ${CLUSTER} --name client.rgw.$(hostname) --setuser ceph --setgroup ceph &
tail -fn0 /var/log/ceph/ceph-client.rgw.ewos1-osd1-stage.log | less -IS
2020-02-26 16:32:59.208 7fc1f1c54700 20 Getting KMS encryption key for key=cd0903db-c613-49be-96d9-165c02544bc7
2020-02-26 16:32:59.208 7fc1f1c54700 20 Requesting secret from barbican url=http://keystone.service.stage.i.ewcs.ch:5000/v3/auth/tokens
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTPClient::process: http_status: 0
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTP::process
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTP::send
2020-02-26 16:32:59.208 7fc1f1c54700 20 sending request to http://keystone.service.stage.i.ewcs.ch:5000/v3/auth/tokens
2020-02-26 16:32:59.208 7fc1f1c54700 20 ssl verification is set to off
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTPManager::add_request: client->init_request(req_data): 0
2020-02-26 16:32:59.208 7fc1f1c54700 20 register_request mgr=0x56374b865540 req_data->id=4, curl_handle=0x56374c77c4a0
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTPManager::signal_thread(): write(thread_pipe[1], (void *)&buf, sizeof(buf)): 4
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTPManager::add_request: signal_thread(): 0
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTP::send: rgw_http_manager->add_request(req): 0
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: RGWHTTP::process: send(req): 0
2020-02-26 16:32:59.208 7fc1f1c54700 20 ewdebug: struct rgw_http_req_data : public RefCountedObject : int wait() : ret: 0
2020-02-26 16:32:59.208 7fc2184a1700 20 link_request req_data=0x56374c96a240 req_data->id=4, curl_handle=0x56374c77c4a0
2020-02-26 16:32:59.608 7fc2184a1700 20 ewdebug: RGWHTTPManager::reqs_thread_entry: http_status: 201
2020-02-26 16:32:59.608 7fc2184a1700 20 ewdebug: RGWHTTPManager::reqs_thread_entry: rgw_http_error_to_errno(http_status): 0
2020-02-26 16:32:59.608 7fc2184a1700 20 ewdebug: RGWHTTPManager::reqs_thread_entry: finish_request(req_data, status): status: 0
2020-02-26 16:32:59.608 7fc2184a1700 20 ewdebug: struct rgw_http_req_data : public RefCountedObject : void finish(int r) : ret: 0
2020-02-26 16:32:59.652 7fc1f1c54700 5 ewdebug: request_key_from_barbican: Accept application/octet-stream X-Auth-Token gAAAAABeVo-xxx
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTPClient::process: http_status: 0
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTP::process
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTP::send
2020-02-26 16:32:59.652 7fc1f1c54700 20 sending request to http://barbican.service.stage.i.ewcs.ch:9311/v1/secrets/cd0903db-c613-49be-…
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTPManager::add_request: client->init_request(req_data): 0
2020-02-26 16:32:59.652 7fc1f1c54700 20 register_request mgr=0x56374b865540 req_data->id=5, curl_handle=0x56374c77c4a0
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTPManager::signal_thread(): write(thread_pipe[1], (void *)&buf, sizeof(buf)): 4
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTPManager::add_request: signal_thread(): 0
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTP::send: rgw_http_manager->add_request(req): 0
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: RGWHTTP::process: send(req): 0
2020-02-26 16:32:59.652 7fc1f1c54700 20 ewdebug: struct rgw_http_req_data : public RefCountedObject : int wait() : ret: 0
2020-02-26 16:32:59.652 7fc2184a1700 20 link_request req_data=0x56374c96a240 req_data->id=5, curl_handle=0x56374c77c4a0
=> 2020-02-26 16:32:59.752 7fc2184a1700 20 ewdebug: RGWHTTPManager::reqs_thread_entry: http_status: 200
2020-02-26 16:32:59.752 7fc2184a1700 20 ewdebug: RGWHTTPManager::reqs_thread_entry: rgw_http_error_to_errno(http_status): 0
2020-02-26 16:32:59.752 7fc2184a1700 20 ewdebug: RGWHTTPManager::reqs_thread_entry: finish_request(req_data, status): status: 0
2020-02-26 16:32:59.752 7fc2184a1700 20 ewdebug: struct rgw_http_req_data : public RefCountedObject : void finish(int r) : ret: 0
2020-02-26 16:32:59.752 7fc1f1c54700 5 ewdebug: request_key_from_barbican: secret_req.process: 0
=> 2020-02-26 16:32:59.752 7fc1f1c54700 5 ewdebug: request_key_from_barbican: secret_req.get_http_status: 0
2020-02-26 16:32:59.752 7fc1f1c54700 5 ewdebug: request_key_from_barbican: secret_req.get_http_status not in [200,299] range!
2020-02-26 16:32:59.752 7fc1f1c54700 5 Failed to retrieve secret from barbican:cd0903db-c613-49be-96d9-165c02544bc7
2020-02-26 16:32:59.752 7fc1f1c54700 5 ERROR: failed to retrieve actual key from key_id: cd0903db-c613-49be-96d9-165c02544bc7
2020-02-26 16:32:59.752 7fc1f1c54700 2 req 1 1.092s s3:put_obj completing
2020-02-26 16:32:59.752 7fc1f1c54700 2 req 1 1.092s s3:put_obj op status=-13
2020-02-26 16:32:59.752 7fc1f1c54700 2 req 1 1.092s s3:put_obj http status=403
2020-02-26 16:32:59.752 7fc1f1c54700 1 ====== req done req=0x56374c9808d0 op status=-13 http_status=403 latency=1.092s ======
=> we see that http_status is correct (200) but the variable secret_req.get_http_status (member of class RGWHTTPClient) is incorrect (0 instead of 200)
On 12/04/2020 21:41, huxiaoyu(a)horebdata.cn wrote:
> thanks again.
>
> I will try PetaSAN later. How big is the recommended cache size
> (dm-writecache) for a OSD?
>
Actual number of partitions per SSD is more important, each partition
serves 1 HDD/OSD, we allow 1-8.
For size anything above 50GB is good, more will help specially with read
caching. You need 2% of cache size in RAM, so a 100GB partitions
requires 2GB in RAM, RAM is used internally to keep track of things, no
data caching happens in RAM.
/Maged
> ------------------------------------------------------------------------
> huxiaoyu(a)horebdata.cn
>
> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
> *Date:* 2020-04-12 21:34
> *To:* huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; Reed
> Dier <mailto:reed.dier@focusvq.com>; jesper <mailto:jesper@krogh.cc>
> *CC:* ceph-users <mailto:ceph-users@ceph.io>
> *Subject:* Re: [ceph-users] Re: Recommendation for decent write
> latency performance from HDDs
>
>
> On 12/04/2020 20:35, huxiaoyu(a)horebdata.cn wrote:
>>
>>
>> That said, with a recent kernel such as 4.19 stable release, and
>> a decent enterprise SSD such as Intel D4510/4610, i do not need
>> to worry about the data safety related to dm-writecache.
>>
>> thanks a lot.
>>
>> samuel
>>
>
> The patch recently went in 5.4 and backported to 4.18+, so you
> need to check your version has it.
>
> This will guarantee that a fua (force unit access)/sync write
> generated by the OSD, will end up on media before successful
> return. Enterprise drives + any drive that supports PLP will
> guarantee stored data will not corrupt in a power failure.
>
> If you wish, you can install PetaSAN to test cache performance
> yourself.
>
> /Maged
>
>
>> ------------------------------------------------------------------------
>> huxiaoyu(a)horebdata.cn
>>
>> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
>> *Date:* 2020-04-12 20:03
>> *To:* huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>;
>> Reed Dier <mailto:reed.dier@focusvq.com>; jesper
>> <mailto:jesper@krogh.cc>
>> *CC:* ceph-users <mailto:ceph-users@ceph.io>
>> *Subject:* Re: [ceph-users] Re: Recommendation for decent
>> write latency performance from HDDs
>>
>>
>> On 12/04/2020 18:10, huxiaoyu(a)horebdata.cn wrote:
>>> Dear Maged Mokhtar,
>>>
>>> It is very interesting to know that your experiment shows
>>> dm-writecache would be better than other alternatives. I
>>> have two questions:
>>
>> yes much better.
>>
>>>
>>> 1 can one cache device serve multiple HDDs? I know bcache
>>> can do this, which is convenient. dont know whether
>>> dm-writecache has such a feature
>>
>> it works on a partition, so you can partition your disk to
>> several partitions to support multiple OSDs,in our ui we
>> allow from 1-8 partitions.
>>
>>> 2 Did you test whether write-back to disks from
>>> dm-writecache is power-safe or not. As far as know, bcache
>>> does not gurantee power-safe writebacks, thus i have to turn
>>> off HDD write cache (otherwise a data loss may occur)
>>>
>> Get a recent kernel and insure it has the fua patch
>> mentioned, this will correctly handle sync writes, else you
>> may lose data. You also need a recent lvm tool set that
>> support dm-writecache. You need also use an SSD with PLP
>> support (enterprise models and some consumer models), some
>> cheaper SSDs without PLP support can lose existing stored
>> data on power loss, since their write cycle involves a
>> read/erase/write block so a power loss can erase already
>> stored data on such consumer devices. We also have another
>> patch (see our source) that adds mirroring of metadata to
>> dm-writecache to handle this, but that is not needed for
>> decent drives.
>>
>>> best regards,
>>>
>>> samuel
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>> huxiaoyu(a)horebdata.cn
>>>
>>> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
>>> *Date:* 2020-04-12 16:45
>>> *To:* Reed Dier <mailto:reed.dier@focusvq.com>; jesper
>>> <mailto:jesper@krogh.cc>
>>> *CC:* ceph-users <mailto:ceph-users@ceph.io>
>>> *Subject:* [ceph-users] Re: Recommendation for decent
>>> write latency performance from HDDs
>>> On 10/04/2020 23:17, Reed Dier wrote:
>>> > Going to resurrect this thread to provide another option:
>>> >
>>> > LVM-cache, ie putting a cache device in-front of the
>>> bluestore-LVM LV.
>>> >
>>> > I only mention this because I noticed it in the SUSE
>>> documentation for
>>> > SES6 (based on Nautilus) here:
>>> >
>>> https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
>>> in PetaSAN project, we support dm-writecache and it
>>> works very well. We
>>> had done tests with other cache devices such as bcache
>>> and dm-cache,
>>> and it is much better. it is mainly a write cache, but
>>> reads are read
>>> from cache device if present, but does not promote reads
>>> from slow
>>> device. Typically with hdd clusters, write latency is
>>> the issue, reads
>>> are helped by OSD cache and in case of reduplicated
>>> pools, are much
>>> faster anyways.
>>> You need a recent kernel, we have an upstreamed patch:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…
>>> + depending on your distribution, you may need an
>>> updated lvm tool set.
>>> /Maged
>>> >
>>> >> * If you plan to use a fast drive as an LVM cache
>>> for multiple
>>> >> OSDs, be aware that all OSD operations (including
>>> replication)
>>> >> will go through the caching device. All
>>> reads will be queried
>>> >> from the caching device, and are only served from
>>> the slow device
>>> >> in case of a cache miss. Writes are always
>>> applied to the caching
>>> >> device first, and are flushed to the slow device
>>> at a later time
>>> >> ('writeback' is the default caching mode).
>>> >> * When deciding whether to utilize an LVM cache,
>>> verify whether the
>>> >> fast drive can serve as a front for multiple OSDs
>>> while still
>>> >> providing an acceptable amount of IOPS. You
>>> can test it by
>>> >> measuring the maximum amount of IOPS that the
>>> fast device can
>>> >> serve, and then dividing the result by the number
>>> of OSDs behind
>>> >> the fast device. If the result is lower or close
>>> to the maximum
>>> >> amount of IOPS that the OSD can provide without
>>> the cache, LVM
>>> >> cache is probably not suited for this setup.
>>> >>
>>> >> * The interaction of the LVM cache device with OSDs is
>>> >> important. Writes are periodically flushed from
>>> the caching
>>> >> device to the slow device. If the incoming
>>> traffic is sustained
>>> >> and significant, the caching device will struggle
>>> to keep up with
>>> >> incoming requests as well as the flushing
>>> process, resulting in
>>> >> performance drop. Unless the fast device can
>>> provide much more
>>> >> IOPS with better latency than the slow device, do
>>> not use LVM
>>> >> cache with a sustained high volume workload.
>>> Traffic in a burst
>>> >> pattern is more suited for LVM cache as it gives
>>> the cache time
>>> >> to flush its dirty data without interfering with
>>> client traffic.
>>> >> For a sustained low traffic workload, it is
>>> difficult to guess in
>>> >> advance whether using LVM cache will improve
>>> performance. The
>>> >> best test is to benchmark and compare the LVM
>>> cache setup against
>>> >> the WAL/DB setup. Moreover, as small writes
>>> are heavy on the WAL
>>> >> partition, it is suggested to use the fast device
>>> for the DB
>>> >> and/or WAL instead of an LVM cache.
>>> >>
>>> >
>>> > So it sounds like you could partition your NVMe for
>>> either LVM-cache,
>>> > DB/WAL, or both?
>>> >
>>> > Just figured this sounded a bit more akin to what you
>>> were looking for
>>> > in your original post and figured I would share.
>>> >
>>> > I don't use this, but figured I would share it.
>>> >
>>> > Reed
>>> >
>>> >> On Apr 4, 2020, at 9:12 AM, jesper(a)krogh.cc
>>> <mailto:jesper@krogh.cc>
>>> >> wrote:
>>> >>
>>> >> Hi.
>>> >>
>>> >> We have a need for "bulk" storage - but with decent
>>> write latencies.
>>> >> Normally we would do this with a DAS with a Raid5
>>> with 2GB Battery
>>> >> backed write cache in front - As cheap as possible
>>> but still getting the
>>> >> features of scalability of ceph.
>>> >>
>>> >> In our "first" ceph cluster we did the same - just
>>> stuffed in BBWC
>>> >> in the OSD nodes and we're fine - but now we're onto
>>> the next one and
>>> >> systems like:
>>> >>
>>> https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
>>> >> Does not support a Raid controller like that - but is
>>> branded as for
>>> >> "Ceph
>>> >> Storage Solutions".
>>> >>
>>> >> It do however support 4 NVMe slots in the front - So
>>> - some level of
>>> >> "tiering" using the NVMe drives should be what is
>>> "suggested" - but what
>>> >> do people do? What is recommeneded. I see multiple
>>> options:
>>> >>
>>> >> Ceph tiering at the "pool - layer":
>>> >>
>>> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>>> >> And rumors that it is "deprectated:
>>> >>
>>> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html…
>>> >>
>>> >> Pro: Abstract layer
>>> >> Con: Deprecated? - Lots of warnings?
>>> >>
>>> >> Offloading the block.db on NVMe / SSD:
>>> >>
>>> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>>> >>
>>> >> Pro: Easy to deal with - seem heavily supported.
>>> >> Con: As far as I can tell - this will only benefit
>>> the metadata of the
>>> >> osd- not actual data. Thus a data-commit to the osd
>>> til still be
>>> >> dominated
>>> >> by the writelatency of the underlying - very slow HDD.
>>> >>
>>> >> Bcache:
>>> >>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
>>> >>
>>> >> Pro: Closest to the BBWC mentioned above - but with
>>> way-way larger cache
>>> >> sizes.
>>> >> Con: It is hard to see if I end up being the only one
>>> on the planet using
>>> >> this
>>> >> solution.
>>> >>
>>> >> Eat it - Writes will be as slow as hitting dead-rust
>>> - anything that
>>> >> cannot live
>>> >> with that need to be entirely on SSD/NVMe.
>>> >>
>>> >> Other?
>>> >>
>>> >> Thanks for your input.
>>> >>
>>> >> Jesper
>>> >> _______________________________________________
>>> >> ceph-users mailing list -- ceph-users(a)ceph.io
>>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> >
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
On 12/04/2020 20:35, huxiaoyu(a)horebdata.cn wrote:
>
>
> That said, with a recent kernel such as 4.19 stable release, and a
> decent enterprise SSD such as Intel D4510/4610, i do not need to worry
> about the data safety related to dm-writecache.
>
> thanks a lot.
>
> samuel
>
The patch recently went in 5.4 and backported to 4.18+, so you need to
check your version has it.
This will guarantee that a fua (force unit access)/sync write generated
by the OSD, will end up on media before successful return. Enterprise
drives + any drive that supports PLP will guarantee stored data will not
corrupt in a power failure.
If you wish, you can install PetaSAN to test cache performance yourself.
/Maged
> ------------------------------------------------------------------------
> huxiaoyu(a)horebdata.cn
>
> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
> *Date:* 2020-04-12 20:03
> *To:* huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; Reed
> Dier <mailto:reed.dier@focusvq.com>; jesper <mailto:jesper@krogh.cc>
> *CC:* ceph-users <mailto:ceph-users@ceph.io>
> *Subject:* Re: [ceph-users] Re: Recommendation for decent write
> latency performance from HDDs
>
>
> On 12/04/2020 18:10, huxiaoyu(a)horebdata.cn wrote:
>> Dear Maged Mokhtar,
>>
>> It is very interesting to know that your experiment shows
>> dm-writecache would be better than other alternatives. I have two
>> questions:
>
> yes much better.
>
>>
>> 1 can one cache device serve multiple HDDs? I know bcache can do
>> this, which is convenient. dont know whether dm-writecache has
>> such a feature
>
> it works on a partition, so you can partition your disk to several
> partitions to support multiple OSDs,in our ui we allow from 1-8
> partitions.
>
>> 2 Did you test whether write-back to disks from dm-writecache is
>> power-safe or not. As far as know, bcache does not gurantee
>> power-safe writebacks, thus i have to turn off HDD write cache
>> (otherwise a data loss may occur)
>>
> Get a recent kernel and insure it has the fua patch mentioned,
> this will correctly handle sync writes, else you may lose data.
> You also need a recent lvm tool set that support dm-writecache.
> You need also use an SSD with PLP support (enterprise models and
> some consumer models), some cheaper SSDs without PLP support can
> lose existing stored data on power loss, since their write cycle
> involves a read/erase/write block so a power loss can erase
> already stored data on such consumer devices. We also have another
> patch (see our source) that adds mirroring of metadata to
> dm-writecache to handle this, but that is not needed for decent
> drives.
>
>> best regards,
>>
>> samuel
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>> huxiaoyu(a)horebdata.cn
>>
>> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
>> *Date:* 2020-04-12 16:45
>> *To:* Reed Dier <mailto:reed.dier@focusvq.com>; jesper
>> <mailto:jesper@krogh.cc>
>> *CC:* ceph-users <mailto:ceph-users@ceph.io>
>> *Subject:* [ceph-users] Re: Recommendation for decent write
>> latency performance from HDDs
>> On 10/04/2020 23:17, Reed Dier wrote:
>> > Going to resurrect this thread to provide another option:
>> >
>> > LVM-cache, ie putting a cache device in-front of the
>> bluestore-LVM LV.
>> >
>> > I only mention this because I noticed it in the SUSE
>> documentation for
>> > SES6 (based on Nautilus) here:
>> > https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
>> in PetaSAN project, we support dm-writecache and it works
>> very well. We
>> had done tests with other cache devices such as bcache and
>> dm-cache,
>> and it is much better. it is mainly a write cache, but reads
>> are read
>> from cache device if present, but does not promote reads from
>> slow
>> device. Typically with hdd clusters, write latency is the
>> issue, reads
>> are helped by OSD cache and in case of reduplicated pools,
>> are much
>> faster anyways.
>> You need a recent kernel, we have an upstreamed patch:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…
>> + depending on your distribution, you may need an updated lvm
>> tool set.
>> /Maged
>> >
>> >> * If you plan to use a fast drive as an LVM cache for
>> multiple
>> >> OSDs, be aware that all OSD operations (including
>> replication)
>> >> will go through the caching device. All reads will be
>> queried
>> >> from the caching device, and are only served from the
>> slow device
>> >> in case of a cache miss. Writes are always applied to
>> the caching
>> >> device first, and are flushed to the slow device at a
>> later time
>> >> ('writeback' is the default caching mode).
>> >> * When deciding whether to utilize an LVM cache, verify
>> whether the
>> >> fast drive can serve as a front for multiple OSDs
>> while still
>> >> providing an acceptable amount of IOPS. You can test it by
>> >> measuring the maximum amount of IOPS that the fast
>> device can
>> >> serve, and then dividing the result by the number of
>> OSDs behind
>> >> the fast device. If the result is lower or close to
>> the maximum
>> >> amount of IOPS that the OSD can provide without the
>> cache, LVM
>> >> cache is probably not suited for this setup.
>> >>
>> >> * The interaction of the LVM cache device with OSDs is
>> >> important. Writes are periodically flushed from the
>> caching
>> >> device to the slow device. If the incoming traffic
>> is sustained
>> >> and significant, the caching device will struggle to
>> keep up with
>> >> incoming requests as well as the flushing process,
>> resulting in
>> >> performance drop. Unless the fast device can provide
>> much more
>> >> IOPS with better latency than the slow device, do not
>> use LVM
>> >> cache with a sustained high volume workload. Traffic
>> in a burst
>> >> pattern is more suited for LVM cache as it gives the
>> cache time
>> >> to flush its dirty data without interfering with
>> client traffic.
>> >> For a sustained low traffic workload, it is difficult
>> to guess in
>> >> advance whether using LVM cache will improve
>> performance. The
>> >> best test is to benchmark and compare the LVM cache
>> setup against
>> >> the WAL/DB setup. Moreover, as small writes are heavy
>> on the WAL
>> >> partition, it is suggested to use the fast device for
>> the DB
>> >> and/or WAL instead of an LVM cache.
>> >>
>> >
>> > So it sounds like you could partition your NVMe for either
>> LVM-cache,
>> > DB/WAL, or both?
>> >
>> > Just figured this sounded a bit more akin to what you were
>> looking for
>> > in your original post and figured I would share.
>> >
>> > I don't use this, but figured I would share it.
>> >
>> > Reed
>> >
>> >> On Apr 4, 2020, at 9:12 AM, jesper(a)krogh.cc
>> <mailto:jesper@krogh.cc>
>> >> wrote:
>> >>
>> >> Hi.
>> >>
>> >> We have a need for "bulk" storage - but with decent write
>> latencies.
>> >> Normally we would do this with a DAS with a Raid5 with 2GB
>> Battery
>> >> backed write cache in front - As cheap as possible but
>> still getting the
>> >> features of scalability of ceph.
>> >>
>> >> In our "first" ceph cluster we did the same - just stuffed
>> in BBWC
>> >> in the OSD nodes and we're fine - but now we're onto the
>> next one and
>> >> systems like:
>> >>
>> https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
>> >> Does not support a Raid controller like that - but is
>> branded as for
>> >> "Ceph
>> >> Storage Solutions".
>> >>
>> >> It do however support 4 NVMe slots in the front - So -
>> some level of
>> >> "tiering" using the NVMe drives should be what is
>> "suggested" - but what
>> >> do people do? What is recommeneded. I see multiple options:
>> >>
>> >> Ceph tiering at the "pool - layer":
>> >>
>> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>> >> And rumors that it is "deprectated:
>> >>
>> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html…
>> >>
>> >> Pro: Abstract layer
>> >> Con: Deprecated? - Lots of warnings?
>> >>
>> >> Offloading the block.db on NVMe / SSD:
>> >>
>> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>> >>
>> >> Pro: Easy to deal with - seem heavily supported.
>> >> Con: As far as I can tell - this will only benefit the
>> metadata of the
>> >> osd- not actual data. Thus a data-commit to the osd til
>> still be
>> >> dominated
>> >> by the writelatency of the underlying - very slow HDD.
>> >>
>> >> Bcache:
>> >>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
>> >>
>> >> Pro: Closest to the BBWC mentioned above - but with
>> way-way larger cache
>> >> sizes.
>> >> Con: It is hard to see if I end up being the only one on
>> the planet using
>> >> this
>> >> solution.
>> >>
>> >> Eat it - Writes will be as slow as hitting dead-rust -
>> anything that
>> >> cannot live
>> >> with that need to be entirely on SSD/NVMe.
>> >>
>> >> Other?
>> >>
>> >> Thanks for your input.
>> >>
>> >> Jesper
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users(a)ceph.io
>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users(a)ceph.io
>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
On 12/04/2020 18:10, huxiaoyu(a)horebdata.cn wrote:
> Dear Maged Mokhtar,
>
> It is very interesting to know that your experiment shows
> dm-writecache would be better than other alternatives. I have two
> questions:
yes much better.
>
> 1 can one cache device serve multiple HDDs? I know bcache can do
> this, which is convenient. dont know whether dm-writecache has such a
> feature
it works on a partition, so you can partition your disk to several
partitions to support multiple OSDs,in our ui we allow from 1-8 partitions.
> 2 Did you test whether write-back to disks from dm-writecache is
> power-safe or not. As far as know, bcache does not gurantee power-safe
> writebacks, thus i have to turn off HDD write cache (otherwise a data
> loss may occur)
>
Get a recent kernel and insure it has the fua patch mentioned, this will
correctly handle sync writes, else you may lose data. You also need a
recent lvm tool set that support dm-writecache. You need also use an SSD
with PLP support (enterprise models and some consumer models), some
cheaper SSDs without PLP support can lose existing stored data on power
loss, since their write cycle involves a read/erase/write block so a
power loss can erase already stored data on such consumer devices. We
also have another patch (see our source) that adds mirroring of metadata
to dm-writecache to handle this, but that is not needed for decent drives.
> best regards,
>
> samuel
>
>
>
>
> ------------------------------------------------------------------------
> huxiaoyu(a)horebdata.cn
>
> *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org>
> *Date:* 2020-04-12 16:45
> *To:* Reed Dier <mailto:reed.dier@focusvq.com>; jesper
> <mailto:jesper@krogh.cc>
> *CC:* ceph-users <mailto:ceph-users@ceph.io>
> *Subject:* [ceph-users] Re: Recommendation for decent write
> latency performance from HDDs
> On 10/04/2020 23:17, Reed Dier wrote:
> > Going to resurrect this thread to provide another option:
> >
> > LVM-cache, ie putting a cache device in-front of the
> bluestore-LVM LV.
> >
> > I only mention this because I noticed it in the SUSE
> documentation for
> > SES6 (based on Nautilus) here:
> > https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
> in PetaSAN project, we support dm-writecache and it works very
> well. We
> had done tests with other cache devices such as bcache and dm-cache,
> and it is much better. it is mainly a write cache, but reads are read
> from cache device if present, but does not promote reads from slow
> device. Typically with hdd clusters, write latency is the issue,
> reads
> are helped by OSD cache and in case of reduplicated pools, are much
> faster anyways.
> You need a recent kernel, we have an upstreamed patch:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…
> + depending on your distribution, you may need an updated lvm tool
> set.
> /Maged
> >
> >> * If you plan to use a fast drive as an LVM cache for multiple
> >> OSDs, be aware that all OSD operations (including replication)
> >> will go through the caching device. All reads will be queried
> >> from the caching device, and are only served from the slow
> device
> >> in case of a cache miss. Writes are always applied to the
> caching
> >> device first, and are flushed to the slow device at a later
> time
> >> ('writeback' is the default caching mode).
> >> * When deciding whether to utilize an LVM cache, verify
> whether the
> >> fast drive can serve as a front for multiple OSDs while still
> >> providing an acceptable amount of IOPS. You can test it by
> >> measuring the maximum amount of IOPS that the fast device can
> >> serve, and then dividing the result by the number of OSDs
> behind
> >> the fast device. If the result is lower or close to the maximum
> >> amount of IOPS that the OSD can provide without the cache, LVM
> >> cache is probably not suited for this setup.
> >>
> >> * The interaction of the LVM cache device with OSDs is
> >> important. Writes are periodically flushed from the caching
> >> device to the slow device. If the incoming traffic is sustained
> >> and significant, the caching device will struggle to keep
> up with
> >> incoming requests as well as the flushing process, resulting in
> >> performance drop. Unless the fast device can provide much more
> >> IOPS with better latency than the slow device, do not use LVM
> >> cache with a sustained high volume workload. Traffic in a burst
> >> pattern is more suited for LVM cache as it gives the cache time
> >> to flush its dirty data without interfering with client
> traffic.
> >> For a sustained low traffic workload, it is difficult to
> guess in
> >> advance whether using LVM cache will improve performance. The
> >> best test is to benchmark and compare the LVM cache setup
> against
> >> the WAL/DB setup. Moreover, as small writes are heavy on
> the WAL
> >> partition, it is suggested to use the fast device for the DB
> >> and/or WAL instead of an LVM cache.
> >>
> >
> > So it sounds like you could partition your NVMe for either
> LVM-cache,
> > DB/WAL, or both?
> >
> > Just figured this sounded a bit more akin to what you were
> looking for
> > in your original post and figured I would share.
> >
> > I don't use this, but figured I would share it.
> >
> > Reed
> >
> >> On Apr 4, 2020, at 9:12 AM, jesper(a)krogh.cc
> <mailto:jesper@krogh.cc>
> >> wrote:
> >>
> >> Hi.
> >>
> >> We have a need for "bulk" storage - but with decent write
> latencies.
> >> Normally we would do this with a DAS with a Raid5 with 2GB Battery
> >> backed write cache in front - As cheap as possible but still
> getting the
> >> features of scalability of ceph.
> >>
> >> In our "first" ceph cluster we did the same - just stuffed in BBWC
> >> in the OSD nodes and we're fine - but now we're onto the next
> one and
> >> systems like:
> >>
> https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
> >> Does not support a Raid controller like that - but is branded
> as for
> >> "Ceph
> >> Storage Solutions".
> >>
> >> It do however support 4 NVMe slots in the front - So - some
> level of
> >> "tiering" using the NVMe drives should be what is "suggested" -
> but what
> >> do people do? What is recommeneded. I see multiple options:
> >>
> >> Ceph tiering at the "pool - layer":
> >> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> >> And rumors that it is "deprectated:
> >>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html…
> >>
> >> Pro: Abstract layer
> >> Con: Deprecated? - Lots of warnings?
> >>
> >> Offloading the block.db on NVMe / SSD:
> >>
> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
> >>
> >> Pro: Easy to deal with - seem heavily supported.
> >> Con: As far as I can tell - this will only benefit the metadata
> of the
> >> osd- not actual data. Thus a data-commit to the osd til still be
> >> dominated
> >> by the writelatency of the underlying - very slow HDD.
> >>
> >> Bcache:
> >>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
> >>
> >> Pro: Closest to the BBWC mentioned above - but with way-way
> larger cache
> >> sizes.
> >> Con: It is hard to see if I end up being the only one on the
> planet using
> >> this
> >> solution.
> >>
> >> Eat it - Writes will be as slow as hitting dead-rust - anything
> that
> >> cannot live
> >> with that need to be entirely on SSD/NVMe.
> >>
> >> Other?
> >>
> >> Thanks for your input.
> >>
> >> Jesper
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io
> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>