Hi all.
I want to have a measurement on how much time it takes for degraded
pgs to be recovered when one OSD fails?
I know it has many dependencies like network, disk, kernel, recovery
config, ... and I want to have a formula or something else that I can
find out if one of my OSDs fails, how much time does it take for the
cluster to recover.
Does anyone have any doc or idea how I can measure this with the
dependencies it has or give me some points that I can have a better
sense of this?
Thanks all.
Adding the right dev list.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Wed, May 20, 2020 at 12:40 AM Robert LeBlanc <robert(a)leblancnet.us> wrote:
>
> We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed that op behavior has changed. This is an HDD cluster (NVMe journals and NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running WPQ with the high cut-off, it was rock solid. When we had recoveries going on it barely dented the client ops and when the client ops on the cluster went down the backfills would run as fast as the cluster could go. I could have max_backfills set to 10 and the cluster performed admirably.
> After upgrading to Nautilus the cluster struggles with any kind of recovery and if there is any significant client write load the cluster can get into a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive.
> As the person who wrote the WPQ code initially, I know that it was fair and proportional to the op priority and in Jewel it worked. It's not working in Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and setting the recovery priority to 1 or zero barely makes any difference. My best estimation is that the op priority is getting lost before reaching the WPQ scheduler and is thus not prioritizing and dispatching ops correctly. It's almost as if all ops are being treated the same and there is no priority at all.
> Unfortunately, I do not have the time to set up the dev/testing environment to track this down and we will be moving away from Ceph. But I really like Ceph and want to see it succeed. I strongly suggest that someone look into this because I think it will resolve a lot of problems people have had on the mailing list. I'm not sure if a bug was introduced with the other queues that touches more of the op path or if something in the op path restructuring that changed how things work (I know that was being discussed around the time that Jewel was released). But my guess is that it is somewhere between the op being created and being received into the queue.
> I really hope that this helps in the search for this regression. I spent a lot of time studying the issue to come up with WPQ and saw it work great when I switched this cluster from PRIO to WPQ. I've also spent countless hours studying how it's changed in Nautilus.
>
> Thank you,
> Robert LeBlanc
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Hi Folks,
The weekly performance meeting will be starting in ~20 minutes. Topics
for today include figuring out how to review the backlog of bluestore
locking and threading PRs, ISC20 IO500 testing, and continued discussion
on performance CI. Please feel free to add your own topics!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Bluejeans:
https://bluejeans.com/908675367
Thanks,
Mark
Hi Folks,
We are looking at implementing a new encryption feature in libRBD as explained below.
Feedback and thoughts are welcome.
Best,
Danny Harnik
The problem: There is a growing need from multiple industries (e.g. In the finance industry) to encrypt data at the host with tenant/user provided encryption keys at a volume granularity. This is driven by regulations and a rising emphasis on security.
To date, Ceph RBD does not offer any such solution and the existing alternatives are to add an encryption layer before libRBD.
Examples of such solutions are using QEMU LUKS encryption or relying on DM-Crypt. However, using an encryption layer above RBD has limitations when interfacing with storage functions implemented in the RBD layer. Most glaring is cloning that will only work if the parent and child are encrypted with the same encryption key. By moving encryption down into libRBD, we can achieve the flexibility to use Ceph RBD clones, for example, to create encrypted clones of unencrypted golden images. Such a feature exists today with QEMU when using LUKS encryption together with qCOW2 clones. We want to support a similar capability in libRBD, and potentially address some security limitations of existing QEMU based solutions.
Key design points and thoughts:
- Key management flow will closely mimic that of LUKS, as is implemented in QEMU for qCOW2. The point here is to avoid any security pitfalls that come from key management. QEMU with qCOW2 also has a format in place to handle a chain of clones each with different key/encryption mechanisms and we should follow suit in this design.
- We intend to support two main encryption methods:
1) Standard AES-XTS full disk encryption which is the method recommended in LUKS. We plan to use an encryption block size of 4KB which is supported by LUKS2 (but not LUKS v1). Note that the encryption in QEMU works at a 512 byte encryption block granularity which is less efficient.
2) A new enhanced security format - while AES-XTS is the state of the art in full disk encryption it is designed to provide the best security under the limitations of a physical disk. An RBD volume is not a physical disk, but rather a virtual one which gives us room to store additional information such as IV and authentication data. Such a "fresh IV" approach will eliminate IV reuse which is a central limitation in physical disks, and would therefore enhance security over the one provided by LUKS encryption today. In the enhanced security format IVs need to be stored along with the data in order to allow future decryption, adding a space and performance overhead which we intend to evaluate.
- We intend to add an new Encryption Object Dispatch Layer in libRBD that will handle encryption and decryption. This layer will be placed right below the Cache layer and above the Journal layer. Thus the data in the cache will be in the clear but the journal data will be encrypted.
- The biggest challenge in terms of design is to handle clones correctly and specifically "copyup" of data from the parent needs to be decrypted with the parent key and re-encrypted with the child key.
- There is room for expanding the RBD export/import utilities to export data encrypted rather than in the clear. Upon import it can be read once the proper keys are given. Such a feature is straightforward if a single volume is involved but more complex if a chain of clones is involved.
moved to dev(a)ceph.io
Hi RGW experts,
As cephadm doesn't support an empty realm right now, we need to find a
solution.
Right now, I can see two ways forward for users to migrate to cephadm:
1. add the possibility to create a cephadm services without realm
2. provide a documented way to migrate existing zones without realm
What do you prefer?
Best,
Sebastian
Am 02.06.20 um 19:34 schrieb Andy Goldschmidt:
> Hi
> I am trying to upgrade from Mimic (13.2.10) to Octopus (15.x). Im also tryin to sue cephadm and am following this guide.https://docs.ceph.com/docs/master/cephadm/adoption/
>
> It was all going fine until step 11 and deploying the new RGW's. I don't have any realms set for my cluster, so how do I do it? This is also a single site cluster.
> # radosgw-admin realm list{ "default_info": "", "realms": []}
> # radosgw-admin zone list{ "default_info": "a15e2aec-a0da-4cad-a1bd-f448f25bbe3d", "zones": [ "default" ]}
>
>
> The step below is what I don't know what to do, as it says I need to specify the realm.
>
> 11. Redeploy RGW daemons. Cephadm manages RGW daemons by zone. For each zone, deploy new RGW daemons with cephadm:
> # ceph orch apply rgw <realm> <zone> <placement> [--port <port>] [--ssl]
> my ceph.conf has this in it about the rgw's[client.rgw.ceph-mgmt0]host = ceph-mgmt0keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-mgmt0/keyringlog file = /var/log/ceph/ceph-rgw-ceph-mgmt0.logrgw frontends = civetweb port=10.92.135.40:8080 num_threads=100rgw dns name = library.xxxxx.com
> [client.rgw.ceph-mgmt1]host = ceph-mgmt1keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-mgmt1/keyringlog file = /var/log/ceph/ceph-rgw-ceph-mgmt1.logrgw frontends = civetweb port=10.92.135.41:8080 num_threads=100rgw dns name = library.xxxxx.com
> [client.rgw.ceph-mgmt2]host = ceph-mgmt2keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-mgmt2/keyringlog file = /var/log/ceph/ceph-rgw-ceph-mgmt2.logrgw frontends = civetweb port=10.92.135.42:8080 num_threads=100rgw dns name = library.xxxxx.com
> RegardsAndy
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
--
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer
Hi All,
I have noticed that different RBD image size can shape the bluestore latency differently. Is there baseline or guidance for choosing the image size?
Left: RBD image size is 1GB
middle: RBD image size is 40GB
Right: RBD image size is 1GB, RocksDB write buffer 10X default
4K randwrite on SSD with FIO. SSD is preconditioned and image is prefilled(20mins).
Red dot is L1 compaction and green dot is L0 compaction.
Let’s focus on the left graph. The smaller spikes are caused by compactions. The higher spikes seems to be caused by the BlueStore itself.
I suspect this could be related to RBD image size in someway.
Does anyone know what could the cause of the higher spikes? And how to debug it?
Also, what is the proper RBD image size for my test?
Please advice.
Thanks,
Yiming
---------- Forwarded message ---------
From: Abhinav Singh <singhabhinav0796(a)gmail.com>
Date: Tue, Jun 2, 2020 at 2:20 PM
Subject: Re: RGW JaegerTracing Doubt
To: Yuval Lifshitz <ylifshit(a)redhat.com>
Here is two commits which for tracing object deletion with function
overloading
https://github.com/ceph/ceph/commit/1ec9c76b4d3ff7f5fd5ac83150ad1d5e83655276https://github.com/ceph/ceph/commit/3a331ffb60852f472c57697011b20879662130e0
I can change the code to rgw_op.cc and rgw_rest.cc because they have access
to req_state but the functions of RGWRados needs to be rewritten as a whole
if we apply overloading, that will be very messy I guess
On Tue, 2 Jun 2020, 14:13 Abhinav Singh, <singhabhinav0796(a)gmail.com> wrote:
> I will share my commit once I build it successfully it will take some time
> though.
>
> On Tue, 2 Jun 2020, 14:08 Yuval Lifshitz, <ylifshit(a)redhat.com> wrote:
>
>> i think that adding anything "global" to hold info that belongs in a
>> specific call stack is not a good idea.
>> even if your map is thread_local and would not require any locks (and
>> assuming all processing is done in one thread), its not clear how you can
>> lookup the right requests from different inner function calls?
>>
>> seems like function overloading is the correct solution.
>>
>> On Tue, Jun 2, 2020 at 11:22 AM Abhinav Singh <singhabhinav0796(a)gmail.com>
>> wrote:
>>
>>> Yes you are right, I realized the same thing just moments before.
>>>
>>> Could you suggest any tips how to manage this without function
>>> overloading?
>>>
>>> On Tue, 2 Jun 2020, 13:24 Yuval Lifshitz, <ylifshit(a)redhat.com> wrote:
>>>
>>>> the problem with this solution is not the cost of searching the hash
>>>> map, it is making this map thread safe.
>>>> adding a lock would have a very bad impact on performance.
>>>>
>>>> On Tue, Jun 2, 2020 at 5:53 AM Abhinav Singh <
>>>> singhabhinav0796(a)gmail.com> wrote:
>>>>
>>>>> One way of doing this is to store vector of req_state in and
>>>>> unorderd_map<id,req _state>
>>>>> But searching through might cause some time latency, so to counter this
>>>>> I will put a size limit of thousand so that when vector gets big it
>>>>> erases all its element along with unordered_map.
>>>>> this will ensure that cost of searching operation will be
>>>>> greatly reduced.
>>>>>
>>>>> Will this do?
>>>>>
>>>>> On Mon, 1 Jun 2020, 21:34 Abhinav Singh, <singhabhinav0796(a)gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> My `req_state*` is containing spans for a particular request to trace
>>>>>> that request, but as we know req_state is not available everywhere I tried
>>>>>> to insert a req_state variable in CephContext class because every portion
>>>>>> of RGW has access to it and so they will also have access to req_state,
>>>>>> but this wont work because it is on time initialized and when request run
>>>>>> in parallel race condition might occur and traces will be inaccurate.
>>>>>> The Second method I tried was to include req_state in RGWRadosStore
>>>>>> and RGWUserCtl because these are accessible to every function which I want
>>>>>> to trace, but again these also have race condition risk.
>>>>>>
>>>>>> Can anyone give me any tip how to make req_state available in all
>>>>>> functions(if not all then majority) particularly this functions like
>>>>>> RGWRadosStore and RGWUserCtl
>>>>>>
>>>>>> Thank You.
>>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list -- dev(a)ceph.io
>>>>> To unsubscribe send an email to dev-leave(a)ceph.io
>>>>>
>>>>