Best practice and expected benefits of using separate WAL and DB devices with Bluestore

List overview All Threads
Download

newer

older

List of bridges irc/slack/discord

Status of IPv4 / IPv6 dual stack?

Niklaus Hofer

19 Apr 2024 19 Apr '24

1:32 p.m.

Dear all We have an HDD ceph cluster that could do with some more IOPS. One solution we are considering is installing NVMe SSDs into the storage nodes and using them as WAL- and/or DB devices for the Bluestore OSDs. However, we have some questions about this and are looking for some guidance and advice. The first one is about the expected benefits. Before we undergo the efforts involved in the transition, we are wondering if it is even worth it. How much of a performance boost one can expect when adding NVMe SSDs for WAL-devices to an HDD cluster? Plus, how much faster than that does it get with the DB also being on SSD. Are there rule-of-thumb number of that? Or maybe someone has done benchmarks in the past? The second question is of more practical nature. Are there any best-practices on how to implement this? I was thinking we won't do one SSD per HDD - surely an NVMe SSD is plenty fast to handle the traffic from multiple OSDs. But what is a good ratio? Do I have one NVMe SSD per 4 HDDs? Per 6 or even 8? Also, how should I chop-up the SSD, using partitions or using LVM? Last but not least, if I have one SSD handle WAL and DB for multiple OSDs, losing that SSD means losing multiple OSDs. How do people deal with this risk? Is it generally deemed acceptable or is this something people tend to mitigate and if so how? Do I run multiple SSDs in RAID? I do realize that for some of these, there might not be the one perfect answer that fits all use cases. I am looking for best practices and in general just trying to avoid any obvious mistakes. Any advice is much appreciated. Sincerely Niklaus Hofer -- stepping stone AG Wasserwerkgasse 7 CH-3011 Bern Telefon: +41 31 332 53 63 www.stepping-stone.ch niklaus.hofer(a)stepping-stone.ch

Show replies by date

Torkil Svensgaard

19 Apr 19 Apr

2:34 p.m.

Hi Red Hat Ceph support told us back in the day that 16 DB/WAL partitions pr NVMe were the max supported by RHCS because their testing showed performance suffered beyond that. We are running with 11 pr NVMe. We are prepared to lose a bunch of OSDs if we have an NVMe die. We expect ceph will handle it and we can redeploy the OSDs with a new NVMe device. We use a service spec for the chopping up bit: service_type: osd service_id: slow service_name: osd.slow placement: host_pattern: '*' spec: block_db_size: 290966113186 data_devices: rotational: 1 db_devices: rotational: 0 size: '1000G:' filter_logic: AND objectstore: bluestore Mvh. Torkil On 19-04-2024 11:02, Niklaus Hofer wrote:

...

Dear all We have an HDD ceph cluster that could do with some more IOPS. One solution we are considering is installing NVMe SSDs into the storage nodes and using them as WAL- and/or DB devices for the Bluestore OSDs. However, we have some questions about this and are looking for some guidance and advice. The first one is about the expected benefits. Before we undergo the efforts involved in the transition, we are wondering if it is even worth it. How much of a performance boost one can expect when adding NVMe SSDs for WAL-devices to an HDD cluster? Plus, how much faster than that does it get with the DB also being on SSD. Are there rule-of-thumb number of that? Or maybe someone has done benchmarks in the past? The second question is of more practical nature. Are there any best- practices on how to implement this? I was thinking we won't do one SSD per HDD - surely an NVMe SSD is plenty fast to handle the traffic from multiple OSDs. But what is a good ratio? Do I have one NVMe SSD per 4 HDDs? Per 6 or even 8? Also, how should I chop-up the SSD, using partitions or using LVM? Last but not least, if I have one SSD handle WAL and DB for multiple OSDs, losing that SSD means losing multiple OSDs. How do people deal with this risk? Is it generally deemed acceptable or is this something people tend to mitigate and if so how? Do I run multiple SSDs in RAID? I do realize that for some of these, there might not be the one perfect answer that fits all use cases. I am looking for best practices and in general just trying to avoid any obvious mistakes. Any advice is much appreciated. Sincerely Niklaus Hofer

Ondřej Kukla

4:33 p.m.

Hello, I’m going to mainly answer the practical questions Niklaus had. Our standart setup is 12HDDs and 2 Enterprise NVMe per node which means we have 6 OSDs per 1 NVMe. For the partition we use LVM. The fact that one one failed NVMe takes down 6 OSDs isn’t great but our osd-node count is more then double the M + K values for Erasure coding which means 6 OSDs should be ok-ish. Failing multiple NVMe could be an issue. If you use replicated pools then this isn’t that problematic. When it comes to recovery Ceph can easily recover that. Just recreate the LVMs and OSDs and you are good to go. One other benefit for us is that because we use large NVMes (7.7TiB) we can use the spare space for a fast pool. Ondrej

...

On 19. 4. 2024, at 12:04, Torkil Svensgaard <torkil(a)drcmr.dk> wrote: Hi Red Hat Ceph support told us back in the day that 16 DB/WAL partitions pr NVMe were the max supported by RHCS because their testing showed performance suffered beyond that. We are running with 11 pr NVMe. We are prepared to lose a bunch of OSDs if we have an NVMe die. We expect ceph will handle it and we can redeploy the OSDs with a new NVMe device. We use a service spec for the chopping up bit: service_type: osd service_id: slow service_name: osd.slow placement: host_pattern: '*' spec: block_db_size: 290966113186 data_devices: rotational: 1 db_devices: rotational: 0 size: '1000G:' filter_logic: AND objectstore: bluestore Mvh. Torkil On 19-04-2024 11:02, Niklaus Hofer wrote:

Dear all We have an HDD ceph cluster that could do with some more IOPS. One solution we are considering is installing NVMe SSDs into the storage nodes and using them as WAL- and/or DB devices for the Bluestore OSDs. However, we have some questions about this and are looking for some guidance and advice. The first one is about the expected benefits. Before we undergo the efforts involved in the transition, we are wondering if it is even worth it. How much of a performance boost one can expect when adding NVMe SSDs for WAL-devices to an HDD cluster? Plus, how much faster than that does it get with the DB also being on SSD. Are there rule-of-thumb number of that? Or maybe someone has done benchmarks in the past? The second question is of more practical nature. Are there any best- practices on how to implement this? I was thinking we won't do one SSD per HDD - surely an NVMe SSD is plenty fast to handle the traffic from multiple OSDs. But what is a good ratio? Do I have one NVMe SSD per 4 HDDs? Per 6 or even 8? Also, how should I chop-up the SSD, using partitions or using LVM? Last but not least, if I have one SSD handle WAL and DB for multiple OSDs, losing that SSD means losing multiple OSDs. How do people deal with this risk? Is it generally deemed acceptable or is this something people tend to mitigate and if so how? Do I run multiple SSDs in RAID? I do realize that for some of these, there might not be the one perfect answer that fits all use cases. I am looking for best practices and in general just trying to avoid any obvious mistakes. Any advice is much appreciated. Sincerely Niklaus Hofer

-- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>

Simon Kepp

5 p.m.

Hi Ondrej, When running multiple OSDs on a shared DB/WAL NVME,it is important to take into account,when designing your redundancy/failure domains,that the loss of a single NVMe drive will take out a number of OSDs.You must design your redundancy,so tat it is acceptable to lose that many OSDs simultaneously,and still being able to rebuild without data loss.In most scenarios, this is easily addressed simply by using failure_domain=Host, as you won't be sharing DB/WAL NVMes across multiple hosts. I don't think there's any generally agreed perfect number of OSDs per DB/WAL NVMe,but I'veseen others argue for a best practice of maximum3 OSDs per DB/WAL NVMe, and have myself adopted that as a standard. I run hosts with 12 HDD OSDs and 4 DB/WAL NVMEs. and a FAILURE_DOMAIN=Host. Best Regards, Simon Kepp, Founder, Kepp Technologies. On Fri, Apr 19, 2024 at 2:07 PM Ondřej Kukla <ondrej(a)kuuk.la> wrote:

...

On 19. 4. 2024, at 12:04, Torkil Svensgaard <torkil(a)drcmr.dk> wrote: Hi Red Hat Ceph support told us back in the day that 16 DB/WAL partitions

pr NVMe were the max supported by RHCS because their testing showed performance suffered beyond that. We are running with 11 pr NVMe.

We are prepared to lose a bunch of OSDs if we have an NVMe die. We

expect ceph will handle it and we can redeploy the OSDs with a new NVMe device.

We use a service spec for the chopping up bit: service_type: osd service_id: slow service_name: osd.slow placement: host_pattern: '*' spec: block_db_size: 290966113186 data_devices: rotational: 1 db_devices: rotational: 0 size: '1000G:' filter_logic: AND objectstore: bluestore Mvh. Torkil On 19-04-2024 11:02, Niklaus Hofer wrote: > Dear all > We have an HDD ceph cluster that could do with some more IOPS. One

solution we are considering is installing NVMe SSDs into the storage nodes and using them as WAL- and/or DB devices for the Bluestore OSDs.

> However, we have some questions about this and are looking for some

guidance and advice.

> The first one is about the expected benefits. Before we undergo the

efforts involved in the transition, we are wondering if it is even worth it. How much of a performance boost one can expect when adding NVMe SSDs for WAL-devices to an HDD cluster? Plus, how much faster than that does it get with the DB also being on SSD. Are there rule-of-thumb number of that? Or maybe someone has done benchmarks in the past?

> The second question is of more practical nature. Are there any best-

practices on how to implement this? I was thinking we won't do one SSD per HDD - surely an NVMe SSD is plenty fast to handle the traffic from multiple OSDs. But what is a good ratio? Do I have one NVMe SSD per 4 HDDs? Per 6 or even 8? Also, how should I chop-up the SSD, using partitions or using LVM? Last but not least, if I have one SSD handle WAL and DB for multiple OSDs, losing that SSD means losing multiple OSDs. How do people deal with this risk? Is it generally deemed acceptable or is this something people tend to mitigate and if so how? Do I run multiple SSDs in RAID?

> I do realize that for some of these, there might not be the one perfect

answer that fits all use cases. I am looking for best practices and in general just trying to avoid any obvious mistakes.

Any advice is much appreciated. Sincerely Niklaus Hofer

-- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:

ceph-users-leave(a)ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Anthony D'Atri

7:02 p.m.

This is a ymmv thing, it depends on one's workload.

...

However, we have some questions about this and are looking for some guidance and advice. The first one is about the expected benefits. Before we undergo the efforts involved in the transition, we are wondering if it is even worth it.

My personal sense is that it often isn't. It adds a lot of complexity for OSD lifecycle. I suspect there are many sites where if an OSD drive is replaced, it gets rebuilt without the external WAL/DB. This is one of the reasons that HDDs are a false economy.

...

How much of a performance boost one can expect when adding NVMe SSDs for WAL-devices to an HDD cluster? Plus, how much faster than that does it get with the DB also being on SSD. Are there rule-of-thumb number of that? Or maybe someone has done benchmarks in the past?

Very workload-dependent. My limited understanding is that the WAL on faster storage only gets used for small writes, ones smaller than the deferred setting, which I think defaults to min_alloc_size. So maybe for a DB workload that does very small random overwrites?

...

The second question is of more practical nature. Are there any best-practices on how to implement this? I was thinking we won't do one SSD per HDD - surely an NVMe SSD is plenty fast to handle the traffic from multiple OSDs. But what is a good ratio? Do I have one NVMe SSD per 4 HDDs? Per 6 or even 8?

There was once the rule of thumb of 4 SATA HDDs per SATA SSD, 10 SATA HDDs per NVMe SSD. There isn't a hard line in the sand. The ratio is enforced in part by the chassis in use.

...

Also, how should I chop-up the SSD, using partitions or using LVM? Last but not least, if I have one SSD handle WAL and DB for multiple OSDs, losing that SSD means losing multiple OSDs. How do people deal with this risk? Is it generally deemed acceptable or is this something people tend to mitigate and if so how? Do I run multiple SSDs in RAID? I do realize that for some of these, there might not be the one perfect answer that fits all use cases. I am looking for best practices and in general just trying to avoid any obvious mistakes.

This is one of the reasons why I counsel people to consider TCO -- and source hardware effectively -- so they don't get stuck with HDDs.

...

Any advice is much appreciated. Sincerely Niklaus Hofer -- stepping stone AG Wasserwerkgasse 7 CH-3011 Bern Telefon: +41 31 332 53 63 www.stepping-stone.ch niklaus.hofer(a)stepping-stone.ch _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Maged Mokhtar

23 Apr 23 Apr

8:54 p.m.

On 19/04/2024 11:02, Niklaus Hofer wrote:

...

Hi Niklaus, i would recommend always having external wal/db on flash when using HDDs. The impact depends on workload, but roughly you should see 2x better performance for mixed workloads. The impact will be higher if you have iops intensive load. A client write operation will require a metadata read (if not cached) + the data write op to the HDD + metadata write + pg log write. HDDs are terrible with iops (100 to 200 iops), so moving the non data ops to a faster device makes a lot of sense. There are also metadata iops involved during other operations like rocksdb compaction, object/snap deletions, scrubbing...that will benefit from moving those to a fast iops device. I have seen cases where scrubbing alone can load the HDDs. Typically you will always use wal+db and not just wal on external device. Using just wal will improve write latency but not iops, this could be if your load is bursty with small queue depth, like having a small number of client write operations compared to the total number of OSDs. But in vast majority this is not the case and practically/economically, it is a no brainer to use both wal+db. For nvme:HDD ratio, yes you can go for 1:10, or if you have extra slots you can use 1:5 using smaller capacity/cheaper nvmes, this will reduce the impact of nvme failures. /Maged

Anthony D'Atri

9:07 p.m.

...

On Apr 23, 2024, at 12:24, Maged Mokhtar <mmokhtar(a)petasan.org> wrote: For nvme:HDD ratio, yes you can go for 1:10, or if you have extra slots you can use 1:5 using smaller capacity/cheaper nvmes, this will reduce the impact of nvme failures.

On occasion I've seen a suggestion to mirror the fast devices and use LVM slices for each OSD. This might result in increased wear.

days inactive

days old

ceph-users@ceph.io

Manage subscription

6 comments

7 participants

tags (0)

participants (7)

Anthony D'Atri
Anthony D'Atri
Maged Mokhtar
Niklaus Hofer
Ondřej Kukla
Simon Kepp
Torkil Svensgaard