Successfully using dm-cache

List overview All Threads
Download

newer

older

OSDs not balanced

v16.2.15 Pacific released

Michael Lipp

31 Jan 2024 31 Jan '24

8:23 p.m.

Just in case anybody is interested: Using dm-cache works and boosts performance -- at least for my use case. The "challenge" was to get 100 (identical) Linux-VMs started on a three node hyperconverged cluster. The hardware is nothing special, each node has a Supermicro server board with a single CPU with 24 cores and 4 x 4 TB hard disks. And there's that extra 1 TB NVMe... I know that the general recommendation is to use the NVMe for WAL and metadata, but this didn't seem appropriate for my use case and I'm still not quite sure about failure scenarios with this configuration. So instead I made each drive a logical volume (managed by an OSD) and added 85 GiB NVMe to each LV as read-only cache. Each VM uses as system disk an RBD based on a snapshot from the master image. The idea was that with this configuration, all VMs should share most (actually almost all) of the data on their system disk and this data should be available from the cache. Well, it works. When booting the 100 VMs, almost all read operations are satisfied from the cache. So I get close to NVMe speed but have payed for conventional hard drives only (well, SSDs aren't that much more expensive nowadays, but the hardware is 4 years old). So, nothing sophisticated, but as I couldn't find anything about this kind of setup, it might be of interest nevertheless. - Michael

Show replies by date

quaglio＠bol.com.br

31 Jan 31 Jan

8:46 p.m.

New subject: Performance improvement suggestion

quaglio＠bol.com.br

10:48 p.m.

New subject: Performance improvement suggestion

Can Özyurt

10:58 p.m.

New subject: Performance improvement suggestion

I never tried this myself but "min_size = 1" should do what you want to achieve. On Wed, 31 Jan 2024 at 22:48, quaglio(a)bol.com.br <quaglio(a)bol.com.br> wrote: > > Hello everybody, > I would like to make a suggestion for improving performance in Ceph architecture. > I don't know if this group would be the best place or if my proposal is correct. > > My suggestion would be in the item https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale". > > The Client needs to "wait" for the configured amount of replicas to be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting. > > It would be possible: > > 1-) Only record on the primary OSD > 2-) Write other replicas in background (like the same way as when an OSD fails: "degraded" ). > > This way, client has a faster response when writing to storage: improving latency and performance (throughput and IOPS). > > I would find it plausible to accept a period of time (seconds) until all replicas are ok (written asynchronously) at the expense of improving performance. > > Could you evaluate this scenario? > > > Rafael. > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Anthony D'Atri

11:04 p.m.

New subject: Performance improvement suggestion

I’ve heard conflicting asserts on whether the write returns with min_size shards have been persisted, or all of them.

...

On Jan 31, 2024, at 2:58 PM, Can Özyurt <acozyurt(a)gmail.com> wrote: I never tried this myself but "min_size = 1" should do what you want to achieve.

Janne Johansson

1 Feb 1 Feb

10:07 a.m.

New subject: Performance improvement suggestion

...

I’ve heard conflicting asserts on whether the write returns with min_size shards have been persisted, or all of them.

I think it waits until all replicas have written the data, but from simplistic tests with fast network and slow drives, the extra time taken to write many copies is not linear to what it takes to write the first, so unless you do go min_size=1 (not recommended at all), the extra copies are not slowing you down as much as you'd expect. At least not if the other drives are not 100% busy. I get that this thread started on having one bad drive, and that is another scenario of course, but having repl=2 or repl=3 is not about writes taking 100% - 200% more time than the single write, it is less. -- May the most significant bit of your life be positive.

Anthony D'Atri

31 Jan 31 Jan

11:03 p.m.

New subject: Performance improvement suggestion

Would you be willing to accept the risk of data loss?

...

On Jan 31, 2024, at 2:48 PM, quaglio(a)bol.com.br wrote: Hello everybody, I would like to make a suggestion for improving performance in Ceph architecture. I don't know if this group would be the best place or if my proposal is correct. My suggestion would be in the item https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale". The Client needs to "wait" for the configured amount of replicas to be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting. It would be possible: 1-) Only record on the primary OSD 2-) Write other replicas in background (like the same way as when an OSD fails: "degraded" ). This way, client has a faster response when writing to storage: improving latency and performance (throughput and IOPS). I would find it plausible to accept a period of time (seconds) until all replicas are ok (written asynchronously) at the expense of improving performance. Could you evaluate this scenario? Rafael. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

quaglio＠bol.com.br

1 Feb 1 Feb

5:04 p.m.

New subject: Performance improvement suggestion

quaglio＠bol.com.br

5:19 p.m.

New subject: Performance improvement suggestion

Anthony D'Atri

6:03 p.m.

New subject: Performance improvement suggestion

...

I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

...

I just said that it would be interesting if the objects were first recorded only in the primary OSD.

What happens when that host / drive smokes before it can replicate? What happens if a secondary OSD gets a read op before the primary updates it? Swift object storage users have to code around this potential. It's a non-starter for block storage. This is similar to why RoC HBAs (which are a badly outdated thing to begin with) will only enter writeback mode if they have a BBU / supercap -- and of course if their firmware and hardware isn't pervasively buggy. Guess how I know this?

...

This way it would greatly increase performance (both for iops and throuput).

It might increase low-QD IOPS for a single client on slow media with certain networking. Depending on media, it wouldn't increase throughput. Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x the network resources between the client and the servers.

...

Later (in the background), record the replicas. This situation would avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded.

If one makes the mistake of using HDDs, they're going to be overloaded no matter how one slices and dices the ops. Ya just canna squeeze IOPS from a stone. Throughput is going to be limited by the SATA interface and seeking no matter what.

...

Where I work, performance is very important and we don't have money to make a entire cluster only with NVMe.

If there isn't money, then it isn't very important. But as I've written before, NVMe clusters *do not cost appreciably more than spinners* unless your procurement processes are bad. In fact they can cost significantly less. This is especially true with object storage and archival where one can leverage QLC. * Buy generic drives from a VAR, not channel drives through a chassis brand. Far less markup, and moreover you get the full 5 year warranty, not just 3 years. And you can painlessly RMA drives yourself - you don't have to spend hours going back and forth with $chassisvendor's TAC arguing about every single RMA. I've found that this is so bad that it is more economical to just throw away a failed component worth < USD 500 than to RMA it. Do you pay for extended warranty / support? That's expensive too. * Certain chassis brands who shall remain nameless push RoC HBAs hard with extreme markups. List prices as high as USD2000. Per server, eschewing those abominations makes up for a lot of the drive-only unit economics * But this is the part that lots of people don't get: You don't just stack up the drives on a desk and use them. They go into *servers* that cost money and *racks* that cost money. They take *power* that costs money. * $ / IOPS are FAR better for ANY SSD than for HDDs * RUs cost money, so do chassis and switches * Drive failures cost money * So does having your people and applications twiddle their thumbs waiting for stuff to happen. I worked for a supercomputer company who put low-memory low-end diskless workstations on engineer's desks. They spent lots of time doing nothing waiting for their applications to respond. This company no longer exists. * So does the risk of taking *weeks* to heal from a drive failure Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc I walked through this with a certain global company. QLC SSDs were demonstrated to have like 30% lower TCO than spinners. Part of the equation is that they were accustomed to limiting HDD size to 8 TB because of the bottlenecks, and thus requiring more servers, more switch ports, more DC racks, more rack/stack time, more administrative overhead. You can fit 1.9 PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB of the largest spinners you can get today. 22 TIMES the density. And since many applications can't even barely tolerate the spinner bottlenecks, capping spinner size at even 10T makes that like 40 TIMES better density with SSDs.

...

However, I don't think it's interesting to lose the functionality of the replicas. I'm just suggesting another way to increase performance without losing the functionality of replicas. Rafael. De: "Anthony D'Atri" <anthony.datri(a)gmail.com> Enviada: 2024/01/31 17:04:08 Para: quaglio(a)bol.com.br Cc: ceph-users(a)ceph.io Assunto: Re: [ceph-users] Performance improvement suggestion Would you be willing to accept the risk of data loss?

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

quaglio＠bol.com.br

8:31 p.m.

New subject: Performance improvement suggestion

quaglio＠bol.com.br

8:53 p.m.

New subject: Performance improvement suggestion

Anthony D'Atri

8:58 p.m.

New subject: Performance improvement suggestion

I'd totally defer to the RADOS folks. One issue might be adding a separate code path, which can have all sorts of problems.

...

On Feb 1, 2024, at 12:53, quaglio(a)bol.com.br wrote: Ok Anthony, I understood what you said. I also believe in all the professional history and experience you have. Anyway, could there be a configuration flag to make this happen? As well as those that already exist: "--yes-i-really-mean-it". This way, the storage pattern would remain as it is. However, it would allow situations like the one I mentioned to be possible. This situation will permit some rules to be relaxed (even if they are not ok at first). Likewise, there are already situations like lazyio that make some exceptions to standard procedures. Remembering: it's just a suggestion. If this type of functionality is not interesting, it is ok. Rafael. De: "Anthony D'Atri" <anthony.datri(a)gmail.com> Enviada: 2024/02/01 12:10:30 Para: quaglio(a)bol.com.br Cc: ceph-users(a)ceph.io Assunto: [ceph-users] Re: Performance improvement suggestion

I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

I just said that it would be interesting if the objects were first recorded only in the primary OSD.

This way it would greatly increase performance (both for iops and throuput).

Later (in the background), record the replicas. This situation would avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded.

Where I work, performance is very important and we don't have money to make a entire cluster only with NVMe.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

quaglio＠bol.com.br

20 Feb 20 Feb

10:53 p.m.

New subject: Performance improvement suggestion

Anthony D'Atri

11:03 p.m.

New subject: Performance improvement suggestion

...

Hi Anthony, Did you decide that it's not a feature to be implemented?

That isn't up to me.

...

I'm asking about this so I can offer options here. I'd not be confortable to enable "mon_allow_pool_size_one" at a specific pool. It would be better if this feature could make a replica at a second time on selected pool. Thanks. Rafael. De: "Anthony D'Atri" <anthony.datri(a)gmail.com> Enviada: 2024/02/01 15:00:59 Para: quaglio(a)bol.com.br Cc: ceph-users(a)ceph.io Assunto: [ceph-users] Re: Performance improvement suggestion I'd totally defer to the RADOS folks. One issue might be adding a separate code path, which can have all sorts of problems.

I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

I just said that it would be interesting if the objects were first recorded only in the primary OSD.

This way it would greatly increase performance (both for iops and throuput).

Later (in the background), record the replicas. This situation would avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded.

Where I work, performance is very important and we don't have money to make a entire cluster only with NVMe.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Alex Gorbachev

21 Feb 21 Feb

4:23 a.m.

New subject: Performance improvement suggestion

I would be against such an option, because it introduces a significant risk of data loss. Ceph has made a name for itself as a very reliable system, where almost no one lost data, no matter how bad of a decision they made with architecture and design. This is what you pay for in commercial systems, to "not be allowed a bad choice", and this is what everyone gets with Ceph for free (if they so choose). Allowing a change like this would likely be the beginning of the end of Ceph. It is a bad idea in the extreme. Ceph reliability should never be compromised. There are other options for storage that are robust and do not require as much investment. Use ZFS, with NFS if needed. Use bcache/flashcache, or something similar on the client side. Use proper RAM caching in databases and applications. -- Alex Gorbachev Intelligent Systems Services Inc. STORCIUM On Tue, Feb 20, 2024 at 3:04 PM Anthony D'Atri <anthony.datri(a)gmail.com> wrote:

...

Hi Anthony, Did you decide that it's not a feature to be implemented?

That isn't up to me.

I'm asking about this so I can offer options here. I'd not be confortable to enable "mon_allow_pool_size_one" at a

specific pool.

It would be better if this feature could make a replica at a second time

on selected pool.

Thanks. Rafael. De: "Anthony D'Atri" <anthony.datri(a)gmail.com> Enviada: 2024/02/01 15:00:59 Para: quaglio(a)bol.com.br Cc: ceph-users(a)ceph.io Assunto: [ceph-users] Re: Performance improvement suggestion I'd totally defer to the RADOS folks. One issue might be adding a separate code path, which can have all sorts

of problems.

> On Feb 1, 2024, at 12:53, quaglio(a)bol.com.br wrote: > > > > Ok Anthony, > > I understood what you said. I also believe in all the professional

history and experience you have.

> > Anyway, could there be a configuration flag to make this happen? > > As well as those that already exist: "--yes-i-really-mean-it". > > This way, the storage pattern would remain as it is. However, it would

allow situations like the one I mentioned to be possible.

> > This situation will permit some rules to be relaxed (even if they are

not ok at first).

> Likewise, there are already situations like lazyio that make some

exceptions to standard procedures.

> Remembering: it's just a suggestion. > If this type of functionality is not interesting, it is ok. > > > > Rafael. > > > De: "Anthony D'Atri" <anthony.datri(a)gmail.com> > Enviada: 2024/02/01 12:10:30 > Para: quaglio(a)bol.com.br > Cc: ceph-users(a)ceph.io > Assunto: [ceph-users] Re: Performance improvement suggestion > > > > > I didn't say I would accept the risk of losing data. > > That's implicit in what you suggest, though. > > > I just said that it would be interesting if the objects were first

recorded only in the primary OSD.

> > What happens when that host / drive smokes before it can replicate?

What happens if a secondary OSD gets a read op before the primary updates it? Swift object storage users have to code around this potential. It's a non-starter for block storage.

> > This is similar to why RoC HBAs (which are a badly outdated thing to

begin with) will only enter writeback mode if they have a BBU / supercap -- and of course if their firmware and hardware isn't pervasively buggy. Guess how I know this?

> > > This way it would greatly increase performance (both for iops and

throuput).

> > It might increase low-QD IOPS for a single client on slow media with

certain networking. Depending on media, it wouldn't increase throughput.

> > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use

3x the network resources between the client and the servers.

> > > Later (in the background), record the replicas. This situation would

avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded.

> > If one makes the mistake of using HDDs, they're going to be overloaded

no matter how one slices and dices the ops. Ya just canna squeeze IOPS from a stone. Throughput is going to be limited by the SATA interface and seeking no matter what.

> > > Where I work, performance is very important and we don't have money

to make a entire cluster only with NVMe.

> > If there isn't money, then it isn't very important. But as I've

written before, NVMe clusters *do not cost appreciably more than spinners* unless your procurement processes are bad. In fact they can cost significantly less. This is especially true with object storage and archival where one can leverage QLC.

> > * Buy generic drives from a VAR, not channel drives through a chassis

brand. Far less markup, and moreover you get the full 5 year warranty, not just 3 years. And you can painlessly RMA drives yourself - you don't have to spend hours going back and forth with $chassisvendor's TAC arguing about every single RMA. I've found that this is so bad that it is more economical to just throw away a failed component worth < USD 500 than to RMA it. Do you pay for extended warranty / support? That's expensive too.

> > * Certain chassis brands who shall remain nameless push RoC HBAs hard

with extreme markups. List prices as high as USD2000. Per server, eschewing those abominations makes up for a lot of the drive-only unit economics

> > * But this is the part that lots of people don't get: You don't just

stack up the drives on a desk and use them. They go into *servers* that cost money and *racks* that cost money. They take *power* that costs money.

> > * $ / IOPS are FAR better for ANY SSD than for HDDs > > * RUs cost money, so do chassis and switches > > * Drive failures cost money > > * So does having your people and applications twiddle their thumbs

waiting for stuff to happen. I worked for a supercomputer company who put low-memory low-end diskless workstations on engineer's desks. They spent lots of time doing nothing waiting for their applications to respond. This company no longer exists.

> > * So does the risk of taking *weeks* to heal from a drive failure > > Punch honest numbers into

https://www.snia.org/forums/cmsi/programs/TCOcalc

> > I walked through this with a certain global company. QLC SSDs were

demonstrated to have like 30% lower TCO than spinners. Part of the equation is that they were accustomed to limiting HDD size to 8 TB because of the bottlenecks, and thus requiring more servers, more switch ports, more DC racks, more rack/stack time, more administrative overhead. You can fit 1.9 PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB of the largest spinners you can get today. 22 TIMES the density. And since many applications can't even barely tolerate the spinner bottlenecks, capping spinner size at even 10T makes that like 40 TIMES better density with SSDs.

> > > > However, I don't think it's interesting to lose the functionality of

the replicas.

> > I'm just suggesting another way to increase performance without

losing the functionality of replicas.

> > > > > > Rafael. > > > > > > De: "Anthony D'Atri" <anthony.datri(a)gmail.com> > > Enviada: 2024/01/31 17:04:08 > > Para: quaglio(a)bol.com.br > > Cc: ceph-users(a)ceph.io > > Assunto: Re: [ceph-users] Performance improvement suggestion > > > > Would you be willing to accept the risk of data loss? > > > >> > >> On Jan 31, 2024, at 2:48 PM, quaglio(a)bol.com.br wrote: > >> > >> Hello everybody, > >> I would like to make a suggestion for improving performance in Ceph

architecture.

> >> I don't know if this group would be the best place or if my

proposal is correct.

> >> > >> My suggestion would be in the item

https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale".

> >> > >> The Client needs to "wait" for the configured amount of replicas to

be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting.

> >> > >> It would be possible: > >> > >> 1-) Only record on the primary OSD > >> 2-) Write other replicas in background (like the same way as when

an OSD fails: "degraded" ).

> >> > >> This way, client has a faster response when writing to storage:

improving latency and performance (throughput and IOPS).

> >> > >> I would find it plausible to accept a period of time (seconds)

until all replicas are ok (written asynchronously) at the expense of improving performance.

> >> > >> Could you evaluate this scenario? > >> > >> > >> Rafael. > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io > >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to

ceph-users-leave(a)ceph.io_______________________________________________

ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Özkan Göksu

1:03 a.m.

New subject: Performance improvement suggestion

Hello. I didn't test it personally but what about rep 1 write cache pool with nvme backed by another rep 2 pool? It has the potential exactly what you are looking for in theory. 1 Şub 2024 Per 20:54 tarihinde quaglio(a)bol.com.br <quaglio(a)bol.com.br> şunu yazdı:

...

Ok Anthony, I understood what you said. I also believe in all the professional history and experience you have. Anyway, could there be a configuration flag to make this happen? As well as those that already exist: "--yes-i-really-mean-it". This way, the storage pattern would remain as it is. However, it would allow situations like the one I mentioned to be possible. This situation will permit some rules to be relaxed (even if they are not ok at first). Likewise, there are already situations like lazyio that make some exceptions to standard procedures. Remembering: it's just a suggestion. If this type of functionality is not interesting, it is ok. Rafael. ------------------------------ *De: *"Anthony D'Atri" <anthony.datri(a)gmail.com> *Enviada: *2024/02/01 12:10:30 *Para: *quaglio(a)bol.com.br *Cc: * ceph-users(a)ceph.io *Assunto: * [ceph-users] Re: Performance improvement suggestion

I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

I just said that it would be interesting if the objects were first

recorded only in the primary OSD. What happens when that host / drive smokes before it can replicate? What happens if a secondary OSD gets a read op before the primary updates it? Swift object storage users have to code around this potential. It's a non-starter for block storage. This is similar to why RoC HBAs (which are a badly outdated thing to begin with) will only enter writeback mode if they have a BBU / supercap -- and of course if their firmware and hardware isn't pervasively buggy. Guess how I know this?

This way it would greatly increase performance (both for iops and

throuput). It might increase low-QD IOPS for a single client on slow media with certain networking. Depending on media, it wouldn't increase throughput. Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x the network resources between the client and the servers.

Later (in the background), record the replicas. This situation would

avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded. If one makes the mistake of using HDDs, they're going to be overloaded no matter how one slices and dices the ops. Ya just canna squeeze IOPS from a stone. Throughput is going to be limited by the SATA interface and seeking no matter what.

Where I work, performance is very important and we don't have money to

make a entire cluster only with NVMe. If there isn't money, then it isn't very important. But as I've written before, NVMe clusters *do not cost appreciably more than spinners* unless your procurement processes are bad. In fact they can cost significantly less. This is especially true with object storage and archival where one can leverage QLC. * Buy generic drives from a VAR, not channel drives through a chassis brand. Far less markup, and moreover you get the full 5 year warranty, not just 3 years. And you can painlessly RMA drives yourself - you don't have to spend hours going back and forth with $chassisvendor's TAC arguing about every single RMA. I've found that this is so bad that it is more economical to just throw away a failed component worth < USD 500 than to RMA it. Do you pay for extended warranty / support? That's expensive too. * Certain chassis brands who shall remain nameless push RoC HBAs hard with extreme markups. List prices as high as USD2000. Per server, eschewing those abominations makes up for a lot of the drive-only unit economics * But this is the part that lots of people don't get: You don't just stack up the drives on a desk and use them. They go into *servers* that cost money and *racks* that cost money. They take *power* that costs money. * $ / IOPS are FAR better for ANY SSD than for HDDs * RUs cost money, so do chassis and switches * Drive failures cost money * So does having your people and applications twiddle their thumbs waiting for stuff to happen. I worked for a supercomputer company who put low-memory low-end diskless workstations on engineer's desks. They spent lots of time doing nothing waiting for their applications to respond. This company no longer exists. * So does the risk of taking *weeks* to heal from a drive failure Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc I walked through this with a certain global company. QLC SSDs were demonstrated to have like 30% lower TCO than spinners. Part of the equation is that they were accustomed to limiting HDD size to 8 TB because of the bottlenecks, and thus requiring more servers, more switch ports, more DC racks, more rack/stack time, more administrative overhead. You can fit 1.9 PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB of the largest spinners you can get today. 22 TIMES the density. And since many applications can't even barely tolerate the spinner bottlenecks, capping spinner size at even 10T makes that like 40 TIMES better density with SSDs.

However, I don't think it's interesting to lose the functionality of the

replicas.

I'm just suggesting another way to increase performance without losing

the functionality of replicas.

Rafael. De: "Anthony D'Atri" <anthony.datri(a)gmail.com> Enviada: 2024/01/31 17:04:08 Para: quaglio(a)bol.com.br Cc: ceph-users(a)ceph.io Assunto: Re: [ceph-users] Performance improvement suggestion Would you be willing to accept the risk of data loss? > > On Jan 31, 2024, at 2:48 PM, quaglio(a)bol.com.br wrote: > > Hello everybody, > I would like to make a suggestion for improving performance in Ceph

architecture.

> I don't know if this group would be the best place or if my proposal is

correct.

> > My suggestion would be in the item

https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale".

> > The Client needs to "wait" for the configured amount of replicas to be

written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting.

> > It would be possible: > > 1-) Only record on the primary OSD > 2-) Write other replicas in background (like the same way as when an

OSD fails: "degraded" ).

> > This way, client has a faster response when writing to storage:

improving latency and performance (throughput and IOPS).

> > I would find it plausible to accept a period of time (seconds) until

all replicas are ok (written asynchronously) at the expense of improving performance.

Could you evaluate this scenario? Rafael. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Anthony D'Atri

1:59 a.m.

New subject: Performance improvement suggestion

Cache tiering is deprecated.

...

On Feb 20, 2024, at 17:03, Özkan Göksu <ozkangksu(a)gmail.com> wrote: Hello. I didn't test it personally but what about rep 1 write cache pool with nvme backed by another rep 2 pool? It has the potential exactly what you are looking for in theory. 1 Şub 2024 Per 20:54 tarihinde quaglio(a)bol.com.br <quaglio(a)bol.com.br> şunu yazdı:

I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

I just said that it would be interesting if the objects were first

This way it would greatly increase performance (both for iops and

Later (in the background), record the replicas. This situation would

Where I work, performance is very important and we don't have money to

However, I don't think it's interesting to lose the functionality of the

replicas.

I'm just suggesting another way to increase performance without losing

the functionality of replicas.

architecture.

> I don't know if this group would be the best place or if my proposal is

correct.

> > My suggestion would be in the item

https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale".

> > The Client needs to "wait" for the configured amount of replicas to be

written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting.

> > It would be possible: > > 1-) Only record on the primary OSD > 2-) Write other replicas in background (like the same way as when an

OSD fails: "degraded" ).

> > This way, client has a faster response when writing to storage:

improving latency and performance (throughput and IOPS).

> > I would find it plausible to accept a period of time (seconds) until

all replicas are ok (written asynchronously) at the expense of improving performance.

Could you evaluate this scenario? Rafael. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Dan van der Ster

10:01 a.m.

New subject: Performance improvement suggestion

Hi, I just want to echo what the others are saying. Keep in mind that RADOS needs to guarantee read-after-write consistency for the higher level apps to work (RBD, RGW, CephFS). If you corrupt VM block devices, S3 objects or bucket metadata/indexes, or CephFS metadata, you're going to suffer some long days and nights recovering. Anyway, I think that what you proposed has at best a similar reliability to min_size=1. And note that min_size=1 is strongly discouraged because of the very high likelihood that a device/network/power failure turns into a visible outage. In short: your idea would turn every OSD into a SPoF. How would you handle this very common scenario: a power outage followed by at least one device failing to start afterwards? 1. Write object A from client. 2. Fsync to primary device completes. 3. Ack to client. 4. Writes sent to replicas. 5. Cluster wide power outage (before replicas committed). 6. Power restored, but the primary osd does not start (e.g. permanent hdd failure). 7. Client tries to read object A. Today, with min_size=1 such a scenario manifests as data loss: you get either a down PG (with many many objects offline/IO blocked until you manually decide which data loss mode to accept) or unfounded objects (with IO blocked until you accept data loss). With min_size=2 the likelihood of data loss is dramatically reduced. Another thing about that power loss scenario is that all dirty PGs would need to be recovered when the cluster reboots. You'd lose all the writes in transit and have to replay them from the primary's pg_log, or backfill if the pg_log was too short. Again, any failure during that recovery would lead to data loss. So I think that to maintain any semblance of reliability, you'd need to at least wait for a commit ack from the first replica (i.e. min_size=2). But since the replica writes are dispatched in parallel, your speedup would evaporate. Another thing: I suspect this idea would result in many inconsistencies from transient issues. You'd need to ramp up the number of parallel deep-scrubs to look for those inconsistencies quickly, which would also work against any potential speedup. Cheers, Dan -- Dan van der Ster CTO Clyso GmbH w: https://clyso.com | e: dan.vanderster(a)clyso.com Try our Ceph Analyzer!: https://analyzer.clyso.com/ We are hiring: https://www.clyso.com/jobs/ On Wed, Jan 31, 2024, 11:49 quaglio(a)bol.com.br <quaglio(a)bol.com.br> wrote:

...

Hello everybody, I would like to make a suggestion for improving performance in Ceph architecture. I don't know if this group would be the best place or if my proposal is correct. My suggestion would be in the item https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale". The Client needs to "wait" for the configured amount of replicas to be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting. It would be possible: 1-) Only record on the primary OSD 2-) Write other replicas in background (like the same way as when an OSD fails: "degraded" ). This way, client has a faster response when writing to storage: improving latency and performance (throughput and IOPS). I would find it plausible to accept a period of time (seconds) until all replicas are ok (written asynchronously) at the expense of improving performance. Could you evaluate this scenario? Rafael. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

pg＠ceph.list.sabi.co.UK

3:10 p.m.

New subject: Performance improvement suggestion

...

1. Write object A from client. 2. Fsync to primary device completes. 3. Ack to client. 4. Writes sent to replicas.

[...] As mentioned in the discussion this proposal is the opposite of what the current policy, is, which is to wait for all replicas to be written before writes are acknowledged to the client: https://github.com/ceph/ceph/blob/main/doc/architecture.rst "After identifying the target placement group, the client writes the object to the identified placement group's primary OSD. The primary OSD then [...] confirms that the object was stored successfully in the secondary and tertiary OSDs, and reports to the client that the object was stored successfully." A more revolutionary option would be for 'librados' to write in parallel to all the "active set" OSDs and report this to the primary, but that would greatly increase client-Ceph traffic, while the current logic increases traffic only among OSDs.

...

So I think that to maintain any semblance of reliability, you'd need to at least wait for a commit ack from the first replica (i.e. min_size=2).

Frank Schilder

4 Mar 4 Mar

11:41 a.m.

New subject: Performance improvement suggestion

Hi all, coming late to the party but want to ship in as well with some experience. The problem of tail latencies of individual OSDs is a real pain for any redundant storage system. However, there is a way to deal with this in an elegant way when using large replication factors. The idea is to use the counterpart of the "fast read" option that exists for EC pools and: 1) make this option available to replicated pools as well (is on the road map as far as I know), but also 2) implement an option "fast write" for all pool types. Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). I have fast read enabled on all EC pools. This does increase the cluster-internal network traffic, which is nowadays absolutely no problem (in the good old 1G times it potentially would be). In return, the read latencies on the client side are lower and much more predictable. In effect, the user experience improved dramatically. I would really wish that such an option gets added as we use wide replication profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large replication factors (more precisely, large (size-min_size)) to mitigate the impact of slow OSDs would be awesome. It would also add some incentive to stop the ridiculous size=2 min_size=1 habit, because one gets an extra gain from replication on top of redundancy. In the long run, the ceph write path should try to deal with a-priori known different-latency connections (fast local ACK with async remote completion, was asked for a couple of times), for example, for stretched clusters where one has an internal connection for the local part and external connections for the remote parts. It would be great to have similar ways of mitigating some penalties of the slow write paths to remote sites. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Peter Grandi <pg(a)ceph.list.sabi.co.UK> Sent: Wednesday, February 21, 2024 1:10 PM To: list Linux fs Ceph Subject: [ceph-users] Re: Performance improvement suggestion

...

1. Write object A from client. 2. Fsync to primary device completes. 3. Ack to client. 4. Writes sent to replicas.

...

So I think that to maintain any semblance of reliability, you'd need to at least wait for a commit ack from the first replica (i.e. min_size=2).

Perhaps it could be similar to 'k'+'m' for EC, that is 'k' synchronous (write completes to the client only when all at least 'k' replicas, including primary, have been committed) and 'm' asynchronous, instead of 'k' being just 1 or 2. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Marc

2:35 p.m.

New subject: Performance improvement suggestion

...

Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning).

What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client?

Maged Mokhtar

3:56 p.m.

New subject: Performance improvement suggestion

On 04/03/2024 13:35, Marc wrote:

...

What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client?

This should not be different than what exists today..unless of-course if the error happens on the local/primary osd

Frank Schilder

4:37 p.m.

New subject: Performance improvement suggestion

...

What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client?

This should not be different than what exists today..unless of-course if the error happens on the local/primary osd

Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Maged Mokhtar

5:40 p.m.

New subject: Performance improvement suggestion

On 04/03/2024 15:37, Frank Schilder wrote:

...

What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client?

This should not be different than what exists today..unless of-course if the error happens on the local/primary osd

i think this is something the rados devs need to say. it does sound worth investigating. it is not just for cases with db compaction but more importantly the normal(happy) io path as it will have the most impact.

Mark Nelson

8:06 p.m.

New subject: Performance improvement suggestion

On 3/4/24 08:40, Maged Mokhtar wrote:

...

On 04/03/2024 15:37, Frank Schilder wrote:

> Fast write enabled would mean that the primary OSD sends #size > copies to the > entire active set (including itself) in parallel and sends an ACK > to the > client as soon as min_size ACKs have been received from the peers > (including > itself). In this way, one can tolerate (size-min_size) slow(er) > OSDs (slow > for whatever reason) without suffering performance penalties > immediately > (only after too many requests started piling up, which will show > as a slow > requests warning). > What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client?

This should not be different than what exists today..unless of-course if the error happens on the local/primary osd

Typically a L0->L1 compaction will have two primary effects: 1) It will cause large IO read/write traffic to the disk potentially impacting other IO taking place if the disk is already saturated. 2) It will block memtable flushes until the compaction finishes. This means that more and more data will accumulate in the memtables/WAL which can trigger throttling and eventually stalls if you run out of buffer space. By default, we allow up to 1GB of writes to WAL/memtables before writes are fully stalled, but RocksDB will typlically throttle writes before you get to that point. It's possible a larger buffer may allow you to absorb traffic spikes for longer at the expense of more disk and memory usage. Ultimately though, if you are hitting throttling, it means that the DB can't keep up with the WAL ingestion rate. Mark

...

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Best Regards, Mark Nelson Head of Research and Development Clyso GmbH p: +49 89 21552391 12 | a: Minnesota, USA w: https://clyso.com | e: mark.nelson(a)clyso.com We are hiring: https://www.clyso.com/jobs/

days inactive

110

days old

ceph-users@ceph.io

Manage subscription

25 comments

13 participants

tags (0)

participants (13)

Alex Gorbachev
Anthony D'Atri
Can Özyurt
Dan van der Ster
Frank Schilder
Janne Johansson
Maged Mokhtar
Marc
Mark Nelson
Michael Lipp
pg＠ceph.list.sabi.co.UK
quaglio＠bol.com.br
Özkan Göksu