I would be against such an option, because it introduces a significant risk
of data loss. Ceph has made a name for itself as a very reliable system,
where almost no one lost data, no matter how bad of a decision they made
with architecture and design. This is what you pay for in commercial
systems, to "not be allowed a bad choice", and this is what everyone gets
with Ceph for free (if they so choose).
Allowing a change like this would likely be the beginning of the end of
Ceph. It is a bad idea in the extreme. Ceph reliability should never be
compromised.
There are other options for storage that are robust and do not require as
much investment. Use ZFS, with NFS if needed. Use bcache/flashcache, or
something similar on the client side. Use proper RAM caching in databases
and applications.
--
Alex Gorbachev
Intelligent Systems Services Inc.
STORCIUM
On Tue, Feb 20, 2024 at 3:04 PM Anthony D'Atri <anthony.datri(a)gmail.com>
wrote:
Hi Anthony,
Did you decide that it's not a feature to be implemented?
That isn't up to me.
I'm asking about this so I can offer
options here.
I'd not be confortable to enable "mon_allow_pool_size_one" at a
specific pool.
It would be better if this feature could make a replica at a second time
on
selected pool.
Thanks.
Rafael.
De: "Anthony D'Atri" <anthony.datri(a)gmail.com>
Enviada: 2024/02/01 15:00:59
Para: quaglio(a)bol.com.br
Cc: ceph-users(a)ceph.io
Assunto: [ceph-users] Re: Performance improvement suggestion
I'd totally defer to the RADOS folks.
One issue might be adding a separate code path, which can have all sorts
of
problems.
> On Feb 1, 2024, at 12:53, quaglio(a)bol.com.br wrote:
>
>
>
> Ok Anthony,
>
> I understood what you said. I also believe in all the professional
history
and experience you have.
>
> Anyway, could there be a configuration flag to make this happen?
>
> As well as those that already exist: "--yes-i-really-mean-it".
>
> This way, the storage pattern would remain as it is. However, it would
allow
situations like the one I mentioned to be possible.
>
> This situation will permit some rules to be relaxed (even if they are
not ok
at first).
> Likewise, there are already situations like
lazyio that make some
exceptions to standard procedures.
> Remembering: it's just a suggestion.
> If this type of functionality is not interesting, it is ok.
>
>
>
> Rafael.
>
>
> De: "Anthony D'Atri" <anthony.datri(a)gmail.com>
> Enviada: 2024/02/01 12:10:30
> Para: quaglio(a)bol.com.br
> Cc: ceph-users(a)ceph.io
> Assunto: [ceph-users] Re: Performance improvement suggestion
>
>
>
> > I didn't say I would accept the risk of losing data.
>
> That's implicit in what you suggest, though.
>
> > I just said that it would be interesting if the objects were first
recorded only in the primary OSD.
>
> What happens when that host / drive smokes before it can replicate?
What
happens if a secondary OSD gets a read op before the primary updates
it? Swift object storage users have to code around this potential. It's a
non-starter for block storage.
>
> This is similar to why RoC HBAs (which are a badly outdated thing to
begin
with) will only enter writeback mode if they have a BBU / supercap --
and of course if their firmware and hardware isn't pervasively buggy. Guess
how I know this?
>
> > This way it would greatly increase performance (both for iops and
throuput).
>
> It might increase low-QD IOPS for a single client on slow media with
certain
networking. Depending on media, it wouldn't increase throughput.
>
> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use
3x
the network resources between the client and the servers.
>
> > Later (in the background), record the replicas. This situation would
avoid leaving users/software waiting for the recording response from all
replicas when the storage is overloaded.
>
> If one makes the mistake of using HDDs, they're going to be overloaded
no
matter how one slices and dices the ops. Ya just canna squeeze IOPS from
a stone. Throughput is going to be limited by the SATA interface and
seeking no matter what.
>
> > Where I work, performance is very important and we don't have money
to make a entire cluster only with NVMe.
>
> If there isn't money, then it isn't very important. But as I've
written before, NVMe clusters *do not cost appreciably more than spinners*
unless your procurement processes are bad. In fact they can cost
significantly less. This is especially true with object storage and
archival where one can leverage QLC.
>
> * Buy generic drives from a VAR, not channel drives through a chassis
brand.
Far less markup, and moreover you get the full 5 year warranty, not
just 3 years. And you can painlessly RMA drives yourself - you don't have
to spend hours going back and forth with $chassisvendor's TAC arguing about
every single RMA. I've found that this is so bad that it is more economical
to just throw away a failed component worth < USD 500 than to RMA it. Do
you pay for extended warranty / support? That's expensive too.
>
> * Certain chassis brands who shall remain nameless push RoC HBAs hard
with
extreme markups. List prices as high as USD2000. Per server, eschewing
those abominations makes up for a lot of the drive-only unit economics
>
> * But this is the part that lots of people don't get: You don't just
stack up the drives on a desk and use them. They go into *servers* that
cost money and *racks* that cost money. They take *power* that costs money.
>
> * $ / IOPS are FAR better for ANY SSD than for HDDs
>
> * RUs cost money, so do chassis and switches
>
> * Drive failures cost money
>
> * So does having your people and applications twiddle their thumbs
waiting
for stuff to happen. I worked for a supercomputer company who put
low-memory low-end diskless workstations on engineer's desks. They spent
lots of time doing nothing waiting for their applications to respond. This
company no longer exists.
>
> * So does the risk of taking *weeks* to heal from a drive failure
>
> Punch honest numbers into
https://www.snia.org/forums/cmsi/programs/TCOcalc
>
> I walked through this with a certain global company. QLC SSDs were
demonstrated to have like 30% lower TCO than spinners. Part of the equation
is that they were accustomed to limiting HDD size to 8 TB because of the
bottlenecks, and thus requiring more servers, more switch ports, more DC
racks, more rack/stack time, more administrative overhead. You can fit 1.9
PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB
of the largest spinners you can get today. 22 TIMES the density. And since
many applications can't even barely tolerate the spinner bottlenecks,
capping spinner size at even 10T makes that like 40 TIMES better density
with SSDs.
>
>
> > However, I don't think it's interesting to lose the functionality of
the replicas.
> > I'm just suggesting another way to
increase performance without
losing the functionality of replicas.
> >
> >
> > Rafael.
> >
> >
> > De: "Anthony D'Atri" <anthony.datri(a)gmail.com>
> > Enviada: 2024/01/31 17:04:08
> > Para: quaglio(a)bol.com.br
> > Cc: ceph-users(a)ceph.io
> > Assunto: Re: [ceph-users] Performance improvement suggestion
> >
> > Would you be willing to accept the risk of data loss?
> >
> >>
> >> On Jan 31, 2024, at 2:48 PM, quaglio(a)bol.com.br wrote:
> >>
> >> Hello everybody,
> >> I would like to make a suggestion for improving performance in Ceph
architecture.
> >> I don't know if this group
would be the best place or if my
proposal is correct.
> >>
> >> My suggestion would be in the item
https://docs.ceph.com/en/latest/architecture/, at the end of the topic
"Smart Daemons Enable Hyperscale".
> >>
> >> The Client needs to "wait" for the configured amount of replicas
to
be written (so that the client receives an ok and continues). This way, if
there is slowness on any of the disks on which the PG will be updated, the
client is left waiting.
> >>
> >> It would be possible:
> >>
> >> 1-) Only record on the primary OSD
> >> 2-) Write other replicas in background (like the same way as when
an
OSD fails: "degraded" ).
> >>
> >> This way, client has a faster response when writing to storage:
improving latency and performance (throughput and IOPS).
> >>
> >> I would find it plausible to accept a period of time (seconds)
until
all replicas are ok (written asynchronously) at the expense of
improving performance.
> >>
> >> Could you evaluate this scenario?
> >>
> >>
> >> Rafael.
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io
> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to
ceph-users-leave(a)ceph.io_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io