Hello list,

I am new to ceph and try to understand the ceph philosophy. We did a bunch of tests on a 3 node ceph cluster.

After tests, I see that our network is always the bottleneck of writing to very fast storages.

So let me explain first my point of view before I get down to the questions:
For my discussion I am assuming nowadays PCIe based NVMe drives, which are capable of writing about 8GiB/s, which is about 64GBit/s.

So having 2 or 4 of such drives in a local server we can write in ideal servers 128Gbit/s (2 drives) or 256Gbit/s

Considering the latencies here, is also a dream value if we use PCI 5.0 (or even 4.0)

Now considering the situation that you have 5 nodes each has 4 of that drives,

will make all small and mid-sized companies to go bankrupt ;-) only from buying the corresponding networking switches.

But the servers hardware is still a simplistic commodity hardware which can saturate the given any given commodity network hardware easily.

If I want to be able to use full 64GBit/s I would require at least 100GBit/s networking or tons of trunked ports and cabaling with lower bandwidth switches.

If we now also consider distributing the nodes over racks, building on same location or distributed datacenters, the costs will be even more painfull.

And IMHO it could be "easily" changed, if some "minor" different behavior would be available.

My target scenario would be to implement a ceph cluster with such named servers as above.
The ceph commit requirement will be 2 copies on different OSDs (comparable to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a RAID with multiple disk redudancy)

In all our tests so far, we could not control the behavior of how ceph is persisting this 2 copies. It will always try to persist it somehow over the network.
Q1: Is this behavior mandatory?

Our common workload, and afaik nearly all webservice based applications are:

- a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)

- and probably mostly 1write to 4read or even 1:6 ratio on utilizing the cluster

Hope I could explain the situation here well enough.

Now assuming my ideal world with ceph:

if ceph would do:
1. commit 2 copies to local drives to the node there ceph client is connected to
2. after commit sync (optimized/queued) the data over the network to fulfill the common needs of ceph storage with 4 copies
3. maybe optionally move 1 copy away from the intial node which still holds the 2 local copies...

this behaviour would ensure that:
- the felt performance of the OSD clients will be the full bandwidth of the local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s and the latency would be comparable as writing locally
- we would have 2 copies nearly "immediately" reported to any ceph client
- bandwidth utilization will be optimized, since we do not duplicate the stored data transfers on the network immediatelly, we defer it from the initial writing of the ceph client and can so utilize better a queing mechanism
- IMHO the scalability with commodity network would be far easier to implement, since the networking requirements are factors lower

Mabe I have a total wrong understanding of ceph cluster and data distribution of the copies.
Q2: If so plz let me know where I may read more about this?

So to bring it quickly down:
Q3: is it possible to configure ceph to behave like named above in my ideal world?
means to first write n minimal copies to local drives, and deferred the syncing of the other copies into the network
Q4: if not, are there any plans into this direction?
Q5: if possible, is there a good documentation for it?
Q6: we would still like to be able to distribute over racks, enclosures and datacenters

best wishes

Hans