I would like to add one comment.
I'm not entirely sure if primary on SSD will actually make the read happen on SSD. For
EC pools there is an option "fast_read"
(
https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read…),
which states that a read will return as soon as the first k shards have arrived. The
default is to wait for all k+m shards (all replicas). This option is not available for
replicated pools.
Now, not sure if this option is not available for replicated pools because the read will
always be served by the acting primary, or if it currently waits for all replicas. In the
latter case, reads will wait for the slowest device.
I'm not sure if I interpret this correctly. I think you should test the setup with HDD
only and SSD+HDD to see if read speed improves. Note that write speed will always depend
on the slowest device.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 25 October 2020 15:03:16
To: 胡 玮文; Alexander E. Patrakov
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
A cache pool might be an alternative, heavily depending on how much data is hot. However,
then you will have much less SSD capacity available, because it also requires
replication.
Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD you will
probably run short of SSD capacity. Or, looking at it the other way around, with copies on
1 SSD+3HDD, you will only be able to use about 30T out of 120T HDD capacity.
With this replication, the usable storage will be 10T and raw used will be 10T SSD and 30T
HDD. If you can't do anything else on the HDD space, you will need more SSDs. If your
servers have more free disk slots, you can add SSDs over time until you have at least 40T
SSD capacity to balance SSD and HDD capacity.
Personally, I think the 1SSD + 3HDD is a good option compared with a cache pool. You have
the data security of 3-times replication and, if everything is up, need only 1 copy in the
SSD cache, which means that you have 3 times the cache capacity.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: 胡 玮文 <huww98(a)outlook.com>
Sent: 25 October 2020 13:40:55
To: Alexander E. Patrakov
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2
host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as
intended, right? Because at least I can ensure 3 HDDs are from different hosts.
在 2020年10月25日,20:04,Alexander E. Patrakov
<patrakov(a)gmail.com> 写道:
On Sun, Oct 25, 2020 at 12:11 PM huww98(a)outlook.com <huww98(a)outlook.com> wrote:
Hi all,
We are planning for a new pool to store our dataset using CephFS. These data are almost
read-only (but not guaranteed) and consist of a lot of small files. Each node in our
cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at
getting the highest read throughput.
If we just use a replicated pool of size 3 on SSD, we should get the best performance,
however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such
small object read workload, I think.
Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data
replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on
HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d
expect the same read throughout as the all SSD deployment.
I’ve read the documents and did some tests. Here is the crush rule I’m testing with:
rule mixed_replicated_rule {
id 3
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 1 type host
step emit
step take default class hdd
step chooseleaf firstn -1 type host
step emit
}
Now I have the following conclusions, but I’m not very sure:
* The first OSD produced by crush will be the primary OSD (at least if I don’t change the
“primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And
every read request will read from SSD if it is up.
* It is currently not possible to enforce SSD and HDD OSD to be chosen from different
hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1
SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal
value 3, on the pool using the above crush rule.
Am I correct about the above statements? How would this work from your experience?
Thanks.
This works (i.e. guards against host failures) only if you have
strictly separate sets of hosts that have SSDs and that have HDDs.
I.e., there should be no host that has both, otherwise there is a
chance that one hdd and one ssd from that host will be picked.
--
Alexander E. Patrakov
CV:
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPL…
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io