在 2020年11月8日,11:30,Tony Liu
<tonyliu0592(a)hotmail.com> 写道:
Is it FileStore or BlueStore? With this SSD-HDD solution, is journal
or WAL/DB on SSD or HDD? My understanding is that, there is no
benefit to put journal or WAL/DB on SSD with such solution. It will
also eliminate the single point of failure when having all WAL/DB
on one SSD. Just want to confirm.
We are building a new cluster, so BlueStore. I think put WAL/DB on SSD is more about
performance. How this is related to eliminating single point of failure? I’m going to
deploy WAL/DB on SSD for my HDD OSDs. And of course, just use single device for SSD OSDs
Another thought is to have separate pools, like
all-SSD pool and
all-HDD pool. Each pool will be used for different purpose. For example,
image, backup, object can be in all-HDD pool and VM volume can be in
all-SSD pool.
Thanks!
Tony
> -----Original Message-----
> From: 胡 玮文 <huww98(a)outlook.com>
> Sent: Monday, October 26, 2020 9:20 AM
> To: Frank Schilder <frans(a)dtu.dk>
> Cc: Anthony D'Atri <anthony.datri(a)gmail.com>om>; ceph-users(a)ceph.io
> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD
> replicated pool
>
>
>>> 在 2020年10月26日,15:43,Frank Schilder <frans(a)dtu.dk> 写道:
>>
>>
>>> I’ve never seen anything that implies that lead OSDs within an acting
> set are a function of CRUSH rule ordering.
>>
>> This is actually a good question. I believed that I had seen/heard
> that somewhere, but I might be wrong.
>>
>> Looking at the definition of a PG, is states that a PG is an ordered
> set of OSD (IDs) and the first up OSD will be the primary. In other
> words, it seems that the lowest OSD ID is decisive. If the SSDs were
> deployed before the HDDs, they have the smallest IDs and, hence, will be
> preferred as primary OSDs.
>
> I don’t think this is correct. From my experiments, using previously
> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the
> primary OSDs are always SSD.
>
> I also have a look at the code, if I understand it correctly:
>
> * If the default primary affinity is not changed, then the logic about
> primary affinity is skipped, and the primary would be the first one
> returned by CRUSH algorithm [1].
>
> * The order of OSDs returned by CRUSH still matters if you changed the
> primary affinity. The affinity represents the probability of a test to
> be success. The first OSD will be tested first, and will have higher
> probability to become primary. [2]
> * If any OSD has primary affinity = 1.0, the test will always success,
> and any OSD after it will never be primary.
> * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to
> 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has
> probability of 0.125. Otherwise, 1st will be primary.
> * If no test success (Suppose all OSDs have affinity of 0), 1st OSD
> will be primary as fallback.
>
> [1]:
>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.co…
> 53/src/osd/OSDMap.cc#L2456
> [2]:
>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.co…
> 53/src/osd/OSDMap.cc#L2561
>
> So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient
> for it to be the primary in my case.
>
> Do you think I should contribute these to documentation?
>
>> This, however, is not a sustainable situation. Any addition of OSDs
> will mess this up and the distribution scheme will fail in the future. A
> way out seem to be:
>>
>> - subdivide your HDD storage using device classes:
>> * define a device class for HDDs with primary affinity=0, for example,
>> pick 5 HDDs and change their device class to hdd_np (for no primary)
>> * set the primary affinity of these HDD OSDs to 0
>> * modify your crush rule to use "step take default class hdd_np"
>> * this will create a pool with primaries on SSD and balanced storage
>> distribution between SSD and HDD
>> * all-HDD pools deployed as usual on class hdd
>> * when increasing capacity, one needs to take care of adding disks to
>> hdd_np class and set their primary affinity to 0
>> * somewhat increased admin effort, but fully working solution
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Anthony D'Atri <anthony.datri(a)gmail.com>
>> Sent: 25 October 2020 17:07:15
>> To: ceph-users(a)ceph.io
>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD
>> replicated pool
>>
>>> I'm not entirely sure if primary on SSD will actually make the read
> happen on SSD.
>>
>> My understanding is that by default reads always happen from the lead
> OSD in the acting set. Octopus seems to (finally) have an option to
> spread the reads around, which IIRC defaults to false.
>>
>> I’ve never seen anything that implies that lead OSDs within an acting
> set are a function of CRUSH rule ordering. I’m not asserting that they
> aren’t though, but I’m … skeptical.
>>
>> Setting primary affinity would do the job, and you’d want to have cron
> continually update it across the cluster to react to topology changes.
> I was told of this strategy back in 2014, but haven’t personally seen it
> implemented.
>>
>> That said, HDDs are more of a bottleneck for writes than reads and
> just might be fine for your application. Tiny reads are going to limit
> you to some degree regardless of drive type, and you do mention
> throughput, not IOPS.
>>
>> I must echo Frank’s notes about capacity too. Ceph can do a lot of
> things, but that doesn’t mean something exotic is necessarily the best
> choice. You’re concerned about 3R only yielding 1/3 of raw capacity if
> using an all-SSD cluster, but the architecture you propose limits you
> anyway because drive size. Consider also chassis, CPU, RAM, RU, switch
> port costs as well, and the cost of you fussing over an exotic solution
> instead of the hundreds of other things in your backlog.
>>
>> And your cluster as described is *tiny*. Honestly I’d suggest
> considering one of these alternatives:
>>
>> * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are really
> promising for replacing HDDs for density in this kind of application.
> You might even consider ARM if IOPs aren’t a concern.
>> * An NVMeoF solution
>>
>>
>> Cache tiers are “deprecated”, but then so are custom cluster names.
>> Neither appears
>>
>>> For EC pools there is an option "fast_read"
> (
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3Dfas
> t_read%23set-pool-
> values&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7f
> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWFpbG
> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D
> %7C1000&sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&
> ;reserved=0), which states that a read will return as soon as the first
> k shards have arrived. The default is to wait for all k+m shards (all
> replicas). This option is not available for replicated pools.
>>> Now, not sure if this option is not available for replicated pools
> because the read will always be served by the acting primary, or if it
> currently waits for all replicas. In the latter case, reads will wait
> for the slowest device.
>>> I'm not sure if I interpret this correctly. I think you should test
> the setup with HDD only and SSD+HDD to see if read speed improves. Note
> that write speed will always depend on the slowest device.
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> ________________________________________
>>> From: Frank Schilder <frans(a)dtu.dk>
>>> Sent: 25 October 2020 15:03:16
>>> To: 胡 玮文; Alexander E. Patrakov
>>> Cc: ceph-users(a)ceph.io
>>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD
>>> replicated pool A cache pool might be an alternative, heavily
> depending on how much data is hot. However, then you will have much less
> SSD capacity available, because it also requires replication.
>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =
> 120T HDD you will probably run short of SSD capacity. Or, looking at it
> the other way around, with copies on 1 SSD+3HDD, you will only be able
> to use about 30T out of 120T HDD capacity.
>>> With this replication, the usable storage will be 10T and raw used
> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD
> space, you will need more SSDs. If your servers have more free disk
> slots, you can add SSDs over time until you have at least 40T SSD
> capacity to balance SSD and HDD capacity.
>>> Personally, I think the 1SSD + 3HDD is a good option compared with a
> cache pool. You have the data security of 3-times replication and, if
> everything is up, need only 1 copy in the SSD cache, which means that
> you have 3 times the cache capacity.
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> ________________________________________
>>> From: 胡 玮文 <huww98(a)outlook.com>
>>> Sent: 25 October 2020 13:40:55
>>> To: Alexander E. Patrakov
>>> Cc: ceph-users(a)ceph.io
>>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD
>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in my
> mind. In order to guard against 2 host failures, I’m going to use 4
> replications, 1 on SSD and 3 on HDD. This will work as intended, right?
> Because at least I can ensure 3 HDDs are from different hosts.
>>>>> 在 2020年10月25日,20:04,Alexander E. Patrakov
<patrakov(a)gmail.com>
> 写道:
>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98(a)outlook.com
> <huww98(a)outlook.com> wrote:
>>>>> Hi all,
>>>>> We are planning for a new pool to store our dataset using CephFS.
> These data are almost read-only (but not guaranteed) and consist of a
> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T
> HDD, and we will deploy about 10 such nodes. We aim at getting the
> highest read throughput.
>>>>> If we just use a replicated pool of size 3 on SSD, we should get
> the best performance, however, that only leave us 1/3 of usable SSD
> space. And EC pools are not friendly to such small object read workload,
> I think.
>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy.
> Ideally, I want 3 data replications, each on a different host (fail
> domain). 1 of them on SSD, the other 2 on HDD. And normally every read
> request is directed to SSD. So, if every SSD OSD is up, I’d expect the
> same read throughout as the all SSD deployment.
>>>>> I’ve read the documents and did some tests. Here is the crush rule
> I’m testing with:
>>>>> rule mixed_replicated_rule {
>>>>> id 3
>>>>> type replicated
>>>>> min_size 1
>>>>> max_size 10
>>>>> step take default class ssd
>>>>> step chooseleaf firstn 1 type host
>>>>> step emit
>>>>> step take default class hdd
>>>>> step chooseleaf firstn -1 type host
>>>>> step emit
>>>>> }
>>>>> Now I have the following conclusions, but I’m not very sure:
>>>>> * The first OSD produced by crush will be the primary OSD (at least
> if I don’t change the “primary affinity”). So, the above rule is
> guaranteed to map SSD OSD as primary in pg. And every read request will
> read from SSD if it is up.
>>>>> * It is currently not possible to enforce SSD and HDD OSD to be
> chosen from different hosts. So, if I want to ensure data availability
> even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means
> setting the replication size to 4, instead of the ideal value 3, on the
> pool using the above crush rule.
>>>>> Am I correct about the above statements? How would this work from
> your experience? Thanks.
>>>> This works (i.e. guards against host failures) only if you have
>>>> strictly separate sets of hosts that have SSDs and that have HDDs.
>>>> I.e., there should be no host that has both, otherwise there is a
>>>> chance that one hdd and one ssd from that host will be picked.
>>>> --
>>>> Alexander E. Patrakov
>>>> CV:
>>>>
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.
>>>>
cd%2FPLz7&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C8
>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnkno
>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWw
>>>>
iLCJXVCI6Mn0%3D%7C1000&sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfqFH
>>>> NS8F6IIchsrk%3D&reserved=0
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>>> email to ceph-users-leave(a)ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>>> email to ceph-users-leave(a)ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>>> email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>> email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io