Thanks for digging this out. I believed to remember exactly this method (don't know
where from), but couldn't find it in the documentation and started doubting it. Yes,
this would be very useful information to add to the documentation and it also confirms
that your simpler setup with just a specialized crush rule will work exactly as intended
and is long-term stable.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: 胡 玮文 <huww98(a)outlook.com>
Sent: 26 October 2020 17:19
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users(a)ceph.io
Subject: Re: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
在 2020年10月26日,15:43,Frank Schilder
<frans(a)dtu.dk> 写道:
I’ve never seen anything that implies that lead
OSDs within an acting set are a function of CRUSH rule ordering.
This is actually a good question. I believed that I had seen/heard that somewhere, but I
might be wrong.
Looking at the definition of a PG, is states that a PG is an ordered set of OSD (IDs) and
the first up OSD will be the primary. In other words, it seems that the lowest OSD ID is
decisive. If the SSDs were deployed before the HDDs, they have the smallest IDs and,
hence, will be preferred as primary OSDs.
I don’t think this is correct. From my experiments, using previously mentioned CRUSH rule,
no matter what the IDs of the SSD OSDs are, the primary OSDs are always SSD.
I also have a look at the code, if I understand it correctly:
* If the default primary affinity is not changed, then the logic about primary affinity is
skipped, and the primary would be the first one returned by CRUSH algorithm [1].
* The order of OSDs returned by CRUSH still matters if you changed the primary affinity.
The affinity represents the probability of a test to be success. The first OSD will be
tested first, and will have higher probability to become primary. [2]
* If any OSD has primary affinity = 1.0, the test will always success, and any OSD after
it will never be primary.
* Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. Then the 2nd
OSD has probability of 0.25 to be primary, 3rd one has probability of 0.125. Otherwise,
1st will be primary.
* If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be primary as
fallback.
[1]:
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/…
[2]:
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/…
So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it to be the
primary in my case.
Do you think I should contribute these to documentation?
This, however, is not a sustainable situation. Any
addition of OSDs will mess this up and the distribution scheme will fail in the future. A
way out seem to be:
- subdivide your HDD storage using device classes:
* define a device class for HDDs with primary affinity=0, for example, pick 5 HDDs and
change their device class to hdd_np (for no primary)
* set the primary affinity of these HDD OSDs to 0
* modify your crush rule to use "step take default class hdd_np"
* this will create a pool with primaries on SSD and balanced storage distribution between
SSD and HDD
* all-HDD pools deployed as usual on class hdd
* when increasing capacity, one needs to take care of adding disks to hdd_np class and
set their primary affinity to 0
* somewhat increased admin effort, but fully working solution
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14