Hello everybody,
we are running a multisite (active/active) gateway on 2 ceph clusters.
One production and one backup cluster.
Now we make a backup with rclone from the master and don't need anymore the second Gateway.
What is the best way to shutdown the second Gateway and remove the multisite sync from the master without lost of the data on the master site.
greetings
Markus
Hi All,
Cephfs kernel client is influenced by kernel page cache when we write
data to it, outgoing data will be huge when os starts flush page cache.
So Is there a way to make Cephfs kernel client to write data to ceph osd
smoothly when buffer io is used ?
I'm setting up a radosgw for my ceph Octopus cluster. As soon as I
started the radosgw service, I notice that it created a handful of new
pools. These pools were assigned the 'replicated_data' crush rule
automatically.
I have a mixed hdd/ssd/nvme cluster, and this 'replicated_data' crush
rule spans all device types. I would like radosgw to use a replicated
SSD pool and avoid the HDDs. What is the recommended way to change the
crush device class for these pools without risking the loss of any data
in the pools? I will note that I have not yet written any user data to
the pools. Everything in them was added by the radosgw process
automatically.
--Mike
Hi All,
We have recently deployed a new CEPH cluster Octopus 15.2.4 which consists
of
12 OSD Nodes(16 Core + 200GB RAM, 30x14TB disks, CentOS 8)
3 Mon Nodes (8 Cores + 15GB, CentOS 8)
We use Erasure Coded Pool and RBD block devices.
3 Ceph clients use the RBD devices, each has 25 RBDs and Each RBD size is
10TB. Each RBD is partitioned with the EXT4 file system.
Cluster Health Is OK and Hardware is New and good.
All the machines have 10Gbps (Active/Passive) bond Interface configured on
it.
Read operation of the cluster is OK, however, writes are very slow.
One one of the RBDs we did the perf test.
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128
-rw=randread -runtime=60 -filename=/dev/rbd40
Run status group 0 (all jobs):
READ: bw=401MiB/s (420MB/s), 401MiB/s-401MiB/s (420MB/s-420MB/s),
io=23.5GiB (25.2GB), run=60054-60054msec
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128
-rw=randwrite -runtime=60 -filename=/dev/rbd40
Run status group 0 (all jobs):
WRITE: bw=217KiB/s (222kB/s), 217KiB/s-217KiB/s (222kB/s-222kB/s),
io=13.2MiB (13.9MB), run=62430-62430msec
I see a High IO wait from the client.
Any suggestions/pointers address this issue is really appreciated.
Thanks and Regards,
Athreya
Hi,
I have problem with RGW in multisite configuration with Nautilus 14.2.11. Both zones with SSD and 10Gbps network. Master zone consist from 5x DELL R740XD servers (every 256GB RAM, 8x800GB SSD for CEPH, 24xCPU). Secondary zone (temporary for testing) consist from 3x HPE DL360 Gen10 servers (every 256GB RAM, 6x800GB SSD, 48CPU).
We have 17 test buckets with manual sharding (101 shards). Every bucket with 10M of small objects (10kB - 15kB). Zonegroup configuration is attached bellow. Replication of 150M objects from master to secondary zone takes almost 28 hours and the replication completed with success.
After deleting objects from one bucket in master zone is not possible to sync zones properly. I tried to restart both secondary RGWs, but without success. Sync status on secondary zone is behind master. The number of objects in buckets on master zone is different than on secondary zone.
Ceph HEALTH status is WARNING on both zones. On master zone I have 146 large objects found in pool 'prg2a-1.rgw.buckets.index' 16 large objects found in pool 'prg2a-1.rgw.log'. On secondary zone 88 large objects found in pool 'prg2a-2.rgw.log' 1584 large objects found in pool 'prg2a-2.rgw.buckets.index'.
AVG OSD latencies on secondary zone during sync was "read 0,158ms, write 1,897ms, overwrite 1,634ms". After unsuccesfull sync (after 12h of sync fall down RGW requests, IOPS and throughput) jumps up AVG OSD latencies to "read 125ms, write 30ms, overwrite 272ms". After stopping of both RGWs on secondary zone are AVG OSD latencies almost 0ms, but when I start RGWs on secondary zone again, OSD latencies will rise again to "read 125ms, write 30ms, overwrite 272ms" with spikes up to 3 seconds.
We have seen the same behaviour of ceph multisite with large number of object in one bucket (150M+ objects), so we tried different strategy with smaller buckets, but results are same.
I will appreciate any help or advice, how tune or diagnose multisite problems.
Does anyone else have any ideas? Is there anyone else with a similar use-case? I do not know what is wrong.
Thank you and best regards,
Miroslav
radosgw-admin zonegroup get
{
"id": "ac0005da-2e9f-4f38-835f-72b289c240d0",
"name": "prg2a",
"api_name": "prg2a",
"is_master": "true",
"endpoints": [
"http://s3.prg1a.sys.cz:80",
"http://s3.prg2a.sys.cz:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "d9ebbd1f-3312-4083-b4c2-843e1fb899ad",
"zones": [
{
"id": "d9ebbd1f-3312-4083-b4c2-843e1fb899ad",
"name": "prg2a-1",
"endpoints": [
"http://10.104.200.101:7480",
"http://10.104.200.102:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
},
{
"id": "fdd76c02-c679-4ec7-8e7d-c14d2ac74fb4",
"name": "prg2a-2",
"endpoints": [
"http://10.104.200.221:7480",
"http://10.104.200.222:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "cb831094-e219-44b8-89f3-fe25fc288c00"
ii radosgw 14.2.11-pve1 amd64 REST gateway for RADOS distributed object store
ii ceph 14.2.11-pve1 amd64 distributed storage and file system
ii ceph-base 14.2.11-pve1 amd64 common ceph daemon libraries and management tools
ii ceph-common 14.2.11-pve1 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 14.2.11-pve1 amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 14.2.11-pve1 amd64 metadata server for the ceph distributed file system
ii ceph-mgr 14.2.11-pve1 amd64 manager for the ceph distributed storage system
ii ceph-mon 14.2.11-pve1 amd64 monitor server for the ceph storage system
ii ceph-osd 14.2.11-pve1 amd64 OSD server for the ceph storage system
ii libcephfs2 14.2.11-pve1 amd64 Ceph distributed file system client library
Hello all,
We're trying to debug a "slow ops" situation on our cluster running Nautilus (latest version). Things were running smoothly for a while, but we had a few issues that made things fall apart (possible clock skew, faulty disk...)
- We've checked the ntp, everything seems fine, the whole cluster shows no clock skew. Network config seems fine too (we're using jumbo frames throughout the cluster).
- We have multiple PGs that are in a "stuck peering" or "stuck inactive" state.
ceph health detail
HEALTH_WARN Reduced data availability: 1020 pgs inactive, 1008 pgs peering; Degraded data redundancy: 208352/95157861 objects degraded (0.219%), 9 pgs degraded, 9 pgs undersized; 2 pgs not deep-scrubbed in time; 2 pgs not scrubbed in time; 3 daemons have recently crashed; 1184 slow ops, oldest one blocked for 1792 sec, daemons [osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]... have slow ops.
PG_AVAILABILITY Reduced data availability: 1020 pgs inactive, 1008 pgs peering
pg 12.3cd is stuck inactive for 8939.938831, current state peering, last acting [111,75,53]
pg 12.3ce is stuck peering for 350761.931800, current state peering, last acting [48,103,76]
pg 12.3cf is stuck peering for 345518.349253, current state peering, last acting [80,46,116]
pg 12.3d0 is stuck peering for 396432.771388, current state peering, last acting [114,95,42]
pg 12.3d1 is stuck peering for 389771.820478, current state peering, last acting [33,99,122]
pg 12.3d2 is stuck peering for 16385.796714, current state peering, last acting [48,75,105]
pg 12.3d3 is stuck peering for 375090.876123, current state peering, last acting [53,118,90]
pg 12.3d4 is stuck peering for 350665.788611, current state peering, last acting [59,81,40]
pg 12.3d5 is stuck peering for 344195.934260, current state peering, last acting [104,73,87]
pg 12.3d6 is stuck peering for 388515.338772, current state peering, last acting [57,79,60]
pg 12.3d7 is stuck peering for 27320.368320, current state peering, last acting [35,56,109]
pg 12.3d8 is stuck peering for 345470.520103, current state peering, last acting [91,41,74]
pg 12.3d9 is stuck peering for 347582.613090, current state peering, last acting [85,66,103]
pg 12.3da is stuck peering for 346518.712024, current state peering, last acting [87,63,56]
pg 12.3db is stuck peering for 348804.986864, current state peering, last acting [100,122,46]
pg 12.3dc is stuck peering for 343796.439591, current state peering, last acting [55,90,125]
pg 12.3dd is stuck peering for 345621.663979, current state peering, last acting [83,38,125]
pg 12.3de is stuck peering for 348026.449482, current state peering, last acting [38,113,82]
pg 12.3df is stuck peering for 350263.925579, current state peering, last acting [41,104,87]
pg 12.3e0 is stuck peering for 8738.645205, current state peering, last acting [57,86,108]
pg 12.3e1 is stuck peering for 397082.568164, current state peering, last acting [124,46]
pg 12.3e2 is stuck peering for 345232.402459, current state peering, last acting [80,114,65]
pg 12.3e3 is stuck peering for 347014.276511, current state peering, last acting [63,102,83]
pg 12.3e4 is stuck peering for 345470.524144, current state peering, last acting [91,38,71]
pg 12.3e5 is stuck peering for 346636.837554, current state peering, last acting [64,85,118]
pg 12.3e6 is stuck peering for 398952.293609, current state peering, last acting [92,36,75]
pg 12.3e7 is stuck peering for 346973.264600, current state peering, last acting [31,94,53]
pg 12.3e8 is stuck peering for 370098.248268, current state peering, last acting [119,90,72]
pg 12.3e9 is stuck peering for 345134.069457, current state peering, last acting [96,105,36]
pg 12.3ea is stuck peering for 346305.043394, current state peering, last acting [94,103,51]
pg 12.3eb is stuck peering for 388515.112735, current state peering, last acting [57,116,59]
pg 12.3ec is stuck peering for 348097.249845, current state peering, last acting [56,111,84]
pg 12.3ed is stuck peering for 346636.835287, current state peering, last acting [64,106,101]
pg 12.3ee is stuck peering for 398197.856231, current state peering, last acting [53,105,80]
pg 12.3ef is stuck peering for 347061.858678, current state peering, last acting [47,64,80]
pg 12.3f0 is stuck peering for 371495.723196, current state peering, last acting [77,115,81]
pg 12.3f1 is stuck peering for 27539.717691, current state peering, last acting [123,69,48]
pg 12.3f2 is stuck peering for 346973.596729, current state peering, last acting [31,80,45]
pg 12.3f3 is stuck peering for 345419.834162, current state peering, last acting [108,89,40]
pg 12.3f4 is stuck peering for 347400.170304, current state peering, last acting [82,67,104]
pg 12.3f5 is stuck peering for 346793.349638, current state peering, last acting [116,51,68]
pg 12.3f6 is stuck peering for 372361.763947, current state peering, last acting [114,46,93]
pg 12.3f7 is stuck inactive for 346840.292765, current state activating, last acting [125,77,47]
pg 12.3f8 is stuck peering for 347004.967439, current state peering, last acting [42,31,116]
pg 12.3f9 is stuck peering for 346894.489185, current state peering, last acting [40,94,67]
pg 12.3fa is stuck peering for 395041.494033, current state peering, last acting [58,97,112]
pg 12.3fb is stuck peering for 346337.742759, current state peering, last acting [79,55,61]
pg 12.3fc is stuck peering for 347634.039502, current state peering, last acting [66,54,101]
pg 12.3fd is stuck peering for 345340.666831, current state peering, last acting [112,32,87]
pg 12.3fe is stuck peering for 345777.554974, current state peering, last acting [98,30,44]
pg 12.3ff is stuck peering for 18040.716533, current state peering, last acting [86,51,59]
PG_DEGRADED Degraded data redundancy: 208352/95157861 objects degraded (0.219%), 9 pgs degraded, 9 pgs undersized
pg 7.3b is stuck undersized for 1119305.639931, current state active+undersized+degraded, last acting [29,19]
pg 7.5a is stuck undersized for 351332.251298, current state active+undersized+degraded, last acting [29,11]
pg 8.b9 is stuck undersized for 351332.246585, current state active+undersized+degraded, last acting [22,17]
pg 8.c1 is stuck undersized for 351332.257178, current state active+undersized+degraded, last acting [24,14]
pg 8.db is stuck undersized for 350987.698147, current state active+undersized+degraded, last acting [24,7]
pg 8.1c2 is stuck undersized for 350999.603413, current state active+undersized+degraded, last acting [20,2]
pg 9.32 is stuck undersized for 351332.258240, current state active+undersized+degraded, last acting [21,11]
pg 9.a5 is stuck undersized for 351332.266130, current state active+undersized+degraded, last acting [25,14]
pg 9.df is stuck undersized for 351333.298597, current state active+undersized+degraded, last acting [25,19]
PG_NOT_DEEP_SCRUBBED 2 pgs not deep-scrubbed in time
pg 8.db not deep-scrubbed since 2020-10-24 07:09:12.599242
pg 7.3b not deep-scrubbed since 2020-10-24 14:10:59.877193
PG_NOT_SCRUBBED 2 pgs not scrubbed in time
pg 8.db not scrubbed since 2020-10-24 07:09:12.599242
pg 7.3b not scrubbed since 2020-10-24 14:10:59.877193
RECENT_CRASH 3 daemons have recently crashed
osd.70 crashed on host starfish-osd-05 at 2020-10-30 09:16:06.981832Z
osd.57 crashed on host starfish-osd-04 at 2020-11-06 10:07:47.868835Z
mds.starfish-mon-01 crashed on host starfish-mon-01 at 2020-11-02 18:36:25.266426Z
SLOW_OPS 1184 slow ops, oldest one blocked for 1792 sec, daemons [osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]... have slow ops.
- The 9 degraded / undersized pgs are on a different pool that has osds that need to be reweighed. OSDs 1-29 are on another root on the crush map.
- When querying one of the PGs that are in a "stuck peering" state,, there are a lot of ".handle_connect_reply_2 connect got BADAUTHORIZER" replys.
- The OSDs logs show the following message (they dissapear for a while if the osd is restarted) :
2020-11-10 11:58:58.671 7f90430da700 0 auth: could not find secret_id=14160
2020-11-10 11:58:58.671 7f90430da700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=14160
2020-11-10 11:58:58.671 7f90430da700 0 --1- [v2:10.100.0.7:6851/483865,v1:10.100.0.7:6854/483865] >> v1:10.100.0.3:6815/8007973 conn(0x5632b1a64000 0x5632a64fb000 :6854 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: got bad authorizer, auth_reply_len=0
Cheers !
Regards,
Shehzaad
Hi,
Yeah the negative pid is interesting. AFAICT we use a negative pid to
indicate that the lock was taken on another host:
https://github.com/torvalds/linux/blob/master/fs/ceph/locks.c#L119https://github.com/torvalds/linux/commit/9d5b86ac13c573795525ecac6ed2db39ab…
"Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests."
The good news is that my colleagues managed to clear this filelock by
restarting dovecot on a couple nodes.
But I'm still curious if others have a nice way to debug such things.
Cheers, Dan
On Mon, Nov 9, 2020 at 8:11 PM Anthony D'Atri <anthony.datri(a)gmail.com> wrote:
>
> Looks like a - in front of the 9605 — signed/unsigned int flern?
>
> > On Nov 9, 2020, at 4:59 AM, Dan van der Ster <dan(a)vanderster.com> wrote:
> >
> > Hi all,
> >
> > MDS version v14.2.11
> > Client kernel 3.10.0-1127.19.1.el7.x86_64
> >
> > We are seeing a strange issue with a dovecot use-case on cephfs.
> > Occasionally we have dovecot reporting a file locked, such as:
> >
> > Nov 09 13:55:00 dovecot-backend-00.cern.ch dovecot[27710]:
> > imap(reguero)<23945><fRA6B6yznq68uE28>: Error: Mailbox Deleted Items:
> > Timeout (180s) while waiting for lock for transaction log file
> > /mail/users/r/reguero//mdbox/mailboxes/Deleted
> > Items/dbox-Mails/dovecot.index.log (WRITE lock held by pid -9605)
> >
> > We checked all hosts that have mounted the cephfs -- there is no pid 9605.
> >
> > Is there any way to see who exactly created the lock? ceph_filelock
> > has a client id, but I didn't find a way to inspect the
> > cephfs_metadata to see the ceph_filelock directly.
> >
> > Otherwise, are other Dovecot/CephFS users seeing this? Did you switch
> > to flock or lockfile instead of fnctlk locks?
> >
> > Thanks!
> >
> > Dan
> >
> > P.S. here is the output from print locks tool from the kernel client:
> >
> > Read lock:
> > Type: 1 (0: Read, 1: Write, 2: Unlocked)
> > Whence: 0 (0: start, 1: current, 2: end)
> > Offset: 0
> > Len: 1
> > Pid: -9605
> > Write lock:
> > Type: 1 (0: Read, 1: Write, 2: Unlocked)
> > Whence: 0 (0: start, 1: current, 2: end)
> > Offset: 0
> > Len: 1
> > Pid: -9605
> >
> > and same file from a 15.2.5 fuse client :
> >
> > Read lock:
> > Type: 1 (0: Read, 1: Write, 2: Unlocked)
> > Whence: 0 (0: start, 1: current, 2: end)
> > Offset: 0
> > Len: 0
> > Pid: 0
> > Write lock:
> > Type: 1 (0: Read, 1: Write, 2: Unlocked)
> > Whence: 0 (0: start, 1: current, 2: end)
> > Offset: 0
> > Len: 0
> > Pid: 0
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
We had some network problems (high packet drops) to some cephfs client
nodes that run ceph-fuse (14.2.13) against a Nautilus cluster (on
version 14.2.8). As a result a couple of clients got evicted (as one
would expect). What was really odd is that the clients were trying to
flush data they had in cache and kept getting rejected by OSD's for
almost an hour, and then magically the data flush worked. When asked
afterwards, the client reported that it was no longer backlisted. How
would that happen? I certainly didn't run any commands to un-blacklist
a client and the docs say that otherwise the client will stay
blacklisted until the file system gets remounted.
Here is the status of the client when it was backlisted:
[root@worker2033 ceph]# ceph daemon
/var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
"metadata": {
"ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
"ceph_version": "ceph version 14.2.13
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
"entity_id": "cephfs2",
"hostname": "worker2033",
"mount_point": "/mnt/ceph",
"pid": "7698",
"root": "/"
},
"dentry_count": 252,
"dentry_pinned_count": 9,
"id": 111995680,
"inst": {
"name": {
"type": "client",
"num": 111995680
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
}
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
},
"inst_str": "client.111995680 10.254.65.33:0/410851087",
"addr_str": "10.254.65.33:0/410851087",
"inode_count": 251,
"mds_epoch": 3376260,
"osd_epoch": 1717896,
"osd_epoch_barrier": 1717893,
"blacklisted": true
}
This corresponds to server side log messages:
2020-11-09 15:56:31.578 7fffe59a4700 1 mds.0.3376160 Evicting (and
blacklisting) client session 111995680 (10.254.65.33:0/410851087)
2020-11-09 15:56:31.578 7fffe59a4700 0 log_channel(cluster) log [INF] :
Evicting (and blacklisting) client session 111995680
(10.254.65.33:0/410851087)
2020-11-09 15:56:31.706 7fffe59a4700 1 mds.0.3376160 Evicting (and
blacklisting) client session 111995680 (10.254.65.33:0/410851087)
2020-11-09 15:56:31.706 7fffe59a4700 0 log_channel(cluster) log [INF] :
Evicting (and blacklisting) client session 111995680
(10.254.65.33:0/410851087)
and them some time later (perhaps half an hour or so) I got this from
the client:
[root@worker2033 ceph]# ceph daemon
/var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
"metadata": {
"ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
"ceph_version": "ceph version 14.2.13
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
"entity_id": "cephfs2",
"hostname": "worker2033",
"mount_point": "/mnt/ceph",
"pid": "7698",
"root": "/"
},
"dentry_count": 252,
"dentry_pinned_count": 9,
"id": 111995680,
"inst": {
"name": {
"type": "client",
"num": 111995680
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
}
},
"addr": {
"type": "v1",
"addr": "10.254.65.33:0",
"nonce": 410851087
},
"inst_str": "client.111995680 10.254.65.33:0/410851087",
"addr_str": "10.254.65.33:0/410851087",
"inode_count": 251,
"mds_epoch": 3376260,
"osd_epoch": 1717897,
"osd_epoch_barrier": 1717893,
"blacklisted": false
}
The cluster was otherwise healthy - nothing wrong with MDS's, or any
placement groups, etc. I also don't see any further log messages
regarding eviction/backlisting in the MDS logs. I didn't run any ceph
commands that would change the state of the cluster - I was just looking
around, increasing log levels.
Any ideas how could that have happened?
A separate problem (perhaps needs a ticket filed) that while the
ceph-fuse client was in a blacklisted state, it was retrying in an
infinite loop to flush data to the OSD's and got rejected every time. I
have some logs for the details of this too.
Andras
Hello everybody,
we are running a multisite (active/active) gateway on 2 ceph clusters.
One production and one backup cluster.
Now we make a backup with rclone from the master and don't need anymore the second Gateway.
What is the best way to shutdown the second Gateway and remove the multisite sync from the master without lost of the data on the master site.
greetings
Markus
Hi all,
We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput.
If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think.
Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment.
I’ve read the documents and did some tests. Here is the crush rule I’m testing with:
rule mixed_replicated_rule {
id 3
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 1 type host
step emit
step take default class hdd
step chooseleaf firstn -1 type host
step emit
}
Now I have the following conclusions, but I’m not very sure:
* The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up.
* It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule.
Am I correct about the above statements? How would this work from your experience? Thanks.