November 2020 - ceph-users

disable / remove multisite sync RGW (Ceph Nautilus)

by Markus Gans

Hello everybody, we are running a multisite (active/active) gateway on 2 ceph clusters. One production and one backup cluster. Now we make a backup with rclone from the master and don't need anymore the second Gateway. What is the best way to shutdown the second Gateway and remove the multisite sync from the master without lost of the data on the master site. greetings Markus

3 years, 5 months

2
1
0 0

Is there a way to make Cephfs kernel client to write data to ceph osd smoothly with buffer io

by Sage Meng

Hi All, Cephfs kernel client is influenced by kernel page cache when we write data to it, outgoing data will be huge when os starts flush page cache. So Is there a way to make Cephfs kernel client to write data to ceph osd smoothly when buffer io is used ?

3 years, 5 months

2
1
0 0

safest way to re-crush a pool

by Michael Thomas

I'm setting up a radosgw for my ceph Octopus cluster. As soon as I started the radosgw service, I notice that it created a handful of new pools. These pools were assigned the 'replicated_data' crush rule automatically. I have a mixed hdd/ssd/nvme cluster, and this 'replicated_data' crush rule spans all device types. I would like radosgw to use a replicated SSD pool and avoid the HDDs. What is the recommended way to change the crush device class for these pools without risking the loss of any data in the pools? I will note that I have not yet written any user data to the pools. Everything in them was added by the radosgw process automatically. --Mike

3 years, 5 months

2
2
0 0

Ceph RBD - High IOWait during the Writes

by athreyavc

Hi All, We have recently deployed a new CEPH cluster Octopus 15.2.4 which consists of 12 OSD Nodes(16 Core + 200GB RAM, 30x14TB disks, CentOS 8) 3 Mon Nodes (8 Cores + 15GB, CentOS 8) We use Erasure Coded Pool and RBD block devices. 3 Ceph clients use the RBD devices, each has 25 RBDs and Each RBD size is 10TB. Each RBD is partitioned with the EXT4 file system. Cluster Health Is OK and Hardware is New and good. All the machines have 10Gbps (Active/Passive) bond Interface configured on it. Read operation of the cluster is OK, however, writes are very slow. One one of the RBDs we did the perf test. fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randread -runtime=60 -filename=/dev/rbd40 Run status group 0 (all jobs): READ: bw=401MiB/s (420MB/s), 401MiB/s-401MiB/s (420MB/s-420MB/s), io=23.5GiB (25.2GB), run=60054-60054msec fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/rbd40 Run status group 0 (all jobs): WRITE: bw=217KiB/s (222kB/s), 217KiB/s-217KiB/s (222kB/s-222kB/s), io=13.2MiB (13.9MB), run=62430-62430msec I see a High IO wait from the client. Any suggestions/pointers address this issue is really appreciated. Thanks and Regards, Athreya

3 years, 5 months

3
3
0 0

RGW multisite sync and latencies problem

by Miroslav Bohac

Hi, I have problem with RGW in multisite configuration with Nautilus 14.2.11. Both zones with SSD and 10Gbps network. Master zone consist from 5x DELL R740XD servers (every 256GB RAM, 8x800GB SSD for CEPH, 24xCPU). Secondary zone (temporary for testing) consist from 3x HPE DL360 Gen10 servers (every 256GB RAM, 6x800GB SSD, 48CPU). We have 17 test buckets with manual sharding (101 shards). Every bucket with 10M of small objects (10kB - 15kB). Zonegroup configuration is attached bellow. Replication of 150M objects from master to secondary zone takes almost 28 hours and the replication completed with success. After deleting objects from one bucket in master zone is not possible to sync zones properly. I tried to restart both secondary RGWs, but without success. Sync status on secondary zone is behind master. The number of objects in buckets on master zone is different than on secondary zone. Ceph HEALTH status is WARNING on both zones. On master zone I have 146 large objects found in pool 'prg2a-1.rgw.buckets.index' 16 large objects found in pool 'prg2a-1.rgw.log'. On secondary zone 88 large objects found in pool 'prg2a-2.rgw.log' 1584 large objects found in pool 'prg2a-2.rgw.buckets.index'. AVG OSD latencies on secondary zone during sync was "read 0,158ms, write 1,897ms, overwrite 1,634ms". After unsuccesfull sync (after 12h of sync fall down RGW requests, IOPS and throughput) jumps up AVG OSD latencies to "read 125ms, write 30ms, overwrite 272ms". After stopping of both RGWs on secondary zone are AVG OSD latencies almost 0ms, but when I start RGWs on secondary zone again, OSD latencies will rise again to "read 125ms, write 30ms, overwrite 272ms" with spikes up to 3 seconds. We have seen the same behaviour of ceph multisite with large number of object in one bucket (150M+ objects), so we tried different strategy with smaller buckets, but results are same. I will appreciate any help or advice, how tune or diagnose multisite problems. Does anyone else have any ideas? Is there anyone else with a similar use-case? I do not know what is wrong. Thank you and best regards, Miroslav radosgw-admin zonegroup get { "id": "ac0005da-2e9f-4f38-835f-72b289c240d0", "name": "prg2a", "api_name": "prg2a", "is_master": "true", "endpoints": [ "http://s3.prg1a.sys.cz:80", "http://s3.prg2a.sys.cz:80" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "d9ebbd1f-3312-4083-b4c2-843e1fb899ad", "zones": [ { "id": "d9ebbd1f-3312-4083-b4c2-843e1fb899ad", "name": "prg2a-1", "endpoints": [ "http://10.104.200.101:7480", "http://10.104.200.102:7480" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "fdd76c02-c679-4ec7-8e7d-c14d2ac74fb4", "name": "prg2a-2", "endpoints": [ "http://10.104.200.221:7480", "http://10.104.200.222:7480" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "cb831094-e219-44b8-89f3-fe25fc288c00" ii radosgw 14.2.11-pve1 amd64 REST gateway for RADOS distributed object store ii ceph 14.2.11-pve1 amd64 distributed storage and file system ii ceph-base 14.2.11-pve1 amd64 common ceph daemon libraries and management tools ii ceph-common 14.2.11-pve1 amd64 common utilities to mount and interact with a ceph storage cluster ii ceph-fuse 14.2.11-pve1 amd64 FUSE-based client for the Ceph distributed file system ii ceph-mds 14.2.11-pve1 amd64 metadata server for the ceph distributed file system ii ceph-mgr 14.2.11-pve1 amd64 manager for the ceph distributed storage system ii ceph-mon 14.2.11-pve1 amd64 monitor server for the ceph storage system ii ceph-osd 14.2.11-pve1 amd64 OSD server for the ceph storage system ii libcephfs2 14.2.11-pve1 amd64 Ceph distributed file system client library

3 years, 5 months

1
0
0 0

Slow ops and "stuck peering"

by shehzaad.chakowree＠teamto.com

Hello all, We're trying to debug a "slow ops" situation on our cluster running Nautilus (latest version). Things were running smoothly for a while, but we had a few issues that made things fall apart (possible clock skew, faulty disk...) - We've checked the ntp, everything seems fine, the whole cluster shows no clock skew. Network config seems fine too (we're using jumbo frames throughout the cluster). - We have multiple PGs that are in a "stuck peering" or "stuck inactive" state. ceph health detail HEALTH_WARN Reduced data availability: 1020 pgs inactive, 1008 pgs peering; Degraded data redundancy: 208352/95157861 objects degraded (0.219%), 9 pgs degraded, 9 pgs undersized; 2 pgs not deep-scrubbed in time; 2 pgs not scrubbed in time; 3 daemons have recently crashed; 1184 slow ops, oldest one blocked for 1792 sec, daemons [osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]... have slow ops. PG_AVAILABILITY Reduced data availability: 1020 pgs inactive, 1008 pgs peering pg 12.3cd is stuck inactive for 8939.938831, current state peering, last acting [111,75,53] pg 12.3ce is stuck peering for 350761.931800, current state peering, last acting [48,103,76] pg 12.3cf is stuck peering for 345518.349253, current state peering, last acting [80,46,116] pg 12.3d0 is stuck peering for 396432.771388, current state peering, last acting [114,95,42] pg 12.3d1 is stuck peering for 389771.820478, current state peering, last acting [33,99,122] pg 12.3d2 is stuck peering for 16385.796714, current state peering, last acting [48,75,105] pg 12.3d3 is stuck peering for 375090.876123, current state peering, last acting [53,118,90] pg 12.3d4 is stuck peering for 350665.788611, current state peering, last acting [59,81,40] pg 12.3d5 is stuck peering for 344195.934260, current state peering, last acting [104,73,87] pg 12.3d6 is stuck peering for 388515.338772, current state peering, last acting [57,79,60] pg 12.3d7 is stuck peering for 27320.368320, current state peering, last acting [35,56,109] pg 12.3d8 is stuck peering for 345470.520103, current state peering, last acting [91,41,74] pg 12.3d9 is stuck peering for 347582.613090, current state peering, last acting [85,66,103] pg 12.3da is stuck peering for 346518.712024, current state peering, last acting [87,63,56] pg 12.3db is stuck peering for 348804.986864, current state peering, last acting [100,122,46] pg 12.3dc is stuck peering for 343796.439591, current state peering, last acting [55,90,125] pg 12.3dd is stuck peering for 345621.663979, current state peering, last acting [83,38,125] pg 12.3de is stuck peering for 348026.449482, current state peering, last acting [38,113,82] pg 12.3df is stuck peering for 350263.925579, current state peering, last acting [41,104,87] pg 12.3e0 is stuck peering for 8738.645205, current state peering, last acting [57,86,108] pg 12.3e1 is stuck peering for 397082.568164, current state peering, last acting [124,46] pg 12.3e2 is stuck peering for 345232.402459, current state peering, last acting [80,114,65] pg 12.3e3 is stuck peering for 347014.276511, current state peering, last acting [63,102,83] pg 12.3e4 is stuck peering for 345470.524144, current state peering, last acting [91,38,71] pg 12.3e5 is stuck peering for 346636.837554, current state peering, last acting [64,85,118] pg 12.3e6 is stuck peering for 398952.293609, current state peering, last acting [92,36,75] pg 12.3e7 is stuck peering for 346973.264600, current state peering, last acting [31,94,53] pg 12.3e8 is stuck peering for 370098.248268, current state peering, last acting [119,90,72] pg 12.3e9 is stuck peering for 345134.069457, current state peering, last acting [96,105,36] pg 12.3ea is stuck peering for 346305.043394, current state peering, last acting [94,103,51] pg 12.3eb is stuck peering for 388515.112735, current state peering, last acting [57,116,59] pg 12.3ec is stuck peering for 348097.249845, current state peering, last acting [56,111,84] pg 12.3ed is stuck peering for 346636.835287, current state peering, last acting [64,106,101] pg 12.3ee is stuck peering for 398197.856231, current state peering, last acting [53,105,80] pg 12.3ef is stuck peering for 347061.858678, current state peering, last acting [47,64,80] pg 12.3f0 is stuck peering for 371495.723196, current state peering, last acting [77,115,81] pg 12.3f1 is stuck peering for 27539.717691, current state peering, last acting [123,69,48] pg 12.3f2 is stuck peering for 346973.596729, current state peering, last acting [31,80,45] pg 12.3f3 is stuck peering for 345419.834162, current state peering, last acting [108,89,40] pg 12.3f4 is stuck peering for 347400.170304, current state peering, last acting [82,67,104] pg 12.3f5 is stuck peering for 346793.349638, current state peering, last acting [116,51,68] pg 12.3f6 is stuck peering for 372361.763947, current state peering, last acting [114,46,93] pg 12.3f7 is stuck inactive for 346840.292765, current state activating, last acting [125,77,47] pg 12.3f8 is stuck peering for 347004.967439, current state peering, last acting [42,31,116] pg 12.3f9 is stuck peering for 346894.489185, current state peering, last acting [40,94,67] pg 12.3fa is stuck peering for 395041.494033, current state peering, last acting [58,97,112] pg 12.3fb is stuck peering for 346337.742759, current state peering, last acting [79,55,61] pg 12.3fc is stuck peering for 347634.039502, current state peering, last acting [66,54,101] pg 12.3fd is stuck peering for 345340.666831, current state peering, last acting [112,32,87] pg 12.3fe is stuck peering for 345777.554974, current state peering, last acting [98,30,44] pg 12.3ff is stuck peering for 18040.716533, current state peering, last acting [86,51,59] PG_DEGRADED Degraded data redundancy: 208352/95157861 objects degraded (0.219%), 9 pgs degraded, 9 pgs undersized pg 7.3b is stuck undersized for 1119305.639931, current state active+undersized+degraded, last acting [29,19] pg 7.5a is stuck undersized for 351332.251298, current state active+undersized+degraded, last acting [29,11] pg 8.b9 is stuck undersized for 351332.246585, current state active+undersized+degraded, last acting [22,17] pg 8.c1 is stuck undersized for 351332.257178, current state active+undersized+degraded, last acting [24,14] pg 8.db is stuck undersized for 350987.698147, current state active+undersized+degraded, last acting [24,7] pg 8.1c2 is stuck undersized for 350999.603413, current state active+undersized+degraded, last acting [20,2] pg 9.32 is stuck undersized for 351332.258240, current state active+undersized+degraded, last acting [21,11] pg 9.a5 is stuck undersized for 351332.266130, current state active+undersized+degraded, last acting [25,14] pg 9.df is stuck undersized for 351333.298597, current state active+undersized+degraded, last acting [25,19] PG_NOT_DEEP_SCRUBBED 2 pgs not deep-scrubbed in time pg 8.db not deep-scrubbed since 2020-10-24 07:09:12.599242 pg 7.3b not deep-scrubbed since 2020-10-24 14:10:59.877193 PG_NOT_SCRUBBED 2 pgs not scrubbed in time pg 8.db not scrubbed since 2020-10-24 07:09:12.599242 pg 7.3b not scrubbed since 2020-10-24 14:10:59.877193 RECENT_CRASH 3 daemons have recently crashed osd.70 crashed on host starfish-osd-05 at 2020-10-30 09:16:06.981832Z osd.57 crashed on host starfish-osd-04 at 2020-11-06 10:07:47.868835Z mds.starfish-mon-01 crashed on host starfish-mon-01 at 2020-11-02 18:36:25.266426Z SLOW_OPS 1184 slow ops, oldest one blocked for 1792 sec, daemons [osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106,osd.107,osd.108,osd.109]... have slow ops. - The 9 degraded / undersized pgs are on a different pool that has osds that need to be reweighed. OSDs 1-29 are on another root on the crush map. - When querying one of the PGs that are in a "stuck peering" state,, there are a lot of ".handle_connect_reply_2 connect got BADAUTHORIZER" replys. - The OSDs logs show the following message (they dissapear for a while if the osd is restarted) : 2020-11-10 11:58:58.671 7f90430da700 0 auth: could not find secret_id=14160 2020-11-10 11:58:58.671 7f90430da700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=14160 2020-11-10 11:58:58.671 7f90430da700 0 --1- [v2:10.100.0.7:6851/483865,v1:10.100.0.7:6854/483865] >> v1:10.100.0.3:6815/8007973 conn(0x5632b1a64000 0x5632a64fb000 :6854 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: got bad authorizer, auth_reply_len=0 Cheers ! Regards, Shehzaad

3 years, 5 months

1
0
0 0

Re: Dovecot and fnctl locks

by Dan van der Ster

Hi, Yeah the negative pid is interesting. AFAICT we use a negative pid to indicate that the lock was taken on another host: https://github.com/torvalds/linux/blob/master/fs/ceph/locks.c#L119 https://github.com/torvalds/linux/commit/9d5b86ac13c573795525ecac6ed2db39ab… "Finally, we convert remote filesystems to present remote pids using negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote pid returned for F_GETLK lock requests." The good news is that my colleagues managed to clear this filelock by restarting dovecot on a couple nodes. But I'm still curious if others have a nice way to debug such things. Cheers, Dan On Mon, Nov 9, 2020 at 8:11 PM Anthony D'Atri <anthony.datri(a)gmail.com> wrote: > > Looks like a - in front of the 9605 — signed/unsigned int flern? > > > On Nov 9, 2020, at 4:59 AM, Dan van der Ster <dan(a)vanderster.com> wrote: > > > > Hi all, > > > > MDS version v14.2.11 > > Client kernel 3.10.0-1127.19.1.el7.x86_64 > > > > We are seeing a strange issue with a dovecot use-case on cephfs. > > Occasionally we have dovecot reporting a file locked, such as: > > > > Nov 09 13:55:00 dovecot-backend-00.cern.ch dovecot[27710]: > > imap(reguero)<23945><fRA6B6yznq68uE28>: Error: Mailbox Deleted Items: > > Timeout (180s) while waiting for lock for transaction log file > > /mail/users/r/reguero//mdbox/mailboxes/Deleted > > Items/dbox-Mails/dovecot.index.log (WRITE lock held by pid -9605) > > > > We checked all hosts that have mounted the cephfs -- there is no pid 9605. > > > > Is there any way to see who exactly created the lock? ceph_filelock > > has a client id, but I didn't find a way to inspect the > > cephfs_metadata to see the ceph_filelock directly. > > > > Otherwise, are other Dovecot/CephFS users seeing this? Did you switch > > to flock or lockfile instead of fnctlk locks? > > > > Thanks! > > > > Dan > > > > P.S. here is the output from print locks tool from the kernel client: > > > > Read lock: > > Type: 1 (0: Read, 1: Write, 2: Unlocked) > > Whence: 0 (0: start, 1: current, 2: end) > > Offset: 0 > > Len: 1 > > Pid: -9605 > > Write lock: > > Type: 1 (0: Read, 1: Write, 2: Unlocked) > > Whence: 0 (0: start, 1: current, 2: end) > > Offset: 0 > > Len: 1 > > Pid: -9605 > > > > and same file from a 15.2.5 fuse client : > > > > Read lock: > > Type: 1 (0: Read, 1: Write, 2: Unlocked) > > Whence: 0 (0: start, 1: current, 2: end) > > Offset: 0 > > Len: 0 > > Pid: 0 > > Write lock: > > Type: 1 (0: Read, 1: Write, 2: Unlocked) > > Whence: 0 (0: start, 1: current, 2: end) > > Offset: 0 > > Len: 0 > > Pid: 0 > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

3 years, 5 months

2
1
0 0

cephfs - blacklisted client coming back?

by Andras Pataki

We had some network problems (high packet drops) to some cephfs client nodes that run ceph-fuse (14.2.13) against a Nautilus cluster (on version 14.2.8). As a result a couple of clients got evicted (as one would expect). What was really odd is that the clients were trying to flush data they had in cache and kept getting rejected by OSD's for almost an hour, and then magically the data flush worked. When asked afterwards, the client reported that it was no longer backlisted. How would that happen? I certainly didn't run any commands to un-blacklist a client and the docs say that otherwise the client will stay blacklisted until the file system gets remounted. Here is the status of the client when it was backlisted: [root@worker2033 ceph]# ceph daemon /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status { "metadata": { "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f", "ceph_version": "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)", "entity_id": "cephfs2", "hostname": "worker2033", "mount_point": "/mnt/ceph", "pid": "7698", "root": "/" }, "dentry_count": 252, "dentry_pinned_count": 9, "id": 111995680, "inst": { "name": { "type": "client", "num": 111995680 }, "addr": { "type": "v1", "addr": "10.254.65.33:0", "nonce": 410851087 } }, "addr": { "type": "v1", "addr": "10.254.65.33:0", "nonce": 410851087 }, "inst_str": "client.111995680 10.254.65.33:0/410851087", "addr_str": "10.254.65.33:0/410851087", "inode_count": 251, "mds_epoch": 3376260, "osd_epoch": 1717896, "osd_epoch_barrier": 1717893, "blacklisted": true } This corresponds to server side log messages: 2020-11-09 15:56:31.578 7fffe59a4700 1 mds.0.3376160 Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087) 2020-11-09 15:56:31.578 7fffe59a4700 0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087) 2020-11-09 15:56:31.706 7fffe59a4700 1 mds.0.3376160 Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087) 2020-11-09 15:56:31.706 7fffe59a4700 0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087) and them some time later (perhaps half an hour or so) I got this from the client: [root@worker2033 ceph]# ceph daemon /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status { "metadata": { "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f", "ceph_version": "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)", "entity_id": "cephfs2", "hostname": "worker2033", "mount_point": "/mnt/ceph", "pid": "7698", "root": "/" }, "dentry_count": 252, "dentry_pinned_count": 9, "id": 111995680, "inst": { "name": { "type": "client", "num": 111995680 }, "addr": { "type": "v1", "addr": "10.254.65.33:0", "nonce": 410851087 } }, "addr": { "type": "v1", "addr": "10.254.65.33:0", "nonce": 410851087 }, "inst_str": "client.111995680 10.254.65.33:0/410851087", "addr_str": "10.254.65.33:0/410851087", "inode_count": 251, "mds_epoch": 3376260, "osd_epoch": 1717897, "osd_epoch_barrier": 1717893, "blacklisted": false } The cluster was otherwise healthy - nothing wrong with MDS's, or any placement groups, etc. I also don't see any further log messages regarding eviction/backlisting in the MDS logs. I didn't run any ceph commands that would change the state of the cluster - I was just looking around, increasing log levels. Any ideas how could that have happened? A separate problem (perhaps needs a ticket filed) that while the ceph-fuse client was in a blacklisted state, it was retrying in an infinite loop to flush data to the OSD's and got rejected every time. I have some logs for the details of this too. Andras

3 years, 5 months

3
6
0 0

disable / remove multisite sync RGW (Ceph Nautilus)

by gans＠teamnet.de

Hello everybody, we are running a multisite (active/active) gateway on 2 ceph clusters. One production and one backup cluster. Now we make a backup with rclone from the master and don't need anymore the second Gateway. What is the best way to shutdown the second Gateway and remove the multisite sync from the master without lost of the data on the master site. greetings Markus

3 years, 5 months

1
0
0 0

The feasibility of mixed SSD and HDD replicated pool

by huww98＠outlook.com

Hi all, We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput. If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think. Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment. I’ve read the documents and did some tests. Here is the crush rule I’m testing with: rule mixed_replicated_rule { id 3 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 1 type host step emit step take default class hdd step chooseleaf firstn -1 type host step emit } Now I have the following conclusions, but I’m not very sure: * The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up. * It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule. Am I correct about the above statements? How would this work from your experience? Thanks.

3 years, 5 months

9
18
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2020