October 2019 - ceph-users

by Ramanathan S

Hi all, I just had created a ceph cluster to use cephfs. When i create the a ceph fs pool i get the filesystem below error. # ceph osd pool create cephfs_data 128 pool 'cephfs_data' created # ceph osd pool create cephfs_metadata 128 pool 'cephfs_metadata' created # ceph fs new cephfs cephfs_metadata cephfs_data new fs with metadata pool 6 and data pool 5 # ceph -s cluster: id: 1c27def45-f0f9-494d-sfke-eb4323432fd health: HEALTH_ERR 1 filesystem is offline 1 filesystem is online with fewer MDS than max_mds services: mon: 2 daemons, quorum ceph-mon01,ceph-mon02 mgr: ceph-adm01(active) mds: cephfs-0/0/1 up osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 256 pgs objects: 0 objects, 0 B usage: 12 GiB used, 588 GiB / 600 GiB avail pgs: 256 active+clean but when i check the max_mds for the ceph fs it says 1 # ceph fs get cephfs | grep max_mds max_mds 1 Let anyone know what am i missing here? Any inputs is much appreciated. Regards, Ram Ceph-explorer..

4 weeks, 1 day

3
3
0 0

MDS rejects clients causing hanging mountpoint on linux kernel client

by Florian Pritz

Hi, We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4. Our cephfs clients are using the kernel module and we have noticed that some of them are sometimes (at least once) hanging after an MDS restart. The only way to resolve this is to unmount and remount the mountpoint, or reboot the machine if unmounting is not possible. After some investigation, the problem seems to be that the MDS denies reconnect attempts from some clients during restart even though the reconnect interval is not yet reached. In particular, I see the following log entries. Note that there are supposedly 9 sessions. 9 clients reconnect (one client has two mountpoints) and then two more clients reconnect after the MDS already logged "reconnect_done". These two clients were hanging after the event. The kernel log of one of them is shown below too. Running `ceph tell mds.0 client ls` after the clients have been rebooted/remounted also shows 11 clients instead of 9. Do you have any ideas what is wrong here and how it could be fixed? I'm guessing that the issue is that the MDS apparently has an incorrect session count and stops the reconnect process to soon. Is this indeed a bug and if so, do you know what is broken? Regardless, I also think that the kernel should be able to deal with a denied reconnect and that it should try again later. Yet, even after 10 minutes, the kernel does not attempt to reconnect. Is this a known issue or maybe fixed in newer kernels? If not, is there a chance to get this fixed? Thanks, Florian MDS log: > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby Hanging client (10.1.67.49) kernel log: > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING) > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm

3 years, 2 months

3
6
0 0

Choosing suitable SSD for Ceph cluster

by Hermann Himmelbauer

Hi, I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on 3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks), interconnected via Infiniband 40. Problem is that the ceph performance is quite bad (approx. 30MiB/s reading, 3-4 MiB/s writing ), so I thought about plugging into each node a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to have a faster ceph storage and also some storage extension. The question is now which SSDs I should use. If I understand it right, not every SSD is suitable for ceph, as is denoted at the links below: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i… or here: https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a fast SSD for ceph. As the 950 is not available anymore, I ordered a Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO. Before equipping all nodes with these SSDs, I did some tests with "fio" as recommended, e.g. like this: fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test The results are as the following: ----------------------- 1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec Jobs: 4: read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec Jobs: 10: read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec ----------------------- So the read speed is impressive, but the write speed is really bad. Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND chips (MLC instead of TLC). The results are, however even worse for writing: ----------------------- Samsung 970 PRO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec Jobs: 4: read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec Jobs: 10: read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec ----------------------- I did some research and found out, that the "--sync" flag sets the flag "O_DSYNC" which seems to disable the SSD cache which leads to these horrid write speeds. It seems that this relates to the fact that the write cache is only not disabled for SSDs which implement some kind of battery buffer that guarantees a data flush to the flash in case of a powerloss. However, It seems impossible to find out which SSDs do have this powerloss protection, moreover, these enterprise SSDs are crazy expensive compared to the SSDs above - moreover it's unclear if powerloss protection is even available in the NVMe form factor. So building a 1 or 2 TB cluster seems not really affordable/viable. So, can please anyone give me hints what to do? Is it possible to ensure that the write cache is not disabled in some way (my server is situated in a data center, so there will probably never be loss of power). Or is the link above already outdated as newer ceph releases somehow deal with this problem? Or maybe a later Debian release (10) will handle the O_DSYNC flag differently? Perhaps I should simply invest in faster (and bigger) harddisks and forget the SSD-cluster idea? Thank you in advance for any help, Best Regards, Hermann -- hermann(a)qwer.tk PGP/GPG: 299893C7 (on keyservers)

3 years, 7 months

12
18
0 0

ceph osd set-require-min-compat-client jewel failure

by 潘东元

hi,every one, my ceph version 12.2.12，I want to set require min compat client luminous,I use command #ceph osd set-require-min-compat-client luminous but ceph report:Error EPERM: cannot set require_min_compat_client to luminous: 4 connected client(s) look like jewel (missing 0xa00000000200000); add --yes-i-really-mean-it to do it anyway [root@node-1 ~]# ceph features { "mon": { "group": { "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 3 } }, "osd": { "group": { "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 15 } }, "client": { "group": { "features": "0x40106b84a842a52", "release": "jewel", "num": 4 }, "group": { "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 168 } } } so,I run command: [root@node-1 gyt]# ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it set require_min_compat_client to luminous but now,I want to set require min compat client jewel,I use command： [root@node-1 gyt]# ceph osd set-require-min-compat-client jewel Error EPERM: osdmap current utilizes features that require luminous; cannot set require_min_compat_client below that to jewel what‘s the way we are set luminous chang to jewel？

3 years, 11 months

3
2
0 0

ceph-mgr Module "zabbix" cannot send Data

by i.schmidt＠langeoog.de

Hi Folks We are using Ceph as our storage backend on our 6 Node Proxmox VM Cluster. To Monitor our systems we use Zabbix and i would like to get some Ceph Data into our Zabbix to get some alarms when something goes wrong. Ceph mgr has a module, "zabbix" that uses "zabbix-sender" to actively send data, but i cannot get the module working. It always responds with "failed to send data" The network side seems to be fine: root@vm-2:~# traceroute 192.168.15.253 traceroute to 192.168.15.253 (192.168.15.253), 30 hops max, 60 byte packets 1 192.168.15.253 (192.168.15.253) 0.411 ms 0.402 ms 0.393 ms root@vm-2:~# nmap -p 10051 192.168.15.253 Starting Nmap 7.70 ( https://nmap.org ) at 2019-09-18 08:40 CEST Nmap scan report for 192.168.15.253 Host is up (0.00026s latency). PORT STATE SERVICE 10051/tcp open zabbix-trapper MAC Address: BA:F5:30:EF:40:EF (Unknown) Nmap done: 1 IP address (1 host up) scanned in 0.61 seconds root@vm-2:~# ceph zabbix config-show {"zabbix_port": 10051, "zabbix_host": "192.168.15.253", "identifier": "VM-2", "zabbix_sender": "/usr/bin/zabbix_sender", "interval": 60} root@vm-2:~# But if i try "ceph zabbix send" i get "failed to send data to zabbix" and this show up in the systems journal: Sep 18 08:41:13 vm-2 ceph-mgr[54445]: 2019-09-18 08:41:13.272 7fe360fe4700 -1 mgr.server reply reply (1) Operation not permitted The log of ceph-mgr on that machine states: 2019-09-18 08:42:18.188 7fe359fd6700 0 mgr[zabbix] Exception when sending: /usr/bin/zabbix_sender exited non-zero: zabbix_sender [3253392]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 44; total: 44; seconds spent: 0.000179"}] 2019-09-18 08:43:18.217 7fe359fd6700 0 mgr[zabbix] Exception when sending: /usr/bin/zabbix_sender exited non-zero: zabbix_sender [3253629]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 44; total: 44; seconds spent: 0.000321"}] I'm guessing, this could have something to do with user rights. But i have no idea where to start to track this down. Maybe someone here has a hint? If more information is needed, i will gladly provide it. greetings Ingo

4 years

5
6
0 0

osd_pg_create causing slow requests in Nautilus

by Bryan Stillwell

We've run into a problem on our test cluster this afternoon which is running Nautilus (14.2.2). It seems that any time PGs move on the cluster (from marking an OSD down, setting the primary-affinity to 0, or by using the balancer), a large number of the OSDs in the cluster peg the CPU cores they're running on for a while which causes slow requests. From what I can tell it appears to be related to slow peering caused by osd_pg_create() taking a long time. This was seen on quite a few OSDs while waiting for peering to complete: # ceph daemon osd.3 ops { "ops": [ { "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", "initiated_at": "2019-08-27 14:34:46.556413", "age": 318.25234538000001, "duration": 318.25241895300002, "type_data": { "flag_point": "started", "events": [ { "time": "2019-08-27 14:34:46.556413", "event": "initiated" }, { "time": "2019-08-27 14:34:46.556413", "event": "header_read" }, { "time": "2019-08-27 14:34:46.556299", "event": "throttled" }, { "time": "2019-08-27 14:34:46.556456", "event": "all_read" }, { "time": "2019-08-27 14:35:12.456901", "event": "dispatched" }, { "time": "2019-08-27 14:35:12.456903", "event": "wait for new map" }, { "time": "2019-08-27 14:40:01.292346", "event": "started" } ] } }, ...snip... { "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", "initiated_at": "2019-08-27 14:35:09.908567", "age": 294.900191001, "duration": 294.90068416899999, "type_data": { "flag_point": "delayed", "events": [ { "time": "2019-08-27 14:35:09.908567", "event": "initiated" }, { "time": "2019-08-27 14:35:09.908567", "event": "header_read" }, { "time": "2019-08-27 14:35:09.908520", "event": "throttled" }, { "time": "2019-08-27 14:35:09.908617", "event": "all_read" }, { "time": "2019-08-27 14:35:12.456921", "event": "dispatched" }, { "time": "2019-08-27 14:35:12.456923", "event": "wait for new map" } ] } } ], "num_ops": 6 } That "wait for new map" message made us think something was getting hung up on the monitors, so we restarted them all without any luck. I'll keep investigating, but so far my google searches aren't pulling anything up so I wanted to see if anyone else is running into this? Thanks, Bryan

4 years, 1 month

6
16
0 0

RGWReshardLock::lock failed to acquire lock ret=-16

by Josh Haft

Hi, Currently running Mimic 13.2.5. We had reports this morning of timeouts and failures with PUT and GET requests to our Ceph RGW cluster. I found these messages in the RGW log: RGWReshardLock::lock failed to acquire lock on bucket_name:bucket_instance ret=-16 NOTICE: resharding operation on bucket index detected, blocking block_while_resharding ERROR: bucket is still resharding, please retry Which were preceded by many of these, which I think are normal/expected. check_bucket_shards: resharding needed: stats.num_objects=6415879 shard max_objects=6400000 Our RGW cluster sits behind haproxy which notified me approx 90 seconds after the first 'resharding needed' message that no backends were available. It appears this dynamic reshard process caused the RGWs to lock up for a period of time. Roughly 2 minutes later the reshard error messages stop and operation returns to normal. Looking back through previous RGW logs, I see a similar event from about a week ago, on the same bucket. We have several buckets with shard counts exceeding 1k (this one only has 128), and much larger object counts, so clearly this isn't the first time dynamic sharding has been invoked on this cluster. Has anyone seen this? I expect it will come up again, and can turn up debugging if that'll help. Thanks for any assistance! Josh

4 years, 1 month

1
1
0 0

official ceph.com buster builds?

by Chad W Seys

Hi all, Am I missing the ceph buster build built by ceph.com ? http://download.ceph.com/debian-nautilus/dists/ Should I be using the Croit supplied builds? Thanks! Chad.

4 years, 2 months

2
2
0 0

ceph status reports: slow ops - this is related to long running process /usr/bin/ceph-osd

by Thomas

Hi, ceph status reports: root@ld3955:~# ceph -s cluster: id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae health: HEALTH_ERR 1 filesystem is degraded 1 filesystem has a failed mds daemon 1 filesystem is offline insufficient standby MDS daemons available 4 nearfull osd(s) 1 pool(s) nearfull Reduced data availability: 59 pgs inactive, 16 pgs peering Degraded data redundancy: 597/153910758 objects degraded (0.000%), 2 pgs degraded, 1 pg undersized Degraded data redundancy (low space): 23 pgs backfill_toofull 1 pgs not deep-scrubbed in time 4 pgs not scrubbed in time 3 pools have too many placement groups 164 slow requests are blocked > 32 sec 1082 stuck requests are blocked > 4096 sec 1490 slow ops, oldest one blocked for 19711 sec, daemons [osd,0,osd,175,osd,186,osd,5,osd,6,osd,63,osd,68,osd,9,mon,ld5505,mon,ld5506]... have slow ops. services: mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 5h) mgr: ld5507(active, since 5h), standbys: ld5506, ld5505 mds: pve_cephfs:0/1, 1 failed osd: 419 osds: 416 up, 416 in; 6024 remapped pgs data: pools: 6 pools, 8864 pgs objects: 51.30M objects, 196 TiB usage: 594 TiB used, 907 TiB / 1.5 PiB avail pgs: 0.666% pgs not active 597/153910758 objects degraded (0.000%) 52964415/153910758 objects misplaced (34.412%) 5954 active+remapped+backfill_wait 2786 active+clean 40 active+remapped+backfilling 35 activating 23 active+remapped+backfill_wait+backfill_toofull 16 peering 7 activating+remapped 1 activating+undersized+degraded 1 active+clean+scrubbing 1 active+recovering+degraded io: client: 3.5 KiB/s wr, 0 op/s rd, 0 op/s wr recovery: 551 MiB/s, 137 objects/s I'm concerned about the slow ops on osd.0 and osd.9. On the relevant OSD node I can see 2 relevant services running for hours: ceph 14795 1 99 09:58 ? 08:49:22 /usr/bin/ceph-osd -f --cluster ceph --id 9 --setuser ceph --setgroup ceph ceph 15394 1 99 09:58 ? 07:10:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph In the relevant osd log I can find similar messages: root@ld5505:~# tail -f /var/log/ceph/ceph-osd.0.log 2019-10-08 15:35:32.830 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) 2019-10-08 15:35:33.806 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) 2019-10-08 15:35:34.842 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) 2019-10-08 15:35:35.862 7ff60c7cc700 -1 osd.0 233323 get_health_metrics reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987) root@ld5505:~# tail -f /var/log/ceph/ceph-osd.9.log 2019-10-08 15:35:38.822 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) 2019-10-08 15:35:39.854 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) 2019-10-08 15:35:40.850 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) 2019-10-08 15:35:41.862 7f8957599700 -1 osd.9 233407 get_health_metrics reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362) Question: How can I analyse and solve the issue with slow ops? THX

4 years, 2 months

3
2
0 0

MDS failing under load with large cache sizes

by Janek Bevendorff

Hi, I am trying to copy the contents of our storage server into a CephFS, but am experiencing stability issues with my MDSs. The CephFS sits on top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting of two. My Ceph cluster version is Nautilus, the client is Mimic and uses the kernel module to mount the FS. The index of filenames to copy is about 23GB and I am using 16 parallel rsync processes over a 10G link to copy the files over to Ceph. This works perfectly for a while, but then the MDSs start reporting oversized caches (between 20 and 50GB, sometimes more) and an inode count between 1 and 4 million. Particularly the Inode count seems quite high to me. Each rsync job has 25k files to work with, so if all 16 processes open all their files at the same time, I should not exceed 400k. Even if I double this number to account for the client's page cache, I should get nowhere near that number of inodes (a sync flush takes about 1 second). Then after a few hours, my MDSs start failing with messages like this: -21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 -20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal heartbeat is not healthy! The standby nodes try to take over, but take forever to become active and will fail as well eventually. During my research, I found this related topic: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015959.html, but I tried everything in there from increasing to lowering my cache size, the number of segments etc. I also played around with the number of active MDSs and two appears to work the best, whereas one cannot keep up with the load and three seems to be the worst of all choices. Do you have any ideas how I can improve the stability of my MDS damons to handle the load properly? single 10G link is a toy and we could query the cluster with a lot more requests per second, but it's already yielding to 16 rsync processes. Thanks

4 years, 3 months

4
33
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2019