February 2021 - ceph-users

adding a second rgw instance on the same zone

by Adrian Nicolae

Hi guys, I have a Mimic cluster with only one RGW machine. My setup is simple - one realm, one zonegroup, one zone. How can I safely add a second RGW server to the same zone ? Is it safe to just run "ceph-deploy rgw create" for the second server without impacting the existing metadata pools ? What about the existing S3/Swift users - they should be available to the second RGW from the current pools, right ? My biggest concern is that the second RGW server will try to recreate some internal pools when going online so I just want to double-check that I will not mess the current setup when adding the second instance :) Thanks.

3 years, 2 months

2
1
0 0

Increasing QD=1 performance (lowering latency)

by Wido den Hollander

Hi, There are many talks and presentations out there about Ceph's performance. Ceph is great when it comes to parallel I/O, large queue depths and many applications sending I/O towards Ceph. One thing where Ceph isn't the fastest are 4k blocks written at Queue Depth 1. Some applications benefit very much from high performance/low latency I/O at qd=1, for example Single Threaded applications which are writing small files inside a VM running on RBD. With some tuning you can get to a ~700us latency for a 4k write with qd=1 (Replication, size=3) I benchmark this using fio: $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. .. 700us latency means the result will be about ~1500 IOps (1000 / 0.7) When comparing this to let's say a BSD machine running ZFS that's on the low side. With ZFS+NVMe you'll be able to reach about somewhere between 7.000 and 10.000 IOps, the latency is simply much lower. My benchmarking / test setup for this: - Ceph Nautilus/Octopus (doesn't make a big difference) - 3x SuperMicro 1U with: - AMD Epyc 7302P 16-core CPU - 128GB DDR4 - 10x Samsung PM983 3,84TB - 10Gbit Base-T networking Things to configure/tune: - C-State pinning to 1 - CPU governer to performance - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0) Higher clock speeds (New AMD Epyc coming in March!) help to reduce the latency and going towards 25Gbit/100Gbit might help as well. These are however only very small increments and might help to reduce the latency by another 15% or so. It doesn't bring us anywhere near the 10k IOps other applications can do. And I totally understand that replication over a TCP/IP network takes time and thus increases latency. The Crimson project [0] is aiming to lower the latency with many things like DPDK and SPDK, but this is far from finished and production ready. In the meantime, am I overseeing some things here? Can we reduce the latency further of the current OSDs? Reaching a ~500us latency would already be great! Thanks, Wido [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/

3 years, 2 months

5
4
0 0

Storage-class split objects

by Marcelo

Hello all! We have a cluster where there are HDDs for data and NVMEs for journals and indexes. We recently added pure SSD hosts, and created a storage class SSD. To do this, we create a default.rgw.hot.data pool, associate a crush rule using SSD and create a HOT storage class in the placement-target. The problem is when we send an object to use a HOT storage class, it is in both the STANDARD storage class pool and the HOT pool. STANDARD pool: # rados -p default.rgw.buckets.data ls d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE # rados -p default.rgw.buckets.data stat d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE default.rgw.buckets.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE mtime 2021-02-09 14: 54: 14.000000, size 0 HOT pool: # rados -p default.rgw.hot.data ls d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0 # rados -p default.rgw.hot.data stat d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0 default.rgw.hot.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0 mtime 2021-02-09 14: 54: 14.000000, size 15220 The object itself is in the HOT pool, however it creates this other object similar to an index in the STANDARD pool. Monitoring with iostat we noticed that this behavior generates an unnecessary IO on disks that do not need to be touched. Why this behavior? Are there any ways around it? Thanks, Marcelo

3 years, 2 months

2
4
0 0

Re: Backups of monitor

by Freddy Andersen

I would say production should have 5 MON servers From: huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn> Date: Friday, February 12, 2021 at 7:59 AM To: Marc <Marc(a)f1-outsourcing.eu>, Michal Strnad <michal.strnad(a)cesnet.cz>, ceph-users <ceph-users(a)ceph.io> Subject: [ceph-users] Re: Backups of monitor Normally any production Ceph cluster will have at least 3 MONs, does it reall need a backup of MON? samuel huxiaoyu(a)horebdata.cn From: Marc Date: 2021-02-12 14:36 To: Michal Strnad; ceph-users(a)ceph.io Subject: [ceph-users] Re: Backups of monitor So why not create an extra start it only when you want to make a backup, wait until it is up to date, stop it and then stop it to back it up? > -----Original Message----- > From: Michal Strnad <michal.strnad(a)cesnet.cz> > Sent: 11 February 2021 21:15 > To: ceph-users(a)ceph.io > Subject: [ceph-users] Backups of monitor > > Hi all, > > We are looking for a proper solution for backups of monitor (all maps > that they hold). On the internet we found advice that we have to stop > one of monitor, back it up (dump) and start daemon again. But this is > not right approach due to risk of loosing quorum and need of > synchronization after monitor is back online. > > Our goal is to have at least some (recent) metadata of objects in > cluster for the last resort when all monitors are in very bad > shape/state and we could start any of them. Maybe there is another > approach but we are not aware of it. > > We are running the latest nautilus and three monitors on every cluster. > > Ad. We don't want to use more monitors than thee. > > > Thank you > Cheers > Michal > -- > Michal Strnad > _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 2 months

2
1
0 0

Backups of monitor

by Michal Strnad

Hi all, We are looking for a proper solution for backups of monitor (all maps that they hold). On the internet we found advice that we have to stop one of monitor, back it up (dump) and start daemon again. But this is not right approach due to risk of loosing quorum and need of synchronization after monitor is back online. Our goal is to have at least some (recent) metadata of objects in cluster for the last resort when all monitors are in very bad shape/state and we could start any of them. Maybe there is another approach but we are not aware of it. We are running the latest nautilus and three monitors on every cluster. Ad. We don't want to use more monitors than thee. Thank you Cheers Michal -- Michal Strnad

3 years, 2 months

3
2
0 0

Re: Cannot access "Object Gateway" in dashboard after setting rgw api keys

by Alfonso Martinez Hidalgo

Hi Troels, 1) It seems you need to set up the user id like this: ceph dashboard set-rgw-api-user-id <user_id> More info here: https://docs.ceph.com/en/nautilus/mgr/dashboard/#enabling-the-object-gatewa… 2) Have you set up multisite configuration (realms/zonegroups/zones) ? Please paste the output of: radosgw-admin realm list radosgw-admin zonegroup list radosgw-admin zone list Regards, -- Alfonso Martínez Senior Software Engineer, Ceph Storage Red Hat <https://www.redhat.com> <https://red.ht/sig>

3 years, 2 months

2
1
0 0

Network design issues

by Frank Schilder

Dear cephers, I believe we are facing a bottleneck due to an inappropriate overall network design and would like to hear about experience and recommendations. I start with a description of the urgent problem/question and follow up with more details/questions. These observations are on our HPC home file system served with ceph. It has 12 storage servers facing 550+ client servers. Under high load, I start seeing "slow ping time" warnings with quite incredible latencies. I suspect we have a network bottleneck. On the storage servers we have 6x10G LACP trunks. Clients are on single 10G NICs. We have separate VLANs for front- and back network, but they both go through all NICs in the same way, so, technically, its just one cluster network shared with clients. The aggregated bandwidth is sufficient for a single-node storage server load (roughly matches with disk controller IO capacity). However, point-to-point connections are 10G only and I believe that we start observing clients saturating a 10G link and starving all other ceph cluster traffic that needs to go through this link as well. This, in turn, leads to backlog effects with slow ops on unrelated OSDs, affecting overall user experience. The number of OSDs reporting slow ping times is about the percentage one would expect if one or two 10G links are congested. Its usually just one storage server that coughs up. I guess the users with aggressive workloads getting the full bandwidth are happy, but everyone else is complaining. What I observe is that one or two clients can DOS everyone else. I typically see a very high read bandwidth from a few OSDs only and my suspicion is that this is a large job of 50-100 nodes starting the same application at the same time. For example, 50-100 clients reading the same executable simultaneously. I see 5-6GB/s and up to 10K IOP/s read, which is really good in principle. Except that is not fair-shared with other users. Question: I start considering to enable QOS on the switches for traffic between storage servers and would like to know if anyone is doing this and what the experience is. Unfortunately, our network design is probably flawed and makes this now difficult; see below. More Info. Our FS data pool is EC 8+2. I have fast-read enabled. Hence, the network traffic amplification for both, read and write, is quite substantial. Our network is a spine-leaf architecture where ceph servers and ceph clients are distributed more or less equally over the leaf switches. I'm afraid that this is a first flaw in the design, because storage servers and clients compete for the same switches and the clients greatly outnumber the storage servers. It also makes implementing QOS a real pain while it could be just traffic shaping on an uplink trunk to clients if the storage servers were isolated. This is the first design question: Isolated storage cluster providing service via uplinks/gateways versus "integrated/hyper-converged" where storage servers and clients are distributed equally over a spine-leaf architecture. Pros and cons? We have a 100G spine VLT-pair with ports configured as 40G. Up-links from leafs are 2x40, in fact, we have these leafs configured as VLT-pairs for HA as well. A pair has 2x2x40G uplinks and 2x40G VLT interlinks. There are 2 ceph servers per VLT leaf-pair and ca. 85+ client servers on the same pair. There are also clients on leaf switches without ceph servers. I don't think the 40G uplinks are congested, but you never know. We started with the ceph servers having 15HDDs for fs data and 1 SSD for fs meta-data each. With this configuration, the disk speed was the bottleneck and I observed slow ops under high load, but everything was more or less stable. I recently changed an MDS setting that greatly improved both, client performance and also the client's ability to overload OSDs. In addition, one week ago I added 20HDDs in a JBOD per host, which more than doubled the HDD throughput. Both increases in performance together have now the counter-intuitive effect that aggregated performance has tripled in comparison to 2 months ago, but the user experience is very erratic. My suspicion is, as explained above, that each server can now handle a volume of traffic that easily saturates a 10G link, leading to observations that seem to indicate insufficient network capacity whenever too many client/cluster requests go through the same 10G link. In essence, we increased aggregated performance greatly but users complain more than ever. I suspect that this imbalance of server throughput ability and 10G point-to-point limitation is a problem. However, I cannot change the networking and would like some advice of how similar set-ups are configured and if QOS can help. My idea is to enable dot1p layer 2 QOS and give traffic coming from ports with storage servers connected a higher priority than traffic coming from everywhere else. I know it would be a lot simpler if the storage cluster was isolated, but I have to deal with the situation as is for now. Any advice and experience is highly appreciated. If I do it, should I do QOS on both, front- and back network, or is QOS on the VLAN for back-network enough? Note that MONs are only on the front network. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 2 months

1
0
0 0

CEPHFS - MDS gracefull handover of rank 0

by Martin Hronek

Hello fellow CEPH-users, currently we are updating our CEPH(14.2.16) and making changes to some config settings. TLDR: is there a way to make a graceful MDS active node shutdown without loosing the caps, open files and client connections? Something like handover active state, promote standby to active, ...? Sadly we run into some difficulties when restarting MDS Nodes. While we had two active nodes and one standby we initially though that this would have a nice handover when restarting the active rank ... sadly we saw how the node was going through the states: replay-reconnect-rejoin-active as nicely visualized here https://docs.ceph.com/en/latest/cephfs/mds-states/ This left some nodes going into timeouts until the standby node has gone into the active state again, most probably since the cephfs hast already some 600k folders and 3M files and from the client side it took more than 30s. So before the next MDS the FS config where changed to one active and one standby-replay node, the idea was that since the MDS replay nodes follows the active one the handover would be smoother. The active state was reached faster, but we still noticed some hiccups on the clients while the new active MDS was waiting for clients to reconnect(state up:reconnect) after the failover. The next idea was to do a manual node promotion, graceful shutdown or something similar - where the open caps and sessions would be handed over ... but I did not find any hint in the docs regarding this functionality. But, this should somehow be possible (imho), since when adding a second active mds node (max_mds 2) and then removing it again (max_mds 1) the rank 1 node goes to stopping-state and hands over all clients/caps to rank 0 without interruptions for the clients. Therefore my question: how can one gracefully shutdown an active rank 0 mds node or promote an standby node to the active state without loosing open files/caps or client sessions? Thanks in advance, M

3 years, 2 months

4
6
0 0

cephadm upgrade issue persisting with one node

by Darrin Hodges

Hi all, Still getting upgrade issue with cephadm " Upgrade: failed to pull target image". On each of the nodes in the cluster I can do: docker pull docker.io/ceph/ceph:v15.2.8 And there is no error but the upgrade command fails still. I can see an entry in the logs for: Feb 11 22:27:56 ceph-admin01 bash[2641]: audit 2021-02-11T22:27:55.339997+0000 mon.ceph-admin01 (mon.0) 61 : audit [INF] from='mgr.12064184 ' entity='mgr.ceph-osd01.rpgexq' cmd=[{"prefix":"config-key set","key":"mgr/cephadm/upgrade_state","val":"{\"target_name\": \"docker.io/ceph/ceph:v15.2.8\", \"progress_id\": \"7cf2e315-6cfe-4e9a-88bc-ec8d611b6b4f\", \"error\": \"UPGRADE_FAILED_PULL: Upgrade: failed to pull target image\", \"paused\": true}"}]: dispatch any ideas how I can find extra info on what is going on there? thanks Darrin -- CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. It may contain privileged, confidential or copyright information. If you are not the named recipients, any use, reliance upon, disclosure or copying of this email or any attachments is unauthorised. If you have received this email in error, please reply via email or telephone +61 2 8004 5928.

3 years, 2 months

1
1
0 0

ceph osd df results

by Marc

Should the ceph osd df results not have this result for every device class? I do not think that there people mixing these classes in pools. MIN/MAX VAR: 0.78/4.28 STDDEV: 6.15

3 years, 2 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2021