Hi guys,
I have a Mimic cluster with only one RGW machine. My setup is simple -
one realm, one zonegroup, one zone. How can I safely add a second RGW
server to the same zone ?
Is it safe to just run "ceph-deploy rgw create" for the second server
without impacting the existing metadata pools ? What about the existing
S3/Swift users - they should be available to the second RGW from the
current pools, right ?
My biggest concern is that the second RGW server will try to recreate
some internal pools when going online so I just want to double-check
that I will not mess the current setup when adding the second instance :)
Thanks.
Hi,
There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.
One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.
Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.
With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)
I benchmark this using fio:
$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
700us latency means the result will be about ~1500 IOps (1000 / 0.7)
When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.
My benchmarking / test setup for this:
- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking
Things to configure/tune:
- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.
These are however only very small increments and might help to reduce
the latency by another 15% or so.
It doesn't bring us anywhere near the 10k IOps other applications can do.
And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.
The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.
In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?
Reaching a ~500us latency would already be great!
Thanks,
Wido
[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
Hello all!
We have a cluster where there are HDDs for data and NVMEs for journals and
indexes. We recently added pure SSD hosts, and created a storage class SSD.
To do this, we create a default.rgw.hot.data pool, associate a crush rule
using SSD and create a HOT storage class in the placement-target. The
problem is when we send an object to use a HOT storage class, it is in both
the STANDARD storage class pool and the HOT pool.
STANDARD pool:
# rados -p default.rgw.buckets.data ls
d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
# rados -p default.rgw.buckets.data stat
d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
default.rgw.buckets.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
mtime 2021-02-09 14: 54: 14.000000, size 0
HOT pool:
# rados -p default.rgw.hot.data ls
d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
# rados -p default.rgw.hot.data stat
d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
default.rgw.hot.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
mtime 2021-02-09 14: 54: 14.000000, size 15220
The object itself is in the HOT pool, however it creates this other object
similar to an index in the STANDARD pool. Monitoring with iostat we noticed
that this behavior generates an unnecessary IO on disks that do not need to
be touched.
Why this behavior? Are there any ways around it?
Thanks, Marcelo
I would say production should have 5 MON servers
From: huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn>
Date: Friday, February 12, 2021 at 7:59 AM
To: Marc <Marc(a)f1-outsourcing.eu>, Michal Strnad <michal.strnad(a)cesnet.cz>, ceph-users <ceph-users(a)ceph.io>
Subject: [ceph-users] Re: Backups of monitor
Normally any production Ceph cluster will have at least 3 MONs, does it reall need a backup of MON?
samuel
huxiaoyu(a)horebdata.cn
From: Marc
Date: 2021-02-12 14:36
To: Michal Strnad; ceph-users(a)ceph.io
Subject: [ceph-users] Re: Backups of monitor
So why not create an extra start it only when you want to make a backup, wait until it is up to date, stop it and then stop it to back it up?
> -----Original Message-----
> From: Michal Strnad <michal.strnad(a)cesnet.cz>
> Sent: 11 February 2021 21:15
> To: ceph-users(a)ceph.io
> Subject: [ceph-users] Backups of monitor
>
> Hi all,
>
> We are looking for a proper solution for backups of monitor (all maps
> that they hold). On the internet we found advice that we have to stop
> one of monitor, back it up (dump) and start daemon again. But this is
> not right approach due to risk of loosing quorum and need of
> synchronization after monitor is back online.
>
> Our goal is to have at least some (recent) metadata of objects in
> cluster for the last resort when all monitors are in very bad
> shape/state and we could start any of them. Maybe there is another
> approach but we are not aware of it.
>
> We are running the latest nautilus and three monitors on every cluster.
>
> Ad. We don't want to use more monitors than thee.
>
>
> Thank you
> Cheers
> Michal
> --
> Michal Strnad
>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi all,
We are looking for a proper solution for backups of monitor (all maps
that they hold). On the internet we found advice that we have to stop
one of monitor, back it up (dump) and start daemon again. But this is
not right approach due to risk of loosing quorum and need of
synchronization after monitor is back online.
Our goal is to have at least some (recent) metadata of objects in
cluster for the last resort when all monitors are in very bad
shape/state and we could start any of them. Maybe there is another
approach but we are not aware of it.
We are running the latest nautilus and three monitors on every cluster.
Ad. We don't want to use more monitors than thee.
Thank you
Cheers
Michal
--
Michal Strnad
Hi Troels,
1) It seems you need to set up the user id like this:
ceph dashboard set-rgw-api-user-id <user_id>
More info here:
https://docs.ceph.com/en/nautilus/mgr/dashboard/#enabling-the-object-gatewa…
2) Have you set up multisite configuration (realms/zonegroups/zones) ?
Please paste the output of:
radosgw-admin realm list
radosgw-admin zonegroup list
radosgw-admin zone list
Regards,
--
Alfonso Martínez
Senior Software Engineer, Ceph Storage
Red Hat <https://www.redhat.com>
<https://red.ht/sig>
Dear cephers,
I believe we are facing a bottleneck due to an inappropriate overall network design and would like to hear about experience and recommendations. I start with a description of the urgent problem/question and follow up with more details/questions.
These observations are on our HPC home file system served with ceph. It has 12 storage servers facing 550+ client servers.
Under high load, I start seeing "slow ping time" warnings with quite incredible latencies. I suspect we have a network bottleneck. On the storage servers we have 6x10G LACP trunks. Clients are on single 10G NICs. We have separate VLANs for front- and back network, but they both go through all NICs in the same way, so, technically, its just one cluster network shared with clients. The aggregated bandwidth is sufficient for a single-node storage server load (roughly matches with disk controller IO capacity). However, point-to-point connections are 10G only and I believe that we start observing clients saturating a 10G link and starving all other ceph cluster traffic that needs to go through this link as well. This, in turn, leads to backlog effects with slow ops on unrelated OSDs, affecting overall user experience. The number of OSDs reporting slow ping times is about the percentage one would expect if one or two 10G links are congested. Its usually just one storage server that coughs up.
I guess the users with aggressive workloads getting the full bandwidth are happy, but everyone else is complaining. What I observe is that one or two clients can DOS everyone else. I typically see a very high read bandwidth from a few OSDs only and my suspicion is that this is a large job of 50-100 nodes starting the same application at the same time. For example, 50-100 clients reading the same executable simultaneously. I see 5-6GB/s and up to 10K IOP/s read, which is really good in principle. Except that is not fair-shared with other users.
Question: I start considering to enable QOS on the switches for traffic between storage servers and would like to know if anyone is doing this and what the experience is. Unfortunately, our network design is probably flawed and makes this now difficult; see below.
More Info.
Our FS data pool is EC 8+2. I have fast-read enabled. Hence, the network traffic amplification for both, read and write, is quite substantial.
Our network is a spine-leaf architecture where ceph servers and ceph clients are distributed more or less equally over the leaf switches. I'm afraid that this is a first flaw in the design, because storage servers and clients compete for the same switches and the clients greatly outnumber the storage servers. It also makes implementing QOS a real pain while it could be just traffic shaping on an uplink trunk to clients if the storage servers were isolated.
This is the first design question: Isolated storage cluster providing service via uplinks/gateways versus "integrated/hyper-converged" where storage servers and clients are distributed equally over a spine-leaf architecture. Pros and cons?
We have a 100G spine VLT-pair with ports configured as 40G. Up-links from leafs are 2x40, in fact, we have these leafs configured as VLT-pairs for HA as well. A pair has 2x2x40G uplinks and 2x40G VLT interlinks. There are 2 ceph servers per VLT leaf-pair and ca. 85+ client servers on the same pair. There are also clients on leaf switches without ceph servers. I don't think the 40G uplinks are congested, but you never know.
We started with the ceph servers having 15HDDs for fs data and 1 SSD for fs meta-data each. With this configuration, the disk speed was the bottleneck and I observed slow ops under high load, but everything was more or less stable. I recently changed an MDS setting that greatly improved both, client performance and also the client's ability to overload OSDs. In addition, one week ago I added 20HDDs in a JBOD per host, which more than doubled the HDD throughput. Both increases in performance together have now the counter-intuitive effect that aggregated performance has tripled in comparison to 2 months ago, but the user experience is very erratic. My suspicion is, as explained above, that each server can now handle a volume of traffic that easily saturates a 10G link, leading to observations that seem to indicate insufficient network capacity whenever too many client/cluster requests go through the same 10G link.
In essence, we increased aggregated performance greatly but users complain more than ever.
I suspect that this imbalance of server throughput ability and 10G point-to-point limitation is a problem. However, I cannot change the networking and would like some advice of how similar set-ups are configured and if QOS can help. My idea is to enable dot1p layer 2 QOS and give traffic coming from ports with storage servers connected a higher priority than traffic coming from everywhere else. I know it would be a lot simpler if the storage cluster was isolated, but I have to deal with the situation as is for now. Any advice and experience is highly appreciated.
If I do it, should I do QOS on both, front- and back network, or is QOS on the VLAN for back-network enough? Note that MONs are only on the front network.
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello fellow CEPH-users,
currently we are updating our CEPH(14.2.16) and making changes to some
config settings.
TLDR: is there a way to make a graceful MDS active node shutdown without
loosing the caps, open files and client connections? Something like
handover active state, promote standby to active, ...?
Sadly we run into some difficulties when restarting MDS Nodes. While we
had two active nodes and one standby we initially though that this would
have a nice handover when restarting the active rank ... sadly we saw
how the node was going through the states:
replay-reconnect-rejoin-active as nicely visualized here
https://docs.ceph.com/en/latest/cephfs/mds-states/
This left some nodes going into timeouts until the standby node has gone
into the active state again, most probably since the cephfs hast already
some 600k folders and 3M files and from the client side it took more
than 30s.
So before the next MDS the FS config where changed to one active and one
standby-replay node, the idea was that since the MDS replay nodes
follows the active one the handover would be smoother. The active state
was reached faster, but we still noticed some hiccups on the clients
while the new active MDS was waiting for clients to reconnect(state
up:reconnect) after the failover.
The next idea was to do a manual node promotion, graceful shutdown or
something similar - where the open caps and sessions would be handed
over ... but I did not find any hint in the docs regarding this
functionality.
But, this should somehow be possible (imho), since when adding a second
active mds node (max_mds 2) and then removing it again (max_mds 1) the
rank 1 node goes to stopping-state and hands over all clients/caps to
rank 0 without interruptions for the clients.
Therefore my question: how can one gracefully shutdown an active rank 0
mds node or promote an standby node to the active state without loosing
open files/caps or client sessions?
Thanks in advance,
M
Hi all,
Still getting upgrade issue with cephadm " Upgrade: failed to pull
target image". On each of the nodes in the cluster I can do:
docker pull docker.io/ceph/ceph:v15.2.8
And there is no error but the upgrade command fails still. I can see an
entry in the logs for:
Feb 11 22:27:56 ceph-admin01 bash[2641]: audit
2021-02-11T22:27:55.339997+0000 mon.ceph-admin01 (mon.0) 61 : audit
[INF] from='mgr.12064184 ' entity='mgr.ceph-osd01.rpgexq'
cmd=[{"prefix":"config-key
set","key":"mgr/cephadm/upgrade_state","val":"{\"target_name\":
\"docker.io/ceph/ceph:v15.2.8\", \"progress_id\":
\"7cf2e315-6cfe-4e9a-88bc-ec8d611b6b4f\", \"error\":
\"UPGRADE_FAILED_PULL: Upgrade: failed to pull target image\",
\"paused\": true}"}]: dispatch
any ideas how I can find extra info on what is going on there?
thanks
Darrin
--
CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. It may contain privileged, confidential or copyright information. If you are not the named recipients, any use, reliance upon, disclosure or copying of this email or any attachments is unauthorised. If you have received this email in error, please reply via email or telephone +61 2 8004 5928.
Should the ceph osd df results not have this result for every device class? I do not think that there people mixing these classes in pools.
MIN/MAX VAR: 0.78/4.28 STDDEV: 6.15