Hi,
Hello I am using rados bench tool. Currently I am using this tool on the
development cluster after running vstart.sh script. It is working fine and
I am interested in benchmarking the cluster. However I am struggling to
achieve a good bandwidth i.e. bandwidth (MB/sec). My target throughput is
at least 50 MB/sec and more. But mostly I am achieving is around 15-20
MB/sec. So, very poor.
I am quite sure I am missing something. Either I have to change my cluster
through vstart.sh script or I am not fully utilizing the rados bench tool.
Or may be both. i.e. not the right cluster and also not using the rados
bench tool correctly.
Some of the shell examples I have been using to build the cluster are
bellow:
MDS=0 RGW=1 ../src/vstart.sh -d -l -n --bluestore
MDS=0 RGW=1 MON=1 OSD=4../src/vstart.sh -d -l -n --bluestore
While using rados bench tool I have been trying with different block sizes
4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K. And I have also been changing the
-t parameter in the shell to increase concurrent IOs.
Looking forward to help.
Bobby
Dear All,
A questions that probalby has been asked by many other users before. I want to do a POC. For the POC I can use old decomissioned hardware. Currently I have 3 x IBM X3550 M5 with:
1 Dualport 10G NIC
Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
64GB RAM
the other two have a slower CPU but more RAM:
Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
512GB RAM
Of course I can re-arrange the RAM.
The switches are not LACP capable, so I'm planning to use bonding in active-active. For the disks I'm planning on buying 12 x Samsung PM883 1.9TB and use them in an EC pool.
My questions are:
1. Which bonding mode should I choose? balance-alb?
2. Are the disks ok for a POC? Or should I rather go with more smaller disks (960GB) e.g. 24 in total?
3. Are there any drawbacks when using EC pools?
Workload will be mostly VMs (Vsphere / Openstack), but also CephFS with Samba Gateway.
Many thanks
Hi,
This is with v15.2 and v15.2.8.
Once an OSD service is applied, it can't be removed.
It always shows up from "ceph orch ls".
"ceph orch rm <osd service name>" only marks it "unmanaged",
but not actually removes it.
Is this the expected?
Thanks!
Tony
I would say everyone recommends at least 3 monitors and since they need to be 1,3,5 or 7 I always read that as 5 is the best number (if you have 5 servers in your cluster). The other reason is high availability since the MONs use Paxos for the quorum and I like to have 3 in the quorum you need 5 to be able to do maintenance. (2 out of 3, 3 out of 5… ) So if you are doing maintenance on a mon host in a 5 mon cluster you will still have 3 in the quorum.
From: huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn>
Date: Friday, February 12, 2021 at 8:42 AM
To: Freddy Andersen <freddy(a)cfandersen.com>, Marc <Marc(a)f1-outsourcing.eu>, Michal Strnad <michal.strnad(a)cesnet.cz>, ceph-users <ceph-users(a)ceph.io>
Subject: [ceph-users] Re: Backups of monitor
Why 5 instead of 3 MONs are required?
huxiaoyu(a)horebdata.cn
From: Freddy Andersen
Date: 2021-02-12 16:05
To: huxiaoyu(a)horebdata.cn; Marc; Michal Strnad; ceph-users
Subject: Re: [ceph-users] Re: Backups of monitor
I would say production should have 5 MON servers
From: huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn>
Date: Friday, February 12, 2021 at 7:59 AM
To: Marc <Marc(a)f1-outsourcing.eu>, Michal Strnad <michal.strnad(a)cesnet.cz>, ceph-users <ceph-users(a)ceph.io>
Subject: [ceph-users] Re: Backups of monitor
Normally any production Ceph cluster will have at least 3 MONs, does it reall need a backup of MON?
samuel
huxiaoyu(a)horebdata.cn
From: Marc
Date: 2021-02-12 14:36
To: Michal Strnad; ceph-users(a)ceph.io
Subject: [ceph-users] Re: Backups of monitor
So why not create an extra start it only when you want to make a backup, wait until it is up to date, stop it and then stop it to back it up?
> -----Original Message-----
> From: Michal Strnad <michal.strnad(a)cesnet.cz>
> Sent: 11 February 2021 21:15
> To: ceph-users(a)ceph.io
> Subject: [ceph-users] Backups of monitor
>
> Hi all,
>
> We are looking for a proper solution for backups of monitor (all maps
> that they hold). On the internet we found advice that we have to stop
> one of monitor, back it up (dump) and start daemon again. But this is
> not right approach due to risk of loosing quorum and need of
> synchronization after monitor is back online.
>
> Our goal is to have at least some (recent) metadata of objects in
> cluster for the last resort when all monitors are in very bad
> shape/state and we could start any of them. Maybe there is another
> approach but we are not aware of it.
>
> We are running the latest nautilus and three monitors on every cluster.
>
> Ad. We don't want to use more monitors than thee.
>
>
> Thank you
> Cheers
> Michal
> --
> Michal Strnad
>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
I am running a Ceph Octopus (15.2.8) cluster primarily for CephFS.
Metadata is stored on SSD, data is stored in three different pools on
HDD. Currently, I use 22 subvolumes.
I am rotating snapshots on 16 subvolumes, all in the same pool, which is
the primary data pool for CephFS. Currently I have 41 snapshots per
subvolume. The goal is 50 snapshots (see bottom of mail for details).
Snapshots are only placed in the root subvolume directory, i.e.
/volumes/_nogroup/subvolname/hex-id/.snap
I place the snapshots on one of the nodes. Complete CephFS is mounted,
mkdir and rmdir is performed for each relevant subvolume, then CephFS is
unmounted again. All PGs are active+clean most of the time, only a few
in snaptrim for 1-2 minutes after snapshot deletion. I therefore assume
that snaptrim is not a limiting factor.
Obviously, the total number of snapshots is more than the 400 and 100 I
see mentioned in some documentation. I am unsure if that is an issue
here, as the snapshots are all in disjunct subvolumes.
When mounting the subvolumes with kernel client (ranging from CentOS 7
supplied 3.10 up to 5.4.93), after some time and for some subvolumes the
kworker process begins to hug 100% cpu usage and stat operations become
very slow (even slower than with fuse client). I can mostly replicate
this by starting specific rsync operations (with many small files, e.g.
CTAN, CentOS, Debian mirrors) and by running a bareos backup. The
kworker process seems to be stuck even after terminating the causing
operating, i.e. rsync or bareos-fd.
Interestingly, I can even trigger these issues on a host that has only a
single CephFS subvolume without any snapshots mounted, as long as that
subvolume is in the same pool as other subvolumes with snapshots.
I don't see any abnormal behaviour on the cluster nodes or on other
clients during these kworker hanging phases.
With fuse client, in normal operation stat calls are about 10-20x slower
than with the kernel client. However, I don't encounter the extreme
slowdown behaviour. I am therefore currently mounting some
known-problematic subvolumes with fuse and non-problematic subvolumes
with the kernel client.
My questions are:
- Is this known or expected behaviour?
- I could move the subvolumes with snapshots into a subvolumegroup and
snapshot the whole group instead of each subvolume. Will this be likely
to solve the issues?
- What is the current recommendation regarding CephFS and max number of
snapshots?
Cluster setup:
5 nodes with a total of 56 OSDs
Each node has a Xeon Silver 4208 and 128 GB RAM
Each node has two 480GB Samsung PM883 SSD used for CephFS metadata pool
HDDs are ranging from 8TB to 14TB, majority is 14TB
10 GbE internal network and 10 GbE client network, no Jumbo frames
$ ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 520 TiB 141 TiB 378 TiB 379 TiB 72.88
ssd 3.9 TiB 3.8 TiB 1.7 GiB 97 GiB 2.46
TOTAL 524 TiB 145 TiB 378 TiB 379 TiB 72.36
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 66 MiB 57 198 MiB 0 23 TiB
cephfs.cephfs.meta 2 1024 26 GiB 2.29M 77 GiB 2.06 1.2 TiB
cephfs.cephfs.data 3 1024 70 TiB 54.95M 213 TiB 75.19 23 TiB
lofar 4 512 77 TiB 21.41M 154 TiB 68.68 35 TiB
proxmox 6 64 526 GiB 158.60k 1.6 TiB 2.16 23 TiB
archive 7 32 7.3 TiB 5.42M 10 TiB 12.57 56 TiB
Snapshots are only on cephfs.cephfs.data pool.
Intended snapshot rotation:
4 quarter-hourly snapshots
24 hourly snapshots
14 daily snapshots
8 weekly snapshots
Cheers
Sebastian
Hi,
I've been trying with v15.2 and v15.2.8, no luck.
Wondering if this is actually supported or ever worked for anyone?
Here is what I've done.
1) Create a cluster with 1 controller (mon and mgr) and 3 OSD nodes,
each of which is with 1 SSD for DB and 8 HDDs for data.
2) OSD service spec.
service_type: osd
service_id: osd-spec
placement:
hosts:
- ceph-osd-1
- ceph-osd-2
- ceph-osd-3
spec:
block_db_size: 92341796864
data_devices:
model: ST16000NM010G
db_devices:
model: KPM5XRUG960G
3) Add OSD hosts and apply OSD service spec. 8 OSDs (data on HDD and
DB on SSD) are created on each host properly.
4) Run "orch osd rm 1 --replace --force". OSD is marked "destroyed" and
reweight is set to 0 in "osd tree". "pg dump" shows no PG on that OSD.
"orch ps" shows no daemon running for that OSD.
5) Run "orch device zap <host> <device>". VG and LV for HDD are removed.
LV for DB stays. "orch device ls" shows HDD device is available.
6) Cephadm finds OSD claims and applies OSD spec on the host.
Here is the message.
============================
cephadm [INF] Found osd claims -> {'ceph-osd-1': ['1']}
cephadm [INF] Found osd claims for drivegroup osd-spec -> {'ceph-osd-1': ['1']}
cephadm [INF] Applying osd-spec on host ceph-osd-1...
cephadm [INF] Applying osd-spec on host ceph-osd-2...
cephadm [INF] Applying osd-spec on host ceph-osd-3...
cephadm [INF] ceph-osd-1: lvm batch --no-auto /dev/sdc /dev/sdd
/dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
--db-devices /dev/sdb --block-db-size 92341796864
--osd-ids 1 --yes --no-systemd
code: 0
out: ['']
err: ['/bin/docker:stderr --> passed data devices: 8 physical, 0 LVM',
'/bin/docker:stderr --> relative data size: 1.0',
'/bin/docker:stderr --> passed block_db devices: 1 physical, 0 LVM',
'/bin/docker:stderr --> 1 fast devices were passed, but none are available']
============================
Q1. Is DB LV on SSD supposed to be deleted or not, when replacing an OSD
whose data is on HDD and DB is on SSD?
Q2. If yes from Q1, is a new DB LV supposed to be created on SSD as long as
there is sufficient free space, when building the new OSD?
Q3. If no from Q1, since it's replacing, is the old DB LV going to be reused
for the new OSD?
Again, is this actually supposed to work? Am I missing anything or just trying
on some unsupported feature?
Thanks!
Tony
Hi all,
I'm running a 15.2.8 cluster using ceph orch with all daemons adopted to
cephadm.
I tried reinstall an OSD node. Is there a way to make ceph orch/cephadm
activate the devices on this node again, ideally automatically?
I tried running `cephadm ceph-volume -- lvm activate --all` but this has
an error related to dmcrypt:
> [root@osd2803 ~]# cephadm ceph-volume -- lvm activate --all
> Using recent ceph image docker.io/ceph/ceph:v15
> /usr/bin/podman:stderr --> Activating OSD ID 0 FSID
> 697698fd-3fa0-480f-807b-68492bd292bf
> /usr/bin/podman:stderr Running command: /usr/bin/mount -t tmpfs tmpfs
> /var/lib/ceph/osd/ceph-0
> /usr/bin/podman:stderr Running command: /usr/bin/ceph-authtool
> /var/lib/ceph/osd/ceph-0/lockbox.keyring --create-keyring --name
> client.osd-lockbox.697698fd-3fa0-480f-807b-68492bd292bf --add-key
> AQAy7Bdg0jQsBhAAj0gcteTEbcpwNNvMGZqTTg==
> /usr/bin/podman:stderr stdout: creating
> /var/lib/ceph/osd/ceph-0/lockbox.keyring
> /usr/bin/podman:stderr added entity
> client.osd-lockbox.697698fd-3fa0-480f-807b-68492bd292bf
> auth(key=AQAy7Bdg0jQsBhAAj0gcteTEbcpwNNvMGZqTTg==)
> /usr/bin/podman:stderr Running command: /usr/bin/chown -R ceph:ceph
> /var/lib/ceph/osd/ceph-0/lockbox.keyring
> /usr/bin/podman:stderr Running command: /usr/bin/ceph --cluster ceph
> --name client.osd-lockbox.697698fd-3fa0-480f-807b-68492bd292bf
> --keyring /var/lib/ceph/osd/ceph-0/lockbox.keyring config-key get
> dm-crypt/osd/697698fd-3fa0-480f-807b-68492bd292bf/luks
> /usr/bin/podman:stderr stderr: Error initializing cluster client:
> ObjectNotFound('RADOS object not found (error calling conf_read_file)',)
> /usr/bin/podman:stderr --> RuntimeError: Unable to retrieve dmcrypt
> secret
> Traceback (most recent call last):
> File "/usr/sbin/cephadm", line 6111, in <module>
> r = args.func()
> File "/usr/sbin/cephadm", line 1322, in _infer_fsid
> return func()
> File "/usr/sbin/cephadm", line 1381, in _infer_image
> return func()
> File "/usr/sbin/cephadm", line 3611, in command_ceph_volume
> out, err, code = call_throws(c.run_cmd(), verbose=True)
> File "/usr/sbin/cephadm", line 1060, in call_throws
> raise RuntimeError('Failed command: %s' % ' '.join(command))
> RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
> --net=host --entrypoint /usr/sbin/ceph-volume --privileged
> --group-add=disk -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15 -e
> NODE_NAME=osd2803.banette.os -v /dev:/dev -v /run/udev:/run/udev -v
> /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm
> docker.io/ceph/ceph:v15 lvm activate --all
The OSDs are encrypted indeed. `cephadm ceph-volume lvm list` and
`cephadm shell ceph -s` run just fine, and if I run ceph-volume
directly, the same command works, but then of course the daemons are
started in the legacy way again, not in containers.
Is there another way trough the 'ceph orch' to achieve this? Or if
`cephadm ceph-volume -- lvm activate --all` would be the way to go here,
I'm probably seeing a bug here ?
Thanks!!
Kenneth
Hi Gents,
Can you tell me how much is your latency in your multisite cluster? Multisite should be latency sensitive and I scare this is my sync issue, but I don't really know what means "low latency".
Here is mine, I wonder is it good or not.
In HKG:
"data-sync-from-ash": {
"fetch_bytes": {
"avgcount": 0,
"sum": 0
},
"fetch_not_modified": 425395,
"fetch_errors": 1,
"poll_latency": {
"avgcount": 890,
"sum": 47481.671537512,
"avgtime": 53.350192738
},
"poll_errors": 0
},
"data-sync-from-sin": {
"fetch_bytes": {
"avgcount": 0,
"sum": 0
},
"fetch_not_modified": 484757,
"fetch_errors": 0,
"poll_latency": {
"avgcount": 21686,
"sum": 135649.750753768,
"avgtime": 6.255176185
},
"poll_errors": 3
In ASH:
"data-sync-from-hkg": {
"fetch_bytes": {
"avgcount": 7904,
"sum": 497898243
},
"fetch_not_modified": 7383973,
"fetch_errors": 654,
"poll_latency": {
"avgcount": 6586,
"sum": 2568055.690045521,
"avgtime": 389.926463717
},
"poll_errors": 3
},
"data-sync-from-sin": {
"fetch_bytes": {
"avgcount": 13362,
"sum": 800114616
},
"fetch_not_modified": 7326406,
"fetch_errors": 558,
"poll_latency": {
"avgcount": 10137,
"sum": 3145053.032619919,
"avgtime": 310.254812333
},
"poll_errors": 5
},
In SGP:
"data-sync-from-ash": {
"fetch_bytes": {
"avgcount": 0,
"sum": 0
},
"fetch_not_modified": 2057839,
"fetch_errors": 1,
"poll_latency": {
"avgcount": 8874,
"sum": 682176.718044618,
"avgtime": 76.873644133
},
"poll_errors": 0
},
"data-sync-from-hkg": {
"fetch_bytes": {
"avgcount": 114,
"sum": 1097512
},
"fetch_not_modified": 1939198,
"fetch_errors": 823,
"poll_latency": {
"avgcount": 2123,
"sum": 60947.760976996,
"avgtime": 28.708318877
},
"poll_errors": 1
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi,
after upgrading Ceph from 14.2.8 to 14.2.16 we experienced increased latencies. There were no changes in hardware, configuration, workload or networking, just a rolling-update via ceph-ansible on running production cluster. The cluster consists of 16 OSDs (all SSD) over 4 Nodes. The VMs served via RBD from this cluster currently suffer on i/o wait cpu.
These are some latencies that are increased after the update:
- op_r_latency
- op_w_latency
- kv_final_lat
- state_kv_commiting_lat
- submit_lat
- subop_w_latency
Do these latencies point to KV/RocksDB?
These are some latencies which are NOT increased after the update:
- kv_sync_lat
- kv_flush_lat
- kv_commit_lat
I attached one graph showing the massive increase after the update.
I tried setting bluefs_buffered_io=true (as it’s default value was changed and it was mentioned as performance relevant) for all OSDs in one host but this does not make a difference.
The ceph.conf is fairly simple:
[global]
cluster network = xxx
fsid = xxx
mon host = xxx
public network = xxx
[osd]
osd memory target = 10141014425
Any ideas what to try? Help appreciated.
Björn
--
dbap GmbH
phone +49 251 609979-0 / fax +49 251 609979-99
Heinr.-von-Kleist-Str. 47, 48161 Muenster, Germany
http://www.dbap.de
dbap GmbH, Sitz: Muenster
HRB 5891, Amtsgericht Muenster
Geschaeftsfuehrer: Bjoern Dolkemeier