February 2021 - ceph-users

struggling to achieve high bandwidth on Ceph dev cluster - HELP

by Bobby

Hi, Hello I am using rados bench tool. Currently I am using this tool on the development cluster after running vstart.sh script. It is working fine and I am interested in benchmarking the cluster. However I am struggling to achieve a good bandwidth i.e. bandwidth (MB/sec). My target throughput is at least 50 MB/sec and more. But mostly I am achieving is around 15-20 MB/sec. So, very poor. I am quite sure I am missing something. Either I have to change my cluster through vstart.sh script or I am not fully utilizing the rados bench tool. Or may be both. i.e. not the right cluster and also not using the rados bench tool correctly. Some of the shell examples I have been using to build the cluster are bellow: MDS=0 RGW=1 ../src/vstart.sh -d -l -n --bluestore MDS=0 RGW=1 MON=1 OSD=4../src/vstart.sh -d -l -n --bluestore While using rados bench tool I have been trying with different block sizes 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K. And I have also been changing the -t parameter in the shell to increase concurrent IOs. Looking forward to help. Bobby

3 years, 2 months

3
9
0 0

POC Hardware questions

by Oliver Weinmann

Dear All, A questions that probalby has been asked by many other users before. I want to do a POC. For the POC I can use old decomissioned hardware. Currently I have 3 x IBM X3550 M5 with: 1 Dualport 10G NIC Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz 64GB RAM the other two have a slower CPU but more RAM: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz 512GB RAM Of course I can re-arrange the RAM. The switches are not LACP capable, so I'm planning to use bonding in active-active. For the disks I'm planning on buying 12 x Samsung PM883 1.9TB and use them in an EC pool. My questions are: 1. Which bonding mode should I choose? balance-alb? 2. Are the disks ok for a POC? Or should I rather go with more smaller disks (960GB) e.g. 24 in total? 3. Are there any drawbacks when using EC pools? Workload will be mostly VMs (Vsphere / Openstack), but also CephFS with Samba Gateway. Many thanks

3 years, 2 months

2
1
0 0

Re: can't remove osd service by "ceph orch rm <service name>"

by Juan Miguel Olmo Martinez

Hi Tony. Take a look to: https://docs.ceph.com/en/latest/mgr/orchestrator/#remove-an-osd -- Juan Miguel Olmo Martínez Senior Software Engineer Red Hat <https://www.redhat.com/> jolmomar(a)redhat.com <https://www.redhat.com/>

3 years, 2 months

1
0
0 0

can't remove osd service by "ceph orch rm <service name>"

by Tony Liu

Hi, This is with v15.2 and v15.2.8. Once an OSD service is applied, it can't be removed. It always shows up from "ceph orch ls". "ceph orch rm <osd service name>" only marks it "unmanaged", but not actually removes it. Is this the expected? Thanks! Tony

3 years, 2 months

1
0
0 0

Re: Backups of monitor

by Freddy Andersen

I would say everyone recommends at least 3 monitors and since they need to be 1,3,5 or 7 I always read that as 5 is the best number (if you have 5 servers in your cluster). The other reason is high availability since the MONs use Paxos for the quorum and I like to have 3 in the quorum you need 5 to be able to do maintenance. (2 out of 3, 3 out of 5… ) So if you are doing maintenance on a mon host in a 5 mon cluster you will still have 3 in the quorum. From: huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn> Date: Friday, February 12, 2021 at 8:42 AM To: Freddy Andersen <freddy(a)cfandersen.com>, Marc <Marc(a)f1-outsourcing.eu>, Michal Strnad <michal.strnad(a)cesnet.cz>, ceph-users <ceph-users(a)ceph.io> Subject: [ceph-users] Re: Backups of monitor Why 5 instead of 3 MONs are required? huxiaoyu(a)horebdata.cn From: Freddy Andersen Date: 2021-02-12 16:05 To: huxiaoyu(a)horebdata.cn; Marc; Michal Strnad; ceph-users Subject: Re: [ceph-users] Re: Backups of monitor I would say production should have 5 MON servers From: huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn> Date: Friday, February 12, 2021 at 7:59 AM To: Marc <Marc(a)f1-outsourcing.eu>, Michal Strnad <michal.strnad(a)cesnet.cz>, ceph-users <ceph-users(a)ceph.io> Subject: [ceph-users] Re: Backups of monitor Normally any production Ceph cluster will have at least 3 MONs, does it reall need a backup of MON? samuel huxiaoyu(a)horebdata.cn From: Marc Date: 2021-02-12 14:36 To: Michal Strnad; ceph-users(a)ceph.io Subject: [ceph-users] Re: Backups of monitor So why not create an extra start it only when you want to make a backup, wait until it is up to date, stop it and then stop it to back it up? > -----Original Message----- > From: Michal Strnad <michal.strnad(a)cesnet.cz> > Sent: 11 February 2021 21:15 > To: ceph-users(a)ceph.io > Subject: [ceph-users] Backups of monitor > > Hi all, > > We are looking for a proper solution for backups of monitor (all maps > that they hold). On the internet we found advice that we have to stop > one of monitor, back it up (dump) and start daemon again. But this is > not right approach due to risk of loosing quorum and need of > synchronization after monitor is back online. > > Our goal is to have at least some (recent) metadata of objects in > cluster for the last resort when all monitors are in very bad > shape/state and we could start any of them. Maybe there is another > approach but we are not aware of it. > > We are running the latest nautilus and three monitors on every cluster. > > Ad. We don't want to use more monitors than thee. > > > Thank you > Cheers > Michal > -- > Michal Strnad > _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 2 months

5
4
0 0

CephFS Octopus snapshots / kworker at 100% / kernel vs. fuse client

by Sebastian Knust

Hi, I am running a Ceph Octopus (15.2.8) cluster primarily for CephFS. Metadata is stored on SSD, data is stored in three different pools on HDD. Currently, I use 22 subvolumes. I am rotating snapshots on 16 subvolumes, all in the same pool, which is the primary data pool for CephFS. Currently I have 41 snapshots per subvolume. The goal is 50 snapshots (see bottom of mail for details). Snapshots are only placed in the root subvolume directory, i.e. /volumes/_nogroup/subvolname/hex-id/.snap I place the snapshots on one of the nodes. Complete CephFS is mounted, mkdir and rmdir is performed for each relevant subvolume, then CephFS is unmounted again. All PGs are active+clean most of the time, only a few in snaptrim for 1-2 minutes after snapshot deletion. I therefore assume that snaptrim is not a limiting factor. Obviously, the total number of snapshots is more than the 400 and 100 I see mentioned in some documentation. I am unsure if that is an issue here, as the snapshots are all in disjunct subvolumes. When mounting the subvolumes with kernel client (ranging from CentOS 7 supplied 3.10 up to 5.4.93), after some time and for some subvolumes the kworker process begins to hug 100% cpu usage and stat operations become very slow (even slower than with fuse client). I can mostly replicate this by starting specific rsync operations (with many small files, e.g. CTAN, CentOS, Debian mirrors) and by running a bareos backup. The kworker process seems to be stuck even after terminating the causing operating, i.e. rsync or bareos-fd. Interestingly, I can even trigger these issues on a host that has only a single CephFS subvolume without any snapshots mounted, as long as that subvolume is in the same pool as other subvolumes with snapshots. I don't see any abnormal behaviour on the cluster nodes or on other clients during these kworker hanging phases. With fuse client, in normal operation stat calls are about 10-20x slower than with the kernel client. However, I don't encounter the extreme slowdown behaviour. I am therefore currently mounting some known-problematic subvolumes with fuse and non-problematic subvolumes with the kernel client. My questions are: - Is this known or expected behaviour? - I could move the subvolumes with snapshots into a subvolumegroup and snapshot the whole group instead of each subvolume. Will this be likely to solve the issues? - What is the current recommendation regarding CephFS and max number of snapshots? Cluster setup: 5 nodes with a total of 56 OSDs Each node has a Xeon Silver 4208 and 128 GB RAM Each node has two 480GB Samsung PM883 SSD used for CephFS metadata pool HDDs are ranging from 8TB to 14TB, majority is 14TB 10 GbE internal network and 10 GbE client network, no Jumbo frames $ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 520 TiB 141 TiB 378 TiB 379 TiB 72.88 ssd 3.9 TiB 3.8 TiB 1.7 GiB 97 GiB 2.46 TOTAL 524 TiB 145 TiB 378 TiB 379 TiB 72.36 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 66 MiB 57 198 MiB 0 23 TiB cephfs.cephfs.meta 2 1024 26 GiB 2.29M 77 GiB 2.06 1.2 TiB cephfs.cephfs.data 3 1024 70 TiB 54.95M 213 TiB 75.19 23 TiB lofar 4 512 77 TiB 21.41M 154 TiB 68.68 35 TiB proxmox 6 64 526 GiB 158.60k 1.6 TiB 2.16 23 TiB archive 7 32 7.3 TiB 5.42M 10 TiB 12.57 56 TiB Snapshots are only on cephfs.cephfs.data pool. Intended snapshot rotation: 4 quarter-hourly snapshots 24 hourly snapshots 14 daily snapshots 8 weekly snapshots Cheers Sebastian

3 years, 2 months

2
1
0 0

Is replacing OSD whose data is on HDD and DB is on SSD supported?

by Tony Liu

Hi, I've been trying with v15.2 and v15.2.8, no luck. Wondering if this is actually supported or ever worked for anyone? Here is what I've done. 1) Create a cluster with 1 controller (mon and mgr) and 3 OSD nodes, each of which is with 1 SSD for DB and 8 HDDs for data. 2) OSD service spec. service_type: osd service_id: osd-spec placement: hosts: - ceph-osd-1 - ceph-osd-2 - ceph-osd-3 spec: block_db_size: 92341796864 data_devices: model: ST16000NM010G db_devices: model: KPM5XRUG960G 3) Add OSD hosts and apply OSD service spec. 8 OSDs (data on HDD and DB on SSD) are created on each host properly. 4) Run "orch osd rm 1 --replace --force". OSD is marked "destroyed" and reweight is set to 0 in "osd tree". "pg dump" shows no PG on that OSD. "orch ps" shows no daemon running for that OSD. 5) Run "orch device zap <host> <device>". VG and LV for HDD are removed. LV for DB stays. "orch device ls" shows HDD device is available. 6) Cephadm finds OSD claims and applies OSD spec on the host. Here is the message. ============================ cephadm [INF] Found osd claims -> {'ceph-osd-1': ['1']} cephadm [INF] Found osd claims for drivegroup osd-spec -> {'ceph-osd-1': ['1']} cephadm [INF] Applying osd-spec on host ceph-osd-1... cephadm [INF] Applying osd-spec on host ceph-osd-2... cephadm [INF] Applying osd-spec on host ceph-osd-3... cephadm [INF] ceph-osd-1: lvm batch --no-auto /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj --db-devices /dev/sdb --block-db-size 92341796864 --osd-ids 1 --yes --no-systemd code: 0 out: [''] err: ['/bin/docker:stderr --> passed data devices: 8 physical, 0 LVM', '/bin/docker:stderr --> relative data size: 1.0', '/bin/docker:stderr --> passed block_db devices: 1 physical, 0 LVM', '/bin/docker:stderr --> 1 fast devices were passed, but none are available'] ============================ Q1. Is DB LV on SSD supposed to be deleted or not, when replacing an OSD whose data is on HDD and DB is on SSD? Q2. If yes from Q1, is a new DB LV supposed to be created on SSD as long as there is sufficient free space, when building the new OSD? Q3. If no from Q1, since it's replacing, is the old DB LV going to be reused for the new OSD? Again, is this actually supposed to work? Am I missing anything or just trying on some unsupported feature? Thanks! Tony

3 years, 2 months

1
1
0 0

reinstalling node with orchestrator/cephadm

by Kenneth Waegeman

Hi all, I'm running a 15.2.8 cluster using ceph orch with all daemons adopted to cephadm. I tried reinstall an OSD node. Is there a way to make ceph orch/cephadm activate the devices on this node again, ideally automatically? I tried running `cephadm ceph-volume -- lvm activate --all` but this has an error related to dmcrypt: > [root@osd2803 ~]# cephadm ceph-volume -- lvm activate --all > Using recent ceph image docker.io/ceph/ceph:v15 > /usr/bin/podman:stderr --> Activating OSD ID 0 FSID > 697698fd-3fa0-480f-807b-68492bd292bf > /usr/bin/podman:stderr Running command: /usr/bin/mount -t tmpfs tmpfs > /var/lib/ceph/osd/ceph-0 > /usr/bin/podman:stderr Running command: /usr/bin/ceph-authtool > /var/lib/ceph/osd/ceph-0/lockbox.keyring --create-keyring --name > client.osd-lockbox.697698fd-3fa0-480f-807b-68492bd292bf --add-key > AQAy7Bdg0jQsBhAAj0gcteTEbcpwNNvMGZqTTg== > /usr/bin/podman:stderr stdout: creating > /var/lib/ceph/osd/ceph-0/lockbox.keyring > /usr/bin/podman:stderr added entity > client.osd-lockbox.697698fd-3fa0-480f-807b-68492bd292bf > auth(key=AQAy7Bdg0jQsBhAAj0gcteTEbcpwNNvMGZqTTg==) > /usr/bin/podman:stderr Running command: /usr/bin/chown -R ceph:ceph > /var/lib/ceph/osd/ceph-0/lockbox.keyring > /usr/bin/podman:stderr Running command: /usr/bin/ceph --cluster ceph > --name client.osd-lockbox.697698fd-3fa0-480f-807b-68492bd292bf > --keyring /var/lib/ceph/osd/ceph-0/lockbox.keyring config-key get > dm-crypt/osd/697698fd-3fa0-480f-807b-68492bd292bf/luks > /usr/bin/podman:stderr stderr: Error initializing cluster client: > ObjectNotFound('RADOS object not found (error calling conf_read_file)',) > /usr/bin/podman:stderr --> RuntimeError: Unable to retrieve dmcrypt > secret > Traceback (most recent call last): > File "/usr/sbin/cephadm", line 6111, in <module> > r = args.func() > File "/usr/sbin/cephadm", line 1322, in _infer_fsid > return func() > File "/usr/sbin/cephadm", line 1381, in _infer_image > return func() > File "/usr/sbin/cephadm", line 3611, in command_ceph_volume > out, err, code = call_throws(c.run_cmd(), verbose=True) > File "/usr/sbin/cephadm", line 1060, in call_throws > raise RuntimeError('Failed command: %s' % ' '.join(command)) > RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host > --net=host --entrypoint /usr/sbin/ceph-volume --privileged > --group-add=disk -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15 -e > NODE_NAME=osd2803.banette.os -v /dev:/dev -v /run/udev:/run/udev -v > /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm > docker.io/ceph/ceph:v15 lvm activate --all The OSDs are encrypted indeed. `cephadm ceph-volume lvm list` and `cephadm shell ceph -s` run just fine, and if I run ceph-volume directly, the same command works, but then of course the daemons are started in the legacy way again, not in containers. Is there another way trough the 'ceph orch' to achieve this? Or if `cephadm ceph-volume -- lvm activate --all` would be the way to go here, I'm probably seeing a bug here ? Thanks!! Kenneth

3 years, 2 months

4
7
0 0

Multisite cluster sync latency

by Szabo, Istvan (Agoda)

Hi Gents, Can you tell me how much is your latency in your multisite cluster? Multisite should be latency sensitive and I scare this is my sync issue, but I don't really know what means "low latency". Here is mine, I wonder is it good or not. In HKG: "data-sync-from-ash": { "fetch_bytes": { "avgcount": 0, "sum": 0 }, "fetch_not_modified": 425395, "fetch_errors": 1, "poll_latency": { "avgcount": 890, "sum": 47481.671537512, "avgtime": 53.350192738 }, "poll_errors": 0 }, "data-sync-from-sin": { "fetch_bytes": { "avgcount": 0, "sum": 0 }, "fetch_not_modified": 484757, "fetch_errors": 0, "poll_latency": { "avgcount": 21686, "sum": 135649.750753768, "avgtime": 6.255176185 }, "poll_errors": 3 In ASH: "data-sync-from-hkg": { "fetch_bytes": { "avgcount": 7904, "sum": 497898243 }, "fetch_not_modified": 7383973, "fetch_errors": 654, "poll_latency": { "avgcount": 6586, "sum": 2568055.690045521, "avgtime": 389.926463717 }, "poll_errors": 3 }, "data-sync-from-sin": { "fetch_bytes": { "avgcount": 13362, "sum": 800114616 }, "fetch_not_modified": 7326406, "fetch_errors": 558, "poll_latency": { "avgcount": 10137, "sum": 3145053.032619919, "avgtime": 310.254812333 }, "poll_errors": 5 }, In SGP: "data-sync-from-ash": { "fetch_bytes": { "avgcount": 0, "sum": 0 }, "fetch_not_modified": 2057839, "fetch_errors": 1, "poll_latency": { "avgcount": 8874, "sum": 682176.718044618, "avgtime": 76.873644133 }, "poll_errors": 0 }, "data-sync-from-hkg": { "fetch_bytes": { "avgcount": 114, "sum": 1097512 }, "fetch_not_modified": 1939198, "fetch_errors": 823, "poll_latency": { "avgcount": 2123, "sum": 60947.760976996, "avgtime": 28.708318877 }, "poll_errors": 1 ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 2 months

1
0
0 0

Latency increase after upgrade 14.2.8 to 14.2.16

by Björn Dolkemeier

Hi, after upgrading Ceph from 14.2.8 to 14.2.16 we experienced increased latencies. There were no changes in hardware, configuration, workload or networking, just a rolling-update via ceph-ansible on running production cluster. The cluster consists of 16 OSDs (all SSD) over 4 Nodes. The VMs served via RBD from this cluster currently suffer on i/o wait cpu. These are some latencies that are increased after the update: - op_r_latency - op_w_latency - kv_final_lat - state_kv_commiting_lat - submit_lat - subop_w_latency Do these latencies point to KV/RocksDB? These are some latencies which are NOT increased after the update: - kv_sync_lat - kv_flush_lat - kv_commit_lat I attached one graph showing the massive increase after the update. I tried setting bluefs_buffered_io=true (as it’s default value was changed and it was mentioned as performance relevant) for all OSDs in one host but this does not make a difference. The ceph.conf is fairly simple: [global] cluster network = xxx fsid = xxx mon host = xxx public network = xxx [osd] osd memory target = 10141014425 Any ideas what to try? Help appreciated. Björn -- dbap GmbH phone +49 251 609979-0 / fax +49 251 609979-99 Heinr.-von-Kleist-Str. 47, 48161 Muenster, Germany http://www.dbap.de dbap GmbH, Sitz: Muenster HRB 5891, Amtsgericht Muenster Geschaeftsfuehrer: Bjoern Dolkemeier

3 years, 2 months

2
6
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2021