September 2020 - ceph-users

by Tony Liu

Hi, I have a 3-OSD-node Ceph cluster with 1 480GB SSD and 8 x 2TB 12Gpbs SAS HDD on each node, to provide storage to a OpenStack cluster. Both public and cluster networks are 2x10G. WAL and DB of each OSD is on SSD and they share the same 60GB partition. I run fio with different combinations of operation, block size and io-depth to collect IOPS, bandwidth and latency. I tried fio on compute node with ioengine=rbd, also fio within VM (backed by Ceph) with ioengine=libaio. The result doesn't seem good. Here are couple examples. ==================================== fio --name=test --ioengine=rbd --clientname=admin \ --pool=benchmark --rbdname=test --numjobs=1 \ --runtime=30 --direct=1 --size=2G \ --rw=read --bs=4k --iodepth=1 test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1 fio-3.7 Starting 1 process Jobs: 1 (f=0): [f(1)][100.0%][r=27.6MiB/s,w=0KiB/s][r=7075,w=0 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=56310: Mon Sep 14 19:01:24 2020 read: IOPS=7610, BW=29.7MiB/s (31.2MB/s)(892MiB/30001msec) slat (nsec): min=1550, max=57662, avg=3312.74, stdev=2981.42 clat (usec): min=77, max=4799, avg=127.39, stdev=39.88 lat (usec): min=78, max=4812, avg=130.70, stdev=40.67 clat percentiles (usec): | 1.00th=[ 82], 5.00th=[ 86], 10.00th=[ 95], 20.00th=[ 98], | 30.00th=[ 100], 40.00th=[ 104], 50.00th=[ 116], 60.00th=[ 129], | 70.00th=[ 141], 80.00th=[ 157], 90.00th=[ 182], 95.00th=[ 198], | 99.00th=[ 233], 99.50th=[ 245], 99.90th=[ 359], 99.95th=[ 515], | 99.99th=[ 709] bw ( KiB/s): min=27160, max=40696, per=100.00%, avg=30474.29, stdev=2826.23, samples=59 iops : min= 6790, max=10174, avg=7618.56, stdev=706.56, samples=59 lat (usec) : 100=28.89%, 250=70.72%, 500=0.34%, 750=0.05%, 1000=0.01% lat (msec) : 2=0.01%, 10=0.01% cpu : usr=3.55%, sys=3.80%, ctx=228358, majf=0, minf=29 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=228333,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=29.7MiB/s (31.2MB/s), 29.7MiB/s-29.7MiB/s (31.2MB/s-31.2MB/s), io=892MiB (935MB), run=30001-30001msec Disk stats (read/write): dm-0: ios=290/3, merge=0/0, ticks=2427/19, in_queue=2446, util=0.95%, aggrios=290/4, aggrmerge=0/0, aggrticks=2427/39, aggrin_queue=2332, aggrutil=0.95% sda: ios=290/4, merge=0/0, ticks=2427/39, in_queue=2332, util=0.95% ==================================== ==================================== fio --name=test --ioengine=rbd --clientname=admin \ --pool=benchmark --rbdname=test --numjobs=1 \ --runtime=30 --direct=1 --size=2G \ --rw=write --bs=4k --iodepth=1 test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1 fio-3.7 Starting 1 process Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=6352KiB/s][r=0,w=1588 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=56544: Mon Sep 14 19:03:36 2020 write: IOPS=1604, BW=6417KiB/s (6571kB/s)(188MiB/30003msec) slat (nsec): min=2240, max=45925, avg=6526.95, stdev=3486.19 clat (usec): min=399, max=35411, avg=615.88, stdev=231.41 lat (usec): min=402, max=35421, avg=622.40, stdev=232.08 clat percentiles (usec): | 1.00th=[ 420], 5.00th=[ 449], 10.00th=[ 469], 20.00th=[ 498], | 30.00th=[ 529], 40.00th=[ 562], 50.00th=[ 611], 60.00th=[ 652], | 70.00th=[ 685], 80.00th=[ 709], 90.00th=[ 766], 95.00th=[ 799], | 99.00th=[ 881], 99.50th=[ 955], 99.90th=[ 2671], 99.95th=[ 3097], | 99.99th=[ 3785] bw ( KiB/s): min= 5944, max= 6792, per=100.00%, avg=6415.95, stdev=178.72, samples=60 iops : min= 1486, max= 1698, avg=1603.93, stdev=44.67, samples=60 lat (usec) : 500=20.82%, 750=67.23%, 1000=11.55% lat (msec) : 2=0.25%, 4=0.14%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=1.22%, sys=1.25%, ctx=48143, majf=0, minf=18 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,48129,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=6417KiB/s (6571kB/s), 6417KiB/s-6417KiB/s (6571kB/s-6571kB/s), io=188MiB (197MB), run=30003-30003msec Disk stats (read/write): dm-0: ios=31/2, merge=0/0, ticks=342/14, in_queue=356, util=0.12%, aggrios=33/3, aggrmerge=0/0, aggrticks=390/27, aggrin_queue=404, aggrutil=0.13% sda: ios=33/3, merge=0/0, ticks=390/27, in_queue=404, util=0.13% ==================================== Does that make sense? How do you benchmark your Ceph cluster? Appreciate if you could share your experiences here. Thanks! Tony

3 years, 8 months

1
0
0 0

Ceph RBD iSCSI compatibility

by Salsa

I just came across a Suse documentation stating that RBD features are not iSCSI compatible. Since I had 2 cases of image corruption in this scenario in 10 days I'm wondering if my setup is to blame. So question is if it is possible to provide disks to a Windows Server 2019 via iSCSI while using rbd-mirror to backup data to a second cluster? I created all images with all features enabled. Is that compatible? -- Salsa

3 years, 8 months

6
12
0 0

Choosing suitable SSD for Ceph cluster

by Hermann Himmelbauer

Hi, I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on 3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks), interconnected via Infiniband 40. Problem is that the ceph performance is quite bad (approx. 30MiB/s reading, 3-4 MiB/s writing ), so I thought about plugging into each node a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to have a faster ceph storage and also some storage extension. The question is now which SSDs I should use. If I understand it right, not every SSD is suitable for ceph, as is denoted at the links below: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i… or here: https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a fast SSD for ceph. As the 950 is not available anymore, I ordered a Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO. Before equipping all nodes with these SSDs, I did some tests with "fio" as recommended, e.g. like this: fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test The results are as the following: ----------------------- 1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec Jobs: 4: read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec Jobs: 10: read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec ----------------------- So the read speed is impressive, but the write speed is really bad. Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND chips (MLC instead of TLC). The results are, however even worse for writing: ----------------------- Samsung 970 PRO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec Jobs: 4: read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec Jobs: 10: read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec ----------------------- I did some research and found out, that the "--sync" flag sets the flag "O_DSYNC" which seems to disable the SSD cache which leads to these horrid write speeds. It seems that this relates to the fact that the write cache is only not disabled for SSDs which implement some kind of battery buffer that guarantees a data flush to the flash in case of a powerloss. However, It seems impossible to find out which SSDs do have this powerloss protection, moreover, these enterprise SSDs are crazy expensive compared to the SSDs above - moreover it's unclear if powerloss protection is even available in the NVMe form factor. So building a 1 or 2 TB cluster seems not really affordable/viable. So, can please anyone give me hints what to do? Is it possible to ensure that the write cache is not disabled in some way (my server is situated in a data center, so there will probably never be loss of power). Or is the link above already outdated as newer ceph releases somehow deal with this problem? Or maybe a later Debian release (10) will handle the O_DSYNC flag differently? Perhaps I should simply invest in faster (and bigger) harddisks and forget the SSD-cluster idea? Thank you in advance for any help, Best Regards, Hermann -- hermann(a)qwer.tk PGP/GPG: 299893C7 (on keyservers)

3 years, 8 months

12
18
0 0

ceph-container: docker restart, mon's unable to join

by Stefan Kooman

Hi, In an attempt to get a (test) Mimic cluster running on Ubuntu 20.04 we are using docker with ceph-container images (ceph/daemon:latest-mimic). Deploying monitors and mgrs works fine. If however a monitor container gets stopped and started (i.e. docker restart) two out of three (with exception of mon initial member) mons won't join the cluster anymore and keep logging the following: /opt/ceph-container/bin/entrypoint.sh: Existing mon, trying to rejoin cluster... If docker is stopped, the mon directory "/var/lib/ceph/mon/$mon-name" removed and docker started again the mon is able to join the cluster. This directory is a persistent volume with correct permissions (167.167). No etcd cluster is in use here. We manually copied the /etc/ceph and /var/lib/ceph directories to the docker hosts. Any hints on how to make a mon container survive a reboot is welcome. Gr. Stefan P.s And yes, we know about Rook, kubernetes, etc. but that's not what want to use now.

3 years, 8 months

1
0
0 0

ceph-osd performance on ram disk

by George Shuklin

I'm creating a benchmark suite for Сeph. During benchmarking of benchmark, I've checked how fast ceph-osd works. I decided to skip all 'SSD mess' and use brd (block ram disk, modprobe brd) as underlying storage. Brd itself can yield up to 2.7Mpps in fio. In single thread mode (iodepth=1) it can yield up to 750k IOPS. LVM over brd gives about 600kIOPS in single-threaded mode with iodepth=1 (16us latency). But, as soon as I put ceph-osd (bluestore) on it, I see something very odd. No matter how much parallel load I push onto this OSD, it never gives more than 30 kIOPS, and I can't understand where bottleneck is. CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not a bottleneck. Network: I've moved benchmark on the same host as OSD, so it's a localhost. Even counting network, it's still far away from saturation. 30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are run on localhost, so network is irrelevant (I've checked it, traffic is on localhost). Test itself consumes about 70% CPU of one core, so there are plenty left. Replication: I've killed it (size=1, single osd in the pool). single-threaded latency: 200us, 4.8kIOPS. iopdeth=32: 2ms (15kIOPS). iodepth=16,numjobs=8: 5ms (24k IOPS) I'm running fio with 'rados' ioengine, and it looks like putting more workers doesn't change much, so it's not rados ioengine. As there is plenty CPU and IO left, there is only one possible place for bottleneck: some time-consuming single-threaded code in ceph-osd. Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher).

3 years, 8 months

3
5
0 0

Orchestrator & ceph osd purge

by Robert Sander

Hi, is it correct that when using the orchestrator to deploy and manage a cluster you should not use "ceph osd purge" any more as the orchestrator then is not able to find the OSD for the "ceph orch osd rm" operation? Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

3 years, 8 months

1
0
0 0

Is it possible to assign osd id numbers?

by Shain Miley

Hello, I have been wondering for quite some time whether or not it is possible to influence the osd.id numbers that are assigned during an install. I have made an attempt to keep our osds in order over the last few years, but it is a losing battle without having some control over the osd assignment. I am currently using ceph-deploy to handle adding nodes to the cluster. Thanks in advance, Shain Shain Miley | Director of Platform and Infrastructure | Digital Media | smiley(a)npr.org

3 years, 8 months

3
4
0 0

Assignment Help

by jack39068＠gmail.com

We are the best in online assignment help and students stuck in their assignment so don’t worry we are available 24/7. Take best help with our high qualified experts.https://www.fullhomeworkhelp.com/pages/assignment/assignment-help-a…

3 years, 8 months

1
0
0 0

Incapable to utilize make include out of Facebook? Call Facebook Customer Service Toll Free Number.

by mary smith

The make include out of the mail is utilized to compose an email however on the off chance that you can't utilize it, at that point you can get the important assistance from the assistance community by utilizing their FAQs. Notwithstanding that, you can likewise call the Facebook Customer Service Toll Free Number to get the necessary help. https://www.customercare-email.com/facebook-customer-service.html

3 years, 8 months

1
0
0 0

Issue in interface causing Epson Error Code 0x97? Get to customer care team for help.

by mary smith

Interface issues can also lead to Epson Error Code 0x97 as the glitch can cause hindrance with smooth functioning. However, the error can be resolved if you use the help that is provided by customer care. All you have to do is to call the team by dialing the tech support number that is available on the internet.https://www.epsonprintersupportpro.net/epson-error-code-0x97/

3 years, 8 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2020