September 2020 - ceph-users

Is ceph-mon disk write i/o normal at more than 1/2TB a day on an empty cluster?

by tri＠postix.net

Hi all, I'm in the process of testing out ceph on a small cluster. The cluster is virtually empty with no clients, just a few OSDs. I noticed extensive disk write I/O on some nodes and tracked them down to the ceph-mon daemon. A cursory log over 1 hour shows that each monitor generates about 22GB worth of disk write I/O or more than 1/2TB a day. The log was generated using iotop. It seems the 3 monitors in the cluster are doing the same thing. Is this normal or something I should be taking a closer look at? Cheers, --Tri Hoang

3 years, 7 months

1
0
0 0

Process for adding a separate block.db to an osd

by tri＠postix.net

Hey all, I'm trying to figure out the appropriate process for adding a separate SSD block.db to an existing OSD. From what I gather the two steps are: 1. Use ceph-bluestore-tool bluefs-bdev-new-db to add the new db device 2. Migrate the data ceph-bluestore-tool bluefs-bdev-migrate I followed this and got both executed fine without any error. Yet when the OSD got started up, it keeps on using the integrated block.db instead of the new db. The block.db link to the new db device was deleted. Again, no error, just not using the new db Any suggestion? Thanks. --Tri Hoang root@elmo:/#CEPH_ARGS="--bluestore_block_db_size=26843545600 --bluestore_block_db_create=true" ceph-bluestore-tool --path /mnt/ceph/c258000c-f3e4-11ea-9ebe-c3c75e8e9028/osd.2 bluefs-bdev-new-db --dev-target /dev/vg/sdc.db inferring bluefs devices from bluestore path DB device added /dev/dm-8 root@elmo:/# ceph-bluestore-tool --path /mnt/ceph/c258000c-f3e4-11ea-9ebe-c3c75e8e9028/osd.2 --devs-source /mnt/ceph/c258000c-f3e4-11ea-9ebe-c3c75e8e9028/osd.2/block --dev-target /mnt/ceph/c258000c-f3e4-11ea-9ebe-c3c75e8e9028/osd.2/block.db bluefs-bdev-migrate inferring bluefs devices from bluestore path

3 years, 7 months

2
2
0 0

Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

by Maged Mokhtar

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri… On 19/09/2020 10:31, huxiaoyu(a)horebdata.cn wrote: > Dear Maged, > > Thanks a lot for detailed explanantion on dm-writecache with Ceph. > > You mentioned REQ_FUA support patch for dm-writecache, does such a > patch not included into recent dm-writecache source code? I am using > 4.4 and 4.15/4.19 kernels, where do i get the mentioned patch? > > best regards, > > Samuel > > > > ------------------------------------------------------------------------ > huxiaoyu(a)horebdata.cn > > *From:* Maged Mokhtar <mailto:mmokhtar@petasan.org> > *Date:* 2020-09-18 18:20 > *To:* vitalif <mailto:vitalif@yourcmc.ru>; huxiaoyu > <mailto:huxiaoyu@horebdata.cn>; ceph-users <mailto:ceph-users@ceph.io> > *Subject:* Re: [ceph-users] Re: Benchmark WAL/DB on SSD and HDD > for RGW RBD CephFS > dm-writecache works using a high and low watermarks, set at 45 and > 50%. > All writes land in cache, once cache fills to the high watermark > backfilling to the slow device starts and stops when reaching the low > watermark. Backfilling uses b-tree with LRU blocks and tries merge > blocks to reduce hdd seeks, this is further helped by the io > scheduler > (cfq/deadline) ordering. > Each sync write op to the device requires 2 sync write ops, one > for data > and one for metadata, metadata is always in ram so there is no > additional metada read op (at the expense of using 2.5% of your cache > partition size in ram). So for pure sync writes (those with > REQ_FUA or > REQ_FLUSH which is used by Ceph) get half the SSD iops performance at > the device level. > Now the questions, what sustained performance would you get during > backfilling: it totally depends on whether your workload is > sequential > or random. For pure sequential workloads, all blocks are merged so > there > will not be a drop in input iops and backfilling occurs in small step > like intervals, but for such workloads you could get good performance > even without a cache. For purely random writes theoretically you > should > drop to the hdd random iops speed ( ie 80-150 iops ), but in our > testing > with fio pure random we would get 400-450 sustained iops, this is > probably related to the non-random-ness of fio rather than any magic. > For real life workloads that have a mix of both, this is where the > real > benefit of the cache will be felt, however it is not easy to simulate > such workloads, fio does offer a zipf/theta random distribution > control > but it was difficult for us to simulate real life workloads with > it, we > did some manual workloads such as installing and copying multiple vms > and we found the cache helped by 3-4 times the time to complete. > dm-writecache does serve reads if in cache, however the OSD cache > does > help for reads as well as any client read-ahead and in general writes > are the performance issue with hdd in Ceph. > For bcache, the only configuration we did was to enable write back > mode, > we did not set the block size to 4k. > If you want to try dm-writecache, use a recent 5.4+ kernel or a > kernel > with REQ_FUA support patch we did. You would need a recent lvm tools > package to support dm-writecache. We also limit the number of > backfill > blocks inflight to 100k blocks ie 400 MB. > /Maged > On 18/09/2020 13:38, vitalif(a)yourcmc.ru wrote: > >> we did test dm-cache, bcache and dm-writecache, we found the > later to be > >> much better. > > Did you set bcache block size to 4096 during your tests? Without > this setting it's slow because 99.9% SSDs don't handle 512 byte > overwrites well. Otherwise I don't think bcache should be worse > than dm-writecache. Also dm-writecache only caches writes, and > bcache also caches reads. And lvmcache is trash because it only > writes to SSD when the block is already on the SSD. > > > > Please post some details about the comparison if you have them :) >

3 years, 7 months

1
0
0 0

Cephadm adoption not properly working

by Julian Fölsch

Good day everybody, We recently upgraded from Nautilus to Octopus and tried to adopt our cluster to cephadm following the guide in the documentation[0]. We only experienced some at first glance minor hickups where the cluster failed to refresh because it could not pull the container image[1]. This was fixed by setting `container_image` to the location where it could find the image in our network. But when we tried to run `ceph orch ps` it returned only a empty list. The one OSD we adopted after fixing the container_image variable was appearing fine in `ceph orch ps` then. However all Manager and Monitor daemons are now in a floating state where `cephadm ls` reports them as style "cephadm:v1"[2] but the orchestrator considers them stray[3]. Functionally however, those daemons just work fine and work together in the cluster. Can you help out how to get those daemons under the orchestrators control? Kind regards, Julian Fölsch [0] https://docs.ceph.com/en/octopus/cephadm/adoption/ [1] https://agdsn.me/~paktosan/ceph/adoption/image_failure.txt [2] https://agdsn.me/~paktosan/ceph/adoption/cephadm_ls.txt [3] https://agdsn.me/~paktosan/ceph/adoption/stray_daemons.txt -- Julian Fölsch Arbeitsgemeinschaft Dresdner Studentennetz (AG DSN) Telefon: +49 351 271816 69 Mobil: +49 152 22915871 Fax: +49 351 46469685 Email: julian.foelsch(a)agdsn.de Studierendenrat der TU Dresden Helmholtzstr. 10 01069 Dresden

3 years, 7 months

1
0
0 0

ceph-volume quite buggy compared to ceph-disk

by Marc Roos

[@]# ceph-volume lvm activate 36 82b94115-4dfb-4ed0-8801-def59a432b0a Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-36 Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-36/lockbox.keyring --create-keyring --name client.osd-lockbox.82b94115-4dfb-4ed0-8801-def59a432b0a --add-key AQBxA2Zfj6avOBAAIIHqNNY2J22EnOZV+dNzFQ== stdout: creating /var/lib/ceph/osd/ceph-36/lockbox.keyring added entity client.osd-lockbox.82b94115-4dfb-4ed0-8801-def59a432b0a auth(key=AQBxA2Zfj6avOBAAIIHqNNY2J22EnOZV+dNzFQ==) Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-36/lockbox.keyring Running command: /usr/bin/ceph --cluster ceph --name client.osd-lockbox.82b94115-4dfb-4ed0-8801-def59a432b0a --keyring /var/lib/ceph/osd/ceph-36/lockbox.keyring config-key get dm-crypt/osd/82b94115-4dfb-4ed0-8801-def59a432b0a/luks Running command: /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen /dev/ceph-9263e83b-7660-4f5b-843a-2111e882a17e/osd-block-82b94115-4dfb-4 ed0-8801-def59a432b0a I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb stderr: Device I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb already exists. Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-36 Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/mapper/I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb --path /var/lib/ceph/osd/ceph-36 --no-mon-config stderr: failed to read label for /dev/mapper/I8MyTZ-RQjx-gGmd-XSRw-kfa1-L60n-fgQpCb: (2) No such file or directory --> RuntimeError: command returned non-zero exit status: 1 dmsetup ls lists this???? Where is an option to set the weight? As far as I can see you can only set this after peering started? How can I mount this tmpfs manually to inspect this? Maybe put in the manual[1]? [1] https://docs.ceph.com/en/latest/ceph-volume/lvm/activate/

3 years, 7 months

1
0
0 0

Re: Ceph RDMA GID Selection Problem

by Lazuardi Nasution

Hi Samuel, I resent my reply since CC will not be accepted by this mailing list. I'm using Octopus. If it is the release problem, why it is successful with only ms_async_rdma_device_name and ms_async_rdma_gid_idx? Best regards. On Sat, Sep 19, 2020 at 3:33 PM huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn> wrote: > Which Ceph version are you using? Just wondering Ceph RDMA support is > officially announced, or still on development? > > best regards, > > samuel > > ------------------------------ > huxiaoyu(a)horebdata.cn > > > *From:* Lazuardi Nasution <mrxlazuardin(a)gmail.com> > *Date:* 2020-09-18 19:21 > *To:* ceph-users <ceph-users(a)ceph.io> > *Subject:* [ceph-users] Ceph RDMA GID Selection Problem > Hi, > > I have something weird about GID selection for Ceph with RDMA. When I do > configuration with ms_async_rdma_device_name and ms_async_rdma_gid_idx, > Ceph with RDMA running successfully. But, when I do configuration with > ms_async_rdma_device_name, ms_async_rdma_local_gid and > ms_async_rdma_roce_ver, Ceph with RDMA is not working, OSDs are down in > seconds after they up. The GID index which is used on the first attempt is > associated with the GID and RoCE version which are used on the second > attempt. Is this the string matter (maybe because GID is using colon > characters) or something else? Using the GID index sometimes gives me > problems due to it not persisting and changes happen every time I do > network reconfiguration (for example: adding/removing VLAN), or even > rebooting. > > Best regards, > >

3 years, 7 months

1
0
0 0

Re: Ceph RDMA GID Selection Problem

by Lazuardi Nasution

Hi Samuel, I'm using Octopus. If it is the release problem, why it is successful with only ms_async_rdma_device_name and ms_async_rdma_gid_idx? Best regards, On Sat, Sep 19, 2020 at 3:33 PM huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn> wrote: > Which Ceph version are you using? Just wondering Ceph RDMA support is > officially announced, or still on development? > > best regards, > > samuel > > ------------------------------ > huxiaoyu(a)horebdata.cn > > > *From:* Lazuardi Nasution <mrxlazuardin(a)gmail.com> > *Date:* 2020-09-18 19:21 > *To:* ceph-users <ceph-users(a)ceph.io> > *Subject:* [ceph-users] Ceph RDMA GID Selection Problem > Hi, > > I have something weird about GID selection for Ceph with RDMA. When I do > configuration with ms_async_rdma_device_name and ms_async_rdma_gid_idx, > Ceph with RDMA running successfully. But, when I do configuration with > ms_async_rdma_device_name, ms_async_rdma_local_gid and > ms_async_rdma_roce_ver, Ceph with RDMA is not working, OSDs are down in > seconds after they up. The GID index which is used on the first attempt is > associated with the GID and RoCE version which are used on the second > attempt. Is this the string matter (maybe because GID is using colon > characters) or something else? Using the GID index sometimes gives me > problems due to it not persisting and changes happen every time I do > network reconfiguration (for example: adding/removing VLAN), or even > rebooting. > > Best regards, > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > >

3 years, 7 months

1
0
0 0

Ceph RDMA GID Selection Problem

by Lazuardi Nasution

Hi, I have something weird about GID selection for Ceph with RDMA. When I do configuration with ms_async_rdma_device_name and ms_async_rdma_gid_idx, Ceph with RDMA running successfully. But, when I do configuration with ms_async_rdma_device_name, ms_async_rdma_local_gid and ms_async_rdma_roce_ver, Ceph with RDMA is not working, OSDs are down in seconds after they up. The GID index which is used on the first attempt is associated with the GID and RoCE version which are used on the second attempt. Is this the string matter (maybe because GID is using colon characters) or something else? Using the GID index sometimes gives me problems due to it not persisting and changes happen every time I do network reconfiguration (for example: adding/removing VLAN), or even rebooting. Best regards,

3 years, 7 months

2
1
0 0

Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

by vitalif＠yourcmc.ru

> we did test dm-cache, bcache and dm-writecache, we found the later to be > much better. Did you set bcache block size to 4096 during your tests? Without this setting it's slow because 99.9% SSDs don't handle 512 byte overwrites well. Otherwise I don't think bcache should be worse than dm-writecache. Also dm-writecache only caches writes, and bcache also caches reads. And lvmcache is trash because it only writes to SSD when the block is already on the SSD. Please post some details about the comparison if you have them :)

3 years, 7 months

3
2
0 0

RuntimeError: Unable check if OSD id exists

by Marc Roos

I have still ceph-disk created osd's in nautilus. Thought about using this ceph-volume, but looks like this manual for replacing ceph-disk[1] is not complete. Getting already this error RuntimeError: Unable check if OSD id exists: [1] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#rados-repl…

3 years, 7 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2020