September 2020 - ceph-users

by devidpaul41＠gmail.com

Working with estate agents might not be the first choice for many of us. But if you are trying to find someone who can help you with your house for sale in Chichester or finding a renter, the estate agent can help you to meet potential investors and buyers. Estate agents in Chichester always ensure you get the best service. Opposed to the seemingly endless negotiations and contract signings for purchasing, estate agent make the process faster and do all authentication before bringing any offer in front of you. Estate agents in Chichester always act informative, helpful and responsive and look at tenants as a potential buyer. However, if you are looking for houses in Chichester to rent, ensure to check with the legal property owner and be permitted to let out the property. https://www.todanstee.com/cgi-bin/properties-for-sale-rent/rental-search.pl

3 years, 7 months

1
0
0 0

Low level bluestore usage

by Ivan Kurnosov

Hi, this morning I woke up to a degraded test ceph cluster (managed by rook, but it does not really change anything for the question I'm about to ask). After checking logs I have found that bluestore on one of the OSDs run out of space. Some cluster details: ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) it runs on 3 little OSDs 10Gb each `ceph osd df` returned RAW USE of about 4.5GB on every node, happily reporting about 5.5GB of AVAIL. Yet: debug -9> 2020-09-22T20:23:15.421+0000 7f29e9798f40 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1600806195424423, "job": 1, "event": "recovery_started", "log_files": [347, 350]} debug -8> 2020-09-22T20:23:15.421+0000 7f29e9798f40 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #347 mode 0 debug -7> 2020-09-22T20:23:16.465+0000 7f29e9798f40 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #350 mode 0 debug -6> 2020-09-22T20:23:18.689+0000 7f29e9798f40 1 bluefs _allocate failed to allocate 0x17a2360 on bdev 1, free 0x390000; fallback to bdev 2 debug -5> 2020-09-22T20:23:18.689+0000 7f29e9798f40 1 bluefs _allocate unable to allocate 0x17a2360 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander debug -4> 2020-09-22T20:23:18.689+0000 7f29e9798f40 -1 bluestore(/var/lib/ceph/osd/ceph-0) allocate_bluefs_freespace failed to allocate on 0x39a20000 min_size 0x17b0000 > allocated total 0x6250000 bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 12ee32000 debug -3> 2020-09-22T20:23:18.689+0000 7f29e9798f40 -1 bluefs _allocate failed to expand slow device to fit +0x17a2360 debug -2> 2020-09-22T20:23:18.689+0000 7f29e9798f40 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x17a2360 debug -1> 2020-09-22T20:23:18.693+0000 7f29e9798f40 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f29e9798f40 time 2020-09-22T20:23:18.690014+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/os/bluestore/BlueFS.cc: 2696: ceph_abort_msg("bluefs enospc") So, my question would be: how could I have prevented that? From monitoring I have (prometheus) - OSDs are healthy, have plenty of space, yet they are not. What command (and prometheus metric) would help me understand the actual real bluestore use? Or am I missing something? Oh, and I "fixed" the cluster by expanding the broken osd.0 with a larger 15GB volume. And 2 other OSDs still run on 10GB volumes. Thanks in advance for any thoughts. -- With best regards, Ivan Kurnosov

3 years, 7 months

4
4
0 0

samba vfs_ceph: client_mds_namespace not working?

by Frank Schilder

Dear all, maybe someone has experienced this before. We are setting up a SAMBA gateway and would like to use the vfs_ceph module. In case of several file systems one needs to choose an mds namespace. There is an option in ceph.conf: client mds namespace = CEPH-FS-NAME Unfortunately, it seems not to work. I tried it in all possible versions, in [global] and [client], with and without "client" at the beginning, to no avail. I either get a time out or an error. I also found the libcephfs function ceph_select_filesystem(cmount, CEPH-FS-NAME) added it to vfs_ceph.c just before the ceph_mount with the same result, I get an error (operation not permitted). Does anyone know how to get this to work? And, yes, I tested an ordinary kernel fs mount with the credentials for the ceph client without problems. I can't access any documentation on the libcephfs api, I always get a page not found error. My last resort is now to ceph fs set-default CEPH-FS-NAME to the fs to be used and live with the implied restrictions and ugliness. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 7 months

2
3
0 0

A disk move gone wrong & Luminous vs. Nautilus performance

by Nico Schottelius

Good morning, you might have seen my previous mails and I wanted to discuss some findings over the last day+night over what happened and why it happened here. As the system behaved inexplicitly for us, we are now looking for someone to analyse the root cause on consultancy basis - if you are interested & available, please drop me a PM. What happened ------------- Yesterday at about 1520 we moved 4 SSDs from server15 to server8. These SSDs are located in 3 SSD only pools. Soon after that the cluster begun to exhibit slow I/O on clients and saw many PGs in the peering state. Additionally we began to see slow ops on unrelated OSDs, i.e. osds that have a device class "hdd-big" set and are only selected by the "hdd" pool. Excerpt from ceph -s: 167 slow ops, oldest one blocked for 958 sec, daemons [osd.0,osd.11,osd.12,osd.14,osd.15,osd. 16,osd.19,osd.2,osd.22,osd.25]... have slow ops. All pgs that were hanging in peering state seemed to be related to pools from the moved SSDs. However the question remains why other osds began to be affected. Then we noticed that the newly started OSDs on server8 are consuming 100% cpu, the associated disks have a queue depth of 1 and the disks are 80-99% busy. This is a change that we noticed some weeks ago in the nautilus osd behaviour: while in luminous, osds would start right away, osds often take seconds to minutes to start in nautilus, accompanied with high cpu usage. Before yesterday however, the maximum startup time was in the range of minutes. When we noticed blocked/slow IOPS, we wanted to ensure that the clients are least possible affected. We tried to submit the two changes to all osds: ceph tell osd.* injectargs '--osd-recovery-max-active 1' ceph tell osd.* injectargs '--osd-max-backfills 1' However at this time, ceph tell would hang. We then switched to address each OSD individually using a for loop: for osd in $(ceph osd tree | grep osd. | awk '{ print $4 }'); do echo $osd; ( ceph tell $osd injectargs '--osd-max-backfills 1' &) ; done However this resulted in many hanging ceph tell processes and some of them began to report the following error: 2020-09-22 15:59:25.407 7fbc0359e700 0 --1- [2a0a:e5c0:2:1:20d:b9ff:fe48:3bd4]:0/1280010984 >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6856/4078 conn(0x7fbbec02edc0 0x7fbbec02f5d0 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER This message would be repeated many times a second per ceph tell process until we SIGINT killed it. After this failed we noticed that the remaining HDDs in server15 were at 100% i/o usage at about 0.6-2 MB/s reading rate. These low rates would very well explain the slow I/O on the clients - but it made us wonder why the rates were suddenly so low. We restarted one of the OSDs on this server, but the effect was that the osd would now use 100% cpu additionally to fully utilising the disk. We began to suspected a hardware problem and moved all HDDs to server8. Unfortunately, all newly HDD OSDs behave exactly the same: Very high cpu usage (60-100%), very high disk utilisation (about 90%+) and very low transfer rates (~1 MB/s on average). After a few hours (!) some of the OSDs normalised to 20-50MB/s read rates and after about 8h later all of them normalised. What is curious that during the slow phase of many hours, the average queue depth of the disk consistently stays at 1, while afterwards they usually average around 10-15. Given full utilisation, we suspect that the nautilus OSDs cause severly disk seeks. Our expectation (and so far behaviour with Luminous) was that if we have a single host failure, setting max backfills and max recovery to 1 and keeping the osd_recovery_op_priority low (usually at 2 in our clusters) has some, but more a minor impact on client I/O. This time, purely moving 4 osds with existing data made the cluster practically unusable, something we never experienced in a Luminous setup before. While I do not want to render out any potential mistakes or incorrect designs from our side, the effects seen are not expected from our side. While all of this happening we measured the network bandwidth on all nodes without congestion, monitored the monitors with some (spikes up to 50% from one core), but not high cpu usage. I was wondering whether anyone on the list has an insight on these questions: - Why were OSDs slow that are used in a different pool? - What changed in the OSDs from Luminous to Nautilus to make the startup phase slow? - What are nautilus OSDs doing at startup and why do they consume a lot of CPU? - What are the OSDs doing to severly impact the I/O? Is our seek theory correct? - Why does / can ceph tell hang in Nautilus? We never experienced this problem in Luminous. - Where does the BADAUTHORIZER message come from and what is the fix for it? - How can we debug / ensure that ceph tell does not hang? - Did code in the osds change that de-prioritizes command traffic? Best regards, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

3 years, 7 months

1
0
0 0

Ceph RBD latency with synchronous writes?

by René Bartsch

Hi, we're considering running KVM virtual machine images on Ceph RBD block devices. How does Ceph RBD perform with the synchronous writes of databases (MariaDB)? Best regards, Renne

3 years, 7 months

2
1
0 0

Vitastor, a fast Ceph-like block storage for VMs

by vitalif＠yourcmc.ru

Hi! After almost a year of development in my spare time I present my own software-defined block storage system: Vitastor - https://vitastor.io I designed it similar to Ceph in many ways, it also has Pools, PGs, OSDs, different coding schemes, rebalancing and so on. However it's much simpler and much faster. In a test cluster with SATA SSDs it achieved Q1T1 latency of 0.14ms which is especially great compared to Ceph RBD's 1ms for writes and 0.57ms for reads. In an "iops saturation" parallel load benchmark it reached 895k read / 162k write iops, compared to Ceph's 480k / 100k on the same hardware, but the most interesting part was CPU usage: Ceph OSDs were using 40 CPU cores out of 64 on each node and Vitastor was only using 4. Of course it's an early pre-release which means that, for example, it lacks snapshot support and other useful features. However the base is finished - it works and runs QEMU VMs. I like the design and I plan to develop it further. There are more details in the README file which currently opens from the domain https://vitastor.io Sorry if it was a bit off-topic, I just thought it could be interesting for you :) -- With best regards, Vitaliy Filippov

3 years, 7 months

6
9
0 0

Update erasure code profile

by Thomas Svedberg

Hi, We are running a Nautilus cluster and have some old and new erasure code profiles. For example: # ceph osd erasure-code-profile get m_erasure crush-device-class=hdd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=6 m=2 plugin=jerasure technique=reed_sol_van w=8 # ceph osd erasure-code-profile get c_erasure crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 Here we want to add crush-device-class information to the second profile. Is the following correct? Safe? What will actually happen if we run this command? # ceph osd erasure-code-profile set c_erasure crush-device-class=hdd k=4 m=2 --force Thanks for any input! Regards, Thomas Svedberg

3 years, 7 months

2
1
0 0

Documentation broken

by Frank Schilder

Hi all, during the migration of documentation, would it be possible to make the old documentation available somehow? A lot of pages are broken and I can't access the documentation for mimic at all any more. Is there an archive or something similar? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 7 months

2
2
0 0

switching to ceph-volume requires changing the default lvm.conf?

by Marc Roos

I was wondering if switching to ceph-volume requires me to change the default centos lvm.conf? Eg. The default has issue_discards = 0 Also I wonder if trimming is the default on lvm's on ssds? I read somewhere that the dmcrypt passthrough of trimming was still secure in combination with a btrfs filesystem.

3 years, 7 months

1
0
0 0

Re: Troubleshooting stuck unclean PGs?

by Matt Larson

Hi Wout, None of the OSDs are greater than 20% full. However, only 1 PG is backfilling at a time, while the others are backfill_wait. I had recently added a large amount of data to the Ceph cluster, and this may have caused the # of PGs to increase causing the need to rebalance or move objects. It appears that I could increase the # of backfill operations that happen simultaneously by increasing `osd_max_backfills` and/or `osd_recovery_max_active`. It looks like I should maybe consider increasing the number of max backfills happening at a time because the overall io during the backfill is pretty small. Does this seem reasonable? If so, with Ceph Octopus/cephadm, how can adjust the parameters? Thanks, Matt On Mon, Sep 21, 2020 at 2:21 PM Wout van Heeswijk <wout(a)42on.com> wrote: > > Hi Matt, > > The mon data can grow during when PGs are stuck unclean. Don't restart the mons. > > You need to find out why your placement groups are "backfill_wait". Likely some of your OSDs are (near)full. > > If you have space elsewhere you can use the ceph balancer module or reweighting of OSDs to rebalance data. > > Scrubbing will continue once the PGs are "active+clean" > > Kind regards, > > Wout > 42on > > ________________________________________ > From: Matt Larson <larsonmattr(a)gmail.com> > Sent: Monday, September 21, 2020 6:22 PM > To: ceph-users(a)ceph.io > Subject: [ceph-users] Troubleshooting stuck unclean PGs? > > Hi, > > Our Ceph cluster is reporting several PGs that have not been scrubbed > or deep scrubbed in time. It is over a week for these PGs to have been > scrubbed. When I checked the `ceph health detail`, there are 29 pgs > not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to > manually start a scrub on the PGs, but it appears that they are > actually in an unclean state that needs to be resolved first. > > This is a cluster running: > ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable) > > Following the information at [Troubleshooting > PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-…, > I checked for PGs that are stuck stale | inactive | unclean. There > were no PGs that are stale or inactive, but there are several that are > stuck unclean: > > ``` > PG_STAT STATE UP > UP_PRIMARY ACTING ACTING_PRIMARY > 8.3c active+remapped+backfill_wait > [124,41,108,8,87,16,79,157,49] 124 > [139,57,16,125,154,65,109,86,45] 139 > 8.3e active+remapped+backfill_wait > [108,2,58,146,130,29,37,66,118] 108 > [127,92,24,50,33,6,130,66,149] 127 > 8.3f active+remapped+backfill_wait > [19,34,86,132,59,78,153,99,6] 19 > [90,45,147,4,105,61,30,66,125] 90 > 8.40 active+remapped+backfill_wait > [19,131,80,76,42,101,61,3,144] 19 > [28,106,132,3,151,36,65,60,83] 28 > 8.3a active+remapped+backfilling > [32,72,151,30,103,131,62,84,120] 32 > [91,60,7,133,101,117,78,20,158] 91 > 8.7e active+remapped+backfill_wait > [108,2,58,146,130,29,37,66,118] 108 > [127,92,24,50,33,6,130,66,149] 127 > 8.3b active+remapped+backfill_wait > [34,113,148,63,18,95,70,129,13] 34 > [66,17,132,90,14,52,101,47,115] 66 > 8.7f active+remapped+backfill_wait > [19,34,86,132,59,78,153,99,6] 19 > [90,45,147,4,105,61,30,66,125] 90 > 8.78 active+remapped+backfill_wait > [96,113,159,63,29,133,73,8,89] 96 > [138,121,15,103,55,41,146,69,18] 138 > 8.7d active+remapped+backfilling > [0,90,60,124,159,19,71,101,135] 0 > [150,72,124,129,63,10,94,29,41] 150 > 8.7c active+remapped+backfill_wait > [124,41,108,8,87,16,79,157,49] 124 > [139,57,16,125,154,65,109,86,45] 139 > 8.79 active+remapped+backfill_wait > [59,15,41,82,131,20,73,156,113] 59 > [13,51,120,102,29,149,42,79,132] 13 > ``` > > If I query one of the PGs that is backfilling, 8.3a, it shows it's state as : > "recovery_state": [ > { > "name": "Started/Primary/Active", > "enter_time": "2020-09-19T20:45:44.027759+0000", > "might_have_unfound": [], > "recovery_progress": { > "backfill_targets": [ > "30(3)", > "32(0)", > "62(6)", > "72(1)", > "84(7)", > "103(4)", > "120(8)", > "131(5)", > "151(2)" > ], > > Q1: Is there anything that I should check/fix to enable the PGs to > resolve from the `unclean` state? > Q2: I have also seen that the podman containers on one of our OSD > servers are taking large amounts of disk space. Is there a way to > limit the growth of disk space for podman containers, when > administering a Ceph cluster using `cephadm` tools? At last check, a > server running 16 OSDs and 1 MON is using 39G of disk space for its > running containers. Can restarting containers help to start with a > fresh slate or reduce the disk use? > > Thanks, > Matt > > ------------------------ > > Matt Larson > Associate Scientist > Computer Scientist/System Administrator > UW-Madison Cryo-EM Research Center > 433 Babcock Drive, Madison, WI 53706 > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io -- Matt Larson, PhD Madison, WI 53705 U.S.A.

3 years, 7 months

4
7
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2020