December 2019 - Dev - lists.ceph.io

ceph-volume simple disk scenario without LVM for OSD on PVC

by Sebastien Han

Hi, I've started working on a saner way to deploy OSD with Rook so that they don't use the rook binary image. Why were/are we using the rook binary to activate the OSD? A bit of background on containers first, when executing a container, we need to provide a command entrypoint that will act as PID 1. So if you want to do pre/post action before running the process you need to use a wrapper. In Rook, that's the rook binary, which has a CLI and can then "activate" an OSD. Currently, this "rook osd activate" call does the following: * sed the lvm.conf * run c-v lvm activate * run the osd process On shutdown, we intercept the signal, "kill -9" the osd and de-activate the LV. I have a patch here: https://github.com/rook/rook/pull/4386, that solves the initial bullet points but one thing we cannot do is the signal catching and the lv de-activation. Before you ask, Kubernetes has pre/post-hook but they are not reliable, it's known and documented that there is no guarantee they would actually run before or after the container starts/stops. We tried and we had issues. Why do we want to stop using the rook binary for activation? Because each time we get a new binary version (new operator version), this will restart all the OSDs, even if the deployment spec didn't change, at least if nothing else than the rook image version changed. Also with containers, we have seen so many issues working with LVM, just to name a few: * adapt lvm filters * interactions with udev - need to tune the lvm config, even c-v itself has lvm flag to not sync with udev built-in * several bindmounts * lvm package must be present on the host even if running in containers * SELinux, yes lvm calls SELinux commands under the hood and pollute the logs in some scenarios Currently, one of the ways I can see this working is by not using LVM when bootstrapping OSDs. Unfortunately, some of the logic cannot go in the OSD code since the lv de-activation happens after the OSD stops. We need to de-activate the LV so when running in the Cloud the block can safely be re-attached to a new machine without LVM issues. I know this will be a bit challenging and might ultimately look like ceph-disk but it'd be nice to consider it. What about a small prototype for Bluestore with block/db/wal on the same disk? If this gets rejected, I might try a prototype for not using c-v in Rook or something else that might come up with this discussion. Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood."

4 years, 4 months

8
27
0 0

osdmaps not trimmed until ceph-mon's restarted (if cluster has a down osd)

by Dan van der Ster

Hi Joao, I might have found the reason why several of our clusters (and maybe Bryan's too) are getting stuck not trimming osdmaps. It seems that when an osd fails, the min_last_epoch_clean gets stuck forever (even long after HEALTH_OK), until the ceph-mons are restarted. I've updated the ticket: https://tracker.ceph.com/issues/41154 Cheers, Dan

4 years, 4 months

5
6
0 0

14.2.5 QE Nautilus validation status

by Yuri Weinstein

(This is an early update, some tests are still running, as we are trying to release this point next week before the US holidays, and have more time to review results) Details of this release summarized here: https://tracker.ceph.com/issues/42839#note-3 rados - approved by Neha rgw - approved by Casey rbd - need approval Jason krbd - need approval Jason, Ilya fs - need approval Patrick, Ramana kcephfs - need approval Patrick, Ramana multimds - need approval Patrick, Ramana ceph-deploy - FAILED Sage, Alfredo ? ceph-disk - N/A upgrade/client-upgrade-hammer (nautilus) - N/A upgrade/client-upgrade-jewel (nautilus) - PASSED upgrade/client-upgrade-mimic (nautilus) - FAILED upgrade/luminous-p2p - in progress powercycle - in progress ceph-ansible - Brad is finxing upgrade/luminous-x (nautilus) - in progress upgrade/mimic-x (nautilus) - in progress ceph-volume - Jan fixing (please speak up if something is missing) Thx YuriW

4 years, 4 months

11
23
0 0

Qemu RBD image usage

by Liu, Changcheng

Hi all, I want to attach another RBD image into the Qemu VM to be used as disk. However, it always failed. The VM definiation xml is attached. Could anyone tell me where I did wrong? || nstcc3@nstcloudcc3:~$ sudo virsh start ubuntu_18_04_mysql --console || error: Failed to start domain ubuntu_18_04_mysql || error: internal error: process exited while connecting to monitor: || 2019-12-09T16:24:30.284454Z qemu-system-x86_64: -drive || file=rbd:rwl_mysql/mysql_image:auth_supported=none:mon_host=nstcloudcc4\:6789,format=raw,if=none,id=drive-virtio-disk1: || error connecting: Operation not supported The cluster info is below: || ceph@nstcloudcc3:~$ ceph --version || ceph version 14.0.0-16935-g9b6ef711f3 (9b6ef711f3a40898756457cb287bf291f45943f0) octopus (dev) || ceph@nstcloudcc3:~$ ceph -s || cluster: || id: e31502ff-1fb4-40b7-89a8-2b85a77a3b09 || health: HEALTH_OK || || services: || mon: 1 daemons, quorum nstcloudcc4 (age 2h) || mgr: nstcloudcc4(active, since 2h) || osd: 4 osds: 4 up (since 2h), 4 in (since 2h) || || data: || pools: 1 pools, 128 pgs || objects: 6 objects, 6.3 KiB || usage: 4.0 GiB used, 7.3 TiB / 7.3 TiB avail || pgs: 128 active+clean || || ceph@nstcloudcc3:~$ || ceph@nstcloudcc3:~$ rbd info rwl_mysql/mysql_image || rbd image 'mysql_image': || size 100 GiB in 25600 objects || order 22 (4 MiB objects) || snapshot_count: 0 || id: 110feda39b1c || block_name_prefix: rbd_data.110feda39b1c || format: 2 || features: layering, exclusive-lock, object-map, fast-diff, deep-flatten || op_features: || flags: || create_timestamp: Mon Dec 9 23:48:17 2019 || access_timestamp: Mon Dec 9 23:48:17 2019 || modify_timestamp: Mon Dec 9 23:48:17 2019 B.R. Changcheng

4 years, 4 months

1
0
0 0

Re: About the optimization of rbd object map

by Jason Dillaman

On Mon, Dec 9, 2019 at 8:19 AM Li Wang <laurence.liwang(a)gmail.com> wrote: > > Hi Jason, > If before the first write to object, the object map is updated first > to indicate > the object EXIST, what happen if crash occured before the data write, and after > the object map write, will the map wrongly indicate one object EXIST but in fact > NOTEXIST. In other words, the map subject to the following semantics, > if an object That's not an issue that would result in an object leak or data corruption. If the object-map flags the object as existing when it doesn't due to an untimely crash, it will either do an unnecessary read IO or delete request when removing the image. > NOTEXIST in map, it REALLY not exist. If an object EXIST in map, > it not necessarily exist. The read/write to such a object will return ENOENT, > and the client will read parent/copy up from parent then write, so > that it is not a problem. > If the above understanding is correct, how about diff computation, > will the wrong indication Yes, it will be wrong for the affected object so your diff will potentially include an extra object on the delta (but no data corruption). The object-map can be re-built using the CLI, but there really shouldn't be a need for such a corner case (that is just slightly sub-optimal). > in the map cause a problem. And, we are wondering what is the negative impacts > if disabling object map. > > Cheers, > Li Wang > > Jason Dillaman <jdillama(a)redhat.com> 于2019年12月6日周五下午9:56写道： > > > > On Thu, Dec 5, 2019 at 11:14 PM Li Wang <laurence.liwang(a)gmail.com> wrote: > > > > > > Hi Jason, > > > We found the synchronous process of object map, which, as a result, > > > write two objects > > > every write greatly slow down the first write performance of a newly > > > created rbd by up to 10x, > > > which is not acceptable in our scenario, so could we do some > > > optimizations on it, > > > for example, batch the map writes or lazy update the map, do we need > > > maintain accurate > > > synchronization between the map and the data objects? but after a > > > glimpse of the librbd codes, > > > it seems no transactional design for the two objects (map object and > > > data object) write? > > > > If you don't update the object-map before issuing the first write to > > the associated object, you could crash and therefore the object-map's > > state is worthless since you couldn't trust it to tell the truth. The > > cost of object-map is supposed to be amortized over time so the first > > writes on a new image will incur the performance hits, but future > > writes do not. > > > > The good news is that you are more than welcome to disable > > object-map/fast-diff if the performance penalty is too great for your > > application -- it's not a required feature of RBD. > > > > > > > > Cheers, > > > Li Wang > > > > > > > > > -- > > Jason > > > -- Jason

4 years, 4 months

1
0
0 0

[RFE] ceph-volume prepare and activate enhancements for containers

by Sebastien Han

Hi, Following up on my previous ceph-volume email as promised. When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure, Google, whatever), the OSDs are backed by PVC (Cloud block storage) attached to virtual machines. This makes the storage portable if the VM dies, the device will be attached to a new virtual machine and the OSD will resume running. In Rook, we have 2 main deployments for the OSD: 1. Prepare the disk to become an OSD Prepare will run on the VM, attach the block device, run "ceph-volume prepare", then this gets complicated. After this, the device is supposed to be detached from the VM because the container terminated. However, the block is still held by LVM so the VG must be de-activated. Currently, we do this in Rook, but it would be nice to de-activate the VG once ceph-volume is done preparing the disk in a container. 2. Activate the OSD. Now, onto the new container, the device is attached again on the VM. At this point, more changes will be required in ceph-volume, particularly in the "activate" call. a. ceph-volume should activate the VG b. ceph-volume should activate the device normally c. ceph-volume should run the ceph-osd process in foreground as well as accepting flag to that CLI, we could have something like: "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID <a bunch of flags>" Perhaps we need a new flag to indicate we want to run the osd process in foreground? Here is an example on how an OSD run today: ceph-osd --foreground --id 2 --fsid 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph --crush-location=root=default host=minikube --default-log-to-file false --ms-learn-addr-from-peer=false --> we can have a bunch of flags or an ENV var with all the flags whatever you prefer. This wrapper should watch for signals too, it should reply to SIGTERM in the following way: - stop the OSD - de-activate the VG - exit 0 Just a side note, the VG must be de-activated when the container stops so that the block device can be detached from the VMs, otherwise, it'll still be held by LVM. Hopefully, I was clear :). This is just a proposal if you feel like this could be done differently, feel free to suggest. Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood."

4 years, 4 months

4
9
0 0

Simplifying Ceph Project Redmine Open Statuses

by Patrick Donnelly

Currently we have these open statuses: New Triaged Verified Need More Info In Progress Feedback Need Review Need Test Testing Pending Backport Pending Upstream It seems to me many of these are mostly unused making their presence confusing to newcomers. I propose we prune these down to: New: default for new trackers; ideally this list should be short and regularly looked at. Triaged: it's been looked at by PTL/team member and could be assigned out. Need More Info: can't be worked on without more information In Progress: assignee is working on the ticket. Need Review: upstream PR ready for review Pending Backport: upstream PR merged; backports are pending. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

4 years, 4 months

8
11
0 0

Performance Tuning and Ceph

by Mark Nelson

Hi Folks, Recently there was a thread called "Tuning Nautilus for flash only" that included a reference to a bluestore performance blog post from earlier this year on the Ceph community website. There was some concern in that thead regarding some of the tuning parameters presented in the article. We discussed it in the core standup earlier this week and felt like we should address it. I've included a reply that Paul made in that thread as I think it's particularly relevant. Before I get into that though, I absolutely want to encourage folks to run performance tests and report their findings. To that end I want to thank Karan and Daniel for their hard work and being willing to present their results. This kind of work is difficult and presenting the results publicly can be a little rough! Thank you Karan and Daniel and please continue running tests and reporting your findings! I also want to thank Paul for making several extremely important and valid points below. I completely agree that some of the tuning parameters presented in the article shouldn't be used in production. Beyond disabling checksuming and authentication, I would highly encourage folks to think about the ramifications of setting very low numbers of pg log entries (especially when combined with low per-pool PG counts via the autotuner). The effect on recovery could be significant. Several other tunings in the article may have unintended consequences. Imagine for instance what could happen with 32 concurrent rocksdb compaction threads per OSD on a server that has a large number of OSDs, oversubscribed DB devices, and underpowered CPUs. Personally I would be concerned about the overhead under heavy load with large databases full of OMAP data. There are cases where our defaults may not be optimal, but many were set after a fair amount of performance testing (and even more QE testing). We tend to be more conservative than not, but often there is at least some level of thought and testing behind the defaults. In some cases, the optimal tuning may also be hardware or workload specific. In the community test lab we have two different classes of performance nodes. One is about 4 years old and uses older Xeon processors and P3700 NVMe drives. Several years ago when bluestore was young we saw that a 16K min alloc size was significantly faster than 4k for small write workloads primarily due to encode/decode overhead. As bluestore matured and improved, the gap between 16k and 4k min_alloc sizes on that hardware largely evaporated. On our new nodes however, we see a significant small write performance improvement when using a 4K min alloc size (Likely due to CPU overhead during WAL writes now being a bigger bottleneck than metadata IO in the DB). Of course, the min_alloc size has a huge affect on the space-amplification of small objects as well. This is just one example where an old set of tests on a single hardware configuration may not tell the whole story (or even tell the wrong story). What I'm getting at here is that you shouldn't necessarily trust any single set of tests (including mine!). This is especially true when multiple configuration parameters are changed at the same time and it's not clear how each parameter is affecting the results. I would encourage folks to look at multiple sets of results, look especially at tests that change a single parameter at a time, and also give higher credence to results that provide evidence for why performance changed. This might include profiling data, examples where specific code is shown to be sub-optimal, or corroborating data from tests run by other users. And Paul's advice below to run your own benchmarks that are relevant to your use case is spot on as well. Thanks, Mark On 11/28/19 10:46 AM, Paul Emmerich wrote: > Please don't run this config in production. > Disabling checksumming is a bad idea, disabling authentication is also > pretty bad. > > There are also a few options in there that no longer exist (osd op > threads) or are no longer relevant (max open files), in general, you > should not blindly copy config files you find on the Internet. Only > set an option to its non-default value after carefully checking what > it does and whether it applies to your use case. > > Also, run benchmarks yourself. Use benchmarks that are relevant to > your use case. > > Paul >

4 years, 4 months

1
0
0 0

Re: device class : nvme

by Sage Weil

Adding dev(a)ceph.io On Thu, 21 Nov 2019, Muhammad Ahmad wrote: > While trying to research how crush maps are used/modified I stumbled > upon these device classes. > https://ceph.io/community/new-luminous-crush-device-classes/ > > I wanted to highlight that having nvme as a separate class will > eventually break and should be removed. > > There is already a push within the industry to consolidate future > command sets and NVMe will likely be it. In other words, NVMe HDDs are > not too far off. In fact, the recent October OCP F2F discussed this > topic in detail. > > If the classification is based on performance then command set > (SATA/SAS/NVMe) is probably not the right classification. I opened a PR that does this: https://github.com/ceph/ceph/pull/31796 I can't remember seeing 'nvme' as a device class on any real cluster; the exceptoin is my basement one, and I think the only reason it ended up that way was because I deployed bluestore *very* early on (with ceph-disk) and the is_nvme() detection helper doesn't work with LVM. That's my theory at least.. can anybody with bluestore on NVMe devices confirm? Does anybody see class 'nvme' devices in their cluster? Thanks! sage

4 years, 4 months

6
6
0 0

Documentation Refactor - Kubernetes Helm Procedure Wrong (Bug #2)

by John Zachary Dover

I am updating the Ceph documentation. Included in this email is a proposed change to the documentation and a request for information pertaining to that proposed change. If you know about the issue behind the proposed change and you have information pertinent to it that you would like to enshrine in the documentation, reply to this email and tell me. Documentation Link: http://docs.ceph.com/docs/master/start/kube-helm/ Proposed Change: Removing the Helm-related material from the documentation entirely. Zac's Request: Can anyone provide a compelling reason for keeping the Helm documentation? Tracking Information (this can be ignored by everyone but Zac) Bug # 2 here: https://pad.ceph.com/p/Report_Documentation_Bugs

4 years, 4 months

2
2
0 0

2024

2023

2022

2021

2020

2019

Dev December 2019