I've started working on a saner way to deploy OSD with Rook so that
they don't use the rook binary image.
Why were/are we using the rook binary to activate the OSD?
A bit of background on containers first, when executing a container,
we need to provide a command entrypoint that will act as PID 1. So if
you want to do pre/post action before running the process you need to
use a wrapper. In Rook, that's the rook binary, which has a CLI and
can then "activate" an OSD.
Currently, this "rook osd activate" call does the following:
* sed the lvm.conf
* run c-v lvm activate
* run the osd process
On shutdown, we intercept the signal, "kill -9" the osd and de-activate the LV.
I have a patch here: https://github.com/rook/rook/pull/4386, that
solves the initial bullet points but one thing we cannot do is the
signal catching and the lv de-activation.
Before you ask, Kubernetes has pre/post-hook but they are not
reliable, it's known and documented that there is no guarantee they
would actually run before or after the container starts/stops. We
tried and we had issues.
Why do we want to stop using the rook binary for activation? Because
each time we get a new binary version (new operator version), this
will restart all the OSDs, even if the deployment spec didn't change,
at least if nothing else than the rook image version changed.
Also with containers, we have seen so many issues working with LVM,
just to name a few:
* adapt lvm filters
* interactions with udev - need to tune the lvm config, even c-v
itself has lvm flag to not sync with udev built-in
* several bindmounts
* lvm package must be present on the host even if running in containers
* SELinux, yes lvm calls SELinux commands under the hood and pollute
the logs in some scenarios
Currently, one of the ways I can see this working is by not using LVM
when bootstrapping OSDs. Unfortunately, some of the logic cannot go in
the OSD code since the lv de-activation happens after the OSD stops.
We need to de-activate the LV so when running in the Cloud the block
can safely be re-attached to a new machine without LVM issues.
I know this will be a bit challenging and might ultimately look like
ceph-disk but it'd be nice to consider it.
What about a small prototype for Bluestore with block/db/wal on the same disk?
If this gets rejected, I might try a prototype for not using c-v in
Rook or something else that might come up with this discussion.
Senior Principal Software Engineer, Storage Architect
"Always give 100%. Unless you're giving blood."
I might have found the reason why several of our clusters (and maybe
Bryan's too) are getting stuck not trimming osdmaps.
It seems that when an osd fails, the min_last_epoch_clean gets stuck
forever (even long after HEALTH_OK), until the ceph-mons are
I've updated the ticket: https://tracker.ceph.com/issues/41154
(This is an early update, some tests are still running, as we are
trying to release this point next week before the US holidays, and
have more time to review results)
Details of this release summarized here:
rados - approved by Neha
rgw - approved by Casey
rbd - need approval Jason
krbd - need approval Jason, Ilya
fs - need approval Patrick, Ramana
kcephfs - need approval Patrick, Ramana
multimds - need approval Patrick, Ramana
ceph-deploy - FAILED Sage, Alfredo ?
ceph-disk - N/A
upgrade/client-upgrade-hammer (nautilus) - N/A
upgrade/client-upgrade-jewel (nautilus) - PASSED
upgrade/client-upgrade-mimic (nautilus) - FAILED
upgrade/luminous-p2p - in progress
powercycle - in progress
ceph-ansible - Brad is finxing
upgrade/luminous-x (nautilus) - in progress
upgrade/mimic-x (nautilus) - in progress
ceph-volume - Jan fixing
(please speak up if something is missing)
On Mon, Dec 9, 2019 at 8:19 AM Li Wang <laurence.liwang(a)gmail.com> wrote:
> Hi Jason,
> If before the first write to object, the object map is updated first
> to indicate
> the object EXIST, what happen if crash occured before the data write, and after
> the object map write, will the map wrongly indicate one object EXIST but in fact
> NOTEXIST. In other words, the map subject to the following semantics,
> if an object
That's not an issue that would result in an object leak or data
corruption. If the object-map flags the object as existing when it
doesn't due to an untimely crash, it will either do an unnecessary
read IO or delete request when removing the image.
> NOTEXIST in map, it REALLY not exist. If an object EXIST in map,
> it not necessarily exist. The read/write to such a object will return ENOENT,
> and the client will read parent/copy up from parent then write, so
> that it is not a problem.
> If the above understanding is correct, how about diff computation,
> will the wrong indication
Yes, it will be wrong for the affected object so your diff will
potentially include an extra object on the delta (but no data
corruption). The object-map can be re-built using the CLI, but there
really shouldn't be a need for such a corner case (that is just
> in the map cause a problem. And, we are wondering what is the negative impacts
> if disabling object map.
> Li Wang
> Jason Dillaman <jdillama(a)redhat.com> 于2019年12月6日周五 下午9:56写道：
> > On Thu, Dec 5, 2019 at 11:14 PM Li Wang <laurence.liwang(a)gmail.com> wrote:
> > >
> > > Hi Jason,
> > > We found the synchronous process of object map, which, as a result,
> > > write two objects
> > > every write greatly slow down the first write performance of a newly
> > > created rbd by up to 10x,
> > > which is not acceptable in our scenario, so could we do some
> > > optimizations on it,
> > > for example, batch the map writes or lazy update the map, do we need
> > > maintain accurate
> > > synchronization between the map and the data objects? but after a
> > > glimpse of the librbd codes,
> > > it seems no transactional design for the two objects (map object and
> > > data object) write?
> > If you don't update the object-map before issuing the first write to
> > the associated object, you could crash and therefore the object-map's
> > state is worthless since you couldn't trust it to tell the truth. The
> > cost of object-map is supposed to be amortized over time so the first
> > writes on a new image will incur the performance hits, but future
> > writes do not.
> > The good news is that you are more than welcome to disable
> > object-map/fast-diff if the performance penalty is too great for your
> > application -- it's not a required feature of RBD.
> > >
> > > Cheers,
> > > Li Wang
> > >
> > --
> > Jason
Following up on my previous ceph-volume email as promised.
When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure,
Google, whatever), the OSDs are backed by PVC (Cloud block storage)
attached to virtual machines.
This makes the storage portable if the VM dies, the device will be
attached to a new virtual machine and the OSD will resume running.
In Rook, we have 2 main deployments for the OSD:
1. Prepare the disk to become an OSD
Prepare will run on the VM, attach the block device, run "ceph-volume
prepare", then this gets complicated. After this, the device is
supposed to be detached from the VM because the container terminated.
However, the block is still held by LVM so the VG must be
de-activated. Currently, we do this in Rook, but it would be nice to
de-activate the VG once ceph-volume is done preparing the disk in a
2. Activate the OSD.
Now, onto the new container, the device is attached again on the VM.
At this point, more changes will be required in ceph-volume,
particularly in the "activate" call.
a. ceph-volume should activate the VG
b. ceph-volume should activate the device normally
c. ceph-volume should run the ceph-osd process in foreground as well
as accepting flag to that CLI, we could have something like:
"ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID
<a bunch of flags>"
Perhaps we need a new flag to indicate we want to run the osd
process in foreground?
Here is an example on how an OSD run today:
ceph-osd --foreground --id 2 --fsid
9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph
--crush-location=root=default host=minikube --default-log-to-file
--> we can have a bunch of flags or an ENV var with all the flags
whatever you prefer.
This wrapper should watch for signals too, it should reply to
SIGTERM in the following way:
- stop the OSD
- de-activate the VG
- exit 0
Just a side note, the VG must be de-activated when the container stops
so that the block device can be detached from the VMs, otherwise,
it'll still be held by LVM.
Hopefully, I was clear :).
This is just a proposal if you feel like this could be done
differently, feel free to suggest.
Senior Principal Software Engineer, Storage Architect
"Always give 100%. Unless you're giving blood."
Currently we have these open statuses:
Need More Info
It seems to me many of these are mostly unused making their presence
confusing to newcomers. I propose we prune these down to:
New: default for new trackers; ideally this list should be short and
regularly looked at.
Triaged: it's been looked at by PTL/team member and could be assigned out.
Need More Info: can't be worked on without more information
In Progress: assignee is working on the ticket.
Need Review: upstream PR ready for review
Pending Backport: upstream PR merged; backports are pending.
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
Recently there was a thread called "Tuning Nautilus for flash only" that
included a reference to a bluestore performance blog post from earlier
this year on the Ceph community website. There was some concern in that
thead regarding some of the tuning parameters presented in the article.
We discussed it in the core standup earlier this week and felt like we
should address it. I've included a reply that Paul made in that thread
as I think it's particularly relevant. Before I get into that though, I
absolutely want to encourage folks to run performance tests and report
their findings. To that end I want to thank Karan and Daniel for their
hard work and being willing to present their results. This kind of work
is difficult and presenting the results publicly can be a little rough!
Thank you Karan and Daniel and please continue running tests and
reporting your findings!
I also want to thank Paul for making several extremely important and
valid points below. I completely agree that some of the tuning
parameters presented in the article shouldn't be used in production.
Beyond disabling checksuming and authentication, I would highly
encourage folks to think about the ramifications of setting very low
numbers of pg log entries (especially when combined with low per-pool PG
counts via the autotuner). The effect on recovery could be
significant. Several other tunings in the article may have unintended
consequences. Imagine for instance what could happen with 32 concurrent
rocksdb compaction threads per OSD on a server that has a large number
of OSDs, oversubscribed DB devices, and underpowered CPUs. Personally I
would be concerned about the overhead under heavy load with large
databases full of OMAP data. There are cases where our defaults may not
be optimal, but many were set after a fair amount of performance testing
(and even more QE testing). We tend to be more conservative than not,
but often there is at least some level of thought and testing behind the
In some cases, the optimal tuning may also be hardware or workload
specific. In the community test lab we have two different classes of
performance nodes. One is about 4 years old and uses older Xeon
processors and P3700 NVMe drives. Several years ago when bluestore was
young we saw that a 16K min alloc size was significantly faster than 4k
for small write workloads primarily due to encode/decode overhead. As
bluestore matured and improved, the gap between 16k and 4k min_alloc
sizes on that hardware largely evaporated. On our new nodes however, we
see a significant small write performance improvement when using a 4K
min alloc size (Likely due to CPU overhead during WAL writes now being a
bigger bottleneck than metadata IO in the DB). Of course, the min_alloc
size has a huge affect on the space-amplification of small objects as
well. This is just one example where an old set of tests on a single
hardware configuration may not tell the whole story (or even tell the
What I'm getting at here is that you shouldn't necessarily trust any
single set of tests (including mine!). This is especially true when
multiple configuration parameters are changed at the same time and it's
not clear how each parameter is affecting the results. I would
encourage folks to look at multiple sets of results, look especially at
tests that change a single parameter at a time, and also give higher
credence to results that provide evidence for why performance changed.
This might include profiling data, examples where specific code is shown
to be sub-optimal, or corroborating data from tests run by other users.
And Paul's advice below to run your own benchmarks that are relevant to
your use case is spot on as well.
On 11/28/19 10:46 AM, Paul Emmerich wrote:
> Please don't run this config in production.
> Disabling checksumming is a bad idea, disabling authentication is also
> pretty bad.
> There are also a few options in there that no longer exist (osd op
> threads) or are no longer relevant (max open files), in general, you
> should not blindly copy config files you find on the Internet. Only
> set an option to its non-default value after carefully checking what
> it does and whether it applies to your use case.
> Also, run benchmarks yourself. Use benchmarks that are relevant to
> your use case.
On Thu, 21 Nov 2019, Muhammad Ahmad wrote:
> While trying to research how crush maps are used/modified I stumbled
> upon these device classes.
> I wanted to highlight that having nvme as a separate class will
> eventually break and should be removed.
> There is already a push within the industry to consolidate future
> command sets and NVMe will likely be it. In other words, NVMe HDDs are
> not too far off. In fact, the recent October OCP F2F discussed this
> topic in detail.
> If the classification is based on performance then command set
> (SATA/SAS/NVMe) is probably not the right classification.
I opened a PR that does this:
I can't remember seeing 'nvme' as a device class on any real cluster; the
exceptoin is my basement one, and I think the only reason it ended up that
way was because I deployed bluestore *very* early on (with ceph-disk) and
the is_nvme() detection helper doesn't work with LVM. That's my theory at
least.. can anybody with bluestore on NVMe devices confirm? Does anybody
see class 'nvme' devices in their cluster?
I am updating the Ceph documentation. Included in this email is a proposed
the documentation and a request for information pertaining to that proposed
If you know about the issue behind the proposed change and you have
pertinent to it that you would like to enshrine in the documentation, reply
email and tell me.
Documentation Link: http://docs.ceph.com/docs/master/start/kube-helm/
Proposed Change: Removing the Helm-related material from the
Zac's Request: Can anyone provide a compelling reason for keeping the
Tracking Information (this can be ignored by everyone but Zac)
Bug # 2 here: https://pad.ceph.com/p/Report_Documentation_Bugs