On Tue, Dec 3, 2019 at 11:56 AM Sebastien Han <shan(a)redhat.com> wrote:
Hi,
I've started working on a saner way to deploy OSD with Rook so that
they don't use the rook binary image.
Why were/are we using the rook binary to activate the OSD?
A bit of background on containers first, when executing a container,
we need to provide a command entrypoint that will act as PID 1. So if
you want to do pre/post action before running the process you need to
use a wrapper. In Rook, that's the rook binary, which has a CLI and
can then "activate" an OSD.
Currently, this "rook osd activate" call does the following:
* sed the lvm.conf
* run c-v lvm activate
* run the osd process
On shutdown, we intercept the signal, "kill -9" the osd and de-activate the
LV.
I have a patch here:
https://github.com/rook/rook/pull/4386, that
solves the initial bullet points but one thing we cannot do is the
signal catching and the lv de-activation.
Before you ask, Kubernetes has pre/post-hook but they are not
reliable, it's known and documented that there is no guarantee they
would actually run before or after the container starts/stops. We
tried and we had issues.
Why do we want to stop using the rook binary for activation? Because
each time we get a new binary version (new operator version), this
will restart all the OSDs, even if the deployment spec didn't change,
at least if nothing else than the rook image version changed.
Also with containers, we have seen so many issues working with LVM,
just to name a few:
* adapt lvm filters
* interactions with udev - need to tune the lvm config, even c-v
itself has lvm flag to not sync with udev built-in
* several bindmounts
* lvm package must be present on the host even if running in containers
* SELinux, yes lvm calls SELinux commands under the hood and pollute
the logs in some scenarios
Currently, one of the ways I can see this working is by not using LVM
when bootstrapping OSDs. Unfortunately, some of the logic cannot go in
the OSD code since the lv de-activation happens after the OSD stops.
We need to de-activate the LV so when running in the Cloud the block
can safely be re-attached to a new machine without LVM issues.
I know this will be a bit challenging and might ultimately look like
ceph-disk but it'd be nice to consider it.
What about a small prototype for Bluestore with block/db/wal on the same disk?
You raise some good points here, and I agree that there are many
issues with containers and LVM. There were also quite a few issues
with ceph-disk in containers,
but those issues are not as relevant as making the OSD provisioning
easier for everyone else.
One of the main ideas I brought up when trying to design ceph-volume
was to be completely agnostic on how the OSDs came to be: partitions?
full devices? LVM? something else?
It was interesting to imagine a scenario where the setup didn't matter
much, and ceph-volume would just be in charge of "activating"
(ensuring everything is ready for the ceph-osd daemon). That idea got
push-back in favor of being opinionated and choosing LVM. The amount
of internals ceph-volume has to deal specifically with LVM is
enormous, because with LVM came the requests with having
more flexibility, and more options to make it easier to use.
The `simple` sub-command was an attempt to introduce the hands-off
approach to OSD activation, by requiring just a little bit of metadata
in /etc/ceph/osd/*.json, where each OSD would represent a single
JSON file with some information. That approach not only works well for
ceph-disk OSDs, but should also work well with whatever else that you
may come up with... have you tried with `simple` and not gotten
results? If so, what went wrong?
Another option if `simple` doesn't achieve what Rook needs, is perhaps
implementing a separate sub-command (ceph-volume container?) that
could be implemented as a plugin so that it reuses all the well-tested
utilities that ceph-volume already has. The ZFS plugin did something
like that already.
Creating OSDs on your (Rook's) own is a *very* hard task to get right,
not to mention the many different ways OSDs allow you to configure
them: filestore (dedicated, collocated),
bluestore (data, data+db, data+wal, data+db+wal), dmcrypt or
unencrypted. Plus other nuances like talking to the monitor, and
sending/retrieving information that has changed between releases.
If this gets rejected, I might try a prototype for not using c-v in
Rook or something else that might come up with this discussion.
Thanks!
–––––––––
Sébastien Han
Senior Principal Software Engineer, Storage Architect
"Always give 100%. Unless you're giving blood."
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io