On Fri, Dec 06, 2019 at 03:41:29PM +0000, Sage Weil wrote:
On Fri, 6 Dec 2019, Sebastien Han wrote:
Cool, that works for me!
Okay, so this won't work for a few reasons: (1) ceph-osd drops root privs
so we can't do anything fancy on shutdown, and (2) the signal handler
isn't set up right when the process starts, so it'll always be racy (the
teardown process might not happen). Having the caller do this is really
the right thing.
After chatting with Seb, though, I think we really have two different
problems:
1) Seb's AWS problem: you can't do an EBS detach of there is an active
VG(/LV) on the device. To fix this, you need to do vgchange -an, which
deinitializes the LVs and VG. AFAICS, this doesn't many any sense on a
bare-metal host, and would step on the toes of the generic LVM and udev
infrastructure, which magically initalizes all the LV devices it finds
(and AFAICS doesn't every try to disable that). (Also, IIRC c-v has a VG
per cluster, so if you deinitialize the entire VG, wouldn't that kill all
OSDs on the host for that cluster.. not just the one on the EBS volume
you're detaching?).
VGs are generally per device. In some cases c-v creates multi-device VGs but
there is a bug open for that and I'm working on removing this scenario. However
a VG might still contain volumes from multiple OSDs (multi-device OSDs) so
deactivating a VG might still kill multiple OSDs.
In any case, the problem feels like an EBS vs LVM problem. And I think
I'm back to Seb's original proposal here: the simplest way to solve this
is to just not use LVM at all and to put bluestore on the raw device.
You won't get dmcrypt or other fancy LVM features, but for EBS you don't
need any of them (except, maybe, in the future, growing a volume/OSD, but
that's something we need to teach bluestore to do regardless).
2) My ceph-daemon problem: to make dmcrypt work (well), IMO the decrypted
device should be set up when the OSD container is started, and torn down
when the container stops. For this, the thing that makes sense in my mind
is something like a '-f' flag for ceph-volume activate. IIUC, right now
activate does something like
1- set up decrypted LV, if needed
2- populate /var/lib/ceph/osd/ceph-NN dir
3- start systemd unit (unless the --no-systemd flag is passed, as we
currently do with containers)
4- exit.
Instead, with the -f flag, it would
1,2- same
3- run ceph-osd -f -i ... in the foreground. watch for signals and
pass them along to shut down the osd
4- clean up /var/lib/ceph/osd/ceph-NN
5- stop the decrypted LV
6- exit
This makes me realize that steps 4 and 5 don't currently exist anywhere:
there is no such thing as 'ceph-volume lvm deactivate'. If we had that
second part, a simple wrapper could accomplish the same thing as -f.
Afaics this exists for simple mode, not for the lvm case though.
In any case I think it makes sense to enable c-v to wrap the osd and clean up
after it.
I think we should pursue those 2 paths (barebones bluestore c-v mode and
c-v lvm deactivate) separately...
sage
> –––––––––
> Sébastien Han
> Senior Principal Software Engineer, Storage Architect
>
> "Always give 100%. Unless you're giving blood."
>
> On Fri, Dec 6, 2019 at 3:03 PM Sage Weil <sweil(a)redhat.com> wrote:
> >
> > On Fri, 6 Dec 2019, Sebastien Han wrote:
> > > If not in ceph-osd, can we have the ceph-osd executing a hook before
exiting 0?
> > > Reading a hook script from /etc/ceph/hook.d something like that would
> > > be nice so that we don't need a wrapper.
> >
> > Hmm, maybe if it was just osd_exec_on_shutdown=string, and that could
> > be something like "vgchange ..." or "bash -c ..."? We'd
need to make
> > sure we're setting FD_CLOEXEC on all the right file handles though. I can
> > give it a go..
> >
> > sage
> >
> > >
> > > Thoughts?
> > >
> > > Thanks!
> > > –––––––––
> > > Sébastien Han
> > > Senior Principal Software Engineer, Storage Architect
> > >
> > > "Always give 100%. Unless you're giving blood."
> > >
> > > On Fri, Dec 6, 2019 at 2:50 PM Sage Weil <sweil(a)redhat.com> wrote:
> > > >
> > > > On Fri, 6 Dec 2019, Sebastien Han wrote:
> > > > > I understand this is asking a lot from the ceph-volume side.
> > > > > We can explore a new wrapper binary or perhaps from the ceph-osd
itself.
> > > > >
> > > > > Maybe crazy/stupid idea, can we have a de-activate call from the
osd
> > > > > process itself? ceph-osd gets SIGTERM, closes the connection to
the
> > > > > device, then runs "vgchange -an <vg>", is this
realistic?
> > > >
> > > > Not really... it's hard (or gross) to do a hard/immediate exit
that tears
> > > > down all of the open handles to the device. I think this is not a
nice
> > > > way to layer things. I'd prefer either a c-v command or separate
wrapper
> > > > script to this.
> > > >
> > > > sage
> > > >
> > > >
> > > > >
> > > > > Thanks!
> > > > > –––––––––
> > > > > Sébastien Han
> > > > > Senior Principal Software Engineer, Storage Architect
> > > > >
> > > > > "Always give 100%. Unless you're giving blood."
> > > > >
> > > > > On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza
<adeza(a)redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han
<shan(a)redhat.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Following up on my previous ceph-volume email as
promised.
> > > > > > >
> > > > > > > When running Ceph with Rook in Kubernetes in the Cloud
(Aws, Azure,
> > > > > > > Google, whatever), the OSDs are backed by PVC (Cloud
block storage)
> > > > > > > attached to virtual machines.
> > > > > > > This makes the storage portable if the VM dies, the
device will be
> > > > > > > attached to a new virtual machine and the OSD will
resume running.
> > > > > > >
> > > > > > > In Rook, we have 2 main deployments for the OSD:
> > > > > > >
> > > > > > > 1. Prepare the disk to become an OSD
> > > > > > > Prepare will run on the VM, attach the block device,
run "ceph-volume
> > > > > > > prepare", then this gets complicated. After this,
the device is
> > > > > > > supposed to be detached from the VM because the
container terminated.
> > > > > > > However, the block is still held by LVM so the VG must
be
> > > > > > > de-activated. Currently, we do this in Rook, but it
would be nice to
> > > > > > > de-activate the VG once ceph-volume is done preparing
the disk in a
> > > > > > > container.
> > > > > > >
> > > > > > > 2. Activate the OSD.
> > > > > > > Now, onto the new container, the device is attached
again on the VM.
> > > > > > > At this point, more changes will be required in
ceph-volume,
> > > > > > > particularly in the "activate" call.
> > > > > > > a. ceph-volume should activate the VG
> > > > > >
> > > > > > By VG you mean LVM's Volume Group?
> > > > > >
> > > > > > > b. ceph-volume should activate the device normally
> > > > > >
> > > > > > Not "normally" though right? That would imply
starting the OSD which
> > > > > > you are indicating is not desired.
> > > > > >
> > > > > > > c. ceph-volume should run the ceph-osd process in
foreground as well
> > > > > > > as accepting flag to that CLI, we could have something
like:
> > > > > > > "ceph-volume lvm activate --no-systemd $STORE_FALG
$OSD_ID $OSD_UUID
> > > > > > > <a bunch of flags>"
> > > > > > > Perhaps we need a new flag to indicate we want to run
the osd
> > > > > > > process in foreground?
> > > > > > > Here is an example on how an OSD run today:
> > > > > > >
> > > > > > > ceph-osd --foreground --id 2 --fsid
> > > > > > > 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph
--setgroup ceph
> > > > > > > --crush-location=root=default host=minikube
--default-log-to-file
> > > > > > > false --ms-learn-addr-from-peer=false
> > > > > > >
> > > > > > > --> we can have a bunch of flags or an ENV var
with all the flags
> > > > > > > whatever you prefer.
> > > > > > >
> > > > > > > This wrapper should watch for signals too, it should
reply to
> > > > > > > SIGTERM in the following way:
> > > > > > > - stop the OSD
> > > > > > > - de-activate the VG
> > > > > > > - exit 0
> > > > > > >
> > > > > > > Just a side note, the VG must be de-activated when the
container stops
> > > > > > > so that the block device can be detached from the VMs,
otherwise,
> > > > > > > it'll still be held by LVM.
> > > > > >
> > > > > > I am worried that this goes beyond what I consider the scope
of
> > > > > > ceph-volume which is: prepare device(s) to be part of an
OSD.
> > > > > >
> > > > > > Catching signals, handling the OSD in the foreground, and
accepting
> > > > > > (proxying) flags, sounds problematic for a robust
implementation in
> > > > > > ceph-volume, even
> > > > > > if that means it will help Rook in this case.
> > > > > >
> > > > > > The other challenge I see is that it seems Ceph is in a
transition
> > > > > > from being a baremetal project to a container one, except
lots of
> > > > > > tooling (like ceph-volume) is deeply
> > > > > > tied to the non-containerized workflows. This makes it
difficult (and
> > > > > > non-obvious!) in ceph-volume when adding more flags to do
things that
> > > > > > help the containerized
> > > > > > deployment.
> > > > > >
> > > > > > To solve the issues you describe, I think you need either a
separate
> > > > > > command-line tool that can invoke ceph-volume with the added
features
> > > > > > you listed, or
> > > > > > if there is significant push to get more things in
ceph-volume, a
> > > > > > separate sub-command, so that the `lvm` is isolated from
the
> > > > > > conflicting logic.
> > > > > >
> > > > > > My preference would be a wrapper script, separate from the
Ceph project.
> > > > > >
> > > > > > >
> > > > > > > Hopefully, I was clear :).
> > > > > > > This is just a proposal if you feel like this could be
done
> > > > > > > differently, feel free to suggest.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > –––––––––
> > > > > > > Sébastien Han
> > > > > > > Senior Principal Software Engineer, Storage Architect
> > > > > > >
> > > > > > > "Always give 100%. Unless you're giving
blood."
> > > > > > > _______________________________________________
> > > > > > > Dev mailing list -- dev(a)ceph.io
> > > > > > > To unsubscribe send an email to dev-leave(a)ceph.io
> > > > > >
> > > > > _______________________________________________
> > > > > Dev mailing list -- dev(a)ceph.io
> > > > > To unsubscribe send an email to dev-leave(a)ceph.io
> > > > >
> > > _______________________________________________
> > > Dev mailing list -- dev(a)ceph.io
> > > To unsubscribe send an email to dev-leave(a)ceph.io
> > >
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io
>
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io
--
Jan Fajerski
Senior Software Engineer Enterprise Storage
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer