[RFE] ceph-volume prepare and activate enhancements for containers

List overview All Threads
Download

newer

older

Re: About the optimization of rbd...

Simplifying Ceph Project Redmine...

Sebastien Han

6 Dec 2019 6 Dec '19

10:59 a.m.

Hi, Following up on my previous ceph-volume email as promised. When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure, Google, whatever), the OSDs are backed by PVC (Cloud block storage) attached to virtual machines. This makes the storage portable if the VM dies, the device will be attached to a new virtual machine and the OSD will resume running. In Rook, we have 2 main deployments for the OSD: 1. Prepare the disk to become an OSD Prepare will run on the VM, attach the block device, run "ceph-volume prepare", then this gets complicated. After this, the device is supposed to be detached from the VM because the container terminated. However, the block is still held by LVM so the VG must be de-activated. Currently, we do this in Rook, but it would be nice to de-activate the VG once ceph-volume is done preparing the disk in a container. 2. Activate the OSD. Now, onto the new container, the device is attached again on the VM. At this point, more changes will be required in ceph-volume, particularly in the "activate" call. a. ceph-volume should activate the VG b. ceph-volume should activate the device normally c. ceph-volume should run the ceph-osd process in foreground as well as accepting flag to that CLI, we could have something like: "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID <a bunch of flags>" Perhaps we need a new flag to indicate we want to run the osd process in foreground? Here is an example on how an OSD run today: ceph-osd --foreground --id 2 --fsid 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph --crush-location=root=default host=minikube --default-log-to-file false --ms-learn-addr-from-peer=false --> we can have a bunch of flags or an ENV var with all the flags whatever you prefer. This wrapper should watch for signals too, it should reply to SIGTERM in the following way: - stop the OSD - de-activate the VG - exit 0 Just a side note, the VG must be de-activated when the container stops so that the block device can be detached from the VMs, otherwise, it'll still be held by LVM. Hopefully, I was clear :). This is just a proposal if you feel like this could be done differently, feel free to suggest. Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood."

Show replies by date

Alfredo Deza

6 Dec 6 Dec

12:44 p.m.

On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan(a)redhat.com> wrote:

...

By VG you mean LVM's Volume Group?

...

b. ceph-volume should activate the device normally

Not "normally" though right? That would imply starting the OSD which you are indicating is not desired.

...

c. ceph-volume should run the ceph-osd process in foreground as well as accepting flag to that CLI, we could have something like: "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID <a bunch of flags>" Perhaps we need a new flag to indicate we want to run the osd process in foreground? Here is an example on how an OSD run today: ceph-osd --foreground --id 2 --fsid 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph --crush-location=root=default host=minikube --default-log-to-file false --ms-learn-addr-from-peer=false --> we can have a bunch of flags or an ENV var with all the flags whatever you prefer. This wrapper should watch for signals too, it should reply to SIGTERM in the following way: - stop the OSD - de-activate the VG - exit 0 Just a side note, the VG must be de-activated when the container stops so that the block device can be detached from the VMs, otherwise, it'll still be held by LVM.

I am worried that this goes beyond what I consider the scope of ceph-volume which is: prepare device(s) to be part of an OSD. Catching signals, handling the OSD in the foreground, and accepting (proxying) flags, sounds problematic for a robust implementation in ceph-volume, even if that means it will help Rook in this case. The other challenge I see is that it seems Ceph is in a transition from being a baremetal project to a container one, except lots of tooling (like ceph-volume) is deeply tied to the non-containerized workflows. This makes it difficult (and non-obvious!) in ceph-volume when adding more flags to do things that help the containerized deployment. To solve the issues you describe, I think you need either a separate command-line tool that can invoke ceph-volume with the added features you listed, or if there is significant push to get more things in ceph-volume, a separate sub-command, so that the `lvm` is isolated from the conflicting logic. My preference would be a wrapper script, separate from the Ceph project.

...

Hopefully, I was clear :). This is just a proposal if you feel like this could be done differently, feel free to suggest. Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Sebastien Han

1:30 p.m.

I understand this is asking a lot from the ceph-volume side. We can explore a new wrapper binary or perhaps from the ceph-osd itself. Maybe crazy/stupid idea, can we have a de-activate call from the osd process itself? ceph-osd gets SIGTERM, closes the connection to the device, then runs "vgchange -an <vg>", is this realistic? Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza <adeza(a)redhat.com> wrote:

...

On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan(a)redhat.com> wrote:

By VG you mean LVM's Volume Group?

b. ceph-volume should activate the device normally

Not "normally" though right? That would imply starting the OSD which you are indicating is not desired.

Sage Weil

1:50 p.m.

On Fri, 6 Dec 2019, Sebastien Han wrote:

...

Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza <adeza(a)redhat.com> wrote:

On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan(a)redhat.com> wrote:

By VG you mean LVM's Volume Group?

b. ceph-volume should activate the device normally

Not "normally" though right? That would imply starting the OSD which you are indicating is not desired.

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Sebastien Han

1:57 p.m.

If not in ceph-osd, can we have the ceph-osd executing a hook before exiting 0? Reading a hook script from /etc/ceph/hook.d something like that would be nice so that we don't need a wrapper. Thoughts? Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." On Fri, Dec 6, 2019 at 2:50 PM Sage Weil <sweil(a)redhat.com> wrote:

...

On Fri, 6 Dec 2019, Sebastien Han wrote:

Not really... it's hard (or gross) to do a hard/immediate exit that tears down all of the open handles to the device. I think this is not a nice way to layer things. I'd prefer either a c-v command or separate wrapper script to this. sage > > Thanks! > ––––––––– > Sébastien Han > Senior Principal Software Engineer, Storage Architect > > "Always give 100%. Unless you're giving blood." > > On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza <adeza(a)redhat.com> wrote: > > > > On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan(a)redhat.com> wrote: > > > > > > Hi, > > > > > > Following up on my previous ceph-volume email as promised. > > > > > > When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure, > > > Google, whatever), the OSDs are backed by PVC (Cloud block storage) > > > attached to virtual machines. > > > This makes the storage portable if the VM dies, the device will be > > > attached to a new virtual machine and the OSD will resume running. > > > > > > In Rook, we have 2 main deployments for the OSD: > > > > > > 1. Prepare the disk to become an OSD > > > Prepare will run on the VM, attach the block device, run "ceph-volume > > > prepare", then this gets complicated. After this, the device is > > > supposed to be detached from the VM because the container terminated. > > > However, the block is still held by LVM so the VG must be > > > de-activated. Currently, we do this in Rook, but it would be nice to > > > de-activate the VG once ceph-volume is done preparing the disk in a > > > container. > > > > > > 2. Activate the OSD. > > > Now, onto the new container, the device is attached again on the VM. > > > At this point, more changes will be required in ceph-volume, > > > particularly in the "activate" call. > > > a. ceph-volume should activate the VG > > > > By VG you mean LVM's Volume Group? > > > > > b. ceph-volume should activate the device normally > > > > Not "normally" though right? That would imply starting the OSD which > > you are indicating is not desired. > > > > > c. ceph-volume should run the ceph-osd process in foreground as well > > > as accepting flag to that CLI, we could have something like: > > > "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID > > > <a bunch of flags>" > > > Perhaps we need a new flag to indicate we want to run the osd > > > process in foreground? > > > Here is an example on how an OSD run today: > > > > > > ceph-osd --foreground --id 2 --fsid > > > 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph > > > --crush-location=root=default host=minikube --default-log-to-file > > > false --ms-learn-addr-from-peer=false > > > > > > --> we can have a bunch of flags or an ENV var with all the flags > > > whatever you prefer. > > > > > > This wrapper should watch for signals too, it should reply to > > > SIGTERM in the following way: > > > - stop the OSD > > > - de-activate the VG > > > - exit 0 > > > > > > Just a side note, the VG must be de-activated when the container stops > > > so that the block device can be detached from the VMs, otherwise, > > > it'll still be held by LVM. > > > > I am worried that this goes beyond what I consider the scope of > > ceph-volume which is: prepare device(s) to be part of an OSD. > > > > Catching signals, handling the OSD in the foreground, and accepting > > (proxying) flags, sounds problematic for a robust implementation in > > ceph-volume, even > > if that means it will help Rook in this case. > > > > The other challenge I see is that it seems Ceph is in a transition > > from being a baremetal project to a container one, except lots of > > tooling (like ceph-volume) is deeply > > tied to the non-containerized workflows. This makes it difficult (and > > non-obvious!) in ceph-volume when adding more flags to do things that > > help the containerized > > deployment. > > > > To solve the issues you describe, I think you need either a separate > > command-line tool that can invoke ceph-volume with the added features > > you listed, or > > if there is significant push to get more things in ceph-volume, a > > separate sub-command, so that the `lvm` is isolated from the > > conflicting logic. > > > > My preference would be a wrapper script, separate from the Ceph project. > > > > > > > > Hopefully, I was clear :). > > > This is just a proposal if you feel like this could be done > > > differently, feel free to suggest. > > > > > > Thanks! > > > ––––––––– > > > Sébastien Han > > > Senior Principal Software Engineer, Storage Architect > > > > > > "Always give 100%. Unless you're giving blood." > > > _______________________________________________ > > > Dev mailing list -- dev(a)ceph.io > > > To unsubscribe send an email to dev-leave(a)ceph.io > > > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io >

Sage Weil

2:03 p.m.

On Fri, 6 Dec 2019, Sebastien Han wrote:

...

If not in ceph-osd, can we have the ceph-osd executing a hook before exiting 0? Reading a hook script from /etc/ceph/hook.d something like that would be nice so that we don't need a wrapper.

...

Thoughts? Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." On Fri, Dec 6, 2019 at 2:50 PM Sage Weil <sweil(a)redhat.com> wrote:

On Fri, 6 Dec 2019, Sebastien Han wrote:

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Sebastien Han

2:09 p.m.

Cool, that works for me! Thanks! ––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." On Fri, Dec 6, 2019 at 3:03 PM Sage Weil <sweil(a)redhat.com> wrote:

...

On Fri, 6 Dec 2019, Sebastien Han wrote:

If not in ceph-osd, can we have the ceph-osd executing a hook before exiting 0? Reading a hook script from /etc/ceph/hook.d something like that would be nice so that we don't need a wrapper.

Hmm, maybe if it was just osd_exec_on_shutdown=string, and that could be something like "vgchange ..." or "bash -c ..."? We'd need to make sure we're setting FD_CLOEXEC on all the right file handles though. I can give it a go.. sage > > Thoughts? > > Thanks! > ––––––––– > Sébastien Han > Senior Principal Software Engineer, Storage Architect > > "Always give 100%. Unless you're giving blood." > > On Fri, Dec 6, 2019 at 2:50 PM Sage Weil <sweil(a)redhat.com> wrote: > > > > On Fri, 6 Dec 2019, Sebastien Han wrote: > > > I understand this is asking a lot from the ceph-volume side. > > > We can explore a new wrapper binary or perhaps from the ceph-osd itself. > > > > > > Maybe crazy/stupid idea, can we have a de-activate call from the osd > > > process itself? ceph-osd gets SIGTERM, closes the connection to the > > > device, then runs "vgchange -an <vg>", is this realistic? > > > > Not really... it's hard (or gross) to do a hard/immediate exit that tears > > down all of the open handles to the device. I think this is not a nice > > way to layer things. I'd prefer either a c-v command or separate wrapper > > script to this. > > > > sage > > > > > > > > > > Thanks! > > > ––––––––– > > > Sébastien Han > > > Senior Principal Software Engineer, Storage Architect > > > > > > "Always give 100%. Unless you're giving blood." > > > > > > On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza <adeza(a)redhat.com> wrote: > > > > > > > > On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan(a)redhat.com> wrote: > > > > > > > > > > Hi, > > > > > > > > > > Following up on my previous ceph-volume email as promised. > > > > > > > > > > When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure, > > > > > Google, whatever), the OSDs are backed by PVC (Cloud block storage) > > > > > attached to virtual machines. > > > > > This makes the storage portable if the VM dies, the device will be > > > > > attached to a new virtual machine and the OSD will resume running. > > > > > > > > > > In Rook, we have 2 main deployments for the OSD: > > > > > > > > > > 1. Prepare the disk to become an OSD > > > > > Prepare will run on the VM, attach the block device, run "ceph-volume > > > > > prepare", then this gets complicated. After this, the device is > > > > > supposed to be detached from the VM because the container terminated. > > > > > However, the block is still held by LVM so the VG must be > > > > > de-activated. Currently, we do this in Rook, but it would be nice to > > > > > de-activate the VG once ceph-volume is done preparing the disk in a > > > > > container. > > > > > > > > > > 2. Activate the OSD. > > > > > Now, onto the new container, the device is attached again on the VM. > > > > > At this point, more changes will be required in ceph-volume, > > > > > particularly in the "activate" call. > > > > > a. ceph-volume should activate the VG > > > > > > > > By VG you mean LVM's Volume Group? > > > > > > > > > b. ceph-volume should activate the device normally > > > > > > > > Not "normally" though right? That would imply starting the OSD which > > > > you are indicating is not desired. > > > > > > > > > c. ceph-volume should run the ceph-osd process in foreground as well > > > > > as accepting flag to that CLI, we could have something like: > > > > > "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID > > > > > <a bunch of flags>" > > > > > Perhaps we need a new flag to indicate we want to run the osd > > > > > process in foreground? > > > > > Here is an example on how an OSD run today: > > > > > > > > > > ceph-osd --foreground --id 2 --fsid > > > > > 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph > > > > > --crush-location=root=default host=minikube --default-log-to-file > > > > > false --ms-learn-addr-from-peer=false > > > > > > > > > > --> we can have a bunch of flags or an ENV var with all the flags > > > > > whatever you prefer. > > > > > > > > > > This wrapper should watch for signals too, it should reply to > > > > > SIGTERM in the following way: > > > > > - stop the OSD > > > > > - de-activate the VG > > > > > - exit 0 > > > > > > > > > > Just a side note, the VG must be de-activated when the container stops > > > > > so that the block device can be detached from the VMs, otherwise, > > > > > it'll still be held by LVM. > > > > > > > > I am worried that this goes beyond what I consider the scope of > > > > ceph-volume which is: prepare device(s) to be part of an OSD. > > > > > > > > Catching signals, handling the OSD in the foreground, and accepting > > > > (proxying) flags, sounds problematic for a robust implementation in > > > > ceph-volume, even > > > > if that means it will help Rook in this case. > > > > > > > > The other challenge I see is that it seems Ceph is in a transition > > > > from being a baremetal project to a container one, except lots of > > > > tooling (like ceph-volume) is deeply > > > > tied to the non-containerized workflows. This makes it difficult (and > > > > non-obvious!) in ceph-volume when adding more flags to do things that > > > > help the containerized > > > > deployment. > > > > > > > > To solve the issues you describe, I think you need either a separate > > > > command-line tool that can invoke ceph-volume with the added features > > > > you listed, or > > > > if there is significant push to get more things in ceph-volume, a > > > > separate sub-command, so that the `lvm` is isolated from the > > > > conflicting logic. > > > > > > > > My preference would be a wrapper script, separate from the Ceph project. > > > > > > > > > > > > > > Hopefully, I was clear :). > > > > > This is just a proposal if you feel like this could be done > > > > > differently, feel free to suggest. > > > > > > > > > > Thanks! > > > > > ––––––––– > > > > > Sébastien Han > > > > > Senior Principal Software Engineer, Storage Architect > > > > > > > > > > "Always give 100%. Unless you're giving blood." > > > > > _______________________________________________ > > > > > Dev mailing list -- dev(a)ceph.io > > > > > To unsubscribe send an email to dev-leave(a)ceph.io > > > > > > > _______________________________________________ > > > Dev mailing list -- dev(a)ceph.io > > > To unsubscribe send an email to dev-leave(a)ceph.io > > > > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io >

Sage Weil

3:41 p.m.

On Fri, 6 Dec 2019, Sebastien Han wrote:

...

Cool, that works for me!

Okay, so this won't work for a few reasons: (1) ceph-osd drops root privs so we can't do anything fancy on shutdown, and (2) the signal handler isn't set up right when the process starts, so it'll always be racy (the teardown process might not happen). Having the caller do this is really the right thing. After chatting with Seb, though, I think we really have two different problems: 1) Seb's AWS problem: you can't do an EBS detach of there is an active VG(/LV) on the device. To fix this, you need to do vgchange -an, which deinitializes the LVs and VG. AFAICS, this doesn't many any sense on a bare-metal host, and would step on the toes of the generic LVM and udev infrastructure, which magically initalizes all the LV devices it finds (and AFAICS doesn't every try to disable that). (Also, IIRC c-v has a VG per cluster, so if you deinitialize the entire VG, wouldn't that kill all OSDs on the host for that cluster.. not just the one on the EBS volume you're detaching?). In any case, the problem feels like an EBS vs LVM problem. And I think I'm back to Seb's original proposal here: the simplest way to solve this is to just not use LVM at all and to put bluestore on the raw device. You won't get dmcrypt or other fancy LVM features, but for EBS you don't need any of them (except, maybe, in the future, growing a volume/OSD, but that's something we need to teach bluestore to do regardless). 2) My ceph-daemon problem: to make dmcrypt work (well), IMO the decrypted device should be set up when the OSD container is started, and torn down when the container stops. For this, the thing that makes sense in my mind is something like a '-f' flag for ceph-volume activate. IIUC, right now activate does something like 1- set up decrypted LV, if needed 2- populate /var/lib/ceph/osd/ceph-NN dir 3- start systemd unit (unless the --no-systemd flag is passed, as we currently do with containers) 4- exit. Instead, with the -f flag, it would 1,2- same 3- run ceph-osd -f -i ... in the foreground. watch for signals and pass them along to shut down the osd 4- clean up /var/lib/ceph/osd/ceph-NN 5- stop the decrypted LV 6- exit This makes me realize that steps 4 and 5 don't currently exist anywhere: there is no such thing as 'ceph-volume lvm deactivate'. If we had that second part, a simple wrapper could accomplish the same thing as -f. I think we should pursue those 2 paths (barebones bluestore c-v mode and c-v lvm deactivate) separately... sage

...

––––––––– Sébastien Han Senior Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." On Fri, Dec 6, 2019 at 3:03 PM Sage Weil <sweil(a)redhat.com> wrote:

On Fri, 6 Dec 2019, Sebastien Han wrote:

If not in ceph-osd, can we have the ceph-osd executing a hook before exiting 0? Reading a hook script from /etc/ceph/hook.d something like that would be nice so that we don't need a wrapper.

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Jan Fajerski

9 Dec 9 Dec

9:52 a.m.

On Fri, Dec 06, 2019 at 03:41:29PM +0000, Sage Weil wrote:

...

On Fri, 6 Dec 2019, Sebastien Han wrote:

Cool, that works for me!

VGs are generally per device. In some cases c-v creates multi-device VGs but there is a bug open for that and I'm working on removing this scenario. However a VG might still contain volumes from multiple OSDs (multi-device OSDs) so deactivating a VG might still kill multiple OSDs.

...

In any case, the problem feels like an EBS vs LVM problem. And I think I'm back to Seb's original proposal here: the simplest way to solve this is to just not use LVM at all and to put bluestore on the raw device. You won't get dmcrypt or other fancy LVM features, but for EBS you don't need any of them (except, maybe, in the future, growing a volume/OSD, but that's something we need to teach bluestore to do regardless). 2) My ceph-daemon problem: to make dmcrypt work (well), IMO the decrypted device should be set up when the OSD container is started, and torn down when the container stops. For this, the thing that makes sense in my mind is something like a '-f' flag for ceph-volume activate. IIUC, right now activate does something like 1- set up decrypted LV, if needed 2- populate /var/lib/ceph/osd/ceph-NN dir 3- start systemd unit (unless the --no-systemd flag is passed, as we currently do with containers) 4- exit. Instead, with the -f flag, it would 1,2- same 3- run ceph-osd -f -i ... in the foreground. watch for signals and pass them along to shut down the osd 4- clean up /var/lib/ceph/osd/ceph-NN 5- stop the decrypted LV 6- exit This makes me realize that steps 4 and 5 don't currently exist anywhere: there is no such thing as 'ceph-volume lvm deactivate'. If we had that second part, a simple wrapper could accomplish the same thing as -f.

Afaics this exists for simple mode, not for the lvm case though. In any case I think it makes sense to enable c-v to wrap the osd and clean up after it.

...

I think we should pursue those 2 paths (barebones bluestore c-v mode and c-v lvm deactivate) separately... sage > ––––––––– > Sébastien Han > Senior Principal Software Engineer, Storage Architect > > "Always give 100%. Unless you're giving blood." > > On Fri, Dec 6, 2019 at 3:03 PM Sage Weil <sweil(a)redhat.com> wrote: > > > > On Fri, 6 Dec 2019, Sebastien Han wrote: > > > If not in ceph-osd, can we have the ceph-osd executing a hook before exiting 0? > > > Reading a hook script from /etc/ceph/hook.d something like that would > > > be nice so that we don't need a wrapper. > > > > Hmm, maybe if it was just osd_exec_on_shutdown=string, and that could > > be something like "vgchange ..." or "bash -c ..."? We'd need to make > > sure we're setting FD_CLOEXEC on all the right file handles though. I can > > give it a go.. > > > > sage > > > > > > > > Thoughts? > > > > > > Thanks! > > > ––––––––– > > > Sébastien Han > > > Senior Principal Software Engineer, Storage Architect > > > > > > "Always give 100%. Unless you're giving blood." > > > > > > On Fri, Dec 6, 2019 at 2:50 PM Sage Weil <sweil(a)redhat.com> wrote: > > > > > > > > On Fri, 6 Dec 2019, Sebastien Han wrote: > > > > > I understand this is asking a lot from the ceph-volume side. > > > > > We can explore a new wrapper binary or perhaps from the ceph-osd itself. > > > > > > > > > > Maybe crazy/stupid idea, can we have a de-activate call from the osd > > > > > process itself? ceph-osd gets SIGTERM, closes the connection to the > > > > > device, then runs "vgchange -an <vg>", is this realistic? > > > > > > > > Not really... it's hard (or gross) to do a hard/immediate exit that tears > > > > down all of the open handles to the device. I think this is not a nice > > > > way to layer things. I'd prefer either a c-v command or separate wrapper > > > > script to this. > > > > > > > > sage > > > > > > > > > > > > > > > > > > Thanks! > > > > > ––––––––– > > > > > Sébastien Han > > > > > Senior Principal Software Engineer, Storage Architect > > > > > > > > > > "Always give 100%. Unless you're giving blood." > > > > > > > > > > On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza <adeza(a)redhat.com> wrote: > > > > > > > > > > > > On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan(a)redhat.com> wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Following up on my previous ceph-volume email as promised. > > > > > > > > > > > > > > When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure, > > > > > > > Google, whatever), the OSDs are backed by PVC (Cloud block storage) > > > > > > > attached to virtual machines. > > > > > > > This makes the storage portable if the VM dies, the device will be > > > > > > > attached to a new virtual machine and the OSD will resume running. > > > > > > > > > > > > > > In Rook, we have 2 main deployments for the OSD: > > > > > > > > > > > > > > 1. Prepare the disk to become an OSD > > > > > > > Prepare will run on the VM, attach the block device, run "ceph-volume > > > > > > > prepare", then this gets complicated. After this, the device is > > > > > > > supposed to be detached from the VM because the container terminated. > > > > > > > However, the block is still held by LVM so the VG must be > > > > > > > de-activated. Currently, we do this in Rook, but it would be nice to > > > > > > > de-activate the VG once ceph-volume is done preparing the disk in a > > > > > > > container. > > > > > > > > > > > > > > 2. Activate the OSD. > > > > > > > Now, onto the new container, the device is attached again on the VM. > > > > > > > At this point, more changes will be required in ceph-volume, > > > > > > > particularly in the "activate" call. > > > > > > > a. ceph-volume should activate the VG > > > > > > > > > > > > By VG you mean LVM's Volume Group? > > > > > > > > > > > > > b. ceph-volume should activate the device normally > > > > > > > > > > > > Not "normally" though right? That would imply starting the OSD which > > > > > > you are indicating is not desired. > > > > > > > > > > > > > c. ceph-volume should run the ceph-osd process in foreground as well > > > > > > > as accepting flag to that CLI, we could have something like: > > > > > > > "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID > > > > > > > <a bunch of flags>" > > > > > > > Perhaps we need a new flag to indicate we want to run the osd > > > > > > > process in foreground? > > > > > > > Here is an example on how an OSD run today: > > > > > > > > > > > > > > ceph-osd --foreground --id 2 --fsid > > > > > > > 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph > > > > > > > --crush-location=root=default host=minikube --default-log-to-file > > > > > > > false --ms-learn-addr-from-peer=false > > > > > > > > > > > > > > --> we can have a bunch of flags or an ENV var with all the flags > > > > > > > whatever you prefer. > > > > > > > > > > > > > > This wrapper should watch for signals too, it should reply to > > > > > > > SIGTERM in the following way: > > > > > > > - stop the OSD > > > > > > > - de-activate the VG > > > > > > > - exit 0 > > > > > > > > > > > > > > Just a side note, the VG must be de-activated when the container stops > > > > > > > so that the block device can be detached from the VMs, otherwise, > > > > > > > it'll still be held by LVM. > > > > > > > > > > > > I am worried that this goes beyond what I consider the scope of > > > > > > ceph-volume which is: prepare device(s) to be part of an OSD. > > > > > > > > > > > > Catching signals, handling the OSD in the foreground, and accepting > > > > > > (proxying) flags, sounds problematic for a robust implementation in > > > > > > ceph-volume, even > > > > > > if that means it will help Rook in this case. > > > > > > > > > > > > The other challenge I see is that it seems Ceph is in a transition > > > > > > from being a baremetal project to a container one, except lots of > > > > > > tooling (like ceph-volume) is deeply > > > > > > tied to the non-containerized workflows. This makes it difficult (and > > > > > > non-obvious!) in ceph-volume when adding more flags to do things that > > > > > > help the containerized > > > > > > deployment. > > > > > > > > > > > > To solve the issues you describe, I think you need either a separate > > > > > > command-line tool that can invoke ceph-volume with the added features > > > > > > you listed, or > > > > > > if there is significant push to get more things in ceph-volume, a > > > > > > separate sub-command, so that the `lvm` is isolated from the > > > > > > conflicting logic. > > > > > > > > > > > > My preference would be a wrapper script, separate from the Ceph project. > > > > > > > > > > > > > > > > > > > > Hopefully, I was clear :). > > > > > > > This is just a proposal if you feel like this could be done > > > > > > > differently, feel free to suggest. > > > > > > > > > > > > > > Thanks! > > > > > > > ––––––––– > > > > > > > Sébastien Han > > > > > > > Senior Principal Software Engineer, Storage Architect > > > > > > > > > > > > > > "Always give 100%. Unless you're giving blood." > > > > > > > _______________________________________________ > > > > > > > Dev mailing list -- dev(a)ceph.io > > > > > > > To unsubscribe send an email to dev-leave(a)ceph.io > > > > > > > > > > > _______________________________________________ > > > > > Dev mailing list -- dev(a)ceph.io > > > > > To unsubscribe send an email to dev-leave(a)ceph.io > > > > > > > > _______________________________________________ > > > Dev mailing list -- dev(a)ceph.io > > > To unsubscribe send an email to dev-leave(a)ceph.io > > > > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io >

...

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

-- Jan Fajerski Senior Software Engineer Enterprise Storage SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Felix Imendörffer

Sage Weil

6 Dec 6 Dec

1:31 p.m.

On Fri, 6 Dec 2019, Alfredo Deza wrote:

...

On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan(a)redhat.com> wrote:

By VG you mean LVM's Volume Group?

b. ceph-volume should activate the device normally

Not "normally" though right? That would imply starting the OSD which you are indicating is not desired.

The end goal is to have this inside the ceph/daemon-base container. Making it part of ceph-volume is just one way to do that. Another wrapper script could accomplish the same thing (e.g., /usr/bin/ceph-osd-podwrapper). FWIW I like the idea of making it a subcommand that isn't tied to LVM specifically since it's basically sandwiching the OSD run with activate/deactive, and it could be made to work as/if/when the provisioning method evolves from lvm to something else over time. sage

1599

days inactive

1602

days old

dev@ceph.io

Manage subscription

9 comments

4 participants

tags (0)

participants (4)

Alfredo Deza
Jan Fajerski
Sage Weil
Sebastien Han