Ceph rbd clients surrender exclusive lock in critical situation

List overview All Threads
Download

newer

older

journal fills ...

MDS crash in "inotablev ==...

Frank Schilder

18 Jan 2023 18 Jan '23

12:17 p.m.

Hi all, we are observing a problem on a libvirt virtualisation cluster that might come from ceph rbd clients. Something went wrong during execution of a live-migration operation and as a result we have two instances of the same VM running on 2 different hosts, the source- and the destination host. What we observe now is the the exclusive lock of the RBD disk image moves between these two clients periodically (every few minutes the owner flips). We are pretty sure that no virsh commands possibly having that effect are executed during this time. The client connections are not lost and the OSD blacklist is empty. I don't understand why a ceph rbd client would surrender an exclusive lock in such a split brain situation, its exactly when it needs to hold on to it. As a result, the affected virtual drives are corrupted. The questions we have in this context are: Under what conditions does a ceph rbd client surrender an exclusive lock? Could this be a bug in the client or a ceph config error? Is this a known problem with libceph and libvirtd? Anyone else making the same observation and having some guidance? The VM hosts are on alma8 and we use the advanced virtualisation repo providing very recent versions of qemu and libvirtd. We have seen this floating exclusive lock before on mimic. Now we are on octopus and I can't really blame it on the old ceph version any more. We use opennebula as a KVM front-end. Thanks for any pointers! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

Show replies by date

Ilya Dryomov

18 Jan 18 Jan

1:26 p.m.

On Wed, Jan 18, 2023 at 1:19 PM Frank Schilder <frans(a)dtu.dk> wrote:

...

Hi Frank, If you are talking about RBD exclusive lock feature ("exclusive-lock" under "features" in "rbd info" output) then this is expected. This feature provides automatic cooperative lock transitions between clients to ensure that only a single client is writing to the image at any given time. It's there to protect internal per-image data structures such as the object map, the journal or the client-side PWL (persistent write log) cache from concurrent modifications in case the image is opened by two or more clients. The name is confusing but it's NOT about preventing other clients from opening and writing to the image. Rather it's about serializing those writes.

...

We are pretty sure that no virsh commands possibly having that effect are executed during this time. The client connections are not lost and the OSD blacklist is empty. I don't understand why a ceph rbd client would surrender an exclusive lock in such a split brain situation, its exactly when it needs to hold on to it. As a result, the affected virtual drives are corrupted.

There is no split-brain from the Ceph POV here. RBD has always supported the multiple clients use case.

...

The questions we have in this context are: Under what conditions does a ceph rbd client surrender an exclusive lock?

Exclusive lock transitions are cooperative so any time another client asks for it (not immediately though -- the current lock owner finishes processing in-flight I/O and flushes its caches first).

...

Could this be a bug in the client or a ceph config error?

Very unlikely. There is a way to disable automatic lock transitions but I don't think it's wired up in QEMU.

...

Is this a known problem with libceph and libvirtd?

Not sure what you mean by libceph. Thanks, Ilya

Frank Schilder

2:25 p.m.

Hi Ilya, thanks a lot for the information. Yes, I was talking about the exclusive lock feature and was under the impression that only one rbd client can get write access on connect and will keep it until disconnect. The problem we are facing with multi-VM write access is, that this will inevitably corrupt the file system created on the rbd if two instances can get write access. Its not a shared file system, its just an xfs formatted virtual disk.

...

There is a way to disable automatic lock transitions but I don't think it's wired up in QEMU.

Can you point me to some documentation about that? It sounds like this is what would be needed to avoid the file system corruption in our use case. The lock transition should be initiated from the outside and the lock should then stay fixed on the client holding it until it is instructed to give up the lock or it disconnects.

...

Is this a known problem with libceph and libvirtd?

Not sure what you mean by libceph.

I simply meant that its not a krbd client. Libvirt uses libceph (or was it librbd?) to emulate virtual drives, not krbd. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Ilya Dryomov <idryomov(a)gmail.com> Sent: 18 January 2023 14:26:54 To: Frank Schilder Cc: ceph-users(a)ceph.io Subject: Re: [ceph-users] Ceph rbd clients surrender exclusive lock in critical situation On Wed, Jan 18, 2023 at 1:19 PM Frank Schilder <frans(a)dtu.dk> wrote:

...

There is no split-brain from the Ceph POV here. RBD has always supported the multiple clients use case.

...

The questions we have in this context are: Under what conditions does a ceph rbd client surrender an exclusive lock?

Exclusive lock transitions are cooperative so any time another client asks for it (not immediately though -- the current lock owner finishes processing in-flight I/O and flushes its caches first).

...

Could this be a bug in the client or a ceph config error?

Very unlikely. There is a way to disable automatic lock transitions but I don't think it's wired up in QEMU.

...

Is this a known problem with libceph and libvirtd?

Not sure what you mean by libceph. Thanks, Ilya

Ilya Dryomov

5:12 p.m.

On Wed, Jan 18, 2023 at 3:25 PM Frank Schilder <frans(a)dtu.dk> wrote:

...

There is a way to disable automatic lock transitions but I don't think it's wired up in QEMU.

It looks like there is not much documentation on this specific aspect beyond a few scattered notes which I'm pasting below:

...

To disable transparent lock transitions between multiple clients, the client must acquire the lock by using the RBD_LOCK_MODE_EXCLUSIVE flag.

...

Per mapping (block device) rbd device map options: [...] - exclusive - Disable automatic exclusive lock transitions. Equivalent to --exclusive.

(Yes, both the flag and the option are also named "exclusive". Don't ask why...) However note that for krbd, --exclusive comes with some strings attached. For QEMU, there is no such option at all -- as already mentioned, RBD_LOCK_MODE_EXCLUSIVE flag is not wired up there. Ultimately, it's the responsibility of the orchestration layer to prevent situations like this from happening. Ceph just provides storage, it can't really be involved in managing one's VMs or deciding whether multi-VM access is OK. The orchestration layer may choose to use some of the RBD primitives for this (whether exclusive locks or advisory locks -- see "rbd lock add", "rbd lock ls" and "rbd lock rm" commands), use something else or do nothing at all...

...

Is this a known problem with libceph and libvirtd?

Not sure what you mean by libceph.

I simply meant that its not a krbd client. Libvirt uses libceph (or was it librbd?) to emulate virtual drives, not krbd.

libceph is actually one of the kernel modules. libvirt/QEMU usually use librbd but it's completely up to the user. Nothing prevents you from feeding some krbd devices to libvirt/QEMU, for example. Thanks, Ilya

Frank Schilder

19 Jan 19 Jan

11:50 a.m.

Hi Ilya, thanks for the info, it did help. I agree, its the orchestration layer's responsibility to handle things right. I have a case open already with support and it looks like there is indeed a bug on that side. I was mainly after a way that ceph librbd clients could offer a safety net in case such bugs occur. Its a bit like the four-eyes principle, having an orchestration layer do things right is good, but having a second instance confirming the same thing is much better. A bug in one layer will not cause a catastrophe, because the second layer catches it. I'm not sure if the rbd lock capabilities are sufficiently powerful to provide a command-line interface to that. The flag RBD_LOCK_MODE_EXCLUSIVE seems the only way and if qemu is not using it, there seems not a lot one can do in scripts. Thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

Andreas Teuchert

12:30 p.m.

Hi Frank, one thing that might be relevant here: If you disable transparent lock transitions, you cannot create snapshots of images that are in use in such a way. This may or may not be relevant in your case. I'm just mentioning it because I myself was surprised by that. Best regards, Andreas On 19.01.23 12:50, Frank Schilder wrote:

...

Frank Schilder

12:53 p.m.

Hi Andreas, thanks for that piece of information. I understand that transient lock migration is important under "normal" operational conditions. The use case I have in mind is the process of live-migration, when one might want to do a clean hand-over of a lock between two librbd clients. Specifically, in our use case we ended up with 2 VMs running on the same image and ultimately destroying the local file system. Here, I could imagine that the *eligibility* to aquire a transient lock could be managed by a lock list or something so that the orchestrator can, in an atomic operation, mark the target client as eligible and the source client as not. Then, the transient lock migration cannot happen to clients that should not get this lock under any circumstances. The orchestrator knows about that, but I don't see a way to communicate this to a state of an rbd image so that the ceph side could enforce that independently as well. This would help dramatically with troubleshooting orchestrator errors as these are much less likely to lead to data corruption. I could imagine a command that would allow to make a client read-only no matter what, for example, lock a client in read-only mode. Such a client should not be able to aquire an exclusive write lock, it should simply be ignored when making such a request. Maybe I misunderstand the rbd lock add documentation, but it doesn't seem to offer this kind of permission lock. It would be great if I'm wrong or if it could be added. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Andreas Teuchert <a.teuchert(a)syseleven.de> Sent: 19 January 2023 13:30:52 To: Frank Schilder; Ilya Dryomov Cc: ceph-users(a)ceph.io Subject: Re: [ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation Hi Frank, one thing that might be relevant here: If you disable transparent lock transitions, you cannot create snapshots of images that are in use in such a way. This may or may not be relevant in your case. I'm just mentioning it because I myself was surprised by that. Best regards, Andreas On 19.01.23 12:50, Frank Schilder wrote:

...

485

days inactive

486

days old

ceph-users@ceph.io

Manage subscription

6 comments

3 participants

tags (0)

participants (3)

Andreas Teuchert
Frank Schilder
Ilya Dryomov