Hi Chris and Wissem,
finally found the time:
https://tracker.ceph.com/issues/50638
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Chris Dunlop <chris(a)onthe.net.au>
Sent: 16 March 2021 03:56:50
To: Frank Schilder
Cc: ceph-users(a)ceph.io; Wissem MIMOUNA
Subject: Re: [ceph-users] OSD id 241 != my id 248: conversion from "ceph-disk"
to "ceph-volume simple" destroys OSDs
Hi Frank,
I suggest you should file the ticket as you have the full story and the
use case to go with it.
I'm just an interested bystander, I just happened to know a little about
this area because of a filestore to bluestore migration I'd done recently.
Cheers,
Chris
On Fri, Mar 12, 2021 at 12:48:56PM +0000, Frank Schilder wrote:
> Hi Chris,
>
> thanks for looking at this issue in more detail.
>
> I have two communications on this issue and I'm afraid you didn't get all
information. There seem to be at least 2 occurrences of the same bug. Yes, I'm pretty
sure data.path should also be a stable device path instead of /dev/sdq1. But this is the
second occurrence of this bug, the other one is for block.path, which is not visible in
the communication I sent to you but has more dramatic consequences.
>
> Please find below the full story. Unless you can do it, I will file a ticket. To me
this looks like a general occurrence of using unstable device paths by accident that
should be tracked down everywhere. If you can fix the code, you might want to add a
comment to it to make sure the same mistake is not repeated.
>
> Problems:
>
> - ceph-volume simple scan|activate use unstable device paths like
"/dev/sd??" instead of stable device paths like
"/dev/disk/by-partuuid/UUID", which leads to OSD boot fails when devices are
renamed at reboot by the kernel
>
> - ceph-volume simple activate modifies (!!!) OSD meta data from a stable device path
to an unstable device path, which does not only lead to boot fails but also makes it
impossible to move an OSD to a different host, because ceph-volume simple scan will now
produce a corrupted json config file
>
> Setup and observation:
>
> I observed this in the situation where after a reboot all disks were re-named. I have
a work-flow that deploys containers per physical disk slot and performs a full OSD
discovery at every container start to accommodate exchanging OSDs. The basic sequence
executed every time is:
>
> ceph-volume simple scan
> ceph volume simple activate
>
> Unfortunately, this sequence is not idempotent, because ceph volume simple activate
modifies (!!!) the symbolic link "block" on the OSD data partition to point to
an unstable device path, for example (note the first occurrence of the unstable device
path /dev/sdq1 in data.path):
>
> # mount /dev/sdq1 mnt
> # ls -l mnt
> [...]
> lrwxrwxrwx. 1 root root 58 Mar 11 16:17 block ->
/dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4
> [...]
> # umount mnt
> # ceph-volume simple scan --stdout /dev/sdq1
> Running command: /usr/sbin/cryptsetup status /dev/sdq1
> Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpmfitNx
> stdout: mount: /dev/sdq1 mounted on /tmp/tmpmfitNx.
> Running command: /usr/bin/umount -v /tmp/tmpmfitNx
> stderr: umount: /tmp/tmpmfitNx (/dev/sdq1) unmounted
> {
> "active": "ok",
> "block": {
> "path":
"/dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
> "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4"
> },
> "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
> "bluefs": 1,
> "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
> "cluster_name": "ceph",
> "data": {
> "path": "/dev/sdq1",
> "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15"
> },
> "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15",
> "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==",
> "kv_backend": "rocksdb",
> "magic": "ceph osd volume v026",
> "mkfs_done": "yes",
> "none": "",
> "ready": "ready",
> "require_osd_release": "",
> "type": "bluestore",
> "whoami": 59
> }
> # ceph-volume simple activate --file
"/etc/ceph/osd/59-9b88d6ec-87a4-4640-b80e-81d3d56fac15.json" --no-systemd
> Running command: /usr/bin/mount -v /dev/sdq1 /var/lib/ceph/osd/ceph-59
> stdout: mount: /dev/sdq1 mounted on /var/lib/ceph/osd/ceph-59.
> Running command: /usr/bin/ln -snf /dev/sdq2 /var/lib/ceph/osd/ceph-59/block
<<<--- Oh no !!!
> Running command: /usr/bin/chown -R ceph:ceph /dev/sdq2
> --> Skipping enabling of `simple` systemd unit
> --> Skipping masking of ceph-disk systemd units
> --> Skipping enabling and starting OSD simple systemd unit because --no-systemd
was used
> --> Successfully activated OSD 59 with FSID 9b88d6ec-87a4-4640-b80e-81d3d56fac15
>
> # !!! Note the command "/usr/bin/ln -snf /dev/sdq2
/var/lib/ceph/osd/ceph-59/block" in the output,
> # which is corrupting the OSDs meta-data!
>
> # ls -l /var/lib/ceph/osd/ceph-59
> [...]
> lrwxrwxrwx. 1 root root 9 Mar 12 13:06 block -> /dev/sdq2
> [...]
>
> # This OSD now holds corrupted meta-data in form of a symbolic link with an unstable
device path
> # as its link target. Subsequent discoveries now produce corrupt .json config files
and moving this disk
> # to another host has turned into a real pain:
>
> # umount /var/lib/ceph/osd/ceph-59
> # ceph-volume simple scan --stdout /dev/sdq1
> Running command: /usr/sbin/cryptsetup status /dev/sdq1
> Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpABkQsj
> stdout: mount: /dev/sdq1 mounted on /tmp/tmpABkQsj.
> Running command: /usr/bin/umount -v /tmp/tmpABkQsj
> stderr: umount: /tmp/tmpABkQsj (/dev/sdq1) unmounted
> {
> "active": "ok",
> "block": {
> "path": "/dev/sdq2",
> "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4"
> },
> "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
> "bluefs": 1,
> "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
> "cluster_name": "ceph",
> "data": {
> "path": "/dev/sdq1",
> "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15"
> },
> "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15",
> "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==",
> "kv_backend": "rocksdb",
> "magic": "ceph osd volume v026",
> "mkfs_done": "yes",
> "none": "",
> "ready": "ready",
> "require_osd_release": "",
> "type": "bluestore",
> "whoami": 59
> }
>
> Here in this example, the disk names didn't change, which implies that this OSD
will still start as long as the disk is named /dev/sdq. However, if the disk names change,
ceph-volume simple scan unfortunately follows the broken symlink link instead of using
block_uuid for discovery, which leads to a completely corrupted .json file similar to this
one:
>
> # ceph-volume simple scan --stdout /dev/sdb1
> Running command: /usr/sbin/cryptsetup status /dev/sdb1
> {
> "active": "ok",
> "block": {
> "path": "/dev/sda2",
> "uuid": "b5ac1462-510a-4483-8f42-604e6adc5c9d"
> },
> "block_uuid": "1d9d89a2-18c7-4610-9dcd-167d44ce1879",
> "bluefs": 1,
> "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
> "cluster_name": "ceph",
> "data": {
> "path": "/dev/sdb1",
> "uuid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb"
> },
> "fsid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb",
> "keyring": "AQAZJ6ddedALDxAAJI7NLJ2CRFoQWK5STRpHuw==",
> "kv_backend": "rocksdb",
> "magic": "ceph osd volume v026",
> "mkfs_done": "yes",
> "none": "",
> "ready": "ready",
> "require_osd_release": "",
> "type": "bluestore",
> "whoami": 241
> }
>
> Notice that now block_uuid and block.uuid do not match any more. This corruption
requires manual repair and I had to do this for an entire cluster.
>
> Resolution:
>
> I ended up with all OSDs I converted from "ceph-disk" to "ceph-volume
simple" failing to boot after a server reboot that shifted the device names and all
symbolic links to the block device were invalidated. Fortunately, the OSDs recognised that
the block device partition was for another OSD ID and exited with an error, otherwise I
would probably have lost data. To fix this, I needed to write a script that resets the
link target of the symlink "block" to the correct part_uuip path.
>
> Using unstable device paths is one thing that can happen by accident. However, what I
really do not understand is, why "ceph-volume simple activate" *modifies*
meta-data that should be considered read-only. I found this here in the code
src/ceph-volume/ceph_volume/devices/simple/activate.py:200-203:
>
> # always re-do the symlink regardless if it exists, so that the journal
> # device path that may have changed can be mapped correctly every time
> destination = os.path.join(osd_dir, name)
> process.run(['ln', '-snf', device, destination])
>
> Maybe the intention is correct, I don't know. However, the execution is not. At
this point, a dictionary of UUIDs should be used with explicit link targets as in
"/dev/disk/by-partuuid/"+uuid instead of "device" to make absolutely
sure nothing gets rigged here. I think a correct version of the code in
src/ceph-volume/ceph_volume/devices/simple/activate.py:190-206 would look something like
this
>
> uuid_map = {
> 'journal': osd_metadata.get('journal',
{}).get('uuid'),
> 'block': osd_metadata.get('block',
{}).get('uuid'),
> 'block.db': osd_metadata.get('block.db',
{}).get('uuid'),
> 'block.wal': osd_metadata.get('block.wal',
{}).get('uuid')
> }
>
> for name, uuid in uuid_map.items():
> if not uuid:
> continue
> # always re-do the symlink regardless if it exists, so that the journal
> # device path that may have changed can be mapped correctly every time
> destination = os.path.join(osd_dir, name)
> process.run(['ln', '-snf',
'/dev/disk/by-partuuid/'+uuid, destination])
>
> # make sure that the journal has proper permissions
> system.chown(self.get_device(uuid))
>
> This will be very explicit about using stable device paths. Needless to say that
other occurrences as in src/ceph-volume/ceph_volume/devices/simple/scan.py:89-90 should be
addressed as well, for example:
>
> device_metadata['uuid'] = device_uuid
> device_metadata['path'] = device
>
> could be corrected in a similar way:
>
> device_metadata['uuid'] = device_uuid
> device_metadata['path'] =
'/dev/disk/by-partuuid/'+device_uuid
>
> There are probably more locations that deserve a good looking at.
>
> Hope that explains the calamities I found myself in.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14