My ceph cluster became unstable yesterday after zincati (CoreOS's
auto-updater) updated one of my nodes from 37.20221225.3.0 to
37.20230110.3.1(*). The symptom was slow ops in my cephfs mds which
started immediately the OSDs on this node became in and up. Excluding
the OSDs on this node worked round the problem. Note that the node is
also running a mon and client workloads which use ceph. Also note that
the OSD came up and (IIUC) were participating in recovering their data
to other OSDs. The problem only started when I allowed them to be in.
I rolled back the OS update and the problem was immediately resolved.
Unfortunately I didn't keep the OSD logs, but they lead me to this
thread from ceph-users:
https://www.mail-archive.com/ceph-users@ceph.io/msg18474.html . I
wonder if we have an issue with a very recent kernel update.
I should be able to reproduce if it's likely to be of use to anybody,
but for now I've rolled back this OS update and disabled automatic
updating on my other nodes.
Matt
(*) The complete list of changes:
$ rpm-ostree db diff
d477f98d52bf707d4282f6835b85bed3d60e305a0cf6eb8effd4db4b89607f05
fc214c16d248686d4cf2bb3050b59c559f091692d7af3b07ef896f1b8ab2f161
ostree diff commit from:
d477f98d52bf707d4282f6835b85bed3d60e305a0cf6eb8effd4db4b89607f05
ostree diff commit to:
fc214c16d248686d4cf2bb3050b59c559f091692d7af3b07ef896f1b8ab2f161
Upgraded:
bash 5.2.9-3.fc37 -> 5.2.15-1.fc37
btrfs-progs 6.0.2-1.fc37 -> 6.1.2-1.fc37
clevis 18-12.fc37 -> 18-14.fc37
clevis-dracut 18-12.fc37 -> 18-14.fc37
clevis-luks 18-12.fc37 -> 18-14.fc37
clevis-systemd 18-12.fc37 -> 18-14.fc37
container-selinux 2:2.193.0-1.fc37 -> 2:2.198.0-1.fc37
containerd 1.6.12-1.fc37 -> 1.6.14-2.fc37
containers-common 4:1-73.fc37 -> 4:1-76.fc37
containers-common-extra 4:1-73.fc37 -> 4:1-76.fc37
coreutils 9.1-6.fc37 -> 9.1-7.fc37
coreutils-common 9.1-6.fc37 -> 9.1-7.fc37
crun 1.7.2-2.fc37 -> 1.7.2-3.fc37
curl 7.85.0-4.fc37 -> 7.85.0-5.fc37
dnsmasq 2.87-3.fc37 -> 2.88-1.fc37
ethtool 2:6.0-1.fc37 -> 2:6.1-1.fc37
fwupd 1.8.8-1.fc37 -> 1.8.9-1.fc37
git-core 2.38.1-1.fc37 -> 2.39.0-1.fc37
grub2-common 1:2.06-63.fc37 -> 1:2.06-72.fc37
grub2-efi-x64 1:2.06-63.fc37 -> 1:2.06-72.fc37
grub2-pc 1:2.06-63.fc37 -> 1:2.06-72.fc37
grub2-pc-modules 1:2.06-63.fc37 -> 1:2.06-72.fc37
grub2-tools 1:2.06-63.fc37 -> 1:2.06-72.fc37
grub2-tools-minimal 1:2.06-63.fc37 -> 1:2.06-72.fc37
kernel 6.0.15-300.fc37 -> 6.0.18-300.fc37
kernel-core 6.0.15-300.fc37 -> 6.0.18-300.fc37
kernel-modules 6.0.15-300.fc37 -> 6.0.18-300.fc37
libcurl-minimal 7.85.0-4.fc37 -> 7.85.0-5.fc37
libgpg-error 1.45-2.fc37 -> 1.46-1.fc37
libgusb 0.4.2-1.fc37 -> 0.4.3-1.fc37
libksba 1.6.2-1.fc37 -> 1.6.3-1.fc37
libpcap 14:1.10.1-4.fc37 -> 14:1.10.2-1.fc37
libpwquality 1.4.4-11.fc37 -> 1.4.5-1.fc37
libsmbclient 2:4.17.4-0.fc37 -> 2:4.17.4-2.fc37
libwbclient 2:4.17.4-0.fc37 -> 2:4.17.4-2.fc37
moby-engine 20.10.20-1.fc37 -> 20.10.21-1.fc37
ncurses 6.3-3.20220501.fc37 -> 6.3-4.20220501.fc37
ncurses-base 6.3-3.20220501.fc37 -> 6.3-4.20220501.fc37
ncurses-libs 6.3-3.20220501.fc37 -> 6.3-4.20220501.fc37
net-tools 2.0-0.63.20160912git.fc37 -> 2.0-0.64.20160912git.fc37
rpm-ostree 2022.16-2.fc37 -> 2022.19-2.fc37
rpm-ostree-libs 2022.16-2.fc37 -> 2022.19-2.fc37
samba-client-libs 2:4.17.4-0.fc37 -> 2:4.17.4-2.fc37
samba-common 2:4.17.4-0.fc37 -> 2:4.17.4-2.fc37
samba-common-libs 2:4.17.4-0.fc37 -> 2:4.17.4-2.fc37
selinux-policy 37.16-1.fc37 -> 37.17-1.fc37
selinux-policy-targeted 37.16-1.fc37 -> 37.17-1.fc37
tpm2-tss 3.2.0-3.fc37 -> 3.2.1-1.fc37
Removed:
cracklib-dicts-2.9.7-30.fc37.x86_64
--
Matthew Booth