On Sun, May 16, 2021 at 12:54 PM Markus Kienast <mark(a)trickkiste.at> wrote:
Hi Ilya,
unfortunately I can not find any "missing primary copy of ..." error in the
logs of my 3 OSDs.
The NVME disks are also brand new and there is not much traffic on them.
The only error keyword I find are those two messages in osd.0 and osd.1 logs shown
below.
BTW the error posted before actually concerns osd1. The one I posted was copied from
somebody elses bug report, which had similar errors. Here are my original error messages
on LTSP boot:
Hi Markus,
Please don't ever paste log messages from other bug reports again.
Your email said "I am seeing these messages ..." and I spent a fair
amount of time staring at the code trying to understand how an issue
that was fixed several releases ago could resurface.
The numbers in the log message mean specific things. For example it
is immediately obvious that
get_reply osd1 tid 11 data 4164 > preallocated 4096, skipping
is not related to
get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping
even though they probably look the same to you.
[ 10.331119] libceph: mon1 (1)10.101.0.27:6789
session established
[ 10.331799] libceph: client175444 fsid b0f4a188-bd81-11ea-8849-97abe2843f29
[ 10.336866] libceph: mon0 (1)10.101.0.25:6789 session established
[ 10.337598] libceph: client175444 fsid b0f4a188-bd81-11ea-8849-97abe2843f29
[ 10.349380] libceph: get_reply osd1 tid 11 data 4164 > preallocated
4096, skipping
Please paste the entire boot log and "rbd info" output for the affected
image.
elias@maas:~$ juju ssh ceph-osd/2 sudo zgrep -i error /var/log/ceph/ceph-osd.0.log
2021-05-16T08:52:56.872+0000 7f0b262c2d80 4 rocksdb:
Options.error_if_exists: 0
2021-05-16T08:52:59.872+0000 7f0b262c2d80 4 rocksdb:
Options.error_if_exists: 0
2021-05-16T08:53:00.884+0000 7f0b262c2d80 1 osd.0 8599 warning: got an error loading one
or more classes: (1) Operation not permitted
elias@maas:~$ juju ssh ceph-osd/0 sudo zgrep -i error /var/log/ceph/ceph-osd.1.log
2021-05-16T08:49:52.971+0000 7fb6aa68ed80 4 rocksdb:
Options.error_if_exists: 0
2021-05-16T08:49:55.979+0000 7fb6aa68ed80 4 rocksdb:
Options.error_if_exists: 0
2021-05-16T08:49:56.828+0000 7fb6aa68ed80 1 osd.1 8589 warning: got an error loading one
or more classes: (1) Operation not permitted
How can I find our more about this bug? It keeps coming back every two weeks and I need
to restart all OSDs to make it go away for another two weeks. Can I check "tid 11
data 4164" somehow. I find no documentation, what a tid actually is and how I could
perform a read test on it.
So *just* restarting the three OSDs you have makes it go away?
What is meant by restarting? Rebooting the node or simply restarting
the OSD process?
Another interesting detail is, that the problem does only seem to affect booting up from
this RBD but not operation per se. The thin clients already booted from this RBD continue
working.
I take it that the affected image is mapped on multiple nodes? If so,
on how many?
All systems run:
Ubuntu 20.04.2 LTS
Kernel 5.8.0-53-generic
ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)
The cluster has been setup with Ubuntu MAAS/juju, consists of
* 1 MAAS server
* with 1 virtual LXD juju controller
* 3 OSD servers with one 2 TB Nvme SSD each for ceph and a 256 SATA SSD for the operating
system.
* each OSD contains a virtualized LXD MON and an LXD FS server (setup through juju, see
juju yaml file attached).
Can you describe the client side a bit more? How many clients do you
have? How many of them are active at the same time?
What exactly is meant by "booting from RBD"? Does the affected image
serve as a golden image? Or some other image is snapshotted and cloned
from before booting?
What do clients do after they boot?
Thanks,
Ilya