[ceph-users] Re: libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping

16 May 2021

Am So., 16. Mai 2021 um 21:36 Uhr schrieb Ilya Dryomov &lt;idryomov(a)gmail.com&gt;om>:

...
  On Sun, May 16, 2021 at 8:06 PM Markus Kienast
&lt;mark(a)trickkiste.at&gt; wrote:

 Am So., 16. Mai 2021 um 19:38 Uhr schrieb Ilya Dryomov < 
idryomov(a)gmail.com&gt;gt;:
 >
> On Sun, May 16, 2021 at 4:18 PM Markus Kienast &lt;mark(a)trickkiste.at&gt; 
wrote:
 > >
> > Am So., 16. Mai 2021 um 15:36 Uhr schrieb Ilya Dryomov < 
idryomov(a)gmail.com&gt;gt;:
 > >>
> >> On Sun, May 16, 2021 at 12:54 PM Markus Kienast &lt;mark(a)trickkiste.at&gt;
 wrote:
 > >> >
> >> > Hi Ilya,
> >> >
> >> > unfortunately I can not find any "missing primary copy of
..."  error in the logs of my 3 OSDs.
 > >> > The NVME disks are also brand
new and there is not much traffic on  them.
 > >> >
> >> > The only error keyword I find are those two messages in osd.0 and
 osd.1 logs shown below.
 > >> >
> >> > BTW the error posted before actually concerns osd1. The one I 
posted was copied from somebody elses bug report, which had similar errors.
 Here are my original error messages on LTSP boot:
 > >>
> >> Hi Markus,
> >>
> >> Please don't ever paste log messages from other bug reports again.
> >> Your email said "I am seeing these messages ..." and I spent a
fair
> >> amount of time staring at the code trying to understand how an issue
> >> that was fixed several releases ago could resurface.
> >>
> >> The numbers in the log message mean specific things.  For example it
> >> is immediately obvious that
> >>
> >>   get_reply osd1 tid 11 data 4164 > preallocated 4096, skipping
> >>
> >> is not related to
> >>
> >>   get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, 
skipping
 > >>
> >> even though they probably look the same to you.
> >
> >
> > Sorry, I was not aware of that.
> >
> >>
> >> > [    10.331119] libceph: mon1 (1)10.101.0.27:6789 session 
established
 > >> > [    10.331799] libceph:
client175444 fsid  b0f4a188-bd81-11ea-8849-97abe2843f29
 > >> > [    10.336866] libceph: mon0
(1)10.101.0.25:6789 session  established
 > >> > [    10.337598] libceph:
client175444 fsid  b0f4a188-bd81-11ea-8849-97abe2843f29
 > >> > [    10.349380] libceph:
get_reply osd1 tid 11 data 4164 >  preallocated
 > >> > 4096, skipping
> >>
> >> Please paste the entire boot log and "rbd info" output for the
 affected
 > >> image.
> >
> >
> > elias@maas:~$ rbd info squashfs/ltsp-01
> > rbd image 'ltsp-01':
> > size 3.5 GiB in 896 objects
> > order 22 (4 MiB objects)
> > snapshot_count: 0
> > id: 23faade1714
> > block_name_prefix: rbd_data.23faade1714
> > format: 2
> > features: layering, exclusive-lock, object-map, fast-diff,  deep-flatten
 > > op_features:
> > flags:
> > create_timestamp: Mon Jan 11 12:09:22 2021
> > access_timestamp: Wed Feb 24 10:55:17 2021
> > modify_timestamp: Mon Jan 11 12:09:22 2021
> >
> > I don't have the boot log available right now, but you can watch a 
video of the boot process right here:
 https://photos.app.goo.gl/S8PssYu2VAr4CSeg7
 > >
> > It seems to be consistently "tid 11" consistently, while in this
 video it was "data 4288" not "data 4164" as above. But the
image has been
 modified in the meantime, as far as I can recall, so that might be due to
 that reason.
 > >>
> >>
> >> >
> >> > elias@maas:~$ juju ssh ceph-osd/2 sudo zgrep -i error 
/var/log/ceph/ceph-osd.0.log
 > >> > 2021-05-16T08:52:56.872+0000
7f0b262c2d80  4 rocksdb:             Options.error_if_exists: 0
 > >> > 2021-05-16T08:52:59.872+0000
7f0b262c2d80  4 rocksdb:             Options.error_if_exists: 0
 > >> > 2021-05-16T08:53:00.884+0000
7f0b262c2d80  1 osd.0 8599 warning:  got an error loading one or more classes: (1)
Operation not permitted
 > >> >
> >> > elias@maas:~$ juju ssh ceph-osd/0 sudo zgrep -i error 
/var/log/ceph/ceph-osd.1.log
 > >> > 2021-05-16T08:49:52.971+0000
7fb6aa68ed80  4 rocksdb:             Options.error_if_exists: 0
 > >> > 2021-05-16T08:49:55.979+0000
7fb6aa68ed80  4 rocksdb:             Options.error_if_exists: 0
 > >> > 2021-05-16T08:49:56.828+0000
7fb6aa68ed80  1 osd.1 8589 warning:  got an error loading one or more classes: (1)
Operation not permitted
 > >> >
> >> > How can I find our more about this bug? It keeps coming back every
 two weeks and I need to restart all OSDs to make it go away for another two
 weeks. Can I check "tid 11 data 4164" somehow. I find no documentation,
 what a tid actually is and how I could perform a read test on it.
 > >>
> >> So *just* restarting the three OSDs you have makes it go away?
> >>
> >> What is meant by restarting?  Rebooting the node or simply restarting
> >> the OSD process?
> >
> >
> > I did reboot all OSD nodes and since the MON and FS nodes run as  LXD/juju
instances on them, they were rebooted as well.
 > >
> >>
> >> >
> >> > Another interesting detail is, that the problem does only seem to
 affect booting up from this RBD but not operation per se. The thin clients
 already booted from this RBD continue working.
 > >>
> >> I take it that the affected image is mapped on multiple nodes?  If 
so,
 > >> on how many?
> >
> >
> > Currently "squashfs/ltsp-01" is mapped on 4 nodes.
> > As the pool name indicates, the FS was converted to squashfs and is 
therefore mounted read-only, while the underlying dev might actually not be
 mounted read-only, as there does not seem to be an option available to
 mount RO via /sys/bus/rbd/add_single_major or /sys/bus/rbd/add.
 > >
> > As far as I can tell, the only way to force RO is to map a snapshot 
instead.

 Are you writing to /sys/bus/rbd/add_single_major directly instead of
 using the rbd tool? 

 Yes.
 Line 110 
https://github.com/trickkiste/ltsp/blob/feature-boot_method-rbd/debian/ltsp…

 echo "${mons} name=${user},secret=${key} ${pool} ${image} ${snap}" > 
${rbd_bus}

>
> >
> >>
> >> >
> >> > All systems run:
> >> > Ubuntu 20.04.2 LTS
> >> > Kernel 5.8.0-53-generic
> >> > ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) 
octopus (stable)
 > >> >
> >> > The cluster has been setup with Ubuntu MAAS/juju, consists of
> >> > * 1 MAAS server
> >> > * with 1 virtual LXD juju controller
> >> > * 3 OSD servers with one 2 TB Nvme SSD each for ceph and a 256 
SATA SSD for the operating system.
 > >> > * each OSD contains a
virtualized LXD MON and an LXD FS server  (setup through juju, see juju yaml file
attached).
 > >>
> >> Can you describe the client side a bit more?  How many clients do you
> >> have?  How many of them are active at the same time?
> >
> >
> > Currently, there are only 4 active clients but the system is intended  to
being able to sustain 100s of clients. We are using an RBD as boot
 device for PXE booted thin clients, you might have heard of the Linux
 Terminal Server Project (ltsp.org). We adapted the stack to support
 booting from RBD.

 How many active clients there were at the time when the image couldn't
 be mapped?  I suspect between 60 and 70? 

 No, just 4.
 Most of the time 3 still running and working correctly and one stuck at  reboot.

 Maybe the sum of all LTSP client reboots since I cleared the problem by  rebooting
the OSDs could amount to 60-70. I do not know, as we are not
 logging that currently.

 The next time it happens, check the output of "rbd status" for that
 image.  If you see around 65 watchers, that is it.  With exclusive-lock
 feature enabled on the image, the current kernel implementation can't
 handle more than that. 

 OK, currently I am seeing 5, which is one more than the number of  clients we have.
So it seems these watchers do not timeout after reboot or
 hard reset.

 Is there any way to make these watchers time out? 
 They are supposed to time out after 30 seconds.  Does the IP address
 of the rogue watch offer a clue?

Not really, I see 3 watchers for one of my client IPs right now and one for
each of the others. Have to investigate further.
The clients are assigned the same IP on each bootup, so shouldn't the
watchers either time out or be "claimed/taken over" by the newly booted
client?

...

 Note that when the mapping gets stuck on that preallocated check, it
 still maintains the watch so it's not going to time out in that case.

I will investigate, if it is only one client, producing all the watchers
until the limit is reached, or if all clients produce this problem together
and then report back on the issue.

...

 Watches are established if the image is mapped read-write.  For your
 squashfs + overlayfs use case, it's not only better to map read-only
 just in case, you actually *need* to do that to avoid watches being
 established.

 If you are writing to /sys/bus/rbd/add_single_major directly, append
 "ro" somewhere in the options part of the string:

   ip:port,... name=myuser,secret=mysecret rbd ltsp-01 -  # read-write

   ip:port,... name=myuser,secret=mysecret,ro rbd ltsp-01 -  # read-only 

 Thank you, we will add this missing piece to our rbd initrd code.

 Are you a ceph dev?
 Could you make sure to add this to kernel documentation too!
 https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-rbd 
 Map options are documented in the rbd man page:

 https://docs.ceph.com/en/latest/man/8/rbd/#kernel-rbd-krbd-options 

Thanks for the tip!
However, I would argue that these krbd options should either be available
in the kernel documentation as well, or at least this man page should be
mentioned in the kernel documentation.

...

 There is no mention of that option currently.
 I might even have tried this but it might not have worked. Not sure,  this has been
over a year back.

 Also missing in the documentation is, how one could mount a CephFS on  boot!!!

 Do you mean booting *from* CephFS, i.e. using it as a root filesystem?
 Because mounting CephFS on boot after root filesystem is mounted is done
 through /etc/fstab, like you would mount any other filesystem whether
 local or network.

Yes, CephFS as a roof filesystem. The reason, why this would be convenient
is, that we could then have one VM or client mounted this CephFS in RW
mode, therefore being able to make updates to the packages installed etc.
while the other clients would be restricted to read-only rights via authx.
We would still be able to create snapshots of the last properly working OS
configuration but without the additional hassle.

Also, if we do OS package upgrades, these upgrades would be instantly
available on the clients as well without the need to reboot, since due to
the network-fs nature of CephFS all clients would be aware of the
underlying changes to the FS.

...
   We are
thinking about switching to booting a CephFS in the future.
 But I would not have any idea and did not find any documentation on how  we would
approach that - which boot kernel option to use, which sysfs
 interface could be used, or which tools we must include in initrd.

 Generally it would be great if you could include the proper initrd code  for RBD
and CephFS root filesystems to the Ceph project. You can happily
 use my code as a starting point.

https://github.com/trickkiste/ltsp/blob/feature-boot_method-rbd/debian/ltsp…

 I think booting from CephFS would require kernel patches.  It looks
 like NFS and CIFS are the only network filesystems supported by the
 init/root infrastructure in the kernel.

 As long as we can do it similar to the way we do with RBD now, that would be
sufficient. The actual mapping and mounting in both NFS and RBD does
take place in the initrd scripts. The RBD initrd script actually is derived
from the NFS initrd script, so I would presume that one could make CephFS
mount work in a similar fashion. Question being, which tools are needed to
make this work? I guess "mount" provides everything needed already, does it
not?

Thank you so much for you support,
Markus

Thanks,
...

                 Ilya

2024

2023

2022

2021

2020

2019

[ceph-users] Re: libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping