[ceph-users] Re: Fwd: Re: Ceph osd will not start.

30 May 2021

I've actually managed to get a little further with my problem.

As I've said before these servers are slightly distorted in config.

63 drives and only 48g if memory.

Once I create about 15-20 osds it continues to format the disks but won't
actually create the containers or start any service.

Worse than that on reboot the disks disappear, not stop working but not
detected by Linux, which makes me think I'm hitting some kernel limit.

At this point I'm going to cut my loses and give up and use the small
slightly more powerful 30x drive systems I have (with 256g memory), maybe
transplanting the larger disks if I need more capacity.

Peter

On Sat, 29 May 2021, 23:19 Marco Pizzolo, &lt;marcopizzolo(a)gmail.com&gt; wrote:

...
  Thanks David
 We will investigate the bugs as per your suggestion, and then will look to
 test with the custom image.

 Appreciate it.

 On Sat, May 29, 2021, 4:11 PM David Orman &lt;ormandj(a)corenode.com&gt; wrote:

  You may be running into the same issue we ran
into (make sure to read
 the first issue, there's a few mingled in there), for which we
 submitted a patch:

 https://tracker.ceph.com/issues/50526
 https://github.com/alfredodeza/remoto/issues/62

 If you're brave (YMMV, test first non-prod), we pushed an image with
 the issue we encountered fixed as per above here:
 https://hub.docker.com/repository/docker/ormandj/ceph/tags?page=1 . We
 'upgraded' to this when we encountered the mgr hanging on us after
 updating ceph to v16 and experiencing this issue using: "ceph orch
 upgrade start --image docker.io/ormandj/ceph:v16.2.3-mgrfix". I've not
 tried to boostrap a new cluster with a custom image, and I don't know
 when 16.2.4 will be released with this change (hopefully) integrated
 as remoto accepted the patch upstream.

 I'm not sure if this is your exact issue, see the bug reports and see
 if you see the lock/the behavior matches, if so - then it may help you
 out. The only change in that image is that patch to remoto being
 overlaid on the default 16.2.3 image.

 On Fri, May 28, 2021 at 1:15 PM Marco Pizzolo &lt;marcopizzolo(a)gmail.com&gt;
 wrote:

 Peter,

 We're seeing the same issues as you are.  We have 2 new hosts Intel(R)
 Xeon(R) Gold 6248R CPU @ 3.00GHz w/ 48 cores, 384GB RAM, and 60x 10TB  SED
  drives and we have tried both 15.2.13 and 16.2.4

 Cephadm does NOT properly deploy and activate OSDs on Ubuntu 20.04.2  with
  Docker.

 Seems to be a bug in Cephadm and a product regression, as we have 4 near
 identical nodes on Centos running Nautilus (240 x 10TB SED drives) and  had
  no problems.

 FWIW we had no luck yet with one-by-one OSD daemon additions through  ceph
  orch either.  We also reproduced the issue easily
in a virtual lab using
 small virtual disks on a single ceph VM with 1 mon.

 We are now looking into whether we can get past this with a manual  buildout.

 If you, or anyone, has hit the same stumbling block and gotten past it,  I
  would really appreciate some guidance.

 Thanks,
 Marco

 On Thu, May 27, 2021 at 2:23 PM Peter Childs &lt;pchilds(a)bcs.org&gt; wrote:

 > In the end it looks like I might be able to get the node up to about  30
  > odds before it stops creating any more.
 >
 > Or more it formats the disks but freezes up starting the daemons.
 >
 > I suspect I'm missing somthing I can tune to get it working better.
 >
 > If I could see any error messages that might help, but I'm yet to spit
 > anything.
 >
 > Peter.
 >
 > On Wed, 26 May 2021, 10:57 Eugen Block, &lt;eblock(a)nde.ag&gt; wrote:
 >
 > > > If I add the osd daemons one at a time with
 > > >
 > > > ceph orch daemon add osd drywood12:/dev/sda
 > > >
 > > > It does actually work,
 > >
 > > Great!
 > >
 > > > I suspect what's happening is when my rule for creating osds run
 and
  > > > creates them all-at-once it ties
the orch it overloads cephadm  and it
  > > can't
 > > > cope.
 > >
 > > It's possible, I guess.
 > >
 > > > I suspect what I might need to do at least to work around the  issue
is
  > > set
 > > > "limit:" and bring it up until it stops working.
 > >
 > > It's worth a try, yes, although the docs state you should try to 
avoid
  > > it, it's possible that it
doesn't work properly, in that case  create a
  > > bug report. ;-)
 > >
 > > > I did work out how to get ceph-volume to nearly work manually.
 > > >
 > > > cephadm shell
 > > > ceph auth get client.bootstrap-osd -o
 > > > /var/lib/ceph/bootstrap-osd/ceph.keyring
 > > > ceph-volume lvm create --data /dev/sda --dmcrypt
 > > >
 > > > but given I've now got "add osd" to work, I suspect I just
need  to fine
  > > > tune my osd creation rules, so it
does not try and create too  many osds
  > > on
 > > > the same node at the same time.
 > >
 > > I agree, no need to do it manually if there is an automated way,
 > > especially if you're trying to bring up dozens of OSDs.
 > >
 > >
 > > Zitat von Peter Childs &lt;pchilds(a)bcs.org&gt;rg>:
 > >
 > > > After a bit of messing around. I managed to get it somewhat 
working.
  > > >
 > > > If I add the osd daemons one at a time with
 > > >
 > > > ceph orch daemon add osd drywood12:/dev/sda
 > > >
 > > > It does actually work,
 > > >
 > > > I suspect what's happening is when my rule for creating osds run
 and
  > > > creates them all-at-once it ties
the orch it overloads cephadm  and it
  > > can't
 > > > cope.
 > > >
 > > > service_type: osd
 > > > service_name: osd.drywood-disks
 > > > placement:
 > > >   host_pattern: 'drywood*'
 > > > spec:
 > > >   data_devices:
 > > >     size: "7TB:"
 > > >   objectstore: bluestore
 > > >
 > > > I suspect what I might need to do at least to work around the  issue
is
  > > set
 > > > "limit:" and bring it up until it stops working.
 > > >
 > > > I did work out how to get ceph-volume to nearly work manually.
 > > >
 > > > cephadm shell
 > > > ceph auth get client.bootstrap-osd -o
 > > > /var/lib/ceph/bootstrap-osd/ceph.keyring
 > > > ceph-volume lvm create --data /dev/sda --dmcrypt
 > > >
 > > > but given I've now got "add osd" to work, I suspect I just
need  to fine
  > > > tune my osd creation rules, so it
does not try and create too  many osds
  > > on
 > > > the same node at the same time.
 > > >
 > > >
 > > >
 > > > On Wed, 26 May 2021 at 08:25, Eugen Block &lt;eblock(a)nde.ag&gt; wrote:
 > > >
 > > >> Hi,
 > > >>
 > > >> I believe your current issue is due to a missing keyring for
 > > >> client.bootstrap-osd on the OSD node. But even after fixing that
 > > >> you'll probably still won't be able to deploy an OSD manually
 with
  > > >> ceph-volume because
'ceph-volume activate' is not supported with
 > > >> cephadm [1]. I just tried that in a virtual environment, it 
fails when
  > > >> activating the systemd-unit:
 > > >>
 > > >> ---snip---
 > > >> [2021-05-26 06:47:16,677][ceph_volume.process][INFO  ] Running
 > > >> command: /usr/bin/systemctl enable
 > > >> ceph-volume@lvm-8-1a8fc8ae-8f4c-4f91-b044-d5636bb52456
 > > >> [2021-05-26 06:47:16,692][ceph_volume.process][INFO  ] stderr 
Failed
  > > >> to connect to bus: No such
file or directory
 > > >> [2021-05-26 06:47:16,693][ceph_volume.devices.lvm.create][ERROR 
] lvm
  > > >> activate was unable to
complete, while creating the OSD
 > > >> Traceback (most recent call last):
 > > >>    File
 > > >> 
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py",
  > > >> line 32, in create
 > > >>      Activate([]).activate(args)
 > > >>    File 
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py",
  > > >> line 16, in is_root
 > > >>      return func(*a, **kw)
 > > >>    File
 > > >>
 > 
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py",
  > > >> line
 > > >> 294, in activate
 > > >>      activate_bluestore(lvs, args.no_systemd)
 > > >>    File
 > > >>
 > 
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py",
  > > >> line
 > > >> 214, in activate_bluestore
 > > >>      systemctl.enable_volume(osd_id, osd_fsid, 'lvm')
 > > >>    File
 > > >> 
"/usr/lib/python3.6/site-packages/ceph_volume/systemd/systemctl.py",
  > > >> line 82, in enable_volume
 > > >>      return enable(volume_unit % (device_type, id_, fsid))
 > > >>    File
 > > >> 
"/usr/lib/python3.6/site-packages/ceph_volume/systemd/systemctl.py",
  > > >> line 22, in enable
 > > >>      process.run(['systemctl', 'enable', unit])
 > > >>    File 
"/usr/lib/python3.6/site-packages/ceph_volume/process.py",
  > > >> line 153, in run
 > > >>      raise RuntimeError(msg)
 > > >> RuntimeError: command returned non-zero exit status: 1
 > > >> [2021-05-26 06:47:16,694][ceph_volume.devices.lvm.create][INFO 
] will
  > > >> rollback OSD ID creation
 > > >> [2021-05-26 06:47:16,697][ceph_volume.process][INFO  ] Running
 > > >> command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd
 > > >> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new
 osd.8
  > > >> --yes-i-really-mean-it
 > > >> [2021-05-26 06:47:17,597][ceph_volume.process][INFO  ] stderr 
purged
  > > osd.8
 > > >> ---snip---
 > > >>
 > > >> There's a workaround described in [2] that's not really an
 option for
  > > >> dozens of OSDs. I think your
best approach is to bring cephadm to
 > > >> activate the OSDs for you.
 > > >> You wrote you didn't find any helpful error messages, but did
 cephadm
  > > >> even try to deploy OSDs? What
does your osd spec file look like?  Did
  > > >> you explicitly run 'ceph
orch apply osd -i specfile.yml'? This  should
  > > >> trigger cephadm and you should
see at least some output like  this:
  > > >>
 > > >> Mai 26 08:21:48 pacific1 conmon[31446]: 
2021-05-26T06:21:48.466+0000
  > > >> 7effc15ff700  0
log_channel(cephadm) log [INF] : Applying service
 > > >> osd.ssd-hdd-mix on host pacific2...
 > > >> Mai 26 08:21:49 pacific1 conmon[31009]: cephadm
 > > >> 2021-05-26T06:21:48.469611+0000 mgr.pacific1.whndiw (mgr.14166) 
1646 :
  > > >> cephadm [INF] Applying service
osd.ssd-hdd-mix on host  pacific2...
  > > >>
 > > >> Regards,
 > > >> Eugen
 > > >>
 > > >> [1] https://tracker.ceph.com/issues/49159
 > > >> [2] https://tracker.ceph.com/issues/46691
 > > >>
 > > >>
 > > >> Zitat von Peter Childs &lt;pchilds(a)bcs.org&gt;rg>:
 > > >>
 > > >> > Not sure what I'm doing wrong, I suspect its the way I'm
 running
  > > >> > ceph-volume.
 > > >> >
 > > >> > root@drywood12:~# cephadm ceph-volume lvm create --data 
/dev/sda
  > > >> --dmcrypt
 > > >> > Inferring fsid 1518c8e0-bbe4-11eb-9772-001e67dc85ea
 > > >> > Using recent ceph image ceph/ceph@sha256
 > > >> > 
:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
  > > >> > /usr/bin/docker: Running
command: /usr/bin/ceph-authtool
 > > --gen-print-key
 > > >> > /usr/bin/docker: Running command: /usr/bin/ceph-authtool
 > > --gen-print-key
 > > >> > /usr/bin/docker: -->  RuntimeError: No valid ceph 
configuration file
  > > was
 > > >> > loaded.
 > > >> > Traceback (most recent call last):
 > > >> >   File "/usr/sbin/cephadm", line 8029, in
<module>
 > > >> >     main()
 > > >> >   File "/usr/sbin/cephadm", line 8017, in main
 > > >> >     r = ctx.func(ctx)
 > > >> >   File "/usr/sbin/cephadm", line 1678, in _infer_fsid
 > > >> >     return func(ctx)
 > > >> >   File "/usr/sbin/cephadm", line 1738, in _infer_image
 > > >> >     return func(ctx)
 > > >> >   File "/usr/sbin/cephadm", line 4514, in
command_ceph_volume
 > > >> >     out, err, code = call_throws(ctx, c.run_cmd(),
 > > verbosity=verbosity)
 > > >> >   File "/usr/sbin/cephadm", line 1464, in call_throws
 > > >> >     raise RuntimeError('Failed command: %s' % '
 '.join(command))
  > > >> > RuntimeError: Failed
command: /usr/bin/docker run --rm  --ipc=host
  > > >> > --net=host --entrypoint
/usr/sbin/ceph-volume --privileged
 > > >> --group-add=disk
 > > >> > --init -e CONTAINER_IMAGE=ceph/ceph@sha256
 > :54e95ae1e11404157d7b329d0t
 > > >> >
 > > >> > root@drywood12:~# cephadm shell
 > > >> > Inferring fsid 1518c8e0-bbe4-11eb-9772-001e67dc85ea
 > > >> > Inferring config
 > > >> >
 > >  /var/lib/ceph/1518c8e0-bbe4-11eb-9772-001e67dc85ea/mon.drywood12/config
  > > >> > Using recent ceph image
ceph/ceph@sha256
 > > >> > 
:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
  > > >> > root@drywood12:/#
ceph-volume lvm create --data /dev/sda  --dmcrypt
  > > >> > Running command:
/usr/bin/ceph-authtool --gen-print-key
 > > >> > Running command: /usr/bin/ceph-authtool --gen-print-key
 > > >> > Running command: /usr/bin/ceph --cluster ceph --name
 > > client.bootstrap-osd
 > > >> > --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
 > > >> > 70054a5c-c176-463a-a0ac-b44c5db0987c
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 auth:
 unable
  > to
 > > >> find
 > > >> > a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No
 such
  > > file
 > > >> or
 > > >> > directory
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1
 > > >> > AuthRegistry(0x7fdef405b378) no keyring found at
 > > >> > /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 auth:
 unable
  > to
 > > >> find
 > > >> > a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No
 such
  > > file
 > > >> or
 > > >> > directory
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1
 > > >> > AuthRegistry(0x7fdef405ef20) no keyring found at
 > > >> > /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 auth:
 unable
  > to
 > > >> find
 > > >> > a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No
 such
  > > file
 > > >> or
 > > >> > directory
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1
 > > >> > AuthRegistry(0x7fdef8f0bea0) no keyring found at
 > > >> > /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef2d9d700 -1
 > > monclient(hunting):
 > > >> > handle_auth_bad_method server allowed_methods [2] but i only
 support
  > > [1]
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef259c700 -1
 > > monclient(hunting):
 > > >> > handle_auth_bad_method server allowed_methods [2] but i only
 support
  > > [1]
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef1d9b700 -1
 > > monclient(hunting):
 > > >> > handle_auth_bad_method server allowed_methods [2] but i only
 support
  > > [1]
 > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 
monclient:
  > > >> > authenticate NOTE: no
keyring found; disabled cephx  authentication
  > > >> >  stderr: [errno 13] RADOS
permission denied (error connecting  to the
  > > >> > cluster)
 > > >> > -->  RuntimeError: Unable to create a new OSD id
 > > >> > root@drywood12:/# lsblk /dev/sda
 > > >> > NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
 > > >> > sda    8:0    0  7.3T  0 disk
 > > >> >
 > > >> > As far as I can see cephadm gets a little further than this as
 the
  > > disks
 > > >> > have lvm volumes on them just the osd's daemons are not
 created or
  > > >> started.
 > > >> > So maybe I'm invoking ceph-volume incorrectly.
 > > >> >
 > > >> >
 > > >> > On Tue, 25 May 2021 at 06:57, Peter Childs
&lt;pchilds(a)bcs.org&gt;  wrote:
  > > >> >
 > > >> >>
 > > >> >>
 > > >> >> On Mon, 24 May 2021, 21:08 Marc,
&lt;Marc(a)f1-outsourcing.eu&gt;  wrote:
  > > >> >>
 > > >> >>> >
 > > >> >>> > I'm attempting to use cephadm and Pacific,
currently on  debian
  > > >> buster,
 > > >> >>> > mostly because centos7 ain't supported any more
and  cenotos8
  > ain't
 > > >> >>> > support
 > > >> >>> > by some of my hardware.
 > > >> >>>
 > > >> >>> Who says centos7 is not supported any more? Afaik 
centos7/el7 is
  > > being
 > > >> >>> supported till its EOL 2024. By then maybe a good 
alternative for
  > > >> >>> el8/stream has
surfaced.
 > > >> >>>
 > > >> >>
 > > >> >> Not supported by ceph Pacific, it's our os of choice
 otherwise.
  > > >> >>
 > > >> >> My testing says the version available of podman, docker and
 > python3,
 > > do
 > > >> >> not work with Pacific.
 > > >> >>
 > > >> >> Given I've needed to upgrade docker on buster can we
please  have a
  > > list
 > > >> of
 > > >> >> versions that work with cephadm, maybe even have cephadm say
 no,
  > > please
 > > >> >> upgrade unless your running the right version or better.
 > > >> >>
 > > >> >>
 > > >> >>
 > > >> >>> > Anyway I have a few nodes with 59x 7.2TB disks but
for some
 > reason
 > > >> the
 > > >> >>> > osd
 > > >> >>> > daemons don't start, the disks get formatted and
the osd  are
  > > created
 > > >> but
 > > >> >>> > the daemons never come up.
 > > >> >>>
 > > >> >>> what if you try with
 > > >> >>> ceph-volume lvm create --data /dev/sdi --dmcrypt ?
 > > >> >>>
 > > >> >>
 > > >> >> I'll have a go.
 > > >> >>
 > > >> >>
 > > >> >>> > They are probably the wrong spec for ceph (48gb of
memory  and
  > > only 4
 > > >> >>> > cores)
 > > >> >>>
 > > >> >>> You can always start with just configuring a few disks
per  node.
  > > That
 > > >> >>> should always work.
 > > >> >>>
 > > >> >>
 > > >> >> That was my thought too.
 > > >> >>
 > > >> >> Thanks
 > > >> >>
 > > >> >> Peter
 > > >> >>
 > > >> >>
 > > >> >>> > but I was expecting them to start and be either dirt
slow  or
  > crash
 > > >> >>> > later,
 > > >> >>> > anyway I've got upto 30 of them, so I was hoping
on  getting at
   >
least
 > >> get
 > >> >>> > 6PB of raw storage out of them.
 > >> >>> >
 > >> >>> > As yet I've not spotted any helpful error messages.
 > >> >>> >
 > >> >>> _______________________________________________
 > >> >>> ceph-users mailing list -- ceph-users(a)ceph.io
 > >> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
 > >> >>>
 > >> >>
 > >> > _______________________________________________
 > >> > ceph-users mailing list -- ceph-users(a)ceph.io
 > >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
 > >>
 > >>
 > >> _______________________________________________
 > >> ceph-users mailing list -- ceph-users(a)ceph.io
 > >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
 > >>
 >
 >
 > _______________________________________________
 > ceph-users mailing list -- ceph-users(a)ceph.io
 > To unsubscribe send an email to ceph-users-leave(a)ceph.io
 >
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Fwd: Re: Ceph osd will not start.