"ceph orch ls", "ceph orch daemon rm" fail with exception "'KeyError: 'not'" on 15.2.10 - ceph-users

13 May 2021

Hi,

How I got here
--------------

Yesterday evening I added an OSD to my hobby system most likely using the
command:

# ceph-volume raw prepare --bluestore --data /dev/bcache0
# cephadm adopt --style legacy --name osd.20

I also used the command (after not having much luck with that, but I don't
have the specifics):

% ceph orch daemon add osd tutu:/tmp/bcache0

per https://docs.ceph.com/en/latest/cephadm/osd/#creating-new-osds

..which I think resulted in new osd.18, putting the bcache0 inside its own
VG and its own LV.

I don't have actual log of the used command available, but I did end up with
new osds 18 and 20.  First time using these command as well, my previous
ways to achieve the same were a bit more long-winded..

According to my monitoring my main issue appeared around the same time.

In this post I don't worry about the state of the OSD but only about
management.

Actual issue
------------

So when I now issue "ceph orch ls" I get the following output:

% ceph orch ls
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1204, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in
handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in
<lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
_list_services
    raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in
raise_if_exception
    raise e
AssertionError: not

("ceph orch ps" works fine.)

Similarly the output of "ceph -s" is:

% ceph -s
... 
    health: HEALTH_ERR
            Module 'cephadm' has failed: 'not'
...

The relevant log from the manager, as per the mgr web interface, is:

_Promise failed Traceback (most recent call last): File
"/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in _finalize
next_result = self._on_complete(self._value) File
"/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda> return
CephadmCompletion(on_complete=lambda _: f(*args, **kwargs)) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1333, in describe_service
hosts=[dd.hostname] File
"/lib/python3.6/site-packages/ceph/deployment/service_spec.py", line 429, in
__init__ assert service_type in ServiceSpec.KNOWN_SERVICE_TYPES,
service_type AssertionError: not

I also noticed this seemingly highly relevant bit in my ceph orch ps:

NAME       HOST STATUS  REFRESHED  AGE  VERSION    IMAGE NAME              IMAGE ID 
CONTAINER ID  
not.osd.20 tutu stopped 13h ago    14h  <unknown>  docker.io/ceph/ceph:v15
<unknown> <unknown>

I'm not quite sure how I ended up with that, but I wouldn't exclude operator
error :) such as entering "cephadm adopt --style legacy --name not.osd.20"
(but WHY..).

Sure enough, there is no such docker container running in the host and the
job ceph-3046312a-e453-11ea-b1f5-b42e993e47fc(a)osd.20.service has failed with
"RuntimeError: could not find osd.20 with osd_fsid
212c336a-9516-4818-aeaf-2d0c24c4ca65" (this error makes sense, as both osds
18 and 20 try to use the same bcache0, but the actual bluestore filesystem
is inside vg/lv as used by 18, whereas 20 tries to use bcache0 directly),
but as I said I won't worry about the OSD at the moment.

I tried the command "ceph orch daemon rm not.osd.20", however I'm not sure
if it even should work. It nevertheless fails the same way:

% ceph orch daemon rm not.osd.20
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1204, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in
handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in
<lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 1061, in _daemon_rm
    raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in
raise_if_exception
    raise e
KeyError: 'not'

with the following entries in the mgr log:

5/13/21 1:26:06 PM[ERR]_Promise failed Traceback (most recent call last):
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in
_finalize next_result = self._on_complete(self._value) File
"/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda> return
CephadmCompletion(on_complete=lambda _: f(*args, **kwargs)) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1515, in remove_daemons return
self._remove_daemons(args) File "/usr/share/ceph/mgr/cephadm/utils.py", line
65, in forall_hosts_wrapper return
CephadmOrchestrator.instance._worker_pool.map(do_work, vals) File
"/lib64/python3.6/multiprocessing/pool.py", line 266, in map return
self._map_async(func, iterable, mapstar, chunksize).get() File
"/lib64/python3.6/multiprocessing/pool.py", line 644, in get raise
self._value File "/lib64/python3.6/multiprocessing/pool.py", line 119, in
worker result = (True, func(*args, **kwds)) File
"/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar return
list(map(*args)) File "/usr/share/ceph/mgr/cephadm/utils.py", line 58, in
do_work return f(self, *arg) File "/usr/share/ceph/mgr/cephadm/module.py",
line 1804, in _remove_daemons return self._remove_daemon(name, host) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1818, in _remove_daemon
self.cephadm_services[daemon_type].pre_remove(daemon) KeyError: 'not'

5/13/21 1:26:06 PM[ERR]executing
_remove_daemons((<cephadm.module.CephadmOrchestrator object at
0x7f1f4fec2bd0>, [('not.osd.20', 'tutu')])) failed.  Traceback (most
recent
call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 58, in do_work
return f(self, *arg) File "/usr/share/ceph/mgr/cephadm/module.py", line
1804, in _remove_daemons return self._remove_daemon(name, host) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1818, in _remove_daemon
self.cephadm_services[daemon_type].pre_remove(daemon) KeyError: 'not'

I tried also that "ceph orch daemon rm foo.bar.42" gives the error "Error
EINVAL: Unable to find daemon(s) ['foo.bar.42']", so it seems it processes
the actual command fine in part.

Thanks for any assistance!

-- 
  _____________________________________________________________________
     / __// /__ ____  __                              Erkki Seppälä\   \
    / /_ / // // /\ \/ /                                            \  /
   /_/  /_/ \___/ /_/\_\(a)inside.org        http://www.inside.org/~flux/