I started now to iterate over all osds in the tree and some of the osds
are completely unresponsive:
[18:27:18] black1.place6:~# for osd in $(ceph osd tree | grep osd. | awk '{ print $4
}'); do echo $osd; ceph tell $osd injectargs '--osd-max-backfills 1'; done
osd.20
osd.56
osd.62
osd.63
^CTraceback (most recent call last):
File "/usr/bin/ceph", line 1266, in <module>
retval = main()
File "/usr/bin/ceph", line 1182, in main
prefix='get_command_descriptions')
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1459, in
json_command
inbuf, timeout, verbose)
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1329, in
send_command_retry
return send_command(*args, **kwargs)
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1361, in
send_command
cluster.osd_command, osdid, cmd, inbuf, timeout=timeout)
File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1311, in
run_in_thread
t.join(timeout=timeout)
File "/usr/lib/python3.7/threading.py", line 1036, in join
self._wait_for_tstate_lock(timeout=max(timeout, 0))
File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
osd.64
osd.65
What's the best way to figure out why osd.63 does not react to the tell
command?
Best regards,
Nico
Nico Schottelius <nico.schottelius(a)ungleich.ch> writes:
Hello Stefan,
Stefan Kooman <stefan(a)bit.nl> writes:
Hi,
However as soon as we issue either of the above
tell commands, it just
hangs. Furthermore when ceph tell hangs, pg are also becoming stuck in
"Activating" and "Peering" states.
It seems to be related, as soon as we stop ceph tell (ctrl-c it), a few
minutes later the pgs are peered/active.
We can reproduce this problem also with very busy osds, which have been
moved to another host - they also do not react to the ceph tell commands.
Does this also happen when you issue a osd specific "tell", i.e. ceph
tell osd.13 injectargs '--osd-max-backfills 4'
Does this also happen when you loop over it one by one?
It does hang for some of them, but if I "ping" / select specific OSDs,
this does not happen.
Did
anyone see this before and/or do you have a hint on how to debug
ceph tell as it is not a daemon on its own?
IIRC I have seen this, but not in combination with PGs peering /
activating. Has the config change become effective on alls OSDs: verify
with ceph daemon osd.13 config get osd_max_backfills (for all OSDs)
Just checked - most OSDs did not apply the new setting, setting it
explicitly on them works however.
Best regards,
Nico
--
Modern, affordable, Swiss Virtual Machines. Visit
www.datacenterlight.ch