[Adding Sebastian, dev@ceph.io]
Some things to improve with the OSD create path!
On Mon, 20 Jan 2020, Yaarit Hatuka wrote:
> Here are a few Insights from this debugging process - I hope I got it right:
>
> 1. Adding the device with "/dev/disk/by-id/...." did not work for me, it
> failed in pybind/mgr/cephadm/module.py at:
> https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1241
> "if len(list(set(devices) & set(osd['devices']))) == 0"
> because osd['devices'] has the devices listed as "/dev/sdX", but
> set(devices) has them by their dev-id.... (which is the syntax specified as
> the example in the docs, which I followed).
> It took me a couple of days to debug this :-)
>
> 2. I think that cephadm should be more verbose by default. When creating
> OSD it only writes "Created osd(s) on host 'mira027.front.sepia.ceph.com'"
> (even in case creation failed...). It will help if it outputs the different
> stages so that the user can see where it stopped in case of error.
>
> 3. ceph status shows that the OSD was added even if the orchestrator failed
> to add it (but it's marked down and out).
IIUC this is ceph-volumes failure path not cleaning up? Is this the
failure you saw when you passed the /dev/disk/by-id device path?
It seems like ceph-volume completed successfully all this time, but since I always passed /dev/disk/by-id and not /dev/sdX to 'ceph orchestrator osd create', this intersection was always empty:
set(devices) & set(osd['devices']) [1]
The other part of the condition was also true, so the 'continue' happened all the time.
Therefore the orchestrator does not even try to:
self._create_daemon('osd',...) [2]
Not sure why the OSD count is incremented, though.