Here's what I'm currently thinking. First we should get a few things out
of the way, like
- removing deepsea and ansible
https://github.com/ceph/ceph/pull/33126
- ceph orchestrator ... -> ceph orch ...
https://github.com/ceph/ceph/pull/33131
- your rename PR, if the underlying bug is resolved
Next I think we need to fix the shape of the CLI to resolve the service
group vs service/daemon ambiguity. Here's my proposal:
https://pad.ceph.com/p/orchestrator_cli
Then I think we can proceed more quickly in parallel with adding the
additional services (monitoring, nfs, etc.) and improving the cephadm
internals.
On the internals side, the core problem I see is _get_services(), which
has basically two users:
- callers in cephadm that need a recent view in order to make
decisions about scheduling, placement
- serve() and 'service ls --refresh', which need to trigger an
actual scrape of the remote hosts.
The serve() one is the most important, IMO: we need it to (1) be parallel,
(2) gracefully handle errors for each host and raise appropriate health
alerts, and (3) update the cache as appropriate. For the CLI case,
whether it triggers the scrape synchrnously or somehow kicks serve() and
waits is an probably-not-so-important detail.
On the other hand, the remaining internal _get_services() callers should I
think all just use the latest cached state. Right now the way the code is
structured makes it very confusing which path is used for which, and the
use of the async_map_completion help (currently, at least) makes it hard
to tell which host failed.
As for additional services (monitoring, nfs, etc.), I think that can
proceed more quickly once we have the CLI and add/remove/update issues
sorted out. I may start with a RFC PR on that, but I would really
like some feedback on whether the proposal makes sense.
Thanks!
sage