cephadm next steps - Dev

8 Feb 2020

Here's what I'm currently thinking.  First we should get a few things out 
of the way, like

- removing deepsea and ansible
	https://github.com/ceph/ceph/pull/33126
- ceph orchestrator ... -> ceph orch ...
	https://github.com/ceph/ceph/pull/33131
- your rename PR, if the underlying bug is resolved

Next I think we need to fix the shape of the CLI to resolve the service 
group vs service/daemon ambiguity.  Here's my proposal:

	https://pad.ceph.com/p/orchestrator_cli

Then I think we can proceed more quickly in parallel with adding the 
additional services (monitoring, nfs, etc.) and improving the cephadm 
internals.

On the internals side, the core problem I see is _get_services(), which 
has basically two users:
 - callers in cephadm that need a recent view in order to make 
decisions about scheduling, placement
 - serve() and 'service ls --refresh', which need to trigger an 
actual scrape of the remote hosts.

The serve() one is the most important, IMO: we need it to (1) be parallel, 
(2) gracefully handle errors for each host and raise appropriate health 
alerts, and (3) update the cache as appropriate.  For the CLI case, 
whether it triggers the scrape synchrnously or somehow kicks serve() and 
waits is an probably-not-so-important detail.

On the other hand, the remaining internal _get_services() callers should I 
think all just use the latest cached state.  Right now the way the code is 
structured makes it very confusing which path is used for which, and the 
use of the async_map_completion help (currently, at least) makes it hard 
to tell which host failed.

As for additional services (monitoring, nfs, etc.), I think that can 
proceed more quickly once we have the CLI and add/remove/update issues 
sorted out.  I may start with a RFC PR on that, but I would really 
like some feedback on whether the proposal makes sense.

Thanks!
sage