On Thu, 28 Nov
2019, Paul Cuzner wrote:
On Thu, Nov 28, 2019 at 2:37 AM Sage Weil
<sweil(a)redhat.com> wrote:
On Wed, 27 Nov 2019, Paul Cuzner wrote:
> Hi,
>
> I've got a working gist for the add/remove of the monitoring solution.
>
https://gist.github.com/pcuzner/ac542ce3fa9a4699bb9310b1fd5095d0
>
> I'm out for the next couple of days, but will get a PR raised next week
to
> get this started properly.
For some reason it won't let me comment on that gist.
- I don't think we should install anything on the host outside of the unit
file and /var/lib/ceph/$fsid/$thing. I suggest $thing be 'prometheus',
'alertmanager', 'node-exporter', 'grafana'. We could combine all
but
node-exporter into a single 'monitoring' thing but i'm worried this
obscures things too much when, for example, the user might have an
external prometheus but still need alertmanager, and so on.
So all the configs should live in
/var/lib/ceph/$fsid/$thing/prometheus.yml and so on, and then bound to the
right /etc/whatever location by the container config.
I struggle with this one. Channelling my inner sysadmin: "I expect config
settings to be in /etc and data to be in /var/lib - that's what FHS says
and that's how other systems look that I have to manage, so why does Ceph
have to do things differently?"
1- Because it's a containerized service. Things are in etc inside the
container, not outside. Sprinkling these configs in /etc mixes
containerized service configs with the *host*'s configs, which seems very
untidy to me.
2. Putting it all in /var/lib/ceph/whatever means it's find and
clean up.
I'm also not sure of the value of fsid in the
dir names. I can see the
value if a host has to support multiple ceph clusters - but outside dev is
that something that the community or our customers actually want?
Most deployments won't need it, but it will avoid a whole range of
problems when they do. Especially when it becomes trivial to bootstrap
clusters, you also make it trivial to make multiple clusters overlap on
the same host.
And, like above, it keeps things tidy.
The gist downloads the separate containers we
need in parallel - which I
think is a good thing! reduces time
Sure... that's something we could do regardless of whether it's a separate
script of part of ceph-daemon. Probably what we actually want is for the
ssh 'host add' commadn to kick off some prestaging of containers in the
background so that the first daemon deployment doesn't wait for a
container download at all.
IMO, having monitoring-add deploy grafana/prom
and alert manager together
by default is the way to go. TBH, when I started this, I was putting them
all in the same pod under podman for management and treat them as a single
unit - but having to support 'legacy' docker put an end to that :)
If a user wishes to use a separate prometheus, that will normally have it's
own alertmanager too. Which alertmanager a prometheus server is defined in
the prometheus.yml. With external prometheus, rules, alerts and receiver
definitions are going to be an exercise for the reader. We'll need to
document the settings, but the admin will need to apply them - in this
scenario, we could possibly generate sample files that the admin can pick
up and apply? To my mind deployment of monitoring has two pathways;
default - "monitoring add" yields prom/grafana/alertmanager containers
deployed to machine
external-prom - "monitoring add" just deploys grafana, and points it's
default data source at the external prom url. We're also making an
assumption here that the prometheus server is open and doesn't require auth
(OCP's prometheus for example has auth enabled)
I think it makes sense to focus on the out-of-the-box opinionated easy
scenario vs the DIY case, in general at least. But I have a few
questions...
I think this focus will leave some users in the dust. Monitoring with
prometheus
can get complex, especially if it is to be fault tolerant (which imho is
important for confidence in such a system). Also typically users don't want
several monitoring systems in their environment. So let's keep the case of
existing prometheus systems in mind please.