Am 18.06.21 um 20:42 schrieb Sage Weil:
Following up with some general comments on the main
container
downsides and on the upsides that led us down this path in the first
place.
[...]
Thanks, Sage, for the nice and concise summary on the Cephadm benefits, and the reasoning
on why the path was chosen!
Also thanks for your reply on my question about the modularity of the actual orchestrator.
I really appreciate this, and will try to reply in one place here.
After the huge activity in this thread, I did take a step back to watch and also make up
my mind,
trying to condense my main issues with the "containers-only" approach, also
taking other replies into account.
I hope this is not seen as a rant, but rather a collection of arguments for an additional
orchestrator module,
or maybe even something different. Unfortunately, it has become a wall of text, but I hope
at least some will fight their way through.
First of all, I fully agree with the positive points you raised — it's surely a gain
for devs and many users to ship something tested and "complete"
without having a full and still necessarily incomplete OS test matrix to constantly check
and extend. It also eases testing especially when trying out experimental features,
and takes away usage complexity e.g. in the upgrade path.
Of course, there's also the point that having a large test matrix across OSs tends to
uncover actual bugs or issues which may not show up in a reduced test environment[0],
so reducing the matrix also comes at a price for reliability which has to be weighed
against the time which is saved.
The security issue (50 containers -> 50 versions of openssl to patch) also still stands
— the earlier question on this list (when to expect patched containers for a CVE affecting
a library)
is still unreplied to[1], so these are real-life concerns. In general, I don't know
any project which ever managed to keep up with the workload caused by the requirement to
follow
all CVEs of all dependencies, informing about them and patching them, since this is a
workload comparable to the one the security teams of Linux distributions have to handle.
In addition, you'll also need to address the question of when and how to pull new
images when patched containers become available, how / when to inform the administrator,
and orchestrate service restarts as-needed (you'd basically need
"needs-restarting" and friends). That's still quite a way to go, and will be
a constant developer effort
from now on.
That being said, Ceph may be the first ever project managing to fulfil expectations here
due to the close coupling to those guys wearing red hats ;-).
Another point raised on this list is that some users are anxious about pushing a
"magic" button which upgrades a whole cluster.
Sure, this button is super useful, and incorporates developer wisdom, and allows the
developers to test the full sequence and ship it to everybody.
So these buttons (e.g. "ceph orch upgrade") are something which is useful and, I
should say, important.
However, by design, it hides the "inner workings" of Ceph, which is a major
drawback for some users.
After this introduction, let me come to my main personal concern: Loss of
integratibility.
Our model of operation is to have all machines (anything, be it your off-the-shelf
desktop, a laptop,
a hypervisor, a compute node or a Ceph node) handled by the very same configuration
management. It means all configuration is self-documenting,
reinstallation is done with the push of a button, and anybody who understands the
configuration management and the services at a basic level can take over operations.
It's the only way we _can_ operate, given the huge number of services requested and
required in the IT business these days.
To give just one example: We mount kerberised NFS on all our desktop nodes, via CephFS
exposed via nfs-ganesha. The desktops run Ubuntu/Debian, the file servers CentOS.
If I need to change the Kerberos configuration (order of KDCs, roll out new principals
etc.), for us this is a change in a single place: We perform the change in Puppet,
wait 30 minutes, and all systems run the new configuration[2].
When operating a service with its own orchestrator, this means for me: I have to manually
adapt the configuration of this service.
I need someone who is able to do that (i.e. two configuration systems have to be learnt),
and who then does it for all instances of the service (e.g. all Ceph clusters).
Hence, a previously simple change is multiplied in complexity. I can't just replace
the OS disk of a Ceph-OSD node and push "reinstall" anymore,
letting Puppet install all services and Ceph packages (so I only have to adopt the disks,
this is not automated as a safety precaution),
but I also have to talk to the Ceph orchestrator.
So automation is a must, and we also heavily rely on containers for scientific workloads
to offer a large variety of software stacks to our users.
Automation works brilliantly on Linux (and likely also similar open platforms), since
almost all services can be combined as building blocks,
and controlled by our configuration management Puppet (or any other tool). It's
basically a consequence of the Unix idea that all programs do their own thing well,
and can be "glued" together as-needed for site-specific requirements.
While the "cephadm with containers" approach does this kind of glueing for me,
it makes it harder (impossible?) to integrate as-is into existing configuration management
systems.
I think this is also a strong point, and reading through the list, one of the major
reasons why larger ceph sites do not want to use cephadm with containers seems to be
exactly that:
It can't easily be "cut into pieces" and be integrated with an existing
system they use for all other infrastructure.
So in summary:
The orchestrator is a good thing, but my point is that the currently implemented solution
is not the right solution for a noticeable fraction of the existing community.
In addition to the users you had in mind when designing the current orchestrator,
there are also many active users of Ceph who want to have more direct access to the
"complexity" of the system for two reasons[3]:
- To integrate it into existing automations.
- To learn how things work and interact.
The latter point is also a strong one, especially for me as an experimental physicist: I
learned to love Ceph exactly because I played with the different components and their
interactions,
tried to break them, found how they react, and got a deeper understanding and knowledge to
understand any future issues.
This love never develops for me when I use a more "polished" product which has
buttons doing things for me.
It's a major reason why I'd even choose Ceph(FS) over G*FS if the latter was
available for free (of course, there are many more reasons).
So my conclusion is: The chosen path is changing the audience of Ceph, affecting both
existing users, but also new users.
The length of this thread has shown that the community has different opinions on this path
forward, for many different reasons.
My personal feeling is that the solution embraces new users who want to set something up
quickly,
and is mainly a problem for existing and future long-term production users with larger
clusters who want to understand the full stack and integrate it into their environment.
Does it really free developer resources in the long run?
I'm not sure about that — the community may shift more to users reporting issues like
"I pushed the green upgrade button, and now it is stuck"
(similar mails have arrived on the mailing list already), and have less reports from users
who report stacktraces or issues in network communication between services
(will automated crashdump reports be able to replace experienced bug reporters?). My
personal feeling is that the latter type of users are those
who stay with Ceph for years or decades, and even though they may seem to be complaining
that Ceph is a complicated machine with many nuts and bolts,
it's usually not something which disturbs them as much as it may seem to be the case.
So finally, what is my idea about a path forward?
In addition to keep delivering packages (will they stay?), I see two ways, basically:
- Having an additional orchestrator module running on bare-metal.
Given the assumptions above and mails on this list, most users who'd use it would
in any case install their packages differently,
and only use the orchestrator e.g. to distribute cephx keys, set up initial
configuration, enable systemd units etc.,
finally persisting a differing fraction of these things into their configuration
management.
So the orchestrator may even be useful for them without the actual complex capabilities
of adding repositories and installing packages.
At least, that's what I'd be happy about: An orchestrator doing all the
Ceph-only things for me, ideally telling me what it does.
- Having really extensive manual installation instructions.
Currently, I find these super useful for the basic first steps, but they basically
break off after you have a mon, mgr and osd.
In essence, these two points are the same: The modular orchestrator already has all tasks
coded inside, i.e. essentially,
the orchestrator is the most up-to-date, complete, tested and well-maintained manual
documentation we'll get — correct?
This is why my personal feeling is that having a "bare metal orchestrator" (i.e.
an "SSH orchestrator" like ceph-deploy)
even without the features of adding repositories, installing packages, or upgrading at the
push of a button will be sufficient for those of us
having issues with the current solution.
It's essentially about making the manual installation instructions into an
orchestrator module (which is probably close to the current cephadm minus containers).
Is this a scope which is sufficiently low-hanging to warrant the effort?
It may also be less work than having a more extensive manual installation documentation,
and maintaining it?
Cheers (and congratulations to all who made it to the end of this mail),
Oliver
[0] As a basic example, g++ gets increasingly better at warning about potentially
unintended behaviour caused by common classes of bugs in code,
so a newer g++ may find more issues when testing.
The same may happen for different library versions, which may point out API usage
bugs early on, or reveal issues earlier in testing.
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/PPLJIHT6WK…
[2] Of course, you can (and should) use a staged rollout.
[3] Well, that's a presumption, but the fact that you mentioned user concerns about
this in the survey seems to strengthen that point.
--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax: +49 228 73 7869
--