Thanks, Sage. This is a terrific distilation of the challenges and benefits.
FWIW here are a few of my own perspectives, as someone experienced with Ceph but with
limited container experience. To be very clear, these are *perceptions* not *assertions*;
my goal is discussion not argument. For context, I have not used a release newer than
Nautilus in production, in large part due to containers and cephadm.
Containers are
more complicated than packages, making debugging harder.
I think that part of this comes down to a learning curve and some
semi-arbitrary changes to get used to (e.g., systemd unit name has
changed; logs now in /var/log/ceph/$fsid instead of /var/log/ceph).
Indeed, if there are logs at all. It seems that (by default?) one has to (know to) use
journalctl to extract daemon or cluster logs, which is rather awkward compared to having
straight files. And those go away when the daemon restarts or is redeployed, losing data
continuity? Is logrotate used as usual, such that it can be adjusted?
If running multiple clusters on a set of hardware is deprecated, why include the fsid in
the pathname? This complicates scripting and monitoring / metrics collection. Or have we
retconned multiple clusters?
The admin sockets are under a similar path in /var/run. I have yet to discover an
incantation of `ceph daemon mon.foo` eg. that works, indeed specifying the whole path to
the asok yields an error about the path being too long, so I’ve had to make a symlink to
it. This isn’t great usability, unless of course I’m missing something.
Security (50
containers -> 50 versions of openssl to patch)
This feels like the most tangible critique. It's a tradeoff. We have
had so many bugs over the years due to varying versions of our
dependencies that containers feel like a huge win: we can finally test
and distribute something that we know won't break due to some random
library on some random distro. But it means the Ceph team is on the
hook for rebuilding our containers when the libraries inside the
container need to be patched.
This seems REALLY congruent with the tradeoffs that accompanied shared/dynamic linking
years ago. Shared linking saves on binary size and facilitates sharing of address space
among processes; dynamic shared linking lets one update dependencies (notably openssl for
sure since it has had lots of exploits over time, but others too). But that also means
that changes to those libraries can break applications. So we’ve long seen commercial /
pre-built binaries statically linked to avoid regression and breakage. Kind of a rock and
a hard place situation. Some assert that Ceph daemon systems should mostly or entirely
inaccessible from the Internet, and usually don’t have a large set of users — or any
customers — logging into them. Thus it can be argued that they are less exposed to
attacks which would favor somewhat containerization.
One might say that containerization and orchestration make updates for security fixes
trivial, but remember that in most cases such an upgrade is not against the immediately
prior Ceph dot release, which means exposure to regressions and other unanticipated
changes in behavior. Which is one reason why enterprises especially may stick with a
given specific dot release that works until compelled. Updating upstream containers for
security fixes is right back into the dependency hell situation too.
On the flip side, cephadm's use of containers offer some huge wins:
- Package installation hell is gone.
As a user I never experienced much of this, but then I was mostly installing packages
outside of ceph-deploy et al. With at least 3 different container technologies in play,
though, are we substiuting one complexity for another?
- Upgrades/downgrades can be carefully orchestrated.
With packages,
the version change is by host, with a limbo period (and occasional
SIGBUS) before daemons were restarted. Now we can run new or patched
code on individual daemons and avoid an accidental upgrade when a
daemon restarts.
Fair enough - that limbo period was never a problem for me, but re careful orchestration,
we see people on this list all the time experiencing orchestration failures. Is the list
a nonrepresentative sample of people’s experience? The opacity of said orchestration also
complicates troubleshooting.
- Ceph installations are carefully sandboxed.
Removing/scrubbing ceph
from a host is trivial as only a handful of directories or
configuration files are touched.
Plus of course any ancillary tools. This seems like it would be advantageous in labs. In
production it’s not uncommon to reimage the entire box anyway.
And we can safely run multiple
clusters on the same machine without worry about bad interactions
Wasn’t it observed a few years ago that almost nobody actually did that, hence the
deprecation of custom cluster names?
- Cephadm deploys a bunch of non-ceph software as well
to provide a
complete storage system, including haproxy and keepalived for HA
ingress for RGW and NFS, ganesha for NFS service, grafana, prometheus,
node-exporter, and (soon) samba for SMB. All neatly containerized to
avoid bumping into other software on the host; testing and supporting
the huge matrix of packages versions available via various distros
would be a huge time sink.
One size fits all? None? Many? Some? Does that get in the way of sites that, eg. choose
nginux for LB/HA, to run their own Prometheus / Grafana infra for various reasons? Is
this more of the
We've been beat up for years about how complicated
and hard Ceph is.
True. I was told in an interview once that one needs a PhD in Ceph. Over the years
operators have had to rework tooling with every release, so the substantial retoolings
that come with containers and cephadm / ceph orch can be daunting. Midstream changes and
changes made for no apparent reason contribute to the perception. JSON output is supposed
to be invariant, or at least backward compatible, yet we saw mon clock skew move for no
apparent reason, and there have been other breaking changes. cf. the ceph_exporter source
for more examples.
Rook and cephadm represent two of the most successful
efforts to
address usability (and not just because they enable deployment
management via the dashboard!),
The goals here are totally worthy, to make things more turnkey. I get that, I really do.
There are some wrinkles though:
* Are they successful, though? I’m not saying they aren’t, I’m asking. The frequency of
cephadm / ceph orch SNAFUs posted to this list is daunting. It seemed at one point that
Rook would become the party line, but now it’s heterodox?
* Removing other complexity by introducing new complexity (containers). There seems to
have been assumption here that operators already grok containers? In any of the three+
flavors in play? It’s easy to just dismiss this as a learning curve, but it’s a rather
significant one, and assuming that the operator will do that in their Copious Free Time
isn’t IMHO reasonable.
* Dashboard operation by pushing buttons can make it dead simple to deploy a single dead
simple configuration, but revision-controlled management of dozens of clusters is a
different story. Centralized config is one example (assuming the subtree limit bug has
been fixed). Absolutely mangaging ceph.conf across system types and multiple clusters is
a pain — brittle ERB or J2 templates, inscrutable Ansible errors. But how does one link
CLI-based centralized config with revision control and peer review of changes? One thing
about turnkey solutions is that in generaly they usually are unreasonably simplistic or
rigid in ways that are a bad fit for manageable enterprise deployment, and if we’re going
to do everything for the user *and* make it diffcult for them to dig deep or customize,
then the bar is *very* high for success.
* Some might add ceph-ansible to that list.