Darren Soothill (darren.soothill) writes:
Hi Fabien,
ZFS ontop of RBD really makes me shudder. ZFS expects to have individual disk devices
that it can manage. It thinks it has them with this config but CEPH is masking the real
data behind it.
As has been said before why not just use Samba directly from CephFS and remove that layer
of complexity in the middle.
As a user of ZFS on ceph, I can explain some of our motivation.
As it was pointed out earlier in this thread CephFS will give you snapshots
but not diffs between them. I don't know what the intent was with using
diffs, but in ZFS' case, snapshots provide a basis for checkpointing/
recovery, instant dataset cloning, but also for replication/offsite
mirroring (although not synchronous) - so could easily back up/replicate
the ZFS datasets to another location that doesn't necessarily have a CEPH
installation (say, big, cheap JBOD box with SMR drives running native ZFS).
And, you can diff between snapshots to see instantly which files were
modified. In addition to the other benefits of running ZFS such as lz4
compression (per dataset), deduplication, etc.
While it's true that ZFS on top of RBD is not optimal, it's not
particularly dangerous or unreliable. You provide it with multiple RBDs,
create a pool out of those (ZFS pool, not ceph pool :). It sees each
RBD as an individual disk, and can issue I/O to those indepdently.
If anything, you lose some of the benefits of ZFS (automatic error
correction - everything is still checksummed and you detect corruption).
I already run ZFS within a VM (all our customers are hosted like this,
using LXD or FreeBSD jails), whether the backing store is NFS, local disk
or RBD doesn't really matter.
So why NOT run ZFS on top of RBD ? Complexity mostly, and some measure
of lost performance... But CephFS isn't exactly simple stuff to run in a
reliable manner as of yet (MDS performance and possible deadlocks are
an issue).
If you're planning on serving files, you're still going to need an NFS
or SMB layer. If you're on CEPHFS, you can serve via Ganesha or Samba
without adding the extra ZFS layering which will add latency, but either
way you're still going to drag the data out of cephfs to the client
mounting the FS, export that via Samba/NFS. If instead you attach, say,
10 x 1 TB RBD images from a host, assemble those into a zfs pool, and
run NFS or Samba on top of that, you'll have more or less the same data
path, but in addition you'll be going through ZFS which introduces latency.
Now, if you're daring, you create a ceph pool with size=1, min size=1
(will ceph let you do that ? :), you map RBDs out of that, hand them over
to ZFS in a raid+mirror config (or raidz2) - and let ZFS deal with
failings VDEVs by giving it new RBDs to replace them. Sounds crazy ?
Well, you lose the benefit of CEPH's self-healing, but you still get
a super scalable ZFS running on a near limitless supply of JBOD :) And,
you can quickly set up different (zfs) pools with different levels of
redundancy, quotas, compression, metadata options, etc...
Who says you can't do both anyway ? (CephFS and ZFS), CEPH is flexible
enough...