On Mar 26, 2021, at 6:31 AM, Stefan Kooman
<stefan(a)bit.nl> wrote:
On 3/9/21 4:03 PM, Jesper Lykkegaard Karlsen wrote:
Dear Ceph’ers
I am about to upgrade MDS nodes for Cephfs in the Ceph cluster (erasure code 8+3 ) I am
administrating.
Since they will get plenty of memory and CPU cores, I was wondering if it would be a good
idea to move metadata OSDs (NVMe's currently on OSD nodes together with cephfs_data
ODS (HDD)) to the MDS nodes?
Configured as:
4 x MDS with each a metadata OSD and configured with 4 x replication
so each metadata OSD would have a complete copy of metadata.
I know MDS, stores al lot of metadata in RAM, but if metadata OSDs were on MDS nodes,
would that not bring down latency?
Anyway, I am just asking for your opinion on this? Pros and cons or even better somebody
who actually have tried this?
I doubt you'll gain a lot from this. Data still has to be replicated, so network
latency. And reads would come from the primary OSDs from the cephFS metadata pool. So only
if you can make all primary OSDs be on the single active MDS you might have gains. But you
will have to do manual tuning with upmap to achieve that.
I think primary affinity would be the way to do this vs upmap fwiw, though the net result
might be mixed, since ops will be directed at only 25% of the OSDs. OSD busy-ness vs
network latency. And as the cluster topology changes one would need to periodically
refresh the affinity values.
I think your money is better spend at buying more NVMe
disks and spreading the load than to co-locate that on MDS.
Agreed. Complex solutions have a way of being more brittle, and of hitting corner cases.
If you are planning on multi-active MDS I don't
think it would make sense at all.
Unless one provisions multiple filesystems each pinned to an MDS with unique set of OSDs
(CRUSH root?) with affinities managed independently? Not sure if that’s entirely
possible; if it is, it’d be an awful lot of complexity.