ceph slow at 80% full, mds nodes lots of unused memory - ceph-users

24 Feb 2021

Hi

we've been running our Ceph cluster for nearly 2 years now (Nautilus)
and recently, due to a temporary situation the cluster is at 80% full.

We are only using CephFS on the cluster.

Normally, I realize we should be adding OSD nodes, but this is a
temporary situation, and I expect the cluster to go to <60% full quite soon.

Anyway, we are noticing some really problematic slowdowns. There are
some things that could be related but we are unsure...

- Our 2 MDS nodes (1 active, 1 standby) are configured with 128GB RAM,
but are not using more than 2GB, this looks either very inefficient, or
wrong ;-)

"ceph config dump |grep mds":
  mds            basic    mds_cache_memory_limit
107374182400
  mds            advanced mds_max_scrub_ops_in_progress       10

Perhaps we require more or different settings to properly use the MDS
memory?

- On all our OSD nodes, the memory line is red in "atop", though no swap
is in use, it seems the memory on the OSD nodes is taking quite a
beating, is this normal, or can we tweak settings to make it less stressed?

This is the first time we are having performance issues like this, I
think, I'd like to learn some commands to help me analyse this...

I hope this will ring a bell with someone...

Cheers

/Simon