On 24/02/2021 22:28, Patrick Donnelly wrote:
Hello Simon,
On Wed, Feb 24, 2021 at 7:43 AM Simon Oosthoek
<s.oosthoek(a)science.ru.nl> wrote:
On 24/02/2021 12:40, Simon Oosthoek wrote:
Hi
we've been running our Ceph cluster for nearly 2 years now (Nautilus)
and recently, due to a temporary situation the cluster is at 80% full.
We are only using CephFS on the cluster.
Normally, I realize we should be adding OSD nodes, but this is a
temporary situation, and I expect the cluster to go to <60% full quite soon.
Anyway, we are noticing some really problematic slowdowns. There are
some things that could be related but we are unsure...
- Our 2 MDS nodes (1 active, 1 standby) are configured with 128GB RAM,
but are not using more than 2GB, this looks either very inefficient, or
wrong ;-)
After looking at our monitoring history, it seems the mds cache is
actually used more fully, but most of our servers are getting a weekly
reboot by default. This clears the mds cache obviously. I wonder if
that's a smart idea for an MDS node...? ;-)
No, it's not. Can you also check that you do not have mds_cache_size
configured, perhaps on the MDS local ceph.conf?
Hi Patrick,
I've already changed the reboot period to 1 month.
The mds_cache_size is not configured locally in the /etc/ceph/ceph.conf
file, so I guess it's just the weekly reboot that cleared the memory of
cache data...
I'm starting to think that a full ceph cluster could probably be the
only cause of performance problems. Though I don't know why that would be.
Did the performance issue only arise when OSDs in the cluster reached 80% usage? What is
your osd nearfull_ratio?
$ ceph osd dump | grep ratio
Is the cluster in HEALTH_WARN with nearfull OSDs?
We noticed recently when one of our clusters had nearfull OSDs that cephfs client
performance was heavily impacted.
Our cluster is nautilus 14.2.15. Clients are kernel 4.19.154.
We determined that it was most likely due to the ceph client forcing sync file writes when
nearfull flag is present.
Increasing and decreasing the nearfull ratio confirmed that performance was only impacted
while the nearfull flag was present.
Not sure if that's relevant for your case.