Hi all,
Have a question regarding CephFS and write performance. Possibly I am
overlooking a setting.
We recently started using Ceph, where we want to use CephFS as a shared
storage system for a Sync-and-Share solution.
Now we are still in a testing phase, where we are also mainly looking at
the performance of the system, where we are seeing some strange issues.
We are using Ceph Quincy release 17.2.6, with a replica 3 data policy
across 21 hosts spread across 3 locations.
When I write multiple files of 1G, the writing performance drops from
400MiB/s to 18 MiB/s with also multiple retries.
However, when I empty the page caches every minute on the client, the
performance remains good. But that's not really a solution of course.
Have already played a lot with the sysctl settings, like vm.dirty etc, but
it makes no difference at all.
When I enable the fuse_disable_pagecache, the write performance does stay
reasonable at 70MiB/s,
but the read performance completely collapses from 600 MiB/s to 40 MiB/s
There is no difference in behavior between the kernel or fuse client.
Have already played around with client_oc_max_dirty, client_oc_max_objects,
client_oc_size , etc. But haven't found the right setting.
Anyone familiar with this who can give me some hints?
Thanks for your help! :-)
Kind regards, Tom
Dockerized Ceph 17.2.6 on Ununtu 22.04
The Cephfs filesystem has a size of 180TB, used are only 66TB.
When running a `ls -lR` the output stops and all accesses to the directory
stall. ceph health says:
# ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability
release
mds.vol.ppc721.mvxstq(mds.0): Client dessert failing to respond to
capability release client_id: 6899709
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.vol.ppc721.mvxstq(mds.0): 1 slow requests are blocked > 30 secs
ceph -w shows:
[WRN] slow request 31.421408 seconds old, received at
2023-10-01T09:53:44.634849+0000: client
_request(client.7360117:2224947 getattr AsLsFs #0x4000012503f
2023-10-01T09:53:44.631148+0000 caller_uid=0, caller_gid=0{0,}) currently
failed to rdlock, waiting
[WRN] client.6899709 isn't responding to mclientcaps(revoke), ino
0x4000012503f pending pAsLs
XsFsc issued pAsLsXsFscb, sent 61.422148 seconds ago
The full output of ceph daemon [mds] dump inode 0x4000012503f, config show,
dump_ops_in_flight and "ceph -w" with timestamps can be found on
https://gist.github.com/test-erik/5de4a7bd632f62ab58c3115cfb876ae0
Do you have an idea what we can do about this?