Quick question Ceph guru’s.
For a 1.1PB raw cephfs system currently storing 191TB of data and 390 million objects (mostly small Python, ML training files etc.) how many MDS servers should I be running?
System is Nautilus 14.2.8.
I ask because up to know I have run one MDS with one standby-replay and occasionally it blows up with large memory consumption, 60Gb+ even though I have mds_cache_memory_limit = 32G and that was 16G until recently. It of course tries to
restart on another MDS node fails again and after several attempts usually comes back up. Today I increased to two active MDS’s but the question is what is the optimal number for a pretty active system? The single MDS seemed to regularly run around 1400 req/s
and I often get up to six clients failing to respond to cache pressure.
The current setup is:
ceph fs status
cephfs - 71 clients
======
+------+----------------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+----------------+--------+---------------+-------+-------+
| 0 | active | a | Reqs: 447 /s | 12.0M | 11.9M |
| 1 | active | b | Reqs: 154 /s | 1749k | 1686k |
| 1-s | standby-replay | c | Evts: 136 /s | 1440k | 1423k |
| 0-s | standby-replay | d | Evts: 402 /s | 16.8k | 298 |
+------+----------------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 160G | 169G |
| cephfs_data | data | 574T | 140T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| w |
| x |
| y |
| z |
+-------------+
MDS version: ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Regards.
Robert Ruge
Systems & Network Manager
Faculty of Science, Engineering & Built Environment