A question related to the performance degradation with multple MDS daemons under burst metadata operations - Dev

10 Jul 2019

Hi, I'm Jongyul Kim and interested in the performance of Ceph.
I tried to figure out the advantage of using 2 MDS daemons instead of a
single MDS under massive metadata operations(rename). But the result was
that two MDS daemons performed worse than a single MDS daemon. I'd like to
ask your advice why this happens.

Here is what I did.

I wrote a micro benchmark that 1) creates a file, 2) writes 4KB to the file
and 3) rename it to another directory. These three steps are done by each
process and I measured the throughput (operations/sec) of Ceph increasing
the number of processes in the benchmark. The experimental setup is like
below.

[Ceph and HW configuration]

   - Ceph version: 14.2.1
   - Configured as Filestore
   - NVM with ext4-dax was used as a storage device
   - IPoIB with 40Gbps Infiniband NIC
   - Sufficient cores and memory (96 cores with hyperthreading, about 300GB
   DRAM)

[Base setup]

   - There are two nodes: Node A and Node B
      - Node A: 1 MON, 1 MGR, 1 OSD, 1 MDS and the micro benchmark run.
      - Node B: 1 OSD and the micro benchmark run.
   - Each process does 1,000 operations(OPs).
   - Each process has its own source directory and target directory for
   renaming. So, there is no contention between rename requests of different
   processes; Each process renames 1,000 files from its own source directory
   to its own target directory.

As I increased the number of processes of the micro benchmark, Ceph stopped
to scale (from the perspective of throughput) around 8 processes per node.
I expect the bottleneck is MDS daemon because rename requests took more
time as the number of processes increased. (I.e. the rename portion of
total execution time: 10% with 1 process per node --> 24% with 8 processes
per node)

I added one more active MDS daemon on Node B (Now, there are two MDS
daemons. One on Node A and the other on Node B) to achieve higher
throughput. Additionally, directories were pinned to one of two nodes for
sharding metadata operations. That is, directories accessed by processes of
Node A were pinned to MDS daemon running on Node A (in the same way for
Node B). The result was, as I mentioned at the beginning, 2 MDS achieved
lower throughput, about 50~60% of 1 MDS case's. The rename portion of the
total execution time also increased (50% with 1 process per node and 88%
with 8 processes per node).

I found the rename request to the MDS on Node B takes much longer time than
a request to the MDS on Node A. (MDS on Node A has an authority of '/'
directory. So, it seems to be a master MDS.) I checked the logs and
confirmed that a rename MDS request in Node B was re-dispatched three times
again in the MDS Server to acquire permissions for renaming (One for "pin
inode" and two for "scatter locks". I'm not sure why scatter locks
should
be requested twice.), whereas there was no re-dispatch of a rename MDS
request in MDS Server on Node A.

Although directories were pinned to the MDS on Node B, this MDS continually
requested permissions to the MDS on Node A on every rename request. As a
result, a directory pinning for metadata operation sharding was useless and
2 MDS gets worse performance.

Why this happens? Why the second MDS needs to re-acquire permissions, "pin
inode" and the other scatter locks, on every rename request, even though
this MDS has an authority for the directories (by directory pinning)? The
performance would increase if an authorized MDS (MDS on Node B in this
case) keeps locks or the "pin inode" permission of its authorized
directories until a revocation actually required. It will eliminate the
ping-ponging of locks and permissions between MDSes. But, current Ceph MDS
implementation does not do in this way. I ask advice of you for the
rationale of this design point.

Any comments and advice will be appreciated. Thanks.

Sincerely,
Jongyul Kim