On Wed, Jul 10, 2019 at 2:35 PM Jongyul Kim
<yulistic(a)gmail.com> wrote:
Hi, I'm Jongyul Kim and interested in the performance of Ceph.
I tried to figure out the advantage of using 2 MDS daemons instead of a
single MDS
under massive metadata operations(rename). But the result was
that two MDS daemons performed worse than a single MDS daemon. I'd like to
ask your advice why this happens.
Here is what I did.
I wrote a micro benchmark that 1) creates a file, 2) writes 4KB to the
file and 3)
rename it to another directory. These three steps are done by
each process and I measured the throughput (operations/sec) of Ceph
increasing the number of processes in the benchmark. The experimental setup
is like below.
Are auth mds of the rename source directory and rename dest directory
different? rename file across auth mds is very slow.
They are the same. A source directory and a target directory of one process
are authorized by the same MDS. That is, if a process P1 is running on Node
A, then its source directory (src_dir_p1) and target directory (tar_dir_p1)
are pinned to the MDS running on Node A (mds_a). In the same way, if a
process P2 is running on Node B, then its source directory (src_dir_p2) and
target directory (tar_dir_p2) are pinned to the MDS running on Node B
(mds_b). And so on, ...
[Ceph and HW configuration]
Ceph version: 14.2.1
Configured as Filestore
NVM with ext4-dax was used as a storage device
IPoIB with 40Gbps Infiniband NIC
Sufficient cores and memory (96 cores with hyperthreading, about 300GB
DRAM)
[Base setup]
There are two nodes: Node A and Node B
Node A: 1 MON, 1 MGR, 1 OSD, 1 MDS and the micro benchmark run.
Node B: 1 OSD and the micro benchmark run.
Each process does 1,000 operations(OPs).
Each process has its own source directory and target directory for
renaming. So,
there is no contention between rename requests of different
processes; Each process renames 1,000 files from its own source directory
to its own target directory.
As I increased the number of processes of the micro benchmark, Ceph
stopped to
scale (from the perspective of throughput) around 8 processes
per node. I expect the bottleneck is MDS daemon because rename requests
took more time as the number of processes increased. (I.e. the rename
portion of total execution time: 10% with 1 process per node --> 24% with 8
processes per node)
I added one more active MDS daemon on Node B (Now, there are two MDS
daemons. One
on Node A and the other on Node B) to achieve higher
throughput. Additionally, directories were pinned to one of two nodes for
sharding metadata operations. That is, directories accessed by processes of
Node A were pinned to MDS daemon running on Node A (in the same way for
Node B). The result was, as I mentioned at the beginning, 2 MDS achieved
lower throughput, about 50~60% of 1 MDS case's. The rename portion of the
total execution time also increased (50% with 1 process per node and 88%
with 8 processes per node).
I found the rename request to the MDS on Node B takes much longer time
than a
request to the MDS on Node A. (MDS on Node A has an authority of '/'
directory. So, it seems to be a master MDS.) I checked the logs and
confirmed that a rename MDS request in Node B was re-dispatched three times
again in the MDS Server to acquire permissions for renaming (One for "pin
inode" and two for "scatter locks". I'm not sure why scatter locks
should
be requested twice.), whereas there was no re-dispatch of a rename MDS
request in MDS Server on Node A.
Although directories were pinned to the MDS on Node B, this MDS
continually
requested permissions to the MDS on Node A on every rename
request. As a result, a directory pinning for metadata operation sharding
was useless and 2 MDS gets worse performance.
Why this happens? Why the second MDS needs to re-acquire permissions,
"pin
inode" and the other scatter locks, on every rename request, even
though this MDS has an authority for the directories (by directory
pinning)? The performance would increase if an authorized MDS (MDS on Node
B in this case) keeps locks or the "pin inode" permission of its authorized
directories until a revocation actually required. It will eliminate the
ping-ponging of locks and permissions between MDSes. But, current Ceph MDS
implementation does not do in this way. I ask advice of you for the
rationale of this design point.
Any comments and advice will be appreciated. Thanks.
Sincerely,
Jongyul Kim
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io