On Mon, Apr 5, 2021 at 10:38 AM Loïc Dachary <loic@dachary.org> wrote:
[snip]
> The cpu doing that atomic instruction needs to first get ownership of the cacheline.  If no other threads of execution are also trying to get ownership of that cacheline, then ownership is granted rather quickly. 
> If, however, there are many other threads of execution trying to get ownership of that cacheline, then all those cpus must "get in line" and wait their turn.
>
And if there are N CPUs and N*2 threads, the only way to avoid cacheline contention is if all of these threads use a different (cacheline aligned)
variable. Is that correct? If it is correct, I wonder if cacheline contention has a linear impact on performances. If there is cacheline contention on 1 variable with N threads on N CPUS, will the performance degradation be the same if there is cacheline contention on 10 variables and the same number of threads & CPUS? Or will it get worse because there is some sort of amplification?

In your test case, you avoided all cacheline contention by putting every lock into its own cacheline.  In practice, however, multiple threads need to be contending for the same locks to modify shared data.

The goal is to look at your application's hottest contended cachelines to see if that contention can be minimized.  Some of the ways to do that include:
1) Seeing if the number of accesses to that line can be minimized, especially the writers.
2) Making sure multiple hot data variables don't share the same cacheline.
3) Looking to see if the accesses to the hot cachelines are coming from the same numa node as where the hot data lives.  This isn't always possible, but it's good to examine it.

 
I'd like for the test script to extract those number from the files produced by perf c2c. How do you suggest I go about this?

Doing it for this test case was somewhat trivial, albeit fragile.  That's because the "without-sharding" version only had one contended cacheline, and the "with-sharding" version had no contended cacheline.
Because of that, I was able to dump the raw data and add up the load latencies for the test program's load instructions.
The steps I used were:
  # perf script -i perf_c2c_a_all_user.data > t.script -f
  # grep ceph_test_c2c t.script  |grep ldlat |sed -e 's/^.*LCK//' |awk '{s+=$2;c+=1}END {print s " " c " " s/c}'

However the above simple script won't work when there's more than one hot cacheline involved.  You can still find the data you want, it just gets more complicated.
Plus, the real value in perf c2c is not just seeing the load latencies, but rather to learn everything about the contention which will then help guide how to minimize it.  It provides a lot of insight into what's happening.

How does this approach sound?
 1) You set up Ceph to run on a bigger multi-node server with fast storage. 
 2) Run the attached script, which is just an updated version of the script you've been running.
 3) Then we can set up a shared video call where I can walk you through the perf c2c output pointing out all the key pieces of information.
 4) With the insight from "3" above, you can then decide what you might want to automate and how it might be done.

Does that sound reasonable?

See the attached "run_c2c_ceph2.sh" script.
Joe