[snip]
Hi Loïc:
Your new sharding version looks much better. I do not see any cacheline contention at
all.
Here's some insight into the difference.
The atomic update you're doing to the lock-variable has to both read and write the
lock-variable and does it while it has the cacheline for the lock-variable locked.
The cpu doing that atomic instruction needs to first get ownership of the cacheline. If
no other threads of execution are also trying to get ownership of that cacheline, then
ownership is granted rather quickly.
If, however, there are many other threads of execution trying to get ownership of that
cacheline, then all those cpus must "get in line" and wait their turn.
And if there are N CPUs and N*2 threads, the only way to avoid cacheline contention
is if all of these threads use a different (cacheline aligned)
variable. Is that correct? If it is correct, I wonder if cacheline contention has a linear
impact on performances. If there is cacheline contention on 1 variable with N threads on N
CPUS, will the performance degradation be the same if there is cacheline contention on 10
variables and the same number of threads & CPUS? Or will it get worse because there is
some sort of amplification?
In your "without-sharding" case, the average
number of machine cycles needed for the atomic instruction to gain ownership of the
cacheline was 751 machine cycles. In the "with-sharding" case, it dropped to 84
machine cycles.
Understand, however, the above numbers are not perfectly accurate.
That's
because perf was instructed to ignore any load instructions that took faster than 70
machine cycles to complete. The reasoning is at those low levels of machine cycles, there
is no contention, so why burden
the perf tool execution and data collection with the extra processing of fast loads that
aren't relevant to finding cacheline contention. I mention this for completeness.
The drop from 751 to 84 machine cycles is significant.
Thanks for the crystal clear explanation. I'd like for the test script to
extract those number from the files produced by perf c2c. How do you suggest I go about
this?
Do you have something in your code to guarantee that nothing else resides in the same
aligned "128 byte 2-cacheline block" as your locks?
These 128 bytes are not used, they are just padding to make sure nothing else is
stored there.
Cheers
[0]
https://lab.fedeproxy.eu/ceph/ceph/-/blob/wip-mempool-cacheline-49781/src/i…