On Mon, Apr 5, 2021 at 3:27 AM Loïc Dachary <loic@dachary.org> wrote:
<SNIP>
The version with sharding is faster and it finished before the measures started. The first observable evidence of the optimization, exciting :-) I changed the test program so that it keeps running forever and will be killed by the caller when it is no longer needed.

The output was uploaded in ceph-c2c-jmario-2021-04-05-09-26.tar.gz

Hi Loïc:
Your new sharding version looks much better.  I do not see any cacheline contention at all. 

Here's some insight into the difference.
The atomic update you're doing to the lock-variable has to both read and write the lock-variable and does it while it has the cacheline for the lock-variable locked.  

The cpu doing that atomic instruction needs to first get ownership of the cacheline.  If no other threads of execution are also trying to get ownership of that cacheline, then ownership is granted rather quickly. 
If, however, there are many other threads of execution trying to get ownership of that cacheline, then all those cpus must "get in line" and wait their turn.

In your "without-sharding" case, the average number of machine cycles needed for the atomic instruction to gain ownership of the cacheline was 751 machine cycles.  In the "with-sharding" case, it dropped to 84 machine cycles.

Understand, however, the above numbers are not perfectly accurate.  That's because perf was instructed to ignore any load instructions that took faster than 70 machine cycles to complete.  The reasoning is at those low levels of machine cycles, there is no contention, so why burden the perf tool execution and data collection with the extra processing of fast loads that aren't relevant to finding cacheline contention.   I mention this for completeness.  The drop from 751 to 84 machine cycles is significant.

Do you have something in your code to guarantee that nothing else resides in the same aligned "128 byte 2-cacheline block" as your locks?

Joe