On Mon, Apr 5, 2021 at 3:27 AM Loïc Dachary <loic(a)dachary.org> wrote:
<SNIP>
The version with sharding is faster and it finished before the measures
started. The first observable evidence of the optimization, exciting :-) I
changed the test program so that it keeps running forever and will be
killed by the caller when it is no longer needed.
The output was uploaded in ceph-c2c-jmario-2021-04-05-09-26.tar.gz
Hi Loïc:
Your new sharding version looks much better. I do not see any cacheline
contention at all.
Here's some insight into the difference.
The atomic update you're doing to the lock-variable has to both read and
write the lock-variable and does it while it has the cacheline for the
lock-variable locked.
The cpu doing that atomic instruction needs to first get ownership of the
cacheline. If no other threads of execution are also trying to get
ownership of that cacheline, then ownership is granted rather quickly.
If, however, there are many other threads of execution trying to get
ownership of that cacheline, then all those cpus must "get in line" and
wait their turn.
In your "without-sharding" case, the average number of machine cycles
needed for the atomic instruction to gain ownership of the cacheline was
751 machine cycles. In the "with-sharding" case, it dropped to 84 machine
cycles.
Understand, however, the above numbers are not perfectly accurate. That's
because perf was instructed to ignore any load instructions that took
faster than 70 machine cycles to complete. The reasoning is at those low
levels of machine cycles, there is no contention, so why burden the perf
tool execution and data collection with the extra processing of fast loads
that aren't relevant to finding cacheline contention. I mention this for
completeness. The drop from 751 to 84 machine cycles is significant.
Do you have something in your code to guarantee that nothing else resides
in the same aligned "128 byte 2-cacheline block" as your locks?
Joe