The version with sharding to avoid cacheline
contention is indeed faster (about 5 times faster). I modified the test program to verify
that it is consistently at least 2x faster. This is deliberately conservative: the goal is
to guard against a regression that would break the optimization entirely rather than
trying to fine tune the optimization. There was such a regression in Ceph for a long time
(fixed earlier this year) and it would be good if it does not happen again.
In addition the test should also verify that the optimization actually relates to
cacheline contention. If I understand correctly, the latest output I sent you shows that
the non optimized version uses only one variable
and there is a cacheline contention reported by perf. c2c The optimized version however
has no cacheline contention at all, which is the intended effect of the optimization.
Is my reasoning correct so far?
On 05/04/2021 09:27, Loïc Dachary wrote:
Morning Joe,
On 05/04/2021 02:15, Joe Mario wrote:
Hi Loïc:
On Sun, Apr 4, 2021 at 4:14 PM Loïc Dachary <loic(a)dachary.org
<mailto:loic@dachary.org>> wrote:
<snip>
Is the above assumption correct?
Yes,
absolutely right. I changed the variable to be 128 bytes aligned[0],
is it ok? Maybe there is a constant somewhere that provides this number (number of
bytes to be "cache aligned") so it is not hard coded?
Here are two ways you can get the cacheline size.
One is by reading /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
Another is with: gcc -DLEVEL1_DCACHE_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` ...
Another is with: grep -m1 cache_alignment /proc/cpuinfo
Most often it's 64 bytes. I believe the power cpus are 128 bytes. Itanium was 128
bytes.
However, even on the X86 platforms where the cacheline size is 64 bytes, it's very
often a good idea to pad your hot locks or hot data items out
to 128 bytes (e.g. 2
cachelines instead of 1).
The reason is this: By default when Intel
processors fetch a cacheline of data, the cpu will gratuitously fetch the next cacheline,
just in case you need it. However if that next cacheline is a different hot cacheline,
the last thing you need is invalidate it with gratuitous writes.
We have seen performance problems due to this, and the resolution was to pad the hot
locks and variables out to 128 bytes. Some of the big
database vendors pad out to
128 bytes because of this as well.
Thanks for explaining: it makes sense now.
> I looked at the 2nd tar.gz file that you uploaded
(ceph-c2c-jmario-2021-04-04-22-13.tar.gz ).
> As expected, the "without-sharding" case looked like it did earlier.
> However, in the "with-sharding" case, it didn't even look like your
ceph_test_c2c program was even running. I even dumped the raw samples
from the
perf.data file and didn't see any loads or stores from the program. Can you double
check that it ran correctly?
> It did not run, indeed. The version with sharding is faster and it finished before
the measures started. The first observable evidence of the optimization, exciting :-) I
changed the test program so that it keeps running forever and will be killed by the caller
when it is no longer needed.
>
> The output was uploaded in ceph-c2c-jmario-2021-04-05-09-26.tar.gz
>
> Cheers
>
>
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io