Hi Loïc:
Looking further, there is something in those files.  It's just one small cacheline, but there is something there.  I guess I'm not used to seeing c2c being run on a laptop, nor am I used to seeing so few samples, or even so little in the kernel. 
And I apologize for my quick initial mistaken analysis.

Here's what it looks like is happening.  Correct me if I'm wrong.

In your "without-sharding" version of Ceph, you had 8 threads in the ceph_test_c2c binary all contending for the same lock located at offset 0 in a cacheline.  And then, in the "with-sharding" version of Ceph, you changed it so that each thread would act on its own copy of the 4-byte lock.

Unfortunately, the "with-sharding" version likely didn't help, because all those 8 locks are packed into the same cacheline, with each lock being 4 bytes away from the last one.
If that is true, then you need to rewrite the code such that all the locks are located by themselves in their own cacheline.

Is the above assumption correct?

Also, I'm going to update the "run_c2c_ceph.sh" script I gave you.  The "-g" flag isn't working as it should (known issue).

Joe


On Sun, Apr 4, 2021 at 12:21 PM Loïc Dachary <loic@dachary.org> wrote:
I uploaded the /boot/config-5.10.0-5-amd64 file to dropbox.redhat.com : it looks like perf is compiled in:

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y

and all the options below are =y (no =n).

Maybe a module should be loaded ?