Re: mempool and cacheline ping pong

18 Mar 2021

Hi Joe,

I can't tell you how happy I am that you're around to help understand this :-) In
the spirit of making baby steps to better understand what's going on, I'd like to
run a small part of Ceph[0] that is designed to be optimized and avoid cacheline ping
pong. The code running this part would be in a test (similar to an existing one[1]). It
would be launched on a single Intel machine (as part of a teuthology run, the integration
test tool specific to Ceph) and use the commands you suggest. After the first run I'll
send the data to you for interpretation (sounds like consulting an oracle :-) ).

With your help I'm hoping the integration test will assert that running the mempool
with the optimization is at least X% faster than without the optimization and fail
otherwise. That would be very helpful to guard against accidental regressions.

Once this first goal is achieved, collecting data from a Ceph cluster running under load
could follow the same methodology. I'm sure Mark Nelson will be most interested in
this more ambitious target.

If I'm not mistaken, the commands (let say they are in a ceph-c2c.sh script) should be
run like this:

* Run the software (be it mempool simulation or Ceph under load), let it warm up
* Run ceph-c2c.sh (it won't take more than a minute or so to complete)
* Collect the data and save them
* Kill the software

Is that correct?

Cheers

[0]
https://github.com/ceph/ceph/blob/2b21735498c98299d5ce383011c3dbe25aaee70f/…
[1]
https://github.com/ceph/ceph/blob/2b21735498c98299d5ce383011c3dbe25aaee70f/…

On 18/03/2021 16:40, jmario(a)redhat.com wrote:
...
  Hi Loïc,
 Per our email discussion, I'm happy to help.  If you or anyone else can run perf c2c,
I will analyze the results and reply back with the findings.  

 The perf c2c output is a bit non-intuitive, but it conveys a lot.  I'm happy to share
the findings.

 Here's what I recommend:
  1) Get on an Intel system where you're pushing Ceph really hard. (AMD uses different
low level perf events that haven't been ported over yet.)
  2) Make sure the Ceph code you're running has debug info in it and isn't
stripped. 
  3) This needs to be run on bare-metal.  The PEBS perf events used by c2c are not
supported in a virtualized guest (Intel says support is coming in newer cpus).
  3) As any fyi, the less cpu pinning you do, the more cacheline contention c2c will
expose. 
  4) Once you run the commands that I've appended below (as root), then tar up
everything, data files and all, and lftp them to the location below:

     $ lftp dropbox.redhat.com
  cd /incoming
 put unique-filename 
 Please let me know the name of the files that you uploaded after you put them there. 
I'll grab them.  
 I just joined this list and I don't know if I'll get notified of the replies, so
send me email when the files are there for me to grab.

 Does that sound OK?
 Holler if you have any questions.
 Joe

     # First get some background system info 
     uname -a > uname.out
     lscpu > lscpu.out
     cat /proc/cmdline > cmdline.out
     timeout -s INT 10 vmstat -w 1 > vmstat.out

     nodecnt=`lscpu|grep "NUMA node(" |awk '{print $3}'`
     for ((i=0; i<$nodecnt; i++))
     do
        cat /sys/devices/system/node/node${i}/meminfo > meminfo.$i.out
     done
     more `find /proc -name status` > proc_parent_child_status.out
     more /proc/*/numa_maps > numa_maps.out

     #
     # Get separate kernel and user perf-c2c stats
     #
     perf c2c record -a --ldlat=70 --all-user -o perf_c2c_a_all_user.data sleep 5 
     perf c2c report --stdio -i perf_c2c_a_all_user.data > perf_c2c_a_all_user.out
2>&1
     perf c2c report --full-symbols --stdio -i perf_c2c_a_all_user.data >
perf_c2c_full-sym_a_all_user.out 2>&1

     perf c2c record -g -a --ldlat=70 --all-user -o perf_c2c_g_a_all_user.data sleep 5 
     perf c2c report -g --stdio -i perf_c2c_g_a_all_user.data >
perf_c2c_g_a_all_user.out 2>&1

     perf c2c record -a --ldlat=70 --all-kernel -o perf_c2c_a_all_kernel.data sleep 4 
     perf c2c report --stdio -i perf_c2c_a_all_kernel.data > perf_c2c_a_all_kernel.out
2>&1

     perf c2c record -g --ldlat=70 -a --all-kernel -o perf_c2c_g_a_all_kernel.data sleep 4

     perf c2c report -g --stdio -i perf_c2c_g_a_all_kernel.data >
perf_c2c_g_a_all_kernel.out 2>&1

     #
     # Get combined kernel and user perf-c2c stats
     #
     perf c2c record -a --ldlat=70 -o perf_c2c_a_both.data sleep 4 
     perf c2c report --stdio -i perf_c2c_a_both.data > perf_c2c_a_both.out 2>&1

     perf c2c record -g --ldlat=70 -a --all-kernel -o perf_c2c_g_a_both.data sleep 4 
     perf c2c report -g --stdio -i perf_c2c_g_a_both.data > perf_c2c_g_a_both.out
2>&1

     #
     # Get all-user physical addr stats, in case multiple threads or processes are 
     # accessing shared memory with different vaddrs.
     #
     perf c2c record --phys-data -a --ldlat=70 --all-user -o
perf_c2c_a_all_user_phys_data.data sleep 5 
     perf c2c report --stdio -i perf_c2c_a_all_user_phys_data.data >
perf_c2c_a_all_user_phys_data.out 2>&1
 _______________________________________________
 Dev mailing list -- dev(a)ceph.io
 To unsubscribe send an email to dev-leave(a)ceph.io 
-- 
Loïc Dachary, Artisan Logiciel Libre

2024

2023

2022

2021

2020

2019

Re: mempool and cacheline ping pong