3

I ran a shell pipeline under perf stat, using taskset 0x1 to pin the whole pipeline to a single CPU. I know taskset 0x1 had an effect, because it more than doubled the throughput of the pipeline. However, perf stat shows 0 context switches between the different processes of the pipeline.

So what exactly does perf stat mean by context switches?

I think I was interested in the number of context switches to/from the individual tasks in the pipeline. Is there a better way to measure that?

This was in the context of comparing dd bs=1M </dev/zero, to dd bs=1M </dev/zero | dd bs=1M >/dev/null. If I can measure context switches as desired, I assume that it would be useful in quantifying why the first version is several times more "efficient" than the second.

$ rpm -q perf
perf-4.15.0-300.fc27.x86_64
$ uname -r
4.15.17-300.fc27.x86_64

$ perf stat taskset 0x1 sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C18366+0 records in
18366+0 records out
19258146816 bytes (19 GB, 18 GiB) copied, 5.0566 s, 3.8 GB/s

 Performance counter stats for 'taskset 0x1 sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':

       5059.273255      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               414      page-faults:u             #    0.082 K/sec                  
        36,915,934      cycles:u                  #    0.007 GHz                    
         9,511,905      instructions:u            #    0.26  insn per cycle         
         2,480,746      branches:u                #    0.490 M/sec                  
           188,295      branch-misses:u           #    7.59% of all branches        

       5.061473119 seconds time elapsed

$ perf stat sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C6637+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.04907 s, 1.7 GB/s
6636+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.0492 s, 1.7 GB/s
sh: Interrupt

 Performance counter stats for 'sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':

       3560.269345      task-clock:u (msec)       #    0.878 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               355      page-faults:u             #    0.100 K/sec                  
        32,302,387      cycles:u                  #    0.009 GHz                    
         4,823,855      instructions:u            #    0.15  insn per cycle         
         1,167,126      branches:u                #    0.328 M/sec                  
            88,982      branch-misses:u           #    7.62% of all branches        

       4.052844128 seconds time elapsed
sourcejedi
  • 50,249

1 Answers1

3

perf was silently failing to count context switches because you were not root.

(Linux has 64k pipe buffers. In either case, you can see very close to 2 context switches per 64k transferred. Not exactly sure how that works, but I suspect it's only counting context switches away from dd, either to the other dd, or to the idle task for that cpu).

$ sudo perf stat taskset 0x1 sh -c 'dd bs=1M </dev/zero|dd bs=1M >/dev/null'
^C14508+0 records in
14507+0 records out
15211692032 bytes (15 GB, 14 GiB) copied, 3.87098 s, 3.9 GB/s
14508+0 records in
14508+0 records out
15212740608 bytes (15 GB, 14 GiB) copied, 3.87044 s, 3.9 GB/s
taskset: Interrupt

 Performance counter stats for 'taskset 0x1 sh -c dd bs=1M </dev/zero|dd bs=1M >/dev/null':

       3872.597645      task-clock (msec)         #    1.000 CPUs utilized          
           464,325      context-switches          #    0.120 M/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               928      page-faults               #    0.240 K/sec                  
    11,099,016,844      cycles                    #    2.866 GHz                    
    13,765,220,898      instructions              #    1.24  insn per cycle         
     3,053,464,009      branches                  #  788.480 M/sec                  
        15,462,959      branch-misses             #    0.51% of all branches        

       3.874121023 seconds time elapsed

$ echo $((15212740608 / 464325))
32763

$ sudo perf stat sh -c 'dd bs=1M </dev/zero|dd bs=1M >/dev/null'
^C7031+0 records in
7031+0 records out
7032+0 records in
7031+0 records out
7372537856 bytes (7.4 GB, 6.9 GiB) copied, 4.27436 s, 1.7 GB/s7372537856 bytes (7.4 GB, 6.9 GiB) copied, 4.27414 s, 1.7 GB/s

sh: Interrupt

 Performance counter stats for 'sh -c dd bs=1M </dev/zero|dd bs=1M >/dev/null':

       3736.056509      task-clock (msec)         #    0.873 CPUs utilized          
           218,047      context-switches          #    0.058 M/sec                  
               206      cpu-migrations            #    0.055 K/sec                  
               877      page-faults               #    0.235 K/sec                  
     8,328,413,541      cycles                    #    2.229 GHz                    
     7,617,859,285      instructions              #    0.91  insn per cycle         
     1,671,904,009      branches                  #  447.505 M/sec                  
        13,827,669      branch-misses             #    0.83% of all branches        

       4.277591869 seconds time elapsed

$ echo $((7372537856 / 218047))
33811
sourcejedi
  • 50,249
  • 2
    With sysctl kernel.perf_event_paranoid = 0, you don't need to be root to count context switches, or to profile cycles instead of cycles:u (user-space only), or any other event. Run perf without root-rights – Peter Cordes Apr 22 '18 at 10:29
  • 1
    Interesting idea to pin to a single core. Somewhat surprisingly, pinning both dd processes to the same core is better than pinning both to different hyperthreads of the same physical core (check /proc/cpuid to see whether core id goes 0 0 1 1 ..., or whether it goes all the way up and wraps around like my desktop, so core 1 and core 5 share the same physical core – Peter Cordes Apr 22 '18 at 10:33
  • 1
    We also get high context switches when they're on different cores; presumably every time read blocks it counts a context switch or something. I'm getting non-zero counts for machine_clears.memory_ordering, at a similar rate to context switches. It's probably not significant or the cause of most of the perf cost. – Peter Cordes Apr 22 '18 at 10:37
  • 1
    Exactly. Though commits suggest that switching to/from the idle process can and now does actually avoid switching page tables, this is described as "lazy tlb mode". Pinning seemed an obvious way to simulate a single cpu, where I knew that context switches would be required - so that if perf still showed 0 I could know for sure it was lying. I don't know if you're aware, the Linux scheduler is actually supposed to heuristically consider keeping pipelines together on the same cpu. I had not thought about the HyperThread case, that's so funny. – sourcejedi Apr 22 '18 at 10:49
  • 1
    From a quick look at that LWN article (thanks, BTW!) the heuristic they call "current" is to keep related tasks on the same socket (or group of cores that shares a last-level cache), not on the same core. So they're optimizing for LLC sharing, not per-core private L2/L1 sharing. Single-socket Intel systems only have one LLC domain. AMD Ryzen uses multiple LLC domains (called core clusters) within the same silicon die. (https://www.reddit.com/r/hardware/comments/6s1m80/why_did_amd_design_ryzen_as_2_quadcore_ccxs/) – Peter Cordes Apr 22 '18 at 11:00