I ran a shell pipeline under perf stat
, using taskset 0x1
to pin the whole pipeline to a single CPU. I know taskset 0x1
had an effect, because it more than doubled the throughput of the pipeline. However, perf stat
shows 0 context switches between the different processes of the pipeline.
So what exactly does perf stat
mean by context switches?
I think I was interested in the number of context switches to/from the individual tasks in the pipeline. Is there a better way to measure that?
This was in the context of comparing dd bs=1M </dev/zero
, to dd bs=1M </dev/zero | dd bs=1M >/dev/null
. If I can measure context switches as desired, I assume that it would be useful in quantifying why the first version is several times more "efficient" than the second.
$ rpm -q perf
perf-4.15.0-300.fc27.x86_64
$ uname -r
4.15.17-300.fc27.x86_64
$ perf stat taskset 0x1 sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C18366+0 records in
18366+0 records out
19258146816 bytes (19 GB, 18 GiB) copied, 5.0566 s, 3.8 GB/s
Performance counter stats for 'taskset 0x1 sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':
5059.273255 task-clock:u (msec) # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
414 page-faults:u # 0.082 K/sec
36,915,934 cycles:u # 0.007 GHz
9,511,905 instructions:u # 0.26 insn per cycle
2,480,746 branches:u # 0.490 M/sec
188,295 branch-misses:u # 7.59% of all branches
5.061473119 seconds time elapsed
$ perf stat sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C6637+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.04907 s, 1.7 GB/s
6636+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.0492 s, 1.7 GB/s
sh: Interrupt
Performance counter stats for 'sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':
3560.269345 task-clock:u (msec) # 0.878 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
355 page-faults:u # 0.100 K/sec
32,302,387 cycles:u # 0.009 GHz
4,823,855 instructions:u # 0.15 insn per cycle
1,167,126 branches:u # 0.328 M/sec
88,982 branch-misses:u # 7.62% of all branches
4.052844128 seconds time elapsed
kernel.perf_event_paranoid = 0
, you don't need to be root to count context switches, or to profilecycles
instead ofcycles:u
(user-space only), or any other event. Run perf without root-rights – Peter Cordes Apr 22 '18 at 10:29dd
processes to the same core is better than pinning both to different hyperthreads of the same physical core (check/proc/cpuid
to see whethercore id
goes 0 0 1 1 ..., or whether it goes all the way up and wraps around like my desktop, so core 1 and core 5 share the same physical core – Peter Cordes Apr 22 '18 at 10:33read
blocks it counts a context switch or something. I'm getting non-zero counts formachine_clears.memory_ordering
, at a similar rate to context switches. It's probably not significant or the cause of most of the perf cost. – Peter Cordes Apr 22 '18 at 10:37