7

I am running a cluster where each node has an Intel Xeon E5430. /proc/cpuinfo reports 8 cores. I am using C/C++ compiled with gcc ver 5.3.1 on an Ubuntu 16.04LTS.

Distributing my work to each node was the easy part. My question pertains to the process running on each node. How can I create 8 concurrent threads and guarantee that each one spawns on a separate core?

15 years ago when I was using a 32 processor SGI, the fork command took an integer argument that was the physical processor id. Is there a similar call in either fork or threading that places a thread on a physical core?

Steve S
  • 71
  • Why not let the kernel do it? That's its job and it usually does it better than any programmer could. – Gilles 'SO- stop being evil' Jul 12 '16 at 23:24
  • See my analysis below (as comments to Meuh) As you suggested the difference between letting the kernel assign the threads and using affinity did not turn out to be significant enough to justify the overhead. Assigning threads to cores was slightly faster. In the 4 thread run assigning the cores (0 2 4 & 6) the difference was 0.326s for an average job length of 26.63s. Running all 8 threads the difference was 0.498s for an average job length of 30.22s. – Steve S Jul 13 '16 at 18:31
  • setting cpu affinity can help when you want to reserve a cpu for a given realtime task, or for that cpu to handle only interrupts to reduce latency. In general the kernel knows when scheduling it is a good idea to continue a process on the same cpu it was previously running on, so the L2 cache might still be valid, so there is also a natural cpu affinity. – meuh Jul 13 '16 at 18:47

1 Answers1

7

It is safe to assume that this will happen by default, however you can explicitly set the cpu affinity, a bitmask of the set of cpus you want to use, for a process with sched_setaffinity() or for pthreads pthread_setaffinity_np(). The cli command is taskset. These are Linux and GNU specific.

meuh
  • 51,383
  • Meuh, I just read the man page you suggested, and ran the sample code. I think this is exactly what I needed. Tonight I will attempt to generate some performance metrics 1) letting the thread scheduler distribute 8 threads and 2) forcing each of the eight threads to a separate core. -Thanks – Steve S Jul 12 '16 at 17:40
  • I took a numerical integration problem I did in grad school, and ran it sequentially 4 times. Then, I ran it inside 4 threads. The overall time for the sequential was 83.626s average 20.9061s per job. The threaded version took 27.35s overall with average thread taking 26.7949s. When a used scheduled affinity the overall time was 26.497s with an average job taking 26.4663. The threads added significant overhead to the sequential jobs. However when you divide by 4 it makes up for it. – Steve S Jul 13 '16 at 18:14
  • Next I ran it sequentially 8 times. Then, I ran it inside 8 threads. The overall time for the sequential was 165.563s average 20.6952s per job. The threaded version took 30.675s overall with average thread taking 30.4653s. When I used scheduled affinity the overall time was 30.089s with an average job taking 29.9674 – Steve S Jul 13 '16 at 18:21
  • Are you using floating point? If you have 4 real cores and 4 hyperthreaded cpus you may be sharing each float unit between 2 cpus. I dont know the intel E5430 cpu architecture. – meuh Jul 13 '16 at 18:34
  • The conclusion (as detailed in my comment to Gilles, above), was that the improvement when using CPU affinity was not, in my opinion, significant enough to justify the coding overhead of measuring the load of each core and assigning threads manually. Thank you both for your comments. – Steve S Jul 13 '16 at 18:36
  • All math was in double precision floating point. Last night's run was on my Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz – Steve S Jul 13 '16 at 18:37