Tried to run program X using 8 threads and it was over in n minutes.
Tried to run same program using 50 threads and it was over in n*10 minutes.
Why does this happen and how can I get optimal number of threads I can use?
Tried to run program X using 8 threads and it was over in n minutes.
Tried to run same program using 50 threads and it was over in n*10 minutes.
Why does this happen and how can I get optimal number of threads I can use?
"Why does this happen?" is kind of easy to answer. Imagine you have a corridor that you can fit four people down, side by side. You want to move all the rubbish at one end, to the other end. The most efficient number of people is 4.
If you have 1-3 people then you're missing out on using some corridor space. If you have 5 or more people, then at least one of those people is basically stuck queueing behind another person all the time. Adding more and more people just clogs up the corridor, it doesn't speed up the acivity.
So you want to have as many people as you can fit in without causing any queueing. Why you have queueing (or bottlenecks) depends on the questions in slm's answer.
This is a complicated question you're asking. Without knowing more about the nature of your threads it's difficult to say. Some things to consider when diagnosing system performance:
Is the process/thread
All of these three resources are finite and any one can limit the performance of a system. You need to look at which (might be 2 or 3 together) your particular situation is consuming.
You can use ntop
and iostat
, and vmstat
to diagnose what's going on.
A common recommendation is n+1 threads, n being the number of CPU cores available. That way n threads can work the CPU while 1 thread is waiting for disk I/O. Having fewer threads would not fully utilize the CPU resource (at some point there will always be I/O to wait for), having more threads would cause threads fighting over the CPU resource.
Threads come not free, but with overhead like context switches, and - if data has to be exchanged between threads which is usually the case - various locking mechanisms. This is only worth the cost when you actually have more dedicated CPU cores to run code on. On a single core CPU, a single process (no separate threads) is usually faster than any threading done. Threads do not magically make your CPU go any faster, it just means extra work.
As other have pointed out (slm answer, EightBitTony answer) this is a complicated question and more so since you do not describe what you threads do and how they do it.
But definitively throwing in more threads can make the things worse.
In the field of parallel computing there is Amdahl's law that can be applicable (or cannot, not but you do not describe the details of you problem, so ....) and can give some general insight about this class of problems.
The point of Amdahl's law is that in any program (in any algorithm) there is always a percentage that can not be run in parallel (the sequential portion) and there is another percentage that can be run in parallel (the parallel portion) [Obviously these two portions add up to 100%].
This portions can be expressed as a percentage of execution time. For example, there can be a 25% of time spent in strictly sequential operations, and the remaining 75% of time is spent in operation that can be executed in parallel.
(Image from Wikipedia)
The Amdahl's law predict that for every given parallel portion (e.g. 75%) of a program you can speed up execution only so far (e.g. at most 4 times) even if you use more and more processors to do the work.
As a rule of thumb, the more of you program that you cannot transform in parallel execution, the less you can obtain using more execution units (processors).
Given that you are using threads (and not physical processors) the situation can be even worse than this. Remember that threads can be processed (depending on implementation and hardware available, e.g. CPUs/Cores) sharing the same physical processor/core (it's a form of multitasking, as pointed in another answer).
This theoretical prediction (about CPU times) does not consider others practical bottlenecks as
that can easily be the limiting factor in practical applications.
The culprit here should be the "CONTEXT SWITCHING". It is the process of saving the state of the current thread to start executing another thread. If a number of threads are given the same priority they need to be switched around till they finish execution.
In your case, when there are 50 threads a lot of context switching takes place when compared to just running 10 threads.
This time overhead introduced because of context switching is what making your program run slow
ps ax | wc -l
reports 225 processes, and it is by no means heavily loaded). I'm inclined to go with @EightBitTony's guess; cache invalidation is likely a bigger issue, because every time you flush the cache, the CPU has to wait eons for code and data from RAM.
– user
Jun 24 '13 at 07:21
To fix EightBitTony's metaphor:
"Why does this happen?" is kind of easy to answer. Imagine you have two swimming pools, one full and one empty. You want to move all the water from one to the other, and have 4 buckets. The most efficient number of people is 4.
If you have 1-3 people then you're missing out on using some buckets. If you have 5 or more people, then at least one of those people is stuck waiting for a bucket. Adding more and more people ... doesn't speed up the activity.
So you want to have as many people as can do some work (use a bucket) simultaneously.
A person here is a thread, and a bucket represents whichever execution resource is the bottleneck. Adding more threads doesn't help if they can't do anything. Additionally, we should stress that passing a bucket from one person to another is typically slower than a single person just carrying the bucket the same distance. That is, two threads taking turns on a core typically accomplish less work than a single thread running twice as long: this is because of the extra work done to switch between the two threads.
Whether the limiting execution resource (bucket) is a CPU, or a core, or a hyper-threaded instruction pipeline for your purposes depends on which part of the architecture is your limiting factor. Note also we're assuming the threads are entirely independent. This is only the case if they share no data (and avoid any cache collisions).
As a couple of people have suggested, for I/O the limiting resource might be the number of usefully queueable I/O operations: this could depend on a whole host of hardware and kernel factors, but could easily be much larger than the number of cores. Here, the context switch which is so costly compared to execute-bound code, is pretty cheap compared to I/O bound code. Sadly I think the metaphor will get completely out of control if I try to justify this with buckets.
Note that the optimal behaviour with I/O bound code is typically still to have at most one thread per pipeline/core/CPU. However, you have to write asynchronous or synchronous/non-blocking I/O code, and the relatively small performance improvement won't always justify the extra complexity.
PS. My problem with the original corridor metaphor is it strongly suggests you should be able to have 4 queues of people, with 2 queues carrying rubbish and 2 returning to collect more. Then you can make each queue almost as long as the corridor, and adding people did speed up the algorithm (you basically turned the whole corridor into a conveyor belt).
In fact this scenario is very similar to the standard description of the relationship between latency and window size in TCP networking, which is why it jumped out at me.
It is pretty straightforward and simple to understand. Having more threads than what your CPU supports you are actually serializing and not parallelizing. The more threads you have the slower your system will be. Your results is actually a proof of this phenomenon.
4
is the best number. – Bakuriu Jun 23 '13 at 18:34