Why using more threads makes it slower than using less threads

Question

Tried to run program X using 8 threads and it was over in n minutes.
Tried to run same program using 50 threads and it was over in n*10 minutes.

Why does this happen and how can I get optimal number of threads I can use?

EightBitTony · Answer 1 · 2013-06-23T17:25:28.810

48

"Why does this happen?" is kind of easy to answer. Imagine you have a corridor that you can fit four people down, side by side. You want to move all the rubbish at one end, to the other end. The most efficient number of people is 4.

If you have 1-3 people then you're missing out on using some corridor space. If you have 5 or more people, then at least one of those people is basically stuck queueing behind another person all the time. Adding more and more people just clogs up the corridor, it doesn't speed up the acivity.

So you want to have as many people as you can fit in without causing any queueing. Why you have queueing (or bottlenecks) depends on the questions in slm's answer.

edited Jun 23 '13 at 17:25

answered Jun 23 '13 at 14:52

EightBitTony

21,373

1

Your example is misleading. It would be better to say something like: "You have a corridor that you can fit four people down, side by side and it is used by you and other people for different tasks. There is a referee that decides who can go through the corridor. Then the most efficient number of people is bigger than 4 and less than some number, where your people start to queue[highly context dependent]." Usually having some threads more than the number of CPUs performs better than using exactly 4 threads. If you are the only one using the CPU, then 4 is the best number. – Bakuriu Jun 23 '13 at 18:34
8

Great example, +1. Bakuriu, its an example that illustrates the problem of a limited shared resource. It's explaining the problem, not how to find the optimal number of threads. – Bananguin Jun 23 '13 at 19:13
1

It would also be useful to bear in mind that threads still have their own type of context switching that goes on. Increasing the number of threads doesn't increase performance capacity (as you pointed out) but it also drains CPU time by giving the kernel more work to do. Basically, there are diminishing returns on threading and doing too much causes performance retrograde. – Bratchley Jun 23 '13 at 20:04
9

Every problem can be described at many levels of complexity. I have offered an approximation of the problem, which I believe is useful to explain the basics. Of course it can be more refined, and more detailed, but the more detailed you make it, the less useful it is as an introduction to the issue. – EightBitTony Jun 23 '13 at 22:24
I would just add that, instead of spending a lot of time calculating the optimal number of threads, just code it so that can be changed easily. Any large merge like this will require numerous test runs (most with small subsets of your data) to perfect. Increase the number of threads until you see a big drop in performance or impact on other system activity is unacceptable. – DocSalvager Jun 28 '13 at 08:23

score 38 · Accepted Answer · edited Jun 25 '13 at 19:39

38

This is a complicated question you're asking. Without knowing more about the nature of your threads it's difficult to say. Some things to consider when diagnosing system performance:

Is the process/thread

CPU bound (needs lots of CPU resources)
Memory bound (needs lots of RAM resources)
I/O bound (Network and/or hard drive resources)

All of these three resources are finite and any one can limit the performance of a system. You need to look at which (might be 2 or 3 together) your particular situation is consuming.

You can use ntop and iostat, and vmstat to diagnose what's going on.

edited Jun 25 '13 at 19:39

jw013

51,212

answered Jun 23 '13 at 14:43

slm

369,824

9

The hardware matters too. Physical, virtual, number of cores, type of core, L1/L2/L3 cache, etc. – EightBitTony Jun 23 '13 at 14:49

frostschutz · Answer 3 · 2013-06-23T17:26:55.293

A common recommendation is n+1 threads, n being the number of CPU cores available. That way n threads can work the CPU while 1 thread is waiting for disk I/O. Having fewer threads would not fully utilize the CPU resource (at some point there will always be I/O to wait for), having more threads would cause threads fighting over the CPU resource.

Threads come not free, but with overhead like context switches, and - if data has to be exchanged between threads which is usually the case - various locking mechanisms. This is only worth the cost when you actually have more dedicated CPU cores to run code on. On a single core CPU, a single process (no separate threads) is usually faster than any threading done. Threads do not magically make your CPU go any faster, it just means extra work.

This should be the general answer given the amount of information available in question. we don't need a full blown thesis and philosophy like other answers — Allahjane, Jan 19 '18 at 18:29

score 12 · Answer 4 · edited Dec 10 '19 at 14:23

As other have pointed out (slm answer, EightBitTony answer) this is a complicated question and more so since you do not describe what you threads do and how they do it.

But definitively throwing in more threads can make the things worse.

In the field of parallel computing there is Amdahl's law that can be applicable (or cannot, not but you do not describe the details of you problem, so ....) and can give some general insight about this class of problems.

The point of Amdahl's law is that in any program (in any algorithm) there is always a percentage that can not be run in parallel (the sequential portion) and there is another percentage that can be run in parallel (the parallel portion) [Obviously these two portions add up to 100%].

This portions can be expressed as a percentage of execution time. For example, there can be a 25% of time spent in strictly sequential operations, and the remaining 75% of time is spent in operation that can be executed in parallel.

(Image from Wikipedia)

The Amdahl's law predict that for every given parallel portion (e.g. 75%) of a program you can speed up execution only so far (e.g. at most 4 times) even if you use more and more processors to do the work.

As a rule of thumb, the more of you program that you cannot transform in parallel execution, the less you can obtain using more execution units (processors).

Given that you are using threads (and not physical processors) the situation can be even worse than this. Remember that threads can be processed (depending on implementation and hardware available, e.g. CPUs/Cores) sharing the same physical processor/core (it's a form of multitasking, as pointed in another answer).

This theoretical prediction (about CPU times) does not consider others practical bottlenecks as

Limited I/O speed (hard disk and network "speed")
Memory size limits
Others

that can easily be the limiting factor in practical applications.

1

This must be selected answer. – Eonil Oct 25 '18 at 04:51

score 6 · Answer 5 · answered Jun 24 '13 at 00:41

6

The culprit here should be the "CONTEXT SWITCHING". It is the process of saving the state of the current thread to start executing another thread. If a number of threads are given the same priority they need to be switched around till they finish execution.

In your case, when there are 50 threads a lot of context switching takes place when compared to just running 10 threads.

This time overhead introduced because of context switching is what making your program run slow

answered Jun 24 '13 at 00:41

x-treme

161

Since we don't know what the threads are, this appears to be a guess. Yes, context switching adds an overhead, but if the threads are doing some kind of data analysis, the problem could be cache issues (i.e. not being able to use the cache because every time you switch threads you have to flush it). – EightBitTony Jun 24 '13 at 06:46
Thread context switching in and of itself, unless we are dealing with a huge number of context switches, likely won't have an order-of-magnitude impact on performance. 50 threads is high but not extreme (on my box right now, ps ax | wc -l reports 225 processes, and it is by no means heavily loaded). I'm inclined to go with @EightBitTony's guess; cache invalidation is likely a bigger issue, because every time you flush the cache, the CPU has to wait eons for code and data from RAM. – user Jun 24 '13 at 07:21

score 4 · Answer 6 · answered Jun 24 '13 at 09:03

To fix EightBitTony's metaphor:

"Why does this happen?" is kind of easy to answer. Imagine you have two swimming pools, one full and one empty. You want to move all the water from one to the other, and have 4 buckets. The most efficient number of people is 4.

If you have 1-3 people then you're missing out on using some buckets. If you have 5 or more people, then at least one of those people is stuck waiting for a bucket. Adding more and more people ... doesn't speed up the activity.

So you want to have as many people as can do some work (use a bucket) simultaneously.

A person here is a thread, and a bucket represents whichever execution resource is the bottleneck. Adding more threads doesn't help if they can't do anything. Additionally, we should stress that passing a bucket from one person to another is typically slower than a single person just carrying the bucket the same distance. That is, two threads taking turns on a core typically accomplish less work than a single thread running twice as long: this is because of the extra work done to switch between the two threads.

Whether the limiting execution resource (bucket) is a CPU, or a core, or a hyper-threaded instruction pipeline for your purposes depends on which part of the architecture is your limiting factor. Note also we're assuming the threads are entirely independent. This is only the case if they share no data (and avoid any cache collisions).

As a couple of people have suggested, for I/O the limiting resource might be the number of usefully queueable I/O operations: this could depend on a whole host of hardware and kernel factors, but could easily be much larger than the number of cores. Here, the context switch which is so costly compared to execute-bound code, is pretty cheap compared to I/O bound code. Sadly I think the metaphor will get completely out of control if I try to justify this with buckets.

Note that the optimal behaviour with I/O bound code is typically still to have at most one thread per pipeline/core/CPU. However, you have to write asynchronous or synchronous/non-blocking I/O code, and the relatively small performance improvement won't always justify the extra complexity.

PS. My problem with the original corridor metaphor is it strongly suggests you should be able to have 4 queues of people, with 2 queues carrying rubbish and 2 returning to collect more. Then you can make each queue almost as long as the corridor, and adding people did speed up the algorithm (you basically turned the whole corridor into a conveyor belt).

In fact this scenario is very similar to the standard description of the relationship between latency and window size in TCP networking, which is why it jumped out at me.

It's not a metaphor, it's an approximation designed to explain the system to people in a way they can visualise it easily. As such, it's always going to be 'rubbished' by people who know the next level of detail, but don't realise their level of detail isn't actually necessary for beginners. No one learns particle physics by starting at the PhD level. All the stuff before is an approximation they leads you into it gradually, refining it as you go. It's not 'wrong' it's just not the full picture. — EightBitTony, Jun 25 '13 at 21:58
No-one is confused over which figure of speech you used, and it isn't a bad analogy. Every analogy has some limit beyond which it diverges from the thing it's supposed to describe and ceases to be useful. I only mentioned this because the original reminded me so strongly of a different scenario, and because I don't think this version any more complex for the (hopefully) improved predictiveness. — Useless, Jun 25 '13 at 22:47

Bruno Taboada · Answer 7 · 2019-08-25T08:51:16.027

0

It is pretty straightforward and simple to understand. Having more threads than what your CPU supports you are actually serializing and not parallelizing. The more threads you have the slower your system will be. Your results is actually a proof of this phenomenon.

edited Aug 25 '19 at 08:51

answered Apr 19 '18 at 01:52

Bruno Taboada

1

Why using more threads makes it slower than using less threads

7 Answers7