0

I have a linux machine dedicated to running some parallel computation, and I'm trying to understand how to choose / tune the scheduler, and perhaps other parameters, to extract the most performance (this is deployed using AWS, so there's also some choice of what linux distribution to use, if that matters).

I've implemented the computation in java because there are some subtle dependencies between different parts of the computation (there are about 5K "tasks" in all, but a task typically requires information from other tasks at multiple points in its execution). I want to consider two implementations.

Current Implementation

In the current implementation, the number of threads is equal to the number of cores, and each thread picks up a task which is not waiting for any info, works on it until it is halted by some missing info, at which point it drops the task and picks up another. This continues until the computation is finished.

Here, it seems desirable that each CPU should be bound to a single thread throughout. Should I "tell" the scheduler not to perform any timeslicing, or will this occur naturally? how do I make sure?

Another Possible Implementation

I may change this so that each "task" has its own thread, using java's wait() and notify() paradigm instead instead of picking up and dropping compute tasks. Feel free to comment on the desirability of this change (having 5K tasks = threads, on a 96 core machine, possibly smaller if I can speed things up). But more importantly, assuming I implement this: how can I tell the scheduler to work with the largest possible time slices, except as mandated by the wait() and notify() calls? will using Java's yield() help?

Some Related References

This answer has some useful background on scheduling, and references this and this for some more tunable parameters. The latter specifically mention a queue contention which I've also noticed when trying to scale up the number of processors in the "current implementation" above.

ADDENDUM this seems to say that unix (and linux?) do not time slice at all, and that the only way a thread is interrupted is if it is "preempted" by a higher priority thread, or initiates some blocking operation. Is this really the case?

Many thanks!

Just Me
  • 101
  • 1
    Consider the batch scheduler. It runs at a lower priority that the normal scheduler (this won't matter, unless you have a lot running in the normal scheduler), but uses slower pre-preemption ticks. [I have no experience of bench-marking it, hence just a comment.] – ctrl-alt-delor Jan 03 '21 at 13:41
  • Thanks! Just to clarify - do you mean I should use set_scheduler with SCHED_BATCH? – Just Me Jan 03 '21 at 14:05
  • and also: how do I actually tune these parameters? I can't seem to find if/how I can do this from the command line. Or is this something that I need to change through the jvm? – Just Me Jan 03 '21 at 16:02
  • That looks like it will set the scheduler for batch. However I always use schedtool (keep deployment separate). Some of the other tuning from other posts e.g. slower scheduling for server may also help. – ctrl-alt-delor Jan 03 '21 at 21:33
  • I have never been able to get more than 1% extra performance by fiddling with the scheduler (e.g. by using taskset to keep the process on a single CPU thus avoiding cache misses) as long as you run 1 thread per CPU core and all CPU cores show up as 100% busy. If you do get more than 1% speedup, I hope you will share the measurements. – Ole Tange Jan 06 '21 at 13:27
  • for what it's worth @Ole, neither have I. – Just Me Jan 07 '21 at 11:09

1 Answers1

0

I'm not sure that a 1:1 threading model is garunteed in java. That means a java thread may or may not represent an OS thread. It could very well be that several java threads are managed by the java runtime-environment and represented to the OS as a smaller number of threads. It's really implementation specific. I'd recommend a lower-level language (C or rust) to be sure of the 1:1 relationship.

Since time slicing is important to you, I also recommend choosing a realtime priority with setpriority() which gives you access to some interesting scheduling policies. If you want to avoid time-slicing, use the SCHED_FIFO policy. This will ensure the thread is not interrupted until it is complete unless a higher priority thread is queued. If all 5k threads have the same priority, it'll really be a first-in-first-out solution with minimal context switching. See sched(7) for details.

Your threads will be uninterrupted until they finish or call read() on an FD representing a mutex or some other file representing when the cross-thread communication is ready for reading. It's at this point where your thread will block and another thread has the opportunity to run.

Therefore I think your idea to go with 5k threads and simply queue them is a good idea.

One trap is going with the -rt kernel. This provides pre-emptive scheduling which improves precision of wakeup time of threads at the expense of performance (the CPU queues are cleared in advance). With such a low-level question, I assume you are trying to maximize performance, so this won't work for you.

Stewart
  • 13,677