I have a linux machine dedicated to running some parallel computation, and I'm trying to understand how to choose / tune the scheduler, and perhaps other parameters, to extract the most performance (this is deployed using AWS, so there's also some choice of what linux distribution to use, if that matters).
I've implemented the computation in java because there are some subtle dependencies between different parts of the computation (there are about 5K "tasks" in all, but a task typically requires information from other tasks at multiple points in its execution). I want to consider two implementations.
Current Implementation
In the current implementation, the number of threads is equal to the number of cores, and each thread picks up a task which is not waiting for any info, works on it until it is halted by some missing info, at which point it drops the task and picks up another. This continues until the computation is finished.
Here, it seems desirable that each CPU should be bound to a single thread throughout. Should I "tell" the scheduler not to perform any timeslicing, or will this occur naturally? how do I make sure?
Another Possible Implementation
I may change this so that each "task" has its own thread, using java's wait()
and notify()
paradigm instead instead of picking up and dropping compute tasks. Feel free to comment on the desirability of this change (having 5K tasks = threads, on a 96 core machine, possibly smaller if I can speed things up). But more importantly, assuming I implement this:
how can I tell the scheduler to work with the largest possible time slices, except as mandated by the wait()
and notify()
calls? will using Java's yield()
help?
Some Related References
This answer has some useful background on scheduling, and references this and this for some more tunable parameters. The latter specifically mention a queue contention which I've also noticed when trying to scale up the number of processors in the "current implementation" above.
ADDENDUM this seems to say that unix (and linux?) do not time slice at all, and that the only way a thread is interrupted is if it is "preempted" by a higher priority thread, or initiates some blocking operation. Is this really the case?
Many thanks!
batch
scheduler. It runs at a lower priority that the normal scheduler (this won't matter, unless you have a lot running in the normal scheduler), but uses slower pre-preemption ticks. [I have no experience of bench-marking it, hence just a comment.] – ctrl-alt-delor Jan 03 '21 at 13:41set_scheduler
with SCHED_BATCH? – Just Me Jan 03 '21 at 14:05schedtool
(keep deployment separate). Some of the other tuning from other posts e.g. slower scheduling for server may also help. – ctrl-alt-delor Jan 03 '21 at 21:33taskset
to keep the process on a single CPU thus avoiding cache misses) as long as you run 1 thread per CPU core and all CPU cores show up as 100% busy. If you do get more than 1% speedup, I hope you will share the measurements. – Ole Tange Jan 06 '21 at 13:27