Why isn't something like compound syscalls implemented?

Question

Syscalls (system calls) cause some performance penalty due to the isolation between kernel and user space. Therefore, it sounds like a good idea to reduce syscalls.

So what I thought is, that we could pack together syscalls into a single one. So, the idea is to place the syscalls and arguments in a simple data structure in memory. Then we could introduce a new syscall, which we give this data structure. The kernel could then trigger all the functionality in parallel and resume the thread if one (or all) syscalls finished.

I think this approach would be a good basis for concurrent programming (asynchronous I/O) and would improve on existing select/poll/epoll solutions by allowing concurrency on any syscall and reducing overall context switches.

Why is this not done?

research paper from a few years ago: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls measuring overhead of system calls (including reduced throughput after returning to user-space until caches and out-of-order exec recover on a Nehalem (2008ish) Intel CPU), and proposing a syscall batching mechanism. With Spectre + MDS mitigations adding overhead to system calls, and OoO exec being wider+deeper, they're even more expensive now. — Peter Cordes, Feb 10 '23 at 20:05
VAX/VMS (and now OpenVMS) has had asynchronous IO (via Queued IO calls. and Asynchronous System Trap callbacks which inform you when your data is ready) for at least 35 years. — RonJohn, Feb 11 '23 at 01:53
In addition to the examples given in the answers, the POSIX readv / writev family of syscalls comes to mind as something along those lines (and those are pretty old). — Sebastian Riese, Feb 12 '23 at 16:46

Stephen Kitt · Accepted Answer · 2023-02-10T09:10:29.940

25

This already exists. On Linux it’s implemented by io_uring, available since version 5.1 of the kernel (May 2019): operations are placed on a queue (or rather, ring) and processed without system calls, with their results going to another queue.

edited Feb 10 '23 at 09:10

answered Feb 09 '23 at 16:25

Stephen Kitt

434,908

2

This phoronix article is dated February 2019... https://www.phoronix.com/news/Linux-io_uring-Fast-Efficient – aMike Feb 10 '23 at 02:39

score 21 · Answer 2 · answered Feb 10 '23 at 03:22

The general concept is done and does exist. The closest example is io_uring on Linux as Stephen Kitt’s answer points out, but it is far from the only example of this type of interface. Windows, Solaris, AmigaOS, and a small handful of other operating systems, all have similar IO-oriented completion queue mechanisms that work similarly to io_uring (Linux is actually a bit late to the party here).

Additionally, there are actually a lot of system calls on UNIX-like systems that, while they do not work like you are suggesting, do avoid a lot of potential context switches by pushing some task into the kernel that would normally be done in userspace. The sendfile() system call is probably the best example of this type of syscall, it takes a very common task (copy a large amount of data from one file descriptor to another) and pushes that entirely to kernel mode, completely avoiding the looping and numerous context switches (and extra buffers) that would be required to do that in userspace.

One key thing to understand here though is that for this to make sense, the cost of actually setting up everything associated with the relevant set of operations in bulk like this has to be less than the cost of just doing it the ‘normal’ way. Using io_uring only makes sense if you’re dealing with lots of IO, such as when emulating a block storage device for a VM (QEMU supports using it for this, and the performance difference even on fast host hardware is insane), or reading thousands of files once a second (the company I work for has recently started talking internally about possibly using io_uring for such workloads). Similarly, sendfile() only makes sense if you would need more than one read/write iteration to copy the data through userspace (though that’s usually a function of not being able to afford the buffer space in userspace, not that it’s faster to run a read/write iteration).

Additionally, the system call actually has to make sense in the context of batch processing. IO generally does make sense here provided the processing preserves the ordering of the calls, but a lot of things just don’t. It would be silly to try and use this type of interface for exec() for example (a combined fork and exec maybe, but not a plain exec). Similarly, some types of system call are only useful if processed in isolation. Manipulating the process’s signal mask is a good example of this, other than initial setup, you almost always are doing it to guard a critical section in your code, and you generally need prompt, predictable, handling for that purpose.

I'm not sure if it was originally there, or if Netflix implemented it themselves, but they are using in-kernel TLS in BSD. So they can actually sendfile() or similar over an encrypted connection without context switches. — jaskij, Feb 10 '23 at 11:06
@jaskij AIUI, Linux actually supports that as well (they actually do have TLS support in the kernel’s networking stack), though I have never personally seen software that uses it on Linux. — Austin Hemmelgarn, Feb 10 '23 at 17:55
@StephenKitt thanks for the link, will have to keep it in mind for when I upgrade to 6.1, or whatever the next LTS is. — jaskij, Feb 11 '23 at 15:50
"Manipulating the process’s signal mask is a good example of this" - a similar optimization could be applied though - the process could store the signal mask in its own memory, and the kernel could check it when it receives a signal. Then you just need one asynchronous syscall: "hey, I changed the signal mask, maybe deliver some of the signals that are waiting" — user253751, Feb 12 '23 at 11:54
@user253751 Such an approach would work as a replacement for sigaction(), but not sigprocmask(), because it doesn’t allow atomically updating the whole mask as a single unit (because you generally cannot atomically copy more than a word of data without changing where in memory the data is). Being able to atomically turn on or off all signals at the same time is actually really important for a whole slew of reasons. — Austin Hemmelgarn, Feb 12 '23 at 13:22
Considering that Windows 11 ended up adding a new "IoRing" API that's pretty much a clone of Linux io_uring, was the previous "IO completion" API really similar to this at all? (I have not worked with IO completion style APIs at all, but my understanding is that they still required one syscall per operation.) — u1686_grawity, Feb 12 '23 at 15:36
@user1686 Actually, that’s the API I was referring to. I had forgotten it was such a recent thing (for comparison, the kaio API on Solaris is more than 20 years old now, and most of the other completion queue APIs out there are of a similar vintage). — Austin Hemmelgarn, Feb 12 '23 at 16:30
@AustinHemmelgarn: yeah, CreateIoRing() was only added around 2021, some 1-2 years later than Linux io_uring was merged. I thought that you were referring to the "CreateIoCompletionPort" API that was added 20 years ago in Windows XP, but if I understand correctly, that still uses individual syscalls to submit I/O and to retrieve results one by one so it doesn't really count as "compound". — u1686_grawity, Feb 12 '23 at 16:58

Andrew Henle · Answer 3 · 2023-02-11T13:04:28.877

These features have existed for quite a long time.

Solaris 2.6 in 1997 added a kernel asynchronous IO system call that does exactly this - kaio().

One way it can be accessed is via the `lio_listio() function:

lio_listio

list directed I/O

Synopsis
cc [ flag... ] file... -lrt [ library... ]
#include <aio.h>
int lio_listio(int mode, struct aiocb restrict const list[], int nent,
     struct sigevent restrict sig);
Description

The lio_listio() function allows the calling process, LWP, or thread, to initiate a list of I/O requests within a single function call.

The Illumos libc source code that's been open-sourced and descended from that original Solaris implementation of lio_listio() can be found at https://github.com/illumos/illumos-gate/blob/470204d3561e07978b63600336e8d47cc75387fa/usr/src/lib/libc/port/aio/posix_aio.c#L121

One reason features like this aren't more common is they really don't improve performance much unless the entire software and hardware system is designed to take advantage of it.

Storage has to be configured to provide properly aligned blocks, file systems have to be built so they're properly aligned to the blocks the storage system provides, and the entire software stack needs to be written to not screw the IO up - it all has to do properly-aligned IO.

And with spinning disks, it's easy for a batch of IO operations to the same disk(s) to interfere with each other and actually slow everything down as the head(s) spend more time seeking.

And in my experience all it takes is one of the layers to do things wrong for the performance advantage of batched system calls to disappear into the overhead. Because IO is slow compared to even the worst system call overhead.

The cost of creating and maintaining a combined hardware/software system to take advantage of the performance improvement batched IO system calls offers is immense.

And the best numbers I've ever seen are that batching many IO calls into one system call can improve performance about 25-30%.

If you're processing hundreds of GB of data continuously around the clock, that matters.

Building and maintaining an entire system like that just to lower the latency of viewing cat videos from 8 ms to 6 ms? Not so much.

Why isn't something like compound syscalls implemented?

3 Answers3