Why is there a Linux kernel policy to never break user space?

Question

I started thinking about this issue in the context of etiquette on the Linux Kernel Mailing list. As the world's best known and arguably most successful and important free software project, the Linux kernel gets plenty of press. And the project founder and leader, Linus Torvalds, clearly needs no introduction here.

Linus occasionally attracts controversy with his flames on the LKML. These flames are frequently, by his own admission, to do with breaking user space. Which brings me to my question.

Can I have some historical perspective on why breaking user space is such a bad thing? As I understand it, breaking user space would require fixes on the application level, but is this such a bad thing, if it improves the kernel code?

As I understand it, Linus' stated policy is that not breaking user space trumps everything else, including code quality. Why is this so important, and what are the pros and cons of such a policy?

(There are clearly some cons to such a policy, consistently applied, since Linus occasionally has "disagreements" with his top lieutenants on the LKML on exactly this topic. As far as I can tell, he always gets his way in the matter.)

A Linux 2.0 could be infinitely better. The design of Linux is fucking horrible and that matters for everything downstream. I hate how everyone always contextualises breaking changes to OSes in language like "every time you make a breaking change" or "every few months". Devs who maintain vital APIs know that breaking back-compat is an enormous cost. That's why the goal is to make as many improvements as possible in one breaking change, and only pull the trigger on that one change if the improvements are worth it. Linux has thirty years of hindsight and problems we could fix. — iono, Jun 19 '23 at 04:34

Gilles 'SO- stop being evil' · Accepted Answer · 2024-02-22T11:53:09.877

The reason is not a historical one but a practical one. There are many many many programs that run on top of the Linux kernel; if a kernel interface breaks those programs then everybody would need to upgrade those programs.

Now it's true that most programs do not in fact depend on kernel interfaces directly (the system calls), but only on interfaces of the C standard library (C wrappers around the system calls). Oh, but which standard library? Glibc? uClibC? Dietlibc? Bionic? Musl? etc.

But there are also many programs that implement OS-specific services and depend on kernel interfaces that are not exposed by the standard library. (On Linux, many of these are offered through /proc and /sys.)

And then there are statically compiled binaries. If a kernel upgrade breaks one of these, the only solution would be to recompile them. If you have the source: Linux does support proprietary software too.

Even when the source is available, gathering it all can be a pain. Especially when you're upgrading your kernel to fix a bug with your hardware. People often upgrade their kernel independently from the rest of their system because they need the hardware support. In the words of Linus Torvalds:

Breaking user programs simply isn't acceptable. (…) We know that people use old binaries for years and years, and that making a new release doesn't mean that you can just throw that out. You can trust us.

He also explains that one reason to make this a strong rule is to avoid dependency hell where you'd not only have to upgrade another program to get some newer kernel to work, but also have to upgrade yet another program, and another, and another, because everything depends on a certain version of everything.

It's somewhat ok to have a well-defined one-way dependency. It's sad, but inevitable sometimes. (…) What is NOT ok is to have a two-way dependency. If user-space HAL code depends on a new kernel, that's ok, although I suspect users would hope that it wouldn't be "kernel of the week", but more a "kernel of the last few months" thing.

But if you have a TWO-WAY dependency, you're screwed. That means that you have to upgrade in lock-step, and that just IS NOT ACCEPTABLE. It's horrible for the user, but even more importantly, it's horrible for developers, because it means that you can't say "a bug happened" and do things like try to narrow it down with bisection or similar.

In userspace, those mutual dependencies are usually resolved by keeping different library versions around; but you only get to run one kernel, so it has to support everything people might want to do with it.

Officially,

backward compatibility for [system calls declared stable] will be guaranteed for at least 2 years.

In practice though,

Most interfaces (like syscalls) are expected to never change and always be available.

What does change more often is interfaces that are only meant to be used by hardware-related programs, in /sys. (/proc, on the other hand, which since the introduction of /sys has been reserved for non-hardware-related services, pretty much never breaks in incompatible ways.)

In summary,

breaking user space would require fixes on the application level

and that's bad because there's only one kernel, which people want to upgrade independently of the rest of their system, but there are many many applications out there with complex interdependencies. It's easier to keep the kernel stable than to keep thousands of applications up-to-date on millions of different setups.

Thank you for the answer. So, the interfaces that are declared stable are a superset of the POSIX system calls? My question about history is how this practice evolved. Presumably the original versions of the Linux kernel didn't worry about user space breakage, at least initially. — Faheem Mitha, Oct 12 '15 at 09:01
@FaheemMitha Yes, they did, since 1991. I don't think Linus's approach evolved, it's always been “interfaces for normal applications don't change, interfaces for software that's very strongly tied to the kernel changes very very rarely”. — Gilles 'SO- stop being evil', Oct 12 '15 at 09:28

score 28 · Answer 2 · edited Oct 11 '15 at 05:42

28

In any inter-dependent systems there are basically two choices. Abstraction and integration. (I am purposely not using technical terms). With Abstraction, you're saying that when you make a call to an API that, while the code behind the API may change, the result will always be the same. For example when we call fs.open() we don't care whether it's a network drive, a SSD or a hard drive, we will always get an open file descriptor that we can do stuff with. With "integration" the goal is to provide the "best" way to do a thing, even if the way changes. For example, opening a file may be different for a network share than for a file on disk. Both ways are used pretty extensively in the modern Linux desktop.

From a developers point of view it's a question of "works with any version" or "works with a specific version". A great example of this is OpenGL. Most games are set to work with a specific version of OpenGL. It doesn't matter if you're compiling from source. If the game was written to use OpenGL 1.1 and you're trying to get it to run on 3.x, you're not going to have a good time. On the other end of the spectrum, some calls, are expected to work no matter what. For example, I want to call fs.open() I don't want to care what kernel version I am on. I just want a file descriptor.

There are benefits to each way. Integration provides "newer" features at the cost of backwards compatibility. While abstraction provides stability over "newer" calls. Though it's important to note it's a matter of priority, not possibility.

From a communal stand point, without a really really good reason, abstraction is always better in a complex system. For example, imagine if fs.open() worked differently depending on kernel version. Then a simple file system interaction library would need maintain several hundred different "open file" methods (or blocks probably). When a new kernel version came out, you wouldn't be able to "upgrade", you would have to test every single piece of software you used. Kernel 6.2.2 (fake) may just break your text editor.

For some real world examples OSX tends to not care about breaking User Space. They aim for "integration" over "abstraction" more frequently. And at every major OS update, things break. That's not to say one way is better then the other. It's a choice and design decision.

Most importantly, the Linux eco-system is filled with awesome opensource projects, where people or groups work on the project in their free time, or because the tool is useful. With that in mind, the second it stops being fun and starts being a PIA, those developers will go somewhere else.

For example, I submitted a patch to BuildNotify.py. Not because I am altruistic, but because I use the tool, and I wanted a feature. It was easy, so here, have a patch. If it were complicated, or cumbersome, I would not use BuildNotify.py and I would find something else. If every time a kernel update came out my text editor broke, I would just use a different OS. My contributions to the community (however small) would not continue or exist, and so on.

So, the design decision was made to abstract system calls, so that when I do fs.open() it just works. That means maintaining fs.open long after fs.open2() gained popularity.

Historically, this is the goal of POSIX systems in general. "Here are a set of calls and expected return values, you figure out the middle." Again for portability reasons. Why Linus chooses to use that methodology is internal to his brain, and you would have to ask him to know exactly why. If it were me however, I would choose abstraction over integration on a complex system.

edited Oct 11 '15 at 05:42

Faheem Mitha

35,108

answered Oct 11 '15 at 04:30

coteyr

4,310

Thanks for your interesting answer. Please review my edits. Thanks. – Faheem Mitha Oct 11 '15 at 05:27
To be clear, in your terminology, Linux used Abstraction, right? But I also think Linux does not have a well-defined API. – Faheem Mitha Oct 11 '15 at 07:04
1

The API to userspace, the 'syscall' API, is well-defined (especially the POSIX subset) and stable, because removing any part of it will break software that people may have installed. What it doesn't have is a stable driver API. – pjc50 Oct 11 '15 at 07:21
@pjc50 Oh. So is breaking the driver API what Linus gets upset about, then? – Faheem Mitha Oct 11 '15 at 08:13
4

@FaheemMitha, it's the other way around. Kernel developers are free to break the driver API whenever they wish, so long as they fix all the in-kernel drivers before the next release. It's breaking the userspace API, or even doing non-API things that could break userspace, that produces epic reactions from Linus. – Mark Oct 11 '15 at 08:23
@Mark Ok, I see. Thanks for the correction. But if the userspace API is "well-defined and stable", how does the issue of breaking it arise? – Faheem Mitha Oct 11 '15 at 08:25
5

For example, if someone decides to change it by returning a different error code from ioctl() in some circumstances: https://lkml.org/lkml/2012/12/23/75 (contains swearing and personal attacks on the developer responsible). That patch was rejected because it would have broken PulseAudio, and hence all audio on GNOME systems. – pjc50 Oct 11 '15 at 08:32
1

@FaheemMitha, basically, def add(a, b); return a + b; end --- def add (a, b); c = a + b; return c; end --- def add(a, b); c = a + b +10; return c - 10; end -- are all the "same" implementation of add. What get's him so upset is when people do def add(a, b); return (a + b) * -1; end In essence, changing how "internal" things to the kernel work is ok. Changing what is returned to a defined and "public" API call is not. There are two kinds of API calls "private" and "public". He feels that public API calls should never change without good reason. – coteyr Oct 11 '15 at 09:41
4

A non code example; You go to the store, you buy 87 Octane gas. You, as the consumer don't "care" where the gas came from, or how it was processed. You just care your getting gas. If the gas went through a different refining process, you don't care. Sure the refining process can change. There are even different sources of oil. But what you care about is getting 87 Octane gas. So his position, is change sources, change refineries, change what ever, so long as what comes out at the pump is 87 Octane gas. All the "behind the scenes" stuff doesn't matter. So long as there is 87 Octane gas. – coteyr Oct 11 '15 at 09:44
@pjc50 Thanks for the example. It's actually one of the examples that prompted this question - I guess it's got a fair amount of press. But I wonder, why did the maintainer do it in the first place if it was clearly the wrong thing to do? Also, he seemed inclines to argue the point. I mean, presumably he is not a moron, so what am I missing? – Faheem Mitha Oct 11 '15 at 09:57
@FaheemMitha, In this case that is basically what Linus is saying. That the maintainer did something so fundamentally wrong as to be analogous to floating in the middle of a lake of water and blaming a passing blimp on the fact that he got wet. Everyone does silly things some times. This seems like one of those times (Linus certainly seems to feel so). – coteyr Oct 11 '15 at 11:38
"Why Linus chooses to use that methodology [of POSIX] is internal to his brain" POSIX is the basis of Unix is the basis of Linux, so I don't see why any external justification is required. – underscore_d Oct 11 '15 at 15:10
@underscore_d: Unix is the basis of Macintosh OSX as well, but Apple has chosen a different methodology. They happily break backward compatibility with every release, and expect developers to use an Apple-supplied "middle" layer which they make more effort to keep stable. This works for their userbase, who want to use the "latest and greatest" software and care less if old things can't keep up. So I think Linus is making a conscious choice. – librik Oct 11 '15 at 20:50
Fair point. I guess I felt it wasn't much of a choice, since Linux was specifically designed to be compatible with its predecessors, so following their methodology seems like a given; the choice then seemed like a passive consequence, rather than an active decision. Whereas Apple sort of just 'happens' to be a flavour of Unix, with less focus on userspace and backwards-compatibility. But I acknowledge this is mostly just semantics. – underscore_d Oct 11 '15 at 21:16
@coteyr Fair enough, I suppose. Though "don't break the API", if the API in question is clearly defined, seems like such an obvious and uncontroversial thing that I wonder anyone would argue about it. But I suppose I would have to delve into the details to know what the dispute was about exactly. – Faheem Mitha Oct 11 '15 at 22:09
Linux is based on the POSIX standards, "where they make sense" (in Linus' view, parts of this are braindead), but it extends them quite a bit. E.g. the whole handling of WiFi, firewall, and more recently cgroups are Linux extensions made of whole cloth. Linux has also adopted filesystem ACLs, which are a proposed (but never officially sanctioned) standard. Linux has its own handling of filesystems that aren't POSIXly correct. It goes on. Linus' decision is that changing userspace interfaces can be done only with several year's notice, and compatibility meanwhile. – vonbrand Oct 11 '15 at 22:45
@vonbrand "changing userspace interfaces can be done only with several year's notice, and compatibility meanwhile" is quite different from "do not break user space". Are we really talking about the same thing here? – Faheem Mitha Oct 11 '15 at 22:57
@FaheemMitha "obvious" is in the eye of the beholder. If you have worked on a system (any system) for a long time, many things are obvious to you that would not necessarily be obvious to someone else. And almost everyone has their own ideas on how to do things and what makes sense. It's more of a human nature (and, sometimes, ego) thing than something specific to any particular endeavour. – Joe Oct 17 '15 at 12:51

score 13 · Answer 3 · answered Oct 11 '15 at 03:55

13

It's a design decision and choice. Linus wants to be able to guarantee to user-space developers that, except in extremely rare and exceptional (e.g. security-related) circumstances, changes in the kernel will not break their applications.

The pros are that userspace devs won't find their code suddenly breaking on new kernels for arbitrary and capricious reasons.

The cons are that the kernel has to keep old code and old syscalls etc around forever (or, at least, long past their use-by dates).

answered Oct 11 '15 at 03:55

cas

78,579

Thank you for the reply. Are you aware of the history of how this decision evolved? I'm aware of projects that take a somewhat different perspective. For example the Mercurial project does not have a fixed API, and can and break code that relies on it. – Faheem Mitha Oct 11 '15 at 04:00
No, sorry, i can't recall how it came about. You could email Linus or LKML and ask him. – cas Oct 11 '15 at 04:03
4

Mercurial is not a OS. The entire point of an OS is to enable the running of other software on top of it, and breaking that other software is very unpopular. By comparison Windows has also maintained backwards compatibility for a very long time; 16-bit Windows code was only recently obsoleted. – pjc50 Oct 11 '15 at 07:23
@pjc50 It's true that Mercurial is not an OS, but regardless, there is other software, even if only scripts, that depend on it. And can potentially be broken by changes. – Faheem Mitha Oct 11 '15 at 08:15

Why is there a Linux kernel policy to never break user space?

3 Answers3

Linked