126

What are the contents of this monolithic code base?

I understand processor architecture support, security, and virtualization, but I can't imagine that being more than 600,000 lines or so.

What are the historic & current reason drivers are included in the kernel code base?

Do those 15+ million lines include every single driver for every piece of hardware ever? If so, that then begs the question, why are drivers embedded in the kernel and not separate packages that are auto-detected and installed from hardware IDs?

Is the size of the code base an issue for storage-constrained or memory-constrained devices?

It seems it would bloat the kernel size for space-constrained ARM devices if all that was embedded. Are a lot of lines culled by the preprocessor? Call me crazy, but I can't imagine a machine needing that much logic to run what I understand is the roles of a kernel.

Is there evidence that the size will be an issue in 50+ years due to it's seemingly ever-growing nature?

Including drivers means it will grow as hardware is made.

EDIT: For those thinking this is the nature of kernels, after some research I realized it isn't always. A kernel is not required to be this large, as Carnegie Mellon's microkernel Mach was listed as an example 'usually under 10,000 lines of code'

Jonathan
  • 1,565
  • 9
    Back in 2012 it had over 5 million lines just for drivers. 1.9 million lines for supporting different processor architectures. More info http://www.h-online.com/open/features/Kernel-Log-15-000-000-lines-of-code-3-0-promoted-to-long-term-kernel-1408062.html – steve Aug 17 '15 at 17:27
  • 11
    Yes I have coded a compiler, lexical analyzer, and byte code generator for a language, and it was turing complete plus recursion and it didn't take 10,000 lines. – Jonathan Aug 17 '15 at 17:37
  • 5
    (looked at it now, it was about 2,700 lines) – Jonathan Aug 17 '15 at 17:43
  • 1
    As @drewbenn tell in his answer, it include a lot of stack, drivers and support many architecture, would be nice to count how many effective line take a defined configuration. – Alex Aug 17 '15 at 17:47
  • Creating a configuration for one arch and working a bit with unifdef would tell how many line are really needed (or close enough) – Alex Aug 17 '15 at 18:05
  • 4
    You should download and configure make menuconfig to see how much of the code can be enabled/disabled prior to building. – casey Aug 18 '15 at 00:17
  • 3
    About EDIT: obviously (and for good reasons http://stackoverflow.com/questions/1806585/why-is-linux-called-a-monolithic-kernel/1806597) Linux is not a microkernel – edc65 Aug 18 '15 at 08:33
  • Drivers. Linux is an all-purpose project that runs on and supports A LOT of hardware, file systems, etc. And that's great. – Konrad Gajewski Aug 18 '15 at 11:45
  • 1
    Windows has over 40+ mil LoC and you're complaining about 15+ mil?! – Silviu Burcea Aug 18 '15 at 12:09
  • 6
    @JonathanLeaders: I've done turing complete compilers for LISP like languages in less than 100 lines, with test programs rendering Mandelbrots. Always depends. – phresnel Aug 18 '15 at 12:32
  • 1
    It is mostly drivers. But the drivers are not really embedded in the kernel at runtime, they are loaded when needed. So just think of the package as having its own distribution independent packing system to load drivers as needed. – MTilsted Aug 18 '15 at 16:00
  • 3
    @SilviuBurcea: Windows is a full GUI operating system and server. Linux is a kernel. – Martin Argerami Aug 18 '15 at 17:31
  • After some discussion of history, he mentions some perks of microkernals: https://www.youtube.com/watch?v=86_BkFsb4eI – Jonathan Aug 18 '15 at 18:42
  • 2
    I like the idea of a monolithic kernel, I just feel like the driver code should be seperated, even if at some point you compile it in. In 50, or 100 years, the Linux kernel source code will just continue to grow and grow and grow and grow, almost exclusively due to drivers for more and more hardware. Driver code clearly is growing at a higher rate than anything else, and logically so. I think in terms of code locations, it should be separate, but perhaps 'hooked' in like you would any framework or sets of code. I'd like to see a day when linux kernel code shrinks, because of simplifying. – Jonathan Aug 18 '15 at 21:12
  • I wouldn't call you crazy, you just have a lack of imagination. – Anthon Aug 19 '15 at 05:17
  • 1
    You could probably just delete most of those newlines with little effect. – mikeserv Aug 19 '15 at 05:33
  • 1
    @JonathanLeaders In case the answers below don't make this clear enough--I think your question why are drivers embedded in the kernel and not separate packages that are auto-detected and installed from hardware IDs? is based on a false assumption. The latter is in fact the way Linux drivers work. The modularization of drivers was a huge change from Linux 1.x to 2.x. But this is orthogonal to the structure of the code base. The code for modular drivers is kept in the same repo as the rest of the kernel--but this in no way implies that drivers are "embedded in the kernel" at build. – dodgethesteamroller Aug 19 '15 at 19:02
  • @dodgethesteamroller, Exactly. I am learning this as I read. Thank you. – Jonathan Aug 21 '15 at 00:01
  • I've reworded it, hopefully well enough to reopen – Jonathan Aug 21 '15 at 06:10

6 Answers6

90

According to cloc run against 3.13, Linux is about 12 million lines of code.

  • 7 million LOC in drivers/
  • 2 million LOC in arch/
  • only 139 thousand LOC in kernel/

lsmod | wc on my Debian laptop shows 158 modules loaded at runtime, so dynamically loading modules is a well-used way of supporting hardware.

The robust configuration system (e.g. make menuconfig) is used to select which code to compile (and more to your point, which code to not compile). Embedded systems define their own .config file with just the hardware support they care about (including supporting hardware built-in to the kernel or as loadable modules).

  • I see, so there is a lot of 'culling' or dynamic loading taking place. That makes me feel a lot better. So in reality, maybe a very small portion of the driver code base would actually be in use on one single device at a time. To verify that, do you know a way to tell how many modules possibly exist? You said 158 are running, but out of how many? – Jonathan Aug 17 '15 at 17:45
  • 1
    Okay, so that's just over 5% of the driver modules actually in use. Now we're making more sense. So perhaps the rest of the code also may have similarly small percentages of 'in use' on a device. – Jonathan Aug 17 '15 at 17:58
  • 1
    My gentoo-4.0.5: lsmod | wc -l: 12 (one line is header), find /lib/modules/$(uname -r)/ -name '*.ko' | wc -l: 48. – jimmij Aug 17 '15 at 18:06
  • 3
    counting modules isn't enough, a lot maybe builtin by config – Alex Aug 17 '15 at 18:14
  • 7
    I think from this we can conclude Linux kernel is massive because it supports all sorts of device configurations, not because it's outrageously complex. We see here that very little of the 15m lines are actually in use. Although, as nearly all things are, it may be overly complex, at least we can sleep at night knowing it's within reason – Jonathan Aug 17 '15 at 18:22
  • 2
    @JonathanLeaders: Yes - and as well as modules for strange devices, there are modules for obscure filesystems, networking protocols, etc... – psmears Aug 17 '15 at 20:34
  • 7
    @JonathanLeader I remember when Linux was starting - even getting the installer to work (if it even had an installer!) was a massive pain - there's still some distros where you have to pick your mouse driver manually. Making things like networking or, god forbid, X-window, work, was a rite of passage. On my first Red Hat installation, I had to write my own graphics driver, because there were only three (!) drivers available. Having basics work by default is a sign of maturity - and obviously, you can afford a lot more tweaking on an embedded system, where there's only a few HW combinations. – Luaan Aug 18 '15 at 07:25
  • 2
    @JonathanLeaders As I think you've realized, the LOC in the source is more or less irrelevant. If you want to know how much memory the kernel uses there are much more direct ways. – goldilocks Aug 18 '15 at 14:32
  • A non-modular kernel makes quite a bit of sense when deploying to servers, especially many identical ones .. unless you really do have a printer, sound system and CueCat hooked up to each one. The total LOC in the kernel is just the ceiling on the insanity you can reach building it. Most systems using the Linux kernel will only load what a hardware probe reveals, which is far less than 'all of it'. – Tim Post Aug 18 '15 at 20:08
  • What is "arch/" ? – gornvix Jul 10 '17 at 18:09
82

For anyone curious, here's the linecount breakdown for the GitHub mirror:

=============================================
    Item           Lines             %
=============================================
  ./usr                 845        0.0042
  ./init              5,739        0.0283
  ./samples           8,758        0.0432
  ./ipc               8,926        0.0440
  ./virt             10,701        0.0527
  ./block            37,845        0.1865
  ./security         74,844        0.3688
  ./crypto           90,327        0.4451
  ./scripts          91,474        0.4507
  ./lib             109,466        0.5394
  ./mm              110,035        0.5422
  ./firmware        129,084        0.6361
  ./tools           232,123        1.1438
  ./kernel          246,369        1.2140
  ./Documentation   569,944        2.8085
  ./include         715,349        3.5250
  ./sound           886,892        4.3703
  ./net             899,167        4.4307
  ./fs            1,179,220        5.8107
  ./arch          3,398,176       16.7449
  ./drivers      11,488,536       56.6110
=============================================

drivers contributes to a lot of the linecount.

Nat
  • 103
user3276552
  • 1,343
64

Drivers are maintained in-kernel so when a kernel change requires a global search-and-replace (or search-and-hand-modify) for all users of a function, it gets done by the person making the change. Having your driver updated by people making API changes is a very nice advantage, instead of having to do it yourself when it doesn't compile on a more recent kernel.

The alternative (which is what happens for drivers maintained out-of-tree), is that the patch has to get re-synced by its maintainers to keep up with any changes.

A quick search turned up a debate over in-tree vs. out-of-tree driver development.

The way Linux is maintained is mostly by keeping everything in the mainline repo. Building of small stripped-down kernels is supported by config options to control #ifdefs. So you can absolutely build tiny stripped-down kernels which compile only a tiny part of the code in the whole repo.

The extensive use of Linux in embedded systems has led to better support for leaving stuff out than Linux had years earlier when the kernel source tree was smaller. A super-minimal 4.0 kernel is probably smaller than a super-minimal 2.4.0 kernel.

Peter Cordes
  • 6,466
  • 5
    Now THIS makes sense to me as to why it is logical to have all the code together, it saves man-hours at the cost of computer resources & excessive dependencies. – Jonathan Aug 19 '15 at 06:51
  • 10
    @JonathanLeaders: yeah, it avoids bit-rot for drivers with not-very-active maintenance. It's also probably useful to have all the driver code around when considering core changes. Searching on all callers of some internal API might turn up a driver using it in a way you didn't think of, potentially influencing a change you were thinking about. – Peter Cordes Aug 19 '15 at 12:09
  • 1
    @JonathanLeaders come on xd, as if that extra lines take much extra space, in modern measurements of installing it on a pc. – Junaga Oct 26 '16 at 07:33
  • 4
    @Junaga: you realize linux is very portable and scalable, right? Wasting 1MB of permanently-used kernel memory on a 32MB embedded system is a big deal. Source code size is not important, but compiled binary size is still important. Kernel memory isn't paged, so even with swap space you can't get it back. – Peter Cordes Oct 26 '16 at 07:56
  • @PeterCordes And to cover that case we have a thing, >"Building of small stripped-down kernels is supported by config options to control #ifdefs. So you can absolutely build tiny stripped-down kernels which compile only a tiny part of the code in the whole repo" Clearly having the compiled, complete kernel being a bit larger in order to achieve user friendliness(It's easy) was the right thing to do, else it wouldn't have been done. This decision was made from people how know there stuff way better then you and I do. – Junaga Oct 27 '16 at 08:49
  • @Junaga: You're missing the point. As I understand it, this doesn't make the compiled kernel larger at all, compared to not having some stuff part of mainline in the first place. You're arguing that it takes extra space but it's worth it. I'm arguing that it doesn't take extra space at all, so it's just a matter of repo size / source-code disk space. – Peter Cordes Oct 27 '16 at 08:55
  • IMO this is a nice way to get started quickly and with limited resources. However spaghetti code (because this is what it really is, what you described) means that development will hit a wall at some point. Yes, OK there is a level of manageability, thanks to #ifdefs. Yes this is contained to monolithic drivers. There is hope. – Rolf Apr 29 '18 at 19:09
  • 1
    @Rolf: It's large, but it's not spaghetti. It's currently quite well architected without 2-way dependencies back and forth between core code and drivers. Drivers can be left out without breaking the core kernel. When an internal function or API is refactored so drivers need to use it differently, drivers may need to change, but that's normal for refactoring. – Peter Cordes Aug 28 '18 at 18:51
  • @PeterCordes Thank you. Spaghetti was my impression when I had a read through the code, with plenty symbols shared across files. I'm no expert by any means, though. – Rolf Aug 29 '18 at 11:11
53

The answers so far seem to be "yes there is lots of code" and nobody is tackling the question with the most logical answer: 15M+? SO WHAT? What does 15M lines of source code have to do with the price of fish? What makes this so unimaginable?

Linux clearly does lots. Lots more than anything else... But some of your points show you don't respect what's happening when it's built and used.

  • Not everything is compiled. The Kernel build system allows you to quickly define configurations which select sets of source code. Some is experimental, some is old, some just isn't needed for every system. Look at /boot/config-$(uname -r) (on Ubuntu) in make menuconfig and you'll see just how much is excluded.

    And that's a variable-target desktop distribution. The config for an embedded system would only pull in the things it needs.

  • Not everything is built-in. In my configuration, most of the Kernel features are built as modules:

    grep -c '=m' /boot/config-`uname -r`  # 4078
    grep -c '=y' /boot/config-`uname -r`  # 1944
    

    To be clear, these could all be built-in... Just as they could be printed out and made into a giant paper sandwich. It just wouldn't make sense unless you were doing a custom build for a discrete hardware job (in which case, you'd have limited the number of these items down already).

  • Modules are dynamically loaded. Even when a system has thousands of modules available to it, the system will allow you to load just the things you need. Compare the outputs of:

    find /lib/modules/$(uname -r)/ -iname '*.ko' | wc -l  # 4291
    lsmod | wc -l                                         # 99
    

    Almost nothing is loaded.

  • Microkernels aren't the same thing. Just 10 seconds looking at the leading image to the Wikipedia page you linked would highlight they are designed in a completely different way.

    Linux drivers are internalised (mostly as dynamically loaded modules), not userspace, and the filesystems are similarly internal. Why is that worse than using external drivers? Why is micro better for general purpose computing?


The comments again highlight you're not getting it. If you want to deploy Linux on discrete hardware (eg aerospace, a TiVo, tablet, etc) you configure it to build only the drivers you need. You can do the same on your desktop with make localmodconfig. You end up with a tiny for-purpose Kernel build with zero flexibility.

For distributions like Ubuntu, a single 40MB Kernel package is acceptable. No, scrub that, it's actually preferable to the massive archiving and download scenario that keeping 4000+ floating modules as packages would be. It uses less disk space for them, easier to package at compile-time, easier to store and is better for their users (who have a system that just works).

The future doesn't seem to be an issue either. The rate of CPU speed, disk density/pricing and bandwidth improvements seems much faster than the growth of the Kernel. A 200MB Kernel package in 10 years wouldn't be the end if the world.

It's also not a one way street. Code does get kicked out if it isn't maintained.

Oli
  • 16,068
  • 2
    The concern is mainly for embedded systems. As you show, you have 4,000 modules not in use on your own system. In some small robotics or aerospace applications, (READ: not general purpose computing) this would be unacceptable waste. – Jonathan Aug 18 '15 at 17:54
  • 2
    @JonathanLeaders I think you can safely delete them. On a desktop install, they are there in case you suddenly plug in something in a usb port, or change some hardware configuration, etc. – Didier A. Aug 18 '15 at 20:47
  • 1
    Yes, exactly. I still remain surprised by assumptions like "you could plug in an USB device at any time therefore we need 15m lines of code" are written in at the kernel level, and not at the distro level, seeing as Linux is used in phone ands various embedded devices. Well, I guess the distro does cull the list on it's own. I would just think support for pluggability should be additive and not subtractive, I.E. a distro would kind of 'opt-in' to it by adding package sources, as opposed to embedded ARM configurations telling the kernel to be one percent of it's current size – Jonathan Aug 18 '15 at 21:02
  • The other concern I see, is 50 years later, will the Linux kernel code be 100million lines of code? Does it by necessity grow as hardware is made? I would think package libraries would grow, not the core kernel – Jonathan Aug 18 '15 at 21:08
  • 8
    @JonathanLeaders you would never run a kernel configured for a desktop on an embedded system. Our embedded system has 13 modules and has removed all the hardware support we don't need (along with plenty of other customizations). Stop comparing Desktops to embedded systems. Linux works well because it supports everything and can be customized to only include what you care about. And those 4k modules are really great on desktop systems: when my last laptop died I just put the hard drive in a much newer laptop and everything just worked. –  Aug 18 '15 at 21:21
  • 1
    @JonathanLeaders without wanting to sound rude, it seems like you missed or otherwise didn't understand the very first bullet. If you want a tiny Linux kernel with just the things you need, you can have that. I've edited to re-explain this and the other things you mention. – Oli Aug 18 '15 at 22:30
  • You could even package modules as you say, I just don't think it would be desirable for a distribution or its users. Explanations also in the edit. – Oli Aug 18 '15 at 22:32
  • 7
    This otherwise good/valuable answer suffers from a distinctly angry and combative tone. -1. – TypeIA Apr 24 '18 at 20:01
  • $ find /lib/modules/$(uname -r)/ -iname '*.ko' | wc -l # 0 for Linux version 5.7.0-arch1-1 (linux@archlinux) (gcc version 10.1.0 (GCC), GNU ld (GNU Binutils) 2.34.0) #1 SMP PREEMPT Mon, 01 Jun 2020 22:54:03 +0000 – 15 Volts Jun 07 '20 at 21:08
  • @TypeIA totally agree, downvoting because of this – Ayberk Özgür Jun 09 '20 at 15:31
26

Linux tinyconfig compiled sources line count tinyconfig bubble graph svg (fiddle)

shell script to create the json from the kernel build, use with http://bl.ocks.org/mbostock/4063269


Edit: turned out unifdef have some limitation (-I is ignored and -include unsupported, the latter is used to include the generated configuration header) at this point using cat doesn't change much:

274692 total # (was 274686)

script and procedure updated.


Beside drivers, arch etc. there's a lot of conditional code compiled or not depending on the chosen configuration, code not necessarily in dynamic loaded modules but built in the core.

So, downloaded linux-4.1.6 sources, picked the tinyconfig, it doesn't enable modules and I honestly don't know what it enable or what a user can do with it at runtime, anyway, config the kernel:

# tinyconfig      - Configure the tiniest possible kernel
make tinyconfig

built the kernel

time make V=1 # (should be fast)
#1049168 ./vmlinux (I'm using x86-32 on other arch the size may be different)

the kernel build process leave hidden files called *.cmd with the command line used also to build .o files, to process those files and extract target and dependencies copy script.sh below and use it with find:

find -name "*.cmd" -exec sh script.sh "{}" \;

this create a copy for each dependency of target .o named .o.c

.c code

find -name "*.o.c" | grep -v "/scripts/" | xargs wc -l | sort -n
...
   8285 ./kernel/sched/fair.o.c
   8381 ./kernel/sched/core.o.c
   9083 ./kernel/events/core.o.c
 274692 total

.h headers (sanitized)

make headers_install INSTALL_HDR_PATH=/tmp/test-hdr
find /tmp/test-hdr/ -name "*.h" | xargs wc -l
...
  1401 /tmp/test-hdr/include/linux/ethtool.h
  2195 /tmp/test-hdr/include/linux/videodev2.h
  4588 /tmp/test-hdr/include/linux/nl80211.h
112445 total
Alex
  • 2,586
  • 4
  • 22
  • 30
13

The tradeoffs of monolithic kernels were debated between Tananbaum and Torvalds in public from the very beginning. If you don't need to cross into userspace for everything, then the interface to the kernel can be simpler. If the kernel is monolithic, then it can be more optimized (and more messy!) internally.

We have had modules as a compromise for quite a while. And it is continuing with things like DPDK (moving more networking functionality out of the kernel). The more cores get added, the more important it is to avoid locking; so more things will move into userspace and the kernel will shrink.

Note that monolithic kernels are not the only solution. On some architectures, the kernel/userspace boundary isn't more expensive than any other function call, making microkernels attractive.

Rob
  • 231
  • 1
    "On some architectures, the kernel/userspace boundary isn't more expensive than any other function call" - interesting! What architecture would that be? Looks incredibly hard to pull off if you don't just forsake any kind of memory protection at least. – Voo Aug 18 '15 at 18:03
  • 1
    I went through all of Ivan Goddard's millcomputing.com videos (mill/belt cpu, very VLIW-like). This particular claim is a central theme, and its implications are not obvious until you get to the security video. It's a POC architecture in simulation, but it is probably not the only architecture with this property. – Rob Aug 18 '15 at 18:08
  • 1
    Ah that explains it. In my experience (and I'll be the first to admit that I don't follow the industry that closely) there are many simulated architectures and few live up to their claims as soon as the rubber hits the road, i.e. they're put on real hardware. Although the idea behind it might be interesting in any case - not the first time that particular CPU has been mentioned. If you ever find an existing architecture that has this property, I'd be really interested. – Voo Aug 18 '15 at 18:11
  • 3
    BTW here's more resources on the debate you mentioned: https://en.wikipedia.org/wiki/Tanenbaum%E2%80%93Torvalds_debate – Jonathan Aug 18 '15 at 18:28