50

For years, the OOM killer of my operating system doesn't work properly and leads to a frozen system.
When the memory usage is very high, the whole system tends to "freeze" (in fact: becoming extremely slow) for hours or even days, instead of killing processes to free the memory.
The maximum that I have recorded is 7 days before resigning myself to operate a reset.
When OOM is about to be reached, the iowait is very very high (~ 70%), before becoming unmeasurable.
The tool: iotop has showed that every programs are reading at a very high throughput (per tens of MB/sec) from my hard drive.
What those programs are reading ?
- The directory hierarchy ?
- The executable code itself ?
I don't exactly now.

[edited] At the time I wrote this message (in 2017) I was using an uptodate ArchLinux (4.9.27-1-lts), but had already experienced the issue for years before.
I have experienced the same issue with various Linux distributions and different hardware configurations.
Currently (2019), I am using an uptodate Debian 9.6 (4.9.0) I have 16 GB of physical ram, a SSD on which my OS is installed, and not any swap partition.

Because of the amount of ram that I have, I don't want to enable a swap partition, since it would just delay the apparition of the issue.
Also, with SSDs swapping too often could potentially reduce the lifespan of the disk.
By the way, I've already tried with and without a swap partition, it has proved to only delay the apparition of the problem, but not being the solution.

To me the problem is caused by the fact that Linux drops essential data from the caches, which leads to a frozen system because it has to read everything, every time from the hard drive.

I even wonder if Linux wouldn't drop the executable code pages of running programs, which would explain why programs that normally don't read a lot of data, behave this way in this situation.

I have tried several things in the hope to fix this issue.
One was to set /proc/sys/vm/min_free_kbytes to 1000000 (1 GB).
Because this 1 GB should remain free, I thought that this memory would be reserved by Linux to cache important data.
But it hasn't worked.

Also, I think useful to add that even if it could sound great in theory, restricting the size of the virtual memory to the size of the physical memory, by defining /proc/sys/vm/overcommit_memory to 2 isn't decently technically possible in my situation, because the kind of applications that I use, require more virtual memory than they effectively use for some reasons.
According to the file /proc/meminfo, the Commited_AS value is often higher than the double of the physical ram on my system (16 GB, Commited_AS is often > 32 GB).

I have experienced this problem with /proc/sys/vm/overcommit_memory to its default value: 0, and for a while I have defined it to: 1, because I preferred programs to be killed by the OOM killer rather than behaving wrongly because they don't check the return values of malloc when the allocations are refused.

When I was talking about this issue on IRC, I have met other Linux users who have experienced this very same problem, so I guess that a lot of users are concerned by this.
To me this is not acceptable since even Windows deals better with high memory usage.

If you need more information, have a suggestion, please tell me.

Documentation:
https://en.wikipedia.org/wiki/Thrashing_%28computer_science%29
https://en.wikipedia.org/wiki/Memory_overcommitment
https://www.kernel.org/doc/Documentation/sysctl/vm.txt
https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
https://lwn.net/Articles/317814/

They talk about it:
Why does linux out-of-memory (OOM) killer not run automatically, but works upon sysrq-key?
Why does OOM-killer sometimes fail to kill resource hogs?
Preloading the OOM Killer
Is it possible to trigger OOM-killer on forced swapping?
How to avoid high latency near OOM situation?
https://lwn.net/Articles/104179/
https://bbs.archlinux.org/viewtopic.php?id=233843

M89
  • 778
  • 1
    I think this is what you should expect, if you're thrashing, but you're not really approaching 100% "used" i.e. there is too much memory usage which is file-backed, counted as "buff/cache". (Ugh, this phrasing assumes your tmpfs allocations are trivial, as these show up as "buff/cache", but cannot be paged out to a physical filesystem). min_free_kbytes is not relevant, it's not a reserve for cached pages. AFAICT none of the vm sysctls allow reserving any memory specifically for cached pages, i.e. limiting MAP_ANONYMOUS allocations :(. – sourcejedi Jun 28 '17 at 19:41
  • Thank you for having confirmed that setting another value to: min_free_kbytes is not the right way to solve this issue.
    However, I disagree with your statement: "this is what you should expect".
    What I do expect is a system good enough to free the memory, by killing non vital processes, in high memory usage, and certainly not a frozen system, unusable for days.
    – M89 Jun 29 '17 at 14:53
  • Ugh again. I was trying to make a technical statement, not a normative one. To be clear, I agree it's annoying there seems no way to reserve memory for cached pages, which you point out is the obvious way to try and mitigate this. I'm not smart enough to say if there's a sensible way to account the necessary working set of cached pages to a process in a similar way to how anonymous pages are accounted to a process – sourcejedi Jun 29 '17 at 15:55
  • I'm not smart enough to say if there's a sensible way to account the necessary working set of cached pages to a process in a similar way to how anonymous pages are accounted to a process. I imagine the latter is easier... there's less ambiguity about whether the app is expected to use the page again later. Running out of memory sucks and Linux hasn't managed to mitigate that very much. – sourcejedi Jun 29 '17 at 16:01
  • @Guildo, Can you tell us some information about your system (architecture, distribution, etc)? – M89 Jul 28 '17 at 19:12
  • 5
    I've been looking for a solution for this exact issue for years now without any success. I believe I first noticed the problem after replacing my HDD by an SSD, which also entailed me disabling swapping altogether, but I can't really guarantee that it never happened before these changes, so it might be unrelated. I'm on Archlinux btw. – brunocodutra Aug 08 '17 at 11:59
  • 2
  • zram / zswap might help as a workaround.... – Gert van den Berg Apr 19 '18 at 12:24
  • 1
    @dsstorefile1 Thank you, I will try that. But how could it trigger the OOM killer for sure when the kernel in this situation can not do it properly ? – M89 Jan 11 '19 at 17:52
  • @Gert van den Berg How do you think it could help? Could you please detail a little bit more how this could be a viable solution ? – M89 Jan 11 '19 at 17:52
  • It swaps out data to a compressed lace in RAM... That is MUCH faster than swapping to disk and RAM in memory is often highly compressible. – Gert van den Berg Jan 11 '19 at 18:09
  • @Guildo I already gave the amount of ram, the Linux distribution, the kernel version. What else should I provide ? – M89 Jan 11 '19 at 22:02
  • @GertvandenBerg Alright, but how would you setting this up to solve this issue ? Would you store all binary codes in this memory ? Also, you're talking about 'swap' but remember that I'm not using any. – M89 Jan 11 '19 at 22:06
  • @M89: Zram at least add a virtual swap "partition", which might help preventing running out of memory (avoiding the issues) (And avoiding horribly slow disk-based swap partitions) (If swappiness is high enough you might even keep a bit of disk caches) (There is a reason it is not posted as an answer - this does nothing to speed up OOM killer, it gets more out of the RAM you have to try and avoid the OOM situation) – Gert van den Berg Jan 13 '19 at 14:17
  • @GertvandenBerg I'm dubitative since like I said having a swap didn't help. Anyway I could try that in a while to see if having the swap loaded in ram changes something. Did it solve the issue for you or for anybody else ? – – M89 Jan 15 '19 at 19:51
  • 1
    It helped when chrome managed to leak through all my RAM... (Although I eventually added an actual disk swap partition as well and eventually upgraded to a decent amount of RAM) – Gert van den Berg Jan 16 '19 at 16:56
  • Perhaps earlyoom would help in this situation: https://github.com/rfjakob/earlyoom. It is added to Fedora 32, for example: https://fedoraproject.org/wiki/Changes/EnableEarlyoom – Jetski S-type Jun 19 '20 at 03:17

3 Answers3

11

I've found two explanations(of the same thing) as to why kswapd0 does constant disk reading happens well before OOM-killer kills the offending process:

  1. see the answer and comment of this askubuntu SE answer
  2. see the answer and David Schwartz's comments of this answer on unix SE

I'll quote here the comment from 1. which really opened my eyes as to why I was getting constant disk reading while everything was frozen:

For example, consider a case where you have zero swap and system is nearly running out of RAM. The kernel will take memory from e.g. Firefox (it can do this because Firefox is running executable code that has been loaded from disk - the code can be loaded from disk again if needed). If Firefox then needs to access that RAM again N seconds later, the CPU generates "hard fault" which forces Linux to free some RAM (e.g. take some RAM from another process), load the missing data from disk and then allow Firefox to continue as usual. This is pretty similar to normal swapping and kswapd0 does it. – Mikko Rantalainen Feb 15 at 13:08

If anyone has a way as to how to disable this behavior(maybe recompile kernel with what options?), please let me know as soon as possible! Much appreciated, thanks!

UPDATE: The only way I've found thus far is through patching the kernel, and it works for me with swap disabled(ie. CONFIG_SWAP is not set) but doesn't work for others with swap enabled it seems; see the patch inside this question.

  • Please remove text that is not valid. Do not tag edits with "EDIT" in the text. They are obvious from the revision history. – Kusalananda Jan 12 '19 at 10:35
  • 2
    @Kusalananda This user should be encouraged since he is probably the one who contributed the most. – M89 Jan 12 '19 at 21:37
  • @Kusalananda I thought it was important to keep them so that others could see what (else) was tried and didn't work. Perhaps UPDATE instead of EDIT would've been better? –  Jan 15 '19 at 22:51
  • @MarcusLinsner No, sorry, you've misunderstood. Showing what you've tried is what you do when you ask a question. The answer should be correct for the question as it is currently posed. I mean, one edit even asks the reader to ignore the previous edits. If one is interested in seeing the edit history, you may see it here. – Kusalananda Jan 16 '19 at 06:25
10

EarlyOOM utility is a practical solution to this problem. https://fedoraproject.org/wiki/Changes/EnableEarlyoom

This utility will monitor memory usage and kill large consumers before total memory usage crosses a configurable threshold (e.g. 95%).

It is packaged for Arch Linux in earlyoom package, which ships with a systemd service, so you can have it setup quickly:

pacman -S earlyoom
systemctl enable earlyoom
systemctl start earlyoom

I have been using it for several weeks, and it's a night-and-day difference: no more frozen system, despite pushing the system to the limit with a few memory guzzling applications (Java, Electron, and browsers). I've not witnessed it kill the wrong process, which I suppose can happen in theory. Perhaps this is an unsolvable problem in theory, as so much explanation has been written on the web about it, but in practice, the simple heuristic works extremely well.

alexei
  • 443
  • 1
    Best way to fix it is that kernel developpers do something to force the kernel to stay in RAM whatever happens. That, or even forcing everything that was in RAM when you've just booted to the desktop, to the RAM at all times. We do have early-oom and systemd-oomd, but you can still easily get a system freeze since it's based on how much percent of RAM is left (we can still OOM with only 70% of your RAM used, or even less) instead of killing the biggest non-system process that starts to trash I/O because Linux still really can't calculate the real amount of unreclaimable cache/buffers. – X.LINK Jan 22 '21 at 00:33
  • Tested, it works nicely. https://github.com/rfjakob/earlyoom/issues/242#issuecomment-781903731 – ceremcem Feb 19 '21 at 08:38
  • Project repo https://github.com/rfjakob/earlyoom – qwr Mar 30 '22 at 22:51
0

The memory.min parameter in the cgroups-v2 memory controller should help.

Namely, let me quote:

Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup’s memory won’t be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer is invoked.

Source : https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

RalfFriedl
  • 8,981