I strongly despise any kinds of automatic OOM killers, and would like to resolve such situations manually. So for a long time I have
vm.overcommit_memory=1
vm.overcommit_ratio=200
But this way, when the memory is overflowed, the system becomes unresponsive. On my old laptop with HDD and 6 GB of RAM, I sometimes had to wait many minutes to switch to a text VT, issue some commands and wait for them to be executed. That's why I have numerous performance indicators to notice such situations beforehand, and often receive questions why would I need them at all. And they don't always help too, because if a memory overflow happened when I wasn't at the laptop, it's too late already.
I suspected the situation would be better on a newer laptop with SSD and 12 GB of RAM, but in fact it's even worse. I have zRam with vm.swappiness=200
, which allows up to 16.4 GB of compressed swap, and when it's nearly extinguished, the system becomes even more unresponsive than on the old laptop, to the point even VT switch barely works, as well as I cannot SSH into the system from the local network, so my only resort is blindly invoking the kernel's manual OOM with Alt+SysRq+RF, which sometimes chooses to kill important process like dbus-daemon
. I might make a daemon with a sound alert when the swap is almost full, but that's a partial stopgap again, as I may not come in time anyway.
In the past, I tried to mitigate such situations with thrash-protect
. It sends SIGSTOP
to greedy processes and then automatically SIGCONT
-s them, which helped a lot to postpone the total lockup and resolve the situation manually, but in strong overload conditions, it starts freezing virtually everything (which can be explicitly allowlisted though). And it has a lot of irritating side effects. For example, if a shell is frozen, its child processes may remain frozen after thawing the shell. If two processes share a message bus and one of them is frozen, the messages are rapidly accumulated in the bus, which leads to rapidly growing RAM usage again, or lockups (graphical servers and multi-process browsers are especially prone to this).
I tried to run sshd
with a -20 priority, like suggested in the similar question, but that doesn't really help: it's as unresponsive as with the default priority.
I would like to have some emergency console which is always locked in RAM and is usable regardless of how overloaded the rest of the system is. Something akin to Ctrl+Alt+Del screen in Windows NT≥6, or even better. Given that it's possible to reserve some RAM with the crashkernel
parameter, which I use for kdump
, I suspect it's possible to exploit this or some other kernel mechanism for the task too?