0

I am using a CentOS 7.5 instance in AWS which has 16 CPUs and 32GB memory. I found when I run the following command, the whole system will be unresponsive, I cannot run any commands on it anymore, cannot even establish a new SSH session (but can still ping it). And I do not see OOM killer triggered at all, it seems the whole system just hang forever.

stress --vm 1 --vm-bytes 29800M --vm-hang 0 

However if I run stress --vm 1 --vm-bytes 29850M --vm-hang 0 to consume a bit more memory (50MB), OOM kill will be successfully triggered (I can see it in dmesg). And if I run the stress command to consume less memory than 29800MB (e.g. stress --vm 1 --vm-bytes 29700M --vm-hang 0), the system will be responsive (and no OOM kill) and I can run any commands as usual.

So it seems 29800MB is a "magic number" for this instance, if I run stress command to use more memory than it, the command will be OOM killed, and if I run stress command to use less memory than it, everything is OK, if I run stress command to just use 29800MB memory, the whole system will be unresponsive. And I also observed the same behavior in Linux host with different spec, e.g. in a CentOS 7.5 instance which has 72 CPUs and 144GB memory, the "magic number" is "137600MB".

My question is, why won't OOM kill be triggered when the "magic number" of memory is used?

Eric
  • 103
  • 2

2 Answers2

1

Oh, this topic has been discussed ad neseum already.

See:

The Linux kernel's inability to gracefully handle low memory pressure

Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

The solution? Various daemons like earlyoom (installed and enabled by default in Fedora 32) which is my favourite. There are others:

0

The root cause for the problem is that OOM killer is activated only if no forward progress can be made while trying to run all the processes. Note that "forward progress" is considered as some memory can be continously made available where "continuously" is at least 4 KB of more RAM after scanning the whole virtual memory page table structure once.

The worst case behavior requires constantly scanning that structure for so long that the system appears to hang because there is no time limit for this progress and as long as memory subsystem can make that forward progress it will try to continue keep going, albeit very slowly. Scanning the whole memory may take up to 0.5 seconds and worst case forward progress results in 4 KB free memory. If the system actually needs e.g. 20 MB to re-render the display it would take about 2500 scans over memory to succeed and that would take over half an hour. Obviously, a system where one screen refresh takes half an hour is equivalent to totally hanged system to user even if kernel is happy with the progress.

I think the best option would be to allow specifying maximum latency for the memory subsystem and OOM killer would automatically activate when latency goes too high (e.g. the memory management subsystem must be able to provide at least 100 MB free space per second or system state should be considered as out of RAM). The problem with earlyoom or other similar solutions is that to avoid worst case behavior with highly spiked loads those solutions require very broad safety margins which make you waste RAM for the safety margins.