Freebsd12: Process killed instead of swapping out

Question

I run FreeBSD 12 through Vagrant on a VM with 350 MB RAM and 2 GB swap.

If I run:

perl -e '$a="x"x200_000_000;sleep 1000' & sleep 10
perl -e '$a="x"x200_000_000;sleep 1000' & sleep 10
perl -e '$a="x"x200_000_000;sleep 1000' & sleep 10
perl -e '$a="x"x200_000_000;sleep 1000' & sleep 10
perl -e '$a="x"x200_000_000;sleep 1000' & sleep 10

It does what I expect: Start 5 processes that each eat 400 MB (2*200MB each).

However, if I remove the sleep 10:

perl -e '$a="x"x200_000_000;sleep 1000' &
perl -e '$a="x"x200_000_000;sleep 1000' &
perl -e '$a="x"x200_000_000;sleep 1000' &
perl -e '$a="x"x200_000_000;sleep 1000' &
perl -e '$a="x"x200_000_000;sleep 1000' &

FreeBSD kills 2 of the jobs.

Why?

And how can I avoid it?

If I run more than 5 of:

perl -e '$a="x"x200_000_000;sleep 1000' & sleep 10

Then I would expect the newest to be killed with Out Of Memory. But it seems FreeBSD chooses a perl job at random which is then killed.

Here is a screenshot from top:

last pid:  3529;  load averages:  1.45,  0.62,  0.34                                          up 0+00:17:14  21:49:30
29 processes:  5 running, 24 sleeping
CPU:  0.6% user,  0.0% nice, 34.5% system,  6.6% interrupt, 58.3% idle
Mem: 150M Active, 1080K Inact, 41M Laundry, 108M Wired, 39M Buf, 2256K Free
Swap: 2423M Total, 873M Used, 1550M Free, 36% Inuse, 33M In, 29M Out
PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 3521 vagrant       1  29    0   400M    53M RUN      1   0:03  25.98% perl
 3525 vagrant       1  24    0   400M    52M RUN      1   0:03   4.11% perl
 3527 vagrant       1  25    0   400M    53M RUN      0   0:02   4.09% perl
 3523 vagrant       1  24    0   400M    52M RUN      1   0:03   3.53% perl
  807 root          7  52    0    20M  1672K sigwai   1   0:00   0.07% VBoxService

score 1 · Answer 1 · answered Oct 07 '23 at 08:19

TLDR;

Why?

No guarantees to who survives when you run out of resources. Large memory use paints a target on your back.

And how can I avoid it?

Be careful when you allocate memory. Do not allocate more than you need. If you are in a rush to allocate a lot of memory then monitor "Free" memory to ensure that you do not trigger OOM.

See FreeBSD Wiki on Memory and How does FreeBSD allocate memory?

Beware here be dragons

This is only a guess. You do not touch upon it so I will mention what you might find obvious:

Your system is resource constrained. Only 350 MB RAM but just about enough swap for what you are doing. With just a very small sleep the pageout daemon has time to run and and swap out some or all dirty pages. Without any delay you have simply started a race.

The active and inactive queues contain both clean and dirty pages. When reclaiming memory to alleviate a shortage, the page daemon will free clean pages from the head of the inactive queue. Dirty pages must first be cleaned by paging them out to swap or a file system. This is a lot of work, so the page daemon moves them to the laundry queue for deferred processing.

-- From the excellent article Exploring Swap on FreeBSD

It is worth noting that on NUMA systems you might have several page daemons and laundry threads.

You expect that they should be killed in order. But the system gives no such guarantees. This is an edge case and the system goes completely mafia with "a man's gotta do what a man's gotta do". Behind the curtains there will be some sort of "order" to it:

To find a process to kill, the kernel estimates the memory usage of each runnable process and selects the one with the largest usage.

But your processes are identical so no help there. Then we get into the minutia if your processes are running exactly in order or some other system process is making a small bleep and gets in the order of the pages of your processes in the laundry list. So when we are OOM the system tries to survive and it gives you no deterministic guarantees.

You could argue that you expected process priority (nice level) to be honoured or that you could set a specific memory priority. No such thing exists to my knowledge. You can protect your process from OOM but that should be reserved for really really really important and well tested system services.

Looking closer

When things go bad they are logged to the console log. If debugging more complicated cases than this it can be helpful.

You would primarily be looking for something like this:

pid . . ., was killed: failed to reclaim memory
pid . . ., was killed: a thread waited too long to allocate a page
pid . . ., was killed: out of swap space
kernel: swap_pager

You might want to look at several log files in /var/log but you will get the most interesting with:

dmesg | grep -e killed -e swap_pager

Tuning

If you want to tune OOM behaviour the following sysctls would be of interest:

vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
vm.panic_on_oom: panic on out of memory instead of killing the largest process
vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler
vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
vm.v_pageout_free_min: Min pages reserved for kernel
vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
vm.pageout_lock_miss: vget() lock misses during pageout
vm.disable_swapspace_pageouts: Disallow swapout of dirty pages
vm.pageout_update_period: Maximum active LRU update period

For instance vm.pageout_oom_seq is 12 by default. If you lower this value the more sensitive OOM is to the lack of the pagedaemon progress.

What you should not be doing

When you dig into this you will find some interesting flags. Your case is naturally an example so we do not know exactly what you want to achieve.

If you look at ps(1) you will find the flag P_SWAPPINGOUT. Depending on what you are doing you might keep an eye on that to see if your own process is actually busy swapping. In the real world you would be better off checking the free memory before allocating (I think).

Looking at those flags you might be tempted beyond reason when you see P_PROTECTED. Elsewhere this is also known as madvise(MADV_PROTECT). These are processes which are protected when we are OOM.

Do NOT use that.

It is easy to set this for a process using protect(1)

Do NOT use that.

For rc services this can be set by setting ${name}_oomprotect rc.conf variable to “YES”.

Do NOT use that.

You will be much better off having reasonable system settings and "enough" swap space to let the system grind slowly to a halt.

When you have written a process which allocates memory in a reasonable fashion then you might consider it. And I would only do it if this process is really important to me and I know that I have other misbehaving processes which either overcommit on memory allocation or are leaky.

There is a nice article Preventing FreeBSD to kill PostgreSQL (aka OOM Killer prevention) but I would not do that myself. You are in dangerous territory when you try to outsmart the system. OOM is survival mode and you should not write anything which depends on the world blowing up in an orderly fashion.

Freebsd12: Process killed instead of swapping out

1 Answers1

TLDR;

Why?

And how can I avoid it?

Beware here be dragons

Looking closer

Tuning

What you should not be doing