3

Why did atop show that I was swapping out over 20,000 pages - over 80 megabytes - when I had gigabytes of free memory?

I have not noticed a performance problem with this. I simply wish to take the opportunity to increase my knowledge :-).

atop refreshes every ten seconds. Each refresh shows the activity since the last refresh.

MEM | tot     7.7G | free    3.7G | cache 608.3M | buff   19.1M | slab  264.6M |
SWP | tot     2.0G | free    1.4G |              | vmcom  13.4G | vmlim   5.8G |
PAG | scan  167264 | steal 109112 | stall      0 | swin       0 | swout  23834 |

                                "swout" is non-zero and coloured in red  ^

Kernel meminfo:

$ head -n 5 /proc/meminfo
MemTotal:        8042664 kB
MemFree:         3563088 kB
MemAvailable:    3922092 kB
Buffers:           20484 kB
Cached:           775308 kB

Kernel version:

$ uname -r
5.0.16-200.fc29.x86_64

  1. It is not clear that this would be affected by vm.swappiness. That setting balances cache reclaim v.s. swapping. However there is plenty of free memory, so why would we need to reclaim any memory in the first place?

  2. As you can see, this is a small system. It does not use NUMA. I checked in /proc/zoneinfo and I only have one node, "Node 0". So this is not caused by NUMA.

  3. Related questions and answers mention the idea of "opportunistic swapping", "when the system has nothing better to do", "which may provide a benefit if there's a memory shortage later", etc. I do not find these ideas credible, because they contradict the kernel documentation. See Does Linux perform "opportunistic swapping", or is it a myth?

  4. There are no limits set on RAM usage using systemd.resources features. I.e. I think all systemd units have their RAM usage limit set to "infinity".

    $ systemctl show '*' | \
        grep -E '(Memory|Swap).*(Max|Limit|High)' | \
        grep -v infinity
    $
    
  5. Edit: I suspect this is related to transparent huge pages. I note that Virtual Machines use transparent huge pages to allocate the guest memory efficiently. They are the only user program that uses huge pages on my system.

    There is a similar-looking question: Can kswapd be active if free memory well exceeds pages_high watermark? It is asking about RHEL 6, which enables huge pages for all applications.

I am not sure exactly how to reproduce this result.

It happened when starting a VM. I use libvirt to run VMs. As per the default, VM disk reads are cached using the host page cache. (Cache mode: "Hypervisor default" means "Writeback").

I tried to stop the VM, FADVISE_DONTNEED the image file, and try again. But the same thing did not happen.

Then I tried again with a different VM, and it happened briefly. I captured vmstat. I think atop showed a different, higher figure for "swout", but I did not capture it.

$ vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 770168 5034300  28024 936256    0    2    21    86   60  101 22  9 68  1  0
 0  0 770168 5033852  28048 935904    0    0     0     8  372  633  1  1 97  0  0
 1  0 770168 4974048  28252 948152    3    0  1137   194 1859 2559 11  7 79  3  0
 0  1 770168 4557968  28308 1037512    0    0  8974    45 3595 6218 16  6 57 21  0
 6  3 770168 4317800  28324 1111408    0    0  7200   609 6820 6793 12  5 38 44  0
 0  4 770168 4125100  28348 1182624    0    0  6900   269 5054 3674 74  3  8 15  0
 0  5 770168 3898200  27840 1259768    2    0  9421   630 4694 5465 11  6 11 71  0
 1  3 770168 3760316  27672 1300540    0    0  9294   897 3308 4135  5  4 28 63  0
 0  1 770168 3531332  27032 1356236    0    0 10532   155 3140 4949  8  5 63 25  0
 0  0 783772 3381556  27032 1320296    0 1390  7320  4210 4450 5112 17  5 43 35  0
 0  0 783772 3446284  27056 1335116    0    0   239   441  995 1782  4  2 92  2  0
 0  0 783772 3459688  27076 1335372    0    0     3   410  728 1037  2  2 95  1  0

I also checked for a cgroup memory limit on the VM, on the off-chance that libvirt had bypassed systemd, and inflicted swapping on itself by mistake:

$ cd /sys/fs/cgroup/memory/machine.slice/machine-qemu\x2d5\x2ddebian9.scope
$ find -type d  # there were no sub-directories here
$ grep -H . *limit_in_bytes
memory.kmem.limit_in_bytes:9223372036854771712
memory.kmem.tcp.limit_in_bytes:9223372036854771712
memory.limit_in_bytes:9223372036854771712
memory.memsw.limit_in_bytes:9223372036854771712
memory.soft_limit_in_bytes:9223372036854771712
$ cd ../..
$ find -name "*limit_in_bytes" -exec grep -H -v 9223372036854771712 \{\} \;
$
sourcejedi
  • 50,249
  • 1
    Is there any reason why you are concerned about what appears to be normal behavior? The linux kernel memory manager normally swaps pages to disk when they've not been used in a while, to make room for disk caches and buffers. You can see that shortly after there were lots of blocks in and blocks out, the kernel paged to disk in order to take advantage of real memory for disk cache. – jsbillings Aug 02 '19 at 14:12
  • It's certainly worth better understanding linux and the virtual memory manager! A lot of people get confused on that. For a really simplified explaination, check out Linux Ate My Ram!. – jsbillings Aug 02 '19 at 14:27
  • For a more in-depth explanation, please check out the documentation on the linux-mm.org site – jsbillings Aug 02 '19 at 14:27
  • But in general, unused RAM is wasted RAM, so the kernel tries its best to put it to good use. That means sending unused memory pages to swap and filling up disk cache and I/O buffers to speed up general usage. – jsbillings Aug 02 '19 at 14:29
  • @jsbillings Quoting from your linux-mm.org link. "When the kernel decides not to get memory from any of the other sources we've described so far, it starts to swap." My question is when does it decide not to get memory from the gigabytes of memory shown as "free"? In fact the linux-mm.org link is only about "Now that you know all of these wonderful uses for your free RAM, you have to be wondering: what happens when there is no more free RAM? If I have no memory free, and I need a page for the page cache, inode cache, or dentry cache, where do I get it?" – sourcejedi Aug 02 '19 at 14:32
  • The linux VMM doesn't wait until the system has run out of any other free RAM to start sending unused memory pages to swap, but does that automatically based on the 'vm.swappiness' setting. If you set vm.swappiness to zero, it will never do what you're complaining about. – jsbillings Aug 02 '19 at 14:54
  • mem_cgroup_swappiness() is called to get the per-cgroup swappiness setting (you can define it per-cgroup instead of system-wide). get_scan_count() is what applies the swappiness setting to the LRU. – jsbillings Aug 02 '19 at 15:23
  • @jsbillings it's used to get the applicable swappiness setting. The same function is used even when cgroups are compiled out. So I've been looking at the case where cgroups are compiled out, to help narrow it down. – sourcejedi Aug 02 '19 at 15:28
  • @jsbillings I've edited the question to include information from a related question, my best current guess, and dig deeper into the myth. I don't understand why you think https://linuxatemyram.com would be related to this question. The entire point of that site is "Linux is borrowing unused memory for disk caching". The confusion is common because the free command can show very low "free" memory, when there is a lot of "cache", which is usually "available" to be reclaimed as soon as it is needed. I showed as clearly as I can that I had gigabytes of entirely "free" memory. – sourcejedi Aug 02 '19 at 16:17
  • It might have been nice if you had mentioned you were launching VMs during this. I agree with the answer you cited, about zone rebalancing. – jsbillings Aug 02 '19 at 20:20
  • 1
  • @jsbillings this question has always explained that I observed this when launching a VM. The related questions & the specific added point 5. explain my hypothesis why launching a VM is a special case on my system, and also why on RHEL this type of swapping is not a special case and could happen with any application, so long as it allocates 2MiB+ blocks of virtual memory (that are not file cache). – sourcejedi Aug 04 '19 at 10:28

1 Answers1

2

I was pondering over a similar question -- you saw my thread about kswapd and zone watermarks -- and the answer in my case (and probably in yours as well) is memory fragmentation.

When memory is fragmented enough, higher order allocation will fail, and this (depending on a number of additional factors) will either lead to direct reclaim, or will wake kswapd which will attempt to do zone reclaim/compaction. You can find some additional details in my thread.

Another thing that may escape attention when dealing with such problems is memory zoning. I.e. you may have enough memory overall (and it might even contain enough contiguous chunks) but it may be restricted to DMA32 (if you're on 64-bit architecture). Some people tend to ignore DMA32 as "small" (probably because they are used to 32-bit thinking) but 4GB is not really "small".

You have two ways of finding out for sure what's going on in your case. One is analyzing stats -- you can set up jobs to take periodic snapshots of /proc/buddyinfo, /proc/zoneinfo, /proc/vmstat etc., and try to make sense out of what you're seeing.

The other way is more direct and reliable if you get it to work: you need to capture the codepaths that lead to swapout events, and you can do it using tracepoints the kernel is instrumented with (in particular, there are numerous vmscan events).

But getting it to work may be challenging, as low-level instrumentation doesn't always work the way it's supposed to out of the box. In my case, we had to spend some time setting up ftrace infrastructure only to find out in the end that function_graph probe that we needed wasn't working for some reason. The next tool we tried was perf, and it too wasn't successful on the first attempt. But then when you eventually manage to capture events of interest, they are likely to lead you to the answer much faster than any global counters.

Best regards, Nikolai

Nikolai
  • 156
  • I think this is probably what triggered it. Although as you said in your answer, it seems counter-intuitive that we start dropping/swapping pages in order to make a large contiguous allocation. "Most user-space pages - which are accessed through user virtual addresses - can be moved; all that is needed is to tweak the relevant page table entries accordingly." Memory compaction, LWN.net, 2010. – sourcejedi Aug 06 '19 at 16:33
  • 1
    Yes it does sound counterintuitive. However, what I found in our case was that our server was attempting a crazy number of compactions (a few each second over a period of several weeks), and only like 5% of them were successful. You can check your /proc/vmstat to see if anything similar to this might be also happening in your case. Not sure why this is happening, but after all, all it takes to prevent a contiguous chunk from being compacted, is one unmovable page somewhere in the middle... – Nikolai Aug 07 '19 at 02:17