How does the kernel decide between disk-cache vs swap?

Question

Following on from topics raised in other questions, I believe that when the Linux Kernel is trying to free up physical RAM it makes some kind of decision between discarding pages from its disk cache or flushing other pages into swap. I'm not certain of this, perhaps I've completely misunderstood the mechanism.

In any case there is synergy between:

the kernel keeping it's own internal disk cache
programs memory mapping files
"regular" application memory being placed in a swap file

All three may be used to free up physical RAM, all three rely on either writing "dirty" pages to disk, or simply knowing the pages are already cleanly duplicated on-disk and discarding them from physical RAM allocation.

With swap space, individual swap partitions and files can be given a priority so that faster devices are used before slower devices. I'm not aware of any such configurations for other, non-swap devices.

My question is: How does the Kernel decide which RAM to free up?

Is it purely based on last access?
Does the Kernel use all three above equally?
- is there any prioritisation between them (and individual disks)
- is it configurable?

This question driven by experiences similar to this one but I want this question to focus on the underlying mechanism NOT troubleshooting a particular problem.

All your suppositions are correct. Handling these things is very much a balancing act. The exact algorithm varies by kernel version, and has tweaks available in /proc to change the weights. There is some prioritization, and it is somewhat based on last access. — user10489, Nov 24 '21 at 02:09
And to make things more complicated, even to the point where nobody understands them, there are file systems which circumvent the kernel mechanisms or significantly add to them. An example are the ZFS read cache (ARC) and write cache. A starting point: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-vdev-async-write-active-max-dirty-percent — Binarus, Dec 10 '21 at 07:50

score 9 · Accepted Answer · answered Dec 10 '21 at 13:35

The name given to the overall task of freeing physical pages of memory is reclaim, and it covers a number of tasks. Reclaim is mostly driven by page allocations, with various levels of urgency. On an unloaded system, page allocations can be satisfied without effort and don’t trigger any reclaim. On moderately loaded systems, page allocations can still be satisfied immediately, but they will also cause kswapd to be woken up to perform background reclaim. On loaded systems where page allocations can no longer be satisfied immediately, reclaim is performed synchronously.

Reclaimable pages are pages which store content which can be found or be made available elsewhere. This is where the typical balancing act comes into play: memory whose contents are also in files (or are supposed to end up files), v. memory whose contents are not (and need to be swapped out). The former are stored in the page cache, the latter not, which is why balancing explanations usually talk about page cache v. swap.

The decision to favour one over the other is determined in a single place in the kernel, get_scan_count, controlled by settings in struct scan_control. This function’s purpose is described as follows:

Determine how aggressively the anon and file LRU lists should be scanned. The relative value of each set of LRU lists is determined by looking at the fraction of the pages scanned we did rotate back onto the active list instead of evict.

Perhaps surprisingly for a function named get_..., this doesn’t use a return value; instead, it fills in the array pointed at by the unsigned long *nr pointer, with four entries corresponding to anonymous inactive pages (un-backed, not-recently-used pages), anonymous active pages (un-backed, recently-used pages), file inactive pages (not-recently-used pages in the page cache), and file active pages (recently-used pages in the page cache).

get_scan_count starts by retrieving the appropriate “swappiness” value, from mem_cgroup_swappiness. If the current memory cgroup is an enabled, non-root v1 cgroup, its swappiness setting is used; otherwise, it’s the infamous /proc/sys/vm/swappiness. Both settings share the same purpose; they tell the kernel

the relative IO cost of bringing back a swapped out anonymous page vs reloading a filesystem page

Before it actually uses this value though, get_scan_count determines the overall strategy it should apply:

if there’s no swap, or anonymous pages can’t be reclaimed in the current context, it will go after file-backed pages only;
if the memory cgroup disables swapping entirely, it will go after file-backed pages only;
if swappiness isn’t disabled (set to 0), and the system is nearly out of memory, it will go after all pages equally;
if the system is nearly out of file pages, it will go after anonymous pages only;
if there is enough inactive page cache, it will go after file-backed pages only;
in all other cases, it adjusts the “weight” given to the various LRUs depending on the corresponding I/O cost.

Once it has determined the strategy, it iterates over all the evictable LRUs (inactive anonymous, active anonymous, inactive file-backed, active file-backed, in that order) to determine how many pages from each should be scanned; I’ll ignore v1 cgroups:

if the strategy is “go after all pages equally”, all pages in all LRUs are liable to be scanned, up to the size determined by scan_control’s priority shift factor;
if the strategy is “go after file-backed pages only” or “go after anonymous pages only”, all pages in the corresponding LRUs are candidates (again, shifted by priority), none in the other LRUs;
otherwise, the values are adjusted according to swappiness.

The actual page scanning is driven by shrink_lruvec, which uses the scan lengths determined above, and repeatedly shrinks the LRUs until the targets have been reached (adjusting the targets as it goes in various ways). Once this is done, the active/inactive LRUs are rebalanced.

Getting back to your questions:

the page cache and memory-mapped files are treated equally;
page reclaim isn’t based purely on last access (I haven’t explained how the LRUs are used, or how rebalancing works; read the corresponding chapter in Mel Gorman’s Understanding the Linux Virtual Memory Manager for details);
the kernel doesn’t use them equally; they are prioritised differently based on circumstances, and can be configured via a number of controls (swappiness, cgroups, low watermark thresholds...).

Swap priority only determines where a page goes once it’s been decided to swap it out. (Incidentally, the swappiness documentation and the explanations above should make it clear that there isn’t enough granularity in I/O cost to handle mixed ZRAM/disk swap setups nicely...)

There is a lot more to explain, including how scan_control is set up, but I suspect this is already too long! If you want to track reclaim cost, you can see it in task delay accounting (see also struct taskstats).

This is a fantastic primer on the subject. I'll spend some time digesting your recommended further reading. I particularly appreciate the more detailed explanation of how swappiness fits into this picture. Thanks for spending your valuable time putting it together! — Philip Couling, Dec 10 '21 at 15:38

How does the kernel decide between disk-cache vs swap?

1 Answers1

Linked