The name given to the overall task of freeing physical pages of memory is reclaim, and it covers a number of tasks. Reclaim is mostly driven by page allocations, with various levels of urgency. On an unloaded system, page allocations can be satisfied without effort and don’t trigger any reclaim. On moderately loaded systems, page allocations can still be satisfied immediately, but they will also cause kswapd
to be woken up to perform background reclaim. On loaded systems where page allocations can no longer be satisfied immediately, reclaim is performed synchronously.
Reclaimable pages are pages which store content which can be found or be made available elsewhere. This is where the typical balancing act comes into play: memory whose contents are also in files (or are supposed to end up files), v. memory whose contents are not (and need to be swapped out). The former are stored in the page cache, the latter not, which is why balancing explanations usually talk about page cache v. swap.
The decision to favour one over the other is determined in a single place in the kernel, get_scan_count
, controlled by settings in struct scan_control
. This function’s purpose is described as follows:
Determine how aggressively the anon and file LRU lists should be
scanned. The relative value of each set of LRU lists is determined
by looking at the fraction of the pages scanned we did rotate back
onto the active list instead of evict.
Perhaps surprisingly for a function named get_...
, this doesn’t use a return value; instead, it fills in the array pointed at by the unsigned long *nr
pointer, with four entries corresponding to anonymous inactive pages (un-backed, not-recently-used pages), anonymous active pages (un-backed, recently-used pages), file inactive pages (not-recently-used pages in the page cache), and file active pages (recently-used pages in the page cache).
get_scan_count
starts by retrieving the appropriate “swappiness” value, from mem_cgroup_swappiness
. If the current memory cgroup is an enabled, non-root v1 cgroup, its swappiness setting is used; otherwise, it’s the infamous /proc/sys/vm/swappiness
. Both settings share the same purpose; they tell the kernel
the relative IO cost of bringing back a swapped out anonymous page vs reloading a filesystem page
Before it actually uses this value though, get_scan_count
determines the overall strategy it should apply:
- if there’s no swap, or anonymous pages can’t be reclaimed in the current context, it will go after file-backed pages only;
- if the memory cgroup disables swapping entirely, it will go after file-backed pages only;
- if swappiness isn’t disabled (set to 0), and the system is nearly out of memory, it will go after all pages equally;
- if the system is nearly out of file pages, it will go after anonymous pages only;
- if there is enough inactive page cache, it will go after file-backed pages only;
- in all other cases, it adjusts the “weight” given to the various LRUs depending on the corresponding I/O cost.
Once it has determined the strategy, it iterates over all the evictable LRUs (inactive anonymous, active anonymous, inactive file-backed, active file-backed, in that order) to determine how many pages from each should be scanned; I’ll ignore v1 cgroups:
- if the strategy is “go after all pages equally”, all pages in all LRUs are liable to be scanned, up to the size determined by
scan_control
’s priority
shift factor;
- if the strategy is “go after file-backed pages only” or “go after anonymous pages only”, all pages in the corresponding LRUs are candidates (again, shifted by
priority
), none in the other LRUs;
- otherwise, the values are adjusted according to swappiness.
The actual page scanning is driven by shrink_lruvec
, which uses the scan lengths determined above, and repeatedly shrinks the LRUs until the targets have been reached (adjusting the targets as it goes in various ways). Once this is done, the active/inactive LRUs are rebalanced.
Getting back to your questions:
- the page cache and memory-mapped files are treated equally;
- page reclaim isn’t based purely on last access (I haven’t explained how the LRUs are used, or how rebalancing works; read the corresponding chapter in Mel Gorman’s Understanding the Linux Virtual Memory Manager for details);
- the kernel doesn’t use them equally; they are prioritised differently based on circumstances, and can be configured via a number of controls (
swappiness
, cgroups, low watermark thresholds...).
Swap priority only determines where a page goes once it’s been decided to swap it out. (Incidentally, the swappiness
documentation and the explanations above should make it clear that there isn’t enough granularity in I/O cost to handle mixed ZRAM/disk swap setups nicely...)
There is a lot more to explain, including how scan_control
is set up, but I suspect this is already too long! If you want to track reclaim cost, you can see it in task delay accounting (see also struct taskstats
).