9

I am using Ubuntu 20.04 Server edition.

I have a 512GB of RAM, with about 130GB typically used for a process running on the machine. I also have two ramdisks (tmpfs) both configured as 500GB.

As tmpfs disks expand and contract the amount of RAM they use, depending on how full they are, typically the ramdisks use very little maybe 200GB in total.

I have swap on a fast NVMe drive, about 1000GB. Occasionally, for brief periods of time the usage of RAM spikes, just for a few seconds.

During these moments, how do I tell Linux it should ALWAYS give the ram to the process and use the swap for the drives?

  • See here - You may temporarily increase swappiness, however anything that is swappable is treated the same way. Possible method is then using an infinite script/service to watch the output of free and adapt swappiness according to result. – FelixJN Jun 14 '21 at 10:38
  • You can create a special cgroup for some processes and give that cgroup a higher swappiness. Managing that and making it automatic is messy -- it's doable manually, but there are not good tools to make that persist. – user10489 Jun 14 '21 at 11:51
  • As in turn it around and say never swap out this process? https://man7.org/linux/man-pages/man2/mlock.2.html – ibuprofen Jun 15 '21 at 03:12

1 Answers1

7

To the best of my knowledge (and FelixJN's link seems to agree), all things in RAM are equally swappable, and the decision what to swap is made on freshness, not on type of data. My guess is that this is a good heuristic, even for your use case!

However, you can trick around, a bit, at least.

Assuming your 512GB RAM machine is a NUMA machine: You can assign (see man tmpfs:mpol) your tmpfs to specific node's memory, and will at least concentrate all the memory on one node.

Honestly, though: instead of tmpfs'ing 200 GB (!) of data, wouldn't be putting that directly on SSD, but using very large file system buffers and very low pressure to flush these to storage work? The moment your processes run out of memory, the kernel will start flushing the buffers to SSD, and throttling access to the files that you formerly put into tmpfs, which implements what you want: your temporary files get stored onto SSD when memory pressure is high, and gets handled in RAM if there's plenty of free memory.

  • Specifically mmapping the files in question. – chrylis -cautiouslyoptimistic- Jun 14 '21 at 19:04
  • 1
    Perhaps pick a disk filesystem where you can disable journalling, so it's not trying to maintain crash-recoverable metadata / consistency. Or if you pick one where the journal can't be disabled, journal onto an actual RAMdisk if it allows an external journal. Or at least barrier=0 or equivalent to not order writes. e.g. JFS is somewhat simple and has a nointegrity mount option. Or perhaps ext4 which is well-maintained, if it works to create an ext4 with no journal, else maybe ext2. – Peter Cordes Jun 15 '21 at 07:26
  • The mke2fs man page says mke2fs -t ext3 -O ^has_journal /dev/... creates an ext3 without a journal, "hence by the ext3 filesystem code in the Linux kernel". For large files, you'll want extents (rather than block allocation bitmaps), and that might be less bad than having a journal, so you might go ext4 without barriers or ordering, and with the journal on a ramdisk (or a file on tmpfs!), if you can't enable enough modern features on an ext2 filesystem, and other options like JFS nointegrity or ReiserFS nolog don't perform as well for the workload (e.g. CPU usage by the FS driver.) – Peter Cordes Jun 15 '21 at 07:29
  • 1
    XFS can use an external log device, but can't work without a log. Its nobarrier kernel option was removed in 4.19, but maybe that won't matter much with the log on a 128 MiB ramdisk like /dev/rd0. XFS's internal algorithms are I think known for scaling well to many cores, so it's worth trying out for this use-case. Although some of the work it does is to optimize for rotational media, although contiguous data isn't bad on NVMe; it will mean fewer extents / less metadata to get to your data, and delayed allocation according to memory pressure might actually avoid work on early delete. – Peter Cordes Jun 15 '21 at 07:35
  • 2
    @PeterCordes thanks for the FS insights! To help XFS's case a tiny bit: it's known to work as the file system for compile farms quite well; these are SSD- and FS cache-heavy, so this might not be the worst choice. – Marcus Müller Jun 15 '21 at 09:01