How to disable the Linux page cache

Question

The Linux kernel implements the page cache to accelerate I/O operations.

It would be helpful to be able to turn off and on the page cache for research and testing.

How can the Linux page cache be diabled?

Update 1: If it is not possible to turn off the page cache globally, it may be possible to mount Linux file systems in write-through mode. When I understand it right, the mount option dax of EXT2, EXT4, and XFS does implement write-through mode. Is this a valid option to avoid the page cache (at least for a file system)?

https://www.kernel.org/doc/Documentation/filesystems/dax.txt

Update 2: Obviously, DAX is more or less dead, and the page cache cannot be switched off really well (globally, not at all, and for a single application only in a limited manner). But is there really no way to set a Linux file system (e.g., via mount parameter) to write-through mode?

Update 3: The tool dd is one of the application examples that allow easily to bypassed the page cache of the Linux kernel via oflag=direct.

https://man7.org/linux/man-pages/man1/dd.1.html

Update 4: Obviously, file systems that make use of the FUSE (Filesystem in Userspace) module can be used in direct-io mode where the page cache is completely bypassed for reads and writes.

https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt

Update 5: The mount option -o sync implements write-through mode and bypasses the page cache for some file systems (ext2, ext3, FAT, VFAT, UFS)

https://linux.die.net/man/8/mount

Does this answer your question? How do you empty the buffers and cache on a Linux system? — Chris Davies, Nov 22 '23 at 11:59
Do you mean to disable globally or for some specific applications ? — MC68020, Nov 22 '23 at 12:34
For the status of the DAX experiment, you could perhaps read https://lwn.net/Articles/787233/ and discover that the project came to an end in 2019 and why it has always given "the whole subsystem the feel of a permanent experiment, and that makes people not want to use it." ;-) — MC68020, Nov 22 '23 at 16:28
you mentioned research and testing.... have you looked into the real-time kernel? https://ubuntu.com/blog/what-is-real-time-linux-i — ron, Nov 22 '23 at 16:38

MC68020 · Answer 1 · 2023-11-22T16:05:22.893

It is not possible, even temporarily, to disable the page cache.

One particular application can however indeed try its best to bypass caches by open_ing files with the O_DIRECT flag set :

   O_DIRECT (since Linux 2.4.10)
          Try to minimize cache effects of the I/O to and from this
          file…
          File I/O is done directly to/from user-space buffers.
          The O_DIRECT flag on its own makes an
          effort to transfer data synchronously, but does not give
          the guarantees of the O_SYNC flag that data and necessary
          metadata are transferred.  To guarantee synchronous I/O,
          O_SYNC must be used in addition to O_DIRECT.

While this implies that on must be able to rebuild the source code modified accordingly, nocache, which works intercepting the open and close system calls can be used to launch a non modified binary.
It has got some limitations and will of course slow the process down and… I never tested it.

One of course can always empty the cache using the following command :

# echo 1 > /proc/sys/vm/drop_caches

Is it possible to specify that the O_DIRECT flag shall be used for a given binary application, or is it only possible if the source code is modified? — Neverland, Nov 22 '23 at 15:07
@Neverland : Best is of course if source code is modified. However, I have heard of an application named nocache ( https://github.com/Feh/nocache ) which works intercepting the open and close system calls. You could then launch a non modified binary. It has got some limitations and will of course slow the process down and… I never tested it. — MC68020, Nov 22 '23 at 15:59

Ownsky · Accepted Answer · 2023-11-28T01:25:51.120

I think that we shouldn't try to bypass the page cache on a block device like HDDs and SSDs. That's because the granularity of such disks is block (512B, 4KB or something else), however, the r/w system call is performed on bytes.

The inconsistency means that the kernel needs to combine the data that you want to write to the disk with something else that forms the block, and then write it to the disk at once (if we ignore cross-block write, but it will be the same). For read, the kernel will read a whole block that contains the data you want, and then give you the part that you need.

Once we understand this inconsistency, we can find that the page cache is not just a speed-up trick. Instead, the page cache is quite vital on the I/O stack of disks. More precisely, the inconsistency is handled by the page cache and the buffer head, which is a little bit complex.

Anyway, If you want to bypass the page cache, here are some tips:

DAX: If you are working on a byte-addressable device (like a PMEM or something), DAX is the best choice for you. Because the device itself can be accessed in bytes, the kernel doesn't need to bother with the inconsistency we discussed above. In DAX mode, the file system will simply bypass the page cache and read from / write to the device directly. Note that DAX CANNOT be used on a block device! You can refer to Direct Access for files for more details.
O_DIRECT: You can also open a file with O_DIRECT flag to bypass the page cache. This is designed for applications that have their own high-performance caches in userspace. However, as far as I know, you can only perform aligned operations on it. Also, this does not ensure that when it returns the data is already on the disk. And the performance can be much lower if you don't have any other cache mechanism implemented. Refer to the man page open(2), or this discussion Clarifying Direct IO's Semantics, or just give it a try. Also, @MC68020 has listed the relevant part.

Further, if you just want to evaluate the performance of a file system without the influence of page cache, you can try these:

sync: This is for write. Open the file with O_SYNC, then each write will only return when the data is on the disk. open(2) The write performance under O_SYNC will be close to that the page cache is 'disabled' (though there are some extra overheads like copying data to the page cache, the main cost will be disk I/O).
drop cache: This is for read. As @MC68020 says, you can drop the system cache before you do something. This gives you a blank page cache, so when you read something the file system will fetch it from the disk. The command is like: echo 1 > /proc/sys/vm/drop_caches or sysctl -w vm.drop_caches=1. Check it here /proc/sys/vm, search 'drop_cache' in it. The defect is that the read for a position will only turn to the disk at the first time. The next time it will be in the page cache.

Hope these will help :)

Thank you for your reply! One further question: What is about the mount option -o sync? When I understand it right how it works, it also implements write-through mode and bypasses the page cache for some file systems (ext2, ext3, FAT, VFAT, UFS). — Neverland, Nov 28 '23 at 07:20
sync is probably a bit different in nature btw. Most likely it invalidate the advantages/disadvantages of the write cache in the drive as well. (Not sure if bypass is exactly the right word, since I believe sync write is still a bit different from FUA.) — Tom Yan, Nov 28 '23 at 07:44
@Neverland This is from the man page of mount: "All I/O to the filesystem should be done synchronously. In the case of media with a limited number of write cycles (e.g. some flash drives), sync may cause life-cycle shortening. " I also check the kernel code about it. The flag is called S_SYNC, and the comment says "Writes are synced at once". So it is almost the same with the O_SYNC flag when you open a file. I believe that the only different is mount -o sync is superblock-wise flag, but the O_SYNC is per-fd. — Ownsky, Nov 28 '23 at 12:45
@Neverland Also, the word "write-through" means that when a write is performed, it will be written to both the cache and the backend. That is exactly what sync does, no matter an O_SYNC or a mount option. You'll find that it only influences write, but not read. That means when you use a file (or a file system) in sync mode, the cache still exists, and will also be written to, and will also be read from (if the data is in it). So "bypass" is not a proper description of sync. Sync only specifies the necessary conditions for write function's return: the data is already on the disk. — Ownsky, Nov 28 '23 at 12:55

How to disable the Linux page cache

2 Answers2