1

I'm using a system with multiple storage devices with different write throughput. As explained in question Why were "USB-stick stall" problems reported in 2013? Why wasn't this problem solved by the existing "No-I/O dirty throttling" code? Linux allows a single slow device to use nearly all of disk cache for write buffering by default, which results in poor performance for all processes, even when those processes write to other devices. And the situation gets even worse if any process uses void sync(void) which stalls for writing the whole cache. If a process is e.g. writing an ISO image to slow USB memory stick, the whole image may be in kernel cache from the start of the process so sync() would end up waiting for the whole ISO image being written to the slow memory stick.

Even without processes calling void sync(void), all programs will suffer slowdown because instead of using background writing as usual, all the programs are forced to use syncronous writing if current system wide write cached exceeds /proc/sys/vm/dirty_background_bytes bytes.

I would want to avoid this by preventing a single (or multiple running in parallel) slow storage devices from slowing down the whole system. As far as I know, this requires limiting the cache usage for write buffering allowed a single slow device. The cache should be big enough that the slow and fast device are the bottleneck for any process using those devices but those processes are not limited by other devices, as happens with default configuration.

I've figured out that if I limit /proc/sys/vm/dirty_background_bytes to 50 MB and /proc/sys/vm/dirty_bytes to 200 MB then the latency of the system never gets truly bad but I still get some slowdown when I write to slow devices. I think this happens because all writing is forced syncronous when more than 50 MB dirty cache is in use. If we assume a situation where process writing to the slow memory has 52 MB in the cache and another process wants to write 4 KB file to another fast SSD device, that 4 KB write has to be syncronous, too, which causes slowdown to SSD device speeds instead of running at RAM speeds. On the other hand, the max 200 MB write cache may be too little when writing to truly fast SSD devices because the process generating the data may not be fast enough to fill the cache. As a result, heavily limiting those two kernel settings is a tradeoff for avoiding worst case latency but getting non-optimal average performance.

I know that if I set device BDI max_ratio to value less than 100, then that will be used as a percentage of the whole write cache that can be used for that device.

However, the default for any device is 100 so any slow device is allowed to force the whole system to slow down. I've tested that setting max_ratio to value way below 100 works nicely and prevents any slowdown from the slow device because that device cannot waste all of the cache.

How to set max_ratio to value less than 100 for all devices, including devices connected in the future? I could write a script that runs during system boot and configures all the connected devices but then newly connected storage (be it USB, eSATA or any other connection method) is allowed to take all the write cache.

1 Answers1

3

You can configure max_ratio (and min_ratio) for all devices using udev rules. For example, you may simply create new file called /etc/udev/rules.d/90-bdi-set-min_ratio-and-max_ratio.rules as root with contents

# For every BDI device, set max cache usage to 30% and min reserved cache to 2% of the whole cache
ACTION=="add|change", SUBSYSTEM=="bdi", ATTR{min_ratio}="2", ATTR{max_ratio}="30"

(the syntax of *.rules files is to use == to check for current state and = to apply any actions. In this case, we tell udev to monitor subsystem bdi and whenever a device is added or changed, apply rules ATTR{min_ratio}="2" and ATTR{max_ratio}="30".

After creating that file, you can either reboot the system or just run sudo udevadm trigger to apply the new rules. (This works with Ubuntu 18.04 or newer, some older versions may also need running sudo udevadm control --reload-rules before running the trigger command.)

You can inspect the current state of all the devices with

$ grep . /sys/devices/virtual/bdi/*/{min,max}_ratio

Using the values in the above example (2 and 30) tell the system to reserve at least 2% of the cache for any given device and allow a single device to use up to 30% of the max cache allowed. I don't know what happens if you have 3 storage devices and set min_ratio to 50 because logically you would need to reserve 150% of the total cache space for those devices. If you test this, please, add a comment below.

I haven't tested how this interacts with setting huge dirty_bytes and dirty_background_bytes values but if I've understood kernel behavior correctly, these limits are not enforced until the write cache contains at least (dirty_bytes + dirty_background_bytes) / 2 bytes or 50% of the sum of those values. As such, setting very low percentages for the min_ratio and max_ratio cannot actually prevent the system from using pretty much RAM for caching if you allow huge cache. Assume that the kernel will always use half the sum of dirty_bytes + dirty_background_bytes for write cache and you can balance the other half. Future kernels may improve this but if I've understood correctly, kernel developers use the current implementation to reduce overhead of tracking which part of the write cache is used for which device when everything is running smoothly.

It might be better to just raise the min_ratio for all the devices where the performance actually matters to prevent slow devices from using the cache instead of those devices but this would need to be benchmarked more to understand the kernel behavior better. I selected the max_ratio value of 30 to allow up to 3 slow devices working in parallel and still have 10% of the cache available for fast devices (which are not yet bandwidth limited). As the kernel uses half the cache by default and max_ratio balances the other half, this practically sets the limit to 0.5 + 0.5 * 0.3 or about 65% of the max disk cache allowed to be used for a single slow device.

Note that you can apply BDI limits on partitions or full devices. I would assume that in case of RAID, you can apply the BDI limits for a single partition within that RAID, for the whole RAID device or for a single device within the RAID. The above udev script applies the limits to all levels and I'm not sure if that's optimal. It might be better to apply the limit only to underlying devices only.