What were the "outstanding problems" (stalls), which might be mitigated by limiting per-BDI writeback cache to N seconds based on observed throughput?

Question

I would like to understand this email from 2013, by Mel Gorman:

https://lore.kernel.org/lkml/20131030120152.GM2400@suse.de/

There are still problems though. If all dirty pages were backed by a slow device then dirty limiting is still eventually going to cause stalls in dirty page balancing. [...] Consciously or unconsciously my desktop applications generally do not fall foul of these problems. [...] I'm probably unconsciously avoiding doing any write-heavy work while a USB stick is plugged in.

It suggests the approach below. This has not been implemented yet, i.e. in Linux 5.1.

I still suspect that we will have to bite the bullet and tune based on "do not dirty more data than it takes N seconds to writeback" using per-bdi writeback estimations. It's just not that trivial to implement as the writeback speeds can change for a variety of reasons (multiple IO sources, random vs sequential etc). Hence at one point we think we are within our target window and then get it completely wrong. Dirty ratio is a hard guarantee, dirty writeback estimation is best-effort that will go wrong in some cases.

What were the "outstanding problems" in the kernel code, that the email talks about? Why did Gorman expect to see stalls with the existing code?

E.g. do we know if they could have caused a contemporary PC to appear completely hung for more than 5-10 seconds at a time? Or alternatively, could have caused open windows to appear hung (while still allowing moving the cursor and moving the windows)?

Is there any part of the email which is known to be wrong?

Do the same problems still exist in current Linux kernels?

I have seen a few reports of very similar symptoms, e.g. by Chris Siebenmann in 2017.

Mel Gorman's email is part of an archived discussion. LWN.net wrote up the discussion as "The pernicious USB-stick stall problem". Please avoid simply parroting that article. It is discussed below.

Testing ("what have you tried?")

I run kernel 5.1.6-200.fc29.x86_64. I have an 8G Imation USB drive. dd if=/dev/zero of=/var/run/media/alan/imation/test bs=1M status=progress converges to about 4MiB/s. I.e. roughly 25 times less than my main HDD can achieve. Running it at the moment, grep -E 'Dirty|Writeback' /proc/meminfo converges to around 1GiB of cached writes at any given time. I have 8GiB of physical RAM.

(Writeback shows only one or two MiB at a time. Note also, the Dirty level can vary. It used to be set as an approximate proportion of MemTotal i.e. physical RAM. Currently, Dirty is set as an approximate proportion of MemAvailable).

During the USB write, my system remains responsive. I can move windows, I can continue using my web browser...

Before, and during, the USB write, I ran the following latency test on my main HDD. It takes the same time in both cases: about 3-4 seconds:

mkdir "$HOME"/tmp
cd "$HOME"/tmp
time bash -c "for (( i=0; i < 100; i++ )); do dd if=/dev/zero bs=16k count=1 of=latencytest status=none; sync latencytest; mv latencytest latencytest2; done"

"The pernicious USB-stick stall problem" - ?

Unusually for LWN, they did not read carefully enough. The LWN article about this conflated two different problems. I cited this claim exhaustively in following link. To try and reduce reader exhaustion, I picked out relevant details below.

Why were "USB-stick stall" problems reported in 2013? Why wasn't this problem solved by the existing "No-I/O dirty throttling" code?

The archived emails discuss stalls lasting many seconds, due to having built up a corresponding amount of "dirty" (unwritten) page cache. But there was a bit of confusion about the specifics.

The original complaint was quite clear. Linus Torvalds re-iterated this complaint using more specific examples. E.g. drag and drop a random 1GB linux.iso file into a mounted USB flash drive. Say the drive can write 10MB/s. You run time sync to flush the cached writes to the filesystem, and you find it takes 100 seconds. This feels like an over-large write cache!

This result depends on you having something like 8GB of RAM or more, and also a 64-bit kernel. The write cache is limited to 20% of RAM. The dramatic part was that if you compare a 32-bit kernel, the limit is calculated based on the first 1GB of RAM only. Therefore you could see 8 times longer waiting in the sync part (or 16 times, or ...) when you switched to a 64-bit kernel. I can understand being alarmed and dismayed by this :-).

A second complaint was also posted. This was a different, "near-stall" scenario which involved the main system disk only.

The mechanics in the second case are more complex. It filled the request queues inside both the main disk, and the kernel IO scheduler for the main disk. You can have a problem with this even when you do not allow 20 seconds worth or more of "dirty" cache. Although there is some overlap between the two issues.

It seems significantly easier to find problems similar to the second case. Sustained writes to my /home or / filesystem seem to be much more dangerous in terms of stalling the system (this includes preventing or delaying switching to a text VT e.g. with Ctrl+Alt+F6). I have seen this recently. Less recently: "my system becomes much, much less responsive, every time I clone a VM image."

Things like the cursor are often drawn in hardware so it doesn't actually require a framebuffer update (i.e. it's a slightly different path to other drawing operations). If you're using a compositing window manager the window movement may also be hardware accelerated so again a program stall is unlikely interrupt it. Have you used tools like bpf to examine latency during the stall (http://www.brendangregg.com/ebpf.html )? — Anon, Jun 23 '19 at 19:41
@Anon I'm aware that hardware cursors make a difference in some error cases. I think there's a tendency to assume responsiveness is due to a hardware cursor, when often that doesn't make sense and it must require using some type of signal or thread to handle cursor updates immediately. I don't know how they interact exactly - it might be you need both enabled in Xorg, but I don't think that's obvious. https://unix.stackexchange.com/questions/526596/sometimes-the-gui-responds-slowly-or-not-at-all-or-even-appears-crashed-bu — sourcejedi, Jun 24 '19 at 11:11
@Anon anyway, the reason I mention both cases here is I don't care which one it is. I'm interested in stalls that last long enough, to where you don't care whether the mouse is butter-smooth, it's still at minimum a massive pain to try and do anything, and that includes trying to find or cancel whatever operations are overloading the system :-). — sourcejedi, Jun 24 '19 at 11:15
@Anon secondly, although Chris Siebenmann has reported symptoms in 2017, I was not able to reproduce those on my system. I'll add a subjective note to my test section - the point is I did not see my system bog down while writing to a slower USB drive. — sourcejedi, Jun 24 '19 at 11:20
@Anon thanks for your point about BPF. offcputime is probably a great way to get some stack traces for delays, when you know a test to reproduce delays. (I found it useful already). — sourcejedi, Jun 25 '19 at 08:54
There's some sys-rq for triggering system-wide kernel traces. Using that, years ago, I found in a similar scenario that not only writing but other kinds of page allocation could end up waiting on dirty page reclaim. That would explain system-wide stalls as most processes would run into allocating one or another page at some point. — jrudolph, Nov 21 '19 at 10:07

What were the "outstanding problems" (stalls), which might be mitigated by limiting per-BDI writeback cache to N seconds based on observed throughput?

Testing ("what have you tried?")

"The pernicious USB-stick stall problem" - ?

0 Answers0