[PATCH 0/8] Throttled background buffered writeback v7
Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done.
And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.
But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.
CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling background writeback if the latency is too high.
I have a 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.
I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Although, it seems the biggest problem I had there is due to ext4, and also partly a gnome-shell bug.
- What instructions can I follow, to reproduce the problem that WBT addresses, and get a very rough number showing whether it is a problem on a specific system?
- It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?