1

[PATCH 0/8] Throttled background buffered writeback v7

Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done.

And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.

But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.

CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling background writeback if the latency is too high.

I have a 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.

I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Although, it seems the biggest problem I had there is due to ext4, and also partly a gnome-shell bug.

  1. What instructions can I follow, to reproduce the problem that WBT addresses, and get a very rough number showing whether it is a problem on a specific system?
  2. It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?
sourcejedi
  • 50,249
  • Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see https://utcc.utoronto.ca/~cks/space/blog/linux/USBDrivesKillMyPerformance (and the links in the comments) for details. – Anon Nov 24 '18 at 06:48
  • @Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device. – sourcejedi Nov 24 '18 at 09:53
  • @Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. https://unix.stackexchange.com/questions/480399/why-were-usb-stick-stall-problems-reported-in-2013-why-wasnt-this-problem-so – sourcejedi Nov 24 '18 at 10:04
  • @Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve. – sourcejedi Nov 24 '18 at 10:50

0 Answers0