Writeback cache (`dirty`) seems to be limited below the expected threshold where throttling starts. What is it being limited by?

Question

In a previous question and answer, I showed an experiment about dirty_ratio.

Writeback cache (`dirty`) seems to be limited to even less than dirty_background_ratio. What is it being limited by? How is this limit calculated?

I thought I solved the question, by correcting my understanding of the dirty ratio calculation. But I repeated the experiment just now, and the write-back cache was limited to lower than I saw before. I can't work this out, what could be limiting it?

I have default values for the vm.dirty* sysctl's. dirty_background_ratio is 10, and dirty_ratio is 20. The "ratios" refer to the size of the dirty page cache aka write-back cache, as a percentage of MemFree + Cached. They are not a percentage of MemTotal - this was what confused me in the above question.

These ratios mean that reaching 10% causes background writeback to start, and 20% is the maximum size of the write-back cache. Additionally, I understand the write-back cache is limited by "I/O-less dirty throttling". When the write-back cache rises above 15%, processes which generate dirty pages e.g. with write() are "throttled". That is, the kernel causes the process to sleep inside the write() call. So the kernel can control the size of the write-back cache, by controlling the length of the sleeps. For references, see the answer to my previous question.

But my observed "ratio" seems to stay distinctly lower than the 15% throttling threshold. There must be some factor I am missing! Why is this happening?

In my previous test I saw values around 15-17.5% instead.

My kernel is Linux 4.18.16-200.fc28.x86_64.

The test is as follows: I ran dd if=/dev/zero of=~/test bs=1M status=progress. And at the same time, I monitored the achieved dirty ratio. I interrupted the dd command after 15GB.

$ while true; do grep -E '^(Dirty:|Writeback:|MemFree:|Cached:)' /proc/meminfo | tr '\n' ' '; echo; sleep 1; done
...
MemFree:          139852 kB Cached:          3443460 kB Dirty:            300240 kB Writeback:        135280 kB
MemFree:          145932 kB Cached:          3437220 kB Dirty:            319588 kB Writeback:        112080 kB
MemFree:          134324 kB Cached:          3448776 kB Dirty:            237612 kB Writeback:        160528 kB
MemFree:          134012 kB Cached:          3449004 kB Dirty:            169064 kB Writeback:        143256 kB
MemFree:          133760 kB Cached:          3449024 kB Dirty:            105484 kB Writeback:        119968 kB
MemFree:          133584 kB Cached:          3449032 kB Dirty:             49068 kB Writeback:        104412 kB
MemFree:          134712 kB Cached:          3449116 kB Dirty:                80 kB Writeback:         78740 kB
MemFree:          135448 kB Cached:          3449116 kB Dirty:                 8 kB Writeback:             0 kB

For example, the first line in the quoted output:

avail = 139852 + 3443460 = 3583312
dirty = 300240 + 135280 = 435520
ratio = 435520 / 3583312 = 0.122...

I found one thing that was limiting it, but not enough to see these results. I had been experimenting with setting /sys/class/bdi/*max_ratio. The test results in the question are from running with max_ratio = 1.

Repeating the above test with max_ratio = 100, I can achieve a higher dirty ratio e.g. 0.142:

MemFree:   122936 kB Cached:   3012244 kB Dirty:     333224 kB Writeback: 13532 kB

The write test needs to be quite long to observe this reliably, e.g. 8GB. This test takes about 100 seconds. I am using a spinning hard disk.

I tried testing with 4GB, and I only saw a dirty ratio of 0.129:

MemFree:   118388 kB Cached:   2982720 kB Dirty:     249020 kB Writeback: 151556 kB

As I say, this surprises me. I have an expert source from 2013, saying that dd should have "free run" to generate dirty pages until the system hits a dirty ratio of 0.15. It is explicitly talking about max_ratio.

sourcejedi · Accepted Answer · 2019-02-13T08:49:28.517

The "ratios" refer to the size of the dirty page cache aka write-back cache, as a percentage of MemFree + Cached. They are not a percentage of MemTotal - this was what confused me in the above question.

No. This description is still inaccurate.

Cached includes all files in tmpfs, and other Shmem allocations. They are counted because they are implemented using the page cache. However they are not cache of any persistent storage. They can't just be dropped. tmpfs pages can be swapped, but swappable pages are not included in the calculation.

I had 500-600MB of Shmem. This was about the right amount, to explain why limit / 0.20 was lower than I had expected, when I tried looking the tracepoint again (see answer to previous question).

Also Cached excludes Buffers, which can be a surprisingly large amount on certain setups.

I think I should look carefully at the implementation of global_dirtyable_pages() for my kernel version, and use the more low-level counters exposed in /proc/vmstat. Or perhaps focus on using the tracepoint instead.

Did you ever come up with a concise way to calculate the denominator of the dirty_ratio? — eatnumber1, Jun 21 '23 at 20:41

score 0 · Answer 2 · 2021-10-13T16:15:09.753

On ext2/4 there is a basic flush-every-5-secs default when data=ordered, which also is default:

   commit=nrsec
          Start  a  journal  commit every nrsec seconds.  The default value is 5 seconds.  Zero means de-
          fault.

So if your pages just don't want to stay dirty (or stay dirty too long), that is the first place to adjust.

There is a dilemma here, and there was this big problem with ext4 loosing data some years ago (see noauto_da_alloc mount option):

The kernel is ready to write only when it has to - that is why you can limit by size and age and even by throttling. VM acts as cache for the main storage.

The "desktop" user is mostly NOT ready for this additional layer. When he saves a file, he excepts it to be physically present on the storage immediately after. That is why vim has fs or fsync option:

When on, the library function fsync() will be called after writing a file. This will flush a file to disk, ensuring that it is safely written even on filesystems which do metadata-only journaling. This will force the harddrive to spin up on Linux systems running in laptop mode, so it may be undesirable in some situations.

I had 500'000 dirty pages from git repo manipulation; then it hit the background size limit and vanished completely. When git decides to repack/garbage collect it also seems to sync, a bit like vim with fs set.

A mount (of any another partition) also does a sync.

But with a high ratio, long expire_centisecs, no 5-sec ext4 journal commit and nobody (vim, git) syncing, half of RAM can stay dirty for hours.

yes, data=ordered is default, and a commit of 5 secs. You can leave "ordered" and raise commit, or mount "writeback" and leave commit. — , Oct 13 '21 at 16:14
Hmm. I think my answer was probably the main thing I was hitting, but it's a good point to look out for. I notice "flush-every-5-secs" wouldn't necessarily affect this test, if it was run for the first time i.e. ~/test did not already exist. In this case, auto_da_alloc will not kick in - delayed allocation will be allowed - and hence the data will not necessarily be flushed when there is a commit. — sourcejedi, Oct 13 '21 at 19:12

score 0 · Answer 3 · 2021-10-14T12:48:14.220

I downsized the dd to write only a couple of pages:

dd if=/dev/zero of=zerooo count=250 bs=4096

This script shows the important paramaters:

grep dirty /proc/vmstat
grep ''    /proc/sys/vm/*centi*
grep ''    /proc/sys/vm/dirt*ratio

The exipiring is practically turned off, and the thresholds are above the 250 dd pages:

nr_dirty 0 
nr_dirty_threshold 2398                          
nr_dirty_background_threshold 1199
/proc/sys/vm/dirty_expire_centisecs:360000
/proc/sys/vm/dirty_writeback_centisecs:360000
/proc/sys/vm/dirty_background_ratio:0             
/proc/sys/vm/dirty_ratio:0

The ratios are 0, showing that I echoed into the absolute params:

/proc/sys/vm/dirty_background_bytes:9819200
/proc/sys/vm/dirty_bytes:9819200

The leading 9 turns the 819200 bytes = 200 pages into ~1000 pages, see above "threshold".

Now the mount options are important; unluckily, I have / mounted also "data=writeback", and I don't know if that indirectly prevents some flushing. But the noauto, long commit, and "writeback" all seem to be necessary to produce only direct dirty pages by dd.

TARGET       SOURCE    FSTYPE OPTIONS
/            /dev/sda3 ext4   rw,relatime,noauto_da_alloc,data=writeback
`-/mnt/sda/4 /dev/sda4 ext4   rw,relatime,noauto_da_alloc,commit=3600,data=writeback

Now:

$ time dd if=/dev/zero of=zerooo count=150 bs=4096
150+0 records in
150+0 records out
614400 bytes (614 kB, 600 KiB) copied, 0.00106973 s, 574 MB/s
real    0m0.003s
user    0m0.000s
sys     0m0.003s

And the orig. script running parallel:

Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:               600 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:

The drop to zero is the rm zeroooo I did. With long expiry it stays and stays.

After:

echo 25600 > /proc/sys/vm/dirty_bytes

The thresholds are miniscule; the background value is auto adapted to 50% !:

nr_dirty 0
nr_dirty_threshold 7 
nr_dirty_background_threshold 3

Now I get:

$ time dd if=/dev/zero of=zerooo count=150 bs=4096
150+0 records in
150+0 records out
614400 bytes (614 kB, 600 KiB) copied, 0.0258571 s, 23.8 MB/s
real    0m0.028s
user    0m0.003s
sys     0m0.000s

Much longer! Throttling! And there is only some residual 12 kB dirty:

Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:                12 kB Writeback:             0 kB 
Dirty:                12 kB Writeback:             0 kB 
Dirty:                12 kB Writeback:             0 kB 
Dirty:                12 kB Writeback:             0 kB 
... snip
Dirty:                12 kB Writeback:             0 kB 
Dirty:                12 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB

Again the rm zerooo puts nr_dirty to zero.

DD#3:

With the upper limit well above the 150 pages, but the background left at 3:

nr_dirty 0
nr_dirty_threshold 2448
nr_dirty_background_threshold 3

dd is back full speed. dd does not care where these pages are exactly - they have been written.

$ time dd if=/dev/zero of=zerooo count=150 bs=4096
150+0 records in
150+0 records out
614400 bytes (614 kB, 600 KiB) copied, 0.0011005 s, 558 MB/s
real    0m0.003s
user    0m0.003s
sys     0m0.000s

The result is similar to the second run: some kBs are unflushed by "background", but this is maybe just some artificial fraction because of the low numbers. In my experience when it starts, it flushes down to "zero".

Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:                36 kB Writeback:             0 kB 
Dirty:                36 kB Writeback:             0 kB 
Dirty:                36 kB Writeback:             0 kB 
Dirty:                36 kB Writeback:             0 kB 
Dirty:                36 kB Writeback:             0 kB 
Dirty:                36 kB Writeback:             0 kB 
Dirty:                36 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB 
Dirty:                 0 kB Writeback:             0 kB

Now by just changing:

mount -o remount,commit=13 /mnt/sda/4

And with background above the 150 pages (see 1st run) the I get extra dirtiness after exactly 13 seconds, even with data=writeback:

 Dirty:                 0 kB  
 Dirty:                 0 kB  
 Dirty:                 0 kB  
 Dirty:               456 kB   ??? seems to appear always
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               600 kB  
 Dirty:               620 kB    !!! after 13 seconds commit= mount option
 Dirty:               620 kB  
 Dirty:               620 kB  
 Dirty:               620 kB  
 Dirty:               620 kB  
 Dirty:               620 kB  
 Dirty:               620 kB  
 Dirty:               620 kB  
 Dirty:               620 kB  
 Dirty:               620 kB    rm zeroooo
 Dirty:                20 kB  
 Dirty:                20 kB  
 Dirty:                20 kB  
 Dirty:                20 kB  
 Dirty:                20 kB  
 Dirty:                20 kB    sync
 Dirty:                 0 kB

And this 5-second package of data and metadata is supposed to get flushed every 30 seconds by the "centsecs" parameter(s) in /proc/sys/vm/. But only together. That is where data=ordered comes in.

The title: of course you can't have more nr_dirty in /proc/vmstat than the upper threshold if you ctrl-c dd and then start a watching script.

I suspect we're still misuderstanding each other. I think you've explained why my Dirty fell all the way to 8 (and presumably then to 0) at the end. That's cool & I hadn't really thought about it. The question I tried to ask was about the dynamic equilibrium point: the stable value of Dirty over a minute or multiple minutes while dd was running continuously. — sourcejedi, Oct 14 '21 at 21:03

Writeback cache (`dirty`) seems to be limited below the expected threshold where throttling starts. What is it being limited by?

3 Answers3

Linked