9
  1. sudo dd if=/dev/sda of=/dev/null bs=1M iflag=direct
  2. atopsar -d 5 # in a second terminal
  3. top # in a third terminal

Results from atopsar :

19:18:32  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_
...
19:16:50  sda             18%  156.5  1024.0     0.0     0.0   5.0   1.15 ms
19:16:55  sda             18%  156.3  1024.0     0.0     0.0   4.9   1.15 ms
...

Why is disk utilization ("busy") reported as much less than 100% ?

According to top, the dd process only uses 3% of a CPU or less. top also provides an overall report of hardware and software interrupt (hi and si) usage of the system's CPU's, which shows as less than 1%. I have four CPUs (2 cores with 2 threads each).

/dev/sda is a SATA HDD. It is not an SSD, it is not even a hybrid SSHD drive. It cannot read faster than about 150 megabytes per second :-). So that part of the results makes sense: 156 read/s * 1024 KB/read = 156 MB/s

The kernel version is 5.0.9-200.fc29.x86_64 (Fedora Workstation 29). The IO scheduler is mq-deadline. Since kernel version 5.0, Fedora uses the multi-queue block layer. Because the single queue block layer has been removed :-).

I believe the disk utilization figure in atopsar -d and atop is calculated from one of the kernel iostat fields. The linked doc mentions "field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the multi-queue block layer. As far as I can tell, both atopsar -d and atop use common code to read this field 10. (I believe this field is also used by sar -d / iostat -x / mxiostat.py)

Additional tests

Variant 2: Changing to bs=512k, but keeping iflag=direct.

dd if=/dev/sda of=/dev/null bs=512k iflag=direct

19:18:32  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_
...
19:18:00  sda             35%  314.0   512.0     0.0     0.0   2.1   1.12 ms
19:18:05  sda             35%  313.6   512.0     0.2     4.0   2.1   1.11 ms

Variant 3: Using bs=1M, but removing iflag=direct . dd uses about 10% CPU, and 35% disk.

dd if=/dev/sda of=/dev/null bs=1M

19:18:32  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_
...
19:21:47  sda             35%  242.3   660.2     0.0     0.0   5.4   1.44 ms
19:21:52  sda             31%  232.3   667.8     0.0     0.0   9.5   1.33 ms

How to reproduce these results - essential details

Beware of the last test, i.e. running dd without iflag=direct

It is a bit of a hog. I saw it freeze the system (mouse cursor) for ten seconds or longer. Even when I had swap disabled. (The test fills your RAM with buff/cache. It is filling the inactive LRU list. I think the turnover evicts inactive cache pages relatively quickly. At the same time, the disk is busy with sequential reads, so it takes longer when you need to page something in. How bad this gets probably depends on whether the kernel ends up also turning over the active LRU list, or shrinking it too much. I.e. how well the current "mash of a number of different algorithms with a number of modifications for catching corner cases and various optimisations" is working in your case).

The exact results of the first test are difficult to reproduce.

Sometimes, KB/read shows as 512 instead of 1024. In this case, the other results look more like the results from bs=512k. Including that it shows a disk utilization around 35%, instead of around 20%. My question stands in either case.

If you would like to understand this behaviour, it is described here: Why is the size of my IO requests being limited, to about 512K?

sourcejedi
  • 50,249
  • What's the output of ioptop and lshw? And are you wondering why the maximum speed of the HDD @ 150MB/s is not 100%? – Fabby May 04 '19 at 21:28

1 Answers1

8

This was the result of a change in kernel version 5.0:

block: delete part_round_stats and switch to less precise counting

We want to convert to per-cpu in_flight counters.

The function part_round_stats needs the in_flight counter every jiffy, it would be too costly to sum all the percpu variables every jiffy, so it must be deleted. part_round_stats is used to calculate two counters - time_in_queue and io_ticks.

time_in_queue can be calculated without part_round_stats, by adding the duration of the I/O when the I/O ends (the value is almost as exact as the previously calculated value, except that time for in-progress I/Os is not counted).

io_ticks can be approximated by increasing the value when I/O is started or ended and the jiffies value has changed. If the I/Os take less than a jiffy, the value is as exact as the previously calculated value. If the I/Os take more than a jiffy, io_ticks can drift behind the previously calculated value.

(io_ticks is used in part_stat_show(), to provide the kernel IO stat for "field 10 -- # of milliseconds spent doing I/Os".)

This explains my results very nicely. In the Fedora kernel configuration, a "jiffy" is 1 millisecond. I expect a large read IO submitted by dd can be pending for more than one or two jiffies. Particularly on my system, which uses an old-fashioned mechanical HDD.

When I go back to the previous kernel series 4.20.x, it shows the correct disk utilization:

$ uname -r
4.20.15-200.fc29.x86_64
$ atopsar -d 5
...
13:27:19  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_
13:28:49  sda             98%  149.4  1024.0    13.0     5.3   2.2   6.04 ms
13:28:54  sda             98%  146.0  1024.0     7.2     5.7   1.5   6.38 ms

This old kernel used the legacy single-queue block layer, and the cfq IO scheduler by default. The result is also the same when using the deadline IO scheduler.


Update: since kernel 5.7, this approximation is adjusted. The command in the question shows 100% disk utilization again. The new approximation is expected to break down for some more complex workloads (though I haven't noticed one yet).

block/diskstats: more accurate approximation of io_ticks for slow disks

Currently io_ticks is approximated by adding one at each start and end of requests if jiffies counter has changed. This works perfectly for requests shorter than a jiffy or if one of requests starts/ends at each jiffy.

If disk executes just one request at a time and they are longer than two jiffies then only first and last jiffies will be accounted.

Fix is simple: at the end of request add up into io_ticks jiffies passed since last update rather than just one jiffy.

Example: common HDD executes random read 4k requests around 12ms.

fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 & iostat -x 10 sdb

Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

Before:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43

After:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99

Now io_ticks does not loose time between start and end of requests, but for queue-depth > 1 some I/O time between adjacent starts might be lost.

For load estimation "%util" is not as useful as average queue length, but it clearly shows how often disk queue is completely empty.

Fixes: 5b18b5a ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

sourcejedi
  • 50,249