I am trying to improve my understanding, following this (so far) unanswered question: Possible limiting factor during upgrade of Fedora VM - not disk or cpu or network?
I ran the following test load, which took 200 seconds to complete.
sudo perf trace -s time perf stat dnf -y --releasever=30 --installroot=$HOME/fedora-30 --disablerepo='*' --enablerepo=fedora --enablerepo=updates install systemd passwd dnf fedora-release vim-minimal
I am running this on a fairly default, straightforward install of Fedora Workstation 29. It is not a VM. The kernel version is 5.0.9-200.fc29.x86_64. The IO scheduler is mq-deadline
.
I use LVM, and the ext4
filesystem. I am not using any encryption on my disk or filesystem. I do not have any network filesystem mounted at all, so I am not reading or writing a network filesystem.
I have 4 "CPUs": 2 cores with 2 threads each.
I have only one disk, /dev/sda
, which is a SATA HDD. The HDD supports NCQ: cat /sys/class/block/sda/device/queue_depth
shows 32
.
vmstat 5
showed that non-idle CPU time sometimes rose to about one CPU, i.e. idle was as low as 75%.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
...
1 1 3720 1600980 392948 3669196 0 0 14 30634 8769 1876 4 3 78 15 0
1 1 3720 1600144 393128 3669884 0 0 0 3460 1406 1468 0 1 80 18 0
0 1 3720 1599424 393356 3670532 0 0 0 6830 1416 1480 0 1 73 25 0
0 1 3720 1598448 393700 3671108 0 0 0 7766 1420 1362 0 1 78 20 0
0 1 3720 1597296 393940 3671616 0 0 0 5401 1374 1367 1 2 87 11 0
0 1 3720 1596332 394312 3672416 0 0 0 1162 1647 1558 1 2 85 13 0
3 0 3720 1652064 394564 3673540 0 0 0 17897 15406 1784 1 3 74 23 0
0 0 3720 1972876 394600 3541064 0 0 0 522 8028 18327 3 3 84 10 0
1 0 3720 1974152 394600 3541108 0 0 0 9 422 879 2 1 97 0 0
0 0 3720 1974136 394600 3541120 0 0 0 0 204 455 0 0 99 0 0
(end of test)
And the "IO wait" time (the wa
field under cpu
in vmstat
) rose as high as 25%. I think this means 100% of one CPU. But atopsar -d 5
showed the utilization of my disk did not directly match this. It was much less than 100%:
22:46:44 disk busy read/s KB/read writ/s KB/writ avque avserv _dsk_
...
22:49:34 sda 5% 0.4 4.0 69.5 413.0 36.9 0.68 ms
22:49:39 sda 8% 0.2 60.0 120.6 30.6 18.7 0.66 ms
22:49:44 sda 8% 0.0 0.0 136.2 16.7 20.4 0.61 ms
22:49:49 sda 10% 0.0 0.0 157.1 44.2 21.4 0.65 ms
22:49:54 sda 9% 0.0 0.0 196.4 39.3 48.0 0.47 ms
22:49:59 sda 9% 0.0 0.0 148.9 36.6 32.6 0.62 ms
22:50:04 sda 10% 0.0 0.0 137.3 130.6 37.2 0.70 ms
22:50:09 sda 11% 0.0 0.0 199.6 5.4 13.5 0.55 ms
22:50:14 sda 2% 0.0 0.0 50.2 4.5 11.8 0.32 ms
22:50:19 sda 0% 0.0 0.0 0.8 11.0 13.3 0.75 ms
(end of test)
How can "IO wait" time be higher than disk utilization?
Following is the definition taken from the sar manpage:
%iowait:
Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
Therefore, %iowait means that from the CPU point of view, no tasks were runnable, but at least one I/O was in progress. iowait is simply a form of idle time when nothing could be scheduled. The value may or may not be useful in indicating a performance problem, but it does tell the user that the system is idle and could have taken more work.
https://support.hpe.com/hpsc/doc/public/display?docId=c02783994
"IO wait" is tricksy to define on multi-CPU systems. See How does a CPU know there is IO pending? . But even if you think I was wrong to multiply the above "IO wait" figure by 4, it would still be higher than the disk utilization figure!
I expect the disk utilization figure in atopsar -d
(and equally in atop
/ sar -d
/ iostat -x
/ mxiostat.py
) is calculated from one of the kernel iostat fields. The linked doc mentions "Field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the current multi-queue block layer.
Thanks to the perf
in the test command, I can also report that dnf's fdatasync()
calls were responsible for 81 out of 200 seconds of elapsed time. This evidence suggests to me that the "IO wait" figure is giving a more accurate impression than the disk utilization figure.
199.440461084 seconds time elapsed
60.043226000 seconds user
11.858057000 seconds sys
60.07user 12.17system 3:19.84elapsed 36%CPU (0avgtext+0avgdata 437844maxresident)k
496inputs+2562448outputs (25major+390722minor)pagefaults 0swaps
Summary of events:
...
dnf (6177), 2438150 events, 76.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
fdatasync 3157 81436.062 0.160 25.795 465.329 1.92%
wait4 51 43911.444 0.017 861.009 32770.388 74.93%
poll 45188 6052.070 0.001 0.134 96.764 5.58%
read 341567 2718.531 0.001 0.008 1372.100 50.63%
write 120402 1616.062 0.002 0.013 155.142 9.61%
getpid 50012 755.423 0.001 0.015 207.506 32.86%
...
mq-deadline
. It is "work-conserving":mq-deadline
always tries to fill the device queue first, and my SATA HDD has a standard NCQ with room for 15 requests. This is in contrast to schedulers likebfq
or the old CFQ, which may choose to deliberately idle the device for a bit, in an attempt to get the best performance from sequential IO. – sourcejedi May 03 '19 at 14:55