IO wait time is higher than disk utilization. Isn't this impossible?

Question

I am trying to improve my understanding, following this (so far) unanswered question: Possible limiting factor during upgrade of Fedora VM - not disk or cpu or network?

I ran the following test load, which took 200 seconds to complete.

sudo perf trace -s time perf stat dnf -y --releasever=30 --installroot=$HOME/fedora-30 --disablerepo='*' --enablerepo=fedora --enablerepo=updates install systemd passwd dnf fedora-release vim-minimal

I am running this on a fairly default, straightforward install of Fedora Workstation 29. It is not a VM. The kernel version is 5.0.9-200.fc29.x86_64. The IO scheduler is mq-deadline.

I use LVM, and the ext4 filesystem. I am not using any encryption on my disk or filesystem. I do not have any network filesystem mounted at all, so I am not reading or writing a network filesystem.

I have 4 "CPUs": 2 cores with 2 threads each.

I have only one disk, /dev/sda, which is a SATA HDD. The HDD supports NCQ: cat /sys/class/block/sda/device/queue_depth shows 32.

vmstat 5 showed that non-idle CPU time sometimes rose to about one CPU, i.e. idle was as low as 75%.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

...

 1  1   3720 1600980 392948 3669196    0    0    14 30634 8769 1876  4  3 78 15  0
 1  1   3720 1600144 393128 3669884    0    0     0  3460 1406 1468  0  1 80 18  0
 0  1   3720 1599424 393356 3670532    0    0     0  6830 1416 1480  0  1 73 25  0
 0  1   3720 1598448 393700 3671108    0    0     0  7766 1420 1362  0  1 78 20  0
 0  1   3720 1597296 393940 3671616    0    0     0  5401 1374 1367  1  2 87 11  0
 0  1   3720 1596332 394312 3672416    0    0     0  1162 1647 1558  1  2 85 13  0
 3  0   3720 1652064 394564 3673540    0    0     0 17897 15406 1784  1  3 74 23  0
 0  0   3720 1972876 394600 3541064    0    0     0   522 8028 18327  3  3 84 10  0
 1  0   3720 1974152 394600 3541108    0    0     0     9  422  879  2  1 97  0  0
 0  0   3720 1974136 394600 3541120    0    0     0     0  204  455  0  0 99  0  0
(end of test)

And the "IO wait" time (the wa field under cpu in vmstat) rose as high as 25%. I think this means 100% of one CPU. But atopsar -d 5 showed the utilization of my disk did not directly match this. It was much less than 100%:

22:46:44  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_

...

22:49:34  sda              5%    0.4     4.0    69.5   413.0  36.9   0.68 ms
22:49:39  sda              8%    0.2    60.0   120.6    30.6  18.7   0.66 ms
22:49:44  sda              8%    0.0     0.0   136.2    16.7  20.4   0.61 ms
22:49:49  sda             10%    0.0     0.0   157.1    44.2  21.4   0.65 ms
22:49:54  sda              9%    0.0     0.0   196.4    39.3  48.0   0.47 ms
22:49:59  sda              9%    0.0     0.0   148.9    36.6  32.6   0.62 ms
22:50:04  sda             10%    0.0     0.0   137.3   130.6  37.2   0.70 ms
22:50:09  sda             11%    0.0     0.0   199.6     5.4  13.5   0.55 ms
22:50:14  sda              2%    0.0     0.0    50.2     4.5  11.8   0.32 ms
22:50:19  sda              0%    0.0     0.0     0.8    11.0  13.3   0.75 ms
(end of test)

How can "IO wait" time be higher than disk utilization?

Following is the definition taken from the sar manpage:

%iowait:

Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

Therefore, %iowait means that from the CPU point of view, no tasks were runnable, but at least one I/O was in progress. iowait is simply a form of idle time when nothing could be scheduled. The value may or may not be useful in indicating a performance problem, but it does tell the user that the system is idle and could have taken more work.

https://support.hpe.com/hpsc/doc/public/display?docId=c02783994

"IO wait" is tricksy to define on multi-CPU systems. See How does a CPU know there is IO pending? . But even if you think I was wrong to multiply the above "IO wait" figure by 4, it would still be higher than the disk utilization figure!

I expect the disk utilization figure in atopsar -d (and equally in atop / sar -d / iostat -x / mxiostat.py) is calculated from one of the kernel iostat fields. The linked doc mentions "Field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the current multi-queue block layer.

Thanks to the perf in the test command, I can also report that dnf's fdatasync() calls were responsible for 81 out of 200 seconds of elapsed time. This evidence suggests to me that the "IO wait" figure is giving a more accurate impression than the disk utilization figure.

     199.440461084 seconds time elapsed

      60.043226000 seconds user
      11.858057000 seconds sys


60.07user 12.17system 3:19.84elapsed 36%CPU (0avgtext+0avgdata 437844maxresident)k
496inputs+2562448outputs (25major+390722minor)pagefaults 0swaps

 Summary of events:

...

 dnf (6177), 2438150 events, 76.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   fdatasync           3157 81436.062     0.160    25.795   465.329      1.92%
   wait4                 51 43911.444     0.017   861.009 32770.388     74.93%
   poll               45188  6052.070     0.001     0.134    96.764      5.58%
   read              341567  2718.531     0.001     0.008  1372.100     50.63%
   write             120402  1616.062     0.002     0.013   155.142      9.61%
   getpid             50012   755.423     0.001     0.015   207.506     32.86%
...

score 0 · Answer 1 · answered May 03 '19 at 13:09

0

"IO wait" time rose as high as 25%. I think this means 100% of one CPU".

No, that's basically where you start to go wrong. I/O waiting does not require a CPU core at all. I/O wait is basically a state in the kernel that means "don't waste the CPU on this process, it's still waiting for an external action to complete". When the kernel sees an I/O operation complete, it will find the waiting process and reschedule it.

As a result, you can have 100 processes wait for I/O, and in one second wall clock time they will accumulate 100 seconds of I/O waiting time. And yet the CPU may be fully busy servicing the 101'th process.

Also, you compare I/O wait with disk utilization. While disks are one form of I/O, they are not the only one. Network sockets can also be waited on. And this type of I/O might be hidden when you are using a remote file system.

answered May 03 '19 at 13:09

MSalters

421
5
12

"%iowait means that from the CPU point of view, no tasks were runnable, but at least one I/O was in progress. iowait is simply a form of idle time when nothing could be scheduled. The value may or may not be useful in indicating a performance problem, but it does tell the user that the system is idle and could have taken more work." – sourcejedi May 03 '19 at 14:36
To start with, consider a system with only one CPU and one disk. If "no tasks were runnable, but at least one I/O was in progress" for 50% of the time, then disk utilization should be at least 50%. – sourcejedi May 03 '19 at 14:38
@sourcejedi: Well, there's the potential for some measurement noise. The process might not be scheduled back in at the very nanosecond when its I/O completes. And with I/O prioritization, an I/O first has to go into a priority queue before being picked up by the responsible I/O device driver, so there's some idle time there as well. More bookkeeping challenges: the largest I/O bottleneck with a conventional HDD is the mechanical seek between two reads. Who pays for that? The read before or the read after? – MSalters May 03 '19 at 14:48
The default scheduler on Fedora 29 is currently mq-deadline. It is "work-conserving": mq-deadline always tries to fill the device queue first, and my SATA HDD has a standard NCQ with room for 15 requests. This is in contrast to schedulers like bfq or the old CFQ, which may choose to deliberately idle the device for a bit, in an attempt to get the best performance from sequential IO. – sourcejedi May 03 '19 at 14:55
The second read pays for the seek in between the two reads. – sourcejedi May 03 '19 at 14:57
Good question about a network filesystems. I'm not using one. Interestingly Linux iowait does include NFS IO, I don't know if all the network filesystems are consistent though. – sourcejedi May 03 '19 at 15:18
Network socket operations do not show up in iowait, however. Strongly advise you remove that from your answer. – sourcejedi May 03 '19 at 15:21

IO wait time is higher than disk utilization. Isn't this impossible?

1 Answers1