My first answer to this question, was that "IO pressure" is strongly related to the concept of "CPU IO wait time". And the CPU IO wait times I was seeing also looked far too low to me. So I should not hope that "IO pressure" would make any more sense.
However, I now at least have a workaround for the problem with CPU iowait time. That is, booting with nohz=off
gives the iowait times I expected. Alternatively I can achieve the same effect by forcing polling instead of cpu idling - this also ensures the scheduler tick is never stopped:
cd /sys/devices/system/cpu
# Disable all cpuidle states, including C1 ("halt"), but not POLL
for i in cpu*/cpuidle/state[1-9]*/disable; do echo 1 | sudo tee $i; done
# Reset to default (C-states enabled)
for i in cpu*/cpuidle/state*/disable; do echo 0 | sudo tee $i; done
Booting with nohz=off
also seems to increase the "IO pressure" values. I am not yet convinced by the new values. They seem closer to what I expected...
Although if I understand the code correctly, it does not make sense for nohz=off
to have increased the IO pressure in the same way as iowait! At least the original patch for pressure stall information seems entirely event driven. I do not see it relying on the scheduler tick!
The following atop
lines were taken while running dd if=test iflag=direct bs=1M of=/dev/null
(for at least sixty seconds):
CPU | sys 2% | user 1% | irq 1% | idle 300% | wait 96%
cpu | sys 1% | user 0% | irq 0% | idle 89% | cpu003 w 10%
cpu | sys 1% | user 0% | irq 0% | idle 23% | cpu001 w 76%
cpu | sys 0% | user 0% | irq 0% | idle 98% | cpu000 w 1%
cpu | sys 0% | user 0% | irq 0% | idle 90% | cpu002 w 9%
CPL | avg1 0.76 | avg5 0.92 | avg15 0.84 | | numcpu 4
PSI | cs 0/0/0 | ms 0/0/0 | mf 0/0/0 | is 83/57/44 | if 83/57/44
- cpu3 has 11% non-idle time (100-89). 10% was counted as "iowait".
- cpu1 has 77% non-idle time (100-23). 76% was counted as "iowait".
- cpu2+cpu0 have 12% non-idle time. 10% was counted as "iowait".
ms
is 0. So it is not possible that we are counting some (non-neglible) iowait time as "stalled contending for memory" instead of "stalled waiting on IO". (I am not sure whether how this works more generally though).
psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime.
-- [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
E.g. cpu1 has 76/77 = 98.7% IO pressure (full). And that should be the biggest term in the weighted average. So I still think the reported value is quite a bit lower than expected.
In the tests above, the "full" values were mostly the same as the "some" values. So I expect the mysterious discrepancies apply equally to "IO pressure (some)".
It is important that the "non-idle" time, used to weight the average, only excludes pure idle time. So the "non-idle" time is more than just the CPU usage. The "non-idle" time also includes "CPU IO wait time".
struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS];
/* There are runnable or D-state tasks */
int nonidle;
groupc->nonidle = tasks[NR_RUNNING] ||
tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
CAVEAT: PSI does not literally use the reported per-CPU "iowait". It does its own accounting. Again, my linked question shows that "iowait" can be massively under-accounted, but I think the PSI equivalent of "iowait" was intended to avoid that problem.
Pinning the dd
command to a single cpu using taskset -c 0
, I reach 96% IO pressure if NO_IDLE_HZ
is suppressed, and 90% otherwise. These figures seem much more plausible to me.
full avg10=92.84
and idle is still almost 100% (using/sys/devices/system/cpu/cpu*/cpuidle/state*/time
). – Doug Smythies Jul 07 '19 at 18:32