Why does the new Linux "pressure stall information" for IO not show as 100%?

Question

I noticed a shiny new line in atop (on my Fedora 30 install).

man atop

PSI - Pressure Stall Information.

This line contains three percentages per category: average pressure percentage over the last 10, 60 and 300 seconds (separated by slashes). The categories are: CPU for 'some' (cs), memory for 'some' (ms), memory for 'full' (mf), I/O for 'some' (is), and I/O for 'full' (if).

For more details, see "Tracking pressure-stall information" [LWN.net, 2018].

I ran dd if=/dev/zero bs=1M of=test and watched for 5 ten second intervals. The highest is or if value was 63%. I had to wait several intervals for it to reach that high a value.

I expected this metric for "CPU potential lost on all IO delays"[1] / "tasks are waiting for io completions"[2] would be equal to 100%...

My initial expectation was not really valid. Because the traditional process list in atop says my dd writer uses 15-20% of a CPU. However, combining these numbers still does not add up to 100% :-).

Going the other way, reading the file (without O_DIRECT), dd raises is to about 43% and is shown as using about 9% of a CPU. It at least looks like there is a big gap here :-).

Running a dd writer with oflag=direct, the IO pressure reached 51% and it used 6% CPU.

Running a dd reader with iflag=direct, the IO pressure reached 47% and it used 3% CPU.

How can we understand the apparent discrepancy?

The IO scheduler is mq-deadline. The storage is a mechanical HDD. My kernel is v5.2-rc5-290-g4ae004a9bca8, plus 3 ext4 patches which hopefully do not matter.

sourcejedi · Answer 1 · 2019-07-09T21:09:25.493

My first answer to this question, was that "IO pressure" is strongly related to the concept of "CPU IO wait time". And the CPU IO wait times I was seeing also looked far too low to me. So I should not hope that "IO pressure" would make any more sense.

However, I now at least have a workaround for the problem with CPU iowait time. That is, booting with nohz=off gives the iowait times I expected. Alternatively I can achieve the same effect by forcing polling instead of cpu idling - this also ensures the scheduler tick is never stopped:

cd /sys/devices/system/cpu

# Disable all cpuidle states, including C1 ("halt"), but not POLL
for i in cpu*/cpuidle/state[1-9]*/disable; do echo 1 | sudo tee $i; done

# Reset to default (C-states enabled)
for i in cpu*/cpuidle/state*/disable; do echo 0 | sudo tee $i; done

Booting with nohz=off also seems to increase the "IO pressure" values. I am not yet convinced by the new values. They seem closer to what I expected...

Although if I understand the code correctly, it does not make sense for nohz=off to have increased the IO pressure in the same way as iowait! At least the original patch for pressure stall information seems entirely event driven. I do not see it relying on the scheduler tick!

The following atop lines were taken while running dd if=test iflag=direct bs=1M of=/dev/null (for at least sixty seconds):

CPU |  sys   2% |  user  1% |  irq   1% |  idle    300% |  wait     96% 

cpu |  sys   1% |  user  0% |  irq   0% |  idle     89% |  cpu003 w 10%
cpu |  sys   1% |  user  0% |  irq   0% |  idle     23% |  cpu001 w 76%
cpu |  sys   0% |  user  0% |  irq   0% |  idle     98% |  cpu000 w  1%
cpu |  sys   0% |  user  0% |  irq   0% |  idle     90% |  cpu002 w  9%

CPL |  avg1    0.76 |  avg5    0.92 |  avg15   0.84 |               | numcpu     4

PSI |  cs     0/0/0 |  ms     0/0/0 |  mf     0/0/0 |  is  83/57/44 |  if  83/57/44

cpu3 has 11% non-idle time (100-89). 10% was counted as "iowait".
cpu1 has 77% non-idle time (100-23). 76% was counted as "iowait".
cpu2+cpu0 have 12% non-idle time. 10% was counted as "iowait".
ms is 0. So it is not possible that we are counting some (non-neglible) iowait time as "stalled contending for memory" instead of "stalled waiting on IO". (I am not sure whether how this works more generally though).

psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime.

-- [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

E.g. cpu1 has 76/77 = 98.7% IO pressure (full). And that should be the biggest term in the weighted average. So I still think the reported value is quite a bit lower than expected.

In the tests above, the "full" values were mostly the same as the "some" values. So I expect the mysterious discrepancies apply equally to "IO pressure (some)".

It is important that the "non-idle" time, used to weight the average, only excludes pure idle time. So the "non-idle" time is more than just the CPU usage. The "non-idle" time also includes "CPU IO wait time".

struct psi_group_cpu {
    /* States of the tasks belonging to this group */
    unsigned int tasks[NR_PSI_TASK_COUNTS];

    /* There are runnable or D-state tasks */
    int nonidle;

    groupc->nonidle = tasks[NR_RUNNING] ||
        tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];

CAVEAT: PSI does not literally use the reported per-CPU "iowait". It does its own accounting. Again, my linked question shows that "iowait" can be massively under-accounted, but I think the PSI equivalent of "iowait" was intended to avoid that problem.

Pinning the dd command to a single cpu using taskset -c 0, I reach 96% IO pressure if NO_IDLE_HZ is suppressed, and 90% otherwise. These figures seem much more plausible to me.

The filters used in the calculations are infinite impulse response (IIR). 20 seconds is not enough time to reach steady state for the 10 second time constant filter. Try about 60 seconds, or even longer. At steady state on my ststem I get: full avg10=92.84 and idle is still almost 100% (using /sys/devices/system/cpu/cpu*/cpuidle/state*/time). — Doug Smythies, Jul 07 '19 at 18:32
@DougSmythies I was wondering about that, thanks. I've made sure to test for at least sixty seconds and re-calculated. This still looks borked to me. And it's definitely not "nohz-invariant". — sourcejedi, Jul 09 '19 at 20:59
O.K. thanks for the followup. i did not know about this pressure thing before this, and it seems like a useful feature. — Doug Smythies, Jul 09 '19 at 22:17

Why does the new Linux "pressure stall information" for IO not show as 100%?

1 Answers1