NVMe disk shows 80% io utilization, partitions show 0% io utilization

Question

I have a CentOS 7 server (kernel 3.10.0-957.12.1.el7.x86_64) with 2 NVMe disks with the following setup:

# lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1       259:0    0   477G  0 disk
├─nvme0n1p1   259:2    0   511M  0 part  /boot/efi
├─nvme0n1p2   259:4    0  19.5G  0 part
│ └─md2         9:2    0  19.5G  0 raid1 /
├─nvme0n1p3   259:7    0   511M  0 part  [SWAP]
└─nvme0n1p4   259:9    0 456.4G  0 part
  └─data-data 253:0    0 912.8G  0 lvm   /data
nvme1n1       259:1    0   477G  0 disk
├─nvme1n1p1   259:3    0   511M  0 part
├─nvme1n1p2   259:5    0  19.5G  0 part
│ └─md2         9:2    0  19.5G  0 raid1 /
├─nvme1n1p3   259:6    0   511M  0 part  [SWAP]
└─nvme1n1p4   259:8    0 456.4G  0 part
  └─data-data 253:0    0 912.8G  0 lvm   /data

Our monitoring and iostat continually shows nvme0n1 and nvme1n1 with 80%+ io utilization while the individual partitions have 0% io utilization and are fully available (250k iops, 1GB read/write per sec).

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.14    0.00    3.51    0.00    0.00   89.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00    0.00   50.50     0.00   222.00     8.79     0.73    0.02    0.00    0.02  14.48  73.10
nvme1n1p1         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme1n1p2         0.00     0.00    0.00   49.50     0.00   218.00     8.81     0.00    0.02    0.00    0.02   0.01   0.05
nvme1n1p3         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme1n1p4         0.00     0.00    0.00    1.00     0.00     4.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme0n1           0.00     0.00    0.00   49.50     0.00   218.00     8.81     0.73    0.02    0.00    0.02  14.77  73.10
nvme0n1p1         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme0n1p2         0.00     0.00    0.00   49.50     0.00   218.00     8.81     0.00    0.02    0.00    0.02   0.01   0.05
nvme0n1p3         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme0n1p4         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00   48.50     0.00   214.00     8.82     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    1.00     0.00     4.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00

Any ideas what can be the root cause for such behavior?
All seems to be working fine except for monitoring triggering high io alerts.

The nvme1n1p2 and nvme0n1p2 are showing activity, and they are the components of your md2 RAID1 device. Perhaps the RAID set is syncing or scrubbing in the background? Look into /proc/mdstat to see the RAID device status. The monitoring results might be a quirk resulting from how the background syncing/scrubbing is implemented in the kernel. If so, it should be basically "soft workload" that will be automatically restricted when there are actual user-space disk I/O operations to do. — telcoM, May 08 '19 at 06:30
@telcoM - seems ok. md2 : active raid1 nvme0n1p2[0] nvme1n1p2[1] and 20478912 blocks [2/2] [UU] — mike, May 08 '19 at 08:01
Just to add - this happens on multiple servers. Hosted with the same provider though. — mike, May 23 '19 at 09:24
Do you have the scheduler of the NVMEs set to none? https://access.redhat.com/solutions/3901291 — Thomas, May 25 '19 at 10:07
Sorry, just saw that you need a login for that link. Basically it says that iostat shows wrong utilization for NVME drives configured with none scheduler, which is the default for NVME drives. It seems to be a bug which has been identified. — Thomas, May 25 '19 at 13:40
This one also looks like your issue. Maybe you can double check by setting a different scheduler on the NVMEs? — Thomas, May 25 '19 at 13:44
@Thomas the CentOS bug seems to be this issue. Newer kernel will fix it. Feel free to submit an answer, I'll accept it. Thank you for helping out! — mike, May 25 '19 at 17:31

score 3 · Accepted Answer · answered May 30 '19 at 10:38

The source of the abysmal iostat output for %util and svctm seems to be related to a kernel bug which will be solved in kernel-3.10.0-1036.el7 or in RHEL/CentOS release 7.7. Devices that have the scheduler set to none are affected, which is the default for NVME drives.

For reference, there is a Redhat solution (login required) which describes the bug.
In the CentOS bug report someone wrote that the issue will be solved with the above mentioned kernel/release version.

Changing the scheduler should resolve the issue, until the new kernel is available. As it seems to only affects metrics not real performance, another possibility would be to just ignore the metrics until the new kernel.
I cannot verify this due to lack of NVME drive, maybe @michal kralik can verify this.

score 0 · Answer 2 · answered May 27 '19 at 16:45

0

IO directed to individual logical partitions is remapped by Linux kernel to underlining physical device which actually performs the IO.

answered May 27 '19 at 16:45

John Doe

423

nope. accounted to both. https://elixir.bootlin.com/linux/latest/source/block/bio.c#L1735 . (disk busy% metrics are caculated from io_ticks, see this question and answer: https://unix.stackexchange.com/questions/517132/dd-is-running-at-full-speed-but-i-only-see-20-disk-utilization-why) True cause is most likely that mentioned in the comments underneath the question. – sourcejedi May 27 '19 at 17:10

NVMe disk shows 80% io utilization, partitions show 0% io utilization

2 Answers2