Massive, unpredictable I/O performance drop in Linux

Question

I'm using Debian testing without any problems for ~6 years (I'm just regularly updating it), but recently it started to show a random behaviour that can be summarized as "Low I/O performance which persists until reboot".

The problem is, suddenly all disk reads and writes slow down to ~5MB/sec which results in continuous read and writes. Since the rate is so low, disks are not mechanically challenged or stressed, but everything slows down until I reboot.

I/O subsystem of the computer consists of one OCZ Vertex 3 SSD and two WD Caviar Black HDDs. SSD holds read-heavy part of the OS and a partition on the HDD holds the rest.

To diagnose the problem I tried the following without success:

top doesn't show any runaway activity neither in CPU nor I/O usage.
hdparm returns normal performance ratings of the disks (I only checked -t though).
smartctl doesn't show any performance problems in disks. Long tests showed that the disks are as good as new.

System has Z77 Chipset, 16GB of RAM and Intel i7 3770K CPU and the stats show no signs of saturation in RAM, I/O or CPU, but I'm not experienced to debug problems like this (esp. in kernel space). Any help will be appreciated.

Update 1:

I ran (forced) fsck on every partition as a precaution. All FS are clean.
Incidentally I found a BIOS upgrade which came out a month ago & applied it.
No partition is filled more than 50%.

Update 2:

The problem is not surfacing up for two days. Either fsck or the BIOS update cleaned some clogs in the system. I'm still monitoring the issue and will close the question with a post-mortem answer.

Update 3:

Problem just resurfaced and I did some more digging. Please see the answer.

could be fragmentation issue atop would tell you how busy the disks are (like when seeking all the time). — Stéphane Chazelas, Oct 08 '13 at 12:23
@StephaneChazelas, I'll look at it when I go home, but the partitions are pretty much empty and generally an event triggers that slowness. If that "event" doesn't happen, system works like it should. Also the disks doesn't sound "that busy" (I worked as an HPC cluster administrator for ~5 years, but never had a problem like this). — bayindirh, Oct 08 '13 at 12:29
Just to rule out some quirks, disable NCQ and set the I/O scheduler to noop. — frostschutz, Oct 08 '13 at 12:35
"Low I/O performance which persists until reboot" can be a broken/buggy device that seizes the bus too often for too long which is maddeningly hard to diagnose short of swapping out hardware. — msw, Oct 08 '13 at 12:52
@msw I can get the full speed from disks with hdparm even in low I/O situation. This makes situation stranger. — bayindirh, Oct 08 '13 at 12:57
How are your filesystems and block devices configred? Are you using md? LVM? — symcbean, Oct 08 '13 at 15:18
@symcbean It's vanilla GPT partitioning. Nothing advanced like md or LVM. One of the caviars is single partition storage. Other have parts of the OS as partitions and another storage partition. SSD has the homes and read-heavy parts of the OS — bayindirh, Oct 08 '13 at 18:00
Then the next thing on my lsit to check would be to check the logs for errors and check there's plenty of memory allocated to buffers/cache (see output of free) — symcbean, Oct 08 '13 at 21:36
Just to localize it a bit more is iowait shooting up? Either on the whole or on a particular process? If you try to check the logs/schedule self-tests via smartctl does that give you more information to work with? — Bratchley, Oct 10 '13 at 15:53
@JoelDavis smartctl tests are all clean. I've run them during the diagnosis of the problem. Currently I cannot reproduce the issue, but monitoring didn't end yet. — bayindirh, Oct 10 '13 at 18:40
You might be able to see iowait etc if you're collecting sar data. I'd enable sysstat if it isn't already running. You can check with sar -A most platforms have ten minute sample intervals. — Bratchley, Oct 10 '13 at 18:55
I'm having the exact same issue on an old laptop upgraded, version by version, up to 14.04 (kernel 4.4.0-38-generic from xenial-lts). It's being some time now, both read and write speeds are very slow (see: https://www.dropbox.com/s/z1fab4fb563bhqn/hp-pavilion-dv6-2125el-iotop.txt?dl=0).
hdparm -t --direct /dev/sda says:

/dev/sda: Timing O_DIRECT disk reads: 78 MB in 3.07 seconds = 25.41 MB/sec

So it's not a hdd issue, but something about Linux. I suspect this old ext3 filesystem mounted by the ext4 subsystem is the cause, I'll have to do the tedious job of rsyncing all. — Avio, Oct 07 '16 at 14:51
@Avio, It might not be an ext3 and ext4 issue. My problem was the number of files in the cache. Can you please try the solution in my answer and can you take a look whether it, at least temporarily, remedies the problem? Also please take look to output of free and size of the hard drive cache and the disk activity when things go down. Does your disk constantly writing something small in that case? — bayindirh, Oct 10 '16 at 14:07

bayindirh · Accepted Answer · 2016-04-17T13:06:40.050

I managed to reproduce the problem again and it was result of a big disk cache. My disk caches can grow more than 8GB and seems that some applications doesn't like it and I/O suffers.

Dropping disk caches with echo 3 > /proc/sys/vm/drop_caches as root remedies the problem. I currently don't know why large disk caches causes this I/O degradation.

Last Update: After more investigation I've found out that number of files in the cache was triggering the problem. It was trashing the disks while trying to commit many small files back to the disk. Since I was using the system for ten years, I've took the plunge and reinstalled with 64 bit Debian. Now it's working smoothly. It was probably a side effect of ten years of upgrading with finding limits of 32 bit operating system.

score 2 · Answer 2 · answered Oct 08 '13 at 12:26

2

Are there any suspicious messages in dmesg?

Some more tools you could try to gain some insights into your system's bottlenecks:

dstat
latencytop
sysprof

answered Oct 08 '13 at 12:26

Elias Probst

1,053

Nothing suspicious in any logs. TBH no log entries related to this problem. I'll try the tools nevertheless. There shouldn't be a bottleneck in a high-end PC while sitting in idle without anything running on it. I think a cache or something related to I/O subsystem goes awry. – bayindirh Oct 08 '13 at 12:32
....and iotop, fio – symcbean Oct 08 '13 at 15:19

Massive, unpredictable I/O performance drop in Linux

Update 1:

Update 2:

Update 3:

2 Answers2

Linked