Large ext4 partition ridiculously slow when writing

Question

We have four identical Linux servers with a large (5T) hard disk partition. We have Scientific Linux with this kernel:

Linux s3.law.di.unimi.it 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 14:23:09 CDT 2013 x86_64 x86_64 x86_64 GNU/Linux

The servers are identically configured, installed, etc. But one, and only one of the servers, is ridiculously slow when writing with ext4. If I do a

dd if=/dev/zero of=/mnt/big/analysis/test

l -tr
total 11M
-rw-r--r-- 1 root    root  11M Apr 20 10:01 test
10:01:42 [s3] /mnt/big/analysis
l -tr
total 16M
-rw-r--r-- 1 root    root  16M Apr 20 10:02 test
10:02:13 [s3] /mnt/big/analysis

So 5MB in 30s. All other servers write more than an order of magnitude faster.

The machines are 64GB, 32-core and have no I/O or CPU activity, albeit 90% of memory is filled by a large Java process doing nothing. Only one machine is writing slowly.

SMART says everything is OK

# smartctl -H /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

SMART Health Status: OK

hdparm reads with no problems:

# hdparm -t /dev/sdb

/dev/sdb:
 Timing buffered disk reads:  1356 MB in  3.00 seconds = 451.60 MB/sec

The partition is mounted as follows:

/dev/sdb1 on /mnt/big type ext4 (rw,noatime,data=writeback)

and we set

tune2fs -o journal_data_writeback /dev/sdb1

for performance.

I tried to find anything that could explain why this specific server writes so slowly, that is, any difference in the output of a diagnostic tool, with no results.

Just to complete the picture: we started on all servers a crawl, partitions essentially empty. The crawl created a number of files on the partition, and in particular a 156G file (the store of the crawl). The crawl started OK, but after a few hours we saw slowdowns (apparently, as the store was growing). When we checked, we noticed that writing to disk was getting slower, and slower, and slower.

We stopped everything--no CPU activity, no I/O--but still dd was showing the behaviour above. The other three servers, in identical conditions, same files, etc., work perfectly, both during the crawl and using dd.

Frankly, I don't even know where to look. Does anyone hear a bell ringing? Which diagnostic tools might I use to understand what's happening, or which tests should I try?

Update

Beside the link posted, I thought it would have been a good idea to run the same tests on two servers running the crawler. It's interesting. For instance, vmstat 10 gives

Good

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0  68692 9009864  70832 29853400    0    0    15   214    9    2 11  1 88  0  0 
10  0  68692 8985620  70824 29883460    0    0    48  7402 79465 62898 12  1 86  0  0   
11  0  68692 8936780  70824 29928696    0    0    54  6842 81500 66665 15  1 83  0  0   
10  2  68692 8867280  70840 30000104    0    0    65 36578 80252 66272 14  1 85  0  0   
15  0  68692 8842960  70840 30026724    0    0    61  3667 81245 65161 14  1 85  0  0

Bad

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13  0   8840 14015100  92972 25683108    0    0    11   104    3    9  4  1 94  0  0    
 2  0   8840 14015800  92984 25696108    0    0    49 16835 38619 54204  2  2 96  0  0  
 1  0   8840 14026004  93004 25699940    0    0    33  4152 25914 43530  0  2 98  0  0  
 1  0   8840 14032272  93012 25703716    0    0    30  1164 25062 43230  0  2 98  0  0  
 2  0   8840 14029632  93020 25709448    0    0    24  5619 23475 40080  0  2 98  0  0

And iostat -x -k 5

Good

Linux 2.6.32-358.18.1.el6.x86_64 (s0.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.65    0.00    1.02    0.11    0.00   88.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.11  3338.25   17.98   56.52   903.55 13579.19   388.79     7.32   98.30   1.23   9.18
sda               0.39     0.72    0.49    0.76    11.68     5.90    28.25     0.01   11.25   3.41   0.43

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.86    0.00    1.33    0.03    0.00   82.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1106.20    9.20   31.00    36.80  4549.60   228.18     0.41   10.09   0.39   1.58
sda               0.00     2.20    0.80    3.00     4.80    20.80    13.47     0.04   10.53   3.21   1.22

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.42    0.00    1.23    0.01    0.00   83.34

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1205.40    8.00   33.60    40.80  4956.00   240.23     0.39    9.43   0.33   1.38
sda               0.00     0.60    0.00    1.00     0.00     6.40    12.80     0.01    5.20   4.20   0.42

Bad

Linux 2.6.32-358.18.1.el6.x86_64 (s2.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.37    0.00    1.41    0.06    0.00   94.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.06  1599.70   13.76   38.23   699.27  6551.73   278.96     3.12   59.96   0.99   5.16
sda               0.46     3.17    1.07    0.78    22.51    15.85    41.26     0.03   16.10   2.70   0.50

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.93    0.00    2.99    0.60    0.00   84.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 14885.40   13.60  141.20    54.40 60106.40   777.27    34.90  225.45   1.95  30.14
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    7.00   3.25   0.26

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.61    0.00    2.51    0.16    0.00   85.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  2245.40   10.60   51.20    42.40  9187.20   298.69     3.51   56.80   2.04  12.58
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    6.25   3.25   0.26

So (if I understand correctly the output) it appears that, yes, as it was apparent from the JVM stack traces the slow server is taking forever to do I/O. It remains to understand why :(.

I also ran a strace -c ls -R /. I didn't realize it had to run for a while, so the previous data is not very meaningful. The command was run while the crawler was running, so with massive I/O ongoing.

Good

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.62   14.344825         114    126027           getdents
  0.25    0.036219           1     61812        12 open
  0.07    0.009891           0     61802           close
  0.06    0.007975           0     61801           fstat
  0.01    0.000775           0      8436           write
  0.00    0.000043          22         2           rt_sigaction
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         3         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        15           brk
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   14.399728                319982        18 total

Bad

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.81   24.210936         181    133618           getdents
  0.14    0.032755           1     63183        14 open
  0.03    0.006637           0     63171           close
  0.02    0.005410           0     63170           fstat
  0.00    0.000434           0     15002           write
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         4         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        25           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   24.256172                338259        20 total

Always add bs=1M or something to dd, otherwise you're measuring syscall overhead. smartctl -H is useless. Check kernel logs, full smart data, partition alignment, tune2fs, benchmark disks individually/directly w/o filesystem, ... if nothing comes up, replace cables, disk anyway — frostschutz, Apr 20 '14 at 15:30
bs=1M did not change the test (it was indeed running faster on the other servers), but thanks for the pointer.
UPDATE: after leaving the disk doing nothing for 10 hours, THE SPEED IS AGAIN HIGH. The dd test copies hundreds of megabytes per second, like all other servers.

So it appears that something "happens" incrementally to the file system. After a few hours, things gets slower and slower. But if stop all activity and wait for a few hours things get back to normality.

I guess this has something to do with delayed writes, but frankly I don't know what I should change. — seba, Apr 20 '14 at 17:15
cfq. Thanks, really, for pointing this out—I didn't even know there was a settable scheduler. I think deadline would be more appropriate for what we are doing. We need the system being as dumb as possible, as we alternate phases in which we make large writes to different files. — seba, Apr 20 '14 at 18:58
For the time being, I'm trying disabling journalling altogether. — seba, Apr 20 '14 at 18:59
please paste the output of iostat -x on the slow and a normal server. Please paste vmstat 10 while writing so we see how much IOwait is on the system. Please do an strace on slow and normal servers to see performance like this: strace -c ls -R — Thorsten Staerk, Apr 20 '14 at 20:43
Also check that your partitions are aligned correctly, you can do it with gdisk, it will complain if that is the case upon running. — ek9, Apr 20 '14 at 21:33
Partitions alignment is OK.
I ran again the crawler and slowdowns appeared even without a journal. I've gathered the data requested by @ThorstenStaerk here:

http://vigna.di.unimi.it/slow.txt

(sorry but it wouldn't let me paste it here—too long).

Note that for some reason (maybe journal disabling) now the server is just twice as slow. It is also possible that I didn't catch it in the worst possible moment.

I'll try to run again the crawler to see whether I can again slow down writes to a 100x factor instead of a 2x. If I'm really unable I'll try to puck back the journal in. — seba, Apr 21 '14 at 08:40
@HaukeLaging, you got that backwards; -t flushes the kernel cache and actually measures the disk's real read speed, -T just reads what is already in the kernel cache. — psusi, Apr 21 '14 at 14:45
Are you sure the disks are not actually SMR HDDs? Those are notoriously bad if you write too much in "short" time period. You should expect the performance to drop if you write more than ~20 GB/hour or maybe 200K write operations per hour per disk. The only way to restore performance of SMR HDD is to let it idle powered on, and maybe do one disk operation every minute to make sure it doesn't enter any sleep modes that would prevent cleaning its internal cache area. See https://unix.stackexchange.com/a/489530/20336 for details. — Mikko Rantalainen, Dec 01 '23 at 16:16

score 4 · Accepted Answer · edited May 23 '17 at 12:40

It was the kernel. We were using 2.6.32, which is quite old, but it is the one supported by Red Hat EL and Scientific Linux.

Today I had lunch with a friend (Giuseppe Ottaviano) who had a similar experience tuning high-performance indexing algorithms. After upgrading to the newest version everything (compiler, libraries, etc.) he changed the kernel to the 3.10 line and suddenly everything worked fine.

It worked for us, too. With the 3.10 kernel (courtesy of http://elrepo.org), all problems vanished.

Giuseppe suspects a pernicious interaction between NUMA and kernel paging, which leads to the kernel loading and saving like crazy the same data, slowing the machine almost up to a halt.

score 0 · Answer 2 · answered Apr 21 '14 at 17:09

0

Regarding the txt file, it all seems "normal" for all computers... low await (disk latency), rather lower (=better) for the "bad" computers... request queue not too bad, 0 wa (IO wait time in vmstat).

But you went from 5MB/30sec to 130 MB/sec, what's up?

answered Apr 21 '14 at 17:09

Thorsten Staerk

3,757

OK... I don't know. I think that eliminating the journal has changed the setup. I updated the tests after that as I wanted to have more than one output from IOStat.
But the problem remains. At the end of a crawl we move some data using the transfer() facility of Java NIO. This means that we move from file to file 30G with just one method call. It's going on since days (literally) instead of a few minutes, copying a few hundreds K per second.

The weird thing is that while that copy is going on so low, a dd on the same disk now does 130MB/s.
– seba Apr 22 '14 at 08:37
Another thing that happened is that also a second server started to behave similarly, and we discovered that the two bad server have an old (2.8.0) BIOS w.r.t. the new ones (3.3.2). So we're going to update and retry. – seba Apr 22 '14 at 08:38
And... no, it's not the BIOS. At this point I'm thinking of some very low-level hardware problem, like interrupts timing out or something. Our next try will be reformatting the large partitions and test again. – seba Apr 22 '14 at 10:45

Large ext4 partition ridiculously slow when writing

Update

Good

Bad

Good

Bad

Good

Bad

2 Answers2

Linked