We have four identical Linux servers with a large (5T) hard disk partition. We have Scientific Linux with this kernel:
Linux s3.law.di.unimi.it 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 14:23:09 CDT 2013 x86_64 x86_64 x86_64 GNU/Linux
The servers are identically configured, installed, etc. But one, and only one of the servers, is ridiculously slow when writing with ext4. If I do a
dd if=/dev/zero of=/mnt/big/analysis/test
l -tr
total 11M
-rw-r--r-- 1 root root 11M Apr 20 10:01 test
10:01:42 [s3] /mnt/big/analysis
l -tr
total 16M
-rw-r--r-- 1 root root 16M Apr 20 10:02 test
10:02:13 [s3] /mnt/big/analysis
So 5MB in 30s. All other servers write more than an order of magnitude faster.
The machines are 64GB, 32-core and have no I/O or CPU activity, albeit 90% of memory is filled by a large Java process doing nothing. Only one machine is writing slowly.
SMART says everything is OK
# smartctl -H /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
SMART Health Status: OK
hdparm reads with no problems:
# hdparm -t /dev/sdb
/dev/sdb:
Timing buffered disk reads: 1356 MB in 3.00 seconds = 451.60 MB/sec
The partition is mounted as follows:
/dev/sdb1 on /mnt/big type ext4 (rw,noatime,data=writeback)
and we set
tune2fs -o journal_data_writeback /dev/sdb1
for performance.
I tried to find anything that could explain why this specific server writes so slowly, that is, any difference in the output of a diagnostic tool, with no results.
Just to complete the picture: we started on all servers a crawl, partitions essentially empty. The crawl created a number of files on the partition, and in particular a 156G file (the store of the crawl). The crawl started OK, but after a few hours we saw slowdowns (apparently, as the store was growing). When we checked, we noticed that writing to disk was getting slower, and slower, and slower.
We stopped everything--no CPU activity, no I/O--but still dd was showing the behaviour above. The other three servers, in identical conditions, same files, etc., work perfectly, both during the crawl and using dd.
Frankly, I don't even know where to look. Does anyone hear a bell ringing? Which diagnostic tools might I use to understand what's happening, or which tests should I try?
Update
Beside the link posted, I thought it would have been a good idea to run the same tests on two servers running the crawler. It's interesting. For instance, vmstat 10 gives
Good
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 68692 9009864 70832 29853400 0 0 15 214 9 2 11 1 88 0 0
10 0 68692 8985620 70824 29883460 0 0 48 7402 79465 62898 12 1 86 0 0
11 0 68692 8936780 70824 29928696 0 0 54 6842 81500 66665 15 1 83 0 0
10 2 68692 8867280 70840 30000104 0 0 65 36578 80252 66272 14 1 85 0 0
15 0 68692 8842960 70840 30026724 0 0 61 3667 81245 65161 14 1 85 0 0
Bad
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
13 0 8840 14015100 92972 25683108 0 0 11 104 3 9 4 1 94 0 0
2 0 8840 14015800 92984 25696108 0 0 49 16835 38619 54204 2 2 96 0 0
1 0 8840 14026004 93004 25699940 0 0 33 4152 25914 43530 0 2 98 0 0
1 0 8840 14032272 93012 25703716 0 0 30 1164 25062 43230 0 2 98 0 0
2 0 8840 14029632 93020 25709448 0 0 24 5619 23475 40080 0 2 98 0 0
And iostat -x -k 5
Good
Linux 2.6.32-358.18.1.el6.x86_64 (s0.law.di.unimi.it) 04/21/2014 _x86_64_ (64 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
10.65 0.00 1.02 0.11 0.00 88.22
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.11 3338.25 17.98 56.52 903.55 13579.19 388.79 7.32 98.30 1.23 9.18
sda 0.39 0.72 0.49 0.76 11.68 5.90 28.25 0.01 11.25 3.41 0.43
avg-cpu: %user %nice %system %iowait %steal %idle
15.86 0.00 1.33 0.03 0.00 82.78
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 1106.20 9.20 31.00 36.80 4549.60 228.18 0.41 10.09 0.39 1.58
sda 0.00 2.20 0.80 3.00 4.80 20.80 13.47 0.04 10.53 3.21 1.22
avg-cpu: %user %nice %system %iowait %steal %idle
15.42 0.00 1.23 0.01 0.00 83.34
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 1205.40 8.00 33.60 40.80 4956.00 240.23 0.39 9.43 0.33 1.38
sda 0.00 0.60 0.00 1.00 0.00 6.40 12.80 0.01 5.20 4.20 0.42
Bad
Linux 2.6.32-358.18.1.el6.x86_64 (s2.law.di.unimi.it) 04/21/2014 _x86_64_ (64 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
4.37 0.00 1.41 0.06 0.00 94.16
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.06 1599.70 13.76 38.23 699.27 6551.73 278.96 3.12 59.96 0.99 5.16
sda 0.46 3.17 1.07 0.78 22.51 15.85 41.26 0.03 16.10 2.70 0.50
avg-cpu: %user %nice %system %iowait %steal %idle
11.93 0.00 2.99 0.60 0.00 84.48
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 14885.40 13.60 141.20 54.40 60106.40 777.27 34.90 225.45 1.95 30.14
sda 0.00 0.40 0.00 0.80 0.00 4.80 12.00 0.01 7.00 3.25 0.26
avg-cpu: %user %nice %system %iowait %steal %idle
11.61 0.00 2.51 0.16 0.00 85.71
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 2245.40 10.60 51.20 42.40 9187.20 298.69 3.51 56.80 2.04 12.58
sda 0.00 0.40 0.00 0.80 0.00 4.80 12.00 0.01 6.25 3.25 0.26
So (if I understand correctly the output) it appears that, yes, as it was apparent from the JVM stack traces the slow server is taking forever to do I/O. It remains to understand why :(.
I also ran a strace -c ls -R /
. I didn't realize it had to run for a while, so the previous data is not very meaningful. The command was run while the crawler was running, so with massive I/O ongoing.
Good
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.62 14.344825 114 126027 getdents
0.25 0.036219 1 61812 12 open
0.07 0.009891 0 61802 close
0.06 0.007975 0 61801 fstat
0.01 0.000775 0 8436 write
0.00 0.000043 22 2 rt_sigaction
0.00 0.000000 0 12 read
0.00 0.000000 0 1 stat
0.00 0.000000 0 3 1 lstat
0.00 0.000000 0 33 mmap
0.00 0.000000 0 16 mprotect
0.00 0.000000 0 4 munmap
0.00 0.000000 0 15 brk
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 3 3 ioctl
0.00 0.000000 0 1 1 access
0.00 0.000000 0 3 mremap
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 fcntl
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 statfs
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 3 1 futex
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 14.399728 319982 18 total
Bad
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.81 24.210936 181 133618 getdents
0.14 0.032755 1 63183 14 open
0.03 0.006637 0 63171 close
0.02 0.005410 0 63170 fstat
0.00 0.000434 0 15002 write
0.00 0.000000 0 12 read
0.00 0.000000 0 1 stat
0.00 0.000000 0 4 1 lstat
0.00 0.000000 0 33 mmap
0.00 0.000000 0 16 mprotect
0.00 0.000000 0 4 munmap
0.00 0.000000 0 25 brk
0.00 0.000000 0 2 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 3 3 ioctl
0.00 0.000000 0 1 1 access
0.00 0.000000 0 3 mremap
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 fcntl
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 statfs
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 3 1 futex
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 24.256172 338259 20 total
bs=1M
or something todd
, otherwise you're measuring syscall overhead.smartctl -H
is useless. Check kernel logs, full smart data, partition alignment, tune2fs, benchmark disks individually/directly w/o filesystem, ... if nothing comes up, replace cables, disk anyway – frostschutz Apr 20 '14 at 15:30UPDATE: after leaving the disk doing nothing for 10 hours, THE SPEED IS AGAIN HIGH. The dd test copies hundreds of megabytes per second, like all other servers.
So it appears that something "happens" incrementally to the file system. After a few hours, things gets slower and slower. But if stop all activity and wait for a few hours things get back to normality.
I guess this has something to do with delayed writes, but frankly I don't know what I should change.
– seba Apr 20 '14 at 17:15I ran again the crawler and slowdowns appeared even without a journal. I've gathered the data requested by @ThorstenStaerk here:
http://vigna.di.unimi.it/slow.txt
(sorry but it wouldn't let me paste it here—too long).
Note that for some reason (maybe journal disabling) now the server is just twice as slow. It is also possible that I didn't catch it in the worst possible moment.
I'll try to run again the crawler to see whether I can again slow down writes to a 100x factor instead of a 2x. If I'm really unable I'll try to puck back the journal in.
– seba Apr 21 '14 at 08:40