81

I'm starting to get a collection of computers at home and to support them I have my "server" linux box running a RAID array.

Its currently mdadm RAID-1, going to RAID-5 once I have more drives (and then RAID-6 I'm hoping for). However I've heard various stories about data getting corrupted on one drive and you never noticing due to the other drive being used, up until the point when the first drive fails, and you find your second drive is also screwed (and 3rd, 4th, 5th drive).

Obviously backups are important and I'm taking care of that also, however I know I've previously seen scripts which claim to help against this problem and allow you to check your RAID while its running. However looking for these scripts again now I'm finding it hard to find anything which seems similar to what I ran before and I feel I'm out of date and not understanding whatever has changed.

How would you check a running RAID to make sure all disks are still preforming normally?

I monitor SMART on all the drives and also have mdadm set to email me in case of failure but I'd like to know my drives occasionally "check" themselves too.

  • Sounds like you're already on the right path, you just need to setup a cron to send you the results of smartctl for your drives. – laebshade Jan 08 '12 at 23:15

5 Answers5

91

The point of RAID with redundancy is that it will keep going as long as it can, but obviously it will detect errors that put it into a degraded mode, such as a failing disk. You can show the current status of an array with mdadm --detail (abbreviated as mdadm -D):

# mdadm -D /dev/md0
<snip>
       0       8        5        0      active sync   /dev/sda5
       1       8       23        1      active sync   /dev/sdb7

Furthermore the return status of mdadm -D is nonzero if there is any problem such as a failed component (1 indicates an error that the RAID mode compensates for, and 2 indicates a complete failure).

You can also get a quick summary of all RAID device status by looking at /proc/mdstat. You can get information about a RAID device in /sys/class/block/md*/md/* as well; see admin-guide/md.html in the kernel documentation. Some /sys entries are writable as well; for example you can trigger a full check of md0 with echo check >/sys/class/block/md0/md/sync_action.

In addition to these spot checks, mdadm can notify you as soon as something bad happens. Make sure that you have MAILADDR root in /etc/mdadm.conf (some distributions (e.g. Debian) set this up automatically). Then you will receive an email notification as soon as an error (a degraded array) occurs.

Make sure that you do receive mail send to root on the local machine (some modern distributions omit this, because they consider that all email goes through external providers — but receiving local mail is necessary for any serious system administrator). Test this by sending root a mail: echo hello | mail -s test root@localhost. Usually, a proper email setup requires two things:

  • Run an MTA on your local machine. The MTA must be set up at least to allow local mail delivery. All distributions come with suitable MTAs, pick anything (but not nullmailer if you want the email to be delivered locally).

  • Redirect mail going to system accounts (at least root) to an address that you read regularly. This can be your account on the local machine, or an external email address. With most MTAs, the address can be configured in /etc/aliases; you should have a line like

     root: djsmiley2k
    

    for local delivery, or

     root: djsmiley2k@mail-provider.example.com
    

    for remote delivery. If you choose remote delivery, make sure that your MTA is configured for that. Depending on your MTA, you may need to run the newaliases command after editing /etc/aliases.

textshell
  • 105
29

You can force a check of the entire array while it's online. For example, to check the array on /dev/md0, run as root:

echo check > /sys/block/md0/md/sync_action

I also have a cron job that runs the following command once a month:

tar c /dir/of/raid/filesystem > /dev/null

It’s not a thorough check of the drive itself, but it does force the system to periodically verify that (almost) every file can be read successfully off the disk. Yes, some files are going to be read out of memory cache instead of disk. But I figure that if the file is in memory cache, then it’s successfully been read off disk recently, or is about to be written to disk, and either of those operations will also uncover drive errors. Anyway, running this job tests the most important criterion of a RAID array (“Can I successfully read my data?”) and in the three years I’ve been running my array, the one time I had a drive go bad, it was this command that discovered it.

One little warning is that if your filesystem is big, then this command is going to take a long time; my system takes about 6hr/TiB. I run it using ionice so that the rest of the system doesn’t grind to a halt during the drive check:

ionice -c3 tar c /dir/of/raid/filesystem > /dev/null
stharward
  • 491
  • Note that ionice will only work if you use the (default) CFQ I/O scheduler. – Totor Nov 02 '13 at 17:48
  • So this may be obvious to most, but it's not to me - how does running a script whose output is redirected to devnull actually notify you of something? Is it the case that if "tar" encounters any errors these will get propagated up to the mdadm daemon which will (presumably) send you an email? – ljwobker Sep 04 '15 at 03:09
  • My question to you stharward, is how do you pick up on the tar errors if it's being run from a cron job? Where does that output? I would have thought you'd add a redirection for stderr to a file that could be monitored periodically or the tail of it being printed to console of opening a terminal window :) – Madivad Nov 13 '15 at 16:31
  • @ljwobker the stdout is only being redirected to null, stderr will still go (generally) to the screen. In this instance, I'm not too sure where it would go. – Madivad Nov 13 '15 at 16:32
  • 1
    @ljwobker Sorry for reviving an old thread. I think the intent of the tar command here is to attempt to read the entire contents of the volume. This would verify the entire volume is still readable and give md a chance to detect a bad disk. – mikepj May 23 '17 at 16:54
  • 1
    If it's coming from a cronjob, cron will normally send all it's output directly to the mailto= location, if one is set, otherwise to root. However... I'm wondering if dd would be better than tar, for lower overhead? – djsmiley2kStaysInside Jun 14 '18 at 12:43
  • Noting that my RAID device is md126 not md0, /sys/block/md126/md/sync_action does not exist for me... If I, as root (sudo su) try to execute this anyways (thus creating the "file"), I get Permission denied. – Drew Chapin Sep 03 '18 at 17:53
  • This should be the accepted answer... – pvgoran Apr 24 '22 at 05:28
16

the Debian and Ubuntu 'mdadm' package contains the file

/etc/cron.d/mdadm

which in turns the first sunday of each month will run the command

/usr/share/mdadm/checkarray --cron --all --idle --quiet

that will check all your arrays for consistency (unless you set AUTOCHECK to false in /etc/default/mdadm ). A report will be sent to the 'root' user (make sure you receive such emails).

am70
  • 281
9

I use this simple function to check /proc/mdstat:

#Health of RAID array
raid() { awk '/^md/ {printf "%s: ", $1}; /blocks/ {print $NF}'  /proc/mdstat; }
oz123
  • 517
jasonwryan
  • 73,126
0

I added this to my prompt by creating /etc/profile.d/raid_status.sh:

raid_prompt() { awk '/^md/ {printf "%s: ", $1}; /blocks/ {print $NF}'  /proc/mdstat | awk '/\[U+\]/ {print "\033[32m" $0 "\033[0m"}; /\[.*_.*\]/ {print "\033[31m" $0 "\033[0m"}'; }

Then setting my PS1 to:

PS1='$(raid_prompt)\n[\u@\h \W]\$ '

This prints the array name and member status in green if all members are up. If a member fails, then it's printed in red.