What should I expect to see if md/linux RAID is properly compensating for a failing drive?

Question

Does the md subsystem output any messages (to syslog/systemd-journal) to indicate that it's running in a degraded state (or anything else that might indicate that it has successfully reacted to a drive failure, as hinted at here)?

For example, I see lots of errors from sd indicating things like Unrecovered read error but I don't see anything like "retried successfully on alternate". Maybe no news is good news?

Back in the day, mirroring software/hardware would generate syslog entries that indicated when a device was degraded or otherwise required attention. Does md not do that?

Background: the systems in question are already deployed and are being remotely monitored (via syslog/journald info, so no mdadm or any other interactive commands/access of any sort are available at this point).

Previously posted (and closed as "off topic") at serverfault. — jhfrontz, Oct 06 '19 at 17:22
@roaima please see "no interactive commands/access of any sort are available". — jhfrontz, Oct 06 '19 at 17:23
I see that. I still say that it's an essential starting point, but not as an answer because of your restrictions. (I'm looking to see what other options you've got.) — Chris Davies, Oct 06 '19 at 17:24
Without interactive access, how to you expect to recover from a failure? If a disk fails and is failed out of the array, then when you replace it you need to tell the md driver to add the replacement disk. — Stephen Harris, Oct 06 '19 at 17:28
@StephenHarris Recovery is by dispatching a technician to the site. — jhfrontz, Oct 06 '19 at 20:02
@roaima that's the message that I was looking for (i was able to search for Disk failure and identify an issue). If you make that an answer, I'll accept. — jhfrontz, Oct 07 '19 at 23:28

Chris Davies · Accepted Answer · 2019-10-08T07:57:05.930

I set up a quick test on a RAID 1 array built from two loop devices.

dd bs=1M count=100 if=/dev/zero >/tmp/0.img
cp /tmp/0.img /tmp/1.img
i0=$(losetup --show --find /tmp/0.img); echo $i0
i1=$(losetup --show --find /tmp/1.img); echo $i1
mdadm --create /dev/md99 --metadata default --level 1 --raid-devices 2 $i0 $i1

Setting one half faulty

mdadm --manage /dev/md99 --set-faulty $i1    # For me, $i1=/dev/loop1

gives me this from the kernel (amongst other related RAID1 messages)

Oct 6 17:36:10 pi kernel: [4087450.030438] md/raid1:md99: Disk failure on loop1, disabling device
Oct 6 17:36:10 pi kernel: [4087450.030438] md/raid1:md99: Operation continuing on 1 devices.

What should I expect to see if md/linux RAID is properly compensating for a failing drive?

1 Answers1

Linked