I have searched extensively for this answer and cannot find it... might anyone please know how to set up a linux software mdadm RAID 1 array, such that the computer will still be bootable and repairable if something happens to a RAIDed partition or disk? My impression is that the array can sometimes work fine if you yank out one of the disks, but if you don't mirror the grub installs or boot partitions (if any) or UUIDs/names correctly, the array may fail to boot.
The problem I am trying to solve is that I would like both 1) RAID 1 functionality and 2) snapshot functionality, so a system update/upgrade can be reverted. Could someone please suggest a partition/format/grub/script/etc. setup that achieves this? This is a personal and not a production system. Examples further down.
What I've tried: I have experimented with btrfs raid 1 (but I'm not sure if the permanent-ro bug is fixed yet, and it I'm sadly stuck writing this with an unbootable degraded btrfs raid 1 (not set up as above)); I would prefer not to use zfs (since installation and partition tool integration seems lacking due to the license issues); I would prefer not to use LVM snapshotting (because of the race conditions which must assuredly exist); reverting aptitude updates (or this) seems like it may cause problems. btrfs snapshotting works wonders, but the computer being unbootable after a disk failure with a btrfs RAID is my current issue (I am trying to figure out a way to create a bootable USB from which I could actually run btrfs tools (since one can't reformat the active disk, and I don't have access to chroot/debootstrap/etc., so maybe I should just insert two USB keys with debian-installer), but throwing out that system seems almost simpler).
Proposed solution: The solution I'm hoping to explore is using mdadm (over ext4 I guess), and whenever I use aptitude
, I will "manually snapshot" by excruciatingly dd
ing everything on the root partition to a rollback partition of the same size (maybe on a spare disk), with dd if=/dev/md# of=/dev/sd# bs=??M; sync
(on success, nothing is necessary; on failure, hopefully the system can boot and be recovered with dd if=/dev/sd# of/dev/md#
)
(I'm trying to use GPT partition names here but I'm not that familiar; please correct me if wrong. I guess it would be dd if=/dev/disk/by-???/??? of=/dev/disk/by-partlabel/rollback-root
, but I'm not sure how one could label md
arrays and if they'd need to by ext4 systems to accept a label.)
Illustration:
sda UUID=10 [?? xxxxxxxxxxxxx ]
sdb UUID=20 [?? xxxxxxxxxxxxx ]
^ root and/or boot partitions, in md0(,md1? etc)
sdc LABEL=rollback-root UUID=30 [?? _empty_space_ ]
Spec:
- scenario 1: drive failure - If either drive breaks or is removed, then the bootloader is still safe (not lost, not corrupted), and allows booting into degraded mode (or maybe a barebones recovery partition); this may require user intervention. For example if
/dev/sda
no longer exists, the mobo will pass over it since it can't find grub in either the MBR or wherever it lives in UEFI land, check its next disk in order normal like/dev/sdb
, find a 'backup' grub, then use that one.- ideally, booting into degraded mode would not require user intervention (e.g. if located remotely)
- scenario 2: upgrade failure - if an upgrade happens, and the system becomes unusable (maybe the kernel panics on boot), one can boot from the snapshot partition, then
dd
the snapshot partition back over the original bootable root/boot partition(s), including overwriting the grub install.- in remote-world, I'm not aware of anything that will "try to boot, and failing that, revert" (I believe Android may do this somehow by setting some fallback flag, like setting a different resolution on some platforms "are you sure you want to change this resolution? N seconds")
- if an upgrade happens, and the system and bootloader become broken, I can insert a recovery usb/disk and... fix it somehow? (perhaps someone is aware of a nice way to chain bootloaders from a 'safe' bootloader while allowing aptitude free reign to update what it thinks is the primary grub) (or perhaps I also backup grub onto the rollback disk some...how, then boot from the rollback disk via the motherboard, then restore grub some...how?). I am aware that one cannot merely
dd
the MBR or ESP... how does managing those in these scenarios even work?
Thoughts:
In such a hypothetical setup, how would one sync grub configs between the two RAID partition(s) while being resilient to numbering issues? What happens to the grub config if an aptitude upgrade changes it? Is the nofail
option necessary, or on by default? Would btrfs RAID with a dedicated 'rescue' partition or USB key be easier? (in the case of a partition, how would I set up GRUB to enable that?)
There are likely some major issues with the above, but I hope it illustrated the end goal.
Thank you!
Potentially useful references:
grub
boot menu entries, having one for each bootable physical disk. Provided I have access to the console at boot time - directly or via IPMI - this can work satisfactorily, bearing in mind that "a disk not working" is expected to be a rare and short-term event. – Chris Davies Jan 29 '20 at 17:53