14

My understanding is that hard drives and SSDs implement some basic error correction inside the drive, and most RAID configurations e.g. mdadm will depend on this to decide when a drive has failed to correct an error and needs to be taken offline. However, this depends on the storage being 100% accurate in its error diagnosis. That's not so, and a common configuration like a two-drive RAID-1 mirror will be vulnerable: suppose some bits on one drive are silently corrupted and the drive does not report a read error. Thus, file systems like btrfs and ZFS implement their own checksums, so as not to trust buggy drive firmwares, glitchy SATA cables, and so on.

Similarly, RAM can also have reliability problems and thus we have ECC RAM to solve this problem.

My question is this: what's the canonical way to protect the Linux swap file from silent corruption / bit rot not caught by drive firmware on a two-disk configuration (i.e. using mainline kernel drivers)? It seems to me that a configuration that lacks end-to-end protection here (such as that provided by btrfs) somewhat negates the peace of mind brought by ECC RAM. Yet I cannot think of a good way:

  • btrfs does not support swapfiles at all. You could set up a loop device from a btrfs file and make a swap on that. But that has problems:
  • ZFS on Linux allows using a ZVOL as swap, which I guess could work: http://zfsonlinux.org/faq.html#CanIUseaZVOLforSwap - however, from my reading, ZFS is normally demanding on memory, and getting it working in a swap-only application sounds like some work figuring it out. I think this is not my first choice. Why you would have to use some out-of-tree kernel module just to have a reliable swap is beyond me - surely there is a way to accomplish this with most modern Linux distributions / kernels in this day & age?
  • There was actually a thread on a Linux kernel mailing list with patches to enable checksums within the memory manager itself, for exactly the reasons I discuss in this question: http://thread.gmane.org/gmane.linux.kernel/989246 - unfortunately, as far as I can tell, the patch died and never made it upstream for reasons unknown to me. Too bad, it sounded like a nice feature. On the other hand, if you put swap on a RAID-1 - if the corruption is beyond the ability of the checksum to repair, you'd want the memory manager to try to read from the other drive before panicking or whatever, which is probably outside the scope of what a memory manager should do.

In summary:

  • RAM has ECC to correct errors
  • Files on permanent storage have btrfs to correct errors
  • Swap has ??? <--- this is my question
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 2
    Wouldn't encrypted swap have error detection as a side-effect? If the encrypted stream is corrupted on the drive, the decryption will blow up... I have no idea how the system reacts then! – Stephen Kitt Mar 11 '16 at 08:01
  • I've no experience with btrfs, but I read the link you cited, and I think you are overlooking something. They are referring to files in which blocks are dynamically created, ie sparse files. You could create a dedicated brtfs partition, mounted without COW, create a file filled with zeros (dd if=/dev/zero) matching your desired swap size, and mount that file as the swapfile. I see no reason why this would incur a performance penalty. – Otheus Mar 11 '16 at 10:04
  • Also, I have always assumed md and lvm compare the results of the reads from each mirror. Are you sure they never do this and rely strictly on the device's error reporting? – Otheus Mar 11 '16 at 10:06
  • Finally, with non SDD disks, every 512 or 1k block has a CRC associated with it. I don't know about SSDs. Further, With most modern SATA disks including SSDs, encryption is supported within the disk, and the encryption itself may implicitly support some kind of additional validation. – Otheus Mar 11 '16 at 10:13
  • 3
    @Otheus for performance reasons MD only reads from one device (more accurately, it reads from all devices, but a single read only involves one device); it can compare the contents of all the devices involved but that's a separate operation, scrubbing. – Stephen Kitt Mar 11 '16 at 10:22
  • 2
    @Otheus: Setting nodatacow also disables checksums: https://btrfs.wiki.kernel.org/index.php/Mount_options ... "This also turns off checksumming! IOW, nodatacow implies nodatasum."..... nodatasum says: "Means bit flips and bit rot might go undetected." – James Johnston Mar 11 '16 at 14:42
  • 3
    @Otheus: "Finally, with non SDD disks, every 512 or 1k block has a CRC associated with it" .... true, but it won't protect against bit flips outside of the disk itself. (and you also place a lot of trust in closed-source one-off proprietary drive firmware.) These are reasons why btrfs and ZFS exist in the first place: they DON'T trust the underlying storage (or else they wouldn't bother with checksumming in the first place). For example, some users have reported bit flips due to bad SATA cabling and/or flaky power supplies. – James Johnston Mar 11 '16 at 14:45
  • @StephenKitt: "If the encrypted stream is corrupted on the drive, the decryption will blow up" -- wouldn't you just get a stream of garbage on the decryption which could cause silent data corruption throughout the system? For example, if "discard" is enabled for dmcrypt, the docs specifically say the top-level file-systems must be able to handle garbage instead of zeros when they send "TRIM" (or discard) to dmcrypt. – James Johnston Mar 11 '16 at 14:47
  • @JamesJohnston perhaps, I haven't tested it. Some encryption tools (e.g. GnuPG, even in plain pass-phrase-based AES mode) can detect when an encrypted stream is corrupted, I was hoping dm-crypt could too, but I guess that relies on more information than just the output of the block cipher... – Stephen Kitt Mar 11 '16 at 15:21
  • Is this more than a theoretical question? That is, do you actually rely on swap? I never did in the past (not since the 90s), but now that we have hypervisors with huge memory requirements, I understand the need for it again. But even then, I imagine you won't see a performance impact with on mirrored SSD disks for swap just because btrfs is using COW. – Otheus Mar 11 '16 at 15:41

4 Answers4

5

We trust the integrity of the data retrieved from swap because the storage hardware has checksums, CRCs, and such.

In one of the comments above, you say:

true, but it won't protect against bit flips outside of the disk itself

"It" meaning the disk's checksums here.

That is true, but SATA uses 32-bit CRCs for commands and data. Thus, you have a 1 in 4 billion chance of corrupting data undetectably between the disk and the SATA controller. That means that a continuous error source could introduce an error as often as every 125 MiB transferred, but a rare, random error source like cosmic rays would cause undetectable errors at a vanishingly small rate.

Realize also that if you've got a source that causes an undetected error at a rate anywhere near one per 125 MiB transferred, performance will be terrible because of the high number of detected errors requiring re-transfer. Monitoring and logging will probably alert you to the problem in time to avoid undetected corruption.

As for the storage medium's checksums, every SATA (and before it, PATA) disk uses per-sector checksums of some kind. One of the characteristic features of "enterprise" hard disks is larger sectors protected by additional data integrity features, greatly reducing the chance of an undetected error.

Without such measures, there would be no point to the spare sector pool in every hard drive: the drive itself could not detect a bad sector, so it could never swap fresh sectors in.

In another comment, you ask:

if SATA is so trustworthy, why are there checksummed file systems like ZFS, btrfs, ReFS?

Generally speaking, we aren't asking swap to store data long-term. The limit on swap storage is the system's uptime, and most data in swap doesn't last nearly that long, since most data that goes through your system's virtual memory system belongs to much shorter-lived processes.

On top of that, uptimes have generally gotten shorter over the years, what with the increased frequency of kernel and libc updates, virtualization, cloud architectures, etc.

Furthermore, most data in swap is inherently disused in a well-managed system, being one that doesn't run itself out of main RAM. In such a system, the only things that end up in swap are pages that the program doesn't use often, if ever. This is more common than you might guess. Most dynamic libraries that your programs link to have routines in them that your program doesn't use, but they had to be loaded into RAM by the dynamic linker. When the OS sees that you aren't using all of the program text in the library, it swaps it out, making room for code and data that your programs are using. If such swapped-out memory pages are corrupted, who would ever know?

Contrast this with the likes of ZFS where we expect the data to be durably and persistently stored, so that it lasts not only beyond the system's current uptime, but also beyond the life of the individual storage devices that comprise the storage system. ZFS and such are solving a problem with a time scale roughly two orders of magnitude longer than the problem solved by swap. We therefore have much higher corruption detection requirements for ZFS than for Linux swap.

ZFS and such differ from swap in another key way here: we don't RAID swap filesystems together. When multiple swap devices are in use on a single machine, it's a JBOD scheme, not like RAID-0 or higher. (e.g. macOS's chained swap files scheme, Linux's swapon, etc.) Since the swap devices are independent, rather than interdependent as with RAID, we don't need extensive checksumming because replacing a swap device doesn't involve looking at other interdependent swap devices for the data that should go on the replacement device. In ZFS terms, we don't resilver swap devices from redundant copies on other storage devices.

All of this does mean that you must use a reliable swap device. I once used a $20 external USB HDD enclosure to rescue an ailing ZFS pool, only to discover that the enclosure was itself unreliable, introducing errors of its own into the process. ZFS's strong checksumming saved me here. You can't get away with such cavalier treatment of storage media with a swap file. If the swap device is dying, and is thus approaching that worst case where it could inject an undetectable error every 125 MiB transferred, you simply have to replace it, ASAP.

The overall sense of paranoia in this question devolves to an instance of the Byzantine generals problem. Read up on that, ponder the 1982 date on the academic paper describing the problem to the computer science world, and then decide whether you, in 2019, have fresh thoughts to add to this problem. And if not, then perhaps you will just use the technology designed by three decades of CS graduates who all know about the Byzantine Generals Problem.

This is well-trod ground. You probably can't come up with an idea, objection, or solution that hasn't already been discussed to death in the computer science journals.

SATA is certainly not utterly reliable, but unless you are going to join academia or one of the the kernel development teams, you are not going to be in a position to add materially to the state of the art here. These problems are already well in hand, as you've already noted: ZFS, btrfs, ReFS... As an OS user, you simply have to trust that the OS's creators are taking care of these problems for you, because they also know about the Byzantine Generals.

It is currently not practical to put your swap file on top of ZFS or Btrfs, but if the above doesn't reassure you, you could at least put it atop xfs or ext4. That would be better than using a dedicated swap partition.

Warren Young
  • 72,032
  • 1
    If you have RAID you do ideally run your swap on top of the RAID. Otherwise, you will crash swapped programs when your swap dies. One of the uses of RAID is to survive a disk failure, hot-swap a new disk and keep running without rebooting. – sourcejedi Feb 04 '19 at 21:49
3

dm-integrity

See: Documentation/device-mapper/dm-integrity.txt

dm-integrity would normally be used in journalling mode. In the case of swap, you could arrange to do without the journalling. This could significantly lower the performance overhead. I am not sure whether you would need to reformat the swap-over-integrity partition on each boot, to avoid catching errors after an unclean shutdown.

In the initial announcement of dm-integrity, the author states a preference for "data integrity protection on the higher level" instead. In the case of swap, that would open the possibility of storing the checksums in RAM. However, that option would both require non-trivial modifications to the current swap code, and increase memory usage. (The current code tracks swap efficiently using extents, not individual pages / sectors).


DIF/DIX?

DIX support was added by Oracle in Linux 2.6.27 (2008).

Does using DIX provide end-to-end integrity?

You could consult your vendor. I don't know how you could tell if they are lying about it.

DIX is required to protect data in flight between the OS (operating system) and the HBA.

DIF on its own increases protection for data in flight between the HBA and the storage device. (See also: presentation with some figures about the difference in error rates).

Precisely because the checksum in the guard field is standardized, it is technically possible to implement DIX commands without providing any protection for data at rest. Just have the HBA (or storage device) regenerate the checksum at read time. This outlook was made quite clear by the original DIX project.

  • DIF/DIX are orthogonal to logical block checksums
    • We still love you, btrfs!
    • Logical block checksum errors are used for detection of corrupted data
    • Detection happens at READ time
    • ... which could be months later, original buffer is lost
    • Any redundant copies may also be bad if original buffer was garbled
  • DIF/DIX are about proactively preventing corruption
    • Preventing bad data from being stored on disk in the first place
    • ... and finding out about problems before the original buffer is erased from memory

-- lpc08-data-integrity.pdf from oss.oracle.com

One of their early postings about DIX mentions the possibility of using DIX between OS and HBA even when the drive does not support DIF.

Complete mendacity is relatively unlikely in "enterprise" contexts where DIX is currently used; people would notice it. Also, DIF was based on existing hardware which could be formatted with 520-byte sectors. The protocol for using DIF allegedly requires that you first reformat the drive, see e.g. the sg_format command.

What is more likely is an implementation that does not follow the true end-to-end principle. To give one example, a vendor is mentioned which supports a weaker checksum option for DIX to save CPU cycles, which is then replaced by a stronger checksum further down in the stack. This is useful, but it is not complete end-to-end protection.

Alternatively, an OS could generate its own checksums and store them in the application tag space. However there is no support for this in current Linux (v4.20). The comment, written in 2014, suggests this might be because "very few storage devices actually permit using the application tag space". (I am not certain whether this refers to the storage device itself, the HBA, or both).

What sort of DIX devices are available that work with Linux?

The separation of the data and integrity metadata buffers as well as the choice in checksums is referred to as the Data Integrity Extensions [DIX]. As these extensions are outside the scope of the protocol bodies (T10, T13), Oracle and its partners are trying to standardize them within the Storage Networking Industry Association.

-- v4.20/Documentation/block/data-integrity.txt

Wikipedia tells me DIF is standardized in NVMe 1.2.1. For SCSI HBAs, it seems a bit difficult to pin this down if we don't have a standard to point to. At the moment it might be most precise to talk about "Linux DIX" support :-). There are devices available:

SCSI T10 DIF/DIX [sic] is fully supported in Red Hat Enterprise Linux 7.4, provided that the hardware vendor has qualified it and provides full support for the particular HBA and storage array configuration. DIF/DIX is not supported on other configurations, it is not supported for use on the boot device, and it is not supported on virtualized guests.

At the current time, the following vendors are known to provide this support...

-- RHEL 7.5 Release Notes, Chapter 16. Storage

All the hardware mentioned in the RHEL 7.5 release notes is Fibre Channel.

I don't know this market. It sounds like DIX might become more widely available in servers in future. I don't know any reason why it would become available for consumer SATA disks - as far as I know there isn't even a de-facto standard for the command format. I'll be interested to see if it becomes available more widely on NVMe.

sourcejedi
  • 50,249
  • thanks! I thought I heard something about some integrity "addon" for dev-mapper, but wasn't really sure. – poige Feb 07 '19 at 13:38
2

Swap has ??? <--- this is my question

Swap is still not protected in Linux (but see UPD).

Well, of course there's ZFS on Linux that is capable of being a swap storage but there's still a lock-up under some circumstances — thus effectively revoking that option.

Btrfs still can't handle swap files. They mention possible use of loopback although it's noted to be of a poor performance. There's an indication unclear that Linux 5 could have it finally(?)…

Patches to protect conventional swap itself with checksums didn't make it to mainstream.

So, all-in-all: nope. Linux still has a gap there.

UPD.: As @sourcejedi points out there's such a tool as dm-integrity. Linux kernel since version 4.12 has gotten device-mapper's target that can be put into a use for providing checksums to any general block devices and those that are for swap isn't exception. The tooling isn't broadly incorporated into major distros and most of them don't have any support in udev sub-system, but eventually this should change. When paired with a redundancy provider, say put onto a top of MD aka Linux Software RAID, it should be possible not only to detect bit rot but also to re-route I/O request to healthy data because dm-integrity would indicate there's an issue and MD should handle it.

poige
  • 6,231
0

I do not think that there is a "canonical" way, so the following is my personal opinion.

Having monitored the advance of btrfs from a potential user's point of view, I have to say that it is still somehow obscure to me. There are features which are mature and ready for production use, and there are features which are seemingly immature and dangerous to use.

Personally, I don't have the time to decide which feature to use and which not, let away the time which I would need to figure out how to turn off or on these features.

In contrast, ZFS is rock-solid and mature (IMHO). So, to answer your question, I would use ZFS (by the way, it does not consume much memory - see below).

But for you, btrfs might be the right choice since you are using it already (if I got you right), and one of the comments above shows how to use it for swap.

By pure chance, I have put some Linux servers on ZFS during the last days, each time including the root file system and swap. Before I did this, I have done some research very thoroughly, which took me several days. A short summary of what I have learned:

ZFS's memory consumption

There is a common misunderstanding regarding ZFS's memory consumption. ZFS generally does not consume much memory; in fact, it runs with TBs of storage on machines with 2 GB of RAM. Only if you use deduplication (turned off by default), then it needs lots and lots of RAM.

Hardware error detection / correction

Whether SATA's, PATA's, RAID's or other error detection / correction mechanisms are sufficient for data integrity is a subject which causes endless discussions and even flame wars at all places on the net. By theory, a hardware storage device should report (and possibly correct) any error it encounters, and the data transmission hardware at all levels (chipset, memory etc.) should do as well.

Well, they don't in all cases, or they react surpringly to errors. As an example, let's take a typical RAID5 configuration. Normally, if one disk has a problem, it will report it to the RAID which in turn constructs the data to be read from the other disks and passes it on, but also writes it back to the faulty disk (which in turn probably remaps the sector before writing the data); if the same disk reports too many errors, the RAID takes it offline and informs the administrator (if configured properly).

So far, so good, but there are cases where faulty data gets out of a disk without the disk reporting an error (see next section). Most RAIDs can detect this situation using the parity information, but their reaction is stupid: Instead of reporting the error and stopping the data from being passed on, they will just recalculate the parity based on the faulty data and write the new parity to the respective disk, thus marking the faulty data as correct forever.

Is that a reasonable behavior? As far as I know, most hardware RAID5 controllers and even Linux's md RAID operate this way.

I don't know about btrfs's error correction, but you eventually should read the docs carefully once more, notably if you are using btrfs's RAID.

Silent bit rot

Despite all flame wars and (pseudo-) scientific discussions: Reality mostly is different from theory, and silent bit rot definitely happens although theory might state the opposite (silent bot rot usually means that data on hardware storage gets corrupted without the storage device reporting an error when this data is read, but I will add flipping bits anywhere in the transmission path to this definition).

That this happens is not my personal opinion: At least Google, Amazon and CERN have published detailed white papers covering exactly that subject. The papers are publicly available for download free of charge. They have done systematic experiments with several millions of hard disks and hundreds of thousands of servers / storage devices, either because they had problems with undetected data corruption or because they wanted to know what to do to prevent it before it happened.

In summary, data in their server farms had been corrupted with a rate significantly higher than MTBF statistics or other theory would let expect. By significantly higher, I mean orders of magnitude.

So silent bit rot, i.e. undetected data corruption at any point in the transmission path, is a real life problem.

Data lifetime

Warren Young is correct when he says that swap data has a short lifetime. But I would like to add the following consideration: Not only data (in the sense of documents) gets into swap, but (perhaps even more likely) parts of the O/S or other running software. If I have an MP3 in swap, I could live with a flipping bit. If (due to an extreme situation) parts of my production httpd server software are in swap, I by no means can live with a flipping bit which later leads to executing corrupt code if undetected.

Epilogue

For me, ZFS solves these problems, or, more precisely, it moves them away from the disks to the memory and thereby reduces the probability for silent bit rot by some orders of magnitude. Furthermore, if configured properly (i.e. mirrors instead of RAID), it provides a clean and reasonable error correction which works like expected and can be understood easily after all.

Having said this, please note that you will never get absolute safety. Personally, I trust my ECC RAM more than my disks, and I am convinced that ZFS with its end-to-end checksums reduces problem probability by orders of magnitude. I would never recommend using ZFS without ECC RAM, though.

Disclaimer: I am in no way associated with any ZFS vendor or developer. This is true for all variants (forks) of ZFS. I just became a fan of it in the past days ...

Binarus
  • 3,340