26

I have been having odd and rare filesystem corruption lately that I suspect is the fault of my SSD. I am looking for a good drive torture test tool. Something that can write to the whole disk, then go back and read it looking for flying writes, corrupted blocks, blocks reverted to older revisions, and other errors. This would be much more than what badblocks does. Is there such a tool?

Note I am not looking for a performance benchmark and already checked the SMART status; says healthy and no bad blocks reported.

psusi
  • 17,303
  • 1
    Did you try? http://iozone.org/ – positron Apr 16 '13 at 04:11
  • 1
    Btw, you didn't indicate which OS/system hardware you are using. SSD(s) have been reported to experience the corruption you are talking about on some Mac OS X boxen when configured to power down the hard drive often to save power. This will cause corruption.

    I would imagine the same is likely in other OS/Hardware combos, if the drive is forced to sleep via a hard drive power down issued to an SSD. I would check the configuration of your system before burning up your SSD with drive test.

    – Wing Tang Wong Apr 16 '13 at 15:49
  • 1
    @WingTangWong, wow. I've ready that many SSDs screw up when they lose power, but when asked to go to sleep? That's one buggy drive. I'll keep an eye out for this. I'm using Linux and don't have it sleeping except maybe when I suspend the system... – psusi Apr 16 '13 at 16:04

5 Answers5

13

Might be overkill but there's Phoronix Test Suite. There's also bonnie++, as well as hdparm.

I usually use hdparm, for example:

% hdparm -Tt /dev/hdb
/dev/hdb:
 Timing buffer-cache reads:   128 MB in  1.25 seconds =102.40 MB/sec
 Timing buffered disk reads:  64 MB in 16.70 seconds =  3.83 MB/sec

I wouldn't call hdparm a torture test but it does give you a rough idea of a drives overall performance.

Determining a drives health

After you've tortured the drive you can use this command to check out the general health of the drive:

% sudo udisks --dump | grep -A 24 Updates
 Attribute       Current|Worst|Threshold  Status   Value       Type     Updates
===============================================================================
 raw-read-error-rate         103| 99| 34   good    5854752     Pre-fail Online 
 spin-up-time                100| 99|  0    n/a    0           Pre-fail Online 
 start-stop-count             98| 98| 20   good    2785        Old-age  Online 
 reallocated-sector-count    100|100| 36   good    0 sectors   Pre-fail Online 
 seek-error-rate              72| 60| 30   good    25872884688 Pre-fail Online 
 power-on-hours               89| 89|  0    n/a    424.4 days  Old-age  Online 
 spin-retry-count            100|100| 97   good    0           Pre-fail Online 
 power-cycle-count            98| 98| 20   good    2753        Old-age  Online 
 attribute-184               100|100| 99   good    0           Old-age  Online 
 reported-uncorrect          100|100|  0    n/a    0 sectors   Old-age  Online 
 attribute-188               100| 96|  0    n/a    0           Old-age  Online 
 high-fly-writes             100|100|  0    n/a    0           Old-age  Online 
 airflow-temperature-celsius  58| 42| 45 FAIL_PAST 42C / 108F  Old-age  Online 
 g-sense-error-rate          100|100|  0    n/a    124         Old-age  Online 
 power-off-retract-count     100|100|  0    n/a    15          Old-age  Online 
 load-cycle-count              1|  1|  0    n/a    248327      Old-age  Online 
 temperature-celsius-2        42| 58|  0    n/a    42C / 108F  Old-age  Online 
 hardware-ecc-recovered       45| 38|  0    n/a    5854752     Old-age  Online 
 reallocated-event-count      89| 89| 30   good    14877766723263 Pre-fail Online 
 current-pending-sector      100|100|  0    n/a    0 sectors   Old-age  Online 
 offline-uncorrectable       100|100|  0    n/a    0 sectors   Old-age  Offline
 udma-crc-error-count        200|200|  0    n/a    0           Old-age  Online 
 attribute-254               100|100|  0    n/a    0           Old-age  Online 

Disk health/maintenance tools

We've had good success using the following 2 tools where I work. HDAT2 & Spinrite. The latter is a commercial tool but the former, HDAT2, is an opensource project.

Here are a couple of screenshots of HDAT2:

ss #1 HDAT2

ss #2 HDAT2

You have to reboot your system into both of these so it's offline while you're performing these operations but they've both recovered drives that had failed or were starting to exhibit failures. The UI in HDAT2 is a little rough to navigate, we generally used the default choices for the most part and tried not to wander too far from there.

sebasth
  • 14,872
slm
  • 369,824
  • 1
    The first three you mention are performance benchmarks. Already checked SMART status and it is good with no bad sectors, so it looks like HDAT2 isn't what I'm looking for either. – psusi Apr 16 '13 at 13:38
  • I wouldn't dismiss HDAT2, we were running chkdsks and they were clean as well, yet disk was still not bootable, running HDAT2 was able to find surface issues with the disk that it was able to repair enough to make the drive bootable. – slm Apr 16 '13 at 14:08
  • A late comment, but I'm trying to find the source code to HDAT2, with no success. Has there been a recent licensing change? – i336_ Aug 16 '19 at 16:26
  • @i336 - I don't recalls seeing a link to repo, license is freeware, I'd contact the dev. – slm Jul 03 '20 at 15:29
6

bonnie++ comes to mind:

So, depending on your box's hardware configuration:

bonnie++ -d /path/to/mounted/ssd -r your-system-ram-size-in-MB

Example:

# For a 32GB system with the SSD formatted and mounted at /mnt/mounted-ssd-001
bonnie++ -d /mnt/mounted-ssd-001 -r 32000

It should give your device a good stress test. You can customize it as well.

Note, with an SSD, when a bad block happens, it may get remap'd automatically by the drive hardware, depending on the drive you are working with. Also, a torture test eats away at your SSD's write lifespan. So, use at your own discretion.

EDIT:

Adding a note about SSD failures, since it's been pointed out that Bonnie++ stress tests, but doesn't track errors. The way SSD(s) "remap bad blocks" is different from the way hard drives do remapping. How it goes about it depends entirely on which brand/make/model of SSD you have:

  • Cheap SSD(s) just fail, because they have no spare capacity to remap, or because they have no means of segregating the failed flash blocks. They will just hang or go offline and won't come back online.
  • Midrange SSD(s) with no spare capacity may generate Smartd alerts or perhaps even generate OS level block device errors when a failed block is detected. However, when the fail happens, the SSD's registered size will change. This can result in an error and the device being taken offline by the OS or it can result in the device itself hanging and needing to be pulled out and re-inserted for it to be recognized again. On re-registering, the device's available block size will be diminished.
  • High end SSD(s) with spare capacity will remap the bad blocks behind the scenes and may generate OS level alerts/warnings. When the spare capacity runs out, the device will probably fail along the lines of the Midrange SSD(s).

When the SSD resizes itself due to bad blocks being isolated, you might need to do the following to revive the drive, if the drive's firmware doesn't automatically do the proper updates automatically:

https://web.archive.org/web/20130728024542/http://communities.intel.com/message/145676

Unless the stress test and error logging tool is specifically designed with SSD(s), in mind, you are just using up the lifespan of the device.

EDIT:

Based on info from answers above, suggest either replacing the cable with a better one or replacing the drive(RMA/Warrantee replacement), as that kind of OS filesystem level error is not normal.

Also, if your drive supports it, you can increase the amount of space reserved for handling errors:

http://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm

Pro Backup
  • 4,924
  • 1
    That's a performance benchmark. It may give the drive a workout, but I don't think it detects errors. – psusi Apr 16 '13 at 13:41
  • The way that SSD(s) work, error detection comes in one of several forms, depending on the make/model of the SSD: Smartd error if a block is remapped and spare capacity is used(no faults), capacity of the device is reduced as a section of the flash storage is faulted(can cause smartd errors, can cause filesystem errors, can cause device to hang the bus by going offline. On pull/re-insert, drive is avaialble again, but may need to be reformatted), and the SSD can just outright appear to hang with no remapping(device becomes unresponsive even after re-insert). The failure path is not equiv to HD – Wing Tang Wong Apr 16 '13 at 15:35
  • 1
    the errors aren't detected by the drive, hence the need for a test tool. It manifests itself by the filesystem being remounted ro, and e2fsck finding and fixing lots of errors in the metadata. Also I've had some of my git repository pack files corrupted. It's a silent corruption that happens maybe once every month or two. At first I thought it may be a bug involving TRIM since I don't recall this happening before I enabled it, so I turned it back off and it's still happening. – psusi Apr 16 '13 at 16:10
  • A couple of potential issues: bad drive cable or bad drive.

    You can test the bad drive cable by replacing the cable with another one. I've had this in the past and replacing with a better spec'd cable worked.

    In the case of a bad drive, RMA or send in for warrantee repair.

    – Wing Tang Wong Apr 16 '13 at 16:30
  • 1
    the problem is proving that it is a bad drive ( or really, a bug in the firmware ) and not say, a bug in the kernel. If it were a bad cable, it would manifest as sata ecc errors rather than random silent corruption. – psusi Apr 16 '13 at 17:18
  • The frequency of the errors would definitely make that hard to test. :( Another alternative would be to check the manufacturer's website and see if they have an analysis tool specifically for the SSD that they produced. Another possibility is to check to see if there is a firmware update for the SSD. In some cases, this will fix firmware bugs, which if that is the cause of the corruption, will resolve the issue. I'd go with the manufacturer tools and/or firmware update to address the issue. Failing that, RMA/warrantee swap. – Wing Tang Wong Apr 16 '13 at 17:26
3

I understand this is over a year old, but for the benefit of anyone reading the thread in future, I expect the software you require(d) doesn't yet exist outside of HP Labs:

"Understanding the Robustness of SSDs under Power Fault" https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

Replacing the power fault injection with an event of choice (or nothing, in the case of detecting intermittent firmware bugs), and it appears that this software would detect it. Unfortunately, I don't think there's an alternative, else presumably HP wouldn't have written something in-house.

It's a shame, since I've also needed something like this in order to prove issues in the virtual environment; where I suspect committed writes haven't actually made their way to physical disk. It would be great to be able to stress test the storage stack like this, not just necessarily SSDs. I've yet to find something suitable.

  • 2
    (from anonymous comment)

    While the hardware side of things would need to be replicated, I see no reason why the software verification part of that paper could not be reproduced using fio in client/server mode with triggers. See https://github.com/axboe/fio/blob/master/HOWTO (10.0 Verification and triggers) for details. fio can be made to use unbuffered or periodically synchronise I/O in a variety of useful patterns which can later be verified (it is even possible to save a state file so verification can take place after different fio invocations).

    – Archemar Mar 07 '15 at 13:22
1

Use the maker test tool, its the best way to test a HD, as it can access to the low level tests, remap bad sectors, test all the smart health status (specially for a SSD, there are many registers unknown for most of us but can help the maker to see the hard disk status)

hirensCD have many testing tools, but i think that it was not updated to SSD enabled ones, so check directly the maker website. Some do support linux, others may require a windows liveCD (check the hirenCd again) or booting from a pendrive (freedos, special OS, etc)

Most older HD tools are not good for testing SSD, as a sector is never on the same places, is dynamically mapped by the firmware to spread the writes all over the drive. So if they do write tests, you are just burning write cycles instead of truly testing the disk.

Read tests are don't burn the SSD but also might not really test all SSD sectors, again due the firmware hiding the real layout.

higuita
  • 585
1

Stressdisk

It is designed to stress test your mounted disks and find failures in them.

The downside is that its testing is configured as running time, default 24 hours, not TBW. So you can't exit the test every 100 TBW, save the drive its relevant SMART stats, and continue for the next 100 TBW.

https://github.com/ncw/stressdisk

Anvil's Storage Utilities

TechReport's SSD endurance experiment used this Windows utility. The Benchmarks menu includes Endurance testing.

There’s an integrated MD5 hash check that verifies data integrity. Anvil’s endurance test is not an ideal real-world simulation because it writes files sequentially. The utility can write compressed data in 5 compression grades.

Note: For endurance testing AnvilPro.exe needs to be copied to, and run from the device under test.

Endurance testing - Anvil's Storage Utilities 1.0.25 beta 3 (2011 july 26)

download: https://www.guru3d.com/download/anvils-storage-utilities-download/

SpinRite

Will run even on raw drives, it will only run offline on a modified FreeDos OS. However SpinRite will operate at sector level.

With some DOS batch file scripting, SpinRite torture testing can be automated to run a number of times. Most recent SpinRite 6.1 RC5 by default writes logging to its FAT32 USB thumb drive.

Level 5

At level 5 SpinRite will read and write all sectors of data twice. ALL data is “inverted” from 1’s to 0’s and 0’s to 1’s to verify that any data can be successfully written and read. There is no data loss because of this SpinRite write process. Your original data is inverted twice, thus back in its original state afterwards.

Level 5 will also mark any regions of the drive that were once previously found to be defective, but may now test reliable, as being once again available for full use.

Pro Backup
  • 4,924