Why Big Disk Drives Require Data Integrity Checking

Hard disk drives keep getting bigger, meaning capacity just keeps getting cheaper. But storage capacity is like money: The more you have, the more you use. And this growth in capacity means that data is at risk from a very old nemesis: Unrecoverable Read Errors (URE).

You might think RAID is for data protection, but it does nothing of the sort!

Let’s get one thing out of the way from the start: The only thing protecting your data from corruption is some simple error checking on the disk drive itself and anything built into the software stack on your array or server. RAID doesn’t do any error checking at all, and neither does NTFS in Windows or HFS+ in Mac OS X. And none of those things can correct a read error if they encounter one. When people talk about disk or integrity checks they’re usually talking about the integrity of the file system or RAID set, not of the actual data itself.

Let this sink in: Your data is not protected. It can be corrupted. And you will never know until you need it.

Yes, I’m trying to scare you.

What Protects Your Data?

For most regular people, your only line of defense against random read and write errors is something called error correction coding (ECC), which is built into your hard disk drive’s controller. ECC is essential because magnetic media often has “bad” bits that aren’t readable, especially as information density increases. So hard disk controllers take care of recoverable read errors all the time.

As implemented in most modern hard disk drives, ECC works pretty well, but it’s not perfect. Most manufacturers claim that 1 bad bit will slip through every 10¹⁴ to 10¹⁶ bits, which is actually really good. But what about those unrecoverable read errors (URE’s)? They’re out of the disk drive’s hands. Hopefully something higher in the stack can recover the data. Maybe your filesystem, or maybe the storage array software.

The good news is that every enterprise storage array worthy of the name has data integrity checking built in, including all the big names and most of the smaller companies, too. After all, if an array can’t store data reliably it’s not really worth buying! So if you’re using a storage array, you’re probably good. Drobo apparently does integrity checking, too. So there’s that.

The bad news is that NTFS, ext3, and HFS+ don’t do any kind of data integrity checking. That means that the vast majority of user data is reliant on the ECC in the hard disk drive itself to ensure it meets the prime directive of storage.

The Prime Directive of storage: Do not return the wrong data!

The worse news is that unrecoverable read errors do happen, so all this data is at risk. Heavy data users (Greenplum, Amazon, CERN) report that errors really do happen about as often as hard disk drive manufacturers suggest they might. Furthermore, errors often come after the disk controller is done with the data: Faulty firmware, poor connections, bad cables, and even cosmic radiation can induce URE’s.

How Common is URE?

It’s hard to understand what one error in 10¹⁴, 10¹⁵, or 10¹⁶ really means in the real world. One easier way to think about it is that 10¹⁴ equals 12.5 TB, 10¹⁵ equals 125 TB, and 10¹⁶ is 1.25 PB. But this doesn’t really tell the correct story either. These are error rates, not error guarantees. You can read an exabyte of data and never encounter a URE, just like you can buy a lottery ticket and become a millionaire. The important thing is the probability.

As Matt Simmons points out, we can easily calculate the probability of a URE based on a given amount of data. The formula is Statistics 101 material, and he does a fine job of laying it out in his blog post, Recalculating Odds of RAID5 URE Failure. But even that was a little hard to grasp.

So here’s my take: Given a number of hard disk drives of a certain size in a set, how likely is a URE? I graphed it out for 1-10 drives of modern sizes, 1-10 TB. And the results are pretty scary.

At 10^14, the probability of a URE gets pretty high for big drives! — At 10¹⁴, the probability of a URE gets pretty high for big drives!

Although a single 1 TB drive has less than an 8% chance of URE, those fancy new 10 TB drives start out over 55%, assuming a URE rate of 1 in 10¹⁴. Throw a few into a RAID set and you’ve got real trouble. If your risk threshold is a 50/50 chance, you can’t have more than three 3 TB drives (9 TB) in a set before you’re there. Even if you’re a crazy risk-taker, five 6 TB drives (30 TB) gets you over 90%. This is not good.

How about a URE rate of 1 in 10¹⁵? You’d reach 50% at around 90 TB, which is admittedly pretty high. But that’s still not a crazy huge amount of data, and it’ll be downright common in just a few years. And when you consider the reasonably likely issue of bad firmware and bad cables, it seems like URE isn’t an alien issue.

Play around with the numbers using my Google URE Spreadsheet.

What happens if you lose a bit? Maybe nothing. Video and audio files will probably keep playing. Photos might still look OK. But maybe not. And a faulty cable could wipe out the whole file, not just a bit of it (if you pardon the pun).

Protect Your Data

What can you do about unrecoverable read errors? Simple answer: Use a better storage stack.

As mentioned above, most enterprise storage systems implement serious data integrity checking as part of their storage controller. Many use erasure coding, like the Reed-Solomon codes already used for ECC in the hard disk drive. Others prefer retaining a SHA1 hash for all data and recovering it from an alternate location if it gets corrupted. Either way the risk is reduced to such an extent that you don’t have to worry about it.

But what about servers and home users of non-enterprise storage? You’ve got trouble here. You can’t use NTFS, ext, or HFS+. Btrfs and ReFS have integrity features but they’re not effective out of the box: Btrfs only checks integrity with CRC32 and it’s not clear how it recovers data, while Microsoft engineered “integrity streams” into ReFS but only uses them for metadata by default. All they need to do is turn on integrity streams for user data, and this seems like a no-brainer for them in enterprise storage scenarios.

The only real option is ZFS. It has wonderful, robust integrity checking and data recovery. In fact, this was one of the design goals for ZFS! Honestly, if you care about your data and have more than a dozen terabytes of it, you must use something like ZFS or a real storage array.

Maybe not today, and maybe not tomorrow, but soon you’re going to need data integrity checking and error recovery. It’s time for Microsoft, Apple, and Btrfs to step up and provide it.

You might also want to read these other posts...

Chris M Evans says

December 21, 2014 at 11:49 am

Stephen,

An interesting discussion indeed and one that has been covered a number of times since the advent of TB capacity HDDs. I think the first discussion I saw on this was from Dave Hitz; not surprisingly with 28-disk wide RAID stripes, they were likely to see the problem early.

For your readers I think you should explain where the 10^14 or 10^16 figures come from. An enterprise-class SAS HDD offers the higher reliability (10^16) whereas desktop SATA drives will see 10^14 reliability. Of course we should also remember the SATA drives will likely have much higher capacity models (e.g. 4TB+) than SAS which is topping out around 600-900GB.

Taking these facts, consider two scenarios:

(a) – vendors using desktop drives in Enterprise arrays. Yes, I’ve seen this when I’ve analysed new top end hardware. The three letter acronym vendors are using commodity desktop drives and selling them at enterprise prices. Should we be concerned about this (especially if they have software to manage the issue)?
(b) – the rise of hyperconverged solutions. How much of the last 20 years of knowledge on recovery from URE problems has been baked into hyperconverged? I asked this exact question about VSAN of Cormac Hogan at the UK annual VMUG in November 2013 and his response was “NONE”. As far as he was aware, VSAN had no specific features to manage URE or predictive failure of drives.

Point (b) concerns me more than anything else. We’ve spent 25+ years refining the ability to recover from bit errors and now we’re likely to throw that knowledge away and start again with all the data loss scenarios we thought we’d eliminated. Worse still, we’re likely to choose the cheapest HDD products to build those solutions because the benefits are all sold on having the cheapest hardware solution available.

So Caveat Emptor – and ask your hyperconverged vendor exactly what data management features are built into their solutions.

Chris

Comments

Chris M Evans says

December 21, 2014 at 11:49 am

Stephen,

An interesting discussion indeed and one that has been covered a number of times since the advent of TB capacity HDDs. I think the first discussion I saw on this was from Dave Hitz; not surprisingly with 28-disk wide RAID stripes, they were likely to see the problem early.

For your readers I think you should explain where the 10^14 or 10^16 figures come from. An enterprise-class SAS HDD offers the higher reliability (10^16) whereas desktop SATA drives will see 10^14 reliability. Of course we should also remember the SATA drives will likely have much higher capacity models (e.g. 4TB+) than SAS which is topping out around 600-900GB.

Taking these facts, consider two scenarios:

(a) – vendors using desktop drives in Enterprise arrays. Yes, I’ve seen this when I’ve analysed new top end hardware. The three letter acronym vendors are using commodity desktop drives and selling them at enterprise prices. Should we be concerned about this (especially if they have software to manage the issue)?
(b) – the rise of hyperconverged solutions. How much of the last 20 years of knowledge on recovery from URE problems has been baked into hyperconverged? I asked this exact question about VSAN of Cormac Hogan at the UK annual VMUG in November 2013 and his response was “NONE”. As far as he was aware, VSAN had no specific features to manage URE or predictive failure of drives.

Point (b) concerns me more than anything else. We’ve spent 25+ years refining the ability to recover from bit errors and now we’re likely to throw that knowledge away and start again with all the data loss scenarios we thought we’d eliminated. Worse still, we’re likely to choose the cheapest HDD products to build those solutions because the benefits are all sold on having the cheapest hardware solution available.

So Caveat Emptor – and ask your hyperconverged vendor exactly what data management features are built into their solutions.

Chris
Didier Pironet says

December 21, 2014 at 6:18 pm

Great post as usual!

I was wondering… What does it take to make a HDD more resilient to URE and go from 10^14 to 10^16? Better material? Additional error detection mechanisms? SAS controller? A bit of all three?
alpharob says

December 23, 2014 at 3:16 pm

“The only real option is ZFS. It has wonderful, robust integrity checking
and data recovery. In fact, this was one of the design goals for ZFS!
Honestly, if you care about your data and have more than a dozen
terabytes of it, you must use something like ZFS or a real storage
array.”

Or use RAID6. Everyone is headed there for disk pools. Crazy not to. We’ll see more XDP like wide striping. Infinidat does that today with RAID6 stripes.
ZFS triple-parity is a funny thing. No one in the industry rushing to bring forth their own triple parity protection. I wonder why.
kyalami says

January 20, 2015 at 10:39 am

Interesting. But how about people (such as photographers or video editors like me) working from home with, say, a RAID-1 array of 2 x 6TB disks? How does one realistically protect against these errors?
xtof says

May 5, 2015 at 1:59 am

Btrfs always checksums both data and metadata using CRC32C for each 4KB block. This is sufficient for checksumming this amount of data, since it isn’t a cryptographic use case. If there is a checksum mismatch on read, the file system reports an error and includes the full path of the affected file; if there is a copy of affected data (raid1, raid 10, raid5/6) it’s used instead of the bad copy. Since the start of raid1 and 10 support, the bad copy is automatically repaired. Since kernel 3.19 repairs are done for raid 5/6 as well.
William Warren says

June 14, 2016 at 3:57 pm

ReFS with mirror or parity spaces checks both user data and meta data: https://technet.microsoft.com/en-us/library/hh831724(v=ws.11).aspx
Integrity. ReFS stores data in a way that protects it from many of the common errors that can normally cause data loss. When ReFS is used in conjunction with a mirror space or a parity space, detected corruption—both metadata and user data, when integrity streams are enabled—can be automatically repaired using the alternate copy provided by Storage Spaces. In addition, there are Windows PowerShell cmdlets (Get-FileIntegrity and Set-FileIntegrity) that you can use to manage the integrity and disk scrubbing policies.
William Warren says

June 14, 2016 at 3:59 pm

now unlike ZFS the file integrity is not in real time but as a scheduled task. This is probably a compromise made to reduce the ram requirement an otherwise full COW system like ZFS requires.
Richard H says

November 21, 2018 at 4:55 am

I don’t understand how “RAID doesn’t do any error checking at all, …”
squares with “Parity RAID protects against URE by calculating a checksum for all data and recalculating the data if an error is detected.”

https://blog.fosketts.net/2010/08/11/320-gb-hard-disk-drive-reliability/

What Protects Your Data?

How Common is URE?

Protect Your Data

You might also want to read these other posts...

Reader Interactions

Comments

Leave a Reply