• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • About
    • Stephen Foskett
      • My Publications
        • Urban Forms in Suburbia: The Rise of the Edge City
      • Storage Magazine Columns
      • Whitepapers
      • Multimedia
      • Speaking Engagements
    • Services
    • Disclosures
  • Categories
    • Apple
    • Ask a Pack Rat
    • Computer History
    • Deals
    • Enterprise storage
    • Events
    • Personal
    • Photography
    • Terabyte home
    • Virtual Storage
  • Guides
    • The iPhone Exchange ActiveSync Guide
      • The iPhone Exchange ActiveSync Troubleshooting Guide
    • The iPad Exchange ActiveSync Guide
      • iPad Exchange ActiveSync Troubleshooting Guide
    • Toolbox
      • Power Over Ethernet Calculator
      • EMC Symmetrix WWN Calculator
      • EMC Symmetrix TimeFinder DOS Batch File
    • Linux Logical Volume Manager Walkthrough
  • Calendar

Stephen Foskett, Pack Rat

Understanding the accumulation of data

You are here: Home / Everything / Apple / Why Big Disk Drives Require Data Integrity Checking

Why Big Disk Drives Require Data Integrity Checking

December 19, 2014 By Stephen 7 Comments

Hard disk drives keep getting bigger, meaning capacity just keeps getting cheaper. But storage capacity is like money: The more you have, the more you use. And this growth in capacity means that data is at risk from a very old nemesis: Unrecoverable Read Errors (URE).

You might think RAID is for data protection, but it does nothing of the sort!
You might think RAID is for data protection, but it does nothing of the sort!

Let’s get one thing out of the way from the start: The only thing protecting your data from corruption is some simple error checking on the disk drive itself and anything built into the software stack on your array or server. RAID doesn’t do any error checking at all, and neither does NTFS in Windows or HFS+ in Mac OS X. And none of those things can correct a read error if they encounter one. When people talk about disk or integrity checks they’re usually talking about the integrity of the file system or RAID set, not of the actual data itself.

Let this sink in: Your data is not protected. It can be corrupted. And you will never know until you need it.

Yes, I’m trying to scare you.

What Protects Your Data?

For most regular people, your only line of defense against random read and write errors is something called error correction coding (ECC), which is built into your hard disk drive’s controller. ECC is essential because magnetic media often has “bad” bits that aren’t readable, especially as information density increases. So hard disk controllers take care of recoverable read errors all the time.

As implemented in most modern hard disk drives, ECC works pretty well, but it’s not perfect. Most manufacturers claim that 1 bad bit will slip through every 1014 to 1016 bits, which is actually really good. But what about those unrecoverable read errors (URE’s)? They’re out of the disk drive’s hands. Hopefully something higher in the stack can recover the data. Maybe your filesystem, or maybe the storage array software.

The good news is that every enterprise storage array worthy of the name has data integrity checking built in, including all the big names and most of the smaller companies, too. After all, if an array can’t store data reliably it’s not really worth buying! So if you’re using a storage array, you’re probably good. Drobo apparently does integrity checking, too. So there’s that.

The bad news is that NTFS, ext3, and HFS+ don’t do any kind of data integrity checking. That means that the vast majority of user data is reliant on the ECC in the hard disk drive itself to ensure it meets the prime directive of storage.

The Prime Directive of storage: Do not return the wrong data!
The Prime Directive of storage: Do not return the wrong data!

The worse news is that unrecoverable read errors do happen, so all this data is at risk. Heavy data users (Greenplum, Amazon, CERN) report that errors really do happen about as often as hard disk drive manufacturers suggest they might. Furthermore, errors often come after the disk controller is done with the data: Faulty firmware, poor connections, bad cables, and even cosmic radiation can induce URE’s.

How Common is URE?

It’s hard to understand what one error in 1014, 1015, or 1016 really means in the real world. One easier way to think about it is that 1014 equals 12.5 TB, 1015 equals 125 TB, and 1016 is 1.25 PB. But this doesn’t really tell the correct story either. These are error rates, not error guarantees. You can read an exabyte of data and never encounter a URE, just like you can buy a lottery ticket and become a millionaire. The important thing is the probability.

As Matt Simmons points out, we can easily calculate the probability of a URE based on a given amount of data. The formula is Statistics 101 material, and he does a fine job of laying it out in his blog post, Recalculating Odds of RAID5 URE Failure. But even that was a little hard to grasp.

So here’s my take: Given a number of hard disk drives of a certain size in a set, how likely is a URE? I graphed it out for 1-10 drives of modern sizes, 1-10 TB. And the results are pretty scary.

At 10^14, the probability of a URE gets pretty high for big drives!
At 1014, the probability of a URE gets pretty high for big drives!

Although a single 1 TB drive has less than an 8% chance of URE, those fancy new 10 TB drives start out over 55%, assuming a URE rate of 1 in 1014. Throw a few into a RAID set and you’ve got real trouble. If your risk threshold is a 50/50 chance, you can’t have more than three 3 TB drives (9 TB) in a set before you’re there. Even if you’re a crazy risk-taker, five 6 TB drives (30 TB) gets you over 90%. This is not good.

How about a URE rate of 1 in 1015? You’d reach 50% at around 90 TB, which is admittedly pretty high. But that’s still not a crazy huge amount of data, and it’ll be downright common in just a few years. And when you consider the reasonably likely issue of bad firmware and bad cables, it seems like URE isn’t an alien issue.

Play around with the numbers using my Google URE Spreadsheet.

What happens if you lose a bit? Maybe nothing. Video and audio files will probably keep playing. Photos might still look OK. But maybe not. And a faulty cable could wipe out the whole file, not just a bit of it (if you pardon the pun).

 Protect Your Data

What can you do about unrecoverable read errors? Simple answer: Use a better storage stack.

As mentioned above, most enterprise storage systems implement serious data integrity checking as part of their storage controller. Many use erasure coding, like the Reed-Solomon codes already used for ECC in the hard disk drive. Others prefer retaining a SHA1 hash for all data and recovering it from an alternate location if it gets corrupted. Either way the risk is reduced to such an extent that you don’t have to worry about it.

But what about servers and home users of non-enterprise storage? You’ve got trouble here. You can’t use NTFS, ext, or HFS+. Btrfs and ReFS have integrity features but they’re not effective out of the box: Btrfs only checks integrity with CRC32 and it’s not clear how it recovers data, while Microsoft engineered “integrity streams” into ReFS but only uses them for metadata by default. All they need to do is turn on integrity streams for user data, and this seems like a no-brainer for them in enterprise storage scenarios.

The only real option is ZFS. It has wonderful, robust integrity checking and data recovery. In fact, this was one of the design goals for ZFS! Honestly, if you care about your data and have more than a dozen terabytes of it, you must use something like ZFS or a real storage array.

Maybe not today, and maybe not tomorrow, but soon you’re going to need data integrity checking and error recovery. It’s time for Microsoft, Apple, and Btrfs to step up and provide it.

You might also want to read these other posts...

  • Electric Car Over the Internet: My Experience Buying From…
  • How To Connect Everything From Everywhere with ZeroTier
  • Liberate Wi-Fi Smart Bulbs and Switches with Tasmota!
  • Introducing Rabbit: I Bought a Cloud!
  • Tortoise or Hare? Nvidia Jetson TK1

Filed Under: Apple, Enterprise storage, Features, Terabyte home, Virtual Storage Tagged With: Btrfs, CRC, CRC32C, data protection, ECC, erasure codes, error, HFS, NTFS, parity, RAID, ReFS, unrecoverable read errors, URE, ZFS

Primary Sidebar

If you can keep your head when all about you are losing theirs…
If you can wait and not be tired by waiting…
If you can think – and not make thoughts your aim…
If you can trust yourself when all men doubt you…
Yours is the Earth and everything that’s in it.

Rudyard Kipling

Subscribe via Email

Subscribe via email and you will receive my latest blog posts in your inbox. No ads or spam, just the same great content you find on my site!
 New posts (daily)
 Where's Stephen? (weekly)

Download My Book


Download my free e-book:
Essential Enterprise Storage Concepts!

Recent Posts

How To Install ZeroTier on TrueNAS 12

February 3, 2022

Scam Alert: Fake DMCA Takedown for Link Insertion

January 24, 2022

How To Connect Everything From Everywhere with ZeroTier

January 14, 2022

Electric Car Over the Internet: My Experience Buying From Vroom

November 28, 2020

Powering Rabbits: The Mean Well LRS-350-12 Power Supply

October 18, 2020

Tortoise or Hare? Nvidia Jetson TK1

September 22, 2020

Running Rabbits: More About My Cloud NUCs

September 21, 2020

Introducing Rabbit: I Bought a Cloud!

September 10, 2020

Remove ROM To Use LSI SAS Cards in HPE Servers

August 23, 2020

Test Your Wi-Fi with iPerf for iOS

July 9, 2020

Symbolic Links

    Featured Posts

    How To Keep Your Family Activities In Sync With A Shared Google Calendar

    April 18, 2010

    Regarding My Symbolic Links and Good Reads

    April 16, 2015

    Instapaper for iPad and iPhone Enhances My Web World

    June 1, 2010

    How Fast Is It? A Storage Infographic

    October 29, 2010

    My Visit to Bletchley Park

    August 3, 2012

    Defining Failure: What Is MTTR, MTTF, and MTBF?

    July 6, 2011

    How Will Cisco Recover From The Consumer Strategy Blunder?

    January 2, 2013

    What You See and What You Get When You Follow Me

    May 28, 2019

    Virtualized and Distributed Storage: This Time For Sure!

    September 2, 2014

    My 2012 Project: Improving Energy Efficiency

    January 3, 2012

    Footer

    Legalese

    Copyright © 2022 · Log in