Hard disk drives encounter errors from time to time, so it’s a good thing that most have the ability to recover data anyway. But RAID systems usually have their own error recovery capabilities and can be thrown off when a hard disk pauses I/O. So it’s a good idea to use hard disk drives that allow you to disable or limit error recovery in RAID systems.
Error Recovery Basics
Hard disk drives have more points of failure than most other modern computer components: They are physical devices that rely on magnetism and mechanical precision, not just solid state electronics. And ever-increasing drive density magnifies the challenge of always returning valid data. In fact, magnetic disk media is surprisingly unreliable, with hard drives often relying on error recovery technologies to cover for read and write errors.
The most basic form of error recovery on hard disk drives is CRC32C, a simple error-detecting code that reliably uncovers read and write errors. In most cases, disk drives can re-try a read, adjusting the heads slightly to detect the correct value. Once an error is detected and the correct data is uncovered, the disk drive will either re-write the data in place or mark that spot as bad and re-map it to another physical location.
All this should happen very quickly, but the application must wait for it to complete. Under light load, this process is barely noticeable. But systems with heavy I/O can escalate this wait time to unacceptable levels. In busy systems, an error recovery can take many seconds or even minutes to complete.
RAID and Error Recovery
Multi-drive systems, including RAID and similar solutions, can’t tolerate long waits for error recovery. Most RAID controllers assume that a drive that hasn’t completed an I/O request within a few seconds has failed. The controller will then mark the entire disk drive as “offline” and attempt to rebuild using an available spare disk or simply take the entire RAID set offline to avoid data loss. This can prove problematic, since a RAID rebuild can take hour or days to complete!
It’s not the fault of the RAID system, either. There has to be some threshold where a disk is declared to have failed. It wouldn’t be practical (or even desirable) to escalate the I/O wait “up the stack” and pause all operations until a disk recovers (if ever). So most RAID solutions or controllers set a threshold of a few seconds.
The rule of thumb for RAID controllers is 8 seconds, though this can vary. Some controllers wait for 10, 20, or 30 seconds, for example, and this can be configured on many. ZFS will generally wait as long as needed for error recovery, and this can dramatically impact performance.
Time-Limited Error Recovery
Disk drives intended for RAID use typically implement some form of time limiting for error recovery. Western Digital calls this Time Limited Error Recovery (TLER), while Seagate calls it Error Recovery Control (ERC) and Samsung and Hitachi call it Command Completion Time Limit (CCTL).
Regardless of what it’s called, the drive will limit the wait time on any error recovery command to a settable value, typically 7 seconds by default. The drive will usually report a failed I/O up the stack and attempt to re-try the error recovery at a later time. Meanwhile, the RAID controller will likely recover the data from parity or erasure code and continue operation.
ZFS, and other software RAID systems, will typically “react” the same way when TLER is enabled, recovering data and remapping that block.
Note that most desktop hard disk drives to not have this capability. Error recovery is always turned on and recovery will take as long as necessary. This is one reason that conventional desktop disk drives are not appropriate for use in RAID solutions.
Checking and Setting TLER
If a hard drive is to be used in a RAID or similar setup, it is desirable to have TLER or ERC enabled and set to a value under 8 seconds.
Most UNIX-like systems have the “smartmon” tools package, including the command, smartctl. This can be used to query TLER and similar settings. For example, here is the result of that command in FreeNAS (FreeBSD) for a Western Digital Red NAS drive:
# smartctl -l scterc /dev/da2 smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds)
This tool can also set TLER on a drive as follows:
smartctl -l scterc,80,80 /dev/da2
Western Digital provides a DOS utility, WDTLER.EXE, with similar functionality.
Stephen’s Stance
One reason to use enterprise or NAS hard disk drives is the capability to limit error recovery for smoother performance. I strongly recommend only using such drives with RAID systems, especially ZFS (as in FreeNAS)!
Howard Marks says
I’d recommend 2-3 seconds. If the drive hasn’t recovered the data in 2 seconds I want the RAID or related data protection to take over and not cause a 5-8sec stutter.
Dan Bilzerian says
I agree with you.
Truman HW says
What about making it a function of the [real] MTBF times the number of drives ?
Isn’t 3 seconds quick for something that’s a rare event ?
TimC says
A short TLER/ERC limit may cause the RAID controller to drop a drive and label it faulty even thou data recover may be possible. A major problem occurs if drive 1 is dropped, degrading the array and then a 2nd drive is dropped (for RAID5, or 3 drives for RAID6) before the data had been copied to the hot-spare(s). The act of a full array rebuild can cause other drives from the same batch to fail close to each other. RAID is for convenience and some protection but you still need a 2nd/3rd device/tape/location for BACKUP.