Turning the Page on RAID - Stephen Foskett, Pack Rat

This is part of an ongoing series of longer articles I will be posting every Sunday as part of an experiment in offering more in-depth content.

It has been the core technology behind the storage industry since day one, but the sun is setting on traditional RAID technology. After two decades of refinement and fragmentation, we are abandoning the core concepts of disk-centric data protection as storage and servers go virtual. Next-generation storage products will feature refined and integrated capabilities based on pools of storage rather than combinations of disk drives, and we will all benefit from improved reliability and performance.

RAID Classic

Early storage systems were revolutionary, in physically removing storage from the CPU, in enabling sharing of storage between multiple CPUs, and especially in virtualizing disk drives using RAID. When Patterson, Gibson, and Katz proposed the creation of a redundant array of inexpensive disks (RAID) in 1987, they specified five numbered “levels”. Each level had its own features and benefits, but all centered on the idea that a static set of disk drives would be grouped together and presented to higher-level systems as a single drive. Storage devices, as a rule, mapped host data back to these integral disk sets, sometimes sharing a single RAID group among multiple “LUNs”, but never spreading data more broadly. Storage has remained stuck with small sets of drives ever since.

The core insight of the 1980s remains true: More spindles means better performance. Although additional overhead dulls the impact somewhat, the benefit of spreading data across multiple drives can be tremendous. A typical RAID set offers much better performance than the drives alone, and can handle a mechanical failure as a bonus.

Cracks are appearing in the RAID veneer, however. Double drive failures are much more common than one would expect, leading to the development of hot spare drives and dual-parity RAID 6. If four drives perform well, then forty drives perform much better, leading to the common practice of “stacking” one RAID set on others. Caches and specialized processors were introduced to overcome the performance issues related to parity calculation.

But traditional RAID cannot overcome today’s most critical storage issues. As drives have become larger, the tiny chance of an unrecoverable media error compounds, becoming a certainty. Even dual-parity will not be able to guarantee data protection on the massive disks predicted for the near future — statistics cannot be denied. The latest disks contain so much data, without commensurate improvements in throughput, that rebuild times have skyrocketed, resulting in hours or days of reduced data protection.

RAID is also ill-suited to the demands of virtualized systems, where predictable I/O patterns become fragmented. It cannot provide tiered storage or account for changing requirements over time. It cannot take advantage of the latest high-performance solid state storage technology. It cannot be used in cloud architectures, with massive numbers of small devices clustered together. It interferes with power-saving spin-down ideas. Most RAID implementations cannot even grow or shrink with the addition or removal of a disk. In short, traditional RAID cannot do what we now need storage to do.

RAID is Dead

Although most vendors still use the name, nearly every one has abandoned much of the classic RAID technology. EMC’s Symmetrix pioneered the idea of sub-disk RAID, pairing just a portion of each disk with others to reduce the impact of “hot spots”. HP’s AutoRAID added the ability to dynamically move data from one RAID type to another to balance performance. And NetApp paired disk management so closely with their filesystem that they were able to use RAID 4 and the flexibility it brings.

Today, a new generation of devices has even evolved beyond RAID’s concept of coherent disk sets. Compellent, Dell EqualLogic, 3PAR and others focus on blocks of data, moving portions of a LUN between RAID sets, disk drive types, and even inner or outer tracks based on access patterns. With these devices, a single LUN could encompass data on every drive in the storage array. And the latest clustered arrays can spread data across multiple storage nodes to scale performance and protection.

These innovative devices point the way to a future in which virtual storage is serviced and protected very differently than in the past. Perhaps software like Sun’s ZFS serves to illustrate this future best: It unifies storage as a single pool, intelligently protecting it and presenting flexible storage volumes to the operating system. Although Sun calls its data protection scheme “RAID-Z”, it has little in common with its namesake. Like NetApp’s WAFL, the copy-on-write ZFS filesystem is totally integrated with the layout of data on disk, allowing mobility and efficient use of storage. A single pool can include striping, single- or dual-parity, and mirroring, and disks can be added as needed. Importantly, ZFS also checksums all reads, detecting disk errors.

Long Live RAID

The post-RAID future will see these concepts spread across all enterprise storage devices. Disks will be pooled rather than segregated into RAID sets. Tight integration between layout and data protection will allow for much greater flexibility, integrating tiering and differing data protection strategies in a unified whole. Storage virtualization will allow mobility of data within these future storage arrays, and clustering will enable massive scalability.

Two things will likely remain to remind us of Patterson, Gibson, and Katz, however. First, the core principle that multiple drives working as one yields dividends in terms of performance and data protection. And second, that whatever we use should be called RAID, even though the definition of that term has changed beyond recognition in the last two decades.

You might also want to read these other posts...

Comments

stevetodd says

September 15, 2008 at 11:50 am

Hi Stephen,
Good article on RAID. My comment: not so fast. RAID5/6 implementations based on Patterson et.al. are still being heavily purchased and deployed in the industry. One reason: the mathematical lookup of data, as described by Patterson, is not only fast, but more importantly, it’s trusted. Customers are cognizant of the value of this direct mapping. Virtualizing the location of customer data has its place (e.g. enabling snaps), but mathematical lookup will continue to be a valuable role at the very bottom of the stack.
Keep up the interesting posts,
Steve
Stephen says

September 15, 2008 at 1:09 pm

Steve,

Great point! I’m very much a pragmatist, as are most successful storage managers. They’ll continue to use what works long into the future, even as new ideas come and go. But let me be clear – this is the end for RAID tied to specific whole disks. We’ll continue to see RAID 5 and RAID 6 (and RAID 1) based on subdisks (as in DMX et al) but devices like the CLARiiON will give up their strict disk-centric model. In my opinion, of course!

Stephen
stevetodd says

September 15, 2008 at 3:50 pm

Hi Stephen,
Good article on RAID. My comment: not so fast. RAID5/6 implementations based on Patterson et.al. are still being heavily purchased and deployed in the industry. One reason: the mathematical lookup of data, as described by Patterson, is not only fast, but more importantly, it’s trusted. Customers are cognizant of the value of this direct mapping. Virtualizing the location of customer data has its place (e.g. enabling snaps), but mathematical lookup will continue to be a valuable role at the very bottom of the stack.
Keep up the interesting posts,
Steve
sfoskett says

September 15, 2008 at 5:09 pm

Steve,

Great point! I’m very much a pragmatist, as are most successful storage managers. They’ll continue to use what works long into the future, even as new ideas come and go. But let me be clear – this is the end for RAID tied to specific whole disks. We’ll continue to see RAID 5 and RAID 6 (and RAID 1) based on subdisks (as in DMX et al) but devices like the CLARiiON will give up their strict disk-centric model. In my opinion, of course!

Stephen

You might also want to read these other posts...

Reader Interactions

Comments

Leave a Reply