The Four Horsemen of Storage System Performance: I/O As a Chain of Bottlenecks

Four Horsemen-400 — The Four Horsemen of Storage System Performance: These four ugly gentlemen stand between you and your data.

Why do some data storage solutions perform better than others? What tradeoffs are made for economy and how do they affect the system as a whole? These questions can be puzzling, but there are core truths that are difficult to avoid. Mechanical disk drives can only move a certain amount of data. RAM caching can improve performance, but only until it runs out. I/O channels can be overwhelmed with data. And above all, a system must be smart to maximize the potential of these components. These are the four horsemen of storage system performance, and they cannot be denied.

The Chain of Command

It is tempting to think of storage as a game of hard disk drives, and consider only The Rule of Spindles. But RAM cache can compensate for the mechanical limitations of hard disk drives, and Moore’s Law continues to allow for ever-greater RAM-based storage, including cache, DRAM, and flash. But storage does not exist in a vacuum. All that data must go somewhere, and this is the job of the I/O channel.

To be useful, storage capacity must connect to some sort of endpoint. This could be the CPU in a personal computer or an embedded processor in an industrial device. Indeed, there are endpoints and I/O channels throughout modern systems, with potential bottlenecks, caches, and smarts at each point. “Storage people” like me tend to think too small – imagining that the I/O channel ends at the disk drive, the “front end” of the array, or the storage network. But data must travel further, all the way to its final useful point in the core of the CPU.

Once we consider I/O as a long chain of interconnected endpoints, we begin to see the fact that I/O constraints at any point can strangle overall system performance. This is not merely an academic exercise: Optimizing the I/O channel is a consuming passion for most practitioners of enterprise IT, including architects, engineers, and system developers. And, like a good game of Whack-a-Mole, increasing the speed of one link causes another chokepoint to rear its head.

Parallel and Serial I/O

Imagine you had a warehouse full of boxes to move across the country as fast as possible. There are a few options available to you:

A fast truck can zip back and forth with just a few boxes
A train is slower, but its many cars can haul a huge quantity

But there are realistic limits to both capacity and speed: The train has to fit on the tracks, and the truck can’t move at the speed of light. Plus, one must consider the time taken to load and unload the chosen vehicle.

Parallel and serial IO — We continually shift between parallel and serial I/O paradigms

The same trade-offs are true of computer busses: Serial channels can be optimized to zip individual bits back and forth, or parallel busses can be designed to carry whole bytes (or more) at a time. The simplicity of serial communications is tempting, but designers continue to resort to parallelization for added throughput.

Note: Most serial protocols actually feature two links, making them “full duplex”: One for transmit and another for receive.

Serial storage interconnects are dominant currently, with fraternal twins SAS and SATA coming to dominate the disk interface landscape. SAS and SATA share the same 1.5, 3, and now 6 gigabit per second serial physical interconnect, offering more than enough throughput for conventional hard disk drives and edging out older serial (Fibre Channel, SSA) and parallel (ATA and SCSI) alternatives.

Networks (Ethernet, Fibre Channel, and InfiniBand) are predominately serial as well, as are lower-end interconnects like USB and FireWire. Serial communication also dominates in the system bus world, with serial PCI Express toppling parallel PCI.

But parallel variants are often offered for increased throughput: Multi-lane PCI Express and bonded multi-link InfiniBand make up a fair portion of the installed base, while load balancing MPIO drivers are common in Fibre Channel storage. And let’s not forget that the “X4” variants of Ethernet use multiple bonded links as well.

The Definition of Bottle Neck

Most English speakers have encountered the French term, “cul de sac”, meaning “bottom of the bag” or dead end. But hard disk drives have plenty of “bottom end”, or storage capacity. When it comes to disks, the issue is usually at the neck of the bag: Data just can’t be pulled out of a hard disk drive fast enough.

Wine barrels — Emptying a barrel of wine through a spigot takes hours, but pry the end off and the floor is covered in a moment!

The density of modern hard disk drives (the capacity of our barrel) has been growing much more rapidly than the I/O channels serving them (the spigot). Where once a hard disk drive could be filled or emptied in an hour or two, modern drives take days or weeks!

I once called this “flush time“, but I think the wine metaphor is much more appetizing!

This “bottle neck” has serious implications beyond basic storage performance. Data protection is impacted, since ever-larger storage systems can no longer be backed up by dumping their content; system reliability is reduced, since week-long RAID rebuilds increase the risk of multiple drive failures; and cost containment efforts are also impacted, since adding spindles drives up prices.

Nowhere is this bottleneck more evident than in portable devices. Modern drives (like the 1 TB Seagate USB drive I recently reviewed) have massive capacity and pathetic performance. The USB 2.0 interface just can’t keep up, and this creates a limit to the expansion of capacity. It would take half a day to fill that drive under perfect conditions at 25 MB/s, reducing its value as a massive data movement peripheral. The emerging USB 3.0 standard promises to alleviate this performance issue for now, as illustrated with Iomega’s new external SSD.

Cache and solid state storage can help, but they have their own bottlenecks. Storage arrays typically use Fibre Channel or SAS SSDs, and their front-end interface remains the same. The best-performing SSDs use the PCI Express bus directly rather than emulating hard disk drives over SCSI interfaces. And even PCI Express might not be enough to handle the massive I/O of NAND flash or DRAM. In each case, the bottleneck moves down the chain.

A Chain of Bottlenecks

Let’s follow a typical I/O operation from the disk to the CPU core and count the I/O channels:

A read head senses the state of a bit of magnetic material on the surface of a disk
The head transmits this signal to a buffer on the disk controller board
The data is picked up by the disk controller CPU and transmitted over a SATA or SAS connection
The storage array or RAID controller receives the data and moves it over an internal bus to another buffer or cache
The data is picked up by another CPU in the array controller and sent out another interface using Fibre Channel or Ethernet
The data is buffered and retransmitted by one or more switches in the storage network
The host bus adapter (HBA) on the server side receives the data and buffers it again before sending it over a local PCI Express bus to system memory
The server memory controller pulls the data out of system memory and sends it via a local bus to the CPU core

There are actually many more steps than this, but the picture should be clear by now. There are many, many I/O channels to consider when it comes to storage, and the drive interface is just one potential bottleneck.

Chain of bottlenecks — We constantly move bottlenecks around - as one link is improved, another choke-point appears

Optimizing Storage I/O

Tactical steps to improve storage performance typically focus at one link in the chain: Drive vendors move from 1.5 Gb to 3 Gb SATA, or SAN buyers upgrade from 4 Gb to 8 Gb Fibre Channel. But the basic architecture of enterprise storage has remained constant for over a decade, and the reliance on block SCSI commands endures. This is all about to change.

One critical bit of I/O optimization exists at the point of connection between the various chipsets inside the server. AMD pulled the memory controller off of the “northbridge” with their Athlon line. Intel did the same with their Nehalem and is eliminating the northbridge entirely with the Lynnfield/Jasper Forest CPU lines. This gives serious bandwidth to the crucial PCI Express-to-CPU-core link, moving the bottleneck downstream.

We are in the midst of a massive upgrade of the storage network as well. Between 8 Gb Fibre Channel and iSCSI and Fibre Channel over 10 Gb Ethernet, not to mention persistent interest in InfiniBand, storage network throughput is rapidly expanding. As with the internal PC connections, the expansion of network bandwidth has pushed the bottleneck to the storage array interface for the time being.

Microsoft and Intel recently pushed over a gigabyte per second over 10 GbE using iSCSI, but they needed multiple storage targets to feed that connection. It isn’t that modern storage systems couldn’t push that kind of I/O (indeed, arrays are tens to hundreds of times faster internally thanks to their spindles and cache), but that the conventional storage protocols are tightly linked to a single “front-end” interface. The current state of the art for storage array design is moving to distributed models, exemplified by pNFS and scale-out NAS concepts like MaxiScale (now acquired by Overland).

Once the array interfaces can pump out massive I/O, attention will turn once again to the disk interfaces themselves. Although 6 Gb/s SAS and SATA is now a reality, this interface is inappropriate for future high-performance SSDs. Arrays designed around flash or DRAM are likely to switch to PCI Express as their internal connection of choice for performance and to optimize data placement on these new devices. Companies like Nimbus and NetApp are already moving in this direction.

Time To Get Smart

Hard disk drive spindles make up the bulk of storage capacity, but small amounts of cache make them far more effective. But both of these horsemen must operate within the constraints of the I/O channels they pass through. This brings us to the final horseman of performance: Smarts. Clever designers have created clever controlling mechanisms to overcome the limits of spindles, cache, and I/O channels.

The Chain of Command

Parallel and Serial I/O

The Definition of Bottle Neck

A Chain of Bottlenecks

Optimizing Storage I/O

Time To Get Smart

You might also want to read these other posts...

Reader Interactions

Leave a Reply