I usually welcome discussion (and even argument) about the things I know best: There is always more to learn, and the best insights come through engaging those who disagree with us. But some ideas have been argued so well for so long that they deserve enshrinement. For example, although non-scientists like to argue about evolution and climate change, the scientific community no longer feels that their theories in these areas require much discussion. Like gravity and relativity, they have been accepted as a foundation upon which to build more interesting hypotheses.
My field of enterprise storage has its share of generally-accepted theories:
- Availability, backup, and archive form a Data Protection Trinity: They are unique requirements calling for focused solutions.
- The Rule of RAID: Combining multiple disk drives in creative ways allows us to change the inherent reliability and performance of the system.
- When it comes to storage management, Homogeneity is Paramount: A single storage administrator can manage thousands of identical systems but would be hard-pressed to support a half-dozen unique ones.
- The entire history of computing demonstrates that Connectivity Trumps Capacity when sizing systems: Performance bottlenecks always limit the scalability of storage systems.
Each of these theories underpins the our industy’s daily routine of storing and retrieving the data that drives modern society. These storage theories are also targets for innovation, with the best minds constantly trying to bend or break them.
This album of storage theories also has a B-side, however. These are the no-longer-true theories that have been transcended, as well as the dubious beliefs that were never really true.
- Commutability of Management and Cost is highly suspect: Unless one is considering only identical and homogenous systems, the total cost of ownership (TCO) or number of administrators associated with a given system (TB/admin) cannot be compared between environments.
- The Price of Parity: The impact of parity calculations and multi-disk commits used to kill write performance, giving RAID-5 a bad name. But write-back caches and array intelligence have all but eliminated this “write penalty” for modern enterprise systems.
- Whenever the high cost of enterprise storage is to be refuted, someone is bound to trot out The Dumb Disk Fallacy, claiming that per-GB array costs ought to be comparable to the price of a bare disk drive. But the value of enterprise storage has always been greater than the sum of its parts.
Over the next few weeks, I will be sharing focused articles about these “holy cows” of the enterprise storage world. I encourage everyone in the industry to join me in taking a step back and shining some light on these and other truisms. Which do you agree with or dispute? Are there other theories that I have overlooked?
johnmartinoz says
But write-back caches and array intelligence have all but eliminated this “write penalty” for modern enterprise systems. …
I’d say this is mostly true for light enough workloads/large enough write caches. I’ve seen perfectly acceptable response times from RAID-5 and in some cases RAID-6, the trouble is that after you’ve gone past a certain point or your workload changes, the response time latency can increase rapidly. As a result, almost all spindle bound benchmarks use RAID-10 (with some notable exceptions)
The old “rules of thumb” about 1:2 overheads for mirrored writes and 1:4 overhead for single parity writes and 1:6 for dual parity writes are all worst case but are indicative of the relative penalties for each kind of RAID in spindle bound random write workloads. Once you add in any level of sequentiality (present in most workloads), things start looking a lot better for parity based RAID. In the real world thanks to lots of dilligent engineering, most vendors will do better than these worst cases. For example approaches like WAFL and ZFS address this neatly with small write caches, on traditional algorithmically mapped array technologies your best bet is to use the biggest write cache you can get your hands on, or for those with bigger wallets, you can implement solid state and forget all about your per spindle IOPS worries.
Unfortunately as memory get cheaper disks also get bigger which means more data on fewer spindles, which tends to kill random read performance, so most storage admins are forced to to tune their cache and make the hard choice between good random write performance or good random read performance.
The nice thing about working for NetApp is that we can get outstanding random write performance with relatively small write caches, and can dedicate up to 2TB of intelligently managed dedup aware cache to accelerate random reads. This means our customers can get the best of both worlds.
There are other concerns with RAID-5 that happen as a result of exponentially increasing spindle sizes (increasing scrub times, longer reconstructs etc), but that’s an entirely different kettle of fish.
Justin Warren says
I look forward to reading the focused articles.
The Dumb Disk Fallacy is always a challenge to explain to business users. Analogies with things they’re more familiar with than enterprise storage arrays seem to work sometimes. Lots of outboard powered tinnies tied together != the QE2.
DGentry says
When you write about the Dumb Disk Fallacy, could you talk about why some large consumers of storage space (Google, Facebook, others) don’t use RAID arrays? They use a bunch of disks, with replication and redundancy handled in higher layers of the software. Is there some capacity threshold above which an enterprise could look seriously at eschewing RAID and doing it themselves, or do you believe the engineers at those internet companies just don’t know when to stop writing code and buy something off the shelf?
Telling an enterprise buyer that the “Dumb Disk Fallacy” is just not true is somewhat unconvincing. Telling the enterprise buyer that you _can_ build a storage system whose price approaches the raw drives, but the development and engineering costs only make sense at a scale several orders of magnitude above their needs, would seem to be more convincing.
DGentry says
When you write about the Dumb Disk Fallacy, could you talk about why some large consumers of storage space (Google, Facebook, others) don't use RAID arrays? They use a bunch of disks, with replication and redundancy handled in higher layers of the software. Is there some capacity threshold above which an enterprise could look seriously at eschewing RAID and doing it themselves, or do you believe the engineers at those internet companies just don't know when to stop writing code and buy something off the shelf?
Telling an enterprise buyer that the “Dumb Disk Fallacy” is just not true is somewhat unconvincing. Telling the enterprise buyer that you _can_ build a storage system whose price approaches the raw drives, but the development and engineering costs only make sense at a scale several orders of magnitude above their needs, would seem to be more convincing.