System News
The Clock Is Running for RAID-6
Is Triple-parity RAID the Answer?
December 31, 2009,
Volume 142, Issue 5

It is ... time to create a new RAID level to accommodate the realities of disk reliability, capacity, and throughput merely to maintain [an adequate]level of data protection

-- Adam Leventhal
 

"How much longer will current RAID techniques persevere?" asks Adam Leventhal of Sun Fishworks in his paper for the Association for Computing Machinery entitled Triple Parity RAID and Beyond. In the course of answering his question, the author examines RAID and its history, its rate of capacity growth in the hard-drive industry, and the need for triple-parity RAID as a response to diminishing reliability.

Given the rate of capacity increase in hard drives, Leventhal argues, the reliability of RAID-6 systems is currently being seriously taxed, which makes it inevitable that triple-parity RAID must soon become pervasive. He continues with the observation that the time required to repair a high-density disk drive in a RAID group is sufficient to cost users a level of redundancy.

Bit error rates have improved by about two orders of magnitude while disk capacity has increased by slightly more than two orders of magnitude, Leventhal continues, doubling about every two years and nearly following Kryder's law. Today, a RAID group with 10 TB (nearly 20 billion sectors) is commonplace, and typical bit error rate stands at one in 1016 bits or something like 99.2%. And while bit error rates have nearly kept pace with the growth in disk capacity, throughput has not been given its due consideration when determining RAID reliability, he asserts. The result is that 0.8 percent of disk failures can be expected to result in data loss due to an uncorrectable bit error.

Capacity has increased steadily and significantly, the author contends, and the bit error rate has improved at nearly the same pace. Hard-drive throughput, however, has lagged behind significantly. When RAID systems were developed in the 1980s and 1990s, reconstruction times were measured in minutes. The trend for the past 10 years is quite clear regardless of the drive speed or its market segment: the time to perform a RAID reconstruction is increasing exponentially as capacity far outstrips throughput, Leventhal writes.

"At the extreme, rebuilding a fully populated 2-TB 7200-RPM SATA disk—today's capacity champ—after a failure would take four hours operating at the theoretical optimal throughput. It is rare to achieve those data rates in practice; in the context of a heavily used system the full bandwidth can't be dedicated exclusively to RAID repair without adversely affecting performance. If one assumes that only 10-50 percent of the total system throughput is available for reconstruction, the minutes-long RAID rebuild times of the 1990s balloon to multiple hours or days in practice. RAID systems operate in this degraded state for far longer than they once did and as a consequence are at higher risk for data loss," the author points out.

Furthermore, latent data on hard drives can acquire defects over time—a process blithely referred to as bit rot, according to Leventhal. To mitigate this, RAID systems typically perform background scrubbing in which data is read, verified, and corrected as needed to eradicate correctable failures before they become uncorrectable. The process of scrubbing data necessarily impacts system performance, but the time required for a full scrub is a significant component of the reliability of the total system. A natural tension results between how priorities are assigned to scrubbing versus other system activity. As throughput is dwarfed by capacity, either the percentage of resources dedicated to scrubbing must increase, or the time for a complete scrub must increase, he deduces. A full scrub could conceivably take weeks or even months.

The same incapacity that led to RAID-6 succeeding RAID-5 is now overtaking the successor, Leventhal writes, adding that in about 10 years, RAID-6 will provide only the level of protection that we get from RAID-5 today. "It is again time to create a new RAID level to accommodate the realities of disk reliability, capacity, and throughput merely to maintain that same level of data protection."

Looking to the future, then, the author suggests that there is an impending but not yet urgent need for triple-parity RAID. The addition of another level of parity mitigates increasing RAID rebuild times and occurrences of latent data errors. Triple-parity RAID will address the shortcomings of RAID-6 for years, he assures readers. The reliability is largely independent of the specific implementation of triple-parity RAID; a general Reed-Solomon method suffices for the analysis.

He goes on to say that, not only is there a need for triple-parity RAID, but there's also a need for efficient algorithms that truly address the general case of RAID with an arbitrary number of parity devices.

"For the same reasons that make triple-parity RAID necessary where RAID-6 had sufficed, three-way mirroring will displace two-way mirroring for applications that require both high performance and strong data reliability. Indeed, four-way mirroring may not be far off, since even three-way mirroring is effectively a degenerate, but more reliable, form of RAID-6, and will be susceptible to the same failings." Leventhal predicts.

The author concedes that too little is know at the moment about the future of flash drives and their implications for disk technology, though even with flash devices, which can suffer catastrophic failures, RAID likely will have a role to play. How large a role is the unanswerable question for the moment.

If the pace of solid-state caching and buffering decouples system performance from component hard drive performance, Leventhal conjectures, hard-drive manufacturers would be able to increase capacity even more quickly, unhindered by performance requirements, while likely slowing the rate of throughput increases. Further, divorced from performance, RAID stripes could grow very wide to optimize for absolute capacity; this would reduce the reliability further with the same amount of parity protecting more data. In this scenario, the need for triple-parity RAID would be made all the more urgent by accelerating current trends, Leventhal concludes.

"If Kryder's law continues to hold, the burden of correctness will increasingly shift from the hard-drive manufacturers to the RAID systems that integrate them. Today, RAID reconstruction times factor into reliability calculations more than ever before, and their contribution will increasingly dominate. Triple-parity RAID will soon be critical to provide sufficient reliability even in the face of exponential growth," he predicts.

More Information

Database Study Observes Performance Gains of Sun Storage F5100 flash Array [...read more...]

Keywords:

fullsource
 

Other articles in the Storage section of Volume 142, Issue 5:
  • The Clock Is Running for RAID-6 (this article)

See all archived articles in the Storage section.



News and Solutions for Users of Solaris, Java and Oracle's Sun hardware products
Just the news you need, none of what you don't – 42,000+ Members – 24,000+ Articles Published since 1998