...the musings of Matthew Hall
(That is, when RAID 5 isn't the right answer.)
If I've heard one of the following statements once, I've heard them all a hundred times:
Through reading, research, testing and discussion; I present you the RAID rant. Well, it's less of a rant and more of a collection of useful information which should correct anyone thinking any of the above statements are true.
I feel it is my responsibility to future generations of sysadmins, to ensure that a serious attempt at debunking these common and dangerous myths about RAID be documented somewhere on the Internet for free.
According to independent studies by Google and Carnegie Mellon University annual failure and replacement rates are alarmingly high for drives even within their first 3-5 years of operation:
The observed range of AFRs varies from 1.7%, for drives that were in their first year of operation, to over 8.6%, observed in the 3-year old population. - E. Pinheiro, W. Weber and L. A. Barroso, p. 4
Most commonly, the observed ARR values are in the 3% range. - B. Schroeder, G. A. Gibson, p7
The astute will observe that Google's tests were performed using "consumer grade" SATA drives - somewhat surprisingly (to me) the research performed by CMU showed no real difference in SATA vs FC vs SCSI drives:
It is interesting to observe that for these data sets there is no significant discrepancy between replacement rates for SCSI and FC drives, commonly represented as the most reliable types of disk drives, and SATA drives, frequently described as lower quality. - B. Schroeder, G. A. Gibson, p7
We have to understand then, from the start, that drives die. More than that, drives die frequently in the first 5 years of their use. (For the sake of this discussion, we'll assume an optimistic figure of 3%. We'll also assume that drives in production use will be "assumed" to work for 5 years before a diligent sysadmin gets a budget from their CIO to replace them.)
In order to begin understanding RAID, we need to define and understand the following 5 technical terms:
A=1010, B=1100, C=0000, D=0111, we can exclusive OR (XOR) the inputs to create a parity:
(A:1010 XOR B:1100) = 0110, (0110 XOR C:0000) = 0110, (0110 XOR D:0111) = P:0001. Our parity is
0001. Should we loose input B, it can be recalculated from the other 3 inputs and the parity:
(A:1010 XOR C:0000) = 1010, (1010 XOR D:0111) = 1101, (1101 XOR P:0001) = B:1100. Parity takes time to calculate, but is never larger than any one input and can be used to rebuild a lost or corrupt input (provided only one input is damaged).