Saturday, August 14, 2010

RAID array failure probabilities

I wrote up something on a forum to give an idea of failure rates for different RAID array types, and it ended up being a pretty good summary.

Question: The storage guys seem to say to avoid RAID5 due to the huge storages sizes now and the high chance of errors during rebuilds

The reason there would be an error when rebuilding a RAID5 array (leading to data loss) is if a 2nd disk failed (or a sector failed, etc) and that would be catastrophic to data. The most harrowing time (when you don't have backups) is during a rebuild of any single redundancy array (raid 1, 2, 3 4 or 5) because a 2nd disk failure will lead to data loss. On bigger disks, the rebuild time is longer, so the window for complete data loss is larger. On a hardware raid controller, this window is (best case) roughly equivalent to the sequential read speed (or write, whichever is lower) divided by disk capacity of a single disk in the array (assuming 50 MB/sec a RAID5 array will take 40,000 seconds or about 11.1 hours.) The NV+ takes roughly 20 hours (I know from experience)

There are ways to mitigate this risk (besides backups, which are always required) by using different raid levels. For example, RAID6 is becoming the 'new' RAID5. RAID6 uses 2 different parity calculations, and stores both. This allows for 2 disks to fail without data loss. The trade-off is that you need another disk to make up for the data loss. the NV+ doesn't support RAID6 natively, but other products in the line do (They are the pro series from netgear that have 6 disks instead of 4, and cost about $1200 and are x86 based instead of SPARC). It should be technically feasible to write an add-in for the NV+ that will support RAID6 (since the minimum disks for RAID6 is 4, and the NV+ has 4) But RAID6 is usually for larger disk count arrays. Also, with a 4 disk count, RAID 10 will give the same amount of usable space, but me much faster in a software RAID environment with only slightly less fault tolerance. (RAID 10 can support 2 disk failures if they are the 'right' disks, but if the wrong 2 fail, you can still lose data)

RAID5 is going to be slightly faster than RAID6 for the same size disk array because there are 2 parity calculations for RAID6 while RAID5 only has 1.

RAID5 is way better than RAID0 from a safety standpoint, and gives more space than RAID1. Lets use a hypothetical situation (since I don't want to look up the actuals, but I will be pretty close. I will also simplify some of the math since we don't have to be perfect, just decently close.). The situation is that you have four 2 TB disks with a MTBF of 1M hours and a read/write speed of 50 MB/sec. (enterprise disks are usually 1.2M hours, and home user disks are usually around 800k hours, so 1M is a nice average, and allows for quicker math I can do in my head). Using those 4 disks, let's calculate some failure rates, data storage and speeds. We must also assume 'life of disks' to determine the possibility of data loss over the life of the array. Let's assume 3 years since that is most manufacturer's warranty period. Please also note that we are only calculating disk failures and not including things like hardware or software failure which can also destroy an array, even without a disk failure. For example, bad RAM can cause garbage to be written to an array. We will also only calculate this for 4 of the major RAID levels: 0, 5, 6, and 10.

RAID0 - Total space is the sum of the disk sizes (2TB*4 or 8 TB). Speed will also be the sum of the disks (50 MB/sec * 4 or 200 MB/sec). The possibility of data loss during any given hour will be the MTBF of the disks divided by 4 since in RAID0 if any disk fails all data is lost. 3 years works out to be 365 * 24 * 4 * 3 hours of disk up time, which is 105120 hours total. With a MTBF of 1M hours, you end up with the possibility of data loss during the life of the array of 1M/105120 or 10.5%. That's very high at 1 in 10.

RAID10 - Total space is the sum of the disk sizes divided by 2 (2TB*4/2 or 4 TB). Speed will be the speed of 2 disks (50 MB/sec * 2 or 100 MB/sec) Data will only be lost if the 'wrong' disk fails during a rebuild. That means we have to calculate the rebuild time (which we did above at 11 hours, but let's call it 10 for 'Evad is lazy' sake). So for 10 hours, there is a 1 in 1M chance per hour of a disk failure, and a further 50% probability the wrong disk fails. That works out to a 5 in 1M chance during the rebuild time. But you need to know how many times any disk will fail and cause a rebuild. That was calculated above at 10.5% over the life of 4 disks. So you end up with 10.5% chance of rebuild, and a 5 in 1M chance (0.0005%) of failure during that rebuild, or 0.105 * 0.000005 = 0.000000525 or 0.0000525%. Many orders of magnitude better better than RAID0. The trade off is that you lose 1/2 of the space.

RAID5 - Total space is the sum of the disk sizes - 1 for parity or 6 TB. Speed will be the speed of 3 disks (50 MB/sec * 3 or 150 MB/sec) Data will only be lost if a 2nd disk fails during the rebuild. That means we have to calculate the rebuild time. So for 10 hours, there is a 1 in 1M chance per hour of a disk failure. That works out to a 10 in 1M chance during the rebuild time. But you need to know how many times any disk will fail and cause a rebuild. That was calculated above at 10.5% over the life of 4 disks. So you end up with 10.5% chance of rebuild, and a 10 in 1M chance (0.001%) of failure during that rebuild, or 0.105 * 0.000010 = 0.00000105 or 0.000105%. Many orders of magnitude better better than RAID0 but worse than RAID10. The trade off is that you have 25% less space than RAID0 and 50% more space than RAID10.

RAID6 - Total space is the sum of the disk sizes - 2 for parity or 4 TB (Same as RAID10). Speed will be the speed of 2 disks (50 MB/sec * 2 or 100 MB/sec) Data will only be lost if a 3rd disk fails during the rebuild. That means we have to calculate the rebuild time, which is harder to do than RAId5. So for 10 hours, there is a 1 in 1M chance per hour of a single disk failure, but 2 more need to fail to lose data. That works out to 1/1M * 1/1M during the rebuild time (0.000001 * 0.000001 = 0.000000000001) But you need to know how many times any disk will fail and cause a rebuild. That was calculated above at 10.5% over the life of 4 disks. So you end up with 10.5% chance of rebuild, and a 0.0000000001% chance of another 2 disks failing during that rebuild, or 0.105 * 0.000000000001 = 0.0000000000000105 or 0.00000000000105%. Many orders of magnitude better better than RAID0, RAID10 and RAID5. The trade off is that you spend 2 disks to store parity information, so you have 50% of the actual space usable. RAID6 is also not supported on all devices.

In terms of space, here is a summary:
RAID0 - 8 TB
RAID5 - 6 TB
RAID6 - 4 TB
RAID10 - 4 TB

In terms of probability of failure over the life of the array (3 years):
RAID0 - 10.5%
RAID5 - 0.000105%
RAID10 - 0.0000525%
RAID6 - 0.00000000000105%

In terms of speed:
RAID0 - 200 MB/Sec
RAID5 - 150 MB/sec
RAID10 - 100MB/sec
RAID6 - 100MB/sec

In terms of dollars per usable TB (assume $100 per 2 TB hard drive)
RAID0 - 8 TB / $400 = $50 per TB
RAID5 - 6 TB / $400 = $67 per TB ($17 more than RAID0 to drop the rate from 10.5% to 0.000105%)
RAID6 - 4 TB / $400 = $100 per TB ($33 more than RAID5 to drop the rate from 0.000105% to 0.00000000000105%)
RAID10 - 4 TB / $400= $100 per TB (usually only chosen if small random writes are an issue such as a database or controller doesn't support RAID6, and RAID5 is too high risk.)

Very long story short is that there are trade offs with each RAID level. You need to chose the one that is right for you based on your specific needs, budget and risk tolerance. For the majority of home users who use a NAS, RAID5 provides a good balance of fault tolerances and space per dollar spent in a 4 disk NAS. If we were to redo this with 6 or more bays, then RAID6 becomes a more attractive choice because the probability of a disk failure is higher the more disks exist in the array, and the cost starts approaching RAID5 levels. For example the cost per TB between a 20 disk RAID5 and RAID6 array using 2 TB disks is $52.6 vs $55.5.

There are way more things to consider in different environments such as stripe size (which can waste space and affect speed) load on the array (high disk count parity arrays are bad for random writes because of the write-hole, such as databases while for media streaming parity arrays are fine) options on the controller (such as OCE or ORM and having a dedicated XOR processor). We also can't forget the fabled 'sympathy failure' of disks which may or may not increase the odds of concurrent disk failures if you 'believe' in them. The write-hole in RAID5 can also be solved by using something like RAID-Z.

Clifs:
RAID5 is fine for home users and way better than a bunch of single disks.

3 comments:

  1. You are using MTBF as a yardstick here, and that is incorrect. The base probability you should be plugging into your analysis is that of a URE - unrecoverable bit error. A 2nd drive does not have to fail permanently to cause RAID5 to fail, it only has to suffer a URE during array rebuild. RAID5 is broken for large disks, because the probability of a URE that would prevent rebuild is around 1 bit in 12TB.


    Read this for the details, it's from back in 2007.
    http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

    ReplyDelete
  2. Thanks for pointing that out John. If you want to be really specific, all RAID is 'broken' for URE's. RAID does not protect against a URE at any RAID level. Even RAID 1 with two 2 TB drives will have a 1 in 6 chance of a URE. a URE is almost solely dependent upon array size, so the failure rate is within an order of magnitude no matter which RAID level is chosen for a given count of disks of the same size. Four 2 TB disks will have a URE rate of between ~25% and ~50% in RAID 1, 5, 6 or 10. RAID just isn't designed to address URE's. RAID is designed to protect against major disk failures (disk, sector, etc), not single bit errors.

    Depending on controller, a URE will not result in a complete rebuild failure either. Most controllers will prompt on how to address that URE. You can wipe the array, rebuild all the data except that sector, cluster or stripe, ignore the bit (the container that the bit is contained in will generally be read as corrupt by the OS at that point) and other options.

    There are 3 ways I can think of off the top of my head to solve for a URE. Either less data, disks with lower URE counts (AKA enterprise level disks), or a different redundancy process than RAID. For example, ZFS will help greatly against cosmic radiation, random bit flipping, and URE's. Google solves for URE's by having 3 or more copies of the same data on different physical servers with no RAID (GFS); and Google uses consumer level drives with higher URE's

    ReplyDelete
  3. You've got your probabilities wrong. Probability of failure of RAID 0, 5 and 6 is easily calculated using the binomial probability distribution. Note that because URE during rebuild may be perceived as a disk failure by some hardware/software RAIDs, the probability of an array failing is even higher. So the following probabilities assume no UREs, implying that they are the lowest possible probabilities of failure for a given array.

    If the probability of any single drive failing = p, array size (i.e. no. of disks) = n, and number of drives that fail simultaneously = X, then:

    Pr(X) = (n"COMBINATION"X) * (p)^X * (1-p)^(n-X)

    For p=0.03, n=4 we have

    Pr(X) = (4"COMBINATION"X) * 0.03^X * 0.97^(4-X)

    So: X 0 1 2 3 4
    Pr(X) 0.88529281 0.10952076 0.00508086 0.00010476 0.00000081

    For RAID 0, the array fails when X>=1, so Pr(RAID0 failure) = 0.10952076 + 0.00508086 + 0.00010476 + 0.00000081 = 0.11470719 ~ 1 in 9.

    For RAID 5, the array fails when X>=2, so Pr(RAID5 failure) = 0.00508086 + 0.00010476 + 0.00000081 = 0.00518643 ~ 1 in 193.

    For RAID 6, the array fails when X>=3, so Pr(RAID6 failure) = 0.00010476 + 0.00000081 = 0.00010557 ~ 1 in 9472.

    So for an array 4 disks in size, 1/9 RAID 0 arrays fail, 1/193 RAID 5 arrays fail, and 1/9472 RAID 6 arrays fail.

    Similarly, for an array 6 disks in size, 1/6 RAID 0 arrays fail, 1/80 RAID 5 arrays fail, and 1/1982 RAID 6 arrays fail.

    Also, for an array 24 disks in size, 1/2 RAID 0 arrays fail, 1/6 RAID 5 arrays fail, and 1/29 RAID 6 arrays fail.

    CONCLUSION

    This shows just how important it is to back up critical data. Now note, these probabilities are based the mean probability of a single disk failing being 0.03, which in turn is based on samples taken from data centres. Disks in data centres are under a heavier workload AND a more stressful environment, than disks in consumer PC's, since their entire purpose is to store and serve large amounts of data to many people, 24hrs a day, and large arrays are concentrated into a small space, and subject to intense cooling, and much more vibration. These all increase the probability of failure. Thus, the mean probability of a single disk failing in a consumer PC will be lower than 0.03 (possibly by orders of magnitude).

    Various hard disk failure surveys (source:http://lwn.net/Articles/237924/) show that low temperatures, increased vibration, and using the same batch of disks, using disks older than 3 years, and higher work loads, increases the probability of failure. So to minimize failure, do not cool your drives excessively, use anti-vibration mounts, mix your disks, use disks of age 3 years or less, and keep the workload low.

    ReplyDelete