Traditional RAID is designed to protect whole disks with block-level redundancy. An array of disks is treated as a RAID group, or protection domain, that can tolerate one or more failures and still recover a failed disk by the redundancy encoded on other drives. The RAID recovery requires reading all the surviving blocks on the other disks in the RAID group to recompute blocks lost on the failed disk. As disks have increased capacity by 40% to 100% per year, their bandwidth has not increased substantially.
Comparing models of drives from various vintages of Panasas blades, the bandwidth achieved when writing zeros over the whole drive, beginning to end, ranges from 28 MB/sec to 96 MB/sec, which is a 3.4 times increase. However, the capacity increased from 80 GB to 2000 GB, or 25 times. The end-to-end completion time increased from under an hour (48 min) to over 5 and a half hours (347 min). This trend started with the first generations of disks. When RAID was invented, it took just a few minutes to read the entire drive. The following charts illustrate these trends.
The other dramatic change in disk drives is the absolute number of unreadable sectors on a "good" drive. Drives have long had spare sectors and means to map around media defects that cause unreliable recording on particular sectors. With today's high capacity drives, there are over 2000 spare sectors reserved for this kind of remapping, and it is relatively common that they will encounter one or more unreadable sectors.
"Back in the day," most good drives never experienced a media defect. And, if they did, the RAID controller could afford to sniff out media errors on a periodic basis, reconstruct the contents of the lost sector from other drives, and then re-write the sector to trigger the remapping by the drive firmware. However, it is now much more expensive to proactively detect these errors. A substantial percentage of the system bandwidth is required to read the complete drive even on a weekly basis (3.4% with our 2TB, 96MB/sec drive for the bandwidth, and another 0.2% of the weekly disk time for added seeks caused by scanning.) The net result is that drives have their media scanned on a weekly basis today, compared with daily or even hourly in early RAID systems. This increases the chances that media defects are not proactively discovered.
The worst time to get an unreadable sector is during a RAID rebuild. At that time, the system is already running with reduced redundancy. The unreadable sector represents a second (or third) failure that may not be handled by the RAID encoding. So, we have a confluence of technology trends that make data protection more vulnerable: capacity increases mean longer rebuild times, so the chance of a second failure happening during the recovery is statistically increased. More sector errors means there are more kinds of failures that can crop up during the window. In fact, because many drives need to be read during a RAID rebuild, the risk of encountering an unreadable sector is very high.
Enter object RAID. With object RAID, data protection occurs at the file-level. The Panasas system integrates the file system and data protection to provide novel, robust data protection for the file system. Each file is divided into chunks that are stored in different objects on different storage devices (OSD). File data is written into those container objects using a RAID algorithm to produce redundant data specific to that file. If any object is damaged for whatever reason, the system can recompute the lost object(s) using redundant information in other objects that store the rest of the file.
So far the math and the problems are about the same between the two approaches. Let's add declustered RAID, which spreads out the RAID groups. It is easier to envision with a simple RAID scheme that just mirrors a file into two different objects, but the technique applies to any RAID level. Suppose we have a collection of 100 storage devices. We could have 50 mirror pairs using the traditional block-RAID approach. Or, we could vary the mirror location on a per-file basis with the result that when we lose a storage device, the surviving mirror pairs are distributed across 99 other storage devices – not one. We now have 99 times more disk bandwidth to read the surviving mirrors. Furthermore, we only have to read about 1/99 of each other device. The system is still going to read one disk's worth of mirrored information, but it is going to do it 99 times faster.
The next trick is to rebuild different files concurrently. With all the storage devices on the network, like the Panasas storage blade, then many different computers can cooperate to rebuild different subsets of the files that require reconstruction. In a Panasas® ActiveStor™ system, the director blades perform this function, and there is typically one director blade for every 10 storage blades. The director blades have a faster network (10GbE) and faster processor, and can rebuild data at about 30 MB/sec, which requires reading at about 300 MB/sec. This is similar to the rebuild performance of a fast RAID controller. However, our system can employ many of these concurrently on the same RAID rebuild. The overall effect of declustered, parallel rebuild results in very high rebuild rates for Panasas systems.
The original intent of declustering was to reduce the impact of rebuild performance to on-line activity. This benefit is still true. The idea is that the RAID workload is now diffused across many storage devices instead of concentrated on a few. In a large parallel environment, the system runs at the rate of the slowest element. In this case it is worse to have a subset of devices running in degraded mode for a long period, rather than a large (.i.e., all) devices running in a degraded mode for less time. Indeed, with the long rebuilds found in today's traditional RAID systems, a system with large numbers of arrays will have one of those under reconstruction a high percentage of time. Furthermore, the traditional way to reduce the impact of rebuild is to run it slower, which obviously increases the time. That also increases the chance of an additional failure.
Finally, the fault domain for object RAID is one file. If events conspire to create too many failures for the redundancy to protect, the loss is on a file-by-file basis. The system can identify precisely what files are affected by a failure. In contrast, traditional RAID systems leave behind unreadable locations if a RAID rebuild fails due to an unrecoverable (e.g., media) failure. It is up to the higher level file system or database to isolate those failures, which may or may not be feasible because of lack of communication between the RAID layer and the file system.
The problems with traditional block-based RAID will continue to get worse as the disk technology scales as we expect. Drive vendors are still delivering additional capacity each year, but with a lesser increase in disk bandwidth. The time to read a disk continues to grow. Rebuild times continue to grow. More media defects will be encountered on larger drives. Data protection schemes must evolve to declustered, parallel approaches to provide the reliability and performance that is expected in today's systems. The object-RAID approach that declusters RAID protection groups and allows concurrent rebuilds, provides the fundamental properties needed to keep data protection scaling with disk capacity.