Free is Not Always Free – Or, When Compute Clusters Are Held Hostage by Storage
August 31, 2011 - 6:29pm
Storage comprises about 20% of the available budget in a typical HPC system, with another 20% being spent on software, 20% for services, and 40% on the compute cluster, according to marketing intelligence firm Intersect360 Research Firm (WW HPC Total Market Model). The majority of system design time and effort is spent validating the performance of the compute cluster, with less attention paid to storage. But not all storage is the same. All too frequently the investment and effort in building the best compute cluster will have yielded disappointing results when storage is not given sufficient due diligence. If not fully vetted for scalability, manageability and uptime, storage can drag down even the best high performance computing (HPC) system and hold the entire compute cluster hostage.
HPC storage issues typically fall into three buckets: performance scaling, RAID bottlenecks, and system downtime. First, traditional NAS systems using a NAS filer head do not scale performance well as the storage system grows, as the workflow is highly serial through the filer heads. Panasas® ActiveStor™ storage appliances, on the other hand, are designed to deliver very high parallel performance that scales from 1.5GB/s on a 60 terabyte appliance, all the way up to 150GB/s on 6 petabytes of capacity.
Second, many HPC storage vendors have tried to get around the performance scaling problem by utilizing a free, open source parallel file system like Lustre and then running the software on top of a traditional RAID storage array. The storage array provides data protection by RAIDing the data as it is being written to disk, again, in a highly serial fashion. System performance is then limited by how fast the RAID controller can process RAID calculations and write them out to the disk drives. The best per disk performance that has been achieved from an external RAID controller, according to publicly available benchmarks, is 35MB/s per disk. By contrast, the Panasas PanFS™ parallel file system does all its RAID calculations in software and achieves over 80MB/s per disk drive – more than twice the performance. The parallelism of the data path means that compute nodes can write directly to disk without having to go through any filer head or RAID controller.
Finally, the worst performance hit is of course the non-performing system due to downtime, when no data analysis can be performed at all. When a storage array that is file system unaware is combined with a file system that is RAID unaware, it is a much harder task to keep the complete system up and running on an ongoing basis. While the performance and scaling may be acceptable, there is a significant cost in lost compute time when the storage system is down. Put slightly differently, each dollar of cost in downtime on the storage system translates to two dollars of wasted compute time that was purchased to begin with but is unavailable to use.