Hadoop - It's Simple with Panasas
November 12, 2012 - 11:02pm
A major theme here at Supercomputing 2012 is the use of Hadoop for big data analytics. In partnership with Hortonworks, Panasas is demonstrating Hadoop running on our recently announced Panasas ActiveStor 14 parallel storage system at Supercomputing 2012, booth 3421.
The technology requirements of big data applications vary dramatically for legacy storage and compute technologies. Big data workloads strain traditional storage architectures in three key ways: scale, bandwidth, and volume and variety of unstructured data. Under the mantle of HPC, a new ecosystem had to be developed to extract value from all the unstructured data being generated by sensors, satellite images, simulations, video and email, to name a few. Panasas originally developed its PanFS parallel file system to address the needs of big data workloads in design and discover environments, long before the advent of big data analytics.
On a parallel path (forgive the pun) the Google File System (GFS) was designed to be a highly scalable architecture, utilizing low cost storage and parallel processing to economically extract value from all the data out on the Internet. In building its storage file system, Google leveraged core research done by Panasas founder, Dr. Garth Gibson, at the Carnegie Mellon Parallel Data Lab, in Pittsburgh. Since then, the open-source Hadoop platform, modeled on GFS, has allowed many other companies to enter the big data analytics market, including LinkedIn and Facebook.
The Panasas file system shares a common heritage with modern file systems such as the Hadoop File System (HDFS), and its back-end storage platform is designed to support a highly distributed, scalable parallel file system. ActiveStor appliances can support a Hadoop workload through HDFS, or directly through PanFS. Running a Hadoop workload on an existing Linux cluster with Panasas ActiveStor appliances only requires some configuration changes at startup. The Panasas implementation does not require any proprietary software plug-ins and will work with any version of HDFS. There is no need to purchase additional compute or storage hardware to run Hadoop workloads, and the data is preserved in a trusted enterprise-class storage platform. Finally, there is no performance penalty for using the Panasas scale-out NAS solution. On internal benchmarks, utilizing the TeraSort benchmark suite of tests, ActiveStor 14T was actually 30% faster than a local disk solution.
If you can’t make it to Salt Lake City this year but want to learn more about Hadoop with Panasas, just download our white paper and configuration guide – http://performance.panasas.com/hadoop-configuration-guide.html