The Last Bottleneck: How Parallel I/O Can Attenuate Amdahl's Law
March 2, 2012 - 7:22pm
Director of worldwide systems engineering, Rex Tanakit, gives his thoughts on parallel computing's migration to the mainstream in this blog post:
Parallel computing is becoming mainstream in technical computing. Let me provide you with a few examples. The Economist magazine ran an article on parallel programming last June, http://www.economist.com/node/18750706. Its comfortable discussion of parallel programming is a sure indication of the topic’s movement into the mainstream. A more technical example is to look at the average number of CPUs on the TOP500 (www.top500.org) in the past 18 years. In 2001 the average core count was between 129 and 256 per system according to the article, compared with 4000 to 8000 in 2011, over 30X increase in shared processing power.
Perhaps a more subtle proof of parallel computing’s migration to the masses is found in language where “multi-core” is a common description of laptop and cell phone specifications highlighting the number of processors.
It seems we are getting more performance from our applications, doing things faster, and reducing time to market. Or are we?
A more appropriate reckoning will begin with the notion of whether we get performance improvement commensurate with the investment we put in. If we used to do a simulation in 60 minutes with 100 compute nodes, does the time reduce to six minutes with 1000 compute nodes? The problem is that linear scaling is difficult to accomplish. There are several obstacles that prevent linear performance improvement, including latency in communication between nodes, algorithms used, data size, and I/O. The industry has done what it can to get as much scaling as possible, e.g. created standards like OpenMP and MPI to increase communication efficiency. The real issue here is that in order to get performance and better return on your multi-nodes/sockets/cores investment is to look at parallelism in application processing, not just at the number of processors.
In order for parallelism to get us better ROI, the whole system has to be parallelized, otherwise Amdahl’s law will kill us every time. (Amdahl’s law: “The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program;” Source: G.M. Amdahl, “Validity of the Single-Processor Approach to Achieving Large-Scale Computing Capabilities,” Proc. Am. Federation of Information Processing Societies Conf., AFIPS Press, 1967). Basically, your program will run as fast as the slowest component. We have seen 128 node compute systems using a single storage head node to do I/O. Total workload performance was poor because I/O was the last bottleneck to overcome. In summary, we need to adopt parallel I/O in storage systems as well as compute systems to get more performance. It is not easy, but it is essential.
The more compute nodes a system has, the more file access it can do. Without parallel I/O most of the CPUs will sit idle, waiting for data. Parallel I/O increases the efficiency of the compute system resources. Independent software vendors (ISVs) developing big data applications for design and discovery have begun to adopt parallelism to dramatically improve time-to-results. Let me share with you some examples of Panasas ISV partners who are moving, or have moved, to parallel I/O. Ansys Fluent, a computational fluid dynamics application for manufacturing, incorporated parallel I/O which helped improve overall performance. Figure 1 shows the results from Ansys Fluent 13 running on 16 compute nodes, 8 processes per node, on a single Panasas ActiveStor 11 shelf. Fluent uses N-1 parallel I/O which means all 16 compute nodes write to the same file. This test shows greater than 2.5X saving in time to process the application when moving from serial to parallel I/O. Consider the impact this has on an engineering team, allowing decisions to get made faster, products to market faster, or simply allowing the team to run many more simulations in the same time period.
Yet another example is from CD-adapco, a provider of engineering simulation software; figures 2 and 3 show the results running on three Panasas ActiveStor 12 shelves. These benchmarks clearly show that serial I/O doesn’t scale. You don’t get anymore read and write performance when you increase the number of processes. At low numbers of processes, you are limited by the number of clients, i.e. load that’s being put on the storage system, where the number of disks is the bottleneck at high number of processes. That’s an easy problem to solve – add more spindles, i.e. buy more storage. The point here is you can scale with parallel I/O.
My last example comes from a different market segment. Figure 4 shows results with Landmark Promax/SeisSpace software, a leading seismic processing application. SeisSpace uses an opensource parallel I/O library called JavaSeis. JavaSeis developers understand the importance of parallel I/O and willingly share their work with anyone who wants to take advantage of it. The results show huge improvements in performance as the number of processes increases.
Many ISVs, focusing on big data applications with significant data processing requirements have invested in parallel technologies to improve application performance. A Parallel system is the key to getting more performance. Thanks to advancement in silicon design, multi-CPU /core are now the norm and more nodes can be added to continue the performance improvement. To avoid Amdahl’s law and get the best return on hardware investment, applications, file system, and storage need to be parallelized as well. Porting code to parallel I/O is not an easy task but as a Chinese proverb says “A journey of a thousand miles begins with a single step.” The investment will be worth the effort and yield a big return in productivity.