Big Data According to Garth Gibson and IDC
Panasas founder and chief scientist, Dr. Garth Gibson, and Steve Conway, research vice president in IDC's high performance computing group took some time recently to discuss HPC trends and forces affecting storage in the HPC market. The role that big data plays in HPC storage, now and in future scenarios is discussed in this third and final video in this series. Garth and Steve explored Parallel NFS (pNFS) in the first installment; they covered solid state drive (SSD) technology in their second video.
Garth Gibson: One of the big trends today in HPC is big data. What is big data in HPC?
Steve Conway: A lot of times when these trends happen, people invent words and get excited about the words and things don’t materialize, but in this case we think the markets are forming almost ahead of the slogans about the markets. So this is getting very real. What’s happened is that HPC has always been a market that’s been data intensive in part, around simulation and modeling. What’s happening now is that analytics methods are being added to the tool kit.
GG: Are we just relabeling visualization to be analytics?
SC: No, it’s much more than that because this is happening in traditional HPC sectors, but in addition to that, commercial analytics is now starting to require HPC and there are use cases that are repeatable and not just one-offs. So we’re actually seeing some of these use cases starting to resolve into pursuable markets.
GG: So if we bring the enterprise in through the fact that it has huge amounts of data, they bring in computing stacks and computing paradigms that are different necessarily than HPC. Is that how we got Hadoop and MapReduce?
SC: They’ve been adopting Hadoop and MapReduce, but they’ve also been pushing up. And those don’t necessarily, as you know, require HPC resources, but they’ve been pushing up into HPC, and the interesting thing is, it’s not like the old 3-tier server architectures where they push the sample data off to the side and work on it on a dedicated server. They’re using HPC resources, clusters and storage, integrated right into the workflow.
GG: So the Hadoop infrastructure in place is a different paradigm for processing large amounts of data, and the main thing it does is sacrifice that close coupling communication that distributed shared memory uses that we use for physical simulation. So it isn’t a complete replacement. They are really complimentary. What about the other side of it, the business intelligence and data warehousing?
SC: That’s one of the examples I was talking about that’s being closely integrated. An example is PayPal. That’s an Ebay company and they have responsibility for fraud detection across Ebay and across Skype. They have been using Hadoop successfully for quite a while. But, for fraud detection, for that part of their assignment at PayPal, they really needed to go to real time. In order to that, they had to move to HPC clusters. They are essentially going graphing, finding hidden patterns, and doing that in near real time. Near real time means they have to find them before the money hits the credit cards.
GG: We call this machine learning in academic circles. Machine learning algorithms are changing rapidly to exploit parallelism because most of them were developed for single processors for serial execution. So there’s sophisticated statistics that aren’t parallel that needs to become parallel.
SC: So all of the sudden, storage requirements start to multiply very, very rapidly. Both in terms of capacity which is not so much of an issue, but capability in storage is always the challenge – moving the data, where the data goes, and how much of it.
GG: IDC has been projecting that the spending on HPC storage, and I’m including all of this right now, is going to climb at a regular rate, but it’s both faster than the total spending, isn’t it? It’s being increased in HPC and slower than the aerial density increase in the disk. Is that the true trend?
SC: We see that prediction coming true. Now through 2016 we’ll grow at 8.9 percent – the fastest growing part of the HPC ecosystem. We’re seeing the start of, it’s just a start, but it’s enough of a start that we’re confident that it’s going to continue, is that there’s growing recognition that we’ve sort of been building HPC systems a little bit upside down for what the requirements are now that we’ve been building them so compute centric, that now all of the sudden there’s really a very quick ramp up of a need for more data centric kinds of architectures. That’s what we are really talking about here. That’s why the storage part of it is just growing faster than anything else.
GG: How do you think that’s going to get split among slow disks, fast disks, SSDs, and other technologies?
SC: I think we are going to be having not shallower, but deeper storage hierarchies, and the hierarchies are going to be based very much on capability (the more capable storage, the faster storage is going to move closer to the processors and that’s where we see SSDs which are already being implemented for a number of things in HPC, but widespread I see them proliferating. Particularly with SSDs, the capability is admired, desired, all the rest of that – so it’s a function of how quickly the prices come down.
GG: Yeah, we’ve put SSD into ActiveStor 14. The key idea that we pursued there was ‘all the right data in the right place.’ So it’s an automatic tiering notion that if you understand your data, you know what’s going to be accessed in random small chunks that fully exploits the SSD and what is large sequential which fully exploits the bandwidth of the disk, you can stream and then migrate as needed. I think this type of technology is essential, not just as a cache which would copy onto the SSD and then on to the disk which flushes through the SSD too fast and pushes out all the little things, but more of a ‘I know where the data should be’ type thing. So I agree. I think that is the trend. It’s not all necessarily deep, but sometimes horizontal where we’re understanding our data and streaming it to the right place.
SC: Yes, as you well know, aerial density has been moving along nicely so that’s not so much of an issue (it’s an issue, but not so much of an issue). Access density is the real issue, and data movement for a whole lot of reasons. So, you folks are attacking the right side of that problem. It’s one of the two or three biggest problems that users are facing in HPC today. You mentioned, maybe not with the word, but metadata. Metadata management is a huge fear. That’s what keeping people up at night – how do we know what’s where?
GG: Metadata is ill-defined. It’s frequently all the stuff that doesn’t fit into my data paradigm. There are many ‘metadatas.’ In the case of the storage system itself, its things like: names, locations of the blocks, and permissions. That kind of metadata is our baby. We play with that all the time. We have streamed off and managed quite well. But, there’s a new class of metadata which is a structure that’s inside the data that you want to expose so that you can do search and look-up against that. More traditionally, we call those indices in the data base sense. I think there’s going to be an explosion of automatic and specialized indexing. That’s going to run the gambit between the machine learning/data warehousing all the way down to be embedded in storage. I ran a workshop yesterday and there were a bunch of papers on accelerating metadata and automatic indexing. So there are technologies really coming. I think you’re right that metadata which is the high rate of access at random to small things is the challenge after just being able to manage the big data.
SC: Right. And as you said, search and discovery are not things that always happen separately. In a lot of implementations, you’re really doing both so you’re doing a kind of Hadoop-based search, but people are also wanting to do graph analysis, or other types of discovery algorithms whether it’s in traditional HPC sectors such as climate discovery. They’ve been having climate knowledge discovery workshops (at SC12) and so forth….or whether that’s in what we see as a sort of a rising tide of commercial companies that are suddenly, sometimes desperately, wanting to learn about HPC because some of their key customers have requirements now that are pushing them in that direction.
GG: So would you suggest that the Hadoop side of the HPC penetrates the enterprise broadly?
SC: Yes. A lot of people talk about Hadoop as if it’s only an enterprise phenomenon. You and I know that because we’re familiar with HPC that HPC folks first love to bang on things, so the percentage of HPC sites that are doing something with Hadoop is far higher than in the enterprise segment. What we see is that they’re contributing experiences and knowledge that’s really helping out people in the eEnterprise. So I think Hadoop is going to be spreading nicely through the enterprise.
GG: So that means enterprise standards applied to Hadoop data. This is where we need to talk about ‘how does Hadoop play with existing infrastructures and best practices.’ Our experience is that Hadoop is mostly a parallel programming paradigm. It has a storage abstraction which is normally implemented locally in HDFS in the same nodes in the same cluster.
GG: But it’s easily re-targetable. When you retarget against traditional NAS, the traditional NAS is too slow and so that is perceived to be fundamental, but it isn’t. When you have fast storage like ActiveStor, you can move your data at the speed that Hadoop wants it. We’ve had good success at serving Hadoop services and then you get the reliability and the customer support that the commercial and enterprise NAS provides.
SC: Right. What you folks are targeting, what you’re doing is definitely a need to make Hadoop feasible to run across distributed computing architectures and data centers and so forth. So that’s really an important next step.
GG: That’s feasible. So what about demand? Do the customers want to run their Hadoop systems? Do they want to use their existing NAS and SAN storage architectures with the new style? Or do they want to buy a whole new thing and set it off in the corner and run it separately?
SC: No. What we’re seeing from live examples, and PayPal is certainly one of them, Geico is another one, Mayo Clinic, we can just go through them one-by-one, but what they want to do is to not run it off to the side. It’s not the old 3-tier architecture anymore. They want to be able to incorporate right into their workflow. In fact, they can’t afford not to do that because there are real time requirements that make it very difficult to just put it off to the side and kind of run it as sample data.