Estimated reading time: 4 minutes
A few years ago, before I worked with Panasas, I was a Research IT Consultant for a variety of biotech organizations. When companies invited someone like me in, they were having serious issues
One of the major recurring problems that I observed was a clash between the IT department and the scientific researchers they were meant to be supporting.
Researchers would often complain that their storage was too slow or unscalable, preventing it from growing alongside their research needs. IT would counter that the researchers couldn’t properly plan or even communicate their performance needs.
Either way, IT departments would regularly make decisions around storage that would impair research, and researchers would report that they couldn’t perform their jobs optimally because of some seemingly arbitrary storage decision made by the IT department.
This all too familiar situation often boiled down to a conflict between genomics’ expanding storage requirements and institutional cost restrictions.
The Question of Cost
When genomics institutions buy storage, they usually base their decisions on the price of the solution. What they don’t account for is all the — often unquantifiable — costs they’ll run up down the road.
This is largely why IT staff and decision makers often choose cheaper or open source solutions — they focus primarily on the upfront cost. From that point of view, a solution that saves $50,000 appears the most logical.
But in doing this, they’re letting the wrong questions guide them. In an organization that is supposed to be advancing the cause of science, they’re asking how they can save money. The question they should be asking is this: How can we better serve the research?
Genomics’ Expanding Storage Needs
What IT departments may not understand is that storage is a bedrock for genomic research.
Every month, labs produce terabytes of data for genomic sequencing and analysis that lead to potential storage concerns — massive data loss, performance slowdowns, spiraling maintenance costs, and, most importantly, research delays. All these issues often stem from legacy file systems that no longer serve genomics’ fast-growing storage needs.
As genomics has boomed, the field’s computing and storage requirements have boomed in kind. Researchers are analyzing genomic data on a large scale, making data continually active and outpacing legacy archival storage. As such, both I/O performance and capacity scaling have become critical factors. Additionally, high throughput instruments like DNA sequencers have massively increased the amount of new n-dimensional data being used, further transforming storage requirements. Those instruments generate a large volume of small files for their own metadata which often take up the majority of space in a genomics storage system.
For instance, in a typical genomics file system, 83 percent of the files are less than 12 kilobytes, and 60 percent are less than 2 kilobytes.1 Since most scale-out Network Attached Storage (NAS) systems struggle to handle both small file I/O operations and large sequential file throughput, this presents another challenge.
How Storage Pushes Research Forward
Panasas commonly sees these kinds of problems in genomics research as organizations attempt to balance the demands of IT budgets with the needs of researchers.
The Garvan Institute of Medical Research experienced this a few years ago after it acquired the HiSeq X Ten sequencing platform to advance its research into cancer risk and child intellectual disability diagnoses.
The X Ten system produces as much as 5 terabytes of data per day. To keep their genomic sequencing production line going, they had to be able to accommodate that data output while maintaining performance. Garvan was also scaling out their staff from a team of 10 to an army of 80 researchers.
Their storage utilized several siloed systems to store their data, but it was unable to scale up to match their needs. At the time, they had to undergo a laborious process of copying their massive datasets from NAS storage to compute, causing frustration and delays. These two scale-ups overloaded their storage systems, resulting in a significant slowing of their performance and, ultimately, their research.
That changed when they introduced a Panasas PanFS parallel file system and moved their team of 80 researchers to several Panasas ActiveStor NAS appliances.
This move allowed Garvan to centralize the bulk of their compute onto a single node, so that they no longer had to copy files from their previous NAS to local storage and they could access data directly from the centralized Panasas file system. Using Panasas’ DirectFlow protocol, researchers were able to retrieve data directly from their Linux clusters instead of having to ask a series of nodes to send it back to them.
With performance restored, Garvan could finally scale their storage with their discovery ambitions. When they acquired new sequencers a short time later, their Panasas storage allowed them to increase their sequencing capacity 50 times over.
Storage — previously a barrier to their research – had become a catalyst.
Matching Genomics’ Ambitions
Problems will persist in other genomics institutions if similar changes aren’t made. Genomics research is accelerating, and its computing requirements are, too.
Ultimately, IT departments will have to reexamine their priorities if they seek to be catalysts rather than obstacles in this field.