Research technology in life sciences is advancing at a rapid pace and generating record amounts of data in the process. Having a robust, high-performance IT infrastructure in place is essential to scientific organizations that thrive on the collection, analysis and distribution of data to researchers around the world. Storage plays a big role in supporting research staff productivity and needs to be able to address the unique workflow challenges associated with scientific research.
Accordingly, it is essential to have a storage system in place that reliably supports a 24/7 streamlined workflow with extremely high computational power. To facilitate effective collaboration, data accessibility has to be instant and intuitive. Researchers need the ability to tag, catalog, and search the metadata content of files via natural language, and be able to utilize files and content based on metadata fields instead of file names.
Considering the data in a more technical light, it’s important to recognize that data workflows between and even within different data science techniques can vary extensively. This frequently results in a temptation to create discrete, highly purpose-tuned storage solutions for each workflow. Unfortunately, these discrete “storage puddles” tend to deliver low capacity and performance utilization when considered in the aggregate, and they increase overall storage maintenance costs. Moreover, data movement between storage puddles required for different stages in the data science workflow can be especially inefficient from a time and capacity perspective. Instead, use of a centralized parallel file system avoids the pitfalls of over-tuned and segmented storage while delivering high capacity and performance utilization.
When attempting to determine what storage solutions best fit one’s problem, emphasis should be placed on understanding workflow behavior, rather than on isolated and often unrepresentative performance of generic benchmarks. In doing so one recognizes that most data science workflows have varying I/O requirements during their many phases, each of which rarely mimic any artificial benchmarks. Thereafter, careful staging of these workflows to avoid destructive interference with others of differing storage needs (e.g. the impact of small file workflows on streaming workflows) is crucial to maintaining storage performance. Finally, architecture of a balanced I/O subsystem (from client to network to storage) is equally important to avoid workflow bottlenecks.
To read more about this topic, check out the recent article in Genetic Engineering and Biotechnology News (GEN) titled: The Scoop: Biodata Comes “Ome” to Roost.