More and more research areas now require HPC. Access to scalable computing is required, and so is high performance storage. Despite the rapid growth in HPC, scientific storage remains a bottleneck to discovery. Commonly, price is the primary determining factor when purchasing HPC storage. But several other key metrics must be met for HPC storage systems.
Reliability – The importance of file system reliability cannot be overstated. Uptime is key to meeting project timetables, reducing staff frustration, and research continuity.
Scalability – Storage must scale in capacity, performance, and data protection.
Performance – The performance level must meet researcher needs. While some file systems may offer scalability, the performance level may not meet the demands of computational research.
Without meeting these needs, saving money upfront ends up costing research long term.
How Storage is Growing
Data storage needs in the life sciences continue to grow, in capacity, performance, and reliability. Inability to provide consistent uptime in the face of these needs threatens the success of important research experiments.
The fact is, in most cases the high maintenance requirements and regular service interruptions continue to get in the way of research progress. In some cases, system downtime and poor performance doesn’t just interfere with project timetables, it can actually delay time to a cure.
NIH BioWulf cluster active users per month
In the past 5 years, monthly users have nearly tripled
Pain Points in HPC Storage
Over the years, users got used to the fact that HPC storage deployments were notoriously hard to manage. Organizations had to devote considerable resources in order to employ people who could master the intricacies involved in operating these complicated storage systems. But that’s not a scalable model.
We can no longer assume that HPC data center managers will be ready – or able – to expend time, money, and staff to buy and maintain clunky, complex HPC storage systems.
Change is needed in the HPC storage industry. Researchers are using HPC in a growing number of disciplines, yet storage downtime remains a key issue. Consider these findings from a Hyperion survey of data managers commissioned by Panasas:
Nearly 50% of the respondents experienced storage system failures once a month, with users coming to expect downtime as the norm in HPC storage.
After a system failure, 40% of HPC sites typically require more than two days to restore their storage system to full functionality.
More than 75% of respondents experienced reduced productivity in the past year due to storage related issues. 12% of sites experienced this more than 10 times in the past 12 months.
Some outages lead to downtimes that last as long as a week. A single day of downtime costs can range from $100,000 to more than $1 million.
The most common challenges for HPC storage operations are recruiting and hiring qualified staff, followed by the time and cost needed to tune and optimize the storage systems.
TCO (Total Cost of Ownership) Needs to be Thought of Differently in the Life Sciences
It’s tempting to make decisions about storage based entirely on the desire to seek out the lowest initial purchase price. This makes sense when buyers consider something like equivalent compute nodes sold by different vendors.
In the storage space, however, there is considerable differentiation between products. An open-source system may seem inexpensive at first blush, but as the Hyperion survey makes clear, storage issues after installation are common and often costly. Also, the implications of downtime should be measured in more than dollars. As a researcher, the financial losses incurred from storage issues may not always be relatable – but the cost to research progress is very real when they wind up negatively impacting organizational research goals.
Research Cost of Ownership (RCO)
Reduced productivity from infrastructure failures leads to discovery delays. That can have financial implications for an organization pursuing a new pharmaceutical product. But there is no metric to accurately relate TCO to the overall cost of lost research time at an organization. That’s why I prefer to consider something I call Research Cost of Ownership (RCO).
RCO is the effect that downtime, and reduced productivity, has on the overall scientific mission. While RCO is not immediately quantifiable, it needs to be a major consideration in the storage purchasing process.
Why RCO Matters
The goal of research is to contribute to the collective body of knowledge for humankind. It is how humanity makes discoveries and moves forward with innovation.
When this goal is compromised by short-sighted financial decision making, it has a ripple effect across the global body of scientific knowledge. It is therefore time for IT staff to consider the RCO implications of storage issues. Maybe a few dollars per TB were saved on the initial purchase, but the cost to the scientific community can be much higher. We need to bring down the curtain on the recurring drama where researchers suffer repeated downtime.
Unlike TCO, RCO is somewhat qualitative, but it is essential when evaluating your infrastructure choices. Remember the immense cost to discovery if storage issues persist at your organization. When making your next storage decision, RCO must be a key evaluation metric.
Predictability, resilience and reliability should become our new standard. Science is too important for it to be delayed by avoidable technical issues. In this era, everyone is examining their infrastructure to figure out the best technology approach in order to get their work done quickly and reliably. And that means things need to change in the HPC storage world. If we do it right, we’ll all be heroes.