Contact Us

Thank You

The form was submitted successfully. We will be in touch with you soon.

Why Data Gravity Plays a Role in HPC

Data volumes have a certain weight — data gravity — which can lead to unique storage challenges, especially in the HPC arena.

  Curtis Anderson | December 14, 2021

 

Data storage in the high-performance computing sector has always been one of the poorly lit corners of the industry. The enormous need for speed and the overall magnitude of the challenge require very differentiated technical solutions.

One of the difficulties arising from the complexity of HPC data storage includes explaining the problems that arise and the necessary solutions in such a way that even those who are not experts in this field can understand.

What analogies or mental models can we use to make it easier for managers and executives responsible for HPC installation in an organization to understand what they need for their HPC equipment to work at the highest efficiency?

One of the available mental images: Data has “mass.” Just as a cubic meter of gold is much harder to transport than a cubic centimeter of gold, a petabyte of data is much harder to move from memory to the computer (and back again) than a megabyte of data.

The notion that data has mass is even more fitting than it may seem at first glance. Lifting one cubic meter of gold into the racks of a data center requires a considerable amount of infrastructure — not only reinforced flooring, but also a multitude of energy. Similarly, an HPC storage system needs substantial energy to retrieve or store petabytes of data on the drives, and all of the network equipment needed to transfer those petabytes of high-bandwidth data requires even more energy.

Consider another related analogy: Depending on its quantity, data goes through “phase changes,” much like the physical state changes of steam into water or ice. Steam, water and ice are one and the same, but temperature determines the essential difference in their processing.

“The magnitude of data processing in HPC requires a physical infrastructure and energy level that exceed most companies storage solutions.”

—Curtis Anderson, Panasas

A megabyte of data is like water vapor, which, in addition to some insulation, only needs a few empty pipes to get from one place to another. A gigabyte of data is more like water: You still need empty pipes, but you also need some electric pumps to transport it from one place to another. Finally, a petabyte of data is more comparable to ice, which does not flow through pipes, no matter how much you drive it — you would have to cut it into blocks or crush it and place it on a conveyor belt. That process is not only much more energy-intensive, but it also requires a completely different physical infrastructure compared to transporting the water or steam. In other words, the amount of data (or the temperature of the water) makes the difference, and it determines the way that it must be processed.

“Parallel file systems” were invented as offshoots of typical network file systems such as NFS or SMB/CIFS precisely because they can function as the “conveyor belt” that HPC systems need, as opposed to the simple “pipe.”

These two analogies provide simple mental models representing why HPC storage solutions are so different from enterprise storage solutions. Data on the scale that HPC typically processes requires a physical infrastructure and an energy level that is not available in storage solutions for most companies.


About the author:

Curtis Anderson is a data storage expert with more than 34 years of experience. Anderson focuses on the implementation of file systems. Anderson was one of the five original authors of the XFS file system, which is now widely used in Linux, and worked on the Veritas VxFS file system before Veritas was launched. He was also a member of the IEEE for 14 years, including as a sponsor chair for the IEEE 1244 Working Group, which coordinated and published a formal standard for the sharing of tape drives and tape robots in a SAN by several hosts. In his function as software architect at Panasas, Anderson is responsible for coordinating technology teams working on various elements that make up Panasas’ parallel storage file system. Before joining Panasas, Anderson worked as Technical Director at NetApp and as an architect at EMC/Data Domain. Anderson holds 10 patents, including in the areas of continuous data backup and replication of deduplicated file data over a network.

The authors are responsible for the content and accuracy of their contributions. The opinions presented reflect the views of the authors.

 

   Read the original article in German here.