Michael Matzer / Dr. Jürgen Ehneß | September 27, 2021
The Panasas file system PanFS is unusual in many respects. The parallel file system for object storage has neither an API to AWS S3, as is now widely offered, nor data compression, which is also standard. But according to Curtis Anderson, Panasas software architect, the system is optimized for fast data access, even in the context of mixed workloads (HPC/AI), and is designed to reduce operating costs while reliably maintaining high performance.
No performance or capacity cap: PanFS is ideal for HPC and AI.
Image Source: Public Domain / Pixabay
PanFS consists of three main components: ‘Director nodes’, ‘Storage nodes’, and last but not least, the client driver ‘DirectFlow’. Together, the two types of nodes make up the Panasas ActiveStor appliance. The driver is a loadable software module that runs on Linux compute servers, the clients, and interacts with the two node types. The admin uses GUI or command line running on a director node. Nomen est omen, this node takes over the interaction with the storage nodes.
In this way, PanFS separates the control plane from the data plane, not only for security reasons, but also for the sake of scalability: any number of storage nodes can be added, but performance scales linearly throughout.
While director nodes store metadata regarding folders, file attributes etc on storage nodes, they also coordinate all actions of storage nodes and client drivers. In the event of failure, they take care of all recovery points as well as the necessary operations concerning data security and availability. At first glance, such compute nodes are simple servers, but they offer a high-speed network line, significant DRAM capacity, as well as NVDIMM memory for transaction logs.
Storage nodes create the data layer. Within the architecture, user data and metadata are stored only there. Thus, both data types can scale in relation to each other. The storage nodes are commercially available systems, however their hardware is well balanced in terms of hard disks, SSDs, VVMe and DRAM capacities, CPU performance, network bandwidth, etc.
Finally, the client driver DirectFlow is a loadable file system installed on compute servers, and to be used like any file system by any application. In cooperation with the director and storage nodes, it shows the behavior of a fully POSIX-compliant file system: in a single namespace, across all servers within the compute cluster. The Panasas driver supports all major Linux distributions and versions.
PanFS is designed to scale linearly. If 50 percent more storage nodes are added, storage capacity also increases by 50 percent. Adding more director nodes results in increased metadata processing speed. There is no upper limit on performance or capacity, which in turn makes the file system extremely well suited for high-performance computing (HPC) as well as for AI requirements.
As a parallel file system, PanFS is able to provide much more bandwidth than NFS and CIFS/SMB protocols. Each file stored by PanFS is distributed across many storage nodes so that each file component can be read and written in parallel. This impressively increases performance for file access.
Because PanFS is also a direct file system, the compute server can talk to all storage nodes via the network. Comparable products set up file access via so-called “head nodes” running the NFS or CIFS/SMB protocols, and via an additional backend network. The bottleneck occurs at these head nodes, and the backend network adds cost. In PanFS, the client driver on the compute server talks directly to the storage nodes, and the director nodes are not involved at all (“out-of-band”). As a result, there are hardly any bottlenecks, load points (hotspots) or even fluctuating performance, as it is the case in scale-out NAS systems.
Because all the components of a file are distributed, each file requires a file map that shows where the other components are located on respective storage nodes. The client driver uses this file map to identify which storage nodes it needs to, and can access both directly and in parallel.
PanFS also uses network-erasure coding to ensure the highest level of data integrity and reliability within the distribution process (striping). Because PanFS is fully POSIX-compliant, all processes on the client driver’s compute servers see the same file system namespace, metadata, and user file contents. The client driver DirectFlow also ensures cache coherency.
To ensure the security of the system, PanFS provides so-called Access Control Lists (ACLs), not only for files, but also for directories. This is in addition to the common Linux style such as “-rwxr-xr-x”, but much more fine-grained. Snapshots per drive (at least one logical drive must be set up) allow user-based recovery of older file versions without requiring an admin. To ensure that data remains confidential, it can be encoded with DARE encryption (DARE: Data At Rest Encryption).
In a storage system, file sizes, access patterns and workloads can change significantly over time. But Panasas supports all of these, and in doing so impressively expands the range of use cases. In high-performance computing (HPC), large files are not unique. PanFS supports genetic research as well as hosting central directories with a cloud provider.
PanFS for HPC and AI workloads offers a “Dynamic Data Acceleration on PanFS” feature since 2020. This control feature is designed to accelerate data storage operations on Panasas’ ActiveStor Ultra appliances by helping storage media such as SSDs and hard drives become more efficient. The key factor is not access frequency, as with tiering, but file size. To enable DDA to do this work automatically, an algorithm in the orchestrator monitors how and where metadata and usage data are stored.
By dynamically controlling the movement of files between SSDs and HDDs and realizing the full potential of NVMe, PanFS delivers not only the highest possible performance for HPC and AI workloads at a reasonable cost of ownership, but also, just as importantly, in a consistent, predictable manner. The DDA algorithm controls the sweeper software that does the actual distribution of the small files.
To keep SSD utilization at about 80 percent of its capacity, the sweeper moves small files onto that media. “If an SSD is 80 percent full, the sweeper moves the largest files to disk. If a hard drive is ‘only’ 70 percent full, the sweeper moves the ‘smallest’ files to the faster SSDs,” Curtis Anderson explains. “DDA manages the relocation of small files between SSDs and hard drives to increase access performance as well as the performance of workloads that use small files by keeping them isolated from streaming workflows.” This is just one example of how PanFS keeps system performance at optimal levels.
New vendors such as start-up WekaIO (Storage-Insider reported) appear to be offering customers an advantage by storing “hot” files in fast NVMe SSDs, but storing big, “cold” files in a large, S3-based object store Data Lake on-premises. At WekaIO’s U.K.-based customer Genomic England, 1.3 PB reside on NVMe SSDs, whereas the available 40 PB are stored on hard drives at the customer’s in-house IT.
Automatic tiering takes place between NVMe SSDs and the S3 Data Lake. Moving from the slow, but low-cost hard disks to NVMe SSDs currently has to be initiated manually. Although the need for this does not occur that often, what weighs more heavily is the fact that access from fast SSDs to the “slow” hard disks slows down the whole system: throughput is only 150 GB/s. In comparison, Panasas’ DDA system is 410 GB/s, with 41 PB available.
In terms of respective operational costs, WekaIO’s solution for its customer Genomic England costs about $400 per terabyte, according to Panasas. “Slow, underperforming” media is unknown in the Panasas system: DDA compensates for differences and raises the system to a consistent level of high performance. The price per terabyte is $200. DDA thus delivers a significant monetary advantage when PanFS is implemented.
Read the original article in German here.