Why Parallel NFS is the right choice for AI/ML Workloads

Table Of Contents

Share this page

Srikanth Kaligotla

February 28, 2024

2,456 views

The requirements for artificial intelligence (AI) workloads often dance around the scale of operations and the speed of completion. In a rush to satisfy such requirements, time-tested environments are often overlooked, and an organization ends up building customized, complex solutions that are silos and not easy to maintain. To make matters worse, there’s negative press about the use of NFS for AI and machine learning (AI/ML), even though NFS is the most ubiquitous file system environment around us.

The misconceptions about and the arguments against the use of NFS in AI/ML environments have to do with inefficient setup and a refusal to use the latest features of NFS. In this blog post, my aim is to highlight the improvements in NFS and to showcase its potential in bootstrapping an AI environment seamlessly without sacrificing any of the requirements. As proof, my blog post highlights the ability of NetApp^® ONTAP^® software to serve your AI/ML workloads with zero configuration changes to the host environment, all while meeting or exceeding the requirements.

Challenges with AI/ML workloads

As more and more use cases embrace AI, there’s increased pressure on data access, data compute, and data storage. For GPUs to be high performing, they need the data to be readily accessible. Technologies like GPU Direct Storage are facilitating access to vast amounts of storage through high-bandwidth links. For this access, multiple parallel lanes are established between GPUs and the storage array. The choice of transport also plays a critical role in ensuring saturation of the network lanes. Truth be told, an NFSv3 server wouldn’t be able to handle the load because its design is based on a single-server, multiple-client model.

To fully saturate the links and to keep the GPUs busy, a framework of multiple I/O paths that can drive concurrent I/O between storage and compute are required. Parallel file systems have been designed to break up the data into multiple parts and to store them on several servers, offering concurrent access. Such file systems provide a unique way of indexing the various regions of the data and thus enable simultaneous access to the data. This coordinated use of multiple I/O paths significantly boosts I/O performance.

And because NFSv3 has no way of segregating the traffic, metadata performance is another key requirement of AI/ML that’s pushing users toward parallel file systems. However, the setup of parallel file systems comes with a higher cost and requires careful consideration, including the maintenance of it. Not to mention the vendor lock-in, proprietary clients, and siloed heaps of data.

Game changer—NFSv4.1

Version 4.1 of NFS addresses several workarounds and limitations in establishing NFS for high-speed access. It is a major redesign of the protocol, focusing on performance improvement.

Parallel NFS

Particularly in version 4.1, parallel NFS (pNFS) supports true scale-out so that data can be distributed on multiple servers and can be accessed directly by the client over multiple paths. Another distinguishing feature of pNFS is that it enables clients to use the file layout driver to communicate with the server. Traffic can be segregated so that metadata (file system calls that describe the file attributes) and data (read and write traffic) can be served independently by different servers within a cluster.

The preceding figure highlights multiple connections to the Metadata Server (MDS) as provisioned during mount-serving operations such as open, close, stat, lookup, and readdir. The user data for the file that can live on any high-speed node can be directly accessed through the Data Server (DS). These connections are dynamically established based on the response for file layout.

A range of jobs within the AI/ML workload suite greatly benefit when the transport can separate the data from the metadata. And with independent connections for the type of operations and access to the range of a file, the architecture is well suited to run AI/ML workloads.

Session trunking

Segregating the NFS traffic and providing a direct path for I/O access still fall short in saturating all the available bandwidth. A surefire way to saturate a network link between the client and the server is to create multiple connections between them. The addition of the session trunking feature to NFSv4.1 achieved this purpose. By aggregating the throughput across all available interfaces to the data server, session trunking helps maximize network bandwidth. Aggregation is accomplished by associating multiple connections, each with potentially different source and destination network addresses, within the same session.

As the previous figure shows, the NFS operations can now proceed at a combined throughput of all the available interfaces between the client and the server. This multiplexing of network bandwidth creates high-speed lanes to transfer NFS remote procedure calls (RPCs).

pNFS trunking

By combining the multipathing ability with pNFS, we now end up with concentrated lanes that carry either metadata for the file system or read/write traffic, or both. The resulting architecture is very similar to other parallel file systems, but without the setup and maintenance costs. The following figure is a good demonstration of the architecture similarity between a well-known parallel file system—Lustre—and pNFS.

By employing standardized NFS semantics, we now have high-speed disks, a high-speed network, and powerful compute working in tandem in a time-tested environment.

ONTAP readiness

As established in the preceding technology discussion, AI/ML is a throughput game. The workload largely consists of data processing, model training, and inference. From a storage perspective, it means ingesting and exporting large datasets, with periodic checkpoints before final model generation. In NetApp ONTAP, technologies such as scale-out NAS, FlexGroup, NFS over RDMA, pNFS, and session trunking—among several others—can become the backbone in supporting AI/ML workloads.

Data layout

AI/ML dataset sizes are always growing, and they have to be treated as a single unit. FlexGroup is a natural choice for data of this scale because it gives you the freedom and the flexibility to spread the data evenly within the cluster. The ability to access the data uniformly from any node in the cluster makes the solution truly scale-out, with no hotspots that can use the aggregated bandwidth of the entire cluster. Imagine an eight-node ONTAP cluster, each node with dual 100Gbps RDMA over Converged Ethernet (RoCE) ports. Your system now has a combined bandwidth of 800Gbps working toward accomplishing a single job for AI/ML, and that’s plenty!

Data transport

With specifications to support NFS over RDMA transport, the inefficiencies with memory copy are eliminated. Your data can be directly copied from the system memory of the host to the server and vice versa, eliminating CPU overhead. This capability is a major improvement in achieving high speeds with NFS.

Moreover, the technology works with GPU Direct Storage. Newer programming models like CUDA, developed by NVIDIA for parallel computing, can use NFS over RDMA natively. If your data scientists are working on CUDA enabled programs, they likely don’t care where the data lives; it just gets siphoned into GPU memory from storage by using NFS over RDMA semantics. The ONTAP mature NFS stack, a gold standard, has been an early adopter of NFS over RDMA and checks on all aspects of GPU Direct Storage.

Data acceleration

Finally, the critical step in saturating the available bandwidth comes from the ability to run multiple NFS streams. ONTAP implementation of pNFS and session trunking for RDMA mounts avoids nonuniform memory access (NUMA) and aggregates bandwidth across all available interface cards on each node. The ability to dynamically create and expand I/O connections as the cluster scales out helps your AI/ML workloads keep up as more use cases are added.

Results

With FlexGroup volumes providing uniform distribution of data throughout the cluster and with transport removing inefficiencies in the transfer of data, ONTAP facilitates data throughput at a nearline rate. To demonstrate, we did some testing. The NVIDIA DGX A100 system enabled for GPU Direct Storage was attached to a NetApp AFF A800 cluster of four high-availability (HA) pairs. The configuration required zero updates to the host environment. The supporting blog post has more details. But the key takeaway is the ease of setup in achieving unparalleled speeds to run AI/ML workloads with NFS.

The preceding figure is a schematic with an NVIDIA DGX A100 system with just two RoCE interfaces per ONTAP node. The achieved throughput is around 86% of the line rate, and the post-engineering analysis shows that each node still has more room to scale up even higher.

Optimize your AI/ML workloads with pNFS and ONTAP!

A file system that can keep up with the scale of data and not compromise on data access is the most basic requirement of AI/ML workloads. But after the basic requirement has been met, the simplicity of the setup and the richness of the features define the AI/ML environment. Use of vendor-specific proprietary file systems can achieve high speeds, but the data in those storage systems will forever live in isolation. You need to build a general-purpose file system that meets or exceeds the requirements of AI/ML without sacrificing its ability to be accessed by nonproprietary systems.

ONTAP FlexGroup volumes are beyond feature rich; Ontap is the only data management software with multiple cloud services. The newly introduced capabilities to support AI/ML workloads by using traditional access methods keep your data fluid and available across all platforms.

The new features of ONTAP also help disprove the negative claims about NFS for AI/ML workloads. NFS has many varieties, and its ability to handle large-scale data storage and access through pNFS trunking can keep AI/ML engines running and integrated into existing NFS environments.

Find out how you can enhance, speed up, and simplify your AI/ML workloads with ONTAP and pNFS. To get started, simply upgrade your ONTAP cluster to the latest release. For access to the latest release candidate, see support.netapp.com.

Srikanth Kaligotla

Srikanth Kaligotla is a Principal Engineer at NetApp with over 25 years of experience building technology products. At NetApp, he has led several initiatives, from designing and building products, features, and solutions in file, block, and caching technologies. He is passionate about enabling modern workloads like Kafka and AI/ML on ONTAP and loves to read scientific journals in his spare time.

View all Posts by Srikanth Kaligotla

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion