Parallel File Systems – Versatile Storage for AI and Big Data

BeeGFS

Joe McCormick

February 11, 2021

609 views

While writing this article I found myself reflecting on the state of IT when I started at NetApp in 2012. The world was a different place. GCE and Azure were facing an uphill battle against incumbent AWS, we were talking about OpenStack for private cloud, and Siri was still in beta. However, even though 2012 is an anagram of 2021, I think we can all agree that the entire world has been fundamentally reshaped in ways we could never have envisioned back then.

Amid this never-ending churn, I found myself drawn to identifying and solving the biggest, baddest storage problems I could find. There’s no shortage of those, and the general availability of artificial intelligence presents the latest opportunity for companies to reinvent themselves and more efficiently use the goldmine of data they find themselves sitting on.

Having been well entrenched in helping solve traditional high-performance computing challenges around storage, I already knew that BeeGFS solves big data problems—but what about for large-scale analytics and AI specifically?

We can roughly divide organizations running "HPC workloads" into two broad categories:

Traditional high-performance computing (HPC). In this category, think national labs, research institutes, pretty much anyone who runs what would traditionally be thought of as a "supercomputer."
Enterprise HPC. This category has seen much more growth in recent years as large-scale analytics and AI become more prolific and organizations find value in wielding the power of their data, aided by these techniques.

Traditional HPC has been helping mature these techniques for decades, so it’s valuable to examine the tried-and-true infrastructure they've used to achieve success. On the other hand, IT has been fundamentally reshaped over the last decade by the shift to cloud, and modern large-scale analytics and AI have been chiefly born in a cloud-native world. Success when developing infrastructure to support these modern techniques lies in finding the right combination of traditional and cloud-native technologies.

Enter parallel file systems: A versatile storage option for large-scale analytics and AI

For decades, parallel file systems have been crucial in building storage environments that are capable of keeping up with the most demanding supercomputers. From storing massive, mission-critical datasets, to providing high-speed scratch space, to storing the results of long-running and expensive-to-reproduce computations, parallel file systems are vital. Furthermore, the pricing model is typically designed for scale, helping to ensure that they fit within constrained research budgets. Sound anything like large-scale analytics and AI requirements?

In fact, parallel file systems are a good fit for use throughout data pipelines, whether you're coming from an HPC or an enterprise background. Here are some examples:

Cost-effective storage behind analytics platforms like Hadoop and Spark.
Ability to separate compute from storage to allow efficient scalability.
Scratch space for data preparation and preprocessing tasks with I/O patterns that are difficult to manage, such as supporting extract, transform, load (ETL) workflows.
Storing and retrieving training datasets, especially when you need to keep a large number of GPUs busy.
High-speed ingest, inferencing, and real-time analytics where low latency and high bandwidth are critical.

Obviously, parallel file systems are a powerful tool that can bring relief to storage pain points. But perhaps it’s not well known that BeeGFS can be easily tested in small proof-of-concept environments to prove out the performance and value. Many organizations start small with pilots and POCs; understandably, they’re hesitant to invest heavily until these approaches have proven their worth. POC deployments can easily scale storage as necessary for larger testing or moving to production. Production deployments with BeeGFS allow your data and consumption to grow in all directions without limits.

We'll get deeper into this subject in the next section. To sum up the discussion so far, parallel file systems are a good choice to solve large-scale analytics and AI data challenges because they offer a flexible solution to a problem with great performance.

BeeGFS explained: Why BeeGFS

Some parallel file systems have traditionally been seen as complex to deploy and manage, but BeeGFS has been designed from the beginning for flexibility and simplicity. That, along with its cost-effective high performance, is why we choose to offer support alongside the NetApp^® E-Series storage systems. NetApp provides the full package, storage that provides reliability, easy scalability, and industry leading price/performance based on a proven 25+ year architecture.

Shameless plugs aside, why does BeeGFS specifically make sense for large-scale analytics and AI? Let me paint you a performance picture.

You want to provide space for your end users (possibly data scientists in this context) to store a bunch of arbitrary data, somewhere. We'll call that somewhere a storage namespace. Ideally when users are ready to train, they don't have to copy this data to each of the GPU nodes. Copying data adds extra time to the workflow, and the dataset may exceed the node’s internal storage capacity anyway. So the storage namespace must be able to serve the same files to many GPU nodes, regardless of whether the files are very large or very small.

The design of some storage solutions requires serving the same file from a single storage node. That quickly becomes painful when dealing with tens, hundreds, or even thousands of compute or GPU nodes reading the same file, especially if it’s large. To avoid a single overworked piece of hardware, your ideal storage namespace should stripe each file across multiple storage nodes. But what about the other end of the spectrum, where users need to work with a large number of small files? Ideally the storage namespace stores information about files and directories (metadata) separate from file contents, so looking up a bunch of little files doesn't place extra strain on the storage nodes designed to stream the distributed file contents. To avoid bottlenecks, this metadata should also be distributed across multiple nodes. This distribution also needs to be intelligent enough that one node doesn't end up owning the entirety of a massive directory tree, defeating the purpose of a "distributed file system."

These sorts of challenges are essentially what parallel file systems were designed to overcome, and BeeGFS solves them all eloquently. The design of BeeGFS delivers a storage namespace that can flexibly adapt to meet evolving storage requirements. As your compute and GPU footprint grows, you can scale the performance and overall capacity of your storage namespace to match. If concurrent file access becomes a bottleneck, or the number of files and directories you need in the storage namespace increases, those can be expanded independently.

In particular, BeeGFS excels at data processing use cases—for example, where you need to take one large image file and break it into hundreds or thousands of smaller files. Because BeeGFS is POSIX compliant, the subsequent files are natively accessible to machine learning and deep learning frameworks, without requiring additional data movement or expensive code changes. In short, BeeGFS provides flexibility that is key to ensuring that your storage namespace is right-sized to fit your uniquely evolving large-scale analytics and AI data requirements, while ensuring that data is accessible without requiring more work.

Conclusions

I was thinking about calling this article "Why parallel file systems like BeeGFS go beyond HPC and should be regularly considered for large-scale analytics and AI in enterprise." But my editors say that concise titles are better for SEO. However, I do hope that is your main takeaway from this blog post. I see the characteristics of parallel file systems like BeeGFS continuing to make them valuable tools as organizations tackle these types of initiatives. And I'd hate to see that value overlooked in a sea of flashy alternatives with exaggerated claims to greatness. Of course, to make BeeGFS fit in a cloud-native world, it needs to support cloud-native platforms like Kubernetes. Stay tuned for future developments.

Let me close by trying to sum up this blog post in a sentence: "BeeGFS is a viable option for large-scale analytics and AI when you need to meet or exceed the speed of NAS, be able to scale like object storage, and want to spend more on GPUs than on storage."

I'd love to hear about your challenges around data. If you want to continue the conversation, drop me a line at joe.mccormick@netapp.com.

Joe McCormick

Joe McCormick is a software engineer at NetApp with over ten years of experience in the IT industry. With nearly seven years at NetApp, Joe's current focus is developing high-performance computing solutions around E-Series. Joe is also a big proponent of automation, believing if you've done it once, why are you doing it again.

View all Posts by Joe McCormick

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion