Sign in to my dashboard Create an account
Menu

Apache Spark Workloads & NetApp AI

server room with monitor and keyboard with mouse
Contents

Share this page

Portrait of Rick Huang
Rick Huang
137 views

With the recent boom in enterprise-level artificial intelligence (AI) adoption across a wide range of industry verticals, big data analytics has also advanced tremendously. Those developments are thanks to the amount of data that’s available, innovative and hybrid multicloud–based solutions that automate the data processing workflow, and techniques that process that data with modern computing power.

With those analytics advancements, businesses can extract value and insights from their data faster, more efficiently, and at a lower cost. This blog post explores modern analytics workloads in Apache Spark clusters with the NetApp® storage portfolio.

In Hybrid cloud solutions with Apache Spark and NetApp AI, I wrote about what Apache Spark is designed for, and what challenges it mitigates for customers who use Hadoop. In addition to being a fast analytics engine with machine learning (ML) libraries and enabling deep learning (DL) frameworks that function seamlessly with NetApp AI, Spark plays well with our modern data analytics portfolio. It works directly with the Hadoop Distributed File System (HDFS), NFS direct access, and object storage.

server room with zeros and ones lighting on the door

Before you decide to use Apache Spark with NetApp storage to overcome your large-scale distributed data processing and analytics challenges, you might need to answer questions such as:

  • Why would I use NetApp storage for Apache Spark analytics workloads?
  • What are the benefits of using NetApp storage with Apache Spark?
  • How can the NetApp DataOps Toolkit facilitate traditional or containerized workspace management, data manipulation, ML/DL data, code and model versioning, and inference server orchestration?
  • How does NetApp AI fit into my overall hybrid cloud architecture? How do I strike a balance among data pipeline and application workflow scalability, reliability, maintainability, simplicity, operability, and evolvability?
  • What NetApp storage controllers should I use to support my in-memory engine?

This three-part blog series can help you answer those questions.

The challenges for Apache Spark analytics workloads

We understand the challenges that you face in modern analytics. Our comprehension is based on our findings from many proof-of-concept (POC) studies with large-scale customers in various industries, such as financial services, retail, healthcare, life sciences, manufacturing, and automotive. Some of the challenges include:

  • Data pipeline design and storage protocol performance optimization. It can be difficult to choose one or a set of optimal protocols (object, iSCSI, NFS, and so on) for your workflows in heterogeneous environments. You have to account for things like unforeseen bottlenecks in demand, throughput, latency, and IOPS.
  • Direct-attached storage (DAS) versus shared storage. You want to modernize your analytics infrastructure from a server-based DAS setup to a shared data lake, with flexibility, connectivity, and synchronicity with your hybrid multicloud. 
  • Scaling of compute and storage independently. Because your compute servers are busy running analytics workloads and serving data, you can’t scale your servers and storage independently. It’s not feasible to continue adding servers and storage to keep up with your increasing data quantity; analytics workload complexity; and large-scale model training across your on-premises and cloud Spark clusters, application serving layer, and shared data lake. That approach can’t meet the criteria for low latency (less than hundreds of milliseconds) and high storage throughput (around 4GBps for sequential read queries; around 3.5GBps for Parquet file writes). 
  • Sharing of data with GPUs. Your data is locked up in local HDFS clusters, and you want to share it between multiple clusters and applications in a hybrid cloud. You want to be future-ready to use GPUs to speed up your modern analytics workflows for AI model training and inferencing. To meet those needs, NetApp technology supports NVIDIA GPUDirect Storage (GDS) and NFS over RDMA on NVIDIA DGX or equivalent systems.
  • Hadoop vendor lock-in. Hadoop distributors have their own distributions with proprietary versioning, which locks you in. But like many customers, you need support for analytics operations that’s not tied to specific Hadoop distributions. You need the freedom to change distributions without disrupting your modern analytics workloads.
  • Lack of support for more than one language. Being limited to just one language also limits your flexibility, hinders data organization and retrieval, and slows down time to production.
  • Complicated frameworks and tools. In addition to what’s mentioned in the linked blog post, NetApp AI has been validated by using TensorFlow, MXNet, PyTorch, and other frameworks. Validation testing was run on NVIDIA AI Enterprise (NVAIE), DGX systems powered by NetApp ONTAP® AI, Lenovo ThinkSystem, white box servers, managed private data centers, and all major public clouds.
  • Data lake migration to a modern hybrid multicloud architecture. Like many of our POC customers, perhaps you want to migrate your existing Hadoop-based data lake to a hybrid cloud solution, focusing on object storage as the native protocol for AI and analytics applications. You can start by taking a deeper dive into data lake migration.

We have uncovered and tackled the analytics hurdles to provide solutions that use Apache Spark with NetApp storage. Stay tuned for my next blog post, where I discuss analytics workloads with a NetApp storage solution in detail.

Rick Huang

Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.

View all Posts by Rick Huang

Next Steps

Drift chat loading