Sign in to my dashboard Create an account

Apache Spark data pipeline with NetApp modern data lake solutions

digital animation of data lake waves in blue color

Share this page

Portrait of Rick Huang
Rick Huang

A modern enterprise data lake involves distributed storage systems for storing, managing, and analyzing large amounts of structured and unstructured data. To help manage this dispersed data, organizations demand tiering capabilities and a cost-effective storage infrastructure for their modern data lakes and analytics workflows. That’s why many companies are moving from Hadoop-based data lakes to object storage–based solutions. But with that move, the associated Hadoop databases and tables (like Hive and HBase) require data migration, and some even require format conversion.

Efficiently migrate and securely access and share distributed data

Many organizations have now realized that centralized Hive catalogs are a bottleneck, and decentralized open data table formats such as Apache Iceberg and other nonproprietary table formats are becoming popular in large-scale data platforms. According to this Predictions of Storage Vendors article published by StorageNewsletter, in 2023, more enterprise data will be stored in open table formats.

The same StorageNewsletter article also predicted a rising demand for simplified data access and sharing. In 2023, organizations will need to manage their scattered data wherever it exists. Data sharing across platforms will propel organizations to develop and to implement a data strategy to manage and to share distributed data across regions, organizations, clouds, and platforms.

To facilitate data lake migration and data sharing, NetApp provides several data migration tools, like the NetApp® BlueXP™ copy & sync service, XCP software, and Cloud Volumes ONTAP® software. These tools can help organizations like yours seamlessly move data between your on-premises and multicloud environments, without the need for any data conversions or application refactoring.

Easily move, manage, and protect huge amounts of unstructured data

Lay a solid foundation with consistency

Object storage is a highly scalable and durable technology that can support massive amounts of unstructured data. It’s a natural fit for modern data pipelines that require agility and flexibility to integrate with a wide variety of data sources and compute platforms, such as Apache Spark. To enable the best of hybrid cloud, NetApp solutions enable you to move data seamlessly between your on-premises and multicloud environments without requiring any data conversions or application refactoring. With a continuous data management plane and a consistent operating model, NetApp technology provides the foundation for a modern data lake that supports the needs of today’s businesses.

Manage multiple data pipelines with one control panel

With a NetApp analytics solution, you benefit from integrated security, data governance, and compliance tools with a single control panel for data and workflow management across distributed environments. And you optimize your TCO based on your consumption. You can design multiple data pipelines by incorporating popular real-time event streaming or batch-processing platforms or frameworks like Apache Kafka, Spark, and others. Your system can then handle high-throughput, low-latency data streams and batches for analytics, log aggregation, stream processing, data integration, messaging, and event sourcing. With a data fabric that’s powered by NetApp, you get features such as fault tolerance; scalability; and the ability to store, to analyze, and to synchronize large amounts of data.

Gain flexibility and protection for mixed environments

Now and in the near future, freedom and flexibility are and will be the basic requirement of virtually every data management pipeline. In particular, you need data mobility solutions that are cloud-enabled and that support data migration, data replication, disaster recovery, ransomware protection, and data synchronization across mixed environments, including SSDs, HDDs, and cloud. You need to eliminate data silos, maximizing your ROI for data pipelines and for your production workload operations across your on-premises, public, and private cloud Spark clusters.

Optimize performance and costs

Spot by NetApp provides several CloudOps tools to optimize the consumption costs of your cloud resources, and NetApp offers multiple TCO calculators for different use cases. With NetApp’s complete storage portfolio, both on-premises and cloud data lake storage run ONTAP software. And NetApp SnapMirror® and/or FlexCache® technology can be used to mirror whole volumes or to cache hot data from the data lake on your premises or in the cloud. So, you get hybrid cloud analytics, artificial intelligence for ITOps (AIOps), machine learning operations (MLOps), cloud tiering, and nondisruptive data lake migration capabilities. 

For the best performance recommendations and other data-mover solutions, visit our Modern data analytics solutions page. For your serverless Spark applications that run on Kubernetes, check out Ocean for Apache Spark to automate cloud infrastructure and application management. Ocean also helps you optimize containerized application deployments for high performance, reliability, and cost-efficiency.

Use a proven solution for Spark analytics workloads

In my three-part blog mini-series about how to optimize your Apache Spark workloads with NetApp solutions, I explained the benefits and the depth and breadth of the NetApp modern analytics portfolio. And I backed it all up with results from Spark analytics workflow testing on NetApp storage. In those performance validation tests, which were based on industry-standard benchmarking tools and customer demand, the NetApp Spark solutions demonstrated superior performance relative to native Hadoop systems.

So, to sum up: By running Apache Spark modern analytics workloads with NetApp storage, you can easily meet your large-scale, distributed data processing, analytics, model training, fine-tuning, serving, and retraining requirements.

Continue learning

For more technical information about Apache Spark analytics and deep learning (DL) workloads with NetApp storage, check out NetApp storage solutions for Apache Spark: Architecture, use cases, and performance results

Another in-depth resource is NetApp hybrid cloud data solutions—Spark and Hadoop based on customer use cases. This technical report covers backup of Hadoop data, backup and disaster recovery from the cloud to on-premises environments, development and testing on existing Hadoop data, data protection and multicloud connectivity, and analytics workflow acceleration.

This mini-series is part of my larger Apache Spark with NetApp AI blog series, which currently consists of the following:

Corporate site

Technical Community (might require login)

And there’s more interesting high-level and technical content to come!

Rick Huang

Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.

View all Posts by Rick Huang

Next Steps

Drift chat loading