Sign in to my dashboard Create an account

Hybrid cloud solutions with Apache Spark and NetApp AI

Man typing on a keyboard

Share this page

Portrait of Rick Huang
Rick Huang

With the recent boom of enterprise-level AI and modern analytics adoption, hybrid multicloud solutions have advanced tremendously thanks to the amount of data available—and the techniques of processing that data with modern computing power in distributed locations. In this blog, we’ll explore hybrid cloud applications in distributed Spark clusters with the NetApp® storage portfolio.

Apache Spark is a programming framework for writing Hadoop applications that work directly with the Hadoop Distributed File System (HDFS) and other file systems, such as NFS and object storage. It’s a fast analytics engine with machine learning (ML) libraries that are designed for large-scale distributed data processing, and it functions seamlessly with our NetApp AI and modern data analytics portfolio. Its in-memory operations are more efficient than MapReduce for data pipelines, streaming, interactive analysis, and ML/DL algorithms. Apache Spark also mitigates the I/O operational challenges that you might experience with Hadoop.

The modern enterprise data center is a hybrid cloud that connects multiple distributed infrastructure environment

A modern enterprise data center is a hybrid cloud that connects multiple distributed infrastructure environments through a continuous data management plane with a consistent operating model, on premises and/or in multiple public clouds. To get the most out of a hybrid cloud, you must be able to seamlessly move data between your on-premises and multicloud environments without having to convert data or refactor applications.

Customers report that they start their hybrid cloud journey in one of these ways:

  • By moving secondary storage to the cloud for use cases such as data protection
  • By moving less business-critical workloads such as application development and DevOps to the cloud

They then move on to more critical workloads. Web and content hosting, DevOps and application development, databases, analytics, and containerized apps are among the most popular hybrid cloud workloads.

A well-designed hybrid cloud architecture also provides AIOps/MLOps capabilities for large, geographically distributed data science teams to prototype their various complex models and deploy them for production. Data/model catalogs are accessible almost instantaneously, and the architecture offers security, compliance, and governance. However, the complexity, cost, and risks of enterprise AI projects have historically hindered AI adoption from the experimental stage through production.

With a NetApp hybrid cloud solution, you benefit from integrated security from distributed denial-of-service (DDOS) and ransomware attacks, and the security, data governance, and compliance tools are available through a single control panel for data and workflow management across distributed environments. At the same time, they’re able to optimize TCO based on their consumption. The following figure shows an example solution of a cloud service partner providing multicloud connectivity for an Internet of Things (IoT) big-data-analytics environment.

With a NetApp hybrid cloud solution, you benefit from integrated security from distributed denial-of-service (DDOS) and ransomware attacks, and the security, data governance, and compliance tools are available through a single control panel for data and workflow management across distributed environments.

The challenges for Apache Spark hybrid cloud workloads

In this example scenario, IoT data received in AWS from different sources is stored in a central location in a NetApp storage controller array (NAS, SAN, NetApp AFF) hosted by Equinix. The central storage cluster is connected to Spark or Hadoop/Direct NFS clusters in AWS and Azure, enabling big data analytics and AI applications to run in multiple clouds and access the same data. The main requirements and challenges for this use case include the following: 

  • Customers want to run analytics and AI jobs on the same data by using multiple clouds.
  • Data must be received from different sources such as on-premises and cloud environments through different sensors and hubs.
  • Data preprocessing and anonymization within the pipeline must be fast and automatic due to the sheer amount of data every hour.
  • The solution must be efficient and cost effective.
  • The main challenge is to build a cost-effective and efficient solution that delivers hybrid analytics and AI/ML/DL services among different on-premises and cloud environments.

We addressed these challenges to define the deliverables for solutions using Apache Spark workloads with NetApp storage in hybrid cloud environments.

The benefits of Apache Spark workloads with NetApp storage solutions in a hybrid cloud

Our data protection and multicloud connectivity solution resolves the challenge of having cloud AI applications across multiple hyperscalers. As shown in the previous figure, data from sensors is streamed and ingested into the AWS Spark cluster through Kafka. The data is stored in an NFS share that’s located outside the cloud provider within an Equinix data center.

NetApp BlueXP classification, also known as NetApp Cloud Data Sense, is an AI-driven toolkit to automatically scan, analyze, and categorize your data. It then performs obfuscation or anonymization as necessary across your entire data estate, including file storage, object storage, and databases on premises and in the cloud. It offers file information such as type, size, time attributes, ownership, and user/group permissions, and it analyzes your data to identify personally identifiable information (PII) such as email, credit card number, international bank account number (IBAN), national IDs, IP address, passwords, ethnicity reference, religious beliefs, and civil law reference.

The result is a classification of PII into sensitivity levels (standard, personal, sensitive-personal) and categories such as HR, legal, marketing, sales, and finance. Furthermore, you obtain out-of-the-box governance and privacy insights. You can then customize those insights and take actions. To learn more about BlueXP classification, visit this page and get started.

Because the central NetApp storage cluster is connected to AWS and Microsoft Azure you can leverage the NetApp NFS Direct Access with BlueXP copy and sync to read and write the data from both AWS and Azure Spark clusters. Consequently, because both on-premises and cloud storage (Azure NetApp Files and Amazon FSx) run NetApp ONTAP®️ software, the NetApp SnapMirror®️ feature can mirror the data in your cloud storage into the on-premises cluster, providing hybrid cloud AIOps/MLOps capabilities across on-premises environment and multiple clouds.

For performance recommendations and other data mover solutions, refer to our NetApp Storage Solutions for Apache Spark and Modern Data Analytics Solutions documentation. For more technical information about Apache Spark storage tiering and DL workloads with NetApp storage solutions, check out TR-4570: NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results. See also TR-4657: NetApp hybrid cloud data solutions—Spark and Hadoop based on customer use cases.

This report provides information about backing up Hadoop data, backup and disaster recovery from the cloud to the premises, enabling DevTest on existing Hadoop data, data protection and multicloud connectivity, and accelerating analytics workloads.

We’ll address the important questions that customers face when deciding to use Apache Spark workloads with NetApp storage to implement their large-scale distributed data processing and deep learning for natural language processing (NLP) applications in a future blog. Stay tuned!

Rick Huang

Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.

View all Posts by Rick Huang

Next Steps

Drift chat loading