Sign in to my dashboard Create an account
Menu

Apache Spark Workloads: Analytics

women using the laptop
Contents

Share this page

Portrait of Rick Huang
Rick Huang
165 views

NetApp has four storage portfolios: the NetApp® AFF and FAS, E-Series, StorageGRID®, and Cloud Volumes ONTAP® offerings. NetApp has validated the AFF and E-Series systems with ONTAP based storage for Hadoop solutions with Apache Spark.

In Hybrid cloud solutions with Apache Spark and NetApp AI, I examined a data fabric powered by NetApp that integrates data management services and applications (building blocks) for data access, control, protection, and security. In this blog post, let’s focus on two different on-premises Apache Spark cluster configurations, as shown in the following figure.

Apache Spark Cluster

Key solution components and benefits

The building blocks in the preceding figure include NetApp NFS direct access:

  • This capability gives the latest Hadoop and Spark clusters direct access to NetApp NFS volumes without added software or driver requirements. You can run big data analytics jobs on your existing or new NFS data without moving or copying it. NetApp NFS direct access prevents multiple copies of data and eliminates the need to synchronize with a source.
  • NetApp NFS direct access can replace the default Hadoop Distributed File System (HDFS) with NFS storage as the default file system, enabling direct analytics operations on NFS data.
  • In the second deployment option, NetApp NFS direct access supports the configuration of NFS as added storage along with HDFS in a single Hadoop/Spark cluster. In this case, you can share data through NFS exports and access it from the same cluster along with HDFS data.

Both configuration 1 and configuration 2 offer:

  • NetApp Snapshot, SnapMirror®, FlexCache®, and other innovative technologies. You get data protection, disaster recovery, and ransomware-attack mitigation capabilities in your on-premises clusters with both configurations.
  • More efficient storage and less server replication. For example, the NetApp E-Series solution for Hadoop requires two rather than three replicas of the data, and the FAS solution for Hadoop requires a data source but no replication or copies of data. NetApp storage solutions also produce less server-to-server traffic.
  • Better Hadoop job and cluster behavior during drive and node failure.
  • Better data-ingest performance.

As key benefits, NetApp NFS direct access:

  • Is certified with the Hortonworks Data Platform, which was acquired by Cloudera.
  • Enables hybrid cloud data analytics deployments, surge bursting, and seamless integration with your existing data lakes and workflows.
  • Provides enterprise data protection, simple containerized workspace and inference server provisioning, and code-to-data-to-analytics versioning by using the rich data management capabilities of NetApp ONTAP and the NetApp DataOps Toolkit.
  • Offers many other advantages, as mentioned in my blog post Deep learning with Apache Spark and NetApp AI—Horovod distributed training.

Spark analytics workflow testing on NetApp storage

We ran Spark analytics workflows on a NetApp AFF A800 all-flash storage system running NetApp ONTAP software with NFS direct access. As an example, we tested the Apache Spark workflows by using TeraGen and TeraSort in ONTAP, AFF, E-Series, and NFS direct access versus local storage and HDFS. 

TeraGen and TeraSort are two commonly used benchmarking tools in the Hadoop ecosystem. TeraGen is used to generate large amounts of test data, and TeraSort is used to sort that data. These tools are often used to measure the performance of Hadoop clusters and storage systems. By generating large amounts of data and sorting it, we were able to measure the speed and efficiency of the NetApp solution compared with HDFS and other storage protocols or locations.

Common bottlenecks when running TeraGen and TeraSort include CPU and memory, network speed, and disk I/O limitations. These bottlenecks can affect the overall run time and performance of the system. So it is important to have a storage solution that can handle high I/O loads and can scale to meet the demands of large-scale data analytics workloads.

Our storage solution uses a shared data lake that serves large data compute server farms simultaneously. We based this solution on ONTAP for easy management and better performance; NetApp NFS direct access for fast, reliable, and secure access to NFS data; and HDFS for access to distributed, low-cost storage. In recent years, customers have been modernizing their data lakes with tiering capabilities and containerizing their applications to run in Kubernetes, OpenShift, or other container orchestration platforms. These enhancements further increase the effectiveness of NetApp storage solutions for analytics workloads.

In the NetApp Tech OnTap® blog Deep Learning with Apache Spark and NetApp AI (1) – Financial Sentiment Analysis results, we generated sentiment analysis results comparison tables by using PySpark Python scripts. We also recorded the run times by storing the data and models in different underlying storage file systems. To compare run-time performance on different file systems, we executed variations of financial sentiment analysis workloads on Nasdaq Top 10 company quarterly earnings call transcripts. We also ran the same run-time comparisons for other use cases, such as Horovod distributed training and click-through rate (CTR) prediction by using Keras.

You can review the complete portfolio of NetApp Apache Spark/Hadoop storage positioning and analytics performance numbers (network/cluster throughput, cluster/storage CPU percentage, and so on). Check out Apache Spark workload with NetApp storage solution and NetApp storage solutions for Apache Spark, which includes architecture, use cases, and performance results.

Spark analytics workflow run-time comparison

The following figure compares the run time of two typical analytics operations with different storage controller clusters. The top bar is for sorting, and the bottom one is for data generation. The Spark cluster with a NetApp AFF A800 system has four compute worker nodes, and the cluster with E-Series storage has eight. We performed this test primarily to compare the performance of SSDs and HDDs.

Sort data

To summarize the run-time results:

  • The baseline E-Series configuration used 8 compute nodes and 96 NL-SAS (HDD) drives. This configuration generated 1TB of data in 4 minutes and 38 seconds. For details on the cluster and storage configuration, see NetApp E-Series Solution for Hadoop.
  • By using TeraGen, the all-flash AFF SSD configuration generated 1TB of data 15.66 times faster than the NL-SAS configuration did. Moreover, the SSD configuration used half the number of compute nodes and a quarter the number of drives (a total of 24 SSDs).
  • By using TeraSort, the AFF SSD configuration sorted 1TB of data 1,138.36 times more quickly than the NL-SAS configuration did. Again, the SSD configuration used half the number of compute nodes and a quarter the number of drives. Therefore, per drive, it was approximately 3 times faster than the NL-SAS (HDD) configuration.

Enhanced performance, scale, and efficiency

The takeaway is that transitioning from spinning disks to all-flash storage improves performance. In our testing, the number of compute nodes was not the bottleneck. With NetApp all-flash storage, run-time performance scales well. And with NFS, the data is functionally equivalent to being pooled all together, which can reduce the number of required compute nodes, depending on your workload. And if you use Apache Spark clusters, you do not have to manually rebalance data when you change the number of compute nodes.

In the final part of this blog series, I discuss how NetApp storage helps you meet requirements for large-scale, distributed data processing, analytics, model training, fine-tuning, serving, and retraining for Apache Spark workloads. Stay tuned! Check out our Modern Data Analytics Solutions for more use cases, best practices, and technical details.

Rick Huang

Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.

View all Posts by Rick Huang

Next Steps

Drift chat loading