NetApp has four storage portfolios: the NetApp® AFF and FAS, E-Series, StorageGRID®, and Cloud Volumes ONTAP® offerings. NetApp has validated the AFF and E-Series systems with ONTAP based storage for Hadoop solutions with Apache Spark.
In Hybrid cloud solutions with Apache Spark and NetApp AI, I examined a data fabric powered by NetApp that integrates data management services and applications (building blocks) for data access, control, protection, and security. In this blog post, let’s focus on two different on-premises Apache Spark cluster configurations, as shown in the following figure.
The building blocks in the preceding figure include NetApp NFS direct access:
Both configuration 1 and configuration 2 offer:
As key benefits, NetApp NFS direct access:
We ran Spark analytics workflows on a NetApp AFF A800 all-flash storage system running NetApp ONTAP software with NFS direct access. As an example, we tested the Apache Spark workflows by using TeraGen and TeraSort in ONTAP, AFF, E-Series, and NFS direct access versus local storage and HDFS.
TeraGen and TeraSort are two commonly used benchmarking tools in the Hadoop ecosystem. TeraGen is used to generate large amounts of test data, and TeraSort is used to sort that data. These tools are often used to measure the performance of Hadoop clusters and storage systems. By generating large amounts of data and sorting it, we were able to measure the speed and efficiency of the NetApp solution compared with HDFS and other storage protocols or locations.
Common bottlenecks when running TeraGen and TeraSort include CPU and memory, network speed, and disk I/O limitations. These bottlenecks can affect the overall run time and performance of the system. So it is important to have a storage solution that can handle high I/O loads and can scale to meet the demands of large-scale data analytics workloads.
Our storage solution uses a shared data lake that serves large data compute server farms simultaneously. We based this solution on ONTAP for easy management and better performance; NetApp NFS direct access for fast, reliable, and secure access to NFS data; and HDFS for access to distributed, low-cost storage. In recent years, customers have been modernizing their data lakes with tiering capabilities and containerizing their applications to run in Kubernetes, OpenShift, or other container orchestration platforms. These enhancements further increase the effectiveness of NetApp storage solutions for analytics workloads.
In the NetApp Tech OnTap® blog Deep Learning with Apache Spark and NetApp AI (1) – Financial Sentiment Analysis results, we generated sentiment analysis results comparison tables by using PySpark Python scripts. We also recorded the run times by storing the data and models in different underlying storage file systems. To compare run-time performance on different file systems, we executed variations of financial sentiment analysis workloads on Nasdaq Top 10 company quarterly earnings call transcripts. We also ran the same run-time comparisons for other use cases, such as Horovod distributed training and click-through rate (CTR) prediction by using Keras.
You can review the complete portfolio of NetApp Apache Spark/Hadoop storage positioning and analytics performance numbers (network/cluster throughput, cluster/storage CPU percentage, and so on). Check out Apache Spark workload with NetApp storage solution and NetApp storage solutions for Apache Spark, which includes architecture, use cases, and performance results.
The following figure compares the run time of two typical analytics operations with different storage controller clusters. The top bar is for sorting, and the bottom one is for data generation. The Spark cluster with a NetApp AFF A800 system has four compute worker nodes, and the cluster with E-Series storage has eight. We performed this test primarily to compare the performance of SSDs and HDDs.
To summarize the run-time results:
The takeaway is that transitioning from spinning disks to all-flash storage improves performance. In our testing, the number of compute nodes was not the bottleneck. With NetApp all-flash storage, run-time performance scales well. And with NFS, the data is functionally equivalent to being pooled all together, which can reduce the number of required compute nodes, depending on your workload. And if you use Apache Spark clusters, you do not have to manually rebalance data when you change the number of compute nodes.
In the final part of this blog series, I discuss how NetApp storage helps you meet requirements for large-scale, distributed data processing, analytics, model training, fine-tuning, serving, and retraining for Apache Spark workloads. Stay tuned! Check out our Modern Data Analytics Solutions for more use cases, best practices, and technical details.
Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.