Deep Learning with Apache Spark and NetApp AI – Financial Sentiment Analysis

person looking at stock graph and analyzing it
Table Of Contents

Share this page

Portrait of Rick Huang
Rick Huang

With the recent boom of enterprise-level AI adoption, deep learning (DL) has advanced thanks to the amount of data available and techniques for processing that data with modern computing power. This blog will explore DL applications for natural language processing (NLP) in Apache Spark clusters with the NetApp storage portfolio. And we'll point you to information about how the NetApp storage portfolio meets these challenges.

Apache Spark is a programming framework for writing Hadoop applications that work directly with the Hadoop Distributed File System (HDFS) and other file systems, such as NFS and object storage. It's a fast analytics engine with machine learning (ML) libraries designed for large-scale distributed data processing. It functions seamlessly with our NetApp® AI and modern data analytics portfolio. Its in-memory operations are more efficient than MapReduce for data pipelines, streaming, interactive analysis, and ML/DL algorithms. Apache Spark also mitigates the I/O operational challenges you might experience with Hadoop.

Should you use Apache Spark workloads with NetApp storage to implement large-scale distributed data processing and DL for NLP applications? You might need to answer the following questions:

  • Why would I use NetApp for Apache Spark DL workloads?
  • What are the benefits of using NetApp technology with Apache Spark?
  • How can the NetApp DataOps Toolkit facilitate traditional or containerized workspace management, data manipulation, ML/DL data, code and model versioning, and inference server orchestration?
  • What NetApp storage controller should I use for my in-memory engine?
  • How does NetApp AI fit into my overall hybrid cloud architecture?

The challenges for Apache Spark deep learning workloads

We understand your DL challenges—we've conducted many proof-of-concept studies with large-scale financial services and automotive customers.

  • Unpredictable performance. Existing Hadoop deployments typically use commodity hardware. To improve performance, you must tune the network, operating system, Hadoop cluster, and ecosystem components such as Spark, TensorFlow, and Horovod. Even if you adjust each layer, achieving the overall desired performance levels can be difficult because Hadoop runs on commodity hardware not designed for high performance.
  • Media and node failures. Commodity hardware is prone to failure. If one disk on a data node fails, the Hadoop master, by default, considers that node to be unhealthy. It then copies replicated data from that node over the network to a healthy node. This process slows down the network packets for any Hadoop jobs. The cluster must then copy the data again and remove the over-replicated data when the unhealthy node returns to a healthy state, causing a delay in your production AI workflows.
  • Inability to scale computing and storage. Because your computer servers are busy running DL workloads and serving data, you can't scale your servers and storage independently. It isn't feasible to continue adding servers and storage to keep up with your increasing data quantity, analytics, and large-scale model training demands.
  • Sharable data and GPU. Your data is locked up in local HDFS clusters. But you would like to share it among clusters and applications in a hybrid cloud and be future-ready for using GPUs to accelerate DL model training.
  • Hadoop vendor lock-in. Hadoop vendors have their distributions with proprietary versioning, which locks you into those distributions. However, many customers require DL capabilities that don't tie them to specific Hadoop distributions. They need the freedom to change allocations and still bring their DL workloads.
  • Lack of support for more than one language. To run their jobs, customers often require support for multiple languages in addition to MapReduce Java programs. Options such as SQL, Python, Scala, and scripts provide more flexibility for getting answers and developing data science workflows. These approaches also offer more options for organizing and retrieving data and delivering faster ways of deploying DL models into production.
  • Complicated frameworks and tools. Enterprise AI teams face multiple challenges. Even with expert data science knowledge, tools and frameworks might not translate simply from one deployment ecosystem or application to another. A data science platform should integrate seamlessly with corresponding big data platforms built on Spark. It should provide ease of data movement; reusable models; code out of the box; and tools that support best practices for prototyping, validating, versioning, sharing, reusing, retraining, and quickly deploying models to production.

Financial sentiment analysis results

At NetApp, we've figured out how to address your DL challenges with Apache Spark workloads. For details, see this NetApp Community post, which discusses using Apache Spark DL workloads with NetApp storage solutions (AFF, E-Series, DataOps Toolkit), and Spark NLP sentiment analysis results and run-time comparisons.

Learn how NetApp solutions address Apache Spark DL challenges

For more technical information about Apache Spark DL NLP workloads with NetApp storage solutions, including commands and scripts used in testing and benefits, check out TR-4570: NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results. See Hybrid Cloud Solution with Apache Spark and NetApp AI for hybrid cloud applications in distributed Spark clusters with the NetApp® storage portfolio. 

Rick Huang

Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.

View all Posts by Rick Huang

Next Steps

Drift chat loading