NetApp.com
Big Data Image

What Is Big Data?

Big Data Challenges

IT leaders and analytics teams are under tremendous pressure to harness today's wealth of data and apply it to create new value across the entire organization, all with limited time, skills, and budget. Data is becoming distributed, dynamic, and diverse across data centers and the cloud. This situation is imposing challenges not only for infrastructure teams responsible for storing and protecting this data, but also for data scientists, engineers, and architects, who need to collect and analyze the data in real time from various data sources. Due to this vast data sprawl problem, analytics teams are asked to limit the scope of the data being analyzed or to wait days before the right data can be made available for analysis.

Big Data Technologies

Unstructured and semi-structured data types typically don't fit well in traditional data warehouses, which are based on relational databases oriented to structured datasets. Data warehouses also might not be able to handle the processing demands posed by sets of big data that need to be updated frequently or continually.

As a result, many organizations that collect, process, and analyze big data turn to NoSQL databases as well as Hadoop and its companion tools such as:

  • YARN. A cluster management technology and one of the key features in second-generation Hadoop
  • MapReduce. A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster
  • Apache Spark. A fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing
  • HBase. An open-source, nonrelational, distributed database modeled after Google's Bigtable
  • Apache Hive. A data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis
  • Kafka. An open-source stream processing platform developed by the Apache Software Foundation
  • Pig. An open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters

More and more frequently, big data analytics users are adopting the concept of a Hadoop data lake that serves as the primary repository for incoming streams of raw data. In such architectures, data can be analyzed directly in a Hadoop cluster or run through a processing engine such as Spark.

Big Data Segmentation

Real-Time Analytics is the use of data and related resources as soon as the data enters the system.

Streaming Analytics is the emergence of the next evolution of analytics to enable data to be analyzed in streaming pipelines such as Apache Kafka and Apache Flink where analytics are applied while data is in flight.

Edge Analytics is an approach to data collection and analysis in which an automated analytical computation is performed on data at a sensor, network switch or other device instead of waiting for the data to be sent back to a centralized data store.

NoSQL is an emerging class of database that provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. They range across a diverse form of emerging new-generation databases such as Column, Document, Graph, Key-value and multi-model.

Cognitive Systems is an emerging area of high growth involving the use of technologies such as Deep Learning to enable fully autonomous and self-managed, self-learning systems that are resulting in large scale automation based solutions.

Content Analytics provides visibility into the amount of content that is being created, the nature of that content, and how it is used.

Big Data Ecosystem

Segment Key Vendors
Big Data Analytics Hadoop/Apache Software Distributions Cloudera, HortonWorks, MapR
Application Management, Security, Compliance Splunk
Spark DataBricks
NoSQL Databases Aerospike, Cassandra, Couchbase Server, HBase, MarkLogic, MongoDB, Redis labs
Cloud Analytics Amazon EMR, Azure HDInsights, Google Cloud Platform
Open Source Components Druid, Elasticsearch, Apache Flink, Apache Hive, Apache Kafka, Apache Mesos, Apache Spark, Apache Solr, Apache Hadoop YARN, Apache ZooKeeper

Benefits of Big Data

Driven by specialized analytics systems and software, big data analytics can point the way to various business benefits, including new revenue opportunities, more effective marketing, better customer service, improved operational efficiency and competitive advantages over rivals.

According to a survey by Datameer in 2016, 78% of enterprises agree that big data has the potential to fundamentally change the way they do business over the next 1 to 3 years.

Who Uses Big Data?

Big data analytics applications enable data scientists, predictive modelers, statisticians, and other analytics professionals to analyze growing volumes of structured transaction data, plus a mix of semi-structured and unstructured data such as Internet clickstream data, web server logs, social media content, text from customer e-mails and survey responses, mobile phone call detail records, and machine data captured by sensors connected to the Internet of Things (IoT).

Big Data Management and Storage

Rapidly gaining insights from data is crucial to capitalizing on opportunities, improving profits, and better managing risk. This ability requires enterprise-grade data management capabilities to cope with the vast datasets.

Accelerating real-time machine data analytics helps organizations detect cyberattacks before they cause damage, and prevent fraud without affecting the customer experience.

Quickly deriving business intelligence from customer data is essential to improving satisfaction levels and guiding future service offerings.

However, the first-generation big data analytics commodity storage approach (that is, DAS storage) simply doesn’t scale efficiently. And it doesn’t provide the reliability and flexibility needed as these applications become essential to competitiveness.

Shared storage/external storage big data analytics platforms deliver more scalability and performance, nondisruptively moving data where it’s needed and making sure that it is always protected and secure.

NetApp and Big Data

NetApp’s innovative big data analytics platform delivers up to twice the performance, seamlessly and securely moving data and workloads to the cloud or wherever needed and making sure that data is always backed up, secure, and available. With NetApp, you can lower license fees, hardware costs, and overall TCO by as much as 50% by increasing resource utilization and eliminating unnecessary data copies.

Discover how NetApp® big-data solutions can help you meet extreme enterprise requirements for your Splunk, Hadoop, and NoSQL database workloads.

Big Data Analytics

Store new, unstructured data and keep it available—so your Splunk, Hadoop, and NoSQL workloads are always running.

Card Access Data

Accelerate search performance by 111%. Access your data more frequently. Scale with compute and storage decoupled.

Card Analytics

Run data analytics on existing data stored on NFS-based systems and data in the hybrid cloud.