What Is Big Data?
Big data analytics is the process of examining large and varied datasets to uncover hidden patterns, unknown correlations, market trends, customer preferences, and other useful information that can help organizations make better informed business decisions.
Over the next several years, digital transformation will reshape organizations as the majority of global business revenue centers around digital or digitally enhanced products and services.
Structured or unstructured, generated by humans or machines, and stored in the data center or the cloud, data is the new basis of competitive advantage.
Download the Optimizing Database Storage Performance For Dummies e-book.
Read how innovations in flash storage technology, converged infrastructures, and data management can boost performance.
Big Data Challenges
IT leaders and analytics teams are under tremendous pressure to harness today's wealth of data and apply it to create new value across the entire organization, all with limited time, skills, and budget. Data is becoming distributed, dynamic, and diverse across data centers and the cloud. This situation is imposing challenges not only for infrastructure teams responsible for storing and protecting this data, but also for data scientists, engineers, and architects, who need to collect and analyze the data in real time from various data sources. Due to this vast data sprawl problem, analytics teams are asked to limit the scope of the data being analyzed or to wait days before the right data can be made available for analysis.
Big Data Technologies
Unstructured and semi-structured data types typically don't fit well in traditional data warehouses, which are based on relational databases oriented to structured datasets. Data warehouses also might not be able to handle the processing demands posed by sets of big data that need to be updated frequently or continually.
As a result, many organizations that collect, process, and analyze big data turn to NoSQL databases as well as Hadoop and its companion tools such as:
- YARN. A cluster management technology and one of the key features in second-generation Hadoop
- MapReduce. A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster
- Apache Spark. A fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing
- HBase. An open-source, nonrelational, distributed database modeled after Google's Bigtable
- Apache Hive. A data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis
- Kafka. An open-source stream processing platform developed by the Apache Software Foundation
- Pig. An open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters
More and more frequently, big data analytics users are adopting the concept of a Hadoop data lake that serves as the primary repository for incoming streams of raw data. In such architectures, data can be analyzed directly in a Hadoop cluster or run through a processing engine such as Spark.
Big Data Segmentation
Real-Time Analytics is the use of data and related resources as soon as the data enters the system.
Streaming Analytics is the emergence of the next evolution of analytics to enable data to be analyzed in streaming pipelines such as Apache Kafka and Apache Flink where analytics are applied while data is in flight.
Edge Analytics is an approach to data collection and analysis in which an automated analytical computation is performed on data at a sensor, network switch or other device instead of waiting for the data to be sent back to a centralized data store.
NoSQL is an emerging class of database that provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. They range across a diverse form of emerging new-generation databases such as Column, Document, Graph, Key-value and multi-model.
Cognitive Systems is an emerging area of high growth involving the use of technologies such as Deep Learning to enable fully autonomous and self-managed, self-learning systems that are resulting in large scale automation based solutions.
Content Analytics provides visibility into the amount of content that is being created, the nature of that content, and how it is used.
Big Data Ecosystem
|Big Data Analytics Hadoop/Apache Software Distributions||Cloudera, HortonWorks, MapR|
|Application Management, Security, Compliance||Splunk|
|NoSQL Databases||Aerospike, Cassandra, Couchbase Server, HBase, MarkLogic, MongoDB, Redis labs|
|Cloud Analytics||Amazon EMR, Azure HDInsights, Google Cloud Platform|
|Open Source Components||Druid, Elasticsearch, Apache Flink, Apache Hive, Apache Kafka, Apache Mesos, Apache Spark, Apache Solr, Apache Hadoop YARN, Apache ZooKeeper|
Benefits of Big Data
Driven by specialized analytics systems and software, big data analytics can point the way to various business benefits, including new revenue opportunities, more effective marketing, better customer service, improved operational efficiency and competitive advantages over rivals.
According to a survey by Datameer in 2016, 78% of enterprises agree that big data has the potential to fundamentally change the way they do business over the next 1 to 3 years.
Who Uses Big Data?
Big data analytics applications enable data scientists, predictive modelers, statisticians, and other analytics professionals to analyze growing volumes of structured transaction data, plus a mix of semi-structured and unstructured data such as Internet clickstream data, web server logs, social media content, text from customer e-mails and survey responses, mobile phone call detail records, and machine data captured by sensors connected to the Internet of Things (IoT).
Big Data Management and Storage
Rapidly gaining insights from data is crucial to capitalizing on opportunities, improving profits, and better managing risk. This ability requires enterprise-grade data management capabilities to cope with the vast datasets.
Accelerating real-time machine data analytics helps organizations detect cyberattacks before they cause damage, and prevent fraud without affecting the customer experience.
Quickly deriving business intelligence from customer data is essential to improving satisfaction levels and guiding future service offerings.
However, the first-generation big data analytics commodity storage approach (that is, DAS storage) simply doesn’t scale efficiently. And it doesn’t provide the reliability and flexibility needed as these applications become essential to competitiveness.
Shared storage/external storage big data analytics platforms deliver more scalability and performance, nondisruptively moving data where it’s needed and making sure that it is always protected and secure.
NetApp and Big Data
NetApp’s innovative big data analytics platform delivers up to twice the performance, seamlessly and securely moving data and workloads to the cloud or wherever needed and making sure that data is always backed up, secure, and available. With NetApp, you can lower license fees, hardware costs, and overall TCO by as much as 50% by increasing resource utilization and eliminating unnecessary data copies.