Sign in to my dashboard Create an account
Menu

What is big data?

Topics

Share this page

Big data analytics is the process of examining large and varied datasets to uncover hidden patterns, unknown correlations, market trends, customer preferences, and other useful information that can help organizations make better informed business decisions.

Over the next several years, digital transformation will reshape organizations as the majority of global business revenue centers around digital or digitally enhanced products and services.

Structured or unstructured, generated by humans or machines, and stored in the data center or the cloud, data is the new basis of competitive advantage.

Data management and data storage are integral to an organization's data strategy.

Big data challenges

IT leaders and analytics teams are under tremendous pressure to harness today's wealth of data and apply it to create new value across the entire organization, all with limited time, skills, and budget. Data is becoming distributed, dynamic, and diverse across data centers and the cloud. This situation is imposing challenges not only for infrastructure teams responsible for storing and protecting this data, but also for data scientists, engineers, and architects, who need to collect and analyze the data in real time from various data sources. Due to this vast data sprawl problem, analytics teams are asked to limit the scope of the data being analyzed or to wait days before the right data can be made available for analysis.

Big data technologies

Unstructured and semi-structured data types typically don't fit well in traditional data warehouses, which are based on relational databases oriented to structured datasets. Data warehouses also might not be able to handle the processing demands posed by sets of big data that need to be updated frequently or continually.

As a result, many organizations that collect, process, and analyze big data turn to NoSQL databases as well as Hadoop and its companion tools such as:

  • YARN. A cluster management technology and one of the key features in second-generation Hadoop
  • MapReduce. A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster
  • Apache Spark. A fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing
  • HBase. An open-source, nonrelational, distributed database modeled after Google's Bigtable
  • Apache Hive. A data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis
  • Kafka. An open-source stream processing platform developed by the Apache Software Foundation
  • Pig. An open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters

More and more frequently, big data analytics users are adopting the concept of a Hadoop data lake that serves as the primary repository for incoming streams of raw data. In such architectures, data can be analyzed directly in a Hadoop cluster or run through a processing engine such as Spark.

Big data ecosystem

SegmentKey Vendors
Big Data Analytics Hadoop/Apache Software DistributionsCloudera, HortonWorks, MapR
Application Management, Security, ComplianceSplunk
SparkDataBricks
NoSQL DatabasesAerospike, Cassandra, Couchbase Server, HBase, MarkLogic, MongoDB, Redis labs
Cloud AnalyticsAmazon EMR, Azure HDInsights, Google Cloud Platform
Open Source ComponentsDruid, Elasticsearch, Apache Flink, Apache Hive, Apache Kafka, Apache Mesos, Apache Spark, Apache Solr, Apache Hadoop YARN, Apache ZooKeeper

Benefits of big data

Driven by specialized analytics systems and software, big data analytics can point the way to various business benefits, including new revenue opportunities, more effective marketing, better customer service, improved operational efficiency and competitive advantages over rivals.

According to a survey by Datameer in 2016, 78% of enterprises agree that big data has the potential to fundamentally change the way they do business over the next 1 to 3 years.

Who uses big data?

Big data analytics applications enable data scientists, predictive modelers, statisticians, and other analytics professionals to analyze growing volumes of structured transaction data, plus a mix of semi-structured and unstructured data such as Internet clickstream data, web server logs, social media content, text from customer e-mails and survey responses, mobile phone call detail records, and machine data captured by sensors connected to the Internet of Things (IoT).

Big data management and storage

Rapidly gaining insights from data is crucial to capitalizing on opportunities, improving profits, and better managing risk. This ability requires enterprise-grade data management capabilities to cope with the vast datasets.

Accelerating real-time machine data analytics helps organizations detect cyberattacks before they cause damage, and prevent fraud without affecting the customer experience.

Quickly deriving business intelligence from customer data is essential to improving satisfaction levels and guiding future service offerings.

However, the first-generation big data analytics commodity storage approach (that is, DAS storage) simply doesn’t scale efficiently. And it doesn’t provide the reliability and flexibility needed as these applications become essential to competitiveness.

Shared storage/external storage big data analytics platforms deliver more scalability and performance, nondisruptively moving data where it’s needed and making sure that it is always protected and secure.

NetApp and big data

NetApp’s innovative big data analytics platform delivers up to twice the performance, seamlessly and securely moving data and workloads to the cloud or wherever needed and making sure that data is always backed up, secure, and available. With NetApp, you can lower license fees, hardware costs, and overall TCO by as much as 50% by increasing resource utilization and eliminating unnecessary data copies.

Drift chat loading