Big Data Overview Tutorial

Big Data Overview

Welcome to the first chapter of the Apache Storm tutorial (part of the Apache Storm course.)  This lesson will provide you an introduction to Big Data. Further, it will introduce you to the real-time big data concept.

Let us explore the objectives of this lesson in the next section.

Objectives

By the end of this lesson, you will be able to

  • Describe the concept of big data and its 3 Vs

  • List the different sizes of data

  • Describe some use cases for big data

  • Explain the concept of Apache Hadoop and real-time big data processing

  • Describe some tools for real-time processing of big data

In the next section, we will discuss the concept of big data.

Big Data

Digital data has exploded over the last 2-3 years. Facebook, Twitter, YouTube and sensory networks have been a few of the major contributors to the huge growth of data.

Data is growing at a very rapid pace. Data volumes are in terms of millions of gigabytes. The technology has also evolved to handle this amount of data. The technology to store and process large volumes of data, the technology to analyze and make decisions based on the huge volumes of data have all evolved. The data as well as technology to utilize this data represents big data.

Next, we will explore the 3 Vs of big data.

Wish to have in-depth knowledge of Apache StormCheck out our Course Preview!

3 Vs of Big Data

Big data is normally characterized by 3 Vs:

  • Volume

  • Velocity

  • Variety

These are the main Vs of big data. There are other Vs also considered but not popular, such as Veracity, Visualization, Value and many more. Veracity here refers to the truthfulness of data.

Next, we will explore each of these 3 Vs.

Data Volume

Volume refers to data volume. Data volume is the size of digital data. Internet and social media effect have resulted in the explosion of digital data. Data has grown from Gigabytes to Terabytes to Petabytes to Exabytes.

Total data on the internet was 8 Exabytes as of 2008, and by 2011, it has exploded to 150 Exabytes. It is growing at such a fast pace that it reached 670 Exabytes in 2013. That is like 30% growth per year! In another ten years, it is supposed to exceed seven Zettabytes. How can one store and handle this much data? There are so many new terms to describe the size of data.  

Now, let us understand the various terms used to for different data sizes.

Data Sizes

This table shows the various sizes used for big data. We all know Kilobyte, Megabyte, and Gigabyte. A Terabyte consists of 1000 Gigabytes, and one Petabyte is about 1000 Terabytes. New terms such as Exabyte, Zettabyte, and Yottabyte have been added to address big data sizes. When we say big data, we normally mean sizes in terms of terabytes or more.

Data

Size (Power of 2)

Size

Kilobyte

10

1024 bytes

Megabyte

20

1024 KB

Gigabyte

30

1024 MB

Terabyte

40

1024 GB

Petabyte

50

1024 TB

Exabyte

60

1024 PB

Zettabyte

70

1024 EB

Yottabyte

80

1024 ZB

Now that you know the different big data sizes let us look at the second V of big data.

Velocity of Data

The velocity of data refers to the speed of data ingestion or data growth. There are millions of web pages being added every day. Data gets created from different sources such as desktops, laptops, mobiles, tablets, and sensors. Manufacturing facilities have thousands of sensors that generate sensor data every few seconds. People use one or the other devices to create data on 24/7 basis.

We might think that the data external to an organization like on the internet is growing but the data created internally by the organization is also growing at a faster rate. This is due to the increase in the global customer base of organizations and also due to increased transactions and interactions with the customers.

There are many contributors to this data growth: web and social media, online billing systems, ERP implementations, network, and machine data. Growth in revenues of organizations also means that their data is also growing at a rapid pace.

Moving on, we will look at the third V of big data.

Variety of Data

Data variety refers to the different types of data that is being created. One of the major reasons for that is multimedia and social media effect. So, these days data is not just plain text. It includes images, audio, video, XML, and HTML. There is structured data such as databases and XML and also unstructured data such as program logs, blogs like Wordpress and user comments on Twitter and Facebook.  

There are many industries such as transport, science, and finance which are adding a variety of data every day.

Let us reiterate the 3 Vs of big data: Volume (pause), Velocity (pause) and Variety (pause).

Now that we have explored the 3 Vs of big data let us look at the evolution of data over the years.

Data Evolution

Digital data has gone through a cycle over the last 20-30 years, starting with unstructured data.

We started with document editors creating plain text documents. Then files, data handling, and spreadsheets increased the usage of digital computers.

Introduction of relational databases revolutionized the structured data. Many organizations have created a large amount of structured data using relational databases. This expanded to data warehouses and storage area networks to handle a large volume of structured data.  

Then, came metadata concept that describes the data and semi-structured data such as HTML. With the advent of social media like Facebook and Twitter, unstructured data has exploded in the last few years. Thus, data has come through a full circle.

The image below depicts the evolution of data starting with the unstructured document data through the structured database data ending with unstructured social media comment.

evolution of data

Next, let us look at some of the features of big data.

Features of Big data

Big data has many features. It is extremely fragmented due to the variety of data. It does not provide decisions. You need to figure out how to use it to make decisions.

Big data is not only unstructured data. The structured data component of big data extends and complements your unstructured data. Big data is not a substitute for your structured data. Since most of the information on the internet is available to everyone, it can be used for good causes and at the same time can be misused by antisocial elements to disturb the peace.

Generally, big data has a wide horizon, so you may have hundreds of fields in each line of your data. It is also very dynamic as gigabytes and terabytes of data are created every day and can change every day like the weather. Big data can be both internal, generated within the organization or external like social media or YouTube.

Moving on, let us explore some industry examples of big data.

Industry Examples

Here are some industry examples of big data.  

Retail Affinity Detection or Market Basket Analysis

In the retail industry, big data is used extensively for affinity detection and market basket analysis. They use it to answer questions like when a customer visits a store and buys a product X, which other product Y he/she is most likely to buy? They want to place the product Y next to product X so that the customer has a more pleasant shopping experience.  

Credit Card Fraud Detection 

Credit card companies want to detect any fraudulent purchases at the earliest so that they can alert the customer as soon as possible.

Bank Loan Risk Minimization 

Banks want to scrutinize not only the private data but also the public data available to a customer so that they can minimize the risk while giving loans.

Medical Diagnostics

In medical diagnostics, doctors can diagnose a customer’s illness based on the symptoms instead of depending on intuition.

Digital Marketing

Digital marketers need to process a lot of customer data to find effective marketing channels.

Algorithmic Trading

Based on last 20-30 years of stock market data, algorithmic trading makes it possible to maximize profits on one’s portfolio.

Now, we will look at some more industry examples of big data.  

Insurance Risk Management 

Almost every industry has some use of big data. Insurance companies can use it to minimize insurance risks. For example, a person’s driving data can be captured automatically by cars and forwarded to insurance companies so that the premium can be increased for risky drivers.

Sensory Data Management

Manufacturing units and oil rigs have thousands of sensors that generate gigabytes of data every day. This data is analyzed to reduce risks and costly equipment failures.

Advertising

Advertisers use demographic data to capture their target audience in a better way.

Genetics 

Terabytes and Petabytes of data are analyzed by people in the field of genetics to come up with new models.

Power Grid Load Forecast

Power grids analyze a large amount of historical data and weather forecasting data to forecast power consumption.

Crime Detection and Prevention

As the data is available to the public, law enforcement officials have to be one step ahead of the antisocial elements to detect misuse of data and prevent crimes.

So, as we mentioned earlier almost, every industry uses big data in one or the other way. As we are now familiar with the concept of big data and its uses, let us look at how big data analysis is different from traditional analytics.

Big data Analysis

With big data, you use all your data instead of sample data for analysis. In traditional analysis, analysts take a representative sample of data from the available data and do their analysis to provide their conclusions.

With big data technology, all the available data is used for analysis. You may find associations in data, predict future outcomes and provide prescriptive analysis. Prescriptive means you can say this will happen instead of this may happen.

Using big data for analysis also means you make data-driven decisions instead of decisions based on intuitions. With data, you can support decisions which would otherwise be left to chance. Analysis using big data can help organizations increase their safety standards, reduce maintenance costs, and prevent failures.

The image below depicts how with the help of traditional analytics, you copy sample data to a small database and run analysis on that. However, with big data analytics, you use all the available data without sampling.  

big data analysis

Now, let us do a quick comparison between big data technology and traditional technology.

Technology Comparison

Here is how traditional technology is different from big data technology.  

Traditional Technology

Big Data Technology

  • Has a limit on scalability

  • Uses highly parallel processors on a single machine

  • Processors may be distributed, but data is stored at one place

  • Depends on the expensive high-end hardware of the order of more than 40000 dollars per Terabyte

  • Uses storage technologies like SAN (Storage Area Network) to store data

  • Highly and massively scalable

  • Uses distributed processing with multiple machines

  • Data is distributed to multiple machines

  • Leverages commodity hardware which may cost less than 5000 dollars per TB

  • Uses distributed data with data redundancy 

The cost factor is very important for CTO and CEOs to lean towards big data technology. Moving on, we will discuss the concept of Apache Hadoop, which is a popular big data technology.

Apache Hadoop

Apache Hadoop is one of the most popular frameworks for big data processing.

Hadoop has two core components: HDFS and MapReduce. Hadoop uses HDFS to distribute the data to multiple machines and MapReduce to distribute the process to multiple machines. Hadoop distributes the processing to where the data is and uses the principle of moving processing to data instead of moving data to process.

Firstly, the data is divided into multiple parts like data1, data2, data3 and so on which gets distributed to multiple machines. Then, the processing is done using the CPUs of each machine on the data of that machine.

The diagram below indicates how HDFS distributes the data to multiple machines and how MapReduce distributes the processing to multiple CPUs on those machines.

Apache Hadoop

Next, let us look at HDFS component of Hadoop.

HDFS

HDFS is the storage component of Hadoop. It is an acronym for Hadoop Distributed File System. It stores each file as blocks, with a default block size of 64 Megabytes. This is quite large as compared to 1K or 4K on windows.

HDFS is a ‘write once read full many times’ file system, also called worm. Blocks are replicated across nodes in the cluster where the default replication is three.

Let us illustrate this with an example.  

Suppose you store a 320 Megabytes file into HDFS. It gets divided into 5 blocks each of size 64 Megabytes as 64 * 5 = 320. If there are five nodes in the cluster, then each block is replicated to make 3 copies each to give a total of 15 blocks. These blocks are then distributed to five nodes so that no two replicas of the same block are on the same node.

The diagram below shows how a 320MB file is divided into multiple blocks and stored on five data nodes.

HDFS

Let us move on to understand how MapReduce functions.

MapReduce

MapReduce is the processing framework of Hadoop. It provides highly fault-tolerant distributed processing of the data distributed by HDFS. MapReduce consists of two types of tasks. Mappers are tasks that are run in parallel on different nodes of the cluster and process the data blocks. Maps are actually key-value pairs.

After the completion of the map tasks, results are gathered and aggregated by the reduce tasks of MapReduce. To reduce is to summarize and consolidate. Reducers give the final output of MapReduce. Each mapper runs on the data block on that node. Data locality is preferred. This follows the paradigm of taking the process to the data.

Having learned about how HDFS and MapReduce function, let us explore the concept of real-time big data.

Real-time Big data

Real-time big data refers to handling a massive amount of business data as soon as the data is created to get valuable insights and prescribe immediate actions. Here real-time means as soon as an event happens.

Using real-time big data tools, you will be able to:

  • Read and write data in real-time.

  • Filter and aggregate in real-time.

  • Visualize data in real-time.

  • You can also process millions of records per second. ​ 

Next, we will cite some examples of real-time big data processing.

Real-time Big data Examples

Following are some use cases for real-time big data:

  1. A telecom provider wants to provide data plans to the customers based on their location. Here, the location data is received continuously and has to be processed in real-time.

  2. A bank wants to indicate the ATM location based on the customer location. Here, again the customer location data is received in real-time and recommendation has to be made immediately.

  3. A car manufacturer can alert the car owner on any urgent maintenance required on the car based on the data provided by the car during driving. Measurement data of various sensors in the car has to be streamed to the car manufacturer in real-time.

  4. A news channel may monitor breaking news items across the globe. real-time news data from hundreds of sources have to be prioritized and selected for breaking news.

  5. A security system may monitor movements in a stadium during a game. Any suspicious movements need to be reported immediately.

  6. A telecom network provider wants to use the least congested network for each call. Again, decisions have to be made in real-time.

  7. A credit card company wants to prevent fraudulent transactions. Here, probably both real-time and offline processing may be involved.

  8. A stock market application recommends purchasing stocks every second based on the market conditions. Volatile market conditions have to be analyzed in real-time.

Next, let us look at some real-time big data tools.

Real-time Big data Tools

Following are some of the tools to handle big data in real-time:

  • Apache Kafka

  • Apache Storm

  • Apache Cassandra

  • Apache Spark

  • Apache HBase

We will briefly look at each of these tools. The first real-time tool we will discuss is Apache Kafka.

Kafka is a high-performance real-time messaging system.

  • Open source, part of Apache projects

  • Distributed and partitioned messaging system

  • Highly fault-tolerant

  • Can process millions of messages per second and send to many receivers

Storm is a real-time data processing system.

  • Open source, part of Apache projects

  • Fast and reliable processing

  • Processes unbounded sequence of data

  • Interfaces with queues like Kafka to get data at one end and can store data into Cassandra

Cassandra is an Apache open source database with the following characteristics:

  • Highly fault tolerant – no SPOF

  • Highly available – Ring architecture

  • real-time read and write

  • Super fast writes with tunable consistency

  • Simple SQL interface

  • Key-value database

  • Highly scalable

Apache Spark is considered as the next generation MapReduce.

  • Apache open source project

  • Transforms distributed data

  • Provides data transforms beyond map and reduce

  • Faster than Hadoop MapReduce

  • Suitable for the batch as well as real-time processing

  • Provides Spark-SQL for SQL interface to big data

  • Provides built-in libraries for Machine Learning and Graph Processing

Apache HBase is another open source NoSQL database.

  • Built on top of HDFS

  • Distributed database

  • Columnar storage

  • Real-time read/write random access

  • Supports very large databases

  • Not relational

  • Does not support SQL

In the next section, we will discuss zookeeper.

Looking for more information on Apache StormWatch our Course Preview!

Zookeeper

ZooKeeper is a distributed coordination service of Apache. It is highly scalable and fault-tolerant. It contains an open source library of recipes. It facilitates building relations between distributed processes and applications. It provides useful recipes to handle common issues in distributed process coordination. We have come to the end of this lesson.

Summary

Let us summarize the topics covered in this lesson.

  • Big data is characterized by 3 Vs. Volume, Velocity, and Variety.

  • Almost, every industry can use big data technology.

  • As compared to traditional technology, big data technology uses commodity hardware instead of expensive hardware.

  • Apache Hadoop is a popular product to process big data and has two core components, HDFS and MapReduce.

  • Real-time processing of big data is required for some industries.

  • Kafka, Storm, Cassandra, Spark, and HBase are some of the tools used for real-time processing of big data.

  • Zookeeper is used for distributed process coordination.

Conclusion

This concludes the chapter: Introduction to Big Data. In the next chapter, we will introduce Apache Storm.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*