Big Data Overview Tutorial

1.1 Lesson 1—Big Data Overview

Hello and welcome to lesson one of the Apache Kafka Developer course offered by Simplilearn. This lesson provides an overview of big data.

1.1 Lesson 1—Big Data Overview

Hello and welcome to lesson one of the Apache Kafka Developer course offered by Simplilearn. This lesson provides an overview of big data.

1.2 Objectives

After completing this lesson, you will be able to: ? Describe Big Data. ? List the three V’s of Big Data. ? List the various data sizes used for Big Data. ? Describe Apache Hadoop. ? Explain the concepts of real-time Big Data processing. ? List some tools that handle real-time Big Data.

1.3 Big Data—Introduction

Digital data has exploded over the last two to three years. Facebook, Twitter, YouTube and sensory networks have contributed to the rapid growth of data. Data volumes are measured in terms of millions of Gigabytes. The technology has evolved to store, process, and analyze large volumes of data, and to make decisions based on it. The technology to utilize large volumes of data represents big data.

1.3 Big Data—Introduction

Digital data has exploded over the last two to three years. Facebook, Twitter, YouTube and sensory networks have contributed to the rapid growth of data. Data volumes are measured in terms of millions of Gigabytes. The technology has evolved to store, process, and analyze large volumes of data, and to make decisions based on it. The technology to utilize large volumes of data represents big data.

1.4 The Three Vs of Big Data

Big data is typically characterized by three V’s. They are volume, velocity, and variety. The other characteristics include veracity which refers to the truthfulness of data, visualization, value, and so on.

1.5 Data Volume

Volume refers to data which is the size of digital data. The impact of the Internet and social media has resulted in the explosion of digital data. Data has grown from Gigabytes to Terabytes, Petabytes, Exabytes, and Zettabytes. In 2008, the total data on the Internet was eight Exabytes. It exploded to 150 Exabytes by 2011 and reached 670 Exabytes in 2013, which calculates to a growth rate of 20% a year. It is expected to exceed seven Zettabytes in the next 10 years.

1.6 Data Sizes

The table shows the various sizes used for big data. The size, along with the power and description of each data term is given. The terms Kilobyte, Megabyte, and Gigabyte are familiar. Terabyte is about 1000 Gigabytes and Petabyte is about 1000 Terabytes. The new terms added to address big data sizes are Exabyte, Zettabyte, and Yottabyte. Typically, the term big data refers to data sizes in terms of terabytes or more.

1.7 Data Velocity

Velocity of data refers to the speed of data ingestion or data growth. Millions of web pages get added every day. Data also gets created from different sources such as desktops, laptops, mobiles, tablets, and sensors (Please read this a little slow, give pause between each word). Manufacturing facilities have thousands of sensors that generate data every few seconds. People use different devices to create data throughout the day. Due to the increase in the global customer base, and transactions and interactions with customers, data created within an organization is growing along with external data. There are many contributors to data growth such as web and social media, online billing systems, ERP implementations, network, and machines. Growth in the revenues of organizations indicate growth in data.

1.1 Lesson 1—Big Data Overview

Hello and welcome to lesson one of the Apache Kafka Developer course offered by Simplilearn. This lesson provides an overview of big data.

1.8 Data Variety

Data variety refers to the types of data such as text, images, audio, video, XML, and HTML. There are three types of data: Structured data, where the data is represented in a tabular format. For example, MySQL databases. Semi-structured data, where the data does not have a formal data model. For example, XML files. Unstructured data, where there is no predefined data model. Data is defined at run time. For example, text files. Industries such as transport, science, and finance add a variety of data every day.

1.9 Data Evolution

Digital data has evolved over 30 years, starting with unstructured data. Initially, data was created as plain text documents. Next, files, data handling, and spread sheets increased the usage of digital computers. The introduction of relational databases revolutionized structured data as many organizations used it to create large amounts of structured data. Next, data expanded to data warehouses and Storage Area Networks or SANs to handle large volumes of structured data. Then, the concept of Metadata was introduced to describe structured and semi-structured data. Finally, with the advent of social media such as Facebook and Twitter, unstructured data has exploded in the last few years. The images depict data evolution starting with unstructured data through a structured database and ending with unstructured social media comments.

1.3 Big Data—Introduction

Digital data has exploded over the last two to three years. Facebook, Twitter, YouTube and sensory networks have contributed to the rapid growth of data. Data volumes are measured in terms of millions of Gigabytes. The technology has evolved to store, process, and analyze large volumes of data, and to make decisions based on it. The technology to utilize large volumes of data represents big data.

1.10 Features of Big data

Some of the features of big data are as follows: Big data is extremely fragmented due to the variety of data. It does not provide decisions; however, it can be used to make them. It includes both unstructured and structured data, where structured data extends and complements unstructured data. However, big data is not a substitute for structured data. Furthermore, since most of the information on the Internet is available to anyone, it can be misused by anti-social elements. Big data is also wide; therefore, there could be hundreds of fields in each line of data. Next, big data is dynamic, as gigabytes and terabytes of data are created every day. Finally, big data can be internal, generated within the organization and external, generated on social media.

1.11 Industry Examples

Here are some industry examples of big data. In the retail sector, big data is used extensively for affinity detection and for performing market analysis. A retail company wants to find out the product Y, which the customer is most likely to buy after buying the product X. The company wants to place product Y next to product X to ensure the customer has a pleasant shopping experience. Credit card companies can detect fraudulent purchases quickly to alert customers. While giving loans, banks examine the private and public data of a customer to minimize risks. In medical diagnostics, doctors diagnose a patient’s illness based on symptoms, instead of intuition. Digital marketers need to process huge customer data to find effective marketing channels. Based on the last 20 to 30 years of stock market data, algorithmic trading can maximize profits on a portfolio. Insurance companies use big data to minimize insurance risks. An individual’s driving data can be captured automatically and sent to the insurance company to calculate the premium for risky drivers. Manufacturing units and oil rigs have sensors that generate Gigabits of data every day that are analyzed to reduce the risk of equipment failures. Advertisers use demographic data to identify the target audience. Terabytes and Petabytes of data are analyzed in the field of genetics to design new models. Power grids analyze large amounts of historical and weather forecasting data to forecast power consumption. As data is available to the public, law enforcement officials must take necessary measures to detect misuse of data and prevent crimes.

1.12 Big Data Analysis

In traditional analytics method, analysts take a representative data sample to perform analysis and draw conclusions. Using big data technology, the entire dataset can be used instead of sample data. Big data analysis helps to find associations in data, predict future outcomes, and perform prescriptive analysis. The outcome of prescriptive analysis will be a definitive answer, and not a probable answer. Big data analysis helps to take data-driven decisions instead of intuition-based decisions. It also helps organizations increase their safety standards, reduce maintenance costs, and prevent failures.

1.13 Technology Comparison

Traditional technology can be compared with big data technology in the following ways: Traditional technology has a limit on scalability where as big data technology is highly scalable. Traditional technology uses highly parallel processors on a single machine where as big data technology uses distributed processing with multiple machines. In traditional technology, processors may be distributed with data in a single machine. However, in big data technology, the data is distributed to multiple machines. Traditional technology depends on high-end expensive hardware that costs more than $40000 per terabyte. On the other hand, big data technology leverages commodity hardware that costs less than $5000 per terabyte. Traditional technology uses storage technologies, such as SAN, whereas big data technology uses distributed data with data redundancy. The cost factor is important for Chief Technology Officers or CTO, and Chief Executive Officers or CEOs to lean towards big data technology.

1.14 Stream

In computing, a stream represents a continuous sequence of bytes of data. It is produced by one program and consumed by another. It is consumed in first-in-first-out sequence. For example, if 12345 is produced by one program, another program consumes it in the order 12345 only. It can be bounded or unbounded. Bounded means that the data is limited. Unbounded means that there is no limit; the producer will keep producing the data till it runs and consumer will keep consuming the data. A Linux pipe is an example of a stream. The Linux command is: cat logfile | wc –l In this command, cat logfile produces a stream that is consumed by wc -l to display the number of lines in the file.

1.15 Apache Hadoop

Apache Hadoop is the most popular framework for big data processing. It has two core components. They are Hadoop Distributed File System or HDFS and MapReduce. It uses HDFS to distribute the data into multiple machines, and MapReduce to distribute the process to multiple machines. It uses the principle of moving processing to data instead of data to processing. First, HDFS divides the data into multiple sets, such as, Data 1, Data 2, and Data 3. Next, MapReduce distributes the datasets to multiple machines, such as, CPU 1, CPU 2, and CPU 3. Finally, processing is completed by the CPUs of each machine where the data is stored. So, CPU1 processes Data1, CPU2 processes Data2 and CPU3 processes Data3

1.16 Hadoop Distributed File System

HDFS is the storage component of Hadoop, which stores each file as blocks, with a default block size of 64 megabytes. This is larger than the block size on windows, which is 1K or 4K. HDFS is a Write-Once Read-Many-Times form of file system, which is also called WORM. Blocks are replicated across nodes in a cluster. HDFS provides three default replication copies. The image illustrates the concept. For example, a 320 MB file is stored into HDFS. The file is divided into five blocks, each of size 64 megabytes, as 64 multiplied by five is 320. If there are five nodes in the cluster, each block is replicated to make three copies, to result in a total of 15 blocks. These blocks are further distributed to the five nodes, so that no two replicas of the same block are on the same node.

1.17 MapReduce

MapReduce is the processing framework of Hadoop. It provides highly fault-tolerant distributed processing of the data by HDFS. It consists of two types of tasks. Mappers are tasks that run in parallel on different nodes of the cluster and process the data blocks. Maps are actually key value pairs. After the completion of map tasks, results are gathered and aggregated by the reduce tasks of MapReduce. Reduce is used to summarize and consolidate. Reducers give the final output of MapReduce. Each mapper runs on the data block on that node. Data locality is preferred. This follows the paradigm of taking the process to the data.

1.18 Real-Time Big Data Tools

Some of the tools to handle big data in real-time are as follows: • Apache Kafka. • Apache Storm. • Apache Spark™. • Apache Cassandra™. • Apache HBase™.

1.19 Apache Kafka

Kafka is a high-performance real-time messaging system. It is an open source tool and is a part of Apache projects. It provides a distributed and partitioned messaging system that is highly fault-tolerant. It can process millions of messages per second, and send the messages to many receivers.

1.20 Apache Storm

Storm is a real-time stream processing system. It is an open source tool and is part of Apache projects. It provides fast and reliable processing of big data. It can process unbounded streams that sends data to Storm continuously. It can interface with message queues such as Kafka to get input message data and store the received data into a real-time big data database such as Cassandra.

1.21 Apache Spark

Apache Spark is considered to be the next generation map reduce. It is also an Apache open source project. It is used to transform distributed data. It provides data transforms beyond map and reduce. It processes data faster than Hadoop MapReduce. When entire data fits in memory, Spark is found to be 100 times faster than Hadoop MapReduce, whereas, in other cases, it is found to be at least ten times faster. Spark is suitable for batch and real-time processing. It provides spark-sql for SQLinterface to big data. It provides built-in libraries for machine learning and graph processing. Machine learning consists of programs that can learn based on the data, without being explicitly programmed. A graph is a set of nodes and edges connecting these nodes. Graph processing consists of algorithm to process the nodes and edges of a graph.

1.22 Apache Cassandra

Cassandra is an Apache open source No-SQL data base with the following characteristics: 1. It is highly fault-tolerant with no SPOF or Single Point of Failure. 2. It is highly available. Machines, which are also called nodes are logically organized in a ring architecture. 3. Real-time read and write 4. Fast writes with tunable consistency. The level of consistency can be controlled among multiple nodes that contain data. 5. Provides simple SQL interface and interfaces similar to SQL to insert, update, and select the data. 6. It is a key value database. Each row of data has a primary key to identify the data. 7. It is highly and horizontally scalable with thousands of nodes in a cluster. The image shows the Cassandra logo where the nodes are organized in a ring architecture.

1.23 Apache Hbase

Apache HBase is another open source No-SQL database. It is a distributed database with columnar storage that is built on top of HDFS. It provides real-time read and write random access to data. It supports large databases in the order of terabytes and Petabytes. It is not relational and does not support SQL.

1.24 Real-Time Big Data Tools—Uses

Real-time big data refers to handling a massive amount of business data as soon as the data is created, to get valuable insights and prescribe immediate actions. Here, real-time refers to event that occurs. Using real-time big data tools, you can: ? Read and write data in real-time ? Filter and aggregate in real-time ? Visualize data in real-time ? Process millions of records per second

1.24 Real-Time Big Data Tools—Uses

Real-time big data refers to handling a massive amount of business data as soon as the data is created, to get valuable insights and prescribe immediate actions. Here, real-time refers to event that occurs. Using real-time big data tools, you can: ? Read and write data in real-time ? Filter and aggregate in real-time ? Visualize data in real-time ? Process millions of records per second

1.25 Real-Time Big Data—Use Cases

Some use cases for real-time big data are as follows: 1. A telecom provider wants to provide data plans to customers based on location. Here, the location data is received continuously and has to be processed in real-time. 2. A bank wants to indicate the ATM location based on customer location. Here, the customer location data is received in real-time and recommendation has to be made immediately. 3. A car manufacturer can alert the car owner on any urgent maintenance required on the car based on the data provided by the car during driving. Measurement data of various sensors in the car has to be streamed to the car manufacturer in real-time. 4. A news channel may monitor breaking news items across the globe. Real-time news data from hundreds of sources has to be prioritized and selected for breaking news. 5. A security system may monitor movements in a stadium during a game. Any suspicious movements need to be reported immediately. 6. A telecom network provider wants to use the least congested network for each call. The decisions have to be made in real-time. 7. A credit card company wants to prevent fraudulent transactions. Here, probably both real-time and offline processing may be involved. A stock market application recommends stocks to buy every second, based on the market conditions. Volatile market conditions have to be analyzed in real-time.

1.25 Real-Time Big Data—Use Cases

Some use cases for real-time big data are as follows: 1. A telecom provider wants to provide data plans to customers based on location. Here, the location data is received continuously and has to be processed in real-time. 2. A bank wants to indicate the ATM location based on customer location. Here, the customer location data is received in real-time and recommendation has to be made immediately. 3. A car manufacturer can alert the car owner on any urgent maintenance required on the car based on the data provided by the car during driving. Measurement data of various sensors in the car has to be streamed to the car manufacturer in real-time. 4. A news channel may monitor breaking news items across the globe. Real-time news data from hundreds of sources has to be prioritized and selected for breaking news. 5. A security system may monitor movements in a stadium during a game. Any suspicious movements need to be reported immediately. 6. A telecom network provider wants to use the least congested network for each call. The decisions have to be made in real-time. 7. A credit card company wants to prevent fraudulent transactions. Here, probably both real-time and offline processing may be involved. A stock market application recommends stocks to buy every second, based on the market conditions. Volatile market conditions have to be analyzed in real-time.

1.26 Quiz

A few questions will be presented in the following screens. Select the correct option and click submit to see the feedback.

1.27 Summary

Here is a quick recap of what we have learned in this lesson: ? Big data is typically characterized by three V’s. They are volume, variety, and velocity. ? The various data sizes used for big data include Kilobyte, Megabyte, Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte, and Yottabyte. ? Apache Hadoop is the most popular framework for big data processing. It has two core components. They are Hadoop Distributed File System and Mapreduce. ? Real-time big data refers to handling a massive amount of business data as soon as the data is created to get valuable insights and prescribe immediate actions. Kafka, Storm, Cassandra, Spark, and HBase are some of the tools to handle real-time processing of big data.

1.28 Conclusion

This concludes ‘Big Data Overview.’ The next lesson is ‘Install and Set up VMware.’

1.24 Real-Time Big Data Tools—Uses

Real-time big data refers to handling a massive amount of business data as soon as the data is created, to get valuable insights and prescribe immediate actions. Here, real-time refers to event that occurs. Using real-time big data tools, you can: ? Read and write data in real-time ? Filter and aggregate in real-time ? Visualize data in real-time ? Process millions of records per second

1.25 Real-Time Big Data—Use Cases

Some use cases for real-time big data are as follows: 1. A telecom provider wants to provide data plans to customers based on location. Here, the location data is received continuously and has to be processed in real-time. 2. A bank wants to indicate the ATM location based on customer location. Here, the customer location data is received in real-time and recommendation has to be made immediately. 3. A car manufacturer can alert the car owner on any urgent maintenance required on the car based on the data provided by the car during driving. Measurement data of various sensors in the car has to be streamed to the car manufacturer in real-time. 4. A news channel may monitor breaking news items across the globe. Real-time news data from hundreds of sources has to be prioritized and selected for breaking news. 5. A security system may monitor movements in a stadium during a game. Any suspicious movements need to be reported immediately. 6. A telecom network provider wants to use the least congested network for each call. The decisions have to be made in real-time. 7. A credit card company wants to prevent fraudulent transactions. Here, probably both real-time and offline processing may be involved. A stock market application recommends stocks to buy every second, based on the market conditions. Volatile market conditions have to be analyzed in real-time.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*