Introduction to Big Data and Hadoop

Certification Training
3131 Learners
View Course Now!

Introduction to Big Data and Hadoop Tutorial

Welcome to the first lesson of the ‘Introduction to Big Data and Hadoop’ tutorial (part of the Introduction to Big data and Hadoop course). This lesson provides an introduction to Big Data. Further, it gives an introduction to Hadoop as a Big Data technology.

Let us explore the objectives of this lesson in the next section.

Objectives

By the end of this lesson, you will be able to:

  • Explain the characteristics of Big Data

  • Describe the basics of Hadoop and HDFS architecture

  • List the features and processes of MapReduce

  • Describe the basics of Pig

In the next section of introduction to big data tutorial, we will focus on the need for Big Data.

Need for Big Data

Following are the reasons why Big Data is needed.

By an estimate, around 90% of the world’s data has been created in the last two years alone. Moreover, 80% of the data is unstructured or available in widely varying structures, which are difficult to analyze.

As IT systems are being developed, it has been observed that structured formats like databases have some limitations with respect to handling large quantities of data.

It has also been observed that it is difficult to integrate information distributed across multiple systems.

Further, most business users do not know what should be analyzed and discover requirements only during the development of IT systems. As data has grown, so have ‘data lakes’ within enterprises.

Potentially valuable data for varied systems such as Enterprise Resource Planning or ERP (and Supply Chain Management or SCM (read as S-C-M) are either dormant or discarded. It is often too expensive to integrate large volumes of unstructured data.

Information, such as natural resources, has a short, useful lifespan and is best used in a limited time span.

Further, information is best exploited for business value if a context is added to it.

In the next section of introduction to big data tutorial, we will focus on the characteristics of Big Data.

Three Characteristics of Big Data

Big Data has three characteristics, namely, variety, velocity, and volume.

Variety

Variety encompasses managing the complexity of data in many different structures, ranging from relational data to logs and raw text.

Velocity

Velocity accounts for the streaming of data and movement of large volume of data at a high speed.

Volume

Volume denotes the huge scaling of data ranging from terabytes to zettabytes and more.

Characteristics of Big Data Technology

In this section, we will discuss the characteristics of Big Data technology.

Big Data technology helps to respond to the characteristics discussed in the previous section. It helps to process the growing volumes of data in a cost-efficient way.

For example, as per IBM, Big Data technology has helped to turn the 12 terabytes of Tweets created daily into improved product sentiment analysis. It has converted 350 billion annual meter readings to better predict power consumption.

Big Data technology also helps to respond to the increasing velocity of data.

For example, it has scrutinized 5 million trade events created daily to identify potential frauds. It has helped to analyze 500 million daily call detail records in real time to predict customer churn faster.

Big Data technology can collectively analyze the wide variety of data.

For example, it has helped to monitor hundreds of live video feeds from surveillance cameras to target points of interest for security agencies. It has also been able to exploit the 80% data growth in images, videos, and documents to improve customer satisfaction.

According to Gartner.com, Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

In the next section of introduction to big data tutorial, we will focus on the appeal of Big Data technology.

The appeal of Big Data Technology

Following are the reasons for the popularity of Big Data technology:

  • Big Data technology helps to manage and process a large amount of data in a cost-efficient manner.

  • It analyzes all available data in their native forms, which can be unstructured, structured, or streaming.

  • It captures data from fast-happening events in real time.

  • Big Data technology is able to handle the failure of isolated nodes and tasks assigned to such nodes.

  • It can turn data into actionable insights.

In the next section of introduction to big data tutorial, we will focus on handling limitations of Big Data.

Handling Limitations of Big Data

There are two key challenges that need to be addressed by Big Data technology.

These are handling the system uptime and downtime, and combining data accumulated from all systems.

To overcome the first challenge, Big Data technology uses commodity hardware for data storage and analysis.

Further, it helps to maintain a copy of the same data across clusters.

To overcome the second challenge, Big Data technology analyzes data across different machines and subsequently, merges the data.

In the next section of introduction to big data tutorial, we will introduce the concept of Hadoop that helps to overcome these challenges.

Trying to make a career in Big data? Click here to know more!

Introduction to Hadoop

Hadoop helps to leverage the opportunities provided by Big Data and overcome the challenges it encounters.

What is Hadoop?

Hadoop is an open source, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is based on the Google File System or GFS (read as G-F-S).

Why Hadoop?

Hadoop runs a number of applications on distributed systems with thousands of nodes involving petabytes of data. It has a distributed file system, called Hadoop Distributed File System or HDFS, which enables fast data transfer among the nodes.

Further, it leverages a distributed computation framework called MapReduce.

In the next section of introduction to big data tutorial, we will focus on Hadoop configuration.

Hadoop Configuration

Hadoop supports three configuration modes when it is implemented on commodity hardware:

  • Standalone mode

  • Pseudo-distributed mode

  • Fully distributed mode

In standalone mode, all Hadoop services run in a single JVM, that is, Java Virtual Machine on a single machine.

In pseudo-distributed mode, each Hadoop service runs in its JVM but on a single machine.

In a fully distributed mode, the Hadoop services run in individual JVMs, but these JVMs reside on different commodity hardware in a single cluster.

In the next section of introduction to big data tutorial, we will discuss the core components of Apache Hadoop.

Apache Hadoop Core Components

There are two major components of Apache Hadoop.

They are Hadoop Distributed File System, abbreviated as HDFS, and Hadoop MapReduce.

HDFS is used to manage the storage aspects of Big Data, whereas MapReduce is responsible for processing jobs in a distributed environment.

In the next two sections, we will discuss the core components in detail. We will start with HDFS in the next section.

Hadoop Core Components – HDFS

HDFS is used for storing and retrieving unstructured data.

Some of the key features of Hadoop HDFS are as follows.

HDFS provides high-throughput access to data blocks. When unstructured data is uploaded on HDFS, it is converted into data blocks of fixed size.

The data is chunked into blocks so that it is compatible with the commodity hardware's storage.

HDFS provides a limited interface for managing the file system. It ensures that one can perform a scale up or scale down of resources in the Hadoop cluster.

HDFS creates multiple replicas of each data block and stores them in multiple systems throughout the cluster to enable reliable and rapid data access.

In the next section of introduction to big data tutorial, we will focus on MapReduce as a core component of Hadoop.

Hadoop Core Components – MapReduce

The MapReduce component of Hadoop is responsible for processing jobs in distributed mode.

Some of the key features of the Hadoop MapReduce component are as follows:

  • It performs distributed data processing using the MapReduce programming paradigm.

  • It allows you to possess a user-defined map phase, which is a parallel, share-nothing processing of input.

  • It aggregates the output of the map phase, which is a user-defined, reduce phase after a mapping process.

In the next section of introduction to big data tutorial, we will focus on the HDFS architecture.

HDFS Architecture

A typical HDFS setup is shown in a diagram below.

This setup shows the three essential services of Hadoop:

  • NameNode

  • DataNode

  • Secondary NameNode services

The NameNode and the Secondary NameNode services constitute the master service, whereas the DataNode service falls under the slave service.

The master server is responsible for accepting a job from clients and ensuring that the data required for the operation will be loaded and segregated into chunks of data blocks. HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks that are stored and replicated in DataNodes. The data blocks are then distributed to the DataNode systems within the cluster. This ensures that replicas of the data are maintained.

In the next section of introduction to big data tutorial, we will focus on an introduction to the Ubuntu Server.

Ubuntu Server – Introduction

Ubuntu is a leading open-source platform for scale-out. Ubuntu helps in the optimum utilization of infrastructure, irrespective of whether you want to deploy a cloud, a web farm, or a Hadoop cluster.

Following are the benefits of Ubuntu Server:

  • It has the required versatility and performance to help you get the most out of your infrastructure.

  • Ubuntu services ensure efficient system administration with Landscape.

  • These services provide access to Ubuntu experts as and when required, and enable fast resolution of a problem.

In the next section, we will discuss Hadoop installation.

Hadoop Installation – Prerequisites

To install Hadoop, you need to have a VM installed with the Ubuntu Server 12.04 LTS operating system. You also need high-speed internet access to update the Operating System and download the Hadoop files to the machine.

In the next section, we will discuss Hadoop multi-node installation.

Hadoop Multi-Node Installation – Prerequisites

For Hadoop multi-node installation, you require an Ubuntu Server 12.04 VM preconfigured in Hadoop pseudo-distributed mode. You will also need to ensure that the VM has Internet access so that it can update the file if required.

In the next section, we will differentiate between single-node and multi-node clusters.

Single-Node Cluster vs. Multi-Node Cluster

The table below shows the differences between a single-node cluster and a multi-node cluster.

Single-node cluster

Multi-node cluster

Hadoop is installed on a single system or node.

Hadoop is installed on multiple nodes ranging from a few to thousands.

Single-node clusters are used to run trivial processes and simple MapReduce and HDFS operations. It is also used as a testbed.

Multi-node clusters are used for complex computational requirements including analytics.

In the next section, we will focus on MapReduce in detail.

MapReduce

MapReduce is a programming model. It is also an associated implementation for processing and generating large data sets with parallel and distributed algorithms on a cluster.

MapReduce operation includes specifying the computation in terms of a map and a reduce function. It makes parallel computation across large-scale clusters of machines possible.

MapReduce handles machine failures and performance issues. It also ensures efficient communication between nodes performing the jobs.

Computational processing can occur on data stored either in a filesystem, namely, unstructured data, or in a database, namely structured data.

MapReduce can be applied to significantly larger datasets when compared to "commodity" servers.

In the next section, we will discuss the characteristics of MapReduce.

Characteristics of MapReduce

Some characteristics of MapReduce are listed below in the section.

  • MapReduce is designed to handle very large scale data in the range of petabytes, exabytes and so on.

  • It works well on Write once and read many data, also known as WORM data. MapReduce allows parallelism without mutexes.

  • The Map and reduce operations are typically performed by the same physical processor.

  • Operations are provisioned near the data that is, data locality is preferred.

  • Commodity hardware and storage is leveraged in MapReduce.

  • The runtime takes care of splitting and moving data for operations.

In the next section, we will list some of the real-time uses of MapReduce.

Real-Time Uses of MapReduce

Some of the real-time uses of MapReduce are as follows:

  • Simple algorithms such as grep, text indexing, and reverse indexing.

  • Data-intensive computing, for example, sorting also uses MapReduce.

  • Data mining operations like Bayesian classification use this technique.

  • Search engine operations like keyword indexing, ad rendering, and PageRank have been commonly regularly using MapReduce, and so is enterprise analytics.

  • Gaussian analysis for locating extra-terrestrial objects in astronomy have found MapReduce as a good technique.

  • There seems to be a good potential for MapReduce in semantic web and web 3.0.

In the next section, we will discuss the prerequisites for Hadoop installation in Ubuntu Desktop 12.04 (read as twelve point oh four).

Prerequisites for Hadoop Installation in Ubuntu Desktop 12.04

Ubuntu Desktop 12.04 VM installed with Eclipse, and a high-speed internet connection is required to install Hadoop in Ubuntu Desktop 12.04.

In the next section, we will list the key features of Hadoop MapReduce.

Hadoop MapReduce – Features

MapReduce functions use key/value (read as ‘key value’) pairs.

Some of the key features of Hadoop MapReduce function are as follows:

  • The framework converts each record of input into a key/value pair, which is a one-time input to the map function.

  • The map output is also a set of key/value pairs which are grouped and sorted by keys.

  • The reduce function is called once for each key, in sort sequence, with the key and set of values that share that key.

  • The reduce method may output an arbitrary number of key/value pairs, which are written to the output files in the job output directory.

In the next section, we will explore the processes related to Hadoop MapReduce.

Hadoop MapReduce – Processes

The framework provides two processes that handle the management of MapReduce jobs.

They are the TaskTracker and JobTracker services.

TaskTracker Services

The TaskTracker service resides in the Data Node. The TaskTracker manages the execution of individual map and reduces tasks on a compute node in the cluster.

JobTracker Services

The JobTracker service resides in the system where the NameNode service resides. The JobTracker accepts job submissions, provides job monitoring and control and manages the distribution of tasks to the TaskTracker nodes.

In the next section, we will focus on advanced HDFS.

Advanced HDFS – Introduction

The Hadoop Distributed File System is a block-structured, distributed file system. It is designed to run on small commodity machines in a way that the performance of the running jobs will be better when compared to single standalone dedicated servers.

HDFS provides the storage solution to store Big Data and make the data accessible to Hadoop services.

Some of the settings in advanced HDFS are HDFS benchmarking, setting up HDFS block size, and decommissioning or removing a DataNode.

In the next section, we will focus on advanced MapReduce.

Advanced MapReduce

Hadoop MapReduce uses data types when it works with user-given mappers and reducers. The data is read from files into mappers and emitted by mappers to reducers.

Processed data is sent back by the reducers. Data emitted by reducers go into output files. At every step, data is stored in Java objects.

In the Hadoop environment, objects that can be put to or received from files and across the network must obey a particular interface called Writable. This interface allows Hadoop to read and write data in a serialized form for transmission.

In the next section, we will focus on the data types in Hadoop and their functions.

Data Types in Hadoop

The table below shows a list of data types.

Data types

Functions

Text

Stores String data

IntWritable

Stores Integer data

LogWritable

Stores Log data

FloatWritable

Stores Float data

DoubleWritable

Stores Double data

BooleanWritable

Stores Boolean data

ByteWritable

Stores Byte data

NullWritable

Placeholder when the value is not needed

In the next section, we will introduce the concept of distributed cache.

Distributed Cache

Distributed Cache is a Hadoop feature to cache files needed by applications.

Following are the functions of distributed cache:

  • It helps to boost efficiency when a map or a reduce task needs access to common data.

  • It lets a cluster node read the imported files from its local file system, instead of retrieving the files from other cluster nodes.

  • It allows both single files and archives such as zip and tar.gz.

  • It copies files only to slave nodes. If there are no slave nodes in the cluster, distributed cache copies the files to the master node.

  • It allows access to the cached files from mapper or reducer applications to make sure that the current working directory is added into the application path.

  • It allows referencing the cached files as though they are present in the current working directory.

In the next section, we will understand joins in MapReduce.

Joins in MapReduce

Joins are relational constructs you can use to combine relations. In MapReduce, joins are applicable in situations where two or more datasets need to be combined.

A join is performed either in the Map phase or the Reduce phase by taking advantage of the MapReduce Sort-Merge architecture.

The various join patterns that are available in MapReduce are:

  • Reduce side join

  • Replicated join

  • Composite join

  • Cartesian product

In the next section, we will focus on an introduction to Pig.

Introduction to Pig

Pig is one of the components of the Hadoop eco-system. It is a high-level data flow scripting language.

Pig runs on Hadoop clusters. It was initially developed by Facebook for their project, as they did not want to use java for performing Hadoop operations. Later Pig became an Apache open-source project.

Pig uses HDFS for storing and retrieving data and Hadoop MapReduce for processing Big Data.

In the next section, we will discuss the major components of Pig.

Components of Pig

The two major components of Pig are the Pig Latin (pig-latin) script language and a runtime engine.

The Pig Latin script language is a procedural data flow language. It contains syntax and commands that can be applied to implement business logic.

Examples of Pig Latin are LOAD, STORE, etc.

The runtime engine is a compiler that produces sequences of Map-Reduce programs. It uses HDFS for storing and retrieving data. It is also used to interact with the Hadoop system, that is, HDFS and MapReduce. It parses, validates, and compiles the script operations into a sequence of MapReduce jobs.

In the next section, we will understand the data model associated with Pig.

Pig Data Model

As part of its data model, Pig supports four basic types.

Atom

Atom is a simple atomic value like int, long, double, or string.  

Tuple

A tuple is a sequence of fields that can be of any data type.

Bag

The bag is a collection of tuples of potentially varying structures and can contain duplicates.

Map

A map is an associative array. The key must be a chararray but the value can be of any type.

In the next section, we will differentiate between Pig and SQL (read as S-Q-L).

Pig vs. SQL

The table below shows the differences between Pig and SQL.

Difference

Pig

SQL

Definition

Scripting language used to interact with HDFS

Query language used to interact with databases

Query Style

Step-by-step execution style

Single block execution style

Evaluation

Lazy evaluation

Immediate evaluation

Pipeline Splits

Pipeline splits are supported

Requires the join to be run twice or materialized

The first difference between Pig and SQL is that Pig is a scripting language used to interact with HDFS while SQL is a query language used to interact with databases residing in the database engine.

In terms of query style, Pig offers a step by step execution style compared to the single block execution style of SQL.  

Pig does a lazy evaluation, which means that data is processed only when the STORE or DUMP command is encountered. However, SQL offers an immediate evaluation of a query.

Pipeline Splits are supported in Pig, but in SQL, you may require the join to be run twice or materialized as an intermediate result.  

In the next section, we will discuss the prerequisites for setting the environment for Pig Latin.

Want to know more about Big data? Check out our course preview here!

Prerequisites to Set the Environment for Pig Latin

Ensure the following parameters while setting the environment for Pig Latin.

  • Ensure all Hadoop services are running

  • Ensure Pig is installed and configured

  • Ensure all datasets are uploaded to the NameNode, that is, HDFS.

Summary

Let us summarize the topics covered in this lesson:

  • Big Data has three characteristics, namely, variety, velocity, and volume.

  • Hadoop HDFS and Hadoop MapReduce are the core components of Hadoop.

  • One of the key features of MapReduce is that the map output is a set of key/value pairs which are grouped and sorted by key.

  • TaskTracker manages the execution of individual map and reduces tasks on a compute node in the cluster.

  • Pig is a high-level data flow scripting language. It uses HDFS for storing and retrieving data.

Conclusion

In the next lesson, we will focus on Hive, HBase, and components of the Hadoop ecosystem.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*