Hadoop Tutorial: A Beginner’s Guide

Data generation has increased by leaps and bounds in the last decade. This includes large volumes of various formats of data being generated at a very high speed. In the earlier days, it was not a hard task to manage data, but with the increase in data, it has become more difficult to store, process, and analyze it. This is also known as Big Data. How do we manage big data? Enter Hadoop — a framework used to store, process, and analyze big data. In this blog, we will discuss the following:

  1. Why Hadoop?
  2. What is Hadoop?
  3. Hadoop HDFS
  4. Hadoop MapReduce 
  5. Hadoop YARN
  6. Use case of Hadoop
  7. Demo on HDFS, MapReduce, and YARN
Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training course and get certified today.

Analogy

Let us now try to understand big data and why Hadoop is required with a simple analogy. Imagine this scenario wherein we have a shopkeeper, Tim, who sells grains. The customers were happy, as Tim was very quick with the orders.

Analogy

After some time, Tim sensed an excellent demand for other products, so he thought of expanding his business. Along with grains, he started selling fruits, vegetables, meat, and dairy products. As the number of customers increased, Tim found it difficult to keep up with orders. 

Analogy2

To tackle this situation, Tim decided to hire three more people to assist him with his work. There was Matt, who took care of the fruits and vegetable section, Luke, who handled the dairy and meat, and Ann, who was appointed as the cashier.

Analogy2

But this didn't solve all of Tim's problems. Although Tim had the workforce he needed, he started running out of space in his shop to store the goods he needed to fulfill the increasing demand. 

Tim solved this by distributing the space amongst the different floors of the building. The grains were sold on the ground floor, fruits and vegetables on the first floor, dairy, meat products on the second floor, and so on.

Analogy3

This was how Tim tackled his problems; let's now have a look at how this story can be compared to big data and Hadoop. 

Data generation was once limited to a single format. It could be managed with one storage unit and one processor. Data generation gradually started increasing, and new varieties of data emerged. This started happening at high speed, making it more difficult for a single processor to handle.

This is similar to how Tim found it difficult to manage alone when he expanded his business.

Analog

So just like how Tim solved this issue by expanding the workforce, multiple processors were used to processing each type of data.

Analogy4

However, it became difficult for multiple processors to access the same storage unit.

Analogy5

Finally, just like how Tim adopted the distributed storage approach, the storage system was also distributed, and by doing so, the data was stored in individual databases.

Analogy6

This story helps you relate to the two components of Hadoop: HDFS and MapReduce. HDFS refers to the distributed storage space.

Analogy7

MapReduce, on the other hand, is analogous to how each person took care of a separate section, and at the end, the customers went to the cashier for the final billing, which is similar to the reduce phase.

Analogy8

Hadoop 

Hadoop is a framework that stores and processes big data in a distributed and parallel fashion.

Hadoop

As we briefly mentioned before, Hadoop has individual components to store and process data. Let's first learn more about the storage layer of the Hadoop: Hadoop Distributed File System (HDFS). 

Hadoop HDFS

HDFS is similar to the Google File System, as it stores data across multiple machines. The data is auto replicated to various machines to prevent the loss of data. In HDFS, data is split into multiple blocks; each of these blocks has a default size of 128 MB.

Hadoop HDFS

So, how is this different from traditional storage practices? The difference is that, in the traditional systems, all of the data is stored in one database. This can be problematic — in case the database crashes; all your data is lost. It also overloads the database, and such a system is highly fault intolerant. This issue is taken care of by HDFS as the data is distributed amongst multiple machines. It is specially designed for storing massive datasets in commodity hardware, which means you can have many machines to scale across. Further, HDFS has two components that run on various machines. They are:

  1. NameNode - NameNode is the master of the storage layer HDFS. It stores all the metadata. If the machine on which the NameNode processes crashes, the cluster will be unavailable.
  2. DataNode - DataNodes are known as slave nodes. They store the actual data, and they perform the read/write operations.

Basically, the NameNode manages all the DataNodes. Signals known as heartbeats are sent by the DataNodes to the NameNode to provide status updates.

namenode

Now, let's look at how data gets split in HDFS.

HDFS Split

As you can see from the above example, we have a file of size 530 MB. This file will not be stored as it is; it will be broken down into five different blocks. The last block will only use the remaining space for storage. The data blocks are stored in various DataNodes, which are essentially just computers.

HDFS Split

So, what happens if the computer that contains Block A crashes? Will we lose our data? No: that is the beauty of HDFS; one of the prime features of HDFS is data replication. 

It creates copies of the data in various machines, and this way, even if the computer holding block A crashes, our data will be safe on another. The default replication factor in HDFS is three. This means that, in total, we will have three copies of each data block. 

Let's have a close look at replication. The concept of Rack Awareness helps to decide where a replica of the data block should be stored. Here, rack refers to a collection of 30-40 DataNodes. According to the rule of replication, the data block and its copy cannot be stored on the same DataNode.

HDFS Split

From the above image, you can see that we have Block A in rack 1 and rack 2.

As a rule, we cannot have the block and its replicas all residing on the same rack. Taking the example from Block D, it is also not ideal to have the blocks spread across all the racks, as it will increase the bandwidth requirement. Therefore, if your cluster is rack aware, it will look like Block A and Block B. Here, the blocks are not all on the same rack, hence even if one rack crashes, we don't lose our data as we have copies of it in another rack. This is how HDFS provides fault tolerance. We need to keep in mind that both the block size and replication factor can be customized.

Let's now move on to the architecture of HDFS. The figure below shows how HDFS operates. We have our NameNode, DataNodes, and client requests.

HDFS Arch

Basically, we will have two operations — they are read and write operations on HDFS.

Firstly, NameNode stores all the metadata in its RAM and also in its disk. When a particular cluster starts, we have our DataNodes and NameNode, which are active. As seen previously, DataNodes will start sending heartbeats to the NameNode every three seconds once they are active. This will be registered in the RAM of the NameNode. Moving to the NameNode's disk, we will have the formatting information that you initially set up while starting the cluster. If your cluster is rack aware, then all the data blocks will not reside on the same rack. If the client wants to read data, the request goes to the NameNode. The NameNode will then look at the associated metadata and see where the information lies. Once that is completed, the client can read the data from the underlying DataNodes. This happens in a parallel fashion from multiple DataNodes. 

When it comes to writing data, the process is similar to the read operation. The client requests the NameNode, and then the NameNode looks for DataNodes, which are available. Once the list is ready, the client writes the data into those allotted DataNodes. This is how HDFS handles both read and write operations. Let's now list out some features of HDFS:

  1. HDFS is fault-tolerant as multiple copies of data are made
  2. It provides end-to-end encryption to protect data
  3. In HDFS, multiple nodes can be added to the cluster, depending on the requirement
  4. Hadoop HDFS is flexible in storing any type of data, like structured, semi-structured, or unstructured data

Now that all the data is stored in HDFS, the next step is to process it to get meaningful information. To complete the processing, we use Hadoop MapReduce.

Big Data Hadoop Certification Training Course

Master Big Data and Hadoop EcosystemExplore Course
Big Data Hadoop Certification Training Course

Hadoop MapReduce

Why is MapReduce required in the first place? In the traditional approach, big data was processed at the master node. This was a disadvantage, as it took more time to process various types of data. To overcome this issue, data is processed at each slave node, and the final result is sent to the master node.

This is what MapReduce is for. Here, data is processed wherever it is stored. MapReduce is defined as a programming model, where huge amounts of data are processed in a parallel and distributed fashion. However, the MapReduce framework does not depend on one particular language. It can be written in Java, Python, or any other programming language. 

As the name suggests, MapReduce consists of two tasks:

  1. Map tasks
  2. Reduce tasks

Mapper is the function that takes care of the mapping phase, and similarly, Reducer functions take care of the reducing phase. Both of these functions run the Map and Reduce tasks internally. From the following diagram, let's have a look at how each step is executed in MapReduce.

MapReduce

First, your input data will be split into the number of data blocks it has. In the mapping phase, the mapper function, which consists of some code, will run on one or multiple splits. After this, comes to the shuffling and sorting phase, wherein the output of the mapping phase will be grouped for further processing. Later in the reduce phase, the results are aggregated, and a single output value is delivered. Here, the developer provides the mapper and reducer tasks. The framework itself takes care of the shuffling, sorting, and partitioning. In the following example, input data will be split, shuffled, and aggregated to get the final output. Let's have a look, step by step:

i) The input data is divided, line by line.

MapReduce1

ii) The mapper function then works on each input split, which works similarly to a word count model. The data is mapped to a (key, value) pair. Here, we have the word as the key and the value as one.

MapReduce

iii) In the next step, the data is shuffled and sorted to obtain similar keys together.

Map Phase

iv) The reducer phase aggregates the values for similar keys.

Map Phase

v) Finally, the output will consist of the list of words and their values, which display the number of its occurrence.

Map Phase

This is an example of how MapReduce tasks are performed. Some of MapReduce's features include:

  1. Load balancing is improved as the stages are split into Map and Reduce 
  2. There is an automatic re-execution if a specific task fails
  3. MapReduce has one of the simplest programming models, which is based on Java

HDFS and MapReduce were the two units of Hadoop 1.0. This version had its issues as JobTracker did both the processing of data and resource allocation. This resulted in the JobTracker being overburdened. To overcome this issue, Hadoop 2.0 introduced YARN as the processing layer. 

Hadoop YARN - Yet Another Resource Negotiator

In the previous sections, we discussed how data is stored and processed. But how is the resource allocation unit taken care of? How are the resources negotiated across the cluster? YARN takes care of this and acts as the resource management unit of Hadoop. Apache YARN consists of:

  1. Resource Manager - This acts as the master daemon. It looks into the assignment of CPU, memory, etc.
  2. Node Manager - This is the slave daemon. It reports the usage to the Resource Manager.
  3. Application Master - This works with both the Resource Manager and the Node Manager in negotiating the resources.

Let's have a look at how YARN works. To start with, as seen earlier, the client first interacts with the NameNode to understand which DataNodes are available for data processing. Once that step is complete, the client interacts with the Resource Manager, which keeps track of the available resources Node Managers has. Node Managers also send heartbeats to the Resource Managers.

Yarn

Yarn

When the client contacts the resource manager for processing, the resource manager, in return, request the required resources from multiple node managers. Here, a container is a collection of physical resources, such as CPU and RAM. Depending on the availability of these containers, the node manager responds to the resource manager. Once availability is confirmed, the resource manager will start the application master. 

The application master is a type of code that executes the application. It runs on one of the containers and uses others to execute tasks. In case the Application Master needs additional resources, it cannot contact the Node Manager directly — it has to contact the Resource Manager.

Yarn

Let's review some of YARN's features:

  1. YARN is responsible for processing job requests and allocating resources
  2. Different versions of MapReduce can run on YARN making a MapReduce upgrade manageable
  3. Depending on your requirements, you can add nodes at will

Hadoop Use Case - Pinterest 

Many companies leverage Hadoop to manage their big data sets. Let's review how the popular image sharing website, Pinterest, uses Hadoop.

Hadoop Use Case

Pinterest is a social media platform that enables you to "pin" any interesting information you find online on the site. It has more than 250 million users and nearly 30 billion pins. The platform generates large amounts of data such as login details, user behavior, most viewed pins, etc. 

In the past, Pinterest has had serious issues managing all this data. The company also had significant difficulty analyzing which data needed to be displayed in a user's personalized discovery engine. They found a solution, which was Hadoop. Continuous analysis of data enables Pinterest to provide its users with various features, such as related pins, guided search, and so on.

Big Data Career Guide

An In-depth Guide To Becoming A Big Data ExpertDOWNLOAD GUIDE
Big Data Career Guide

Hadoop Demos

Here, through individual demos, we will look into how HDFS, MapReduce, and YARN can be used. 

1. HDFS Demo

In this demo, you will look into commands that will help you write data to a two-node cluster, which has two DataNodes, two NodeManagers, and one Master machine. There are three ways to write data:

  1. Through commands
  2. By writing code
  3. Using the Graphical User interface (GUI)

Before you begin, you have to download some sample datasets, which you will use to write data into HDFS. Now, you will see how to run a few commands on HDFS. These are the commands that you can start with:

hdfs dfs -mkdir /mydata // To create a directory on HDFS

ls // This lists down the files 

hdfs dfs -copyFromLocal aba* /mydata // Copies file from local file system to HDFS

hdfs dfs -ls /mydata // Lists the directory

After this command, using the web interface, you can check if your file is replicated. If it is replicated, the screen will look like the following image:

Now, let's have a look at additional commands:

cp hadoop-hdc-datanode-m1.log cp hadoop-hdc-datanode-m2.log

cp hadoop-hdc-datanode-m1.log cp hadoop-hdc-datanode-m3.log

cp hadoop-hdc-datanode-m1.log cp hadoop-hdc-datanode-m3.log 

// Above commands creates multiple files

hdfs dfs -mkdir /mydata2 // Creates a new directory on HDFS

hdfs dfs -put hadoop-hdc-datanode-m* /mydata2 // Copies multiple files

hdfs dfs -setrep -R -w 2 /mydata2 // Sets replication factor to 2

hdfs dfs -rm -R /mydata2 // Removes data from HDFS

2. MapReduce Demo

In this MapReduce demo, you will see how to get the total count of URLs that were most frequently visited. 

First, you have to use a sample file, which has a list of URLs and some counts. To implement this program using the MapReduce approach, you will have to follow these steps:

1. Use the Mapper program below is written to perform the map task:

code

2. Use the Reducer program below to perform aggregation:

code

3. Use the driver program below to understand the mapper class, reducer class, output key format, and value format

code

code

4. After writing the code, we can export this into a jar file. In addition to the jar file, there needs to be a file in HDFS to perform MapReduce. For that, you would have to log in to a cluster first. 

5. Put the sample file (the file that has the URLs) into HDFS using the -put command.

6. Using the below command, we will run the MapReduce program and get a cumulative count of how many times a URL was visited. 

hadoop jar URLCount.jar org.example.HCodes.URLCount /user(//mention directory where the input is present) /user(//mention directory where the output should be seen - destination path)

7. After running the above code, you will see that the MapReduce job is submitted to a YARN cluster. Whenever a MapReduce program runs, we will have one or more part files created as output. In this case, we have one map task and one reduce task, and hence the number of part files will also be one. The following codes are used to display the number of part files, and the final result:

hdfs dfs -ls /user(give your destination path) //Displays the number of part files

hdfs dfs -cat /user(give your destination path)/part-r-00000(mention the part details) //Displays the final output

3. YARN Demo

Below are some of the most used YARN commands:

yarn version //Displays the Hadoop and vendor-specific distribution version

yarn application -list //Lists all the applications running

yarn application -list -appSTATES -FINISHED //Lists the services that are finished running

yarn application -status give application ID //Prints the status of the applications

yarn application -kill give application ID //Kills a running application

yarn node -list //Gives the list of node managers

yarn rmadmin -getGroups hdfs //Gives the group HDFS belongs to

Conclusion

We hope this blog helped you understand Apache Hadoop. You learned about big data, the necessity of Hadoop, and what Hadoop is all about. You got an idea of HDFS, MapReduce, and YARN. Finally, you learned how these Hadoop components work through various demos. If you want to learn more about big data and Hadoop, enroll in our Big Data Hadoop Certification Training Course today. Today!

Check out the Simplilearn's video on "Hadoop Tutorial for Beginners" curated by our Big Data experts.

About the Author

Shruti MShruti M

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.