What is Hadoop: Understanding Hadoop and Its Components

Hadoop has been a prevalent term in recent years. With the birth of big data, Hadoop found its prominence in today’s world. In current times, when data is generated with just one click, the Hadoop framework is vital. Have you ever wondered what exactly Hadoop is? You’ll get your answer to this, in this article. You will learn all about Hadoop and the correlation between Hadoop and big data. The topics covered in this article are the following:

  • Hadoop Through an Analogy
  • The Rise of Big Data
  • Big Data and its Challenges
  • What Is Hadoop?
  • Components of Hadoop
  • Hadoop Use Case

Hadoop Through an Analogy  

Before jumping into the technicalities of Hadoop, let us understand Hadoop through an interesting story. By the end of this story, you will comprehend Hadoop, big data, and the necessity for Hadoop. 

Imagine the scenario wherein we have Jack, a farmer who harvests grapes, stores the final produce in a storage room, and finally sells them in the nearby town. There was no change in this until the time came when there was a high demand for other fruits. This led to him harvesting apples and oranges, in addition to grapes. 

As fascinating as it sounds, the entire process turned out to be time-consuming and difficult for Jack to do it single-handedly. 

analogy-1

Hence, he goes ahead and hires two more people to work alongside him. This speeds up the harvesting process as three of them can work simultaneously on different products. 

However, this takes a bad toll on the storage room, as the storage area turns out to be a bottleneck for storing and accessing all the fruits. 

analogy-2

Jack then thought through this problem and came up with a solution, i.e., to give each one of them separate storage space. So, when Jack receives an order for a fruit basket, he can complete the order on time as all three work parallelly with their storage area.

analogy-3

With Jack’s solutions, they can finish their orders in a stipulated time and hassle-free. This way, even with sky-high demands, Jack can complete his orders.

Big Data Engineer Master's Program

In Collaboration with IBMLearn More
Big Data Engineer Master's Program

The Rise of Big Data

So, now you might be wondering how Jack’s story is related to big data and Hadoop. Let’s draw a comparison between Jack’s story and big data. 

Back in the day, there was limited data generation. Hence, storing and processing data was done with a single storage unit and a processor, respectively. In the blink of an eye, data generation increases by leaps and bounds. Not only did it increase in volume but also its variety. Therefore, a single processor was incapable of processing high volumes of different varieties of data. Speaking of varieties of data, you can have structured, semi-structured and unstructured data. 

structured-data

This can be related to how Jack found it hard to harvest different types of fruits single-handedly. Hence, just like Jack’s approach, multiple processors were used in processing different data types. 

This helped in processing data parallelly; however, the storage unit became a bottleneck resulting in a network overhead generation. 

overhead

To combat this, the storage unit was distributed amongst each of the processors. This resulted in storing and accessing data efficiently and with no network overheads. As seen below, this method is called parallel processing with distributed storage. 

processing

This is how big data is managed effectively. Now, do you see the connection between Jack’s story and big data management?

Big Data and Its Challenges

Big Data refers to the massive amount of data that cannot be stored, processed, and analyzed using traditional ways.

The main elements of big data are:

  • Volume - There is a massive amount of data generated every second.
  • Velocity - The speed at which data is generated, collected, and analyzed
  • Variety - The different types of data: structured, semi-structured, unstructured
  • Value - The ability to turn data into useful insights for your business
  • Veracity - Trustworthiness in terms of quality and accuracy

The main challenges that big data faced and the solutions for each are listed below:

Challenges

Solution

Single central storage

Distributed storage

Serial processing

  • One input
  • One processor
  • One output

Parallel processing

  • Multiple inputs
  • Multiple processors
  • One output

Lack of ability to process unstructured data

Ability to process every type of data

What Is Hadoop?

As the years went by and data generation increased, higher volumes and more formats emerged. Hence, multiple processors were needed to process data to save time. However, a single storage unit became the bottleneck due to the network overhead that was generated. This led to using a distributed storage unit for each processor, which made data access easier. This method is known as parallel processing with distributed storage - various computers run the processes on various storages. This article gives you a complete overview of challenges with Big Data, what is Hadoop, its components, and the use case of Hadoop.

Let us next discuss what is Hadoop and what are its components.

Hadoop Interview Guide

Helping You Crack the Interview in the First Go!Download Now
Hadoop Interview Guide

Components of Hadoop

Hadoop is a framework that uses distributed storage and parallel processing to store and manage big data. It is the most commonly used software to handle big data. There are three components of Hadoop.

  1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
  2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.
  3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.

Let us take a detailed look at Hadoop HDFS in this part of What is Hadoop article.

Hadoop HDFS

Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node and data node. While there is only one name node, there can be multiple data nodes.

HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version of a server costs roughly $10,000 per terabyte for the full processor. In case you need to buy 100 of these enterprise version servers, it will go up to a million dollars.

Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend millions of dollars just on your data nodes. However, the name node is always an enterprise server.

Features of HDFS

  • Provides distributed storage
  • Can be implemented on commodity hardware
  • Provides data security
  • Highly fault-tolerant - If one machine goes down, the data from that machine goes to the next machine

Master and Slave Nodes

Master and slave nodes form the HDFS cluster. The name node is called the master, and the data nodes are called the slaves.

Master and Slave Nodes

The name node is responsible for the workings of the data nodes. It also stores the metadata.

The data nodes read, write, process, and replicate the data. They also send signals, known as heartbeats, to the name node. These heartbeats show the status of the data node.

Data Nodes

Consider that 30TB of data is loaded into the name node. The name node distributes it across the data nodes, and this data is replicated among the data notes. You can see in the image above that the blue, grey, and red data are replicated among the three data nodes.

Replication of the data is performed three times by default. It is done this way, so if a commodity machine fails, you can replace it with a new machine that has the same data.

Let us focus on Hadoop MapReduce in the following section of the What is Hadoop article.

Hadoop MapReduce

Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node.

A data containing code is used to process the entire data. This coded data is usually very small in comparison to the data itself. You only need to send a few kilobytes worth of code to perform a heavy-duty process on computers.

Hadoop MapReduce

The input dataset is first split into chunks of data. In this example, the input has three lines of text with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then split into three chunks, based on these entities, and processed parallelly.

In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one ship, and one train.

These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the aggregation takes place, and the final output is obtained.

Hadoop YARN is the next concept we shall focus on in the What is Hadoop article.

Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is available as a component of Hadoop version 2.

  • Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
  • It is responsible for managing cluster resources to make sure you don't overload one machine.
  • It performs job scheduling to make sure that the jobs are scheduled in the right place

Hadoop YARN

Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and management.

In the node section, each of the nodes has its node managers. These node managers manage the nodes and monitor the resource usage in the node. The containers contain a collection of physical resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in, the app master requests the container from the node manager. Once the node manager gets the resource, it goes back to the Resource Manager.

The next section describes a use case that will help you better understand what is Hadoop and how to use it.

Big Data Career Guide

An In-depth Guide To Becoming A Big Data ExpertDownload Now
Big Data Career Guide

Hadoop Use Case

In this case study, we will discuss how Hadoop can combat fraudulent activities. Let us look at the case of Zions Bancorporation. Their main challenge was in how to use the Zions security team’s approaches to combat fraudulent activities taking place. The problem was that they used an RDBMS dataset, which was unable to store and analyze huge amounts of data.

In other words, they were only able to analyze small amounts of data. But with a flood of customers coming in, there were so many things they couldn’t keep track of, which left them vulnerable to fraudulent activities

They began to use parallel processing. However, the data was unstructured, and analyzing it was not possible. Not only did they have a huge amount of data that could not get into their databases, but they also had unstructured data.

Hadoop enabled the Zions’ team to pull all that massive amounts of data together and store it in one place. It also became possible to process and analyze the huge amounts of unstructured data that they had. It was more time-efficient, and the in-depth analysis of various data formats became easier through Hadoop. Zions’ team could now detect everything from malware, spears, and phishing attempts to account takeovers.

Got a clear understanding of what is Hadoop? Check out what you should do next.

Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training course and get certified today.

Conclusion

Hadoop is a widely used big data technology for storing, processing, and analyzing large datasets. After reading this article on what is Hadoop, you would have understood how big data evolved and the challenges it brought with it. You understood the basics of Hadoop, its components, and how they work. Do you have any questions related to what is Hadoop article? If you have, then please put it in the comments section of this article. Our team will help you solve your queries.

If you want to grow your career in Big Data and Hadoop, then you can check this course on Big Data Engineer.

About the Author

Medono ZhasaMedono Zhasa

Medo specializes in writing for the digital space to garner social media attention and increase search visibility. A writer by day and reader by night, Medo has a second life writing Lord of the Rings fan theories and making cat videos for people of the Internet to relish on.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.