HDFS cluster is based on the Hadoop Distributed File System (HDFS). Designed for use on commodity hardware, the storage system is scalable, fault-tolerant, and rack-aware. HDFS distinguishes itself from other distributed file systems in several ways.
Hadoop is a framework permitting the storage of large volumes of data on node systems. The Hadoop architecture allows parallel processing of data using several components:
- Hadoop HDFS to store data across slave machines
- Hadoop YARN for resource management in the Hadoop cluster
- Hadoop MapReduce to process data in a distributed fashion
- Zookeeper to ensure synchronization across a cluster
This article lets you understand the various Hadoop components that make the Hadoop architecture.
The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines.
HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications operate under two rules:
- Two identical blocks cannot be placed on the same DataNode
- When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack
In this example, blocks A, B, C, and D are replicated three times and placed on different racks. If DataNode 7 crashes, we still have two copies of block C data on DataNode 4 of Rack 1 and DataNode 9 of Rack 3.
There are three components of the Hadoop Distributed File System:
- NameNode (a.k.a. masternode): Contains metadata in RAM and disk
- Secondary NameNode: Contains a copy of NameNode’s metadata on disk
- Slave Node: Contains the actual data in the form of blocks
NameNode is the master server. In a non-high availability cluster, there can be only one NameNode. In a high availability cluster, there is a possibility of two NameNodes, and if there are two NameNodes there is no need for a secondary NameNode.
NameNode holds metadata information on the various DataNodes, their locations, the size of each block, etc. It also helps to execute file system namespace operations, such as opening, closing, renaming files and directories.
The secondary NameNode server is responsible for maintaining a copy of the metadata in the disk. The main purpose of the secondary NameNode is to create a new NameNode in case of failure.
In a high availability cluster, there are two NameNodes: active and standby. The secondary NameNode performs a similar function to the standby NameNode.
Hadoop Cluster - Rack Based Architecture
We know that in a rack-aware cluster, nodes are placed in racks and each rack has its own rack switch. Rack switches are connected to a core switch, which ensures a switch failure will not render a rack unavailable.
HDFS Read and Write Mechanism
HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client must interact with the namenode. The namenode checks the privileges of the client and gives permission to read or write on the data blocks.
Datanodes store and maintain the blocks. While there is only one namenode, there can be multiple datanodes, which are responsible for retrieving the blocks when requested by the namenode. Datanodes send the block reports to the namenode every 10 seconds; in this way, the namenode receives information about the datanodes stored in its RAM and disk.
Let us now discuss the next component of the Hadoop architecture - Hadoop YARN.
Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and is responsible for resource allocation and job scheduling. Introduced in the Hadoop 2.0 version, YARN is the middle layer between HDFS and MapReduce in the Hadoop architecture.
The elements of YARN include:
- ResourceManager (one per cluster)
- ApplicationMaster (one per application)
- NodeManagers (one per node)
Resource Manager manages the resource allocation in the cluster and is responsible for tracking how many resources are available in the cluster and each node manager’s contribution. It has two main components:
- Scheduler: Allocating resources to various running applications and scheduling resources based on the requirements of the application; it doesn’t monitor or track the status of the applications
- Application Manager: Accepting job submissions from the client or monitoring and restarting application masters in case of failure
Application Master manages the resource needs of individual applications and interacts with the scheduler to acquire the required resources. It connects with the node manager to execute and monitor tasks.
Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to relay the status of a node. It also monitors each container’s resource utilization.
Container houses a collection of resources like RAM, CPU, and network bandwidth. Allocations are based on what YARN has calculated for the resources. The container provides the rights to an application to use specific resource amounts.
Steps to Running an application in YARN
- Client submits an application to the ResourceManager
- ResourceManager allocates a container
- ApplicationMaster contacts the related NodeManager because it needs to use the containers
- NodeManager launches the container
- Container executes the ApplicationMaster
Now that you know about YARN, let us continue with the next important component of Hadoop architecture called MapReduce.
MapReduce is a framework conducting distributed and parallel processing of large volumes of data. Written using a number of programming languages, it has two main phases: Map Phase and Reduce Phase.
Map Phase stores data in the form of blocks. Data is read, processed and given a key-value pair in this phase. It is responsible for running a particular task on one or multiple splits or inputs.
The reduce Phase receives the key-value pair from the map phase. The key-value pair is then aggregated into smaller sets and an output is produced. Processes such as shuffling and sorting occur in the reduce phase.
The mapper function handles the input data and runs a function on every input split (known as map tasks). There can be one or multiple map tasks based on the size of the file and the configuration setup. Data is then sorted, shuffled, and moved to the reduce phase, where a reduce function aggregates the data and provides the output.
MapReduce Job Execution
- The input data is stored in the HDFS and read using an input format.
- The file is split into multiple chunks based on the size of the file and the input format.
- The default chunk size is 128 MB but can be customized.
- The record reader reads the data from the input splits and forwards this information to the mapper.
- The mapper breaks the records in every chunk into a list of data elements (or key-value pairs).
- The combiner works on the intermediate data created by the map tasks and acts as a mini reducer to reduce the data.
- The partitioner decides how many reduce tasks will be required to aggregate the data.
- The data is then sorted and shuffled based on their key-value pairs and sent to the reduce function.
- Based on the output format decided by the reduce function, the output data is then stored on the HDFS.
Master the Concepts of the Hadoop Framework
Businesses are now capable of making better decisions by gaining actionable insights through big data analytics. The Hadoop Architecture is a major, but one aspect of the entire Hadoop ecosystem. Learn more about other aspects of Big Data with Simplilearn's PCP Data Engineering Course. Apart from gaining hands-on experience with tools like HDFS, YARN, MapReduce, Hive, Impala, Pig, and HBase, you can also start your journey towards achieving Cloudera’s CCA175 Hadoop certification.