HDFS and YARN Tutorial

Welcome to the second lesson ‘HDFS and YARN’ of Big Data Hadoop tutorial which is a part of ‘Big Data Hadoop and Spark Developer Certification course’ offered by Simplilearn.

In this lesson, we will discuss the Hadoop Distributed File System, also known as HDFS, Hadoop architecture, Yet Another Resource Negotiator, also known as YARN, and YARN architecture.

Let's look at the objectives of this yarn tutorial.

Objectives

After completing this lesson, you will be able to:

  • Explain how Hadoop Distributed File System or HDFS stores data across a cluster.

  • Demonstrate how to use HDFS with the Hadoop User Experience or Hue File Browser, and the HDFS commands.

  • Illustrate the YARN architecture.

  • List the different components of YARN.

  • Demonstrate how to use Hue, YARN Web UI, and the YARN command to monitor a cluster.

Want to check the course preview of our Big Data Hadoop and Spark Developer Certification course?  Watch the course content here! 

Let us now discuss one of the components of the Hadoop ecosystem - HDFS.

Hadoop Distributed File System (HDFS)

HDFS is a distributed file system that provides access to data across Hadoop clusters. A cluster is a group of computers that work together.

Like other Hadoop-related technologies, HDFS is a key tool that manages and supports analysis of very large volumes petabytes and zetabytes of data.

Why HDFS?

Before 2011, storing and retrieving petabytes or zettabytes of data had the following three major challenges: Cost, Speed, Reliability.

Traditional file system approximately costs $10,000 to $14,000, per terabyte. Searching and analyzing data was time-consuming and expensive. Also, if search components were saved on different servers, fetching data was difficult.

Let us discuss how HDFS resolves all the three major issues of traditional file systems.

Cost

HDFS is an open source software so that it can be used with zero licensing and support costs. It is designed to run on a regular computer.

Speed 

Large Hadoop clusters can read or write more than a terabyte of data per second. A cluster comprises multiple systems logically interconnected in the same network.

HDFS can easily deliver more than two gigabytes of data per second, per computer to MapReduce, which is a data processing framework of Hadoop.

Reliability

HDFS copies the data multiple times and distributes the copies to individual nodes. A node is a commodity server which is interconnected through a network device.

HDFS then places at least one copy of data on a different server. In case, any of the data is deleted from any of the nodes; it can be found within the cluster.

A regular file system, like a Linux file system, is different from HDFS with respect to the size of the data. In a regular file system, each block of data is small, usually about 51 bytes. However, in HDFS, each block is 128 Megabytes by default.

A regular file system provides access to large data but may suffer from disk input/output problems mainly due to multiple seek operations.

On the other hand, HDFS can read large quantities of data sequentially after a single seek operation. This makes HDFS unique since all of these operations are performed in a distributed mode.

Let us list the characteristics of HDFS.

Characteristics of HDFS

Below are some characteristics of HDFS.

HDFS has high fault-tolerance

HDFS may consist of thousands of server machines. Each machine stores a part of the file system data. HDFS detects faults that can occur on any of the machines and recovers it quickly and automatically.

HDFS has high throughput

HDFS is designed to store and scan millions of rows of data and to count or add some subsets of the data. The time required in this process is dependent on the complexities involved.

It has been designed to support large datasets in batch-style jobs. However, the emphasis is on high throughput of data access rather than low latency.

HDFS is economical

HDFS is designed in such a way that it can be built on commodity hardware and heterogeneous platforms, which is low-priced and easily available.

Similar to the example explained in the previous section, HDFS stores files in a number of blocks. Each block is replicated to a few separate computers.

The count of replication can be modified by the administrator. Data is divided into 128 Megabytes per block and replicated across local disks of cluster nodes.

Metadata controls the physical location of a block and its replication within the cluster. It is stored in NameNode. HDFS is the storage system for both input/output of MapReduce jobs.

Let’s understand how HDFS stores file with an example.

How Does HDFS Work?

Let us look at the below example to understand how HDFS stores file.

Example - A patron gifted a collection of popular books to a college library. The librarian decided to arrange the books on a small rack and then distribute multiple copies of each book on other racks. This way the students could easily pick up a book from any of the racks.

Similarly, HDFS creates multiple copies of a data block and keeps them in separate systems for easy access.

Let’s discuss the HDFS Architecture and Components in the next section.

HDFS Architecture and Components

Broadly, HDFS architecture is known as the master and slave architecture which is shown below.

the-hdfs-architecture-components-1

A master node, that is the NameNode, is responsible for accepting jobs from the clients. Its task is to ensure that the data required for the operation is loaded and segregated into chunks of data blocks.

the-hdfs-architecture-components-2

the-hdfs-architecture-components-5

HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks, stored, and replicated in the slave nodes known as the DataNodes as shown in the section below.

the-hdfs-architecture-components-4

The data blocks are then distributed to the DataNode systems within the cluster. This ensures that replicas of the data are maintained. DataNode serves read or write requests. It also creates, deletes, and replicates blocks on the instructions from the NameNode.

We discussed in the previous topic that, it is the metadata which stores the block location and its replication. It is explained in the below diagram.

the-hdfs-architecture-components-3

There is a Secondary NameNode which performs tasks for NameNode and is also considered as a master node. Prior to Hadoop 2.0.0, the NameNode was a Single Point of Failure, or SPOF, in an HDFS cluster.

Each cluster had a single NameNode. In case of an unplanned event, such as a system failure, the cluster would be unavailable until an operator restarted the NameNode.

Also, planned maintenance events, such as software or hardware upgrades on the NameNode system, would result in cluster downtime.

The HDFS High Availability, or HA, feature addresses these problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.

This allows a fast failover to a new NameNode in case a system crashes or an administrator initiates a failover for the purpose of a planned maintenance.

In an HA cluster, two separate systems are configured as NameNodes. At any instance, one of the NameNodes is in an Active state, and the other is in a Standby state.

The Active NameNode is responsible for all client operations in the cluster, while the Standby simply acts as a slave, maintaining enough state to provide a fast failover if necessary.

An HDFS cluster can be managed using the following features:

  • Quorum-based storage
  • Shared storage using Network File System.

Let us look at these features in detail.

Quorum-based storage

Quorum-based Storage refers to the HA implementation that uses Quorum Journal Manager, or QJM. During this implementation, the Standby node keeps its state synchronized with the Active node through a group of separate daemons called JournalNodes.

Daemons are long-running processes that typically start up with the system and listen for requests from the client processes. Each daemon runs in its own Java Virtual Machine (JVM).

When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of the JournalNodes.

The Standby node reads the edits from the JournalNodes and constantly watches for changes to the edit log. As the Standby node edits, it applies them to its own namespace.

In the event of a failover, the Standby ensures that it has read all the edits from the JournalNodes before it promotes itself to the active state. This ensures the namespace state is fully synchronized before a failover occurs.

Shared storage using Network File System

In shared storage using NFS implementation, the Standby node keeps its state synchronized with the Active node through an access to a directory on a shared storage device.

Now, in the next sections, we will discuss in detail the components of HDFS.

HDFS Components

The main components of HDFS are:

  • Namenode

  • Secondary Namenode

  • File system

  • Metadata

  • Datanode

Let's look at each of them in detail.

Namenode

The NameNode server is the core component of an HDFS cluster. There can be only one NameNode server in an entire cluster.

Namenode maintains and executes the file system namespace operation such as opening, closing, and renaming of files and directories, which are present in HDFS.

the-name-node-operation

The namespace image and the edit log stores information of the data and the metadata. NameNode also determines the linking of blocks to DataNodes. Furthermore, the NameNode is a single point of failure.

The DataNode is a multiple instance server. There can be n number of DataNode servers. The number depends on the type of network and the storage system.

The DataNode servers, stores, and maintains the data blocks. The NameNode Server provisions the data blocks on the basis of the type of job submitted by the client.

DataNode also stores and retrieves the blocks when asked by clients or the NameNode. Furthermore, it reads/writes requests and performs block creation, deletion, and replication of instruction from the NameNode.

There can be only one Secondary NameNode server in a cluster. Note that you cannot treat the Secondary NameNode server as a disaster recovery server. However, it partially restores the NameNode server in case of a failure.

Secondary Namenode

The Secondary NameNode server maintains the edit log and namespace image information in sync with the NameNode server.

At times, the namespace images from the NameNode server are not updated; therefore, you cannot totally rely on the Secondary NameNode server for the recovery process.

secondary-namenode-in-hadoop-distributed-file-system

File system

HDFS exposes a file system namespace and allows user data to be stored in files. HDFS has a hierarchical file system with directories and files. The NameNode manages the file system namespace, allowing clients to work with files and directories.

file-system-%20components-of-hdfsA file system supports operations like create, remove, move, and rename. The NameNode, apart from maintaining the file system namespace, records any change to metadata information.

Now that we have learned about HDFS components, let us see how NameNode works along with other components.

Namenode: Operation

NameNode maintains two persistent files a transaction log called an Edit Log and a namespace image called a FsImage. The Edit Log records every change that occurs in the file system metadata such as creating a new file.

namenode-a-hdfs-component

The NameNode is local file system stores the Edit Log. The entire file system namespace including mapping of blocks, files, and file system properties is stored in FsImage. This is also stored in the NameNode local file system.

Metadata

When new DataNodes join a cluster, metadata loads the blocks that reside on a specific DataNode into its memory at startup. Metadata then periodically loads the data at user-defined or default intervals.

When the NameNode starts up, it retrieves the Edit Log and FsImage from its local file system. It then updates the FsImage with Edit Log information and stores a copy of the FsImage on the file system as a checkpoint.

The metadata size is limited to the RAM available on the NameNode. A large number of small files would require more metadata than a small number of large files. Hence, the in-memory metadata management issue explains why HDFS favors a small number of large files.

If a NameNode runs out of RAM, it will crash, and the applications will not be able to use HDFS until the NameNode is operational again.

Data block split is an important process of HDFS architecture. As discussed earlier, each file is split into one or more blocks stored and replicated in DataNodes.

DataNode

DataNodes manage names and locations of file blocks. By default, each file block is 128 Megabytes. However, this potentially reduces the amount of parallelism that can be achieved as the number of blocks per file decreases.

datanode-a-components-of-hdfs

Each map task operates on one block, so if tasks are lesser than nodes in the cluster, the jobs will run slowly. However, this issue is lesser when the average MapReduce job involves more files or larger individual files.

Let us look at some of the benefits of the data block approach.

The data block approach provides:

  • Simplified replication

  • Fault-tolerance

  • Reliability.

It also helps by shielding users from storage sub-system details.

In the next section of this HDFS and YARN tutorial, we will discuss Block Replication Architecture.

Block Replication Architecture

Block replication refers to creating copies of a block in multiple data nodes. Usually, the data is split into the forms of parts such as part and part one.

block-replication-architecture

HDFS performs block replication on multiple data nodes so that if an error exists on one of the data nodes servers. The job tracker service resubmits the job to another data node server. The job tracker service is present in the name node server.

Replication Method

In the replication method, each file is split into a sequence of blocks. All blocks except the last one in the file are of the same size. Blocks are replicated for fault tolerance.

replication-method-in-hdfs

The block replication factor is usually configured at the cluster level but it can also be configured at the file level.

The name node receives a heartbeat and a block report from each data node in the cluster. The heartbeat denotes that the data node is functioning properly. A block report lists the blocks on a data node.

Let us now understand what data replication topology is.

Data Replication Topology

The topology of the replicas is critical to ensure the reliability of HDFS. Usually, each data is replicated thrice where the suggested replication topology is as follows.

the-data-replication-topology

Place the first replica on the same node as that of the client. Place the second replica on a different rack from that of the first replica. Place the third replica on the same rack as that of the second one but on a different node.

Let's understand data replication through a simple example.

Data Replication Topology - Example

The diagram illustrates a Hadoop cluster with three racks. A diagram for Replication and Rack Awareness in Hadoop is given below.

the-data-replication-topology-example

Each rack consists of multiple nodes. R1N1 represents node 1 on rack 1. Suppose each rack has eight nodes. The name node decides which data node belongs to which rack. Block 1 which is B1 is first written to node 4 on rack 1.

A copy is then written to a different node on a different rack which is node 5 on rack 2. The third and final copy of the block is written to the same rack of the second copy but to a different node which is rack 2 node 1.

Now that you have learned about data replication topology or placement, let's discuss how a file is stored in HDFS.

How are Files stored?

Let's say we have a large data file which is divided into four blocks. Each block is replicated three times as shown in the diagram.

storing-a-file-in-hadoop-distributed-file-system

You might recall that the default size of each block is 128 megabytes.

The name node then carries metadata information of all blocks and its distribution. Let's work on an example of HDFS which gives a deep understanding of all the points discussed so far.

HDFS in Action - Example

Suppose you have 2 log files that you want to save from a local file system to the HDFS cluster.

The cluster has 5 data nodes: node A, node B, node C, node D, and node E.

Now the first log is divided into three blocks: b1 b2 and b3 and the other log are divided into two blocks: b4 and b5.

Now the blocks b1 b2 b3 b4 and b5 we distributed to the node A, node B, node C, and no D respectively as shown in the diagram.

the-hdfs-in-action-example

Each block will also be replicated three times on the five data nodes. All of the information related to the list of blocks and the replication known as the metadata information of the five blocks will be stored in namenode.

Now suppose the client asks for one log that you have stored. The inquiry goes to the namenode and the client gets the information about this log file as shown in the diagram.

the-hdfs-in-action-example-2

Based on the information from the namenode, the client receives the file information from the respective data nodes. HDFS provides various access mechanisms. A Java API can be used for applications.

There is also a Python and AC language wrapper for non-java applications. A web GUI can also be utilized through an HTTP browser. An FS shell is available for executing commands on HDFS.

Let's look at the commands for HDFS in the command-line interface.

HDFS Command Line

Following are a few basic command lines of HDFS.

To copy the file simplilearn.txt from the local disk to the user's directory, type the command line:

$ hdfs dfs -put

simplilearn.txt

simplilearn.txt

This will copy the file to /user/username/simplilearn.txt

To get a directory listing of the user's home directory, type the command line:

$hdfs dfs –ls

To create a directory called testing under the user's home directory, type the command line:

$hdfs dfs –mkdir

To delete the directory testing and all of its components, type the command line:

hdfs dfs -rm -r

In the next section, we will try to understand the hue file browser.

Hue File Browser

The file browser in Hue lets you view and manage your HDFS directories and files. Additionally, you can create, move, rename, modify, upload, download, and delete directories and files. You can also view the file contents.

the-hue-file-browser

In the next few sections, we will discuss YARN, its architecture, and its components. We will also discuss how to use the YARN commands.

What is YARN?

YARN is the acronym for Yet Another Resource Negotiator. YARN is a resource manager created by separating the processing engine and the management function of MapReduce.

It monitors and manages workloads, maintains a multi-tenant environment, manages the high availability features of Hadoop, and implements security controls.

Before 2012, users could write MapReduce programs using scripting languages such as Java, Python, and Ruby. They could also use Pig, a language used to transform data. No matter what language was used, its implementation depended on the MapReduce processing model.

In May 2012, during the release of Hadoop version 2.0, YARN was introduced. You are no longer limited to working with the MapReduce framework anymore as YARN supports multiple processing models in addition to MapReduce, such as Spark.

Other features of YARN include significant performance improvement and a flexible execution engine.

Let us now discuss YARN with the help of an example.

Yarn - Use Case

Yahoo was the first company to embrace Hadoop and this became a trendsetter within the Hadoop ecosystem. In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop infrastructure due to MapReduce limitations.

Both iterative and stream processing was important for Yahoo in facilitating its move from batch computing to continuous computing.

After implementing YARN in the first quarter of 2013, Yahoo installed more than 30,000 production nodes on

  • Spark for iterative processing

  • Storm for stream processing

  • Hadoop for batch processing, allowing it to handle more than 100 billion events such as clicks, impressions, email content, metadata, and so on per day.

This was possible only after YARN was introduced and multiple processing frameworks were implemented. The single-cluster approach provides a number of advantages, including:

  • Higher cluster utilization, where resources unutilized by a framework can be consumed by another

  • Lower operational costs because only one "do-it-all" cluster needs to be managed

  • Reduced data motion as there's no need to move data between Hadoop YARN and systems running on different clusters of computers

Let us now look at the yarn architecture in the next section.

YARN Infrastructure

The YARN Infrastructure is responsible for providing computational resources such as CPUs or memory needed for application executions.

the-yarn-infrastructure-hadoop

YARN infrastructure and HDFS are completely independent. The former provides resources for running an application while the latter provides storage.

The MapReduce framework is only one of the many possible frameworks that run on YARN. The fundamental idea of MapReduce version-2 is to split the two major functionalities of resource management and job scheduling and monitoring into separate daemons.

In the next section, we will discuss YARN and its architecture.

YARN and its Architecture

Let us first understand the important three Elements of YARN Architecture.

The three important elements of the YARN architecture are:

  • Resource Manager

  • Application Master

  • Node Managers

These three Elements of YARN Architecture are shown in the given below diagram.

the-three-important-elements-of-the-yarn-architecture-hadoop

Let us look into these elements in detail.

Resource Manager

The ResourceManager, or RM, which is usually one per cluster, is the master server. Resource Manager knows the location of the DataNode and how many resources they have. This information is referred to as Rack Awareness.

The RM runs several services, the most important of which is the Resource Scheduler that decides how to assign the resources.

Application Master

The Application Master is a framework-specific process that negotiates resources for a single application, that is, a single job or a directed acyclic graph of jobs, which runs in the first container allocated for the purpose.

Each Application Master requests resources from the Resource Manager and then works with containers provided by Node Managers.

Node Managers

The Node Managers can be many in one cluster. They are the slaves of the infrastructure. When it starts, it announces itself to the RM and periodically sends a heartbeat to the RM.

Each Node Manager offers resources to the cluster. The resource capacity is the amount of memory and the number of v-cores, short for the virtual core. At run-time, the Resource Scheduler decides how to use this capacity.

A container is a fraction of the NodeManager capacity, and it is used by the client to run a program. Each Node Manager takes instructions from the ResourceManager and reports and handles containers on a single node.

How about investing your time in Big Data Hadoop and Spark Developer Certification course? Check out our Course Preview now! 

In the next few sections, you will see a detailed explanation of the three elements.

YARN Architecture Element - Resource Manager

The first element of YARN architecture is ResourceManager. The RM mediates the available resources in the cluster among competing applications with the goal of maximum cluster utilization.

It includes a pluggable scheduler called the YarnScheduler, which allows different policies for managing constraints such as capacity, fairness, and Service Level Agreements.

The Resource Manager has two main components‚ Scheduler and Applications Manager. Let us understand each of them in detail.

Resource Manager Component - Scheduler

The Scheduler is responsible for allocating resources to various running applications depending on the common constraints of capacities, queues, and so on.

The Scheduler does not monitor or track the status of the application. Also, it does not restart the tasks in case of any application or hardware failures.

The Scheduler performs its function based on the resource requirements of the applications. It does so base on the abstract notion of a resource container that incorporates elements such as memory, CPU, disk, and network.

The Scheduler has a policy plugin which is responsible for partitioning the cluster resources among various queues and applications. The current MapReduce schedulers such as the Capacity Scheduler and the Fair Scheduler are some examples of the plug-in.

The Capacity Scheduler supports hierarchical queues to enable a more predictable sharing of cluster resources.

Resource Manager Component - Application Manager

The Application Manager is an interface which maintains a list of applications that have been submitted, currently running, or completed.

The Application Manager is responsible for accepting job-submissions, negotiating the first container for executing the application specific Application Master and restarting the Application Master container on failure.

Let’s discuss how each component of YARN Architecture works together. First, we will understand how Resource Manager operates.

How does Resource Manager operate?

The Resource Manager communicates with the clients through an interface called the Client Service. A client can submit or terminate an application and gain information about the scheduling queue or cluster statistics through the Client Service.  

Administrative requests are served by a separate interface called the Admin Service through which operators can get updated information about the cluster operation.

In parallel, the Resource Tracker Service receives node heartbeats from the Node Manager to track new or decommissioned nodes.

The NM Liveliness Monitor and Nodes List Manager keep an updated status of which nodes are healthy so that the Scheduler and the Resource Tracker Service can allocate work appropriately.

The Application Master Service manages Application Masters on all nodes, keeping the Scheduler informed.

The AM Liveliness Monitor keeps a list of Application Masters and their last heartbeat times to let the Resource Manager know what applications are healthy on the cluster.

Any Application Master that does not send a heartbeat within a certain interval is marked as dead and re-scheduled to run on a new container.

Resource Manager in High Availability Mode

Before Hadoop 2.4, the Resource Manager was the single point of failure in a YARN cluster. The High Availability, or HA, feature adds redundancy in the form of an Active/Standby Resource Manager pair to remove this single point of failure.

resource-manager-in-high-availability-mode

Resource Manager HA is realized through the Active/Standby architecture. At any point in time, one of the RMs is active and one or more RMs are in Standby mode waiting to take over, should anything happen to the Active.

The trigger to transition-to-active comes from either the admin through the Command-Line Interface or through the integrated failover-controller.

The RMs have an option to invade the zookeeper base active standby Elector to decide which RMs should be active. Only active go down or become unresponsive, another RMs is automatically Elector to be active.

Note there is no need to run a separate ZKFC Demon like in HDFS. Because the active standby Elector embedded in RMs act as a failure to a detector and lead an Elector.

In the next section, let us look at the second most important YARN Architecture element, Application Master.

YARN Architecture Element - Application Master

The second element of YARN architecture is the Application Master. The Application Master in YARN is a framework-specific library, which negotiates resources from the RM and works with the NodeManager or Managers to execute and monitor containers and their resource consumption.

While an application is running, the Application Master manages the application lifecycle, dynamic adjustments to resource consumption, execution flow, faults, and it provides status and metrics.

The Application Master is architected to support a specific framework and can be written in any language. It uses extensible communication protocols with the Resource Manager and the Node Manager.

The Application Master can be customized to extend the framework or run any other code. Because of this, the Application Master is not considered trustworthy and is not run as a trusted service.

In reality, every application has its own instance of an Application Master. However, it is feasible to implement an Application Master to manage a set of applications, for example, an Application Master for Pig or Hive to manage a set of MapReduce jobs.

In the next section, let us look at the third most important YARN Architecture element, Node Manager.

YARN Architecture Element - Node Manager

The third element of YARN architecture is the Node Manager. When a container is leased to an application, the NodeManager sets up the container environment. The environment includes the resource constraints specified in the lease and any kind of dependencies, such as data or executable files.

The Node Manager monitors the health of the node, reporting to the ResourceManager when a hardware or software issue occurs so that the Scheduler can divert resource allocations to healthy nodes until the issue is resolved.

The Node Manager also offers a number of services to containers running on the node such as a log aggregation service.

The Node Manager runs on each node and manages the activities such as container lifecycle management, container dependencies, container leases, node and container resource usage, node health, and log management and reports node and container status to the Resource Manager.

Let us now look at the node manager component YARN container.

Node Manager Component: YARN Container

A YARN container is a collection of a specific set of resources to use in certain amounts on a specific node. It is allocated by the ResourceManager on the basis of the application.

The Application Master presents the container to the Node Manager on the node where the container has been allocated, thereby gaining access to the resources.

Now, let us discuss how to launch the container.

The Application Master must provide a Container Launch Context or CLC. This includes information such as Environment variables, dependencies on the requirement of data files or shared objects prior to the launch, security tokens, and the command to create the process to launch the application.

The CLC supports the Application Master to use containers. This helps to run a variety of different kinds of work, from simple shell scripts to applications to a virtual operating system.

In the next section, we will try to understand the what are the Applications on YARN.

Applications on YARN

Owing to YARN is the generic approach, a Hadoop YARN cluster runs various work-loads. This means a single Hadoop cluster in your data center can run MapReduce, Storm, Spark, Impala, and more.

Let us first understand how to run an application through YARN.

Running an Application through YARN

Broadly, there are five steps involved in YARN to run an application:

  1. The client submits an application to the Resource Manager

  2. The ResourceManager allocates a container

  3. The Application Master contacts the related Node Manager

  4. The Node Manager launches the container

  5. The container executes the Application Master

In the next few sections, you will learn about each step in detail.

Step 1 - Application submitted to the Resource Manager

Users submit applications to the Resource Manager by typing the Hadoop jar command.

step-1-application-submitted-to-the-resource-manager

The Resource Manager maintains the list of applications on the cluster and available resources on the Node Manager. The Resource Manager determines the next application that receives a portion of the cluster resource.

The decision is subject to many constraints such as queue capacity, Access Control Lists, and fairness.

Step 2 - Resource Manager allocates Container

When the Resource Manager accepts a new application submission, one of the first decisions the Scheduler makes is selecting a container. Then, the Application Master is started and is responsible for the entire life-cycle of that particular application.

step-2-resource-manager-allocates-container

First, it sends resource requests to the ResourceManager to ask for containers to run the application's tasks.

A resource request is simply a request for a number of containers that satisfy resource requirements such as the following:

  • Amount of resources expressed as megabytes of memory and CPU shares Preferred location, specified by hostname or rackname, Priority within this application and not across multiple applications.
  • The Resource Manager allocates a container by providing a container ID and a hostname, which satisfies the requirements of the Application Master.

Step 3 - Application Master contacts Node Manager

step-3-the-application-master-contacts-node-manager

After a container is allocated, the Application Master asks the Node Manager managing the host on which the container was allocated to use these resources to launch an application-specific task. This task can be any process written in any framework, such as a MapReduce task.

Step 4 -Resource Manager Launches Container

step-4-the-application-master-contacts-node-managerThe NodeManager does not monitor tasks; it only monitors the resource usage in the containers.

For example, it kills a container if it consumes more memory than initially allocated.

Throughout its life, the Application Master negotiates containers to launch all of the tasks needed to complete its application.

It also monitors the progress of an application and its tasks, restarts failed tasks in newly requested containers, and reports progress back to the client that submitted the application.

Step 5 - Container Executes the Application Master

step-5-container-executes-the-application-master

After the application is complete, the Application Master shuts itself and releases its own container. Though the ResourceManager does not monitor the tasks within an application, it checks the health of the ApplicationMaster.

If the ApplicationMaster fails, it can be restarted by the ResourceManager in a new container. Thus, the resource manager looks after the ApplicationMaster, while the ApplicationMaster looks after the tasks.

Let us now look at the tools used for YARN development.

Tools for YARN Development

Hadoop includes three tools for YARN developers:

  • YARN Web UI

  • Hue Job Browser

  • YARN Command Line

These tools enable developers to submit, monitor, and manage jobs on the YARN cluster.

YARN Web UI

YARN web UI runs on 8088 port by default. It also provides a better view than Hue; however, you cannot control or configure from YARN web UI.

Hue Job Browser

The Hue Job Browser allows you to monitor the status of a job, kill a running job, and view logs.

hue-job-browser-tool-for-yarn-development

YARN Command Line

Most of the YARN commands are for the administrator rather than the developer.

A few useful commands for the developer are as follows:

  • To list all commands of YARN:

-yarn -help

It lists all command of yarn.

  • To print the version:

- yarn -version

It prints the version.

  • To view logs of a specified application ID:

- yarn logs -applicationId <app-id>

It views logs of specified application ID.

Willing to take up a course in Big Data Hadoop and Spark? Check out our Big Data Hadoop and Spark Developer Certification Course here! 

Summary

Let us now summarize what we have learned in this lesson.

  • HDFS is the storage layer for Hadoop.

  • HDFS chunks data into blocks and distributes them across the cluster.

  • Slave nodes run DataNode daemons, which is managed by a single NameNode.

  • HDFS can be accessed using Hue, HDFS command, or HDFS API.

  • YARN manages resources in a Hadoop cluster and schedules jobs.

  • YARN works with HDFS to run tasks where the data is stored.

  • YARN executes jobs that can be monitored using Hue, YARN Web UI, or YARN command.

Conclusion

This concludes the lesson on HDFS and YARN. In the next lesson of this tutorial, we will focus on MapReduce and Sqoop.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in top cities:


Name Date Place
Big Data Hadoop and Spark Developer 8 Sep -14 Oct 2018, Weekend batch Your City View Details
Big Data Hadoop and Spark Developer 16 Sep -7 Oct 2018, Weekdays batch Dallas View Details
Big Data Hadoop and Spark Developer 21 Sep -27 Oct 2018, Weekdays batch Houston View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*