YARN is the acronym for Yet Another Resource Negotiator. YARN is a resource manager created by separating the processing engine and the management function of MapReduce. It monitors and manages workloads, maintains a multi-tenant environment, manages the high availability features of Hadoop, and implements security controls.
Get trained in Yarn, MapReduce, Pig, Hive, HBase, and Apache Spark with the Big Data Hadoop Certification Training Course. Enroll now!
Before beginning the details of the YARN tutorial, let us understand what is YARN.
What is Yarn?
Before 2012, users could write MapReduce programs using scripting languages such as Java, Python, and Ruby. They could also use Pig, a language used to transform data. No matter what language was used, its implementation depended on the MapReduce processing model.
In May 2012, during the release of Hadoop version 2.0, YARN was introduced. You are no longer limited to working with the MapReduce framework anymore as YARN supports multiple processing models in addition to MapReduce, such as Spark. Other features of YARN include significant performance improvement and a flexible execution engine.
Now that we have learned about YARN, let us next take a look at the Yarn use case as a part of this Yarn tutorial.
YARN - Use Case
Yahoo was the first company to embrace Hadoop and this became a trendsetter within the Hadoop ecosystem. In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop infrastructure due to MapReduce limitations.
Both iterative and stream processing was important for Yahoo in facilitating its move from batch computing to continuous computing.
After implementing YARN in the first quarter of 2013, Yahoo installed more than 30,000 production nodes on
- Spark for iterative processing
- Storm for stream processing
- Hadoop for batch processing, allowing it to handle more than 100 billion events such as clicks, impressions, email content, metadata, and so on per day.
This was possible only after YARN was introduced and multiple processing frameworks were implemented. The single-cluster approach provides a number of advantages, including:
- Higher cluster utilization, where resources unutilized by a framework can be consumed by another
- Lower operational costs because only one "do-it-all" cluster needs to be managed
- Reduced data motion as there's no need to move data between Hadoop YARN and systems running on different clusters of computers
Let us next look at the yarn architecture as a part of this Yarn tutorial.
The YARN Infrastructure is responsible for providing computational resources such as CPUs or memory needed for application executions.
YARN infrastructure and HDFS are completely independent. The former provides resources for running an application while the latter provides storage.
The MapReduce framework is only one of the many possible frameworks that run on YARN. The fundamental idea of MapReduce version-2 is to split the two major functionalities of resource management and job scheduling and monitoring into separate daemons.
In the next section, we will discuss YARN and its architecture as a part of this Yarn tutorial.
YARN and its Architecture
Let us first understand the important three Elements of YARN Architecture.
The three important elements of the YARN architecture are:
- Resource Manager
- Application Master
- Node Managers
These three Elements of YARN Architecture are shown in the given below diagram.
The ResourceManager, or RM, which is usually one per cluster, is the master server. Resource Manager knows the location of the DataNode and how many resources they have. This information is referred to as Rack Awareness. The RM runs several services, the most important of which is the Resource Scheduler that decides how to assign the resources.
The Application Master is a framework-specific process that negotiates resources for a single application, that is, a single job or a directed acyclic graph of jobs, which runs in the first container allocated for the purpose. Each Application Master requests resources from the Resource Manager and then works with containers provided by Node Managers.
The Node Managers can be many in one cluster. They are the slaves of the infrastructure. When it starts, it announces itself to the RM and periodically sends a heartbeat to the RM.
Each Node Manager offers resources to the cluster. The resource capacity is the amount of memory and the number of v-cores, short for the virtual core. At run-time, the Resource Scheduler decides how to use this capacity. A container is a fraction of the NodeManager capacity, and it is used by the client to run a program. Each Node Manager takes instructions from the ResourceManager and reports and handles containers on a single node.
YARN Architecture Element - Resource Manager
The first element of YARN architecture is ResourceManager. The RM mediates the available resources in the cluster among competing applications with the goal of maximum cluster utilization.
It includes a pluggable scheduler called the YarnScheduler, which allows different policies for managing constraints such as capacity, fairness, and Service Level Agreements. The Resource Manager has two main components - Scheduler and Applications Manager. Let us understand each of them in detail.
Resource Manager Component - Scheduler
The Scheduler is responsible for allocating resources to various running applications depending on the common constraints of capacities, queues, and so on. The Scheduler does not monitor or track the status of the application. Also, it does not restart the tasks in case of any application or hardware failures. The Scheduler performs its function based on the resource requirements of the applications. It does so based on the abstract notion of a resource container that incorporates elements such as memory, CPU, disk, and network. The Scheduler has a policy plugin which is responsible for partitioning the cluster resources among various queues and applications. The current MapReduce schedulers such as the Capacity Scheduler and the Fair Scheduler are some examples of the plug-in.
The Capacity Scheduler supports hierarchical queues to enable a more predictable sharing of cluster resources.
Resource Manager Component - Application Manager
The Application Manager is an interface which maintains a list of applications that have been submitted, currently running, or completed. The Application Manager is responsible for accepting job-submissions, negotiating the first container for executing the application-specific Application Master and restarting the Application Master container on failure.
Let’s discuss how each component of YARN Architecture works together. First, we will understand how the Resource Manager operates.
How Does the Resource Manager Operate?
The Resource Manager communicates with the clients through an interface called the Client Service. A client can submit or terminate an application and gain information about the scheduling queue or cluster statistics through the Client Service.
Administrative requests are served by a separate interface called the Admin Service through which operators can get updated information about the cluster operation.
In parallel, the Resource Tracker Service receives node heartbeats from the Node Manager to track new or decommissioned nodes.
The NM Liveliness Monitor and Nodes List Manager keep an updated status of which nodes are healthy so that the Scheduler and the Resource Tracker Service can allocate work appropriately.
The Application Master Service manages Application Masters on all nodes, keeping the Scheduler informed. The AM Liveliness Monitor keeps a list of Application Masters and their last heartbeat times to let the Resource Manager know what applications are healthy on the cluster.
Any Application Master that does not send a heartbeat within a certain interval is marked as dead and re-scheduled to run on a new container.
Resource Manager in High Availability Mode
Before Hadoop 2.4, the Resource Manager was the single point of failure in a YARN cluster. The High Availability, or HA, the feature adds redundancy in the form of an Active/Standby Resource Manager pair to remove this single point of failure.
Resource Manager HA is realized through the Active/Standby architecture. At any point in time, one of the RMs is active and one or more RMs are in Standby mode waiting to take over, should anything happen to the Active. The trigger to transition-to-active comes from either the admin through the Command-Line Interface or through the integrated failover-controller.
The RMs have an option to invade the zookeeper base active standby Elector to decide which RMs should be active. Only active go down or become unresponsive, another RMs is automatically Elector to be active. Note there is no need to run a separate ZKFC Demon like in HDFS. Because the active standby Elector embedded in RMs acts as a failure to a detector and leads to an Elector.
In the next section, let us look at the second most important YARN Architecture element, Application Master.
YARN Architecture Element - Application Master
The second element of YARN architecture is the Application Master. The Application Master in YARN is a framework-specific library, which negotiates resources from the RM and works with the NodeManager or Managers to execute and monitor containers and their resource consumption.
While an application is running, the Application Master manages the application lifecycle, dynamic adjustments to resource consumption, execution flow, faults, and it provides status and metrics.
The Application Master is architected to support a specific framework and can be written in any language. It uses extensible communication protocols with the Resource Manager and the Node Manager.
The Application Master can be customized to extend the framework or run any other code. Because of this, the Application Master is not considered trustworthy and is not run as a trusted service.
In reality, every application has its own instance of an Application Master. However, it is feasible to implement an Application Master to manage a set of applications, for example, an Application Master for Pig or Hive to manage a set of MapReduce jobs.
YARN Architecture Element - Node Manager
The third element of YARN architecture is the Node Manager. When a container is leased to an application, the NodeManager sets up the container environment. The environment includes the resource constraints specified in the lease and any kind of dependencies, such as data or executable files.
The Node Manager monitors the health of the node, reporting to the ResourceManager when a hardware or software issue occurs so that the Scheduler can divert resource allocations to healthy nodes until the issue is resolved. The Node Manager also offers a number of services to containers running on the node such as a log aggregation service.
The Node Manager runs on each node and manages the activities such as container lifecycle management, container dependencies, container leases, node and container resource usage, node health, and log management and reports node and container status to the Resource Manager.
Let us now look at the node manager component YARN container.
Node Manager Component: YARN Container
A YARN container is a collection of a specific set of resources to use in certain amounts on a specific node. It is allocated by the ResourceManager on the basis of the application. The Application Master presents the container to the Node Manager on the node where the container has been allocated, thereby gaining access to the resources.
Now, let us discuss how to launch the container.
The Application Master must provide a Container Launch Context or CLC. This includes information such as Environment variables, dependencies on the requirement of data files or shared objects prior to the launch, security tokens, and the command to create the process to launch the application.
The CLC supports the Application Master to use containers. This helps to run a variety of different kinds of work, from simple shell scripts to applications to a virtual operating system.
Applications on YARN
Owing to YARN is the generic approach, a Hadoop YARN cluster runs various work-loads. This means a single Hadoop cluster in your data center can run MapReduce, Storm, Spark, Impala, and more.
Let us first understand how to run an application through YARN.
Running an Application through YARN
Broadly, there are five steps involved in YARN to run an application:
- The client submits an application to the Resource Manager
- The ResourceManager allocates a container
- The Application Master contacts the related Node Manager
- The Node Manager launches the container
- The container executes the Application Master
Step 1 - Application submitted to the Resource Manager
Users submit applications to the Resource Manager by typing the Hadoop jar command.
The Resource Manager maintains the list of applications on the cluster and available resources on the Node Manager. The Resource Manager determines the next application that receives a portion of the cluster resource. The decision is subject to many constraints such as queue capacity, Access Control Lists, and fairness.
Step 2 - Resource Manager allocates Container
When the Resource Manager accepts a new application submission, one of the first decisions the Scheduler makes is selecting a container. Then, the Application Master is started and is responsible for the entire life-cycle of that particular application.
First, it sends resource requests to the ResourceManager to ask for containers to run the application's tasks.
A resource request is simply a request for a number of containers that satisfy resource requirements such as the following:
- Amount of resources expressed as megabytes of memory and CPU shares Preferred location, specified by hostname or rackname, Priority within this application and not across multiple applications.
- The Resource Manager allocates a container by providing a container ID and a hostname, which satisfies the requirements of the Application Master.
Step 3 - Application Master contacts Node Manager
After a container is allocated, the Application Master asks the Node Manager managing the host on which the container was allocated to use these resources to launch an application-specific task. This task can be any process written in any framework, such as a MapReduce task.
Step 4 -Resource Manager Launches Container
The NodeManager does not monitor tasks; it only monitors the resource usage in the containers.
For example, it kills a container if it consumes more memory than initially allocated.
Throughout its life, the Application Master negotiates containers to launch all of the tasks needed to complete its application. It also monitors the progress of an application and its tasks, restarts failed tasks in newly requested containers, and reports progress back to the client that submitted the application.
Step 5 - Container Executes the Application Master
After the application is complete, the Application Master shuts itself and releases its own container. Though the ResourceManager does not monitor the tasks within an application, it checks the health of the ApplicationMaster. If the ApplicationMaster fails, it can be restarted by the ResourceManager in a new container. Thus, the resource manager looks after the ApplicationMaster, while the ApplicationMaster looks after the tasks.
Tools for YARN Development
Hadoop includes three tools for YARN developers:
- YARN Web UI
- Hue Job Browser
- YARN Command Line
These tools enable developers to submit, monitor, and manage jobs on the YARN cluster.
YARN Web UI
YARN web UI runs on 8088 port by default. It also provides a better view than Hue; however, you cannot control or configure from YARN web UI.
Hue Job Browser
The Hue Job Browser allows you to monitor the status of a job, kill a running job, and view logs.
YARN Command Line
Most of the YARN commands are for the administrator rather than the developer.
A few useful commands for the developer are as follows:
- To list all commands of YARN:
It lists all the commands of yarn.
- To print the version:
|- yarn -version|
It prints the version.
- To view logs of a specified application ID:
|- yarn logs -applicationId <app-id>|
It views logs of specified application ID.
Preparing for the CCA175 exam? Take up this Big Data and Hadoop Developer Practice Test and assess your preparedness.
Next Step to Success
To learn more and get an in-depth understanding of Hadoop and you can enroll in the Big Data Engineer Master’s Program. This program in collaboration with IBM provides online training on the popular skills required for a successful career in data engineering. Master the Hadoop Big Data framework, leverage the functionality of AWS services, and use the database management tool MongoDB to store data.