Introduction to MapReduce Video Tutorial


4.1 Introduction to YARN and MapReduce

Hello and welcome to the Big Data and Hadoop Developer course offered by Simplilearn. This lesson will introduce you to YARN, which is part of Hadoop 2.7, and its components. It also covers MapReduce concepts and its operations.

4.2 Objectives

After completing this lesson, you will be able to: •Describe the YARN architecture •List the different components of YARN •Explain the concepts of MapReduce •List the steps to install Hadoop in Ubuntu machine •Explain the roles of user and system

4.3 Why YARN

Before 2012, users could write MapReduce programs using scripting languages such as Java, Python, and Ruby. They could also use Pig, a language used to transform data. No matter what language was used, its implementation was dependent on the MapReduce processing model. Hadoop version 2.0 was released in May 2012 with the introduction of ‘Yet Another Resource Navigator,’ popularly known as YARN. YARN is called the Operating System of Hadoop. Significantly, we are not limited to work with the often latent MapReduce framework anymore, as it supports multiple processing models in addition to MapReduce, such as Spark. Other USPs of YARN are significant performance improvement and a flexible execution engine.

4.4 What is YARN

YARN is a resource manager. It was created by separating the processing engine and the management function of MapReduce. It monitors and manages workloads, maintains a multi-tenant environment, manages the high availability features of Hadoop, and implements security controls. In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

4.5 YARN—Real Life Connect

Yahoo was the first company to embrace Hadoop in a big way, and it is a trendsetter within the Hadoop ecosystem. In late 2012, it struggled to handle iterative and stream processing of data on Hadoop infrastructure due to MapReduce limitations. Both were important for Yahoo in facilitating its move from batch computing to continuous computing. After implementing YARN in the first quarter of 2013, Yahoo has installed more than 30,000 production nodes on Spark for iterative processing, Storm for stream processing, and Hadoop for batch processing, which allow it to handle 100 billion+ events (clicks, impressions, email content, and meta-data, and so on) per day. Such a solution was possible only after YARN was introduced and multiple processing frameworks were implemented.

4.6 YARN Infrastructure

The YARN Infrastructure is responsible for providing the computational resources, such as, CPUs or memory needed for application executions. The YARN infrastructure and the HDFS federation are completely decoupled and independent: the former provides resources for running an application while the latter provides storage. The MapReduce framework is only one of the many possible frameworks which runs on top of YARN, although currently, it is the only one implemented. The fundamental idea of MRv2 is to split up the two major functionalities; resource management and job scheduling and monitoring, into separate daemons. There is a global ResourceManager and per-application ApplicationMaster.

4.7 YARN Infrastructure (contd.)

The three important elements of the YARN architecture are: •The ResourceManager, or RM, usually numbered one per cluster, is the master and knows where the slaves are located, referred to as Rack Awareness, and how many resources they have. The RM runs several services, the most important of which is the Resource Scheduler that decides how to assign the resources. •The NodeManager, of which there can be many in one cluster, is the slave of the infrastructure. When it starts, it announces itself to the RM and periodically sends a heartbeat to the RM. Each NodeManager offers resources to the cluster, the resource capacity being the amount of memory and the number of vcores. At run-time, the Resource Scheduler decides how to use this capacity. A container is a fraction of the NodeManager capacity and it is used by the client for running a program. Each NodeManager takes instructions from the ResourceManager, and reports and handles containers on a single node. •The ApplicationMaster is a framework-specific process that negotiates resources for a single application, that is, a single job or a directed acyclic graph of jobs, which runs in the first container allocated for the purpose. Each ApplicationMaster requests resources from the ResourceManager, then works with containers provided by NodeManagers.

4.8 ResourceManager

The RM arbitrates available resources in the cluster among competing applications, with the goal of maximum cluster utilization. It includes a pluggable scheduler called the YarnScheduler, which allows different policies for managing constraints such as capacity, fairness, and Service Level Agreements. The ResourceManager has two main components: Scheduler and ApplicationsManager. The Scheduler is responsible for allocating resources to various running applications depending on common constraints of capacities, queues, and so on. The Scheduler does not monitor or track the status of the application. Also, it does not ensure the restarting of tasks that failed either due to application failure or hardware failures. The Scheduler performs its function based on the resource requirements of the applications; it does so based on the abstract notion of a resource container which incorporates elements such as memory, CPU, disk, and network. The Scheduler has a policy plug-in, which is responsible for partitioning the cluster resources among various queues and applications. The current MapReduce schedulers such as the CapacityScheduler and the FairScheduler are some examples of the plug-in. The CapacityScheduler supports hierarchical queues to enable more predictable sharing of cluster resources. At the core of the ResourceManager is an interface called the ApplicationsManager, which maintains a list of applications that have been submitted, are currently running, or completed. The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster, and restarting the ApplicationMaster container on failure.

4.9 Other ResourceManager Components

The figure shown here displays all the internal components of the ResourceManager. The ResourceManager communicates with application clients through an interface called the ClientService. A client can submit or terminate an application and gain information about the scheduling queue or cluster statistics through the ClientService. Administrative requests are served by a separate interface called the AdminService, through which operators can get updated information about cluster operation. In parallel, the ResourceTrackerService receives node heartbeats from the NodeManager to track new or decommissioned nodes. The NMLivelinessMonitor and NodesListManager keep an updated status of which nodes are healthy so that the scheduler and the ResourceTrackerService can allocate work appropriately. The ApplicationMasterService manages ApplicationMasters on all nodes, keeping the scheduler informed. The AMLivelinessMonitor keeps a list of ApplicationMasters and their last heartbeat times, to let the ResourceManager know what applications are healthy on the cluster. Any ApplicationMaster that does not heartbeat within a certain interval is marked as dead and re-scheduled to run on a new container.

4.10 ResourceManager in HA Mode

Before Hadoop 2.4, the ResourceManager was the single point of failure in a YARN cluster. The High Availability, or HA, feature adds redundancy in the form of an Active/Standby ResourceManager pair to remove this single point of failure. ResourceManager HA is realized through the Active/Standby architecture: at any point of time, one of the RMs is active, and one or more RMs are in Standby mode waiting to take over should anything happen to the Active. The trigger to transition-to-active comes from either the admin, through CLI, or where automatic failover is enabled, through the integrated failover-controller. Let’s take a closer look at Automatic Failover. The RMs have an option to embed the Zookeeper-based ActiveStandbyElector to decide which RM should be the Active. When the Active goes down or becomes unresponsive, another RM is automatically elected to be the Active. Note that, there is no need to run a separate ZKFC daemon, like in HDFS, because the ActiveStandbyElector embedded in RMs acts as a failure detector and a leader elector.

4.11 ApplicationMaster

The ApplicationMaster in YARN is a framework-specific library, which negotiates resources from the RM and works with the NodeManager or Managers to execute and monitor containers and their resource consumption. While an application is running, the ApplicationMaster manages the application lifecycle, dynamic adjustments to resource consumption, execution flow, faults, and provides status and metrics. The ApplicationMaster is architected to support a specific framework, and can be written in any language since its communication with the NodeManagers and the ResourceManager is done using extensible communication protocols. The ApplicationMaster can be customized to extend the framework or run any other code. Because of this the ApplicationMaster is not considered trustworthy, and is not run as a trusted service. In reality, every application has its own instance of an ApplicationMaster. However, it’s feasible to implement an ApplicationMaster to manage a set of applications, for example, an ApplicationMaster for Pig or Hive to manage a set of MapReduce jobs.

4.12 NodeManager

When a container is leased to an application, the NodeManager sets up the container’s environment, including the resource constraints specified in the lease and any dependencies, such as data or executable files. The NodeManager monitors the health of the node, reporting to the ResourceManager when a hardware or software issue occurs so that the scheduler can divert resource allocations to healthy nodes until the issue is resolved. The NodeManager also offers a number of services to containers running on the node such as a log aggregation service. The NodeManager runs on each node and manages the following: •Container lifecycle management •Container dependencies •Container leases •Node and container resource usage •Node health •Log management •Reporting node and container status to the ResourceManager.

4.13 Container

A YARN container is a result of a successful resource allocation, meaning that the ResourceManager has granted an application a lease to use a specific set of resources in certain amounts on a specific node. The ApplicationMaster presents the lease to the NodeManager on the node where the container has been allocated, thereby gaining access to the resources. To launch the container, the ApplicationMaster must provide a container launch context or CLC that includes the following information: •Environment variables •Dependencies, that is, local resources such as data files or shared objects needed prior to launch •Security tokens •The command necessary to create the process the application wants to launch The CLC makes it possible for the ApplicationMaster to use containers to run a variety of different kinds of work, from simple shell scripts to applications to virtual machines.

4.14 Applications Running on YARN

Owing to YARN’s generic approach, a Hadoop YARN cluster running many different workloads is now a possibility. This means a single Hadoop cluster in your data center can run MapReduce, Giraph, Storm, Spark, Tez/Impala, MPI, and more. The single-cluster approach obviously provides a number of advantages, including: •Higher cluster utilization, whereby resources not used by one framework could be consumed by another •Lower operational costs, because only one "do-it-all" cluster needs to be managed and tuned •Reduced data motion, as there's no need to move data between Hadoop YARN and systems running on different clusters of machines

4.15 Application Startup in YARN

Suppose users submit applications to the ResourceManager by typing the hadoop jar command. The ResourceManager maintains the list of applications running on the cluster and available resources on each live NodeManager. The ResourceManager determines which application should get a portion of cluster resources next. The decision is subject to many constraints, such as, queue capacity, ACLs, and fairness. When the ResourceManager accepts a new application submission, one of the first decisions the Scheduler makes is selecting a container in which the ApplicationMaster will run. After the ApplicationMaster is started, it is responsible for the entire life cycle of this application. First, it sends resource requests to the ResourceManager to ask for containers to run the application's tasks. A resource request is simply a request for a number of containers that satisfy resource requirements, such as: •Amount of resources, expressed as megabytes of memory and CPU shares •Preferred location, specified by hostname, rackname, or star to indicate no preference •Priority within this application, and not across multiple applications The ResourceManager grants a container, expressed as container ID and hostname, which satisfies the requirements of the ApplicationMaster. A container allows an application to use a given amount of resources on a specific host. After a container is granted, the ApplicationMaster asks the NodeManager, managing the host on which the container was allocated to use these resources, to launch an application-specific task. This task can be any process written in any framework, such as a MapReduce task or a Giraph task. The NodeManager does not monitor tasks; it only monitors the resource usage in the containers, for example, it kills a container if it consumes more memory than initially allocated. Throughout its life, the ApplicationMaster negotiates containers to launch all of the tasks needed to complete its application. It also monitors the progress of an application and its tasks, restarts failed tasks in newly requested containers, and reports progress back to the client that submitted the application. After the application is complete, the ApplicationMaster shuts itself down and releases its own container. Though the ResourceManager does not perform any monitoring of the tasks within an application, it checks the health of the ApplicationMasters. If the ApplicationMaster fails, it can be restarted by the ResourceManager in a new container. In short, the ResourceManager takes care of the ApplicationMasters, while the ApplicationMasters takes care of tasks.

4.16 Application Startup in YARN (contd.)

The application startup can be summarized as follows: •a client submits an application to the Resource Manager •the ResourceManager allocates a container •the ResourceManager contacts the related NodeManager •the NodeManager launches the container •the Container executes the Application Master

4.17 Role of AppMaster in Application Startup

The ApplicationMaster is responsible for the execution of a single application. It asks the Resource Scheduler for containers and executes specific programs, example the main of a Java class, on the allocated containers. The ApplicationMaster knows the application logic and thus, is framework-specific. The MapReduce framework provides its own implementation of an ApplicationMaster. The ResourceManager is a single point of failure in YARN. Using ApplicationMasters, YARN spreads the metadata related to running applications over the cluster. This reduces the load of the ResourceManager and makes it fast-recoverable.

4.18 Why MapReduce

Prior to 2004, huge amounts data was stored in single servers. If any program would run a query for data stored in multiple servers, logical integration of search results & analyses of data was a nightmare. Not to mention the massive efforts and expenses were involved. The threat of data loss, challenge of data backup, and reduced scalability resulted in snowballing the issue. To counter this, Google introduced MapReduce in December 2004. With this, analysis of data sets that would have taken 8-10 days was done probably in less than 10 minutes. Queries could run simultaneously on multiple servers, and now logically integrate search results and analyze data in real-time. The USP of MapReduce is its fault-tolerance and scalability.

4.19 What is MapReduce

MapReduce is a programming model that is simultaneously processes and analyzes huge data sets logically into separate clusters. While Map sorts the data, Reduce segregates it into logical clusters, thus removing ’bad’ data and retaining the necessary information.

4.20 MapReduce—Real-life Connect

Nutri is a well-known courier facility. It transports documents across the globe. When the staff receive a courier, they color code it based on the country to be transported. The dispatch staff then segregates the courier by the tagged color code. Hence, the reception functions as “Map,” and the dispatch team as “Reduce.”

4.21 MapReduce—Analogy

The MapReduce steps are listed here that represent manual vote counting after an election as an analogy. Step 1: Each poll booth’s ballot papers are counted by a teller. This is a pre-MapReduce step called input splitting. Step 2: Tellers of all booths count the ballot papers in parallel. As multiple tellers are working on a single job, the job execution time will be faster. This is called the Map method. Step 3: The ballot count of each booth under the assembly and parliament seat positions is found, and the total count for the candidates is generated. This is known as the Reduce method. Thus, map and reduce help to execute the job more quickly than can be done using individual counting.

4.22 MapReduce—Analogy (contd.)

As we have seen in the previous screen, the analogy of vote counting helps in understanding the usefulness of MapReduce. The key reason to perform mapping and reducing is to speed up the job execution of a specific process. This can be done by splitting a process into a number of tasks, thus enabling parallelism. If one person counts all of the ballot papers or waits for another to finish the ballot count, it could take a month to receive the election results. When many people count the ballots simultaneously, the results are obtained in one or two days. This is how MapReduce works.

4.23 MapReduce—Example

In this screen, the MapReduce operation is explained using a real-time problem. The job is to perform a word count of the given paragraph. The paragraph is, "This quick brown fox jumps over a lazy dog. A dog is a man's best friend." The MapReduce process consists of input, splitting, mapping, shuffling, and reducing phases. The Input phase refers to providing data for which the MapReduce process is to be performed. The paragraph is used as the input here. The Splitting phase refers to converting a job submitted by the client into a number of tasks. In this example, the job is split into two tasks. The Mapping phase refers to generating a key-value pair for the input. Since this example is about counting words, the sentence is split into words by using the substring method to generate words from lines. The Mapping phase will ensure that the words generated are converted into keys, and a default value of one is allotted to each key. The Shuffling phase refers to sorting the data based on the keys. As shown on the screen, the words are sorted in ascending order. The last phase is the Reducing phase. In this phase, the data is reduced based on the repeated keys by incrementing the value. The word “dog” and the letter “a” are repeated. Therefore, the reducer will delete the key and increment the value depending on the number of occurrences of the key. This is how the MapReduce operation is performed.

4.24 Map Execution

Map execution consists of five phases: map phase, partition phase, shuffle phase, sort phase, and reduce phase. •Map Phase: In the map phase, the assigned input split is read from HDFS, where a split could be a file block by default. Furthermore, input is parsed into records as key-value pairs. The map function is applied to each record to return zero or more new records. These intermediate outputs are stored in the local file system as a file. They are sorted first by bucket number and then by key. At the end of the map phase, information is sent to the master node of its completion. •Partition phase: In the partition phase, each mapper must determine which reducer will receive each of the outputs. For any key, regardless of which mapper instance generated it, the destination partition is the same. Note that the number of partitions will be equal to the number of reducers. •Shuffle Phase: In the shuffle phase, input data is fetched from all map tasks for the portion corresponding to the reduce task’s bucket. •Sort Phase: In the sort phase, a merge-sort of all map outputs occurs in a single run. •Reduce Phase: In the reduce phase, a user-defined reduce function is applied to the merged run. The arguments are a key and corresponding list of values. The output is written to a file in HDFS.

4.25 Map Execution—Distributed Two Node Environment

The mappers on each of the nodes are assigned an input split of blocks. Based on the input format, the Record Reader reads the split as a key-value pair. The map function is applied to each record to return zero or more new records. These intermediate outputs are stored in the local file system as a file. Thereafter, a partitioner assigns the records to a reducer. In the shuffling phase, the intermediate key-value pairs are exchanged by all nodes. The key-value pairs are then sorted by applying the key and reduce function. The output is stored in HDFS based on the specified output format.

4.26 MapReduce Essentials

The essentials of each MapReduce phase are shown on the screen. The job input is specified in key-value pairs. Each job consists of two stages. First, a user defined map function is applied to each input record to produce a list of intermediate key-value pairs. Second, a user-defined reduce function is called once for each distinct key in the map output. Then the list of intermediate values associated with that key is passed. The number of reduce tasks can be defined by the users. Each reduce task is assigned a set of record groups which are intermediate records corresponding to a group of keys. For each group, a user-defined reduce function is applied to the record values in that group. The reduce tasks read from every map task, and each read returns the record groups for that reduce task. Note that the reduce phase cannot start until all mappers have finished processing.

4.27 MapReduce Jobs

A job is a full MapReduce program which typically causes multiple map and reduce functions to be run in parallel over the life of the program. Many copies of map and reduce functions are forked for parallel processing across the input data set. A task is a map or reduce function executed on a subset of data. With this understanding of “job” and “task,” the ApplicationMaster and NodeManager functions become easy to comprehend. The ApplicationMaster is responsible for the execution of single application or MapReduce job. It divides the job requests into tasks and assigns those tasks to Node Managers running on the slave node. The NodeManager has a number of dynamically created resource containers. The size of a container depends on the amount of resources it contains, such as memory, CPU, disk, and network IO. It executes map and reduce task by launching these containers when instructed by the MapReduce Application Master.

4.28 MapReduce and Associated Tasks

MapReduce and associated tasks are listed here. The Map process is an initial step to process individual input records in parallel. The reduce process is all about summating the output with a defined goal as coded in business logic. NodeManager keeps track of individual map tasks and can run in parallel. A map job runs as part of container execution by NodeManager on a particular DataNode. The ApplicationMaster keeps track of a MapReduce job.

4.29 Hadoop Job Work Interaction

The flow diagram represents the Hadoop Job work interaction. •Initially, a Hadoop MapReduce job is submitted by a client in the form of an input file or a number of input split of files containing data. •MapReduce ApplicationMaster distributes the input split to separate NodeManagers. •MapReduce ApplicationMaster coordinates with the NodeManagers. •MapReduce ApplicationMaster resubmits the task(s) to an alternate Node Manager if the DataNode fails •ResourceManager gathers the final output and informs the client of the success or failure status.

4.30 Characteristics of MapReduce

Some MapReduce characteristics are: •MapReduce is designed to handle very large scale data in the range of petabytes and exabytes. •It works well on write once and read many data, also known as WORM data. •MapReduce allows parallelism without mutexes. •The Map and Reduce operations are performed by the same processor. •Operations are provisioned near the data as data locality is preferred. •Commodity hardware and storage is leveraged in MapReduce. •The runtime takes care of splitting and moving data for operations.

4.31 Real-time Uses of MapReduce

Some of the real-time uses of MapReduce are as follows: •Simple algorithms such as grep, text-indexing, and reverse indexing •Data-intensive computing such as sorting •Data mining operations like Bayesian classification •Search engine operations like keyword indexing, ad rendering, and page ranking •Enterprise analytics •Gaussian analysis for locating extra-terrestrial objects in astronomy •Semantic web and web 3.0

4.32 Prerequisites for Hadoop Installation in Ubuntu Desktop 14.04

Ubuntu Desktop 14.04 VM installed with Eclipse and a high-speed Internet connection is required to install Hadoop in Ubuntu Desktop 14.04.

4.33 Steps to Install Hadoop

The steps required to install Hadoop in Ubuntu Desktop 14.04 LTS are listed below: •Create a new VM and install Ubuntu Desktop 14.04 LTS Operating System. •Install all Hadoop features in pseudo-distributed mode. •Start all Hadoop services. •Install Eclipse from Ubuntu Software Centre. •Add all essential jar files to run MapReduce code. •Finally, run a sample code to ensure that the environment settings are correct.

4.34 Business Scenario

Olivia is the EVP of IT operations with Nutri Worldwide, Inc. Tim Burnet, the AVP of IT-infra ops, is assigned to one of her projects. Olivia has asked Tim to analyze two large datasets and set up mock activities for anyone new to MapReduce. One of the datasets includes a novel, and the other one includes a weather report for a city. Tim wants to install Eclipse on his system to create a new project to run default MapReduce programs. Some of the MapReduce programs that he wishes to run include counting the words from a paragraph and finding average temperatures.

4.35 Set up Environment for MapReduce Development

Ensure that all Hadoop services are live and running. This can be verified by applying two steps: •Use the command ’jps’ •Then look for all five services: NameNode, DataNode, NodeManager, ResourceManager, and SecondaryNameNode.

4.36 Small Data and Big Data

Small Data consist of block sizes lesser than 256 MB. To upload Small and Big Data we will use the following sources: •War and Peace which is from an e-book website, and •Weather data that is updated on an hourly basis. The sample dataset from these sources will be provided and can be used to perform MapReduce operations. To download the data click the given URL.

4.37 Uploading Small Data and Big Data

The command to upload any Data, Big or Small, from the local system to HDFS is hadoop (space) fs (space) -copyFromLocal (space) source file address (space) destination file address.

4.39 Build MapReduce Program

The steps to build a MapReduce program are: •Determine if the data can be made parallel and solved using MapReduce. For example, we need to analyze whether the data is write once read many that is WORM, in nature. •Design and implement a solution as mapper and reducer classes. •Compile the source code with Hadoop core, and package the code as jar executable. •Configure the application job as to the number of mapper and reducer tasks and to the number of input and output streams. •Load the data, or use it on previously available data; then launch and monitor the job. •Study the results.

4.41 Hadoop MapReduce Requirements

The user or developer is required to set up the framework with the following parameters: •The locations of the job input in the distributed file system •The locations of the job output in the distributed file system •The input format •The output format •The class containing the map function •The class containing the reduce function, which is optional. If a job does not need a reduce function, there is no need to specify a reducer class. The framework will partition the input, schedule, and execute map tasks across the cluster. If requested, it will sort the results of the map task, and it will execute the reduce tasks with the map output. The final output will be moved to the output directory, and the job status will be reported to the user.

4.42 Steps of Hadoop MapReduce

The image shows the set of classes under the user supply and the framework supply. User supply refers to the set of Java classes and methods provided to a Java developer for developing Hadoop MapReduce applications. Framework supply refers to defining the workflow of a job which is followed by all Hadoop services. As shown in the image, the user provides the input location and the input format as required by the program logic. Once the ResourceManager accepts the input, the specific job is divided into tasks by ApplicationMaster. Each task is then assigned to an individual NodeManager. Once the assignment is complete, the NodeManager will start the map task. It performs shuffling, partitioning, and sorting for individual map outputs. Once the sorting is complete, the Reducer starts the merging process. This is also called the reduce task. The final step is collecting the output, which is performed once all the individual tasks are reduced. This reduction is based on programming logic.

4.43 MapReduce—Responsibilities

The basic user or developer responsibilities of MapReduce are: •Setting up the job, •Specifying the input location, and •Ensuring the input is in the expected format and location. The framework responsibilities of MapReduce are as follows: •Distributing jobs among the ApplicationMaster and NodeManager nodes of the cluster, •Running the map operation, •Performing the shuffling and sorting operations, •Reducing phases, •Placing the output in the output directory, and •Informing the user of the job completion status.

4.44 MapReduce Java Programming in Eclipse

To install Eclipse in the Ubuntu Desktop 14.04, open Ubuntu Software Centre. Type “eclipse” in the search bar as shown on the screen. Select Eclipse from the list. Next, click the Install button to continue.

4.45 Create a New Project

Create a new project and add the essential jar files to run MapReduce programs. To create a new project, click the File menu. Select New Project. Alternatively, press Ctrl+N to start the wizard of the new project. Select Java Project from the list. Then click the Next button to continue. Type the project name as WordCount, and click the Next button to continue. Include jar files from the Hadoop framework to ensure the programs locate the dependencies to one location. Under the Libraries tab, click the Add External JARs... button to add the essential jar files. After adding the jar files, click the Finish button to create the project successfully.

4.46 Checking Hadoop Environment for MapReduce

It is important to check whether the machine setup can perform MapReduce operations. To verify this, use the example jar files deployed by Hadoop. This can be done by running the command shown on the screen. Before executing this command, ensure that the words.txt file resides in the /data/first location.

4.48 MapReduce v 2.7

This image shows the MapReduce v2.7 architecture comprising YARN.

4.50 Quiz

Following are a few questions to test your understanding of the concepts discussed in the lesson.

4.53 Summary

Let us summarize the topics covered in this lesson: •Apache Hadoop 2.7 includes YARN, which separates the resource management and processing components. •The three important elements of the YARN architecture are the ResourceManager, NodeManager, and ApplicationMaster. •MapReduce involves processing jobs using the batch processing technique. •MapReduce can be done using Java programming. •Hadoop provides with hadoop-examples.jar file which is normally used by administrators and programmers to perform testing of the MapReduce applications. •MapReduce contains steps such as splitting, mapping, combining, reducing, and output.

4.54 Conclusion

This concludes ‘Introduction to YARN and MapReduce.’ In the next lesson, we will focus on ‘Advanced HDFS and MapReduce.’

4.38 Installing Ubuntu Desktop OS Demo 1

This demo will show you how to install Ubuntu Desktop OS. Select the language and start the installation by clicking the Install Ubuntu button in the Welcome wizard. Click the Continue button and proceed with the installation. In the Installation Type screen, ensure that the Erase disk and install Ubuntu option is selected. Then, click the Continue button. By default, a drive is selected and the disk size that will be consumed is shown. Click the Install Now button. You will see a progress bar of files being copied. Select your location from the dropdown and click Continue. In the keyboard layout, choose English (US) as the language. Click the Continue button. Ensure that all the textboxes are filled with your name and computer name. Enter the password. Retype the password in the password textbox. Click the Continue button. The installation process starts. Click the Restart Now button once the installation is complete. You will now be able to view the desktop screen of Ubuntu Desktop.

4.40 Build a MapReduce Program Demo 2

In this demo, you will see how to build a MapReduce program. Let us run a MapReduce example shipped along with your Hadoop distribution. In a newer version of Hadoop, the examples are placed in a jar file in the share/hadoop directory. Further, under MapReduce directory, use the specified jar file. We will be running an example to compute the value of ‘pi’, which is a computation intensive program. The first argument indicates how many maps to create. Here, we use 10 mappers. The second argument indicates how many samples are generated per map; here, we take 100 random samples. So this program uses 10 multiplied by 100, that is, 1000 random points to estimate pi. We could enhance 100 to 10 million and improve accuracy. For 1000 samples, the value of pi is returned as nearly equal to 3.148.

4.47 Build a MapReduce Application using Eclipse and Run in Hadoop Cl Demo 3

Here is a demo to run the first MapReduce application using Eclipse. Let’s build a MapReduce Java program in Eclipse and then run in our Hadoop cluster. In this demo, we will run Eclipse in the Windows development machine and our Hadoop cluster will be in Ubuntu. First, let’s launch Eclipse. Enter the workspace location. Click OK. The Eclipse window will open. Close the welcome screen of Eclipse. Select the New menu item. Select Java Project. The New Java Project window opens. We will be build a WordCount program here to count the number of times each word occurs in a particular file. Enter the name of the project as ‘WordCount’ and click Finish. Right click the WordCount project in the panel on the left. Select New and then Class. The New Java Class window opens. Enter the name of the class as ‘WordCount’. Click Finish. Now, let’s copy the WordCount program from the MapReduce tutorial on Hadoop’s website. You may go to Hadoop’s documentation or directly go to the link being shown in the screen. Copy the source code for the Word Count program. You would notice a lot of compilation errors. Let’s fix the build patch now. Select the project WordCount. Select the Project menu item. Click Properties. In libraries, add external JARs. Browse to the unpacked Hadoop directory and go to share- Hadoop-MapReduce directory. Select the Hadoop MapReduce client core and Hadoop MapReduce client common JAR files. Now, go to share-Hadoop-common directory. Select the Hadoop common JAR file. The compilation errors would have gone by now. Let’s now see various portions of this program. The usual Java imports are at the top of the program. Further, there are Hadoop and MapReduce related import statements. Select the Description column header. In the main method, we begin by setting configuration of the MapReduce job. We set the name of the Mapper class. We set the name of Combiner class. Similarly, there is a Reducer class. We can set the output key class. We can also set the output value class. Also, set the input data path for the source dataset. Set the output path to a location where the results are desired. Our Mapper class extends Mapper. It has a map method which takes key and value as arguments and uses context. In the WordCount logic, we just tokenize each line by space character and extract individual words. Our Reducer class similarly extends Reducer. The Reduce method takes a key and an iterable list of values as arguments. The final output is again written as key value pairs. Select the New menu item. Let’s now build and export a JAR file to run this program on a Hadoop cluster. Click File menu and then Export. The Export window opens. Expand Java. Select JAR file. Click the Next button. Enter the path and name of JAR. In this case, let’s name it ‘WordCount.jar’. Make sure that you to select the project. Now, let’s transfer this JAR to the Hadoop cluster. If you are using Windows, you can use any SCP or FTP client such as WinSCP. Login to WinSCP using the IP address of the Hadoop Ubuntu cluster. Enter the username of the Hadoop machine. Enter the password. Select the WordCount.jar file from the local Windows machine. Using WinSCP, you can drag and drop to the Ubuntu machine in the panel on the right. The Copy window opens. Click the Copy button. Now, run the WordCount program in the Hadoop cluster using the hadoop jar command. Specify the input file name on which WordCount is to be applied and also the output result path. View the results in the output directory. You will notice a file named similar to the part r1000. View the contents of this output file using the hadoop fs -cat command. The output will have a count of each word’s occurrence in the input dataset.

4.51 Case Study

Scenario: XY Invest provides investment advice to high net worth individuals and maintains stock market data of various exchanges. It handles stock market data critical for analysis and monitoring. This entails time-intensive processing of huge amounts of data, and vertical scaling is proving to be expensive. In its effort to find an alternate solution for processing, XY identifies that using Hadoop would reduce the company’s cost and effort significantly. It wants to use MapReduce, the processing component of Hadoop. Click Analysis to know the company’s next move. Analysis: The company does research on Hadoop and finds that it is a popular solution for Big Data storage and processing. Hadoop has two components, HDFS for storage and MapReduce for processing. Advantages of using MapReduce: 1.It distributes processing so jobs can be completed really quickly. 2.It is highly fault tolerant, so even if some tasks fail, they will be automatically retried. 3.It provides a Java interface for writing MapReduce jobs. 4.It provides a powerful map and reduce paradigm where map does the filtering of data and reduce does the aggregation of data. Click Solution for the steps to explore MapReduce features.

4.52 Case Study - Demo

Solution: Perform the following steps to explore the features of MapReduce paradigm. 1.Store log data from Hadoop in HDFS. 2.Write a Java program per MapReduce specifications to process the log files. Map does the horizontal and vertical filtering of data; Reduce does the aggregation of data. 3.Create the mapreduce.jar file by compiling the Java program. 4.Run the program using YARN by providing input and output parameters. 5.Verify the program and output.