Hadoop is a widely used big data tool for storing and processing large volumes of data in multiple clusters. Apache MapReduce is one of the key components of Hadoop that allows for the faster processing of data.
What is Apache MapReduce?
Apache MapReduce is the processing engine of Hadoop that processes and computes vast volumes of data. MapReduce programming paradigm allows you to scale unstructured data across hundreds or thousands of commodity servers in an Apache Hadoop cluster.
It has two main components or phases, the map phase and the reduce phase.
The input data is fed to the mapper phase to map the data. The shuffle, sort, and reduce operations are then performed to give the final output.
Fig: Steps in MapReduce
MapReduce programming paradigm offers several features and benefits to help gain insights from vast volumes of data.
Features of MapReduce
- MapReduce algorithms help organizations to process vast amounts of data, parallelly stored in the Hadoop Distributed File System (HDFS).
- It reduces the processing time and supports faster processing of data. This is because all the nodes are working with their part of the data, in parallel.
- Developers can write MapReduce codes in a range of languages such as Java, C++, and Python.
- It is fault-tolerant as it considers replicated copies of the blocks in other machines for further processing, in case of failure.
How Does the Hadoop MapReduce Algorithm Work?
Let’s understand how the MapReduce algorithm works by understanding the job execution flow in detail.
- The input data to process using the MapReduce task is stored in input files that reside on HDFS.
- The input format defines the input specification and how the input files are split and read.
- The input split logically represents the data to be processed by an individual Mapper.
- The record reader communicates with the input split and converts the data into key-value pairs suitable for reading by the mapper (k, v).
- The mapper class processes input records from RecordReader and generates intermediate key-value pairs (k’, v’). Conditional logic is applied to ‘n’ number of data blocks present across various data nodes.
- The combiner is a mini reducer. For every combiner, there is one mapper. It is used to optimize the performance of MapReduce jobs.
- The partitioner decides how outputs from the combiner are sent to the reducers.
- The output of the partitioner is shuffled and sorted. All the duplicate values are removed, and different values are grouped based on similar keys. This output is fed as input to the reducer. All the intermediate values for the intermediate keys are combined into a list by the reducer called tuples.
- The record writer writes these output key-value pairs from the reducer to the output files. The output data is stored on the HDFS.
Fig: MapReduce workflow
Shown below is a MapReduce example to count the frequency of each word in a given input text. Our input text is, “Big data comes in various formats. This data can be stored in multiple data servers.”
Fig: MapReduce Example to count the occurrences of words
MapReduce Example to Analyze Call Data Records
Shown below is a sample data of call records. It has the information regarding phone numbers from which the call was made, and to which phone number it was made. The data also gives information about the total duration of each call. It also tells you if the call made was a local (0) or an STD call (1).
We’ll use this data to perform certain operations with the help of a MapReduce algorithm. One of the operations you can perform is to find all the phone numbers that made more than 60 minutes of STD calls.
We’ll use Java programming language to do this task.
1. Let’s first declare our constants for the fields.
2. Import all the necessary packages to make sure we use the classes in the right way.
3. The order of the driver, mapper, and reducer class does not matter. So, let’s create a mapper that will do the map task.
- We will create a TokenizerMapper that will extend our Mapper class. It accepts the desired data types (line 69-70).
- We’ll assign phone numbers and the duration of the calls in minutes (line 72-73).
- The map task works on a string, and it breaks it into individual elements based on a delimiter (line 75-78).
- Then, we’ll check if the string that we are looking for has an STD flag (line 79).
- We will then set the phone numbers using the constant class and find the duration (line 81-83).
- Finally, we’ll extract the phone numbers and the duration of the call made by a particular phone number (line 84-86).
This mapper class will return an intermediate output, which would then be sorted and shuffled and passed on to the reducer.
4. Next, we define our reducer class.
- So, we define our reducer class called SumReducer. The reducer uses the right data types specific to Hadoop MapReduce (line 50-52).
- The reduce (Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs. The output of the reduce task is written to a RecordWriter via TaskInputOutputContext.write(Object, Object) (line 54-56).
- It looks into all the keys and values. Wherever it finds that the keys that are repeating and the duration is more than 60 minutes, it would return an aggregated result (line 57-63).
5. The driver class has all the job configurations, mapper, reducer, and also a combiner class. It is responsible for setting up a MapReduce job to run in the Hadoop cluster. You can specify the names of Mapper and Reducer Classes long with data types and their respective job names.
6. Now, package the files as .jar and transfer it to the Hadoop cluster and run it on top of YARN.
You can locate your call records file using hdfs dfs -ls “Location of the file”
7. Now, we’ll input the call records file for processing. Use the command below to locate the file and give the class name, along with another file location to save the output.
hadoop jar STDSubscribers.jar org.example.hadoopcodes.STDSubscribers sampleMRIn/calldatarecords.txt sampleMROutput-2
8. Once you run the above command successfully, you can see the output by checking the directory.
hdfs dfs -cat sampleMROutput-2/part-r-00000
MapReduce is a Hadoop framework that helps you process vast volumes of data in multiple nodes. After reading this article, you would have learned about what MapReduce is, and the essential features of MapReduce. This is the perfect time for you to start your career in big data with Simplilearn's Caltech Post Graduate Program in Data Science.
You also got an idea as to how the MapReduce algorithm works with the help of a MapReduce example, to count the phone numbers based on a condition. Do you have any questions for us? If so, then please put it in the comments section of this article. Our team of experts will help you solve your queries at the earliest!