Hadoop is a widely used big data tool for storing and processing large volumes of data in multiple clusters. Apache MapReduce is one of the key components of Hadoop that allows for the faster processing of data. In this article, you will learn about a MapReduce example and implement a MapReduce algorithm to solve a task.
Apache MapReduce is the processing engine of Hadoop that processes and computes vast volumes of data. MapReduce programming paradigm allows you to scale unstructured data across hundreds or thousands of commodity servers in an Apache Hadoop cluster.
It has two main components or phases, the map phase and the reduce phase.
The input data is fed to the mapper phase to map the data. The shuffle, sort, and reduce operations are then performed to give the final output.
Fig: Steps in MapReduce
MapReduce programming paradigm offers several features and benefits to help gain insights from vast volumes of data.
Let’s understand how the MapReduce algorithm works by understanding the job execution flow in detail.
Fig: MapReduce workflow
Shown below is a MapReduce example to count the frequency of each word in a given input text. Our input text is, “Big data comes in various formats. This data can be stored in multiple data servers.”
Fig: MapReduce Example to count the occurrences of words
Shown below is a sample data of call records. It has the information regarding phone numbers from which the call was made, and to which phone number it was made. The data also gives information about the total duration of each call. It also tells you if the call made was a local (0) or an STD call (1).
We’ll use this data to perform certain operations with the help of a MapReduce algorithm. One of the operations you can perform is to find all the phone numbers that made more than 60 minutes of STD calls.
We’ll use Java programming language to do this task.
1. Let’s first declare our constants for the fields.
2. Import all the necessary packages to make sure we use the classes in the right way.
3. The order of the driver, mapper, and reducer class does not matter. So, let’s create a mapper that will do the map task.
This mapper class will return an intermediate output, which would then be sorted and shuffled and passed on to the reducer.
4. Next, we define our reducer class.
5. The driver class has all the job configurations, mapper, reducer, and also a combiner class. It is responsible for setting up a MapReduce job to run in the Hadoop cluster. You can specify the names of Mapper and Reducer Classes long with data types and their respective job names.
6. Now, package the files as .jar and transfer it to the Hadoop cluster and run it on top of YARN.
You can locate your call records file using hdfs dfs -ls “Location of the file”
7. Now, we’ll input the call records file for processing. Use the command below to locate the file and give the class name, along with another file location to save the output.
hadoop jar STDSubscribers.jar org.example.hadoopcodes.STDSubscribers sampleMRIn/calldatarecords.txt sampleMROutput-2
8. Once you run the above command successfully, you can see the output by checking the directory.
hdfs dfs -cat sampleMROutput-2/part-r-00000
Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training course and get certified today.
MapReduce is a Hadoop framework that helps you process vast volumes of data in multiple nodes. After reading this article, you would have learned about what MapReduce is, and the essential features of MapReduce.
You also got an idea as to how the MapReduce algorithm works with the help of a MapReduce example, to count the phone numbers based on a condition. Do you have any questions for us? If so, then please put it in the comments section of this article. Our team of experts will help you solve your queries at the earliest!
Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.
Big Data Engineer
Big Data Hadoop and Spark Developer
Big Data and Hadoop Administrator
*Lifetime access to high-quality, self-paced e-learning content.
Explore CategoryHow to Become a Big Data Engineer?
Big Data Career Guide: A Comprehensive Playbook To Becoming A Big Data Engineer
Big Data Engineer Salaries Around the Globe (Based on Country, Experience, and More)
Program Preview: Post Graduate Program in Cloud Computing
How to Become a Machine Learning Engineer?
Data Engineer Interview Guide