MapReduce: What It Is and Why It Is Important

Here we briefly explain what MapReduce is and why it has grown so much in popularity. To further clarify that, we give some examples of how it has been used to solve problems in business and science.

MapReduce is the process of taking a list of objects and running some operations over each object in the list (i.e., map) to either produce a new list or calculate a single value (i.e., reduce). This concept is best explained by giving a small sample.  

Suppose we have this data, which shows car sales over some time at one car dealer.

(Ford, Ford, Ford, Mazda, Chevrolet, Chevrolet)

As you can see, this car dealer sold 3 Fords, 1 Mazda, and 2 Chevrolets. We figured that out in our head, but the computer would do this in two steps: This article talks about what MapReduce is, why it is important, and why it is rapidly growing in the industry. Read to understand the concept of MapReduce.and then reduce.

The map step takes each element in the list and runs some operation over it to product a new item to put into a new list.  Since we want to count car sales, we add the number 1 to each item to product this list of pairs:

((Ford, 1), (Ford, 1), (Ford, 1), (Mazda,1), (Chevrolet, 1), (Chevrolet, 1))

Now we do a reduce operation to place similar items together and then sum them to produce this:

((Ford, 3), (Mazda,1), (Chevrolet, 2)

Now we can read car sales by manufacturer.

Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training course and get certified today.

Why MapReduce Is Rapidly Gaining Ground in the Industry

This relatively simple idea has widespread applications for business. And the software behind it has made solving complex problems over massive data sets much more comfortable, thus helping foster its growth.

MapReduce is gaining ground rapidly because the Apache Hadoop and Spark parallel computing systems let programmers use MapReduce to run models over large distributed sets of data and use advanced statistical and machine learning techniques to make predictions, find patterns, uncover correlations, etc.

This lets business and other organizations run calculations to:

  • Determine the price for their products that yields the highest profits.
  • Know precisely how effective their advertising is and where they should spend their ad dollars.
  • Make long-range weather predictions.
  • Web clicks, sales records purchased from retailers, and Twitter trending topics to determine what new products the company should produce in the upcoming season.

Before MapReduce, doing this kind of calculation would have been difficult. Now programmers can tackle problems like these with relative ease. Data scientists have coded their complex algorithms into frameworks so that regular programmers can use them. Companies no longer need an entire department of Ph.D. scientists to model data, and they do not need a massive supercomputer to process large sets of data, as MapReduce runs across a network of low-cost commodity machines.

MapReduce Use Case: Global Warming

So how are companies, governments, and organizations using this?

First, we give an example where the goal is to calculate a single value from a set of data through reduction.

Suppose we want to know how much global warming has raised the ocean’s temperature. We have input temperature readings from thousands of buoys all over the globe. We have data in this format:

(buoy, DateTime, longitude, latitude, low temperature, high temperature)

We would attack this problem in several maps and reduce steps. The first would be to run a map over every buoy-dateTime reading and add the average temperature as a field:

(buoy, DateTime, longitude, latitude, low, high, average)

We would drop the DateTime column and sum these items for all buoys to produce one average temperature for each buoy:

(buoy n, average)

Then the reduce operation runs. A mathematician would say this is a pair of wise operation on associative data. In other words, we take each of these (buoy, average) adjacent pairs and sum them and then divide that sum by the count to produce the average of averages:  

ocean average temperature = average (buoy n) + average ( buoy n-1) + … + average (buoy 2) + average (buoy 1) / number of buoys

MapReduce Use Case: Drug Trials

Here we provide an example from the drug industry because pharmaceuticals are one of industry where mathematicians and data scientists have traditionally worked.

Like we mentioned above, the invention of MapReduce and the dissemination of data science algorithms in big data systems means ordinary IT departments can now tackle problems that would have required Ph.D. scientists and supercomputers in the past.

A company conducts drug trials to show whether its new drug works against some illness. This is a problem that fits perfectly into the MapReduce model. In this case, we want to run a regression model against a set of patients who have been given the new drug and calculate how effective the drug is in combatting the disease.   

Suppose the drug is a cancer drug. We have data points like this:

{ (patient name: John, DateTime: 3/01/2016 14:00, dosage: 10 mg, size of cancer tumor: 1 mm) }

The first step here obviously would be to calculated the change in the size of the tumor from one dateTime to next. Different patients would be taking different amounts of the drug. So we would want to know what amount of the drug works best. Using MapReduce, we would try to reduce this problem to some linear relationship like this:

percent reduction in tumor = x (quantity of drug) + y (period of time) + constant value

If some correlation exists between the drug and the reduction in the tumor, then the drug can be said to work. The model would also show to what degree it works by calculating the error statistic.

Solving Problems on a Large Scale

What makes this a technological breakthrough are two things. First, we can process unstructured data on a large scale, meaning data that does not easily fit into a relational database. Second, it takes the tools of data science and lets them run over distributed datasets. In the past, those could only run on a single computer.

The relative simplicity of the MapReduce tools and their power and their application to business, military, science, and other problems explains why MapReduce is proliferating. This growth will only increase as more people come to understand how to apply these tools to their situation.

About the Author

Simon TavasoliSimon Tavasoli

Simon Tavasoli is a Business Analytics Lead with more than 12 years of hands-on and leadership experience in various industries. He has led the development of many analytic projects that drive product and marketing initiatives. He has more than 10 years of experience teaching Data Science, Data Visualization, Predictive Analytics, and Statistics.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.