MapReduce: What it is and Why it is Important

...

Simon Tavasoli

Published on December 27, 2016


  • 197 Views

What is MapReduce?

Here we briefly explain what mapReduce is and why it has grown so much in popularity. To further explain that, we give some examples of how it has been used to solve problems in business and science.

MapReduce is the process of taking a list of objects and running some operation over each object in the list (i.e., map) to either produce a new list or calculate a single value (i.e, reduce). This concept is best explained by giving a small sample.  

Suppose we have this data which shows car sales over some period of time at one car dealer.

(Ford, Ford, Ford, Mazda, Chevrolet, Chevrolet)

As you can see, this car dealer sold 3 Fords, 1 Mazda, and 2 Chevrolets.  We figured that out in our head, but the computer would do this in two steps: map and then reduce.

The map step takes each element in the list and runs some operation over it to product a new item to put into a new list.  Since we want to count car sales we add the number 1 to each item to product this list of pairs:

((Ford, 1), (Ford, 1), (Ford, 1), (Mazda,1), (Chevrolet, 1), (Chevrolet, 1))

Now we do a reduce operation to place similar items together and then sum them to produce this:

((Ford, 3), (Mazda,1), (Chevrolet, 2)

Now we can simply read car sales by manufacturer.

Why mapReduce is rapidly gaining ground in the industry  

This relatively simple idea has widespread applications to business.  And the software behind it has made solving complex problems over massive data sets much easier thus helping foster its growth.

MapReduce is gaining ground rapidly because the Apache Hadoop and Spark parallel computing systems lets programmers use mapReduce to run models over large distributed sets of data and use advanced statistical and machine learning techniques to do predictions, find patterns, uncover correlations, etc.

This lets business and other organizations run calculations to:

  • Determine the price for their products that yields the highest profits.
  • Know precisely how effective their advertising is and where they should spend their ad dollars.
  • Do long range weather predictions.
  • Mine web clicks, sales records purchased from retailers, and Twitter trending topics to determine what new products the company should produce in the upcoming season.

Before mapReduce, doing this kind of calculation would have been difficult.  Now programmers can tackle problems like these with relative ease.  Data scientists have coded their complex algorithms into frameworks so that regular programmers can use them.  Companies no longer need an entire department of PhD scientists to model data and they do not need a massive supercomputer to process large sets of data, as mapReduce runs across a network of low-cost commodity machines.

MapReduce Use Case:  Global Warming

So how are companies, government, and organizations using this?

First, we give an example where the goal is to calculate a single value from a set of data through reduction.  

Suppose we want to know how much global warming has raised the ocean’s temperature.  We have as input temperature readings from thousands of buoys all over the globe.   We have data in this format:

(buoy, dateTime, longitude, latitude, low temperature, high temperature)

We would attack this problem in several map and reduce steps.  The first would be to run map over every buoy-dateTime reading and add the average temperature as a field:

(buoy, dateTime, longitude, latitude, low, high, average)

The we would drop the dateTime column and sum these items for all buoys to produce one average temperature for each buoy:

(buoy n, average)

Then the reduce operation runs.  A mathematician would say this is a pair wise operation on associative data. In other words we take each of these (buoy, average) adjacent pairs and sum them and then divide that sum by the count to produce the average of averages:  

ocean average temperature = average (buoy n) + average ( buoy n-1) + … + average (buoy 2) + average (buoy 1) / number of buoys

MapReduce Use Case:  Drug Trials

Here we provide an example from the drug industry because pharmaceuticals is one of industry where mathematicians and data scientists have traditionally worked.  

Like we mentioned above, the invention of mapReduce and the dissemination of data science algorithms in big data systems means ordinary IT departments can now tackle problems that would have required PhD scientists and supercomputers in the past.

A company conducts drug trials to show whether their new drug works against some illness.  This is a problem that fits perfectly into the mapReduce model.  In this case, we want to run a regression model against a set of patients who have been given the new drug and calculate how effective the drug is in combatting the disease.   

Suppose the drug is a cancer drug.  We have data points like this:

{ (patient name: John, dateTime: 3/01/2016 14:00, dosage: 10 mg, size of cancer tumor: 1 mm) }

The first step here obviously would be to calculated the change in the size of the tumor from one dateTime to next.  Different patients would be taking different amounts of the drug.  So we would want to know what amount of the drug works best.  Using mapReduce, we would try to reduce this problem to some linear relationship like this:

percent reduction in tumor = x (quantity of drug) + y (period of time) + constant value

If some correlation exists between the drug and the reduction in the tumor then the drug can be said to work.  The model would also show to what degree it works by calculating the error statistic.

Solving Problems on a Large Scale

What makes this a technological breakthrough are two things.  First we can process unstructured data on a large scale, meaning data that does not easily fit into a relational database.  Second, it takes the tools of data science and lets them run over distributed datasets.  In the past, those could only run on a single computer.   

The relative simplicity of the mapReduce tools and their power and their application to business, military, science, and other problems explains why mapReduce is growing so rapidly.  This growth will only increase as more people come to understand how to apply these tools to their situation.

About the Author

Simon Tavasoli is a Business Analytics Lead with more than 12 years of hands-on and leadership experience in various industries. He has led the development of many analytic projects that drive product and marketing initiatives. He has more than 10 years of experience teaching Data Science, Data Visualization, Predictive Analytics, and Statistics.


{{detail.h1_tag}}

{{detail.display_name}}
... ...

{{author.author_name}}

{{detail.full_name}}

Published on {{detail.created_at| date}} {{detail.duration}}

  • {{detail.date}}
  • Views {{detail.downloads}}
  • {{detail.time}} {{detail.time_zone_code}}

Registrants:{{detail.downloads}}

Downloaded:{{detail.downloads}}

About the On-Demand Webinar

About the Webinar

Hosted By

...

{{author.author_name}}

{{author.author_name}}

{{author.about_author}}

About the E-book

View On-Demand Webinar

Register Now!

First Name*
Last Name*
Email*
Company*
Phone Number*

View On-Demand Webinar

Register Now!

Webinar Expired

Download the Ebook

Email
{{ queryPhoneCode }}
Phone Number {{ detail.getCourseAgree?'*':'(optional)'}}

Show full article video

About the Author

{{detail.author_biography}}

About the Author

{{author.about_author}}