2. Embarrassingly Parallel Problems
A.k.a. Delightfully Parallel Problems
Can be easily parallelizable
Usually use simple communication patterns
Usually work without much communication
among each other
Map-Reduce programming model provides a
powerful abstraction to handle embarrassingly
parallel problems
2
3. Map-Reduce
Common pattern to solve parallel problems
Based on 2 constructs from functional programming,
map & reduce
Introduced by Google
Dean et. al., “MapReduce: Simplified Data Processing
on Large Clusters,” OSDI, 2004
Extensible for different applications
Scale to very large number of nodes
Hide details like failures from users
3
4. High-Order Functions
Programming languages (e.g., Java) pass data
as parameters & results of functions
Higher-order functions pass both data as well as
functions as parameters or results of functions
E.g., Python, Ruby, JavaScript
For example
def f(x):
return x + 3
def g(function, x):
return function(x) * function(x)
print g(f, 7) 4
5. Map-Reduce
Accepts 2 functions as inputs
1. Map function
Y fn1(X)
Accepts input X & outputs another Y
2. Reduce function
Z fn2(List<Y>)
Accepts array of Y’s & returns another output Z
5
6. Map-Reduce (Contd.)
Map-reduce support is provided by a function
like following
Y map-reduce(mapfn, reducefn, List<X>)
Map reduce implementation takes list of inputs
(list) & does following
Apply map function to each entry in the list, which
emit (key, value) pairs
Collect results, group them by keys, & then pass them
to reduce function as array
6
8. Map-Reduce for Word Counting
8
Source: http://xiaochongzhang.me/blog/?p=338
How to do this for a large dataset using a distributed system?
9. In Class Activity
1. Card sorting
2. Card sorting with 2 rounds
3. Identify missing cards
9
Inspired by Marcio Silva's “The MapReduce Card Game” at
http://blog.marciosilva.com/2012/10/the-mapreduce-card-game.html
10. Why Map-Reduce?
Implementing same pattern in a distributed
system isn’t that easy
Need to worry about communication, failures,
initialization, etc.
MapReduce frameworks worry about all those
You write map & reduce functions & call
framework
It forces you to think parallel in design time
It gives you a higher-level of abstraction to think in
It’s very generic, & covers lot of usecases
See http://wiki.apache.org/hadoop/PoweredBy
10
11. Map-Reduce Implementations
Can be implemented in many ways
In-memory implementation
Distributed implementation
Communication by messages
Communication by file system
Communication by databases
Communication Requirements
Need broadcast & reduce operations only
11
12. Map-Reduce with Hadoop
Apache Hadoop is an implementation of Map-
reduce
Handles all details about distributed execution
You just have to give Map & Reduce functions
12
14. Map-Reduce Data Model (Cont.)
Hadoop breaks input data into multiple data items by
new lines & runs map function once for each data item
When executed, map function outputs (key, value) pairs
Hadoop collects all (key, value) pairs generated by map
function, sorts them by the key, & groups values with the
same key together into groups
For each distinct key, Hadoop runs reduce function once
while passing key & list of values for that key as input
Reduce function outputs (key, value) pairs, & Hadoop
writes them to a file as final result
14
15. Execution on a Cluster/Cloud
15
Source: www.cbsolution.net/techniques/ontarget/mapreduce_vs_data_warehouse
17. Designing Map-Reduce Applications
You control task granularity by changing no of
map & reduce tasks
How many map tasks?
How many reduce tasks?
Fine Grain more parallelism more
communication overhead and vise versa
Usually frameworks handle load balancing &
failures
If large number of maps are there, you need a
Combine Function as well
17
18. Examples
Sorting
How to sort an array of 1 million integers using
MapReduce?
Inverted Index
Normal index is a mapping from document to terms
Inverted index is mapping from terms to documents
If we have a million documents, how do we build a
inverted index using MapReduce?
Frequency Distribution of Word Occurrences
Count number of occurrences & build a histogram
18
19. Examples (Cont.)
Stitch Imagery
For Google maps, Google need to combine many
map data into a single set of data
Business Intelligence
A business want to create a graph of income
generated by each region & marketing money spend
on each region
19
20. Examples (Cont.)
K-Means
Assume you are given a list of earth quakes
coordinates happened in the world in last 50 years.
You are asked to use K-Means Clustering algorithm
to find 10 locations around which those earth quakes
were located.
K-Means starts with 10 random cluster locations.
It proceeds iteratively, & at each iteration, it assigns each
data point (earth quake) to the closest cluster location
At end of each iteration, it recalculates each cluster location
using mean of all data point coordinates assigned to that
location
It stops when cluster locations doesn’t change after
recalculation 20
21. K-Means Algorithm
List kmeans(datapointsList , initialClustersList){
oldlocations = null;
newLocations = initialClustersList ;
while(oldlocations != newLocations){
for(d in datapointsList){
oldlocations = newLocations ;
newLocations = //recalculate locations
}
//assign d to closest location in newLocations
}
}
return newLocations ;
21