mapreduce design patterns
DESCRIPTION
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.TRANSCRIPT
1© Copyright 2012 EMC Corporation. All rights reserved.
MapReduceDesign Patterns
Donald MinerGreenplum Hadoop Solutions Architect
@donaldpminer
2© Copyright 2012 EMC Corporation. All rights reserved.
Book was made available December 2012
3© Copyright 2012 EMC Corporation. All rights reserved.
Inspiration for my book
4© Copyright 2012 EMC Corporation. All rights reserved.
What are design patterns?(in general)
Reusable solutions to problems
Domain independent
Not a cookbook, but not a guide
Not a finished solution
5© Copyright 2012 EMC Corporation. All rights reserved.
Why design patterns?(in general)
Makes the intent of code easier to understand
Provides a common language for solutions
Be able to reuse code
Known performance profiles and limitations of solutions
6© Copyright 2012 EMC Corporation. All rights reserved.
Why MapReduce design patterns?
Recurring patterns in data-related problem solving
Groups are building patterns independently
Lots of new users every day
MapReduce is a new way of thinking
Foundation for higher-level tools (Pig, Hive, …)
Community is reaching the right level of maturity
7© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Template
Intent
Motivation
Applicability
Structure
Consequences
Resemblances
Performance analysis
Examples
8© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Categories
Summarization
Filtering
Data Organization
Joins
Metapatterns
Input and output
9© Copyright 2012 EMC Corporation. All rights reserved.
Filtering patterns Extract interesting subsets
Filtering
Bloom filtering
Top ten
Distinct
Summarization patterns top-down summaries
Numerical summarizations
Inverted index
Counting with counters
I only wantsome of my data!
I only wanta top-level view
of my data!
10© Copyright 2012 EMC Corporation. All rights reserved.
Data organization patterns Reorganize, restructure
Structured to hierarchical
Partitioning
Binning
Total order sorting
Shuffling
Join patterns Bringing data sets together
Reduce-side join
Replicated join
Composite join
Cartesian product
I want to changethe way my data
is organized!
I want to mashmy different datasources together!
11© Copyright 2012 EMC Corporation. All rights reserved.
Metapatterns Patterns of patterns
Job chaining
Chain folding
Job merging
Input and output patterns Custom input and output
Generating data
External source output
External source input
Partition pruning
I want to solvea complex problem
with multiple patterns!
I want to get data orput data in anunusual place!
12© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”(filtering)
IntentRetrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data.
MotivationFinding outliersTop ten lists are funBuilding dashboardsSorting/Limit isn’t going to work here
13© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Applicability Rank-able recordsLimited number of output records
ConsequencesThe top K records are returned.
14© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”Structureclass mapper: setup(): initialize top ten sorted list map(key, record): insert record into top ten sorted list if length of array is greater-than 10: truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record
class reducer: setup(): initialize top ten sorted list reduce(key, records): sort records truncate records to top 10 for record in records: emit record
15© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Resemblances
SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10;
Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;
16© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Performance analysisPretty quick: map-heavy, low network usage
Pay attention to how many records the reducer is getting[number of input splits] x K
ExampleTop ten StackOverflow users by reputation
17© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> {
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void map(Object key, Text value, Context context) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
String userId = parsed.get("Id");
String reputation = parsed.get("Reputation");
repToRecordMap.put(Integer.parseInt(reputation), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
protected void cleanup(Context context) {
for (Text t : repToRecordMap.values()) {
context.write(NullWritable.get(), t);
}
}
}
Top Ten Mapper
18© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context) {
for (Text value : values) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
for (Text t : repToRecordMap.descendingMap().values()) {
context.write(NullWritable.get(), t);
}
}
}
Top Ten Reducer
19© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”(filtering)
IntentKeep records that are a member of some predefined set of values. It is not a problem if the output is a bit inaccurate.
MotivationSimilar to normal Boolean filtering, but we are filtering on set membershipSet membership is evaluated with a Bloom filter
20© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Applicability A feature can be extracted and tested for set membershipPredetermined set is availableSome false positives are acceptable
ConsequencesRecords that pass the Bloom filter membership test are returned
Known UsesKeep all records in a watch list (and a few records that aren’t)Pre-filtering records before an expensive membership test
21© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Structureclass mapper: setup(): load bloom filter into memory map(key, record): if record in bloom filter:
emit (record, null)
Resemblances
UDFs?
22© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Performance analysisMap-onlySlight overhead in moving Bloom filter into memoryBloom filter membership tests are constant time
ExampleFilter StackOverflow comments that do not contain a keywordDistributed HBase query using a Bloom filter
23© Copyright 2012 EMC Corporation. All rights reserved.
Candidate new patterns
Link Graph processing patterns (new category)– Shortest past, diameter, graph stats, connected
components, etc.– Too domain specific?– Has its own distinct patterns
Projection (filtering)– Remove “columns” of data
Transformation (data organization?)– Take a data set but transform it into something else
24© Copyright 2012 EMC Corporation. All rights reserved.
Future and call to action
Contributing your own patterns
Trends in the nature of data– Images, audio, video, biomedical, social …
Libraries, abstractions, and tools
Ecosystem patterns: YARN, HBase, ZooKeeper, …