mapreduce design patterns

25
1 © Copyright 2012 EMC Corporation. All rights reserved. MapReduce Design Patterns Donald Miner Greenplum Hadoop Solutions Architect @donaldpminer

Upload: donald-miner

Post on 15-Jan-2015

3.877 views

Category:

Technology


0 download

DESCRIPTION

This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.

TRANSCRIPT

Page 1: MapReduce Design Patterns

1© Copyright 2012 EMC Corporation. All rights reserved.

MapReduceDesign Patterns

Donald MinerGreenplum Hadoop Solutions Architect

@donaldpminer

Page 2: MapReduce Design Patterns

2© Copyright 2012 EMC Corporation. All rights reserved.

Book was made available December 2012

Page 3: MapReduce Design Patterns

3© Copyright 2012 EMC Corporation. All rights reserved.

Inspiration for my book

Page 4: MapReduce Design Patterns

4© Copyright 2012 EMC Corporation. All rights reserved.

What are design patterns?(in general)

Reusable solutions to problems

Domain independent

Not a cookbook, but not a guide

Not a finished solution

Page 5: MapReduce Design Patterns

5© Copyright 2012 EMC Corporation. All rights reserved.

Why design patterns?(in general)

Makes the intent of code easier to understand

Provides a common language for solutions

Be able to reuse code

Known performance profiles and limitations of solutions

Page 6: MapReduce Design Patterns

6© Copyright 2012 EMC Corporation. All rights reserved.

Why MapReduce design patterns?

Recurring patterns in data-related problem solving

Groups are building patterns independently

Lots of new users every day

MapReduce is a new way of thinking

Foundation for higher-level tools (Pig, Hive, …)

Community is reaching the right level of maturity

Page 7: MapReduce Design Patterns

7© Copyright 2012 EMC Corporation. All rights reserved.

Pattern Template

Intent

Motivation

Applicability

Structure

Consequences

Resemblances

Performance analysis

Examples

Page 8: MapReduce Design Patterns

8© Copyright 2012 EMC Corporation. All rights reserved.

Pattern Categories

Summarization

Filtering

Data Organization

Joins

Metapatterns

Input and output

Page 9: MapReduce Design Patterns

9© Copyright 2012 EMC Corporation. All rights reserved.

Filtering patterns Extract interesting subsets

Filtering

Bloom filtering

Top ten

Distinct

Summarization patterns top-down summaries

Numerical summarizations

Inverted index

Counting with counters

I only wantsome of my data!

I only wanta top-level view

of my data!

Page 10: MapReduce Design Patterns

10© Copyright 2012 EMC Corporation. All rights reserved.

Data organization patterns Reorganize, restructure

Structured to hierarchical

Partitioning

Binning

Total order sorting

Shuffling

Join patterns Bringing data sets together

Reduce-side join

Replicated join

Composite join

Cartesian product

I want to changethe way my data

is organized!

I want to mashmy different datasources together!

Page 11: MapReduce Design Patterns

11© Copyright 2012 EMC Corporation. All rights reserved.

Metapatterns Patterns of patterns

Job chaining

Chain folding

Job merging

Input and output patterns Custom input and output

Generating data

External source output

External source input

Partition pruning

I want to solvea complex problem

with multiple patterns!

I want to get data orput data in anunusual place!

Page 12: MapReduce Design Patterns

12© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”(filtering)

IntentRetrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data.

MotivationFinding outliersTop ten lists are funBuilding dashboardsSorting/Limit isn’t going to work here

Page 13: MapReduce Design Patterns

13© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”

Applicability Rank-able recordsLimited number of output records

ConsequencesThe top K records are returned.

Page 14: MapReduce Design Patterns

14© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”Structureclass mapper: setup(): initialize top ten sorted list map(key, record): insert record into top ten sorted list if length of array is greater-than 10: truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record

class reducer: setup(): initialize top ten sorted list reduce(key, records): sort records truncate records to top 10 for record in records: emit record

Page 15: MapReduce Design Patterns

15© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”

Resemblances

SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10;

Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;

Page 16: MapReduce Design Patterns

16© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”

Performance analysisPretty quick: map-heavy, low network usage

Pay attention to how many records the reducer is getting[number of input splits] x K

ExampleTop ten StackOverflow users by reputation

Page 17: MapReduce Design Patterns

17© Copyright 2012 EMC Corporation. All rights reserved.

public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> {

private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();

public void map(Object key, Text value, Context context) {

Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());

String userId = parsed.get("Id");

String reputation = parsed.get("Reputation");

repToRecordMap.put(Integer.parseInt(reputation), new Text(value));

if (repToRecordMap.size() > 10) {

repToRecordMap.remove(repToRecordMap.firstKey());

}

}

protected void cleanup(Context context) {

for (Text t : repToRecordMap.values()) {

context.write(NullWritable.get(), t);

}

}

}

Top Ten Mapper

Page 18: MapReduce Design Patterns

18© Copyright 2012 EMC Corporation. All rights reserved.

public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text> {

private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();

public void reduce(NullWritable key, Iterable<Text> values, Context context) {

for (Text value : values) {

Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());

repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value));

if (repToRecordMap.size() > 10) {

repToRecordMap.remove(repToRecordMap.firstKey());

}

}

for (Text t : repToRecordMap.descendingMap().values()) {

context.write(NullWritable.get(), t);

}

}

}

Top Ten Reducer

Page 19: MapReduce Design Patterns

19© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”(filtering)

IntentKeep records that are a member of some predefined set of values. It is not a problem if the output is a bit inaccurate.

MotivationSimilar to normal Boolean filtering, but we are filtering on set membershipSet membership is evaluated with a Bloom filter

Page 20: MapReduce Design Patterns

20© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”

Applicability A feature can be extracted and tested for set membershipPredetermined set is availableSome false positives are acceptable

ConsequencesRecords that pass the Bloom filter membership test are returned

Known UsesKeep all records in a watch list (and a few records that aren’t)Pre-filtering records before an expensive membership test

Page 21: MapReduce Design Patterns

21© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”

Structureclass mapper: setup(): load bloom filter into memory map(key, record): if record in bloom filter:

emit (record, null)

Resemblances

UDFs?

Page 22: MapReduce Design Patterns

22© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”

Performance analysisMap-onlySlight overhead in moving Bloom filter into memoryBloom filter membership tests are constant time

ExampleFilter StackOverflow comments that do not contain a keywordDistributed HBase query using a Bloom filter

Page 23: MapReduce Design Patterns

23© Copyright 2012 EMC Corporation. All rights reserved.

Candidate new patterns

Link Graph processing patterns (new category)– Shortest past, diameter, graph stats, connected

components, etc.– Too domain specific?– Has its own distinct patterns

Projection (filtering)– Remove “columns” of data

Transformation (data organization?)– Take a data set but transform it into something else

Page 24: MapReduce Design Patterns

24© Copyright 2012 EMC Corporation. All rights reserved.

Future and call to action

Contributing your own patterns

Trends in the nature of data– Images, audio, video, biomedical, social …

Libraries, abstractions, and tools

Ecosystem patterns: YARN, HBase, ZooKeeper, …

Page 25: MapReduce Design Patterns