mapreduce: simplified data processing on large clusters by dinesh dharme

35
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Upload: robert-york

Post on 18-Jan-2018

224 views

Category:

Documents


0 download

DESCRIPTION

Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Size of web > 400 TB Want to parallelize across hundreds/thousands of CPUs. Commodity CPUs have become cheaper. Want to make this easy. Favour programmer productivity over CPU efficiency.

TRANSCRIPT

Page 1: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

MapReduce: Simplified Data Processing on

Large Clusters

By Dinesh Dharme

Page 2: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

MapReduce

Page 3: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Size of web > 400 TB Want to parallelize across

hundreds/thousands of CPUs. Commodity CPUs have become cheaper.

Want to make this easy. Favour programmer productivity over CPU efficiency.

Page 4: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

What is MapReduce? More simply, MapReduce is: A parallel programming model and

associated implementation. Borrows from functional programming Many problems can be modeled based on

MapReduce paradigm

Page 5: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

MapReduce Features

Automatic parallelization & distribution Fault-tolerant Load balancing Network and disk transfer optimization Provides status and monitoring tools Clean abstraction for programmers Improvements to core library benefit all

users of library!

Page 6: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Steps in typical problem solved by MapReduce

Read a lot of data Map: extract something you care about from each

record Shuffle and SortShuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results

Outline stays the same,Map and Reduce change to fit the problem

Page 7: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

MapReduce Paradigm

Basic data type: the key-value pair (k,v). E.g. key = URL, value = HTML of web page.

Users implement interface of two functions: Map: (k,v) <(k↦ 1,v1), (k2,v2), (k3,v3),...,(kn,vn)> Reduce: (k',<v'1, v'2,...,v'n'>) <(k', v''↦ 1),(k',v''2),...,(k', v''n'')> (typically n'' = 1)

All v' with same k' are reduced together.

Page 8: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Example: Count word occurrencesmap(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Page 9: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 10: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Example: Query Frequencymap(String input_key, String input_value): // input_key: query log name // input_value: query log content for each query q in content: if(substring(“full moon”,q)):

EmitIntermediate(issue_time, "1");

reduce(String output_key, Iterator intermediate_values): // output_key: issue_time // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(issue_time, AsString(result));

Page 11: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 12: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 13: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 14: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 15: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 16: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

More Examples:

Distributed Grep Count of URL Access Frequency Suggesting terms for query expansionSuggesting terms for query expansion Distributed Sort.

Page 17: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 18: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Execution

Create M splits of input data User provides R i.e # of partitions or # of

output files Master Data Structure: Keeps track of

state of each map and reduce task.

Page 19: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 20: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Locality

Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack

map() task inputs are divided into 64 MB blocks: same size as Google File System chunks

Page 21: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 22: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Parallelism

map() functions run in parallel, creating different intermediate values from different input data sets

reduce() functions also run in parallel, each working on a different output key

All values are processed independently Bottleneck: reduce phase can’t start until

map phase is completely finished.

Page 23: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Fault Tolerance

Master detects worker failuresRe-executes completed & in-progress map()

tasksRe-executes in-progress reduce() tasks

Master notices particular input key/values cause crashes in map(), and skips those values on re-execution.Effect: Can work around bugs in third-party

libraries!

Page 24: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Fault Tolerance contd.

What if the Master fails?Create a checkpoint and note the state of

Master Data StructureWrite the state to GFS filesystemNew master recovers and continues

Page 25: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Semantics in Presence of Failures Deterministic map and reduce operators

are assumed. Atomic commits of map and reduce task

outputs Relies on GFS

Page 26: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Semantics in Presence of Failures contd. What if map/reduce operators are non-

deterministic?In this case, MapReduce provides weaker but

reasonable semantics.

Page 27: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Optimizations

No reduce can start until map is complete:A single slow disk controller can rate-limit the

whole process Master redundantly executes “slow-

moving” map tasks; uses results of first copy to finish

Performed only when close to completionWhy is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?

Page 28: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Fine tuning Partitioning Function Ordering Guarantees Combiner function Bad Record Skipping Status Information Counters

Page 29: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Partitioning Function Default : “hash(key) mod R” Can be customized:E.g. “hash (Hostname(urlkey)) mod R” Distribution of keys can be used to

determine good partitions.

Page 30: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Combiner Function Runs on same machine as a map task Causes a mini-reduce phase to occur

before the real reduce phase Saves bandwidth

Page 31: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Performance

Grep 1800 machines 1010 100 byte records.(~ 1 TB) 3-character pattern to be matched ( ~ 1 lakh

records contain the pattern ) M = 15000 R = 1 Input data chunk size = 64 MB

Page 32: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Performance:Grep

Page 33: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

Performance

Sort 1800 machines used 1010 100 byte records.(~ 1 TB) M = 15000 R = 4000 Input data chunk size = 64 MB 2 TB of final output (GFS maintains 2 copies)

Page 34: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme
Page 35: MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme

MapReduce Conclusions

MapReduce has proven to be a useful abstraction

Greatly simplifies large-scale computations at Google.

Indexing code rewritten using MapReduce. Code is simpler, smaller, readable.