review and q&a for mapreduce: simplified data processing...

21
Review and Q&A for MapReduce: Simplified Data Processing on Large Clusters [1] James Worsham Spring 2019 CSE 6350

Upload: others

Post on 27-Jun-2020

5 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Review and Q&A forMapReduce: Simplified Data

Processing on Large Clusters [1]

James WorshamSpring 2019CSE 6350

Page 2: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

MapReduce: Technical Overview

• “MapReduce is a programming model and an associated implementation for processing and generating large data sets”[1]

• Map functions generate intermediate keys from key/value pairs

• Reduce functions merge intermediate results together

• Example use cases:• Word count in large collection of documents

• Distributed grep across large collection of inputs

• Distributed sorting

Page 3: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Question 1

• A major advantage is automatic parallelization and execution• Abstracts difficulties of distributed execution

• Input partitioning• Distributed scheduling• Fault tolerance• Load balancing

• Allows developers to build distributed computations without needing experience with the system nuances• Business logic is the only concern

Answer

Compared with traditional parallel programming models, such as multithreading and MPI, what are major advantages of MapReduce?

Page 4: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Question 2

Use Figure 1 to explain a MR program’s execution.

[1]

Page 5: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 2

1. Split input files into M pieces (typically 16-64 MB)Fork many instances of program on cluster

[1]

Page 6: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 2

1. Split input files into M pieces (typically 16-64 MB)Fork many instances of program on cluster

2. Master assigns map and reduce tasks to workers

[1]

Page 7: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 2

1. Split input files into M pieces (typically 16-64 MB)Fork many instances of program on cluster

2. Master assigns map and reduce tasks to workers

3. Map workers read contents of assigned input splitRuns each KV pair through Map functionIntermediate KV outputs buffered in memory

[1]

Page 8: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 2

1. Split input files into M pieces (typically 16-64 MB)Fork many instances of program on cluster

2. Master assigns map and reduce tasks to workers

3. Map workers read contents of assigned input splitRuns each KV pair through Map functionIntermediate KV outputs buffered in memory

4. Buffered pairs partitioned into R regionsPeriodically, partitions written to local diskMaster is told of intermediate file locationsMaster forwards locations to reduce workers

[1]

Page 9: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 2

1. Split input files into M pieces (typically 16-64 MB)Fork many instances of program on cluster

2. Master assigns map and reduce tasks to workers

3. Map workers read contents of assigned input splitRuns each KV pair through Map functionIntermediate KV outputs buffered in memory

4. Buffered pairs partitioned into R regionsPeriodically, partitions written to local diskMaster is told of intermediate file locationsMaster forwards locations to reduce workers

5. Reduce worker remotely reads intermediate fileReduce worker sorts (groups) intermediate keys

[1]

Page 10: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 2

1. Split input files into M pieces (typically 16-64 MB)Fork many instances of program on cluster

2. Master assigns map and reduce tasks to workers

3. Map workers read contents of assigned input splitRuns each KV pair through Map functionIntermediate KV outputs buffered in memory

4. Buffered pairs partitioned into R regionsPeriodically, partitions written to local diskMaster is told of intermediate file locationsMaster forwards locations to reduce workers

5. Reduce workers remotely read intermediate filesReduce workers sort (group) intermediate keys

6. Reduce workers pass each unique key and corresponding values to Reduce functionReduce function output appended to output files

[1]

Page 11: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 2

1. Split input files into M pieces (typically 16-64 MB)Fork many instances of program on cluster

2. Master assigns map and reduce tasks to workers

3. Map workers read contents of assigned input splitRuns each KV pair through Map functionIntermediate KV outputs buffered in memory

4. Buffered pairs partitioned into R regionsPeriodically, partitions written to local diskMaster is told of intermediate file locationsMaster forwards locations to reduce workers

5. Reduce workers remotely read intermediate filesReduce workers sort (group) intermediate keys

6. Reduce workers pass each unique key and corresponding values to Reduce functionReduce function output appended to output files

7. When all workers are done, master wakes upControl returned to user program

[1]

Page 12: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Question 3

Describe how MR handles worker and master failures.

Master Worker

Checkpoints periodically persistedState and ID of workersLocations of intermediate filesSize of intermediate files

If master task dies:New copy started from last checkpoint

Master pings all workers and expects responsesIf no response received, worker marked as failed

If worker fails:All in-progress tasks rescheduledAll completed map tasks rescheduled

When a map task is re-scheduled:All reduce workers are notifiedWorkers read from new node

(If data not already loaded from failed worker)

Page 13: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Question 4

The implementation of MapReduce enforces a barrier between the Map and Reduce phases, i.e., no reducers can proceed until all mappers have completed their assigned workload. For higher efficiency, is it possible for a reducer to start its execution earlier, and why? (clue: think of availability of inputs to reducers)

Page 14: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 4

• In the presented Google implementation…• It is not possible for a reducer to start prior to barrier saturation

• The issue is the shuffling phase• Reducers are remotely collecting and sorting intermediate data

• Once all the KV pairs are sorted, reduce() is ran

• Data excluded from results if reduce() ran before all inputs are ready

Page 15: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 4

*Current MapReduce Implementation

[2]

Page 16: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 4

What if we did this without a barrier?

1. Intermediate data fromM1 is ready

M1 M2 M3

R1

[2]

Page 17: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 4

What if we did this without a barrier?

1. Intermediate data fromM1 is ready

2. R1 collects and sorts k1R1 runs reduce()Final output file appended

M1 M2 M3

R1

R1(K1)[2]

Page 18: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 4

What if we did this without a barrier?

1. Intermediate data fromM1 is ready

2. R1 collects and sorts k1R1 runs reduce()Final output file appended

3. Now M2 and M3 are ready

M1 M2 M3

R1

R1(K1)[2]

Page 19: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Answer 4

What if we did this without a barrier?

1. Intermediate data fromM1 is ready

2. R1 collects and sorts k1R1 runs reduce()Final output file appended

3. Now M2 and M3 are ready

4. If we run reduce for K1 again, we have conflicting output results

M1 M2 M3

R1

R1(K1) R1(K1)[2]

Page 20: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

Questions?

Page 21: Review and Q&A for MapReduce: Simplified Data Processing ...ranger.uta.edu/~sjiang/CSE6350-spring-19/Presentations-materials/... · MapReduce: Simplified Data Processing on Large

References

[1] Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” ODSI, 2004, ai.google/research/pubs/pub62.

[2] Jiang, Song. "Part II: Software Infrastructure in Data Centers: Distributed Execution Engines" CSE 6350, April 24, 2019, UTA. Microsoft PowerPoint presentation.