mapreduce: simplified data processing on large clustersmapreduce is a general programming model that...

36
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean and Sanjay Ghemawat, Google, Inc. Presenter: Xiaoying Wang

Upload: others

Post on 27-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

MapReduce: Simplified Data Processing on Large Clusters

Jeff Dean and Sanjay Ghemawat, Google, Inc.

Presenter: Xiaoying Wang

Page 2: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

MapReduce: A programming model and an associated implementation for processing and generating large datasets

2

Page 3: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Background

Programming Model Definition

Implementation & Refinement

3

Page 4: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Background

Programming Model Definition

Implementation & Refinement

4

Page 5: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Single Machine

CPU

Memory

Disk

Fail to deal with rapidly growing data in Storage & Computation

5

Page 6: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Commodity Cluster

6

Switch Switch Switch

Switch

Page 7: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Storage: Google File System (GFS)

Split file into chunks

Chunk Server

Store chunks

Has duplication

Master server

Store meta data

7

Chunk 1

Chunk 2

Chunk 3

File

Page 8: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Computation: General Implementation

Defining computation

Partitioning input

Scheduling

Handling failures

Managing inter-machine communication

… 8

Page 9: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Computation

Defining computation

Partitioning input

Scheduling

Handling failures

Managing inter-machine communication

… 9

Interface

Hide from user

Page 10: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Background

Programming Model Definition

Implementation & Refinement

10

Page 11: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

The MapReduce Programming Model

Input Output

Map Reducelist(<k1, v1>)

list(<k2, v2>) list(<k2, list(v2)>)

list(v2)

Map: <k1, v1> -> list(<k2, v2>)

Reduce: <k2, list(v2)> -> list(v2)

11

Page 12: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Example: Word Count

1 Cat DogCat Dog Fish Dog Cat Cat

2 Fish Dog

3 Cat Cat

Splitting input: file -> list(<index, line>)

12

Page 13: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Example: Word Count

Cat Dog Fish Dog Cat Cat

Cat 1Dog 1

Fish 1Dog 1

Cat 1Cat 1

Map: <index, line> -> list(<w, 1>)

13

1 Cat Dog

2 Fish Dog

3 Cat Cat

Page 14: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Example: Word Count

Cat Dog Fish Dog Cat Cat

Cat 1Dog 1

Fish 1Dog 1

Cat 1Cat 1

Cat 1Cat 1Cat 1

Dog 1Dog 1

Fish 1 14

Aggregate: list(<w, 1>) -> list(<w, list(1)>)

1 Cat Dog

2 Fish Dog

3 Cat Cat

Page 15: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Example: Word Count

Cat Dog Fish Dog Cat Cat

Cat 3

Dog 2

Fish 1

Reduce: <w, list(1)> -> list(“w, n”)

15

Cat 1Dog 1

Fish 1Dog 1

Cat 1Cat 1

Cat 1Cat 1Cat 1

Dog 1Dog 1

Fish 1

1 Cat Dog

2 Fish Dog

3 Cat Cat

Page 16: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

16

Input Output

Map Reduce

list(<k1, v1>) list(<k1, list(v1)>)

list(<k1, v2>)MapMap

list(<k1, v1>)list(<k2, v2>) list(<k2, list(v2)>)

Reduce list(v2)

Parallel Computing

Map: <k1, v1> -> list(<k2, v2>)

Reduce: <k2, list(v2)> -> list(v2)

list(<k1, v1>)list(<k1, v1>)list(<k1, v1>)

Page 17: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Generalization Capability

Build Inverted Index for Search Engine

Map: <doc ID, content> -> list(<word, doc ID>)

Reduce: <word, list(doc ID)> -> list(“word+list(doc ID)”)

Count URL Access Frequency

Grep from distributed file system

Sort data on distributed file system

… 17

Page 18: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Background

Programming Model Definition

Implementation & Refinement

18

Page 19: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

19

Page 20: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Split input into M pieces

20

Page 21: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Start on cluster

Master

Worker: Map & Reduce

21

Page 22: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Map Worker

Read & Parse

22

Page 23: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Map Worker

Invoke Map Function

23

Page 24: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Map Worker

Partition & Write

24

Page 25: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Reduce Worker

Read & Sort

25

Page 26: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Reduce Worker

Invoke Reduce Function

26

Page 27: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Execution: Overview

Reduce Worker

Write & Rename

27

Page 28: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Master

Schedule

Task status: idle, in-progress, completed & worker machine (for non-idle tasks)

Worker status: ping periodically

Communicate

Location and size of intermediate results

28

Page 29: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Fault Tolerance

Worker

Completed and in-progress map tasks at this worker are reset to idle

Only the in-progress reduce task is reset to idle

Master

Start a new copy by checkpoints / Abort

29

Page 30: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Task Granularity

Split input into M pieces -> M map tasks

Divide intermediate result into R pieces -> R reduce tasks

M & R should be much larger than the number of machines

Dynamic load balancing

Speed up recovery

2000 machines: M=200,000, R=5000

30

Page 31: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Speed Up: Network Bandwidth

Locality

Master assign a map task to a worker machine that

Contains a replica of the input data

Near a replica of the input data

Combiner Function

Merge intermediate result before sent through network

31

Page 32: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Combiner Function in Word Count

32

Cat 2

Cat 1Cat 2

Cat 1Dog 1

Fish 1Dog 1

Cat 1Cat 1

Cat 1Cat 1Cat 1

Dog 1Dog 1

Fish 1

Cat 1Dog 1

Fish 1Dog 1

Cat 1Cat 1

Dog 1Dog 1

Fish 1

M1

R1

M2

M3

R2

R3

M1

M2

M3

R1

R2

R3M3

Page 33: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Speed Up: Backup Tasks

Issue: straggler

A machine that takes unusually long time to complete one task

Solution: backup tasks

Schedule backup executions for in-progress tasks when the MapReduce operation is close to completion

33

Page 34: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Other Details

Customization

Partitioning Function, Input & Output Types

Debug & Monitor

Local Execution, Status Information, Counters…

Skipping Bad Records

34

Page 35: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

TakeAways

MapReduce is a general programming model that it can be used in various application scenarios

MapReduce hides the detail implementation from users and provide simple & flexible interfaces

The implementation of MapReduce is efficient and highly scalable

35

Page 36: MapReduce: Simplified Data Processing on Large ClustersMapReduce is a general programming model that it can be used in various application scenarios MapReduce hides the detail implementation

Q & A

36