mapreduce : simpliyed data processing on large clusters

MapReduce: Simpliyed Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

To appear in OSDI 2004(Operating Systems Design and Implementation)

Jeff DeanSanjay Ghemawat

Important programming model for large-scale data-parallel application

Introduce

Motivation

- Parallel applicationsWidely usedSpecial purpose applications

- Common functionalityParallelize computationDistribute dataHandle failures

- Large Scale(Big Data) Data Processing

MapReduce?

-Programming ModelParallelGenericScalable

-DataMap(Key-Value) pair

-ImplementationCommodity clusters Commodity PC

# map(key, val) is run on each item in set emits new-key / new-val pairs

# reduce(key, vals) is run for each unique key emitted by map()

emits final output

MapReduce?

# User define function

Example

# Distributed Grep (Global / Regular Expression / Print )

# Count of URL Access Frequency (logs of webpage request) map<URL,1(total)> reduce<URL, total count(n)>

# Reverse Web-Link Graph map<target(linked url), source(web page) reduce<target,list(source)>

# Inverted Index map<word, document ID> reduce<word, list(document id)>

# Distributed Sort map<key, record> reduce<key record>(emits all pairs unchanged)

# Term-Vector per Host (<word, frequency>a list of pair) map<hostname, term vector> reduce<hostname, term vector> (throwing away infrequent terms , and emits a fi-nal)

Example

Execution overview

Typical cluster

# Machines are typically 100s or 1000s of 2-CPU x86 machines(dual-processor x86 proces-sors)running Linux, with 2-4 GB of memory# NetWork 100 megabits/second or 1 gigabit/second

# Storage Storage is on local IDE disks

# GFS GFS: distributed file system manages data

# Job scheduling system - jobs made up of tasks - scheduler assigns tasks to machines

# Language C++ library linked into user programs

Distributed-1?

#1 - Split input file into M pieces (16M ~ 64M)(user via optional pa-rameter) - start up many copies of the program on a cluster of machines#2 - Master(1) – on e of the copies of the program is special - worker(n) – assigned work by the master - Map task(M) / Reduce tasks(R)

#3 - Map task reads the content (from input split) - pares (key/value pair) user define map function - buffered in memory

#5 Reduce workers - it uses remote procedure calls to read the buffered data from the local disks of the map workers

#4 Map workers - Periodically, the buffered pairs are written to local disk - the local disk are passed back to the master - who is responsible for forwarding these locations to the reduce workers

#6 - reduce worker iterates(unique intermediate key encountered) - start up many copies of the program on a cluster of machines - The output of the Reduce function is appended to a finnal output le for this reduce partition.

Distributed-2?

#7 - When all map tasks and reduce tasks have been completed - the master wakes up the user program - the MapReduce call in the user program returns back to the user code.

#8 - After successful completion - R output files(reduce)(file names as specied by the user) - the MapReduce call in the user program returns back to the user code.

Master Data Structures

#Status

Idle( 비가동 ) in-progress( 가동 ) completed( 완료 )

Fault Tolerance( 결함의 허용 범위 )

#Worker Failure - The master pings every worker periodically - MapReduce is resilient to large-scale worker failures

#Master Failure mapreduce stop - It is easy to make the master write periodic checkpoints of the mas-ter data structures described above. - If the master task dies, a new copy can be started from the last checkpointed state. - Clients can check for this condition and retry the MapReduce opera-tion if they desire.#Semantics in the Presence of Failures ( 실패의 의미 )

Locality( 지역성 )

#GFS 저장 네트워크 대역폭 절약 GFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines.

#When running largeMapReduce operations on a signicant fraction of theworkers in a cluster, most input data is read locally andconsumes no network bandwidth.

Task Granularity

# 이상적인 : Map (M) , Reduce(R) M,R > Machines - 동적 로드벨런싱 향상 - worker failure 복구시간 향상

#Master O(M+R) 개의 스캐줄링 생성 O(M+R) 개의 상태가 메모리에 유지 실질적인 허용 범위가 존재함 O(M+R) 의 상태는 최소 1byte 로 구성됨

#reduce(r) 사용자 로부터 제약을 받음 ( 각각의 시스템에서 처리 됨으로 )

#M=200,000 개 R=5,000 개 (Machines)Worker=2000 환경에서 MapReduce 연산을 수행

Backup Tasks

# ”Straggler” 낙오자 Machines 전체 연산 중 가장 나중에 수행 되는 매우 처리가 오래 걸리는 map or reduce task

# When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.

#The task is marked as completed whenever either the primary or the backup execution completes.

Combiner Function

Master

MapTask

MapTask

ReduceTask

ReduceTask

ReduceTask

MapTask

Network TrafficCPU Performance

N1

N3N2

Status Infomation

#The master runs an internal HTTP server and exports a set of status pages for human consumption

#how many tasks have been completed

#how many are in progress, bytes of input, bytes of intermediate data, bytes of output, processing rates

# The user can use this data to predict how long the computation will take

Conclusions

#First, the model is easy to use, even for programmers without experi-encewith parallel and distributed systems,# Second, a large variety of problems are easily expressible as MapRe-duce computations

# Third, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines

# First, restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tol-erant.# Second, network bandwidth is a scarce resource.

# Third, redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.

mapreduce : simpliyed data processing on large clusters

Documents