tarazu optimizing mapreduce on heterogeneous clusters

24
Tarazu Optimizing MapReduce On Heterogeneous Clusters 72130310 임임임

Upload: rigg

Post on 24-Feb-2016

153 views

Category:

Documents


1 download

DESCRIPTION

Tarazu Optimizing MapReduce On Heterogeneous Clusters. 72130310 임규찬. 목차. Abstract of Paper Abstract of paper Reference of paper – LATE Introduction Issue with Heterogeneity Tarazu Experimental Result. Abstract of Paper. Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

TarazuOptimizing MapReduce On Heterogeneous Clusters

72130310 임규찬

Page 2: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

1. Abstract of Paper1. Abstract of paper2. Reference of paper – LATE

2. Introduction3. Issue with Heterogeneity4. Tarazu5. Experimental Result

목차

Page 3: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 .◦ 데이터 센터 규모의 클러스터 환경에서 경제적 이유로 Het-

erogeneous 를 도입하고 있음 .◦ MapReduce 기법을 통한 BigData 처리가 가능해짐 .

기존의 기법으로는 성능이 오히려 떨어졌음 .◦ Straggler task Managing 이용한 기존 연구는 효과 없음

그 예시로써 Improving MapReduce Performance in Heterogeneous Environments 논문을 비교함 .

Abstract of Paper

Page 4: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Straggler Task 제어를 통한 Heterogeneous 최적화◦ Node is available but is performing poorly Condition ◦ Can arise many reason, faulty hardware and mis-

configuration LATE Scheduler 제안

◦ Longest Approximate Time to End◦ Task 별 Progress rate 를 이용함

P rogressScore/Amount of time the task Unfortunately, LATE alone is not sufficient to

address hardware heterogeneity.

Reference of PaperImproving MapReduce Performance in Heterogeneous Environments

Page 5: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

‘ 균형’을 뜻하는 힌디어◦ MapReduce 연산에 있어서 균형을 추구하도록 설계

Introduction - Tarazu(तराजू)

Page 6: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

대용량 데이터를 분산 컴퓨팅 환경에서 병렬처리 하도록 만들어진 프레임워크◦ Homogeneous cluster 에 최적화 .

Introduction -MapReduce

Page 7: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

서로 다른 코어로 이루어진 시스템을 이용한 Computing◦ CPU/GPU 를 이용한 GPGPU

CPU/GPU 각각의 장점을 극대화하여 성능 향상을 꾀함 . OpenCL, CUDA, DirectCompute 등 존재 . 본 논문에서 다루지 않음

◦ High/Row Node 를 이용한 Clustering 전력 , 가격 등 금전적인 요소에서의 최적화 본 논문에서 10 개의 Xeon Node, 80 개의 Atom Node 사용

Introduction -Heterogeneous Computing

Page 8: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Four phase Excution Model◦ Map computation

produces <Key, Value> tuple◦ Shuffle

all Map to all Reduce personalized Communication◦ Sorts

Grouping all the tuples for same Key◦ Reduce computation

Processes all the tuples for a key & produce final output

Issue with Heterogeneity-Background : MapReduce

Page 9: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Dynamic Load-balancing in MapReduce◦ Slower nodes fewer tasks/faster nodes more tasks

Heterogeneity is slow than Homogeneity◦ 20-75% slower for six out of eleven benchmarks.◦ Heterogeneity can be degrades performance

Poor performance is due to two Key factors◦ Non-intuitive◦ Other intuitive

Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters

Page 10: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Factor 1 : Non-intuitive◦ Interaction between load balancing

and network traffic

In Heterogeneous, cause remote task◦ Xeon is fast, Atom is slow. So Xeon stole Atom task◦ Remote task can Network Traffic◦ Network Traffic is exacerbated heavy Shuffle

Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters

Page 11: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Factor 2 : Intuitive◦ Reduce phase imbalance amplified by heterogeneity

Reduce phase load imbalance ◦ Different processing speeds cause long time

Issue with Heterogeneity-Reasons for poor perfermance on heterogeneous clusters

Page 12: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Issue with Heterogeneity-A Simple(?) analytical model

Map Finish Time(High/Low System 중 Map 연산이 늦게 끝나는 시간값 )

Number of input data in bisection(Remote Task 로 인한 데이터 + 셔플 데이터 )

Shuffle Finish Time(Remote task 로 인한 시간 혹은 MFT)

Reduce Finish Time(Remote task 로 인한 시간 혹은 MFT)

Page 13: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Two problems in MapReduce◦ Map-side built-in load balancing results in remote Map◦ Reduce-side load imbalance across the nodes

Tarazu consist of three components◦ Communication-Aware Load Balancing of Map computa-

tion◦ Communication-Aware Scheduling of Map computation◦ Predictive Load Balancing of Reduce computation

Tarazu

Page 14: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Based on key observation◦ Due to the overlap between Map computation and Shuf -

fle

In Shuffle is critical, ‘no-steal mode’◦ Pick up remote task when Shuffle end

There are no remote Map tasks to compete with Shuffle Reduce the I/O Processing overhead Slower nodes perform more work

Tarazu- Communication-Aware Load Balancing of Map computation

Page 15: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

In Map Computation is Critical, ‘task-steal mode’◦ Concern of CAS.

CALB’s mode change using shuffleLag◦ Using MapReduce monitor for fault tolerance

Diffence of number of Map task that have completed their computation Have completed their communication

in all nodes◦ Deciding the Source of criticality once is enough

without repeated, dynamic check.

Tarazu- Communication-Aware Load Balancing of Map computation

Page 16: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Determine how many remote tasks needed◦ Using in CALB ‘task-steal’ mode◦ Using to avoid increase SFT

To avoid traffic, CAS spreads out the remote task by interleaving them with local task

Tarazu- Communication-Aware Scheduling of Map computation

Page 17: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

CAS has other benefits◦ By interleaving remote tasks with local tasks,

CAS achieves better overlap between remote task communication and local task computation on both sender and receiver sides

◦ Remote tasks read input data faster by avoiding bursts

Tarazu- Communication-Aware Scheduling of Map computation

Page 18: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Better load balance in the Reduce phase◦ Skewing the intermediate key distribution◦ Reduce max term RFT

Each Reduce task save number of fast/slow nodes.

Tarazu- Predictive Load Balancing of Reduce computation

Page 19: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Using Heterogeneous Cluster Environment◦ 10 Xeon-based/80 Atom-based server nodes

Using Hadoop 0.20.2 Compare another solution, LATE

Experimental Methodology

Page 20: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Heterogeneous 기법을 통한 시스템 장점 극대화◦ Shuffle-Critical 의 경우에는 Atom 의 물량 반영◦ Map-Critical 의 경우에는 Xeon 의 성능 반영

Experimental Result-Performance

Page 21: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Experimental Result-Effect of CALB, CAS and PLB

Page 22: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Experimental Result-Sensitivity to extent of heterogeneity

Page 23: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Experimental Result-Effect of skewed input data dist.

Page 24: Tarazu Optimizing  MapReduce  On Heterogeneous Clusters

Improving MapReduce Performance in Heterogeneous Environments –University of California, Berkeley

https://developers.google.com/appengine/docs/python/dataprocessing/ http://www.cpubenchmark.net/

Reference