tarazu optimizing mapreduce on heterogeneous clusters
DESCRIPTION
Tarazu Optimizing MapReduce On Heterogeneous Clusters. 72130310 임규찬. 목차. Abstract of Paper Abstract of paper Reference of paper – LATE Introduction Issue with Heterogeneity Tarazu Experimental Result. Abstract of Paper. Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 . - PowerPoint PPT PresentationTRANSCRIPT
TarazuOptimizing MapReduce On Heterogeneous Clusters
72130310 임규찬
1. Abstract of Paper1. Abstract of paper2. Reference of paper – LATE
2. Introduction3. Issue with Heterogeneity4. Tarazu5. Experimental Result
목차
Heterogeneous Cluster 환경에서 MapReduce 기법의 최적화를 연구함 .◦ 데이터 센터 규모의 클러스터 환경에서 경제적 이유로 Het-
erogeneous 를 도입하고 있음 .◦ MapReduce 기법을 통한 BigData 처리가 가능해짐 .
기존의 기법으로는 성능이 오히려 떨어졌음 .◦ Straggler task Managing 이용한 기존 연구는 효과 없음
그 예시로써 Improving MapReduce Performance in Heterogeneous Environments 논문을 비교함 .
Abstract of Paper
Straggler Task 제어를 통한 Heterogeneous 최적화◦ Node is available but is performing poorly Condition ◦ Can arise many reason, faulty hardware and mis-
configuration LATE Scheduler 제안
◦ Longest Approximate Time to End◦ Task 별 Progress rate 를 이용함
P rogressScore/Amount of time the task Unfortunately, LATE alone is not sufficient to
address hardware heterogeneity.
Reference of PaperImproving MapReduce Performance in Heterogeneous Environments
‘ 균형’을 뜻하는 힌디어◦ MapReduce 연산에 있어서 균형을 추구하도록 설계
Introduction - Tarazu(तराजू)
대용량 데이터를 분산 컴퓨팅 환경에서 병렬처리 하도록 만들어진 프레임워크◦ Homogeneous cluster 에 최적화 .
Introduction -MapReduce
서로 다른 코어로 이루어진 시스템을 이용한 Computing◦ CPU/GPU 를 이용한 GPGPU
CPU/GPU 각각의 장점을 극대화하여 성능 향상을 꾀함 . OpenCL, CUDA, DirectCompute 등 존재 . 본 논문에서 다루지 않음
◦ High/Row Node 를 이용한 Clustering 전력 , 가격 등 금전적인 요소에서의 최적화 본 논문에서 10 개의 Xeon Node, 80 개의 Atom Node 사용
Introduction -Heterogeneous Computing
Four phase Excution Model◦ Map computation
produces <Key, Value> tuple◦ Shuffle
all Map to all Reduce personalized Communication◦ Sorts
Grouping all the tuples for same Key◦ Reduce computation
Processes all the tuples for a key & produce final output
Issue with Heterogeneity-Background : MapReduce
Dynamic Load-balancing in MapReduce◦ Slower nodes fewer tasks/faster nodes more tasks
Heterogeneity is slow than Homogeneity◦ 20-75% slower for six out of eleven benchmarks.◦ Heterogeneity can be degrades performance
Poor performance is due to two Key factors◦ Non-intuitive◦ Other intuitive
Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters
Factor 1 : Non-intuitive◦ Interaction between load balancing
and network traffic
In Heterogeneous, cause remote task◦ Xeon is fast, Atom is slow. So Xeon stole Atom task◦ Remote task can Network Traffic◦ Network Traffic is exacerbated heavy Shuffle
Issue with Heterogeneity-Reasons for poor performance on heterogeneous clusters
Factor 2 : Intuitive◦ Reduce phase imbalance amplified by heterogeneity
Reduce phase load imbalance ◦ Different processing speeds cause long time
Issue with Heterogeneity-Reasons for poor perfermance on heterogeneous clusters
Issue with Heterogeneity-A Simple(?) analytical model
Map Finish Time(High/Low System 중 Map 연산이 늦게 끝나는 시간값 )
Number of input data in bisection(Remote Task 로 인한 데이터 + 셔플 데이터 )
Shuffle Finish Time(Remote task 로 인한 시간 혹은 MFT)
Reduce Finish Time(Remote task 로 인한 시간 혹은 MFT)
Two problems in MapReduce◦ Map-side built-in load balancing results in remote Map◦ Reduce-side load imbalance across the nodes
Tarazu consist of three components◦ Communication-Aware Load Balancing of Map computa-
tion◦ Communication-Aware Scheduling of Map computation◦ Predictive Load Balancing of Reduce computation
Tarazu
Based on key observation◦ Due to the overlap between Map computation and Shuf -
fle
In Shuffle is critical, ‘no-steal mode’◦ Pick up remote task when Shuffle end
There are no remote Map tasks to compete with Shuffle Reduce the I/O Processing overhead Slower nodes perform more work
Tarazu- Communication-Aware Load Balancing of Map computation
In Map Computation is Critical, ‘task-steal mode’◦ Concern of CAS.
CALB’s mode change using shuffleLag◦ Using MapReduce monitor for fault tolerance
Diffence of number of Map task that have completed their computation Have completed their communication
in all nodes◦ Deciding the Source of criticality once is enough
without repeated, dynamic check.
Tarazu- Communication-Aware Load Balancing of Map computation
Determine how many remote tasks needed◦ Using in CALB ‘task-steal’ mode◦ Using to avoid increase SFT
To avoid traffic, CAS spreads out the remote task by interleaving them with local task
Tarazu- Communication-Aware Scheduling of Map computation
CAS has other benefits◦ By interleaving remote tasks with local tasks,
CAS achieves better overlap between remote task communication and local task computation on both sender and receiver sides
◦ Remote tasks read input data faster by avoiding bursts
Tarazu- Communication-Aware Scheduling of Map computation
Better load balance in the Reduce phase◦ Skewing the intermediate key distribution◦ Reduce max term RFT
Each Reduce task save number of fast/slow nodes.
Tarazu- Predictive Load Balancing of Reduce computation
Using Heterogeneous Cluster Environment◦ 10 Xeon-based/80 Atom-based server nodes
Using Hadoop 0.20.2 Compare another solution, LATE
Experimental Methodology
Heterogeneous 기법을 통한 시스템 장점 극대화◦ Shuffle-Critical 의 경우에는 Atom 의 물량 반영◦ Map-Critical 의 경우에는 Xeon 의 성능 반영
Experimental Result-Performance
Experimental Result-Effect of CALB, CAS and PLB
Experimental Result-Sensitivity to extent of heterogeneity
Experimental Result-Effect of skewed input data dist.
Improving MapReduce Performance in Heterogeneous Environments –University of California, Berkeley
https://developers.google.com/appengine/docs/python/dataprocessing/ http://www.cpubenchmark.net/
Reference