evaluating task scheduling in hadoop-based cloud...

24
EVALUATING TASK SCHEDULING EVALUATING TASK SCHEDULING IN HADOOP IN HADOOP- BASED CLOUD BASED CLOUD IN HADOOP IN HADOOP- BASED CLOUD BASED CLOUD SYSTEMS SYSTEMS SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU UNIVERSITY OF CHINESE ACADEMY OF SCIENCES UNIVERSITY OF CHINESE ACADEMY OF SCIENCES & RICE UNIVERSITY 2013-9-30

Upload: others

Post on 28-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

EVALUATING TASK SCHEDULING EVALUATING TASK SCHEDULING IN HADOOPIN HADOOP--BASED CLOUD BASED CLOUD IN HADOOPIN HADOOP--BASED CLOUD BASED CLOUD SYSTEMSSYSTEMS

SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIUUNIVERSITY OF CHINESE ACADEMY OF SCIENCESUNIVERSITY OF CHINESE ACADEMY OF SCIENCES& RICE UNIVERSITY

2013-9-30

Page 2: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

OUTLINEOUTLINE

Background & Motivation • Background & Motivation • Hadoop Task schedulerp• Benchmark & Methodology

E l ti• Evaluation• CONCLUSIONS & Future work

Page 3: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

PRIVATE CLOUDPRIVATE CLOUD

"The NIST Definition of Cloud Computing" , National Institute

f St d d d T h lof Standards and Technology. Retrieved 24 July 2011

Th l d i f i• The cloud infrastructure isprovisioned for exclusive use by asingle organization comprisingsingle organization comprisingmultiple consumers (e.g.,business units). It may be owned,

d d t d b thmanaged, and operated by theorganization, a third party, orsome combination of them, and it,may exist on or off premises.

Page 4: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

MOTIVATIONMOTIVATION• A private cloud serves multiple users.• Different task priorities e e t tas p o t es• Different task types• Different task data sizes• Different task data sizes

• Optimizing the performance of private cloud is necessary and urgentcloud is necessary and urgent

• A challenge for task scheduling!

Page 5: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

OUTLINEOUTLINE

• Background & Motivation Hadoop Task scheduler• Hadoop Task scheduler

• Benchmark & Methodologygy• Evaluation

CO C S O S & F t k• CONCLUSIONS & Future work

Page 6: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

HADOOPHADOOP OVERVIEWOVERVIEWHadoop

• An open source software framework for• An open-source software framework for processing a large volume of data on a clustercluster

Page 7: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

HADOOPHADOOP TASK SCHEDULERTASK SCHEDULER

• FIFONaïve Fair sharing• Naïve Fair sharing

• Fair Sharing with Delay Schedulingg y g• Capacity Scheduling

HOD• HOD

Page 8: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

OUTLINEOUTLINE

• Background & Motivation Hadoop Task scheduler• Hadoop Task scheduler

• Benchmark & Methodologygy• Evaluation

CO C S O S & F t k• CONCLUSIONS & Future work

Page 9: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

CLOUDRANKCLOUDRANK--DD• A benchmark presented by ICT of CAS• A benchmark suite for private cloud• A benchmark suite for private cloud• Help researchers to simulate various multi-user applications in industrial scenariosapplications in industrial scenarios

• Benchmark provides a set of 13 representative d t l i t ldata analysis tools

• Basic operations • Data mining operations• Data warehouse operations

Page 10: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

DATA SOURCES OF EACH DATA SOURCES OF EACH PROGRAM IN CLOUDRANKPROGRAM IN CLOUDRANK--D D

Application Data sourcesSort

Automatically generatedWord count Automatically generated Word countGrep

Naive Bayes News and WikipediaSupport vector machine Scientist searchSupport vector machine Scientist search

K-means Sougou corpus Item based collaborative filtering Ratings on movies

Retail market basket data

Frequent pattern growth Click-stream data of an on-line news portal Traffic accident data

Collection of web html document Hidden Markov model Scientist search

Grep select

Automatically generated tableRanking select Automatically generated table User visits aggregation User visits-rankings join

Page 11: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

CONTENTCONTENT

• Background & Motivation Hadoop Task scheduler• Hadoop Task scheduler

• Benchmark & Methodologygy• Evaluation

CO C S O S & F t k• CONCLUSIONS & Future work

Page 12: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

WORKLOAD DESIGNWORKLOAD DESIGNImage processing Text indexingLog processing Web crawlingData mining Machine learning

2%Data mining Machine learningReporting Data storage

Applications in CloudRank-D Percent

16%17% private clouds Applications age

Web crawling D t i i

Naive BayesSVM

15%17%

Data mining Machine learning Image Processing

SVM HMM IBCF FPG

35%

15%%

11%

Processing FPG Text Indexing Log Processing

Basic Operations 31%

Reporting D t St Hive 34%

7% Data Storage %

Page 13: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

WORKLOAD DESIGNWORKLOAD DESIGN

Category Application Jobs

Sort 9

100 Jobs

Basic Operations

Sort 9Word count 11

Grep 11100 Jobs

Data Mining

Naïve Bayes 6Support vector machine 6

K-means 7Data Mining Operations

K means 7Item based collaborative 3Frequent pattern growth 7Hidd M k d l 6Hidden Markov model 6

Data Warehouse Grep select

34Ranking select

Operations 34user visits aggregation user visits-rankings join

Page 14: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

JOB SUBMITTINGJOB SUBMITTING• Follows the distribution of

input data size in Taobao Input Data size Percentage

• Follows an exponential distribution with a mean of

size

<25MB 40.57%

14 seconds(Facebook)• Job submitted in a random

25MB-625MB 39.33%

order 1.2GB-5GB 12.03%

>5GB 8.07%

Page 15: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

TESTBEDTESTBED

• Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)DataNodes)

CPU Type Intel CPU CoreIntel ®Xeon E5645 6 cores@2 40GIntel ®Xeon E5645 6 [email protected]

L1 D/I Cache L2 Cache L3 Cache Memory Disk

6 × 32 KB 6 × 256 KB 12MB 16GB 8TB

OS Hadoop Mahout Hive

CentOS 5.5 1.0.2 0.6 0.11

Page 16: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

HADOOP CONFIGURATIONHADOOP CONFIGURATION

Hadoop Parameter Value Description

d t kt k The maximum number of map tasks thatmapred.tasktracker.map.tasks.maximum 12

pwill be executed simultaneously by a tasktracker.

mapred.tasktracker.r The maximum number of reduce tasks thateduce.tasks.maximum

12 will be executed simultaneously by a tasktracker.

mapred map tasks 48 Maximum number of concurrent runningmapred.map.tasks 48 reduce task.

mapred.reduce.tasks 45 Maximum number of concurrent runningmap task.Th t l b f li ti ifi ddfs.replication 2 The actual number of replications specifiedwhen the file is created.

mapreduce.tasktrackt fb d h tb TRUE O th t f b d h tb ter.outofband.heartbe

atTRUE Open the out of band heartbeat.

Page 17: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

HADOOPHADOOP SCHEDULER SCHEDULER EVALUATIONEVALUATION• Data Processed per Second• Turnaround time• Turnaround time

• Running timeW iti Ti• Waiting Time

• Throughput

Page 18: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

DATA PROCESSED PER SECONDDATA PROCESSED PER SECONDDATA PROCESSED PER SECONDDATA PROCESSED PER SECOND25

3s) 12

s)

20

ng ti

me

(103

8

10

DPS

(MB

/s

10

15

Tota

l run

nin

4

6

0

5

T

0

2

0Fair with

DSNaïve Fair

Capacity FIFO HOD

Task Scheduler

0Fair with

DSNaïve Fair Capacity FIFO HOD

Task Scheduler

The total running time (103sec) of running full workload by using five schedulers respectively

The Data Processed per Second(Megabytes processed per second) ofrunning full workload by using fiverunning full workload by using fiveschedulers respectively.

Page 19: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

TURNAROUND TURNAROUND TIMETIME

1.0

1.2 e

(103

s)

0.8

arou

nd ti

me

0.4

0.6

Turn

a

0.0

0.2

Fair with DS Naïve Fair Capacity FIFO HODTask Scheduler

The average job turnaround time (103sec) of running full workload by using five schedulers respectively.

Page 20: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

AVERAGE JOB RUNNING TIME AVERAGE JOB RUNNING TIME & WAITING TIME & WAITING TIME

1.0

1.2

me

(103

s)

200

250

sec.

)

0.6

0.8

Run

ning

tim

150

200

aitin

g tim

e (

0.2

0.4

R

50

100Wa

0.0 0Fair

with DSNaïve Fair

Capacity FIFO HOD

Task Scheduler

The average job running time (103 ) f i f ll kl d b

Task Scheduler

Average job waiting time (second) of i f ll kl d b i fi(103sec) of running full workload by

using five schedulers respectively.running full workload by using five schedulers respectively.

Page 21: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

THROUGHPUTTHROUGHPUT

0 30

0.35

0.40 jo

bs/m

in)

0.20

0.25

0.30

hrou

ghpu

t (j

0.10

0.15

0. 0

Th

0.00

0.05

Fair with DS Naïve Fair Capacity FIFO HODFair with DS Naïve Fair Capacity FIFO HOD

Task Scheduler

The throughput (number of jobs processed in one minute) of runningThe throughput (number of jobs processed in one minute) of running full workload by using five schedulers respectively

Page 22: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

EVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSIS

• Fair with delay scheduling scheduler is the mostefficient scheduler

• some jobs with large size will have longer timeto finish than usual jobsto finish than usual jobs

• Fair with delay scheduling, naïve fair, capacity,these three schedulers are all have the betterthese three schedulers are all have the betterperformance than default FIFO scheduler

HOD h d l f d t ll• HOD scheduler preformed not very well,affected by the extra cost of virtualization

Page 23: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

CONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORK

O ti i i th f f H d l t i• Optimizing the performance of Hadoop clusters is very necessary and significant

Th h i f t k h d l i iti l f t• The choice of task schedulers is very critical for system performance improvement of Hadoop cluster

• With fair sharing with delay scheduling, DPS is improved by 20% than that of FIFO scheduler

• Optimization and design of the scheduler need to refer to the characteristics of the workload

• In the future, we will use more complex workloads to study and evaluate more efficient task schedulers for Hadoop based cloud systemsHadoop based cloud systems

Page 24: Evaluating Task Scheduling in Hadoop-based Cloud Systemsprof.ict.ac.cn/bpoe2013/downloads/ppt/Evaluating Task... · 2013-10-14 · • Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)

Q & AQ & AQ & AQ & A

THANKS!THANKS!

EE MAIL MAIL SOUNDER LIU@163 COMSOUNDER LIU@163 COMEE--MAIL: MAIL: [email protected][email protected],,[email protected]@UCAS.AC.CN