benefits of hadoop as platform as a service

Dublin, 14 April 2016

Benefits of Hadoop as Platform as a

Service

Aaron Call

Barcelona Supercomputing Center

www.bsc.es

Barcelona Supercomputing Center

BSC – Barcelona Supercomputing Center

3

23 years resarch on computer architecture

• European Center for Parallelism of Barcelona (CEPBA)

• Based at the Polytechnical University of Catalonia (UPC)

Led by Mateo Valero

• Seymour Cray 2015, first european to win it

• ACM fellow, Eckert-Mauchly award in 2007, Google award 2009

Large resarch staff

• 1000+ publications

BSC – Barcelona Supercomputing Center

4

Many life sciencies computational projects• Computational Genomics

• Molecular modeling and bioinformatics

• Protein interactions and docking

• In place computational capabilities

• Mare Nostrum supercomputer

Research activity around Hadoop since 2008

• Data-centric research group:

http://www.bsc.es/computer-sciences/data-centric-

computing

• SLA-driven scheduling (adaptive scheduler)

• Project ALOJA

Automated characterization of cost-effectiveness of Big Data

deployments

Seeks to provide knowledge and tools aiming to help users reduce the

TCO of infrastructures

About the project

6

What is the most effective configuration for my needs?

About the project

7

On ALOJA we acquired large knowledge on the behavior of On-

Premise and IaaS hadoop deployments

60k+ runs

Public repository

8

What it is best for one workload it is not for all

Lessons learnt from IaaS

9

Disks and network impact Local vs remote disks

HDD-IB

SSD-ETH

HDD-ETH

SSD-IB

Local only

1 Remote

2 Remotes

3 Remotes

1 Remote /tmp local

2 Remote /tmp local

3 Remote /tmp local

PaaS Advantages

Provides an automated setup of BigData services (Hadoop, Spark,

Hive..)

• Optimized for the underlying hardware

• Removes cost of installation

The service provider is in charge of maintenance

• Reduces TCO

• As any cloud service you pay as you go

Platform as a Service

12

O'Reily made a survey on data science salaries and estimated an

average salary of 140.000 US$ for a data engineer

Within a cluster of 16 datanodes on HDInsight of A3 machines, for a

year it costs:

• (16 datanodes + 2 headnodes) * 0.2384/hr = 4.2912 $US/hr =>

4.2912*24*365 = 37,590.912 $US/year

Hence, on ideal conditions we can save up to 102,409.088 $US per

year

How much spent on maintenance?

13

Some current solutions

• Azure HDInsight

• Rackspace CBD

• Amazon EMR

• Google Cloud Platform

Platform as a Service

14

Linux-based clusters of 4,8 and 16 datanodes

• Azure HDInsight and Rackspace CBD

• Azure IaaS and Rackspace IaaS clusters as well

Clusters of up to 8 cores / per node and 64 GB RAM

HDInsight: azure storage HDFS (remote disks)

Rackspace CBD: nodes’ local disks as HDFS

Evaluation environment

15

Wordcount

• CPU intensive: useful to analyze scalability of the nodes between VM

sizes

Tested workloads

16

%user %system %steal %iowait %nice

Terasort

• Combined I/O and CPU loads, a de facto benchmark in the community

Tested workloads

17

Datasizes of 1, 10,100 and 1000 GB

This is enough to stress the system and get an overall behavior of it

%user %system %steal %iowait %nice

Runs repeated several times

Cloud variability (100GB runs)

18

Benchmark Provider Standard Deviation(%)

Terasort HDInsight 60%

Rackspace CBD 28%

Wordcount HDInsight 55%

Rackspace CBD 47%

Relevant factors tree

19

ALOJA-ML is a set of machine learning techniques and tools to estimate

executions’ behavior on the unexplored search space

Relevant factors tree: a tool that explores the parameters that changes most an execution’s behavior


20

Resulting tree for PaaS executions

IOFileBuffer=131072Datasize

Benchmark=TerasortReplication

Benchmark=wordcountDatanodes

IOFileBuffer=262144Datasize


21

Provider is not a

relevant factor


22

But datasize changes which

is next important factor

IO File Buffer 10GB

23

Analysing IO File Buffer (most relevant parameter on the tree)

IO File Buffer 100GB

24

IO File Buffer 1TB

25

Whether to use one or the other it all depends on your application

Replication factor 100GB

26

Replication factor 1TB

27

Important but not making a significant difference

Datasize scalability terasort

28

4cores,15GB

Datasize scalability terasort

29

4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB

Datanodes impact, wordcount

32

4cores,15GB 8cores,30GB

Datanodes impact, terasort

33

4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB

Datanodes impact, terasort

34

Diminishing returns

$2.87 $2.88

8cores,14GB 4cores,15GB

Cost difference IaaS and PaaS

35

Provider VM Size IaaS US$/h PaaS US$/h

Azure/HDI 4 CPU, 7GB RAM $0,176/h $0,32/h

8 CPU, 15GB RAM $0,352/h $0,64/h

Rackspace/CBD 4vCPU,15GB RAM $0,555/h $0,7925/h

8vCPU,30GB RAM $1,11/h $2,776/h

Amazon/EMR 4vCPU,16G RAM $0,239/h $0,299/h

8vCPU,32GB RAM $0,479/h $0,599/h

IaaS is cheaper, but might increase TCO (maintenance on your own!)

Conclusions

36

Providers are not really significant

In public cloud, large datasizes or large clusters introduce problems

• A larger cluster may improve performance but be more expensive in the

end

PaaS allows you to save on maintenance

• But you still have to take care of tunning a bit

• Not as much as on IaaS

• Cheaper or not than IaaS it all depends on your business

Thank you!

For further information please contact

[email protected]

www.bsc.es

benefits of hadoop as platform as a service

Technology