benefits of hadoop as platform as a service
TRANSCRIPT
Dublin, 14 April 2016
Benefits of Hadoop as Platform as a
Service
Aaron Call
Barcelona Supercomputing Center
www.bsc.es
Barcelona Supercomputing Center
BSC – Barcelona Supercomputing Center
3
23 years resarch on computer architecture
• European Center for Parallelism of Barcelona (CEPBA)
• Based at the Polytechnical University of Catalonia (UPC)
Led by Mateo Valero
• Seymour Cray 2015, first european to win it
• ACM fellow, Eckert-Mauchly award in 2007, Google award 2009
Large resarch staff
• 1000+ publications
BSC – Barcelona Supercomputing Center
4
Many life sciencies computational projects• Computational Genomics
• Molecular modeling and bioinformatics
• Protein interactions and docking
• In place computational capabilities
• Mare Nostrum supercomputer
Research activity around Hadoop since 2008
• Data-centric research group:
http://www.bsc.es/computer-sciences/data-centric-
computing
• SLA-driven scheduling (adaptive scheduler)
• Project ALOJA
ALOJA
Automated characterization of cost-effectiveness of Big Data
deployments
Seeks to provide knowledge and tools aiming to help users reduce the
TCO of infrastructures
About the project
6
What is the most effective configuration for my needs?
About the project
7
On ALOJA we acquired large knowledge on the behavior of On-
Premise and IaaS hadoop deployments
60k+ runs
Public repository
8
What it is best for one workload it is not for all
Lessons learnt from IaaS
9
Disks and network impact Local vs remote disks
HDD-IB
SSD-ETH
HDD-ETH
SSD-IB
Local only
1 Remote
2 Remotes
3 Remotes
1 Remote /tmp local
2 Remote /tmp local
3 Remote /tmp local
PaaS Advantages
Provides an automated setup of BigData services (Hadoop, Spark,
Hive..)
• Optimized for the underlying hardware
• Removes cost of installation
The service provider is in charge of maintenance
• Reduces TCO
• As any cloud service you pay as you go
Platform as a Service
12
O'Reily made a survey on data science salaries and estimated an
average salary of 140.000 US$ for a data engineer
Within a cluster of 16 datanodes on HDInsight of A3 machines, for a
year it costs:
• (16 datanodes + 2 headnodes) * 0.2384/hr = 4.2912 $US/hr =>
4.2912*24*365 = 37,590.912 $US/year
Hence, on ideal conditions we can save up to 102,409.088 $US per
year
How much spent on maintenance?
13
Some current solutions
• Azure HDInsight
• Rackspace CBD
• Amazon EMR
• Google Cloud Platform
Platform as a Service
14
Linux-based clusters of 4,8 and 16 datanodes
• Azure HDInsight and Rackspace CBD
• Azure IaaS and Rackspace IaaS clusters as well
Clusters of up to 8 cores / per node and 64 GB RAM
HDInsight: azure storage HDFS (remote disks)
Rackspace CBD: nodes’ local disks as HDFS
Evaluation environment
15
Wordcount
• CPU intensive: useful to analyze scalability of the nodes between VM
sizes
Tested workloads
16
%user %system %steal %iowait %nice
Terasort
• Combined I/O and CPU loads, a de facto benchmark in the community
Tested workloads
17
Datasizes of 1, 10,100 and 1000 GB
This is enough to stress the system and get an overall behavior of it
%user %system %steal %iowait %nice
Runs repeated several times
Cloud variability (100GB runs)
18
Benchmark Provider Standard Deviation(%)
Terasort HDInsight 60%
Rackspace CBD 28%
Wordcount HDInsight 55%
Rackspace CBD 47%
Relevant factors tree
19
ALOJA-ML is a set of machine learning techniques and tools to estimate
executions’ behavior on the unexplored search space
Relevant factors tree: a tool that explores the parameters that changes most an execution’s behavior
Relevant factors tree
20
Resulting tree for PaaS executions
IOFileBuffer=131072Datasize
Benchmark=TerasortReplication
Benchmark=wordcountDatanodes
IOFileBuffer=262144Datasize
Relevant factors tree
21
Provider is not a
relevant factor
Relevant factors tree
22
But datasize changes which
is next important factor
IO File Buffer 10GB
23
Analysing IO File Buffer (most relevant parameter on the tree)
IO File Buffer 100GB
24
IO File Buffer 1TB
25
Whether to use one or the other it all depends on your application
Replication factor 100GB
26
Replication factor 1TB
27
Important but not making a significant difference
Datasize scalability terasort
28
4cores,15GB
Datasize scalability terasort
29
4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB
Datanodes impact, wordcount
32
4cores,15GB 8cores,30GB
Datanodes impact, terasort
33
4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB
Datanodes impact, terasort
34
Diminishing returns
$2.87 $2.88
8cores,14GB 4cores,15GB
Cost difference IaaS and PaaS
35
Provider VM Size IaaS US$/h PaaS US$/h
Azure/HDI 4 CPU, 7GB RAM $0,176/h $0,32/h
8 CPU, 15GB RAM $0,352/h $0,64/h
Rackspace/CBD 4vCPU,15GB RAM $0,555/h $0,7925/h
8vCPU,30GB RAM $1,11/h $2,776/h
Amazon/EMR 4vCPU,16G RAM $0,239/h $0,299/h
8vCPU,32GB RAM $0,479/h $0,599/h
IaaS is cheaper, but might increase TCO (maintenance on your own!)
Conclusions
36
Providers are not really significant
In public cloud, large datasizes or large clusters introduce problems
• A larger cluster may improve performance but be more expensive in the
end
PaaS allows you to save on maintenance
• But you still have to take care of tunning a bit
• Not as much as on IaaS
• Cheaper or not than IaaS it all depends on your business