sudoers: benchmarking hadoop with aloja

Benchmarking Hadoop with ALOJA

Oct 6, 2015

by Nicolas Poggi @ni_po

sudoers Barcelona:

About Nicolas Poggi @ni_po

Work: Education:

Community:

Agenda Intro on Hadoop

Current scenario and problematic

ALOJA project

Open source tools

Benchmarking DEMO

Results

DEMO results online

Open questions and comments

Intro: Hadoop design and ecosystem

Hadoop design

Hadoop designed to solve complex data Structured and non structured

With [close to] linear scalability

Simplifying the programming model From MPI, OpenMP, CUDA, …

Operates as a blackbox for data analysts

Image source: Hadoop, the definitive guide

Hadoop parameters > 100+ tunable parameters

mapred.map/reduce.tasks.speculative.execution

obscure and interrelated

io.sort.mb 100 (300)

io.sort.record.percent 5% (15%)

io.sort.spill.percent 80% (95 – 100%)

Number of Mappers and Reducers

Rule of thumb 0.5 - 2 per CPU core

Hadoop stack for tuning

Image source: Intel® Distribution for Apache Hadoop

Hadoop highly-scalable but… Not a high-performance solution!

Requires Design,

Clusters, topology clusters

Setup, OS, Hadoop config

and tuning required Iterative approach

Time consuming

And extensive benchmarking!

Hadoop ecosystem

Large and spread

Dominated by big players

Custom patches

Default values not ideal

Product claims

Cloud vs. On-premise

IaaS

PaaS

EMR, HDInsight

Needs standardization and auditing!

DATA

Product claims Needs auditing!

Too many choices?

Remote volumes

-

-

Rotational HDDs

JBODs

Large VMs

Small VMs

Gb Ethernet

InfiniBand

RAID

Cost

Performance

On-Premise

Cloud

And where is my system configuration positioned on

each of these axes?

High availability

Replication

+

+

Project ALOJA

Open initiative to produce mechanisms for an automated characterization of cost-effectiveness

of Big Data deployments

Results from of a growing need of the community to understand job execution details and create transparency

Explore different configuration deployment options and their tradeoffs Both software and hardware

Cloud services and on-premise

Seeks to provide knowledge, tools, and an online service to with which users make better informed decisions

reduce the TCO for their Big Data infrastructures

Guide the future development and deployment of Big Data clusters and applications

Challenges, options, and implementation

Challenges (circa end 2013) Test different clusters architectures

On-premise Commodity, high-end, appliance, low-power

Cloud IaaS 32 different VMs in Azure, similar in other

providers

Cloud PaaS HDInsight, EMR, CloudBigData

Different access level Full admin, user-only, request-to-install,

everything ready, queuing systems (SGE)

Different versions Hadoop, JVM, Spark, Hive, etc…

Dev environments and testing Big Data usually requires a cluster to

develop and test

Benchmarking vs. Production envs Need to compare different executions

Not how the systems are doing now This is the main diff with prod products

Dada does not change (non-OLTP) Temporary data for benchmarks vs. Important data

Fast iteration vs. Reliability Iterates configurations vs. fixed config

Many fast, experimental changes

Security can be relaxed Management for Hadoop

Vendor lock-in Lack of systems support (azure, on-prem, low-power) Hadoop is our use case, not the only one

Leave no traces on the benchmarked system

Available options: (circa end 2013) Deployment

jclouds foreman Puppet Ambari

Config and deploy Ambari (hadoop only) Use Configuration

Management (CM) Puppet, chef, ansible…

Monitoring Ganglia, Zabbix Amabari Cloudera Manager Kibana, GraphD…

Problems All systems though for PROD

Not for comparison

No Azure support Many different packages No one-fits-all solution

Solution Custom implementation Based in simple components Wrapping commands

ALOJA Platform main components

2 Online Repository

•Explore results

•Execution details

•Cluster details

•Costs

•Data sharing

3 Web Analytics

•Data views and evaluations

•Aggregates

•Abstracted Metrics

•Job characterization

•Machine Learning

•Predictions and clustering

1 Big Data Benchmarking

•Deploy & Provision

•Conf Management

•Parameter selection & Queuing

•Perf counters

•Low-level instrumentation

•App logs

17

NGINX, PHP, MySQL

BASH, Unix tools, CLIs R, SQL, JS

Workflow in ALOJA Cluster(s) definition

• VM sizes

• # nodes

• OS, disks

• Capabilities

Execution plan

• Start cluster

• Exec Benchmarks

• Gather results

• Cleanup

Import data

• Convert perf metric

• Parse logs

• Import into DB

Evaluate data

• Data views in Vagrant VM

• Or http://hadoop.bsc.es

PA and KD •Predictive

Analytics

•Knowledge Discovery

Historic Repo

(in progress)

Cluster and node definitions

Clusters (Azure example) Node (Web in Rackspace) #load AZURE defaults

source "$CONF_DIR/azure_defaults.conf"

clusterName="al-08"

numberOfNodes="8"

vmSize=“Large”

#details

vmCores="4"

vmRAM="7" #in GB

#costs

clusterCostHour="1.584" #0.176 * 9 clusterType="IaaS"

clusterDescription="A3 type VMs"

#load node defaults

source “$CONF_DIR/node_defaults.conf"

defaultProvider="rackspace"

vm_name="aloja-web"

vmSize='io1-30'

attachedVolumes="2"

diskSize="1023"

# Node roles (install functions)

extraLocalCommands="

vm_install_webserver;

vm_install_repo 'provider/rackspace';

install_ganglia_gmond;

config_ganglia_gmond 'aloja-web-rackspace' 'aloja-web';

install_percona /scratch/attached/2/mysql;"

Commands and providers

Provisioning commands Providers

Connect

Node and Cluster

Uses SSH proxies automatically

Deploy

Start, Stop

Delete

Nodes and clusters

On-premise Custom settings for

clusters Multiple disk types

Different architectures

Cloud IaaS Azure, OpenStack,

Rackspace, AWS (testing)

Cloud PaaS HDInsight, CloudBigData,

EMR soon

Code at: https://github.com/Aloja/aloja/tree/master/aloja-deploy

Running benchmarks in ALOJA Example of submitting a job to run:

https://github.com/Aloja/aloja/blob/master/aloja-bench/run_benchs.sh

To queue jobs and control results: https://github.com/Aloja/aloja/blob/master/shell/exeq.sh

Benchmarking results

ALOJA Online Benchmark Repository Entry point for explore the results collected from the executions

Index of executions Quick glance of executions

Searchable, Sortable

Execution details Performance charts and histograms

Hadoop counters

Jobs and task details

Data management of benchmark executions Data importing from different clusters

Execution validation

Data management and backup

Cluster definitions Cluster capabilities (resources) Cluster costs

Sharing results Download executions

Add external executions

Documentation and References Papers, links, and feature documentation

Available at: http://aloja.bsc.es

http://hadoop.bsc.es/

http://hadoop.bsc.es/

Impact of SW configurations in Speedup (4 node clusters)

Number of mappers Compression algorithm

No comp.

ZLIB

BZIP2

snappy

4m

6m

8m

10m

Speedup (higher is better)

Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

http://hadoop.bsc.es/configimprovement



Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local only

1 Remote

2 Remotes

3 Remotes

3 Remotes

/tmp local

2 Remotes /tmp local

1 Remotes

/tmp local

HDD-ETH

HDD-IB

SSD-ETH

SDD-IB

Speedup (higher is better)

Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf




Speedup: all disk configurations SSD vs JBOD For DFSIOE read, DFSIOE write, and Terasort

URL: http://hadoop.bsc.es/configimprovement?datefrom=&dateto=&benchs%5B%5D=dfsioe_read&benchs%5B%5D=dfsioe_write&benchs%5B%5D=terasort&id_clusters%5B%5D=21&nets%5B%5D=None&disks%5B%5D=HD2&disks%5B%5D=HD3&disks%5B%5D=HD4&disks%5B%5D=HD5&disks%5B%5D=HDD&disks%5B%5D=HS5&disks%5B%5D=RL1&disks%5B%5D=RL2&disks%5B%5D=RL3&disks%5B%5D=RL4&disks%5B%5D=RL5&disks%5B%5D=RL6&disks%5B%5D=RR1&disks%5B%5D=SS2&disks%5B%5D=SSD&mapss%5B%5D=None&comps%5B%5D=None&replications%5B%5D=None&blk_sizes%5B%5D=None&iosfs%5B%5D=None&iofilebufs%5B%5D=None&datanodess%5B%5D=None&bench_types%5B%5D=HDI&bench_types%5B%5D=HiBench&vm_sizes%5B%5D=None&vm_coress%5B%5D=None&vm_RAMs%5B%5D=None&hadoop_versions%5B%5D=None&types%5B%5D=None&filters%5B%5D=valid&filters%5B%5D=filters&allunchecked=

2 SSDs

5 SATA 1 SSD /tmp

1 SSD

1 SATA

2 SATA

3 SATA

4 SATA

5 SATA

Higher is better

Fastest config

High capacity and fast

High capacity but slow

Speedup by disk configuration in the Cloud (higher is better)

URL

http://104.130.159.92/configimprovement?benchs%5B%5D=terasort&disks%5B%5D=HDD&disks%5B%5D=RL1&disks%5B%5D=RL2&disks%5B%5D=RL3&disks%5B%5D=RR1&disks%5B%5D=RR2 &disks%5B%5D=RR3&disks%5B%5D=RR4&disks%5B%5D=RR5&disks%5B%5D=RR6&disks%5B%5D=RS1&disks%5B%5D=RS6&disks%5B%5D=SSD&bench_types%5B%5D =HiBench&filters%5B%5D=valid&filters%5B%5D=filters&allunchecked=&selected-groups=disk&datefrom=&dateto=&minexetime=150&maxexetime=1500

1-6 remotes

1 and 6 remotes with /tmp on SSD

SSD only

Higher is better

VM Size comparison (Azure) Lower is better

Preview: Cost/Performance Scalability

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster size X axis number of datanodes (cluster size Left Y Execution time (lower is better) Right Y Execution cost

Execution time Execution cost

Recommended size

InfiniBand + SDD (LOCAL)

GbE SDD + (LOCAL) CLOUD (local disk /tmp and HDFS)

CLOUD (/tmp in Local Disk, HDFS in Blob storage 1-3 devices)

CLOUD (/tmp and HDFS in Blob storage 1-3 devices)

InfiniBand + SATA disks (LOCAL)

GbE+ SATA disks (LOCAL)

Price

Performance

Cost-effectiveness On-premise vs. Cloud)

Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Open questions: is BASH good enough?

PROs CONs and Alternatives

Simple and Fast Well known

(basics at least)

Easy to hack

Most of the work requires running sys commands

Custom implementation problems Missing some systems

Too simple, missing: objects, inheritance,

types, data structures, testing

Python? Perl?

Puppet? Ansible?

We’ll stick to bash for now..

What’s missing for incubating in Apache?

More info: ALOJA Benchmarking platform and online repository

http://aloja.bsc.es

Benchmarking Big Data by Nicolas Poggi http://www.slideshare.net/ni_po/benchmarking-hadoop

Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80organizations) http://clds.sdsc.edu/bdbc/community

Workshop Big Data Benchmarking (WBDB) Next: http://clds.sdsc.edu/wbdb2015.ca

SPEC Research Big Data working group http://research.spec.org/working-groups/big-data-working-group.html

Slides and video: Michael Frank on Big Data benchmarking

http://www.tele-task.de/archive/podcast/20430/

Tilmann Rabl Big Data Benchmarking Tutorial http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl

http://aloja.bsc.es/

http://www.slideshare.net/ni_po/benchmarking-hadoop





http://clds.sdsc.edu/bdbc/community



http://clds.sdsc.edu/wbdb2015.ca



http://research.spec.org/working-groups/big-data-working-group.html















http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl




@BDOOP_BCN

More info: http://aloja.bsc.es

or join BDOOP group http://www.meetup.com/Barcelona-BigData-Perfomance-and-

Operations

Oct 06, 2015

http://aloja.bsc.es/

http://www.meetup.com/Barcelona-BigData-Perfomance-and-Operations










sudoers: benchmarking hadoop with aloja

Technology