top 10 performance gotchas for scaling in-memory algorithms

H2O – The Open Source Math Engine !

Better Predictions!

4/23/13

H2O – Open Source in-memory Machine Learning for Big Data

Universe is sparse. Life is messy. Data is sparse & messy.!

- Lao Tzu

Hadoop = opportunity Not enough Data Scientists Analysts won’t code java

H2O the

Prediction

Engine

Adhoc Explora-on

Math Modeling

Real-‐-me Scoring

Big Data

Messy NAs

Clustering

Classifica-on

Ensembles 100’s nanos

models

Regression

Group By Grep

H2O the

Prediction

Engine

Big Data Explora-on Modeling Scoring Real-‐-me

No New API!

Approximate!results each step!

H2O the

Prediction

Engine

Intellectual Legacy

Math needs to be free

Open Source

Support and Innovation

hFps://github.com/0xdata/h2o

All Top 10ʼs are binary!- Anonymous

Data chunks > code chunks TCP for Data. UDP for Control.

>> Generated Java Assist

10 Move Code not Data

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

A Frame: Vec[] age sex zip ID car

l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM

A Chunk, Unit of Parallel Access

season for Variable-sized chunks

and a season Uniform chunks. Tightly-packed! (chunk is also unit of batch!)

9 Chunk-ing Express!

No Expensive intermediate states. Fine-grain parallelism wins! >> Fork / Join

8 Reduce early. Reduce Often!

All CPUs grab Chunks in parallel Map/Reduce & F/J handles all sync

8 Reduce early. Reduce Often!

JVM 4 Heap

JVM 1 Heap JVM 2 Heap JVM 3 Heap

Vec Vec Vec Vec Vec

Debugging slow >> Heartbeats, Messages Two General’s Paradox

7 Slow is not different from Dead

in-memory system as good as your memory manager! lazy eviction. compress.

align. Corollary: Track down Leaks!

6 Memory Manager

Use primitives

5 Memory Overheads

// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }

Tree size Bin size Recursively divide Till Data à Cache

4 Cache-‐Oblivious

User-mode reliability S3 Readers will TCP Reset Mux your connections Not all toolkits are equal. >> JetS3

3 EC2 – Nothing is bounded

Non-Blocking Data Structures.

2 No Locks, No Cry

// VOLATILE READ before key compare. // CAS private final boolean CAS_kvs( final Object[] oldkvs, final Object[] newkvs ) { return _unsafe.compareAndSwapObject(this, _kvs_offset, oldkvs, newkvs ); }

byte[ ]. roll-your-own. fast.

1 endian wars ended! Keep-It-Simple-Serialization.

public AutoBuffer putA1 ( byte[] ary, int sofar, int length ) {

while( sofar < length ) { int len = Math.min(length - sofar, _bb.remaining()); _bb.put(ary, sofar, len); sofar += len; if( sofar < length ) sendPartial(); } return this;

}

Data Movement is a Defect. Slowing down helps communication.

Got Speed?

Accuracy rules over speed. Predictive Performance

0 Math always produces a number

Data presentation bias. Sorted data => interesting results

1 Shuffle

2 Random acts of Kindness?

3 Convex Problems: ADMM

Matrix operations jama, jblas.. all single node. Distributed version needs data transfer!

4  Amdahl strikes: Cholesky / QR Decomposition

embarrassingly parallel binning tree-building splits

5 Random Forests

iterate & stage weak-learners =>

strong learners each tree can be parallel minimize communication

6 Boos-ng

embarrassingly parallel pre-calculate base stats distance calculation weight matrices – small footprint

7 Neural Nets & Clustering

Daisy chain a bunch of models Interleave. JIT – Minimize loops over data.

8 Ensembles

Deterministic versions first! Got Pen & Paper? Optimize often. Test Big Data soon.

9 Tools

Replace NAs to improves predictive performance by about 10pc.!

- Newton

Munging Missing Features impute NAs with mean impute NAs with knn impute with recursive pca!

- Boyd

Unbalanced data single rare classes Fraud / No-Fraud!

Stratify

Unbalanced data multiple rare classes Browse, Click, Purchase!

Stratify

Use Customer Data Algorithms for Sparse vs. Dense Unbalanced Data. Robustness under noise

10 Data is the System

Volume: HDFS

HIVE/SQL

Data Scientist

Munging slice n dice Features

Classification Regression Clustering Optimal Model

Engineer

Velocity: Events Online Scoring

Explora-on

Modeling

Offline Scoring

Business Analyst

Ensemble models Low latency

Applications

Predictions

Rule Engine

Before H2O


Big Data beats Better Algorithms!


Big Data and Better Algorithms! Scale & Parallelism!

H2O the

Prediction

Engine

Intellectual Legacy

Math needs to be free

Open Source

Support and Innovation

hFps://github.com/0xdata/h2o

H2O – The Open Source Math Engine !

Better Predictions!

0xdata.com

45

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

46

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

Read the docs!

This talk!

Join our GIT!

0xdata.com

47

Distributed Data Taxonomy

Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame

Usecases

Conversion, Retention & Churn!•  Lead Conversion!•  Engagement!•  Product Placement!•  Recommendations!

Pricing Engine!Fraud Detection!

top 10 performance gotchas for scaling in-memory algorithms

Technology