top 10 performance gotchas for scaling in-memory algorithms
DESCRIPTION
Top 10 Data Parallelism and Model Parallelism lessons from scaling H2O. "Math Algorithms have primarily been the domain of desktop data science. With the success of scalable algorithms at Google, Amazon, and Netflix, there is an ever growing demand for sophisticated algorithms over big data. In this talk, we get a ringside view in the making of the world's most scalable and fastest machine learning framework, H2O, and the performance lessons learnt scaling it over EC2 for Netflix and over commodity hardware for other power users. Top 10 Performance Gotchas is about the white hot stories of i/o wars, S3 resets, and muxers, as well as the power of primitive byte arrays, non-blocking structures, and fork/join queues. Of good data distribution & fine-grain decomposition of Algorithms to fine-grain blocks of parallel computation. It's a 10-point story of the rage of a network of machines against the tyranny of Amdahl while keeping the statistical properties of the data and accuracy of the algorithm."TRANSCRIPT
H2O – The Open Source Math Engine !
Better Predictions!
4/23/13
H2O – Open Source in-memory Machine Learning for Big Data
Universe is sparse. Life is messy. Data is sparse & messy.!
- Lao Tzu
Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
H2O the
Prediction
Engine
Adhoc Explora-on
Math Modeling
Real-‐-me Scoring
Big Data
Messy NAs
Clustering
Classifica-on
Ensembles 100’s nanos
models
Regression
Group By Grep
H2O the
Prediction
Engine
Big Data Explora-on Modeling Scoring Real-‐-me
No New API!
Approximate!results each step!
H2O the
Prediction
Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
hFps://github.com/0xdata/h2o
All Top 10ʼs are binary!- Anonymous
Data chunks > code chunks TCP for Data. UDP for Control.
>> Generated Java Assist
10 Move Code not Data
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
A Frame: Vec[] age sex zip ID car
l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM
A Chunk, Unit of Parallel Access
season for Variable-sized chunks
and a season Uniform chunks. Tightly-packed! (chunk is also unit of batch!)
9 Chunk-ing Express!
No Expensive intermediate states. Fine-grain parallelism wins! >> Fork / Join
8 Reduce early. Reduce Often!
All CPUs grab Chunks in parallel Map/Reduce & F/J handles all sync
8 Reduce early. Reduce Often!
JVM 4 Heap
JVM 1 Heap JVM 2 Heap JVM 3 Heap
Vec Vec Vec Vec Vec
Debugging slow >> Heartbeats, Messages Two General’s Paradox
7 Slow is not different from Dead
in-memory system as good as your memory manager! lazy eviction. compress.
align. Corollary: Track down Leaks!
6 Memory Manager
Use primitives
5 Memory Overheads
// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }
Tree size Bin size Recursively divide Till Data à Cache
4 Cache-‐Oblivious
User-mode reliability S3 Readers will TCP Reset Mux your connections Not all toolkits are equal. >> JetS3
3 EC2 – Nothing is bounded
Non-Blocking Data Structures.
2 No Locks, No Cry
// VOLATILE READ before key compare. // CAS private final boolean CAS_kvs( final Object[] oldkvs, final Object[] newkvs ) { return _unsafe.compareAndSwapObject(this, _kvs_offset, oldkvs, newkvs ); }
byte[ ]. roll-your-own. fast.
1 endian wars ended! Keep-It-Simple-Serialization.
public AutoBuffer putA1 ( byte[] ary, int sofar, int length ) {
while( sofar < length ) { int len = Math.min(length - sofar, _bb.remaining()); _bb.put(ary, sofar, len); sofar += len; if( sofar < length ) sendPartial(); } return this;
}
Data Movement is a Defect. Slowing down helps communication.
Got Speed?
Accuracy rules over speed. Predictive Performance
0 Math always produces a number
Data presentation bias. Sorted data => interesting results
1 Shuffle
2 Random acts of Kindness?
3 Convex Problems: ADMM
Matrix operations jama, jblas.. all single node. Distributed version needs data transfer!
4 Amdahl strikes: Cholesky / QR Decomposition
embarrassingly parallel binning tree-building splits
5 Random Forests
iterate & stage weak-learners =>
strong learners each tree can be parallel minimize communication
6 Boos-ng
embarrassingly parallel pre-calculate base stats distance calculation weight matrices – small footprint
7 Neural Nets & Clustering
Daisy chain a bunch of models Interleave. JIT – Minimize loops over data.
8 Ensembles
Deterministic versions first! Got Pen & Paper? Optimize often. Test Big Data soon.
9 Tools
Replace NAs to improves predictive performance by about 10pc.!
- Newton
Munging Missing Features impute NAs with mean impute NAs with knn impute with recursive pca!
- Boyd
Unbalanced data single rare classes Fraud / No-Fraud!
Stratify
Unbalanced data multiple rare classes Browse, Click, Purchase!
Stratify
Use Customer Data Algorithms for Sparse vs. Dense Unbalanced Data. Robustness under noise
10 Data is the System
Volume: HDFS
HIVE/SQL
Data Scientist
Munging slice n dice Features
Classification Regression Clustering Optimal Model
Engineer
Velocity: Events Online Scoring
Explora-on
Modeling
Offline Scoring
Business Analyst
Ensemble models Low latency
Applications
Predictions
Rule Engine
Before H2O
Big Data Explora-on Modeling Scoring Real-‐-me
Big Data beats Better Algorithms!
Big Data Explora-on Modeling Scoring Real-‐-me
Big Data and Better Algorithms! Scale & Parallelism!
H2O the
Prediction
Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
hFps://github.com/0xdata/h2o
H2O – The Open Source Math Engine !
Better Predictions!
0xdata.com
45
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
46
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
Read the docs!
This talk!
Join our GIT!
0xdata.com
47
Distributed Data Taxonomy
Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame
Usecases
Conversion, Retention & Churn!• Lead Conversion!• Engagement!• Product Placement!• Recommendations!
Pricing Engine!Fraud Detection!