scaling up practical learning algorithms (lecture by mikhail bilenko)
TRANSCRIPT
Scaling Up Practical Learning Algorithms
Misha Bilenko
ALMADA Summer School, Moscow 2013
Preliminaries: ML-in-four-slides• ML: mapping observations to predictions that minimize error• Predictor: ,
– : observations, each consisting of features• Numbers, strings, attributes – typically mapped to vectors
– : predictions (labels), assume a true for a given • Binary, numeric, ordinal, structured…
– loss function that quantifies error. • 0-1 loss: L1: L2 :
– is the function class from which a predictor is learned• E.g., “linear model (feature weights)” or “1000 decision trees with 32 leaves
each”
• Supervised learning: training set • Regularization: prevents overfitting to the training set:
ML Examples
• Email spam– : header/body words, client- and server-side statistics (sender, recipient, etc.)– : binary label– : Cost-sensitive: false positives (good mail in Junk) vs. false negatives (Inbox spam)
• Click prediction (ads, search results, recommendations, …)– : attributes of context (e.g., query), item (e.g., ad text), user (e.g., location), – : probability – :
Key Prediction Models• Important function classes
– Linear predictors (logistic regression, linear SVMs, …)• is a hyperplane (feature weights):
– Tree ensembles (boosting, random forests)• is a set of each tree’s splits and leaf outputs
– Non-linear parametric predictors (neural nets, Bayes nets)• is sets of parameters (weights for hidden units, distribution parameters)
– Non-parametric predictors (k-NN, kernel SVMs, Gaussian Processes)• is “remembered” subset of training examples with corresponding parameters
Learning: Training Predictors
• Two key algorithm patterns– Iteratively updating to reduce
• Gradient descent (stochastic, coordinate/sub-gradient, quasi-Newton, …)• Boosting: each subsequent ensemble member reduces error (functional GD)• Active-subset SVM training: iterative improvement over support vectors
– Averaging multiple models to reduce variance• Bagging/random forests: models learned on subsets of data/features• Parameter mixtures: explicitly averages weights from different data subsets
Big Learning: Large Datasets.. and Beyond
• Large training sets: many examples iff accuracy is improved• Large models: many features, ensembles, “deep” nets• Model selection: hyper-parameter tuning, statistical significance• Fast inference: structured prediction (e.g., speech)
• Fundamental differences across settings– Learning vs. inference, input complexity vs. model complexity– Dataflow/computation and bottlenecks are highly algorithm- and task-specific– Rest of this talk: practical algorithm nuggets for (1), (2)
Dealing with Large Training Sets (I): SGD
• Online learning: Stochastic Gradient Descent– “averaged perceptron”, “Pegasos”, etc.
• For • E.g., with hinge loss and
– if error, no update otherwise.
• Algorithm is fundamentally iterative: no “clean” parallelization– Literature: mini-batches, averaging, async updates, ….
• …but why? Algorithm runs at disk I/O speed!– Parallelize I/O. For truly enormous datasets, average parameters/models.
• RCV1: 1M documents (~1GB): <12s on this laptop!
Dealing with Large Training Sets (II): L-BFGS
• Regularized logistic regression
• L-BFGS: batch quasi-Newton method using quadratic approximation– Update is: where – Limited memory trick: keep a buffer of recent ( to approximate
• Parallelizes well on multi-core: – Each core takes a batch of examples and computes gradient
• Multi-node– Poor fit for MapReduce: global weight update, ( history, many iterations– Alternative: ADMM (first-order but better rate than SGD’s )
• Rule-based prediction is natural and powerful (non-linear)– Play outside: if no rain and not too hot, or if snowing but not windy.
• Trees hierarchically encode rule-based prediction– Nodes test features and split– Leaves produce predictions– Regression trees: numeric outputs
• Ensembles combine tree predictions
Dealing with Large Datasets (II): Trees
𝑊𝑖𝑛𝑑<10
=
𝑃𝑟𝑒𝑐𝑖𝑝𝑖𝑡𝑎𝑡𝑖𝑜𝑛
𝑇𝑒𝑚𝑝<90
𝑓𝑎𝑙𝑠𝑒𝑡𝑟𝑢𝑒
𝑓𝑡 𝑓𝑡
−+¿−𝑓 𝑡−
𝑇𝑒𝑚𝑝<32
+¿𝑊𝑖𝑛𝑑<10
=
𝑃𝑟𝑒𝑐𝑖𝑝𝑖𝑡𝑎𝑡𝑖𝑜𝑛
𝑇𝑒𝑚𝑝<30
𝑓𝑎𝑙𝑠𝑒𝑡𝑟𝑢𝑒
𝑓𝑡 𝑓𝑡
.1 0.6 0.2𝑓 𝑡
0.05
0
0.6
𝑇𝑒𝑚𝑝>10=
𝑊𝑖𝑛𝑑<25𝑓 𝑡
0.2
0.01 0.7
𝑓𝑡
.1 0.6 0.2
0.050.6
+ + +…
Tree Ensemble Zoo
• Different models can define different types of:– Combiner function: voting vs. weighting – Leaf prediction models: constant vs. regression– Split conditions: single vs. multiple features
• Examples (small biased sample, some are not tree-specific)– Boosting: AdaBoost, LogitBoost, GBM/MART, BrownBoost, Transform
Regression– Random Forests: Random Subspaces, Bagging, Additive Groves, BagBoo – Beyond regression and binary classification: RankBoost, abc-mart, GBRank,
LambdaMART, MatrixNet
Tree Ensembles Are Rightfully Popular• State-of-the-art accuracy: web, vision, CRM, bio, …
• Efficient at prediction time – Multithread evaluation of individual trees; optimize/short-circuit
• Principled: extensively studied in statistics and learning theory
• Practical– Naturally handle mixed, missing, (un)transformed data– Feature selection embedded in algorithm– Well-understood parameter sweeps– Scalable to extremely large datasets: rest of this section
Naturally Parallel Tree Ensembles• No interaction when learning individual trees
– Bagging: each tree trained on a bootstrap sample of data – Random forests: bootstrap plus subsample features at each split– For large datasets, local data replaces bootstrap -> embarrassingly parallel
Bagging tree construction
¿𝑇𝑟𝑎𝑖𝑛 ()
Random forest tree construction
¿𝑆𝑝𝑙𝑖𝑡 ()
¿𝑆𝑝𝑙𝑖𝑡 ()
Boosting: Iterative Tree Construction“Best off-the-shelf classifier in the world” – Breiman
¿𝑇𝑟𝑎𝑖𝑛 ()
• Numerically: gradient descent in function space– Each subsequent tree approximates a step in direction– Recompute target labels – Logistic loss: – Squared loss:
• Reweight examples for each subsequent tree to focus on errors
¿𝑇𝑟𝑎𝑖𝑛 () ¿𝑇𝑟𝑎𝑖𝑛 (). . .
Efficient Tree Construction
• Boosting is iterative: scaling up = parallelizing tree construction
• For every node: pick best feature to split– For every feature: pick best split-point• For every potential split-point: compute gain– For every example in current node, add its gain contribution for given split
• Key efficiency: limiting+ordering the set of considered split points– Continuous features: discretize into bins, splits = bin boundaries– Allows computing split values in a single pass over data
??
Binned Split Evaluation• Each feature’s range is split into bins. Per-bin statistics are
aggregated in a single pass
• For each tree node, a two-stage procedure (1) Pass through dataset aggregating node-feature-bin statistics(2) Select split among all (feature,bin) options
… Bin …
Accumulate: Totals:Split gain:
Features
Bins
A
B
Tree Construction Visualized
• Observation 1: a single pass is sufficient per tree level• Observation 2: data pass can iterate by-instance or by-feature
– Supports horizontally or vertically partitioned data
. . . .
Features
Bins
Features
Instances
Features
Bins
A
B
. . .
Features
Bins
Features
Bins
Data-Distributed Tree Construction
• Master1. Send workers current model and set of nodes to expand2. Wait to receive local split histograms from workers3. Aggregate local split histograms, select best split for every node
• Worker2a. Pass through local data, aggregating split histograms2b. Send completed local histograms to master
Master Worker
Feature-Distributed Tree Construction
• Workers maintain per-instance index of current residuals and previous splits• Master
1. Request workers to expand a set of nodes2. Wait to receive best per-feature splits from workers3. Select best feature-split for every node4. Request best splits’ workers to broadcast per-instance assignments and residuals
• Worker2a. Pass through all instances for local features, aggregating split histograms for each node2b. Select local features’ best splits for each node, send to master
Master Worker
• How many is “many”? At least billions.
• Exhibit A: English n-gramsUnigrams: 13 millionBigrams: 315 millionTrigrams: 977 millionFourgrams: 1.3 billionFivegrams: 1.2 billion
• Can we scale up linear learners? Yes, but there are limits:– Retraining: ideally real-time, definitely not more than a couple hours– Modularity: ideally fit in memory, definitely decompose elastically
• Exhibit B: search ads, 3 monthsUser IDs: hundreds of millionsListing IDs: hundreds of millionsQueries: tens to hundreds of millionsUser x Listing x Query: billions
Learning with Many Features
Towards infinite features:Feature hashing
• Basic idea: elastic, medium-dimensional projection • Classic low-d projections: storage, cost, updates hard• Solution: mapping defined by a hashing function
+ Effortless elasticity, sparsity preserved- Compression is random (not driven by error reduction)
[Weinberger et al. 2009]; trick first described in [Moody 1989].
…
Scaling up ML: Concluding Thoughts
• Learner parallelization is highly algorithm-dependent
• High-level parallelization (MapReduce)– Less work but there is a convenience penalty– Limits on communication and control can be algorithm-killing
• Low-level parallelization (Multicore, GPUs, )– Harder to implement/debug– Successes architecture-vs-algorithm specific: i.e. GPUs are great if matrix
multiplication is the core operation (NNs)– Typical trade-off: memory/IO latency/contention vs. update complexity