technical tricks of vowpal wabbit

Technical Tricks of Vowpal Wabbit

http://hunch.net/~vw/

John Langford, Columbia, Data-Driven Modeling,April 16

git clonegit://github.com/JohnLangford/vowpal_wabbit.git

Goals of the VW project

1 State of the art in scalable, fast, e�cientMachine Learning. VW is (by far) the mostscalable public linear learner, and plausibly themost scalable anywhere.

2 Support research into new ML algorithms. MLresearchers can deploy new algorithms on ane�cient platform e�ciently. BSD open source.

3 Simplicity. No strange dependencies, currentlyonly 9437 lines of code.

4 It just works. A package in debian & R.Otherwise, users just type �make�, and get aworking system. At least a half-dozen companiesuse VW.

Demonstration

vw -c rcv1.train.vw.gz �exact_adaptive_norm�power_t 1 -l 0.5

The basic learning algorithm

Learn w such that fw(x) = w .x predicts well.1 Online learning with strong defaults.2 Every input source but library.3 Every output sink but library.4 In-core feature manipulation for ngrams, outer

products, etc... Custom is easy.5 Debugging with readable models & audit mode.6 Di�erent loss functions � squared, logistic, ...7 `1 and `2 regularization.8 Compatible LBFGS-based batch-mode

optimization9 Cluster parallel10 Daemon deployable.

The tricks

Basic VW Newer Algorithmics Parallel Stu�

Feature Caching Adaptive Learning Parameter AveragingFeature Hashing Importance Updates Nonuniform AverageOnline Learning Dim. Correction Gradient SummingImplicit Features L-BFGS Hadoop AllReduce

Hybrid LearningWe'll discuss Basic VW and algorithmics, then Parallel.

Feature Caching

Compare: time vw rcv1.train.vw.gz �exact_adaptive_norm�power_t 1

Feature Hashing

RA

M

Conventional VW

Weights

Str

ing −

> I

ndex

dic

tionar

y

Weights

Most algorithms use a hashmap to change a word into anindex for a weight.VW uses a hash function which takes almost no RAM, is x10faster, and is easily parallelized.

The spam example [WALS09]

1 3.2 ∗ 106 labeled emails.

2 433167 users.

3 ∼ 40 ∗ 106 unique features.

How do we construct a spam �lter which is personalized, yetuses global information?

Answer: Use hashing to predict according to:〈w , φ(x)〉+ 〈w , φu(x)〉

Results

(baseline = global only predictor)

Basic Online Learning

Start with ∀i : wi = 0, Repeatedly:

1 Get example x ∈ (∞,∞)∗.

2 Make prediction y −∑i wixi clipped to interval [0, 1].

3 Learn truth y ∈ [0, 1] with importance I or goto (1).

4 Update wi ← wi + η2(y − y)Ixi and go to (1).

Reasons for Online Learning

1 Fast convergence to a good predictor

2 It's RAM e�cient. You need store only one example inRAM rather than all of them. ⇒ Entirely new scales ofdata are possible.

3 Online Learning algorithm = Online OptimizationAlgorithm. Online Learning Algorithms ⇒ the ability tosolve entirely new categories of applications.

4 Online Learning = ability to deal with driftingdistributions.

Implicit Outer Product

Sometimes you care about the interaction of two sets offeatures (ad features x query features, news features x userfeatures, etc...).Choices:

1 Expand the set of features explicitly, consuming n2 diskspace.

2 Expand the features dynamically in the core of yourlearning algorithm.

Option (2) is x10 faster. You need to be comfortable withhashes �rst.

The tricks



Hybrid LearningNext: algorithmics.

Adaptive Learning [DHS10,MS10]

For example t, let git = 2(y − y)xit .

New update rule: wi ← wi − η gi,t+1√Ptt′=1

g2it′

Common features stabilize quickly. Rare features can havelarge updates.

Learning with importance weights [KL11]

y

yw>t x yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??yw>t x

−η(∇`)>x

w>t+1xyw>t x w>t+1x yw>t x w>t+1x

s(h)||x||2


y

yw>t x

yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??yw>t x

−η(∇`)>x


s(h)||x||2


yyw>t x

yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??yw>t x

−η(∇`)>x


s(h)||x||2


yyw>t x yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x

yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??yw>t x

−η(∇`)>x


s(h)||x||2


yyw>t x yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??

yw>t x

−η(∇`)>x


s(h)||x||2


yyw>t x yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??

yw>t x

−η(∇`)>x

w>t+1x

yw>t x w>t+1x yw>t x w>t+1x

s(h)||x||2


yyw>t x yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??yw>t x

−η(∇`)>x

w>t+1x

yw>t x w>t+1x

yw>t x w>t+1x

s(h)||x||2


yyw>t x yw>t x

−η(∇`)>x

yw>t x

−η(∇`)>x

w>t+1x yw>t x

−6η(∇`)>x

yw>t x

−6η(∇`)>x

w>t+1x ??yw>t x

−η(∇`)>x

w>t+1xyw>t x w>t+1x

yw>t x w>t+1x

s(h)||x||2

Robust results for unweighted problems

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97

stan

dard

importance aware

astro - logistic loss

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98

stan

dard

importance aware

spam - quantile loss

0.9

0.905

0.91

0.915

0.92

0.925

0.93

0.935

0.94

0.945

0.95

0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95

stan

dard

importance aware

rcv1 - squared loss

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99

stan

dard

importance aware

webspam - hinge loss

Dimensional Correction

Gradient of squared loss = ∂(fw (x)−y)2

∂wi= 2(fw (x)− y)xi and

change weights in the negative gradient direction:

wi ← wi − η∂(fw (x)− y)2

∂wi

But the gradient has intrinsic problems. wi naturally has unitsof 1/i since doubling xi implies halving wi to get the sameprediction.⇒ Update rule has mixed units!

A crude �x: divide update by∑

i x2

i . It helps much!

This is scary! The problem optimized is minw∑

x ,y(fw (x)−y)2P

i x2

i

rather than minw∑

x ,y (fw (x)− y)2. But it works.

LBFGS [Nocedal80]

Batch(!) second order algorithm. Core idea = e�cientapproximate Newton step.

H = ∂2(fw (x)−y)2

∂wi∂wj= Hessian.

Newton step = ~w → ~w + H−1~g .

Newton fails: you can't even represent H.Instead build up approximate inverse Hessian according to:∆w∆T

w

∆Tw ∆g

where ∆w is a change in weights w and ∆g is a change

in the loss gradient g .

Hybrid Learning

Online learning is GREAT for getting to a good solution fast.LBFGS is GREAT for getting a perfect solution.

Use Online Learning, then LBFGS

0 10 20 30 40 500.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Iteration

auP

RC

Online

L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass

L−BFGS

0 5 10 15 20

0.466

0.468

0.47

0.472

0.474

0.476

0.478

0.48

0.482

0.484

Iteration

auP

RC

Online


L−BFGS

Hybrid Learning

Online learning is GREAT for getting to a good solution fast.LBFGS is GREAT for getting a perfect solution.Use Online Learning, then LBFGS

0 10 20 30 40 500.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Iteration

auP

RC

Online


L−BFGS

0 5 10 15 20

0.466

0.468

0.47

0.472

0.474

0.476

0.478

0.48

0.482

0.484

Iteration

auP

RC

Online


L−BFGS

The tricks



Hybrid LearningNext: Parallel.

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.


Interviewer: So, what do you want to do?

John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.


Interviewer: So, what do you want to do?John: I'd like to solve AI.

I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.


Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?

J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.


Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!

I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.


Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!

The worst part: he had a point.


Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a goodlinear predictor fw (x) =

∑i wixi?

2.1T sparse features17B Examples16M parameters1K nodes

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.

Compare: Other Supervised Algorithms inParallel Learning book

100 1000

10000 100000 1e+06 1e+07 1e+08 1e+09

RB

F-S

VM

MP

I?-5

00R

CV

1E

nsem

ble

Tre

e M

PI-

128

Syn

thet

icR

BF

-SV

MT

CP

-48

MN

IST

220

KD

ecis

ion

Tre

eM

apR

ed-2

00A

d-B

ounc

e #

Boo

sted

DT

MP

I-32

Ran

king

#Li

near

Thr

eads

-2R

CV

1Li

near

Had

oop+

TC

P-1

000

Ads

*&

Fea

ture

s/s

Speed per method

parallelsingle

MPI-style AllReduce

7

2 3 4

6

Allreduce initial state

5

1

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

MPI-style AllReduce

2828 28

Allreduce final state

28 28 28 28

Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

2 3 4

6

7

5

1

Create Binary Tree

Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

2 3 4

7

8

1

13

Reducing, step 1

Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

2 3 4

8

1

13

Reducing, step 2

28

Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

2 3 41

28

Broadcast, step 1

28 28

Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

28

28 28


28 28 28 28

AllReduce = Reduce+Broadcast

Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

28

28 28


28 28 28 28

AllReduce = Reduce+BroadcastProperties:


2 Bandwidth ≤ 6n.


An Example Algorithm: Weight averagingn = AllReduce(1)While (pass number < max)

1 While (examples left)

1 Do online update.

2 AllReduce(weights)

3 For each weight w ← w/n

Other algorithms implemented:

1 Nonuniform averaging for online learning

2 Conjugate Gradient

3 LBFGS

What is Hadoop AllReduce?

1

DataProgram

�Map� job moves program to data.

2 Delayed initialization: Most failures are disk failures.First read (and cache) all data, before initializingallreduce. Failures autorestart on di�erent node withidentical data.

3 Speculative execution: In a busy cluster, one node isoften slow. Hadoop can speculatively start additionalmappers. We use the �rst to �nish reading all data once.

Approach Used

1 Optimize hard so few data passes required.

1 Normalized, adaptive, safe, online, gradientdescent.

2 L-BFGS3 Use (1) to warmstart (2).

2 Use map-only Hadoop for process control and errorrecovery.

3 Use AllReduce code to sync state.

4 Always save input examples in a cache�le to speed laterpasses.

5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.1. Search for it.

Robustness & Speedup

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60 70 80 90 100

Spe

edup

Nodes

Speed per method

Average_10Min_10

Max_10linear

Splice Site Recognition

0 10 20 30 40 500.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Iteration

auP

RC

Online


L−BFGS

0 5 10 15 20

0.466

0.468

0.47

0.472

0.474

0.476

0.478

0.48

0.482

0.484

Iteration

auP

RC

Online


L−BFGS

Splice Site Recognition

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

Effective number of passes over data

auP

RC

L−BFGS w/ one online passZinkevich et al.Dekel et al.

To learn more

The wiki has tutorials, examples, and help:https://github.com/JohnLangford/vowpal_wabbit/wiki

Mailing List: [email protected]

Various discussion: http://hunch.net Machine Learning(Theory) blog

Bibliography: Original VW

Caching L. Bottou. Stochastic Gradient Descent Examples onToy Problems,http://leon.bottou.org/projects/sgd, 2007.

Release Vowpal Wabbit open source project, http://github.com/JohnLangford/vowpal_wabbit/wiki,2007.

Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola,and SVN Vishwanathan, Hash Kernels for StructuredData, AISTAT 2009.

Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, andJ. Attenberg, Feature Hashing for Large Scale MultitaskLearning, ICML 2009.

http://leon.bottou.org/projects/sgd

http://github.com/JohnLangford/vowpal_wabbit/wiki

http://github.com/JohnLangford/vowpal_wabbit/wiki

Bibliography: Algorithmics

L-BFGS J. Nocedal, Updating Quasi-Newton Matrices withLimited Storage, Mathematics of Computation35:773�782, 1980.

Adaptive H. B. McMahan and M. Streeter, Adaptive BoundOptimization for Online Convex Optimization, COLT2010.

Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive SubgradientMethods for Online Learning and StochasticOptimization, COLT 2010.

Safe N. Karampatziakis, and J. Langford, Online ImportanceWeight Aware Updates, UAI 2011.

Bibliography: Parallel

grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A ScalableModular Convex Solver for Regularized RiskMinimization, KDD 2007.

avg. 1 G. Mann et al. E�cient large-scale distributed trainingof conditional maximum entropy models, NIPS 2009.

avg. 2 K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtablefor Distributed Optimization, LCCC 2010.

ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li,Parallelized Stochastic Gradient Descent, NIPS 2010.

P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,Parallel Online Learning, in SUML 2010.

D. Mini 1 O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao,Optimal Distributed Online Predictions Using Minibatch,http://arxiv.org/abs/1012.1367

D. Mini 2 A. Agarwal and J. Duchi, Distributed delayed stochasticoptimization, http://arxiv.org/abs/1009.0571

http://arxiv.org/abs/1012.1367

http://arxiv.org/abs/1009.0571

Vowpal Wabbit Goals for FutureDevelopment

1 Native learning reductions. Just like more complicatedlosses. In development now.

2 Librari�cation, so people can use VW in their favoritelanguage.

3 Other learning algorithms, as interest dictates.

4 Various further optimizations. (Allreduce can beimproved by a factor of 3...)

Reductions

Goal: minimize ` on D

Algorithm for optimizing `0/1

Transform D into D ′

Transform h with small `0/1(h,D ′) into Rh with small `(Rh,D).

h

such that if h does well on (D ′, `0,1), Rh is guaranteed to dowell on (D, `).

The transformation

R = transformer from complex example to simple example.

R−1 = transformer from simple predictions to complexprediction.

example: One Against All

Create k binary regression problems, one per class.For class i predict �Is the label i or not?�

(x , y) 7−→

(x ,1(y = 1))

(x ,1(y = 2))

. . .

(x ,1(y = k))

Multiclass prediction: evaluate all the classi�ers and choosethe largest scoring label.

The code: oaa.cc

// parses reduction-speci�c �ags.void parse_�ags(size_t s, void (*base_l)(example*), void(*base_f)())

// Implements R and R−1 using base_l.void learn(example* ec)

// Cleans any temporary state and calls base_f.void �nish()

The important point: anything �tting this interface is easy tocode in VW now, including all forms of feature diddling andcreation.And reductions inherit all theinput/output/optimization/parallelization of VW!

Reductions implemented

1 One-Against-All (� �oaa <k>). The baseline multiclassreduction.

2 Cost Sensitive One-Against-All (� �csoaa <k>).Predicts cost of each label and minimizes the cost.

3 Weighted All-Pairs (� �wap <k>). An alternative to�csoaa with better theory.

4 Cost Sensitive One-Against-All with Label DependentFeatures (� �csoaa_ldf). As csoaa, but features notshared between labels.

5 WAP with Label Dependent Features (� �wap_ldf).

6 Sequence Prediction (� �sequence <k>). A simpleimplementation of Searn and Dagger for sequenceprediction. Uses cost sensitive predictor.

Reductions to Implement

Mean Regression

AUC Ranking

Classification

IW ClassificationQuantile Regression 1 1

k−1 4

Classification

FilterTree

PECOC

Probing

CostingEi

Quanting

4 ECT

k/2

SearnTk ln TPSDP

Unsupervised by Self Prediction

Tk

??

−cost k

−Classificationk−Partial Labelk

T step RL with State Visitation Step RL with Demonstration PolicyT

Dynamic Models

??

OffsetTree

−way Regressionk

Regret Transform Reductions

1QuicksortRegret multiplier Algorithm Name

technical tricks of vowpal wabbit

Education