1 second order learning koby crammer department of electrical engineering ecml pkdd 2013 prague

1

Second Order Learning

Koby CrammerDepartment of Electrical Engineering

ECML PKDD 2013 Prague

Thanks

• Mark Dredze• Alex Kulesza• Avihai Mejer• Edward Moroshko• Francesco Orabona• Fernando Pereira• Yoram Singer• Nina Vaitz

2

3

Tutorial Context

OnlineLearning

Tutorial

OptimizationTheory

Real-WorldData

SVMs

4

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

5

Online Learning

Tyrannosaurus rex

6

Online Learning

Triceratops

7

Online Learning

Tyrannosaurus rex

Velocireptor

8

Formal Setting – Binary Classification

• Instances – Images, Sentences

• Labels– Parse tree, Names

• Prediction rule– Linear predictions rules

• Loss– No. of mistakes

9

Predictions

• Discrete Predictions:– Hard to optimize

• Continuous predictions :

– Label

– Confidence

10

Loss Functions

• Natural Loss:– Zero-One loss:

• Real-valued-predictions loss:– Hinge loss:

– Exponential loss (Boosting)– Log loss (Max Entropy, Boosting)

11

Loss Functions

1

1Zero-One Loss

Hinge Loss

Online Learning

Maintain Model M Get Instance x

Predict Label y=M(x)

Get True Label ySuffer Loss l(y,y)

Update Model M

14

• Any Features

• W.l.o.g.

• Binary Classifiers of the form

Linear Classifiers

Notation

Abuse

15

• Prediction :

• Confidence in prediction:

Linear Classifiers (cntd.)

16

Linear Classifiers

Input Instance to be classified

Weight vector of classifier

17

• Margin of an example with respect to the classifier :

• Note :

• The set is separable iff there exists such that

Margin

18

Geometrical Interpretation

19


20


21


Margin >0

Margin <<0

Margin <0Margin >>0

22

Hinge Loss

23

Why Online Learning?

• Fast• Memory efficient - process one example at a

time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive

• Not as good as a well designed batch algorithms

24

Outline





25

The Perceptron Algorithm

• If No-Mistake

– Do nothing

• If Mistake

– Update

• Margin after update :

Rosenblat 1958

26


27

Outline





Gradient Descent

• Consider the batch problem

• Simple algorithm:– Initialize– Iterate, for – Compute

– Set

28

Stochastic Gradient Descent


• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute

– Set

30

Stochastic Gradient Descent

• “Hinge” loss

• The gradient

• Simple algorithm:– Initialize– Iterate, for– Pick a random index – If then

else– Set 32

The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples

33

Outline





34

Motivation

• Perceptron: No guaranties of margin after the update

• PA :Enforce a minimal non-zero margin after the update

• In particular :– If the margin is large enough (1), then do nothing– If the margin is less then unit, update such that the

margin after the update is enforced to be unit

35

Input Space

36

Input Space vs. Version Space

• Input Space :– Points are input data– One constraint is induced

by weight vector– Primal space– Half space = all input

examples that are classified correctly by a given predictor (weight vector)

• Version Space :– Points are weight vectors– One constraints is induced

by input data– Dual space– Half space = all predictors

(weight vectors) that classify correctly a given input example

37

Weight Vector (Version) Space

The algorithm forces to reside in this region

38

Passive Step

Nothing to do. already resides on the desired side.

39

Aggressive Step

The algorithm projects on the desired half-space

40

Aggressive Update Step

• Set to be the solution of the following optimization problem :

• Solution:

41

Perceptron vs. PA

• Common Update :

• Perceptron

• Passive-Aggressive

42

Perceptron vs. PA

Margin

Error

No-E

rror, Sm

all Margin

No-Error, Large Margin

43

Perceptron vs. PA

44

Outline





45

Geometrical Assumption

• All examples are bounded in a ball of radius R

46

Separablity

• There exists a unit vector that classifies the data correctly

• Simple case: positive points

negative points

• Separating hyperplane

• Bound is :

Perceptron’s Mistake Bound

• The number of mistakes the algorithm makes is bounded by

47

48

Geometrical Motivation

SGD on such data

49

50

Outline





Second Order Perceptron

• Assume all inputs are given• Compute “whitening” matrix

• Run the Perceptron on “wightened” data

• New “whitening” matrix

51

Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005


• Bound:

• Same simple case:

• Thus

• Bound is :

52


53


• If No-Mistake

– Do nothing

• If Mistake

– Update


SGD on weightened data

54

55

Outline





56

• The weight vector is a linear combination of examples

• Two rate schedules (many many others):– Perceptron algorithm, Conservative

– Passive - Aggressive

Span-based Update Rules

Feature-value of input instance

Target labelEither -1 or 1

Learning rateLearning rateWeight of feature f

57

Sentiment Classification

• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!

Pang, Lee, Vaithyanathan, EMNLP 2002

58

Sentiment Classification

• Many positive reviews with the word best

Wbest

• Later negative review – “boring book – best if you want to sleep in seconds”

• Linear update will reduce both

Wbest Wboring

• But best appeared more than boring• The model know’s more about best than boring• Better to reduce words in different rate

Wboring Wbest

59

Natural Language Processing

• Big datasets, large number of features

• Many features are only weakly correlated with target label

• Linear classifiers: features are associated with word-counts

• Heavy-tailed feature distribution

Feature Rank

Cou

nts

Natural Language Processing

61

New Prediction Models

• Gaussian distributions over weight vectors

• The covariance is either full or diagonal

• In NLP we have many features and use a diagonal covariance

62

Classification

• Given a new example • Stochastic:

– Draw a weight vector– Make a prediction

• Collective:– Average weight vector– Average margin– Average prediction

63

The Margin is Random Variable

• The signed margin

is random 1-d Gaussian

• Thus:

64

Linear Model Distribution over Linear Models

Example

Mean weight-vector

65

The algorithm forces that most of the values of would reside in this region

Weight Vector (Version) Space

66

Nothing to do, most of the weight vectors already classifies the example correctly

Passive Step

67

The mean is moved beyond the mistake-line(Large Margin)

Aggressive Step

The covariance is shrunk in the direction of the input example

The algorithm projects the current Gaussian distribution on the half-space

68

Projection Update

• Vectors (aka PA):

• Distributions (New Update) :

Confidence Parameter

69

• Sum of two divergences of parameters :

• Convex in both arguments simultaneously

Divergence

Matrix Itakura-Saito Divergence

Mahanabolis Distance

70

Constraint

• Probabilistic Constraint :

• Equivalent Margin Constraint :

• Convex in , concave in • Solutions:

– Linear approximation– Change variables to

get a convex formulation– Relax (AROW)

Dredze, Crammer, Pereira. ICML 2008

Crammer, Dredze, Pereira. NIPS 2008

Crammer, Dredze, Kulesza. NIPS 2009

71

Convexity

• Change variables• Equivalent convex formulation

Crammer, Dredze, Pereira. NIPS 2008

72

AROW

• PA:

• CW :

• Similar update form as CW

Crammer, Dredze, Kulesza. NIPS 2009

73

• Optimization update can be solved analytically

• Coefficients depend on specific algorithm

The Update

Definitions

74

Updates

CW (Linearization)CW (Change Variables)

AROW

75

76

Per-feature Learning Rate

Per-feature Learning rate

Reducing the Learning rate and eigenvalues of

covariance matrix

77

Diagonal Matrix• Given a matrix we define to

be only the diagonal part of the matrix,

• Make matrix diagonal

• Make inverse diagonal

78

Outline





(Back to)Stochastic Gradient Descent



– Set

79

Adaptive Stochastic Gradient Descent



– Set

– Set 80

Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010


• Very general! Can be used to solve with various regularizations

• The matrix A can be either full or diagonal

• Comes with convergence and regret bounds

• Similar performance to AROW



SGD AdaGrad


86

Outline





87

Kernels

Proof

• Show that we can write

• Induction

88

Proof (cntd)

• By update rule :

• Thus

89

Proof (cntd)

• By update rule :

90

Proof (cntd)

• Thus

91

92

Outline





94

Statistical Interpretation

• Margin Constraint :

• Distribution over weight-vectors :

• Assume input is corrupted with Gaussian noise

95

Statistical Interpretation

Example

Mean weight-vector

Version Space Input Space

Input Instance

Linear Separator

Good realization

Bad realization

96

Mistake Bound• For any reference weight vector , the

number of mistakes made by AROW is upper bounded by

where

– set of example indices with a mistake– set of example indices with an update

but not a mistake–

Orabona and Crammer, NIPS 2010

97

Comment I

• Separable case and no updates:

where

98

Comment II

• For large the bound becomes:

• When no updates are performed: Perceptron

Bound for Diagonal Algorithm

• No. of mistakes is bounded by

• Is low when either

a feature is rare or non-informative• Exactly as in NLP …

Orabona and Crammer, NIPS 2010

100

Outline





101

Synthetic Data

• 20 features• 2 informative (rotated

skewed Gaussian)• 18 noisy• Using a single feature is

as good as a random prediction

102

Synthetic Data (cntd.)

Distribution after 50 examples (x1)

103

Synthetic Data (no noise)

Perceptron

PA

SOP

CW-full

CW-diag

104

Synthetic Data (10% noise)

105

Outline





106

Data

• Sentiment– Sentiment reviews from 6 Amazon domains (Blitzer et al)– Classify a product review as either positive or negative

• Reuters, pairs of labels– Three divisions:

• Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail.

– Bag of words representation with binary features.

• 20 News Groups, pairs of labels– Three divisions:

• comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances.

– Bag of words representation with binary features.

107

Experimental Design

• Online to batch :– Multiple passes over the training data– Evaluate on a different test set after each pass– Compute error/accuracy

• Set parameter using held-out data• 10 Fold Cross-Validation• ~2000 instances per problem• Balanced class-labels

108

Results vs Online- Sentiment

• StdDev and Variance – always better than baseline • Variance – 5/6 significantly better

109

Results vs Online – 20NG + Reuters

• StdDev and Variance – always better than baseline • Variance – 4/6 significantly better

110

Results vs Batch - Sentiment

• always better than batch methods • 3/6 significantly better

111

Results vs Batch - 20NG + Reuters

• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse

115

Results - Sentiment

• CW is better (5/6 cases), statistically significant (4/6)

• CW benefit less from many passes

Passes of Training Data

Acc

urac

y

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

116

Results – Reuters + 20NG

• CW is better (5/6 cases), statistically significant (4/6)

• CW benefit less from many passes

Passes of Training Data

Acc

urac

y

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

117

Error Reduction by Multiple Passes

• PA benefits more from multiple passes (8/12)

• Amount of benefit is data dependent

Bayesian Logistic Regression

BLR

• Covariance

• Mean

CW/AROW

• Covariance

• Mean

118

T. Jaakkola and M. Jordan. 1997

Based on the Variational

Approximation

Conceptually decoupled update

Function of the margin/hinge-loss

Algorithms Summary

1st Order2nd Order

PerceptronSOP

PACW+AROW

SGDAdaGrad

Logisitic Regression

LR

• Different motivation, similar algorithms

• All algorithms can be kernelized

• Work well for data NOT isotropic / symmetric

• State-of-the-art results in various domains

• Accompanied with theory

119

1 second order learning koby crammer department of electrical engineering ecml pkdd 2013 prague

Documents

loss hinge loss slide

margin slide

prague slide

predictions loss

label confidence slide

small cumulative loss

loss functions natural

exponential loss