1 second order learning koby crammer department of electrical engineering ecml pkdd 2013 prague

114
1 Second Order Learning Koby Crammer Department of Electrical Engineering PKDD 2013 Prague

Upload: tomas-chisley

Post on 29-Mar-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

1

Second Order Learning

Koby CrammerDepartment of Electrical Engineering

ECML PKDD 2013 Prague

Page 2: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Thanks

• Mark Dredze• Alex Kulesza• Avihai Mejer• Edward Moroshko• Francesco Orabona• Fernando Pereira• Yoram Singer• Nina Vaitz

2

Page 3: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

3

Tutorial Context

OnlineLearning

Tutorial

OptimizationTheory

Real-WorldData

SVMs

Page 4: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

4

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 5: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

5

Online Learning

Tyrannosaurus rex

Page 6: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

6

Online Learning

Triceratops

Page 7: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

7

Online Learning

Tyrannosaurus rex

Velocireptor

Page 8: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

8

Formal Setting – Binary Classification

• Instances – Images, Sentences

• Labels– Parse tree, Names

• Prediction rule– Linear predictions rules

• Loss– No. of mistakes

Page 9: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

9

Predictions

• Discrete Predictions:– Hard to optimize

• Continuous predictions :

– Label

– Confidence

Page 10: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

10

Loss Functions

• Natural Loss:– Zero-One loss:

• Real-valued-predictions loss:– Hinge loss:

– Exponential loss (Boosting)– Log loss (Max Entropy, Boosting)

Page 11: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

11

Loss Functions

1

1Zero-One Loss

Hinge Loss

Page 12: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Online Learning

Maintain Model M Get Instance x

Predict Label y=M(x)

Get True Label ySuffer Loss l(y,y)

Update Model M

Page 13: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

14

• Any Features

• W.l.o.g.

• Binary Classifiers of the form

Linear Classifiers

Notation

Abuse

Page 14: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

15

• Prediction :

• Confidence in prediction:

Linear Classifiers (cntd.)

Page 15: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

16

Linear Classifiers

Input Instance to be classified

Weight vector of classifier

Page 16: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

17

• Margin of an example with respect to the classifier :

• Note :

• The set is separable iff there exists such that

Margin

Page 17: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

18

Geometrical Interpretation

Page 18: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

19

Geometrical Interpretation

Page 19: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

20

Geometrical Interpretation

Page 20: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

21

Geometrical Interpretation

Margin >0

Margin <<0

Margin <0Margin >>0

Page 21: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

22

Hinge Loss

Page 22: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

23

Why Online Learning?

• Fast• Memory efficient - process one example at a

time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive

• Not as good as a well designed batch algorithms

Page 23: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

24

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 24: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

25

The Perceptron Algorithm

• If No-Mistake

– Do nothing

• If Mistake

– Update

• Margin after update :

Rosenblat 1958

Page 25: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

26

Geometrical Interpretation

Page 26: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

27

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 27: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Gradient Descent

• Consider the batch problem

• Simple algorithm:– Initialize– Iterate, for – Compute

– Set

28

Page 28: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague
Page 29: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Stochastic Gradient Descent

• Consider the batch problem

• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute

– Set

30

Page 30: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

31

Page 31: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Stochastic Gradient Descent

• “Hinge” loss

• The gradient

• Simple algorithm:– Initialize– Iterate, for– Pick a random index – If then

else– Set 32

The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples

Page 32: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

33

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 33: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

34

Motivation

• Perceptron: No guaranties of margin after the update

• PA :Enforce a minimal non-zero margin after the update

• In particular :– If the margin is large enough (1), then do nothing– If the margin is less then unit, update such that the

margin after the update is enforced to be unit

Page 34: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

35

Input Space

Page 35: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

36

Input Space vs. Version Space

• Input Space :– Points are input data– One constraint is induced

by weight vector– Primal space– Half space = all input

examples that are classified correctly by a given predictor (weight vector)

• Version Space :– Points are weight vectors– One constraints is induced

by input data– Dual space– Half space = all predictors

(weight vectors) that classify correctly a given input example

Page 36: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

37

Weight Vector (Version) Space

The algorithm forces to reside in this region

Page 37: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

38

Passive Step

Nothing to do. already resides on the desired side.

Page 38: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

39

Aggressive Step

The algorithm projects on the desired half-space

Page 39: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

40

Aggressive Update Step

• Set to be the solution of the following optimization problem :

• Solution:

Page 40: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

41

Perceptron vs. PA

• Common Update :

• Perceptron

• Passive-Aggressive

Page 41: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

42

Perceptron vs. PA

Margin

Error

No-E

rror, Sm

all Margin

No-Error, Large Margin

Page 42: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

43

Perceptron vs. PA

Page 43: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

44

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 44: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

45

Geometrical Assumption

• All examples are bounded in a ball of radius R

Page 45: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

46

Separablity

• There exists a unit vector that classifies the data correctly

Page 46: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

• Simple case: positive points

negative points

• Separating hyperplane

• Bound is :

Perceptron’s Mistake Bound

• The number of mistakes the algorithm makes is bounded by

47

Page 47: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

48

Geometrical Motivation

Page 48: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

SGD on such data

49

Page 49: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

50

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 50: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Second Order Perceptron

• Assume all inputs are given• Compute “whitening” matrix

• Run the Perceptron on “wightened” data

• New “whitening” matrix

51

Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005

Page 51: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Second Order Perceptron

• Bound:

• Same simple case:

• Thus

• Bound is :

52

Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005

Page 52: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

53

Second Order Perceptron

• If No-Mistake

– Do nothing

• If Mistake

– Update

Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005

Page 53: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

SGD on weightened data

54

Page 54: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

55

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 55: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

56

• The weight vector is a linear combination of examples

• Two rate schedules (many many others):– Perceptron algorithm, Conservative

– Passive - Aggressive

Span-based Update Rules

Feature-value of input instance

Target labelEither -1 or 1

Learning rateLearning rateWeight of feature f

Page 56: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

57

Sentiment Classification

• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!

Pang, Lee, Vaithyanathan, EMNLP 2002

Page 57: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

58

Sentiment Classification

• Many positive reviews with the word best

Wbest

• Later negative review – “boring book – best if you want to sleep in seconds”

• Linear update will reduce both

Wbest Wboring

• But best appeared more than boring• The model know’s more about best than boring• Better to reduce words in different rate

Wboring Wbest

Page 58: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

59

Natural Language Processing

• Big datasets, large number of features

• Many features are only weakly correlated with target label

• Linear classifiers: features are associated with word-counts

• Heavy-tailed feature distribution

Feature Rank

Cou

nts

Page 59: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Natural Language Processing

Page 60: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

61

New Prediction Models

• Gaussian distributions over weight vectors

• The covariance is either full or diagonal

• In NLP we have many features and use a diagonal covariance

Page 61: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

62

Classification

• Given a new example • Stochastic:

– Draw a weight vector– Make a prediction

• Collective:– Average weight vector– Average margin– Average prediction

Page 62: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

63

The Margin is Random Variable

• The signed margin

is random 1-d Gaussian

• Thus:

Page 63: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

64

Linear Model Distribution over Linear Models

Example

Mean weight-vector

Page 64: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

65

The algorithm forces that most of the values of would reside in this region

Weight Vector (Version) Space

Page 65: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

66

Nothing to do, most of the weight vectors already classifies the example correctly

Passive Step

Page 66: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

67

The mean is moved beyond the mistake-line(Large Margin)

Aggressive Step

The covariance is shrunk in the direction of the input example

The algorithm projects the current Gaussian distribution on the half-space

Page 67: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

68

Projection Update

• Vectors (aka PA):

• Distributions (New Update) :

Confidence Parameter

Page 68: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

69

• Sum of two divergences of parameters :

• Convex in both arguments simultaneously

Divergence

Matrix Itakura-Saito Divergence

Mahanabolis Distance

Page 69: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

70

Constraint

• Probabilistic Constraint :

• Equivalent Margin Constraint :

• Convex in , concave in • Solutions:

– Linear approximation– Change variables to

get a convex formulation– Relax (AROW)

Dredze, Crammer, Pereira. ICML 2008

Crammer, Dredze, Pereira. NIPS 2008

Crammer, Dredze, Kulesza. NIPS 2009

Page 70: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

71

Convexity

• Change variables• Equivalent convex formulation

Crammer, Dredze, Pereira. NIPS 2008

Page 71: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

72

AROW

• PA:

• CW :

• Similar update form as CW

Crammer, Dredze, Kulesza. NIPS 2009

Page 72: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

73

• Optimization update can be solved analytically

• Coefficients depend on specific algorithm

The Update

Page 73: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Definitions

74

Page 74: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Updates

CW (Linearization)CW (Change Variables)

AROW

75

Page 75: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

76

Per-feature Learning Rate

Per-feature Learning rate

Reducing the Learning rate and eigenvalues of

covariance matrix

Page 76: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

77

Diagonal Matrix• Given a matrix we define to

be only the diagonal part of the matrix,

• Make matrix diagonal

• Make inverse diagonal

Page 77: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

78

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 78: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

(Back to)Stochastic Gradient Descent

• Consider the batch problem

• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute

– Set

79

Page 79: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Adaptive Stochastic Gradient Descent

• Consider the batch problem

• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute

– Set

– Set 80

Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010

Page 80: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Adaptive Stochastic Gradient Descent

• Very general! Can be used to solve with various regularizations

• The matrix A can be either full or diagonal

• Comes with convergence and regret bounds

• Similar performance to AROW

Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010

Page 81: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Adaptive Stochastic Gradient Descent

SGD AdaGrad

Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010

Page 82: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

86

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 83: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

87

Kernels

Page 84: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Proof

• Show that we can write

• Induction

88

Page 85: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Proof (cntd)

• By update rule :

• Thus

89

Page 86: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Proof (cntd)

• By update rule :

90

Page 87: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Proof (cntd)

• Thus

91

Page 88: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

92

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 89: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

94

Statistical Interpretation

• Margin Constraint :

• Distribution over weight-vectors :

• Assume input is corrupted with Gaussian noise

Page 90: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

95

Statistical Interpretation

Example

Mean weight-vector

Version Space Input Space

Input Instance

Linear Separator

Good realization

Bad realization

Page 91: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

96

Mistake Bound• For any reference weight vector , the

number of mistakes made by AROW is upper bounded by

where

– set of example indices with a mistake– set of example indices with an update

but not a mistake–

Orabona and Crammer, NIPS 2010

Page 92: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

97

Comment I

• Separable case and no updates:

where

Page 93: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

98

Comment II

• For large the bound becomes:

• When no updates are performed: Perceptron

Page 94: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Bound for Diagonal Algorithm

• No. of mistakes is bounded by

• Is low when either

a feature is rare or non-informative• Exactly as in NLP …

Orabona and Crammer, NIPS 2010

Page 95: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

100

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 96: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

101

Synthetic Data

• 20 features• 2 informative (rotated

skewed Gaussian)• 18 noisy• Using a single feature is

as good as a random prediction

Page 97: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

102

Synthetic Data (cntd.)

Distribution after 50 examples (x1)

Page 98: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

103

Synthetic Data (no noise)

Perceptron

PA

SOP

CW-full

CW-diag

Page 99: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

104

Synthetic Data (10% noise)

Page 100: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

105

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Page 101: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

106

Data

• Sentiment– Sentiment reviews from 6 Amazon domains (Blitzer et al)– Classify a product review as either positive or negative

• Reuters, pairs of labels– Three divisions:

• Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail.

– Bag of words representation with binary features.

• 20 News Groups, pairs of labels– Three divisions:

• comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances.

– Bag of words representation with binary features.

Page 102: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

107

Experimental Design

• Online to batch :– Multiple passes over the training data– Evaluate on a different test set after each pass– Compute error/accuracy

• Set parameter using held-out data• 10 Fold Cross-Validation• ~2000 instances per problem• Balanced class-labels

Page 103: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

108

Results vs Online- Sentiment

• StdDev and Variance – always better than baseline • Variance – 5/6 significantly better

Page 104: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

109

Results vs Online – 20NG + Reuters

• StdDev and Variance – always better than baseline • Variance – 4/6 significantly better

Page 105: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

110

Results vs Batch - Sentiment

• always better than batch methods • 3/6 significantly better

Page 106: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

111

Results vs Batch - 20NG + Reuters

• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse

Page 107: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

112

Page 108: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

113

Page 109: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

114

Page 110: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

115

Results - Sentiment

• CW is better (5/6 cases), statistically significant (4/6)

• CW benefit less from many passes

Passes of Training Data

Acc

urac

y

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

Page 111: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

116

Results – Reuters + 20NG

• CW is better (5/6 cases), statistically significant (4/6)

• CW benefit less from many passes

Passes of Training Data

Acc

urac

y

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

O PAO CW

Page 112: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

117

Error Reduction by Multiple Passes

• PA benefits more from multiple passes (8/12)

• Amount of benefit is data dependent

Page 113: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Bayesian Logistic Regression

BLR

• Covariance

• Mean

CW/AROW

• Covariance

• Mean

118

T. Jaakkola and M. Jordan. 1997

Based on the Variational

Approximation

Conceptually decoupled update

Function of the margin/hinge-loss

Page 114: 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Algorithms Summary

1st Order2nd Order

PerceptronSOP

PACW+AROW

SGDAdaGrad

Logisitic Regression

LR

• Different motivation, similar algorithms

• All algorithms can be kernelized

• Work well for data NOT isotropic / symmetric

• State-of-the-art results in various domains

• Accompanied with theory

119