introduction to machine learning fall 2013 perceptron (6) prof. koby crammer department of...

Introduction to Machine LearningFall 2013

Perceptron (6)

Prof. Koby Crammer

Department of Electrical Engineering

Technion1

2

Online Learning

Tyrannosaurus rex

3

Online Learning

Triceratops

4

Online Learning

Tyrannosaurus rex

Velocireptor

5

Formal Setting – Binary Classification

• Instances – Images, Sentences

• Labels– Parse tree, Names

• Prediction rule– Linear predictions rules

• Loss– No. of mistakes

6

Online Framework

• Initialize Classifier• Algorithm works in rounds• On round the online algorithm :

– Receives an input instance – Outputs a prediction– Receives a feedback label– Computes loss– Updates the prediction rule

• Goal : – Suffer small cumulative loss

7

Why Online Learning?

• Fast• Memory efficient - process one example at a

time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive

• Not as good as a well designed batch algorithms

8

Update Rules

• Online algorithms are based on an update rule which defines from (and possibly other information)

• Linear Classifiers : find from based on the input

• Some Update Rules :

– Perceptron (Rosenblat)

– ALMA (Gentile)

– ROMMA (Li & Long)

– NORMA (Kivinen et. al)

– MIRA (Crammer & Singer)

– EG (Littlestown and Warmuth)

– Bregman Based (Warmuth)

– Numerous Online Convex

Programming Algorithms

9

Today

• The Perceptron Algorithm :– Agmon 1954; – Rosenblatt 1952-1962, – Block 1962, Novikoff 1962, – Minsky & Papert 1969, – Freund & Schapire 1999, – Blum & Dunagan 2002

10

The Perceptron Algorithm

• If No-Mistake

– Do nothing

• If Mistake

– Update

• Margin after update :

Prediction:

11

Geometrical Interpretation

12

• For any competitor prediction function• We bound the loss suffered by the algorithm with

the loss suffered by

Relative Loss Bound

Cumulative Loss Suffered by the

Algorithm

Sequence of Prediction Functions

Cumulative Loss of Competitor

13



Relative Loss Bound

InequalityPossibly Large Gap

RegretExtra Loss

CompetitivenessRatio

14



Relative Loss Bound

Grows With T ConstantGrows With T

15



Relative Loss Bound

Best Prediction Function in hindsight for the data sequence

16

Remarks

• If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere)

• Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemesphere problem.

• Bound of the number of mistakes the perceptron makes with the hinge loss of any compitetor

17

Definitions

• Any Competitor• The parameters vector can be chosen using

the input data

• The parameterized hinge loss of on

• True hinge loss• 1-norm of hinge loss

18

Geometrical Assumption

• All examples are bounded in a ball of radius R

19

Perceptron’s Mistake Bound

• Bounds :

• If the sample is separable then

20

Proof - Intuition

• Two views :– The angle between and decreases with

– The following sum is fixed

as we make more mistakes, our solution is better

[FS99, SS05]

21

Proof

• Define the potential :

• Bound it’s cumulative sum from above and below

[C04]

22

Proof

• Bound from above :

Telescopic Sum Non-NegativeZero Vector

23

Proof

• Bound From Below :– No error on tth round

– Error on tth round

24

Proof

• We bound each term :

25

Proof

• Bound From Below :– No error on tth round

– Error on tth round

• Cumulative bound :

26

Proof

• Putting both bounds together :

• We use first degree of freedom (and scale) :

• Bound :

27

Proof

• General Bound :

• Choose :

• Simple Bound :

Objective of SVM

28

Proof

• Better bound : optimize the value of

29

Remarks

• Bound does not depend on dimension of the feature vector

• The bound holds for all sequences. It is not tight for most real world data

• But, there exists a setting for which it is tight – worst

30

0

2

4

6

8

10

12

14

00.

40.

81.

21.

6 22.

42.

83.

23.

6 44.

44.

85.

25.

6

Three Bounds

31

Separable Case

• Assume there exists such that for all examples Then all bounds are equivalent

• Perceptron makes finite number of mistakes until convergence (not necessarily to )

32

Separable Case – Other Quantities

• Use 1st (parameterization) degree of freedom

• Scale the such that

• Define

• The bound becomes

33

Separable Case - Illustration

34

separable Case – IllustrationThe Perceptron will make more mistakes

Finding a separating hyperplance is more difficult

35

Inseparable Case

• Difficult problem implies a large value of• In this case the Perceptron will make a large

number of mistakes

36

Perceptron Algorithm

• Extremely easy to implement• Relative loss bounds for separable and inseparable

cases. Minimal assumptions (not iid)• Easy to convert to a well-performing batch algorithm

(under iid assumptions)

• Quantities in bound are not compatible : no. of mistakes vs. hinge-loss.

• Margin of examples is ignored by update• Same update for separable case and inseparable

case.

Concluding Remarks

• Batch vs. Online– Two phases: Training and then Test– Single continues process

• Statistical Assumption– Distribution over examples– All sequences

• Conversions– Online -> Batch– Batch -> Online

37

introduction to machine learning fall 2013 perceptron (6) prof. koby crammer department of...

Documents

vector slide

difficult slide

value of slide

ss05 slide

compitetor slide

c04 slide

small cumulative loss

norm of hinge loss slide