introduction to machine learning fall 2013 perceptron (6) prof. koby crammer department of...
TRANSCRIPT
Introduction to Machine LearningFall 2013
Perceptron (6)
Prof. Koby Crammer
Department of Electrical Engineering
Technion1
2
Online Learning
Tyrannosaurus rex
3
Online Learning
Triceratops
4
Online Learning
Tyrannosaurus rex
Velocireptor
5
Formal Setting – Binary Classification
• Instances – Images, Sentences
• Labels– Parse tree, Names
• Prediction rule– Linear predictions rules
• Loss– No. of mistakes
6
Online Framework
• Initialize Classifier• Algorithm works in rounds• On round the online algorithm :
– Receives an input instance – Outputs a prediction– Receives a feedback label– Computes loss– Updates the prediction rule
• Goal : – Suffer small cumulative loss
7
Why Online Learning?
• Fast• Memory efficient - process one example at a
time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive
• Not as good as a well designed batch algorithms
8
Update Rules
• Online algorithms are based on an update rule which defines from (and possibly other information)
• Linear Classifiers : find from based on the input
• Some Update Rules :
– Perceptron (Rosenblat)
– ALMA (Gentile)
– ROMMA (Li & Long)
– NORMA (Kivinen et. al)
– MIRA (Crammer & Singer)
– EG (Littlestown and Warmuth)
– Bregman Based (Warmuth)
– Numerous Online Convex
Programming Algorithms
9
Today
• The Perceptron Algorithm :– Agmon 1954; – Rosenblatt 1952-1962, – Block 1962, Novikoff 1962, – Minsky & Papert 1969, – Freund & Schapire 1999, – Blum & Dunagan 2002
10
The Perceptron Algorithm
• If No-Mistake
– Do nothing
• If Mistake
– Update
• Margin after update :
Prediction:
11
Geometrical Interpretation
12
• For any competitor prediction function• We bound the loss suffered by the algorithm with
the loss suffered by
Relative Loss Bound
Cumulative Loss Suffered by the
Algorithm
Sequence of Prediction Functions
Cumulative Loss of Competitor
13
• For any competitor prediction function• We bound the loss suffered by the algorithm with
the loss suffered by
Relative Loss Bound
InequalityPossibly Large Gap
RegretExtra Loss
CompetitivenessRatio
14
• For any competitor prediction function• We bound the loss suffered by the algorithm with
the loss suffered by
Relative Loss Bound
Grows With T ConstantGrows With T
15
• For any competitor prediction function• We bound the loss suffered by the algorithm with
the loss suffered by
Relative Loss Bound
Best Prediction Function in hindsight for the data sequence
16
Remarks
• If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere)
• Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemesphere problem.
• Bound of the number of mistakes the perceptron makes with the hinge loss of any compitetor
17
Definitions
• Any Competitor• The parameters vector can be chosen using
the input data
• The parameterized hinge loss of on
• True hinge loss• 1-norm of hinge loss
18
Geometrical Assumption
• All examples are bounded in a ball of radius R
19
Perceptron’s Mistake Bound
• Bounds :
• If the sample is separable then
20
Proof - Intuition
• Two views :– The angle between and decreases with
– The following sum is fixed
as we make more mistakes, our solution is better
[FS99, SS05]
21
Proof
• Define the potential :
• Bound it’s cumulative sum from above and below
[C04]
22
Proof
• Bound from above :
Telescopic Sum Non-NegativeZero Vector
23
Proof
• Bound From Below :– No error on tth round
– Error on tth round
24
Proof
• We bound each term :
25
Proof
• Bound From Below :– No error on tth round
– Error on tth round
• Cumulative bound :
26
Proof
• Putting both bounds together :
• We use first degree of freedom (and scale) :
• Bound :
27
Proof
• General Bound :
• Choose :
• Simple Bound :
Objective of SVM
28
Proof
• Better bound : optimize the value of
29
Remarks
• Bound does not depend on dimension of the feature vector
• The bound holds for all sequences. It is not tight for most real world data
• But, there exists a setting for which it is tight – worst
30
0
2
4
6
8
10
12
14
00.
40.
81.
21.
6 22.
42.
83.
23.
6 44.
44.
85.
25.
6
Three Bounds
31
Separable Case
• Assume there exists such that for all examples Then all bounds are equivalent
• Perceptron makes finite number of mistakes until convergence (not necessarily to )
32
Separable Case – Other Quantities
• Use 1st (parameterization) degree of freedom
• Scale the such that
• Define
• The bound becomes
33
Separable Case - Illustration
34
separable Case – IllustrationThe Perceptron will make more mistakes
Finding a separating hyperplance is more difficult
35
Inseparable Case
• Difficult problem implies a large value of• In this case the Perceptron will make a large
number of mistakes
36
Perceptron Algorithm
• Extremely easy to implement• Relative loss bounds for separable and inseparable
cases. Minimal assumptions (not iid)• Easy to convert to a well-performing batch algorithm
(under iid assumptions)
• Quantities in bound are not compatible : no. of mistakes vs. hinge-loss.
• Margin of examples is ignored by update• Same update for separable case and inseparable
case.
Concluding Remarks
• Batch vs. Online– Two phases: Training and then Test– Single continues process
• Statistical Assumption– Distribution over examples– All sequences
• Conversions– Online -> Batch– Batch -> Online
37