based on: the nature of statistical learning theory by v. vapnick 2009 presentation by john dimona...
TRANSCRIPT
![Page 1: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/1.jpg)
STATISTICAL LEARNING THEORY AND CLASSIFICATION BASED ON SUPPORT VECTOR MACHINES
Based on:
The Nature of Statistical Learning Theory by V. Vapnick
2009 Presentation by John DiMona
and some slides based on lectures given by Professor Andrew Moore of Carnegie Mellon University
Presentation by Michael Sullivan
1
![Page 2: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/2.jpg)
EMPIRICAL DATA MODELING
Observations of a system are collected Based on these observations a process
of induction is used to build up a model of the system
This model is used to deduce responses of the system not yet observed
2
![Page 3: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/3.jpg)
EMPIRICAL DATA MODELING
Data obtained through observation is finite and sampled by nature
Typically this sampling is non-uniform Due to the high dimensional nature of
some problems the data will form only a sparse distribution in the input space
Creating a model from this type of data is an ill posed problem
3
![Page 4: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/4.jpg)
EMPIRICAL DATA MODELING
Selected model
Globally Optimal Model
Best Reachable Model
The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.
4
![Page 5: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/5.jpg)
MODELING ERROR
Approximation Error is a consequence of the hypothesis space not exactly fitting target space, The underlying function may lie outside the
hypothesis space A poor choice of the model space will result in a
large approximation error (model mismatch) Estimation Error is the error due to the learning
procedure converging to a non-optimal model in the hypothesis space
Together these form the Generalization Error
5
![Page 6: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/6.jpg)
EMPIRICAL DATA MODELING
Selected model
Globally Optimal Model
Best Reachable Model
The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.
Generalization Error
Approximation Error
Estimation Error
6
![Page 7: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/7.jpg)
WHAT IS STATISTICAL LEARNING?
Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnick 17)
7
![Page 8: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/8.jpg)
MODEL OF SUPERVISED LEARNING
Training: The supervisor takes each generated x value and returns an output value y.
Each (x,y) pair is part of the training set: F(x,y) = F(x),F(y|x) = (x1, y1), ….., (xl, yl)
8
![Page 9: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/9.jpg)
MODEL OF SUPERVISED LEARNING
Goal: For each (x,y) pair, we want to choose the LM’s estimation function : that closest estimates according the supervisor’s response, y.
Once we have the estimation function, we can classify new and unseen data.
9
![Page 10: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/10.jpg)
RISK MINIMIZATION
To find the best function, we need to measure loss. is the discrepancy function based on the y’s generated by the supervisor and the ‘s generated by the estimation functions.
10
![Page 11: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/11.jpg)
RISK MINIMIZATION
To do this, we calculate the risk functional:
We choose the function, f(x, α) that minimizes the risk functional R(α) over the class functions f(x, α), α ϵ Λ
Remember, F(x,y) is unknown except for the information contained in the training set.
11
![Page 12: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/12.jpg)
RISK MINIMIZATION WITH PATTERN RECOGNITION
With pattern recognition, the supervisor’s output y can only take on 2 values, y = {0, 1} and the loss takes the following values.
So the risk functional determines the probability of different answers being given by the supervisor and the estimation function.
12
![Page 13: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/13.jpg)
RISK MINIMIZATION
The expected value of loss with regards to some estimation function :
where
Problem: We still don’t don’t know
f (x,)
R() L(y, f (x,))d P(x,y)
P(x,y) P(x)P(y | x)
P(x,y)
13
![Page 14: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/14.jpg)
TO SIMPLIFY THESE TERMS…From this point on, we’ll refer to the training set,
{(x1, y1), (x2, y2),…,(xl, yl) }, as
{z1, z2, …, zl}
And we’ll refer to the loss functional, , as
14
![Page 15: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/15.jpg)
EMPIRICAL RISK MINIMIZATION (ERM)
Instead of measuring risk over the set of all just measure it over just the training set giving the empirical risk functional of
The empirical risk must converge uniformly to the actual risk over the set of loss functions
in both directions
15
![Page 16: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/16.jpg)
SO WHAT DOES LEARNING THEORY NEED TO ADDRESS?
i. What are the (necessary and sufficient) conditions for consistency of a learning process based on the ERM principle?
ii. How fast is the rate of convergence of the learning process?
iii. How can one control the rate of convergence (the generalization ability) of the learning process?
iv. How can one construct algorithms that can control the generalization ability?
16
![Page 17: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/17.jpg)
VC DIMENSION (VAPNIK–CHERVONENKIS) The VC dimension is a scalar value that measures
the capacity of a set of functions.
The VC dimension of a set of functions is responsible for the generalization ability of learning machines.
The VC dimension of a set of indicator functions α ϵ Λ is the maximum number h of vectors z1, …zh that can be separated into two classes in all possible ways using functions of the set.
17
![Page 18: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/18.jpg)
VC DIMENSION
3 vectors can be shattered, but not 4 since vectors z2, z4 cannot be separated by a line from vectors z1, z3Rule: The set of linear indicator functions in n dimensional space has a VC dimension h = n + 1
18
![Page 19: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/19.jpg)
UPPER BOUND FOR RISK
It can be shown that ,where is the confidence interval and h is the VC dimension
ERM only minimizes and , the confidence interval, is fixed based on the VC dimension of the set of functions determined a priori
When implementing ERM one must tune the confidence interval based on the problem to avoid underfitting/overfitting the data 19
![Page 20: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/20.jpg)
STRUCTURAL RISK MINIMIZIATION (SRM) SRM attempts to minimize the right hand side of
the inequality over both terms simultaneously
The first term is dependent upon a specific function’s error and the second depends on the VC dimension of the space that function is in
Therefore VC dimension must be a controlling variable
20
![Page 21: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/21.jpg)
STRUCTURAL RISK MINIMIZATION (SRM) We define our hypothesis space S to be the set
of functions
We say that is the hypothesis space of VC dimension, k, such that:
For a set of observations SRM chooses the function minimizing the empirical risk in subset for which the guaranteed risk is minimal
21
![Page 22: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/22.jpg)
STRUCTURAL RISK MINIMIZATION (SRM) SRM defines a trade-off between the quality
of the approximation of the given data and the complexity of the approximating function
As VC dimension increases the minima of the empirical risks decrease but the confidence interval increases
SRM is more general than ERM because it uses the subset for which minimizing yields the best bound on
22
![Page 23: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/23.jpg)
SUPPORT VECTOR CLASSIFICATION
Uses the SRM principal to separate two classes by a linear indicator function which is induced from available examples in the training set.
The goal is to produce a classifier that will work well on unseen test examples. We want to the classifier with the maximum generalizing capacity i.e. the lowest risk.
23
![Page 24: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/24.jpg)
SIMPLEST CASE: LINEAR CLASSIFIERS
How would you classify this data?
24
![Page 25: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/25.jpg)
SIMPLEST CASE: LINEAR CLASSIFIERS
All of these lines work as linear classifiers
Which one is the best?
25
![Page 26: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/26.jpg)
SIMPLEST CASE: LINEAR CLASSIFIERS
Define the margin of a linear classifier as the width the boundary can be increased by before hitting a datapoint.
26
![Page 27: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/27.jpg)
SIMPLEST CASE: LINEAR CLASSIFIERS
We want the maximum margin linear classifier.
This is the simplest SVM called a linear SVM
Support vectors are the datapoints the margin pushes up against
27
![Page 28: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/28.jpg)
SIMPLEST CASE: LINEAR CLASSIFIERS
-1 zone
+1 zone
Plus Plane
Minus Plane
We can define these two planes by x, the y-intercept, b, and w, a vector perpendicular to the lines they lie on so that the dot product gives the perpendicular planes
28
![Page 29: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/29.jpg)
THE OPTIMAL SEPARATING HYPERPLANE But how can we find M in terms of w
and b when the planes are defined as:
Positive plane = (w * x) + b = 1 Negative plane = (w * x) +b = -1
Note: Linear classifier plane: (w * x) + b = 0
29
![Page 30: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/30.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
(w * x) + b ≥ 1 (w * x) + b ≤ 1
The margin is defined the distance from any point on the minus plane to the closest point on the plus plane
30
![Page 31: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/31.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
Why?
31
![Page 32: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/32.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
32
![Page 33: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/33.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
=
33
![Page 34: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/34.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
=
=
So
34
![Page 35: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/35.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
35
![Page 36: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/36.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
36
![Page 37: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/37.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
37
![Page 38: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/38.jpg)
THE OPTIMAL SEPARATING HYPERPLANE
So we want to maximize
Or minimize
38
![Page 39: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/39.jpg)
GENERALIZED OPTIMAL HYPERPLANE Possible to extend to non-separable
training sets by adding a error parameter and minimizing:
Data can be split into more than two classifications by using successive runs on the resulting classes
39
![Page 40: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/40.jpg)
QUADRATIC PROGRAMMING
Optimization algorithms used to maximize a quadratic function of some real-valued variables subject to linear constraints.
If we were working in the linear world, we’d want to minimize
Now, we want to maximize:
In the nonnegative quadrant Under the constraint
40
![Page 41: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/41.jpg)
SUPPORT VECTOR MACHINES (SVM)
Maps the input vectors x into a high-dimensional feature space using a kernel function
In this feature space the optimal separating hyperplane is constructed
41
![Page 42: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/42.jpg)
HOW DO SV MACHINES HANDLE DATA IN DIFFERENT CIRCUMSTANCES?
Basic one dimensional example?
42
![Page 43: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/43.jpg)
HOW DO SV MACHINES HANDLE DATA IN DIFFERENT CIRCUMSTANCES?
Easy!
43
![Page 44: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/44.jpg)
HOW DO SV MACHINES HANDLE DATA IN DIFFERENT CIRCUMSTANCES?
Harder one dimensional example?
44
![Page 45: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/45.jpg)
HOW DO SV MACHINES HANDLE DATA IN DIFFERENT CIRCUMSTANCES?
Project the lower dimensional training points into higher dimensional space
45
![Page 46: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/46.jpg)
SV MACHINES
How are SV Machines implemented? Polynomial Learning Machines Radial Basis Functions Machines Two Layer Neural Networks
Each of these methods and all SV Machine implementation techniques use a different kernel function.
46
![Page 47: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/47.jpg)
TWO-LAYER NEURAL NETWORK APPROACH
The kernel is a sigmoid function:
Implementing the rules:
Using this technique the following are found automatically:
i. Architecture of the two layer machine, determining the number N of units in the first layer (the number of support vectors)
ii. The vectors of the weights in the first layer
iii. The vector of weights for the second layer (values of )
47
![Page 48: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/48.jpg)
TWO-LAYER NEURAL NETWORK APPROACH
48
![Page 49: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/49.jpg)
HANDWRITTEN DIGIT RECOGNITION
Data used from the U.S. Postal Service Database (1990)
Purpose was to experiment on learning the recognition of handwritten digits using different SV machines
-7300 training patterns-2000 test patterns collected from real-life zip codes
16X16 pixel resolution of database 256 dimensional input space
49
![Page 50: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/50.jpg)
HANDWRITTEN DIGIT RECOGNITION
50
![Page 51: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/51.jpg)
HANDWRITTEN DIGIT RECOGNITION
51
![Page 52: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/52.jpg)
CONCLUDING REMARKS ON SV MACHINES When implementing, the quality of a
learning machine is characterized by three main components:
1. How rich and universal is the set of functions that the LM can approximate?
2. How well can the machine generalize? 3. How fast does the learning process for
this machine converge?
52
![Page 53: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/53.jpg)
EXAM QUESITON 1
What are the two components of Generalization Error?
53
![Page 54: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/54.jpg)
EXAM QUESITON 1
What are the two components of Generalization Error?
54
Approximation Error and Estimation Error
![Page 55: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/55.jpg)
EXAM QUESTION 2
What is the main difference between Empirical Risk Minimization and Structural Risk Minimization?
55
![Page 56: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/56.jpg)
EXAM QUESTION 2
What is the main difference between Empirical Risk Minimization and Structural Risk Minimization?
ERM: Keep the confidence interval fixed (chosen a priori) while minimizing empirical risk
SRM: Minimize both the confidence interval and the empirical risk simultaneously
56
![Page 57: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/57.jpg)
EXAM QUESTION 3
What differs between SVM implementations?
i.e. Polynomial, radial basis learning machines, neural network LM’s?
57
![Page 58: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/58.jpg)
EXAM QUESTION 3
What differs between SVM implementations?
i.e. Polynomial, radial basis learning machines, neural network LM’s?
The kernel function
58
![Page 59: Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor](https://reader036.vdocuments.mx/reader036/viewer/2022062801/56649e3c5503460f94b2e5d5/html5/thumbnails/59.jpg)
59
ANY QUESTIONS?