statistical learning theory & classifications based on support vector machines 2014: anders...
TRANSCRIPT
Statistical Learning Theory & Classifications Based on Support Vector
Machines
2014: Anders Melen2015: Rachel Temple
The Nature of Statistical Learning Theory by V. Vapnik
1
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
2
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
3
Empirical Data Modeling
• Observations of a system are collected• Induction on observations are used to build up a
model of the system.• Model is then used to deduce responses of an
unobserved system. • Sampling is typically non-uniform• High dimensional problems will form a sparse
distribution in the input space
4
Modeling Error
• Approximation error is the consequence of the hypothesis space not fitting the target space
Globally Optimal Model
Best Reachable Model
Selected Model
5
Modeling Error
• Approximation error is the consequence of the hypothesis space not fitting the target space
Globally Optimal Model
Best Reachable Model
Selected Model
● Goal○ Choose a model from the hypothesis
space which is closest (w/ respect to some error measure) to the function target space
6
• Estimation Error is the error between the best model in our hypothesis space and the model within our hypothesis space that we selected.
● This forms the Generalization Error
Globally Optimal Model
Best Reachable Model
Selected Model
Approximation Error
Generalization Error
Estimation Error
7
•The Globally optimal model & the selected model form the generalization error which measures how well our data model adapts to new and unobserved data
8
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
9
Statistical Learning Theory
Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnik 17)
10
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
11
Model of Supervised Learning• Training
o The supervisor takes each generated x value and returns an output value y.
o Each (x,y) pair is part of the training set:
F(x,y) = F(x)F(y|x) = (x1, y1) , (x2, y2), … , (xl,yl) 12
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
13
Risk Minimization
• To find the best function, we need to measure loss
• L is the discrepancy function which is based on the y’s generated by the supervision and the ŷ’s generated by the estimate functions
• F is a predictor such that expected loss is minimized
L(y, F(x, ))𝛂
14
Risk Minimization
• Pattern Recognitiono With pattern recognition, the supervisor’s output y can
only take on 2 values, y = {0,1} and the loss takes the following values.
○ So the risk function determines the probability of different answers being given by the supervisor and the estimation function.
15
Some Simplifications From Here On
● Training Set{(X1,Y1), … , (Xl,Yl)} → {Z1, … , Zl}
● Loss FunctionL(y, F(x, )) → 𝛂 Q(z, )𝛂
16
Empirical Risk Minimization (ERM)
● We want to measure the risk over the training set rather than the set of all
17
Empirical Risk Minimization (ERM)
● The empirical risk must converge to the actual riskover the set of loss functions
18
Empirical Risk Minimization (ERM)
● In both directions!
19
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
20
Vapnik-Chervonenkis Dimensions• Lets just call them VC Dimensions
• Developed by Alexey Jakovlevich Chervonenkis & Vladimir Vapnik
• The VC dimension is scalar value that measures the capacity of a set of functions
21
Vapnik-Chervonenkis Dimensions
• The VC dimension is a set of functions responsible for the generalization ability of learning machines
• The VC dimension of a set of indicator functions Q(z, )𝛂 𝛂 ∈ 𝞚 is the maximum number h of vectors
z1, …, zh that can be separated into two classes in all 2h possible ways using functions of the set.
22
Upper Bound For Risk
• It can be shown that
where is the confidence interval and h is the
VC dimension
23
Upper Bound For Risk
• ERM only minimizes and ,
the confidence interval, is fixed based on the VC dimension of the set of functions determined by apriori
• ERM must tune the confidence interval based on the problem to avoid overfitting and underfitting
24
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
25
Structural Risk Management (SRM)
• SRM attempts to minimize the right hand size of the inequality over both terms simultaneously
26
Structural Risk Management (SRM)
The term is dependent on a specific function’s error while the
term depends on the dimension of the space that the functions lives in.
• The VC dimension is the controlling variable
27
Structural Risk Management (SRM)
• We define the hypothesis space S to be the set of functions:
Q(z, )𝛂 𝛂 ∈ 𝞚• We say that Sk= {Q(z, )}, 𝛂 𝛂 ∈ 𝞚k is the
hypothesis space of a VC dimension, k, such that:
28
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
29
Support Vector Machines (SVM)
• Map input vectors x into a high-dimensional feature space using a kernel function:
(zi, z) = K(x, xi)
30
Support Vector Machines (SVM)• Feature space… Optimal hyperplane…
What are you talking about...
31
Support Vector Machines (SVM)
● Lets try a basic one dimensional example!
33
Support Vector Machines (SVM)
● Aw snap, that was easy!
34
Support Vector Machines (SVM)
● Ok, what about a harder one dimensional example?
35
Support Vector Machines (SVM)
● Project the lower dimensional data into a higher dimensional space just like in the animation!
36
Support Vector Machines (SVM)
● There is several ways to implement a SVM
○ Polynomial Learning Machine (Like the animation)
○ Radial Basis Function Machines
○ Two-Layer Neural Networks
37
Simple Neural Network
● Neural Networks are computer science models inspired by nature!
● The brain is a massive natural neural network consisting of neurons and synapses
● Neural networks can be modeled using a graphical model
38
Simple Neural Network
● Neurons → Nodes● Synapses → Edges
Molecular Form Neural Network Model39
Two-Layer Neural Network
Kernel is a sigmoid function
Implementing the rules
40
Two-Layer Neural Network
● Using this technique the following are found automatically:
i. Architecture of a two-layer machine
ii. Determining N number of units in first layer (# of support vectors)
iii. The vectors of the weights wi = xi in the first layer
iv. The vector of weights for the second layer (values of 𝛂)
41
Conclusion
● The quality of a learning machine is characterized by three main components
a. How rich and universal is the set of functions that the LM can approximate?
b. How well can the machine generalize?c. How fast does the learning process for this
machine converge
42
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session
43
Exam Question #1
• What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machineo The kernel function
44
Exam Question #2
• What is empirical data modeling? Give a summary of the main concept and its componentso Empirical data modeling is the induction of observations
to build up a model. Then the model is used to deduce responses of an unobserved system.
45
Exam Question #3
• What must the Remp( )𝛂 do over the set of loss functions?o It must converge to the R( )𝛂
46
Table of Contents
• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
• Support Vector Machines (SVM)• Exam Questions• Q & A Session 47
End
Any questions?
48