statistical learning theory & classifications based on support vector machines 2014: anders...

Statistical Learning Theory & Classifications Based on Support Vector

Machines

2014: Anders Melen2015: Rachel Temple

The Nature of Statistical Learning Theory by V. Vapnik

Table of Contents

• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session

Table of Contents

Empirical Data Modeling

• Observations of a system are collected• Induction on observations are used to build up a

model of the system.• Model is then used to deduce responses of an

unobserved system. • Sampling is typically non-uniform• High dimensional problems will form a sparse

distribution in the input space

Modeling Error

• Approximation error is the consequence of the hypothesis space not fitting the target space

Globally Optimal Model

Best Reachable Model

Selected Model

Modeling Error

• Approximation error is the consequence of the hypothesis space not fitting the target space

Selected Model

● Goal○ Choose a model from the hypothesis

space which is closest (w/ respect to some error measure) to the function target space

• Estimation Error is the error between the best model in our hypothesis space and the model within our hypothesis space that we selected.

● This forms the Generalization Error

Selected Model

Approximation Error

Generalization Error

Estimation Error

•The Globally optimal model & the selected model form the generalization error which measures how well our data model adapts to new and unobserved data

Table of Contents

Statistical Learning Theory

Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnik 17)

Table of Contents

Model of Supervised Learning• Training

o The supervisor takes each generated x value and returns an output value y.

o Each (x,y) pair is part of the training set:

F(x,y) = F(x)F(y|x) = (x1, y1) , (x2, y2), … , (xl,yl) 12

Table of Contents

Risk Minimization

• To find the best function, we need to measure loss

• L is the discrepancy function which is based on the y’s generated by the supervision and the ŷ’s generated by the estimate functions

• F is a predictor such that expected loss is minimized

L(y, F(x, ))𝛂

Risk Minimization

• Pattern Recognitiono With pattern recognition, the supervisor’s output y can

only take on 2 values, y = {0,1} and the loss takes the following values.

○ So the risk function determines the probability of different answers being given by the supervisor and the estimation function.

Some Simplifications From Here On

● Training Set{(X1,Y1), … , (Xl,Yl)} → {Z1, … , Zl}

● Loss FunctionL(y, F(x, )) → 𝛂 Q(z, )𝛂

Empirical Risk Minimization (ERM)

● We want to measure the risk over the training set rather than the set of all

● The empirical risk must converge to the actual riskover the set of loss functions

● In both directions!

Table of Contents

Vapnik-Chervonenkis Dimensions• Lets just call them VC Dimensions

• Developed by Alexey Jakovlevich Chervonenkis & Vladimir Vapnik

• The VC dimension is scalar value that measures the capacity of a set of functions

Vapnik-Chervonenkis Dimensions

• The VC dimension is a set of functions responsible for the generalization ability of learning machines

• The VC dimension of a set of indicator functions Q(z, )𝛂 𝛂 ∈ 𝞚 is the maximum number h of vectors

z1, …, zh that can be separated into two classes in all 2h possible ways using functions of the set.

Upper Bound For Risk

• It can be shown that

where is the confidence interval and h is the

VC dimension

Upper Bound For Risk

• ERM only minimizes and ,

the confidence interval, is fixed based on the VC dimension of the set of functions determined by apriori

• ERM must tune the confidence interval based on the problem to avoid overfitting and underfitting

Table of Contents

Structural Risk Management (SRM)

• SRM attempts to minimize the right hand size of the inequality over both terms simultaneously

The term is dependent on a specific function’s error while the

term depends on the dimension of the space that the functions lives in.

• The VC dimension is the controlling variable

• We define the hypothesis space S to be the set of functions:

Q(z, )𝛂 𝛂 ∈ 𝞚• We say that Sk= {Q(z, )}, 𝛂 𝛂 ∈ 𝞚k is the

hypothesis space of a VC dimension, k, such that:

Table of Contents

Support Vector Machines (SVM)

• Map input vectors x into a high-dimensional feature space using a kernel function:

(zi, z) = K(x, xi)

Support Vector Machines (SVM)• Feature space… Optimal hyperplane…

What are you talking about...

● Lets try a basic one dimensional example!

● Aw snap, that was easy!

● Ok, what about a harder one dimensional example?

● Project the lower dimensional data into a higher dimensional space just like in the animation!

● There is several ways to implement a SVM

○ Polynomial Learning Machine (Like the animation)

○ Radial Basis Function Machines

○ Two-Layer Neural Networks

Simple Neural Network

● Neural Networks are computer science models inspired by nature!

● The brain is a massive natural neural network consisting of neurons and synapses

● Neural networks can be modeled using a graphical model

Simple Neural Network

● Neurons → Nodes● Synapses → Edges

Molecular Form Neural Network Model39

Two-Layer Neural Network

Kernel is a sigmoid function

Implementing the rules

Two-Layer Neural Network

● Using this technique the following are found automatically:

i. Architecture of a two-layer machine

ii. Determining N number of units in first layer (# of support vectors)

iii. The vectors of the weights wi = xi in the first layer

iv. The vector of weights for the second layer (values of 𝛂)

Conclusion

● The quality of a learning machine is characterized by three main components

a. How rich and universal is the set of functions that the LM can approximate?

b. How well can the machine generalize?c. How fast does the learning process for this

machine converge

Table of Contents

Exam Question #1

• What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machineo The kernel function

Exam Question #2

• What is empirical data modeling? Give a summary of the main concept and its componentso Empirical data modeling is the induction of observations

to build up a model. Then the model is used to deduce responses of an unobserved system.

Exam Question #3

• What must the Remp( )𝛂 do over the set of loss functions?o It must converge to the R( )𝛂

Table of Contents

• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Classification

o Optimal Separating Hyperplane & Quadratic Programming

• Support Vector Machines (SVM)• Exam Questions• Q & A Session 47

Any questions?

statistical learning theory & classifications based on support vector machines 2014: anders...

best model

optimal model

learning problem

error measure

hypothesis space

unobserved data

input space

function target space

Documents

introduction to statistical learning

machine learning statistical learning theory...2019/06/02...

statistical relational learning

نظریه یادگیری instructor : saeed shiry. course...

statistical learning -...

rebecca lucy melen · rebecca lucy melen * 48. yin, q.,...

running head: statistical learning and real-time …running...

an introduction to statistical learning learning...with...

statistical machine learning

statistical learning - department of computer...

20 statistical learning methods

statistical machine learning:...

statistical relational learning: a tutorial learning -...

part i: introduction to statistical...

a scalable approach for statistical learning in...

learning to reconstruct: statistical learning theory and...

sb2b statistical machine learning hilary term...

learning joint statistical models for audio-visual fusion...

advanced statistical learning theory

statistical learning theory