statistical learning theory & classifications based on support vector machines 2014: anders...

Statistical Learning Theory & Classifications Based on Support Vector

Machines

2014: Anders Melen2015: Rachel Temple

The Nature of Statistical Learning Theory by V. Vapnik

1

Table of Contents

• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Machines (SVM)• Exam Questions• Q & A Session

2

Table of Contents


3

Empirical Data Modeling

• Observations of a system are collected• Induction on observations are used to build up a

model of the system.• Model is then used to deduce responses of an

unobserved system. • Sampling is typically non-uniform• High dimensional problems will form a sparse

distribution in the input space

4

Modeling Error

• Approximation error is the consequence of the hypothesis space not fitting the target space

Globally Optimal Model

Best Reachable Model

Selected Model

5

Modeling Error

• Approximation error is the consequence of the hypothesis space not fitting the target space



Selected Model

● Goal○ Choose a model from the hypothesis

space which is closest (w/ respect to some error measure) to the function target space

6

• Estimation Error is the error between the best model in our hypothesis space and the model within our hypothesis space that we selected.

● This forms the Generalization Error



Selected Model

Approximation Error

Generalization Error

Estimation Error

7

•The Globally optimal model & the selected model form the generalization error which measures how well our data model adapts to new and unobserved data

8

Table of Contents


9

Statistical Learning Theory

Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnik 17)

10

Table of Contents


11

Model of Supervised Learning• Training

o The supervisor takes each generated x value and returns an output value y.

o Each (x,y) pair is part of the training set:

F(x,y) = F(x)F(y|x) = (x1, y1) , (x2, y2), … , (xl,yl) 12

Table of Contents


13

Risk Minimization

• To find the best function, we need to measure loss

• L is the discrepancy function which is based on the y’s generated by the supervision and the ŷ’s generated by the estimate functions

• F is a predictor such that expected loss is minimized

L(y, F(x, ))𝛂

14

Risk Minimization

• Pattern Recognitiono With pattern recognition, the supervisor’s output y can

only take on 2 values, y = {0,1} and the loss takes the following values.

○ So the risk function determines the probability of different answers being given by the supervisor and the estimation function.

15

Some Simplifications From Here On

● Training Set{(X1,Y1), … , (Xl,Yl)} → {Z1, … , Zl}

● Loss FunctionL(y, F(x, )) → 𝛂 Q(z, )𝛂

16

Empirical Risk Minimization (ERM)

● We want to measure the risk over the training set rather than the set of all

17


● The empirical risk must converge to the actual riskover the set of loss functions

18


● In both directions!

19

Table of Contents


20

Vapnik-Chervonenkis Dimensions• Lets just call them VC Dimensions

• Developed by Alexey Jakovlevich Chervonenkis & Vladimir Vapnik

• The VC dimension is scalar value that measures the capacity of a set of functions

21

Vapnik-Chervonenkis Dimensions

• The VC dimension is a set of functions responsible for the generalization ability of learning machines

• The VC dimension of a set of indicator functions Q(z, )𝛂 𝛂 ∈ 𝞚 is the maximum number h of vectors

z1, …, zh that can be separated into two classes in all 2h possible ways using functions of the set.

22

Upper Bound For Risk

• It can be shown that

where is the confidence interval and h is the

VC dimension

23

Upper Bound For Risk

• ERM only minimizes and ,

the confidence interval, is fixed based on the VC dimension of the set of functions determined by apriori

• ERM must tune the confidence interval based on the problem to avoid overfitting and underfitting

24

Table of Contents


25

Structural Risk Management (SRM)

• SRM attempts to minimize the right hand size of the inequality over both terms simultaneously

26


The term is dependent on a specific function’s error while the

term depends on the dimension of the space that the functions lives in.

• The VC dimension is the controlling variable

27


• We define the hypothesis space S to be the set of functions:

Q(z, )𝛂 𝛂 ∈ 𝞚• We say that Sk= {Q(z, )}, 𝛂 𝛂 ∈ 𝞚k is the

hypothesis space of a VC dimension, k, such that:

28

Table of Contents


29

Support Vector Machines (SVM)

• Map input vectors x into a high-dimensional feature space using a kernel function:

(zi, z) = K(x, xi)

30

Support Vector Machines (SVM)• Feature space… Optimal hyperplane…

What are you talking about...

31


32

http://youtube.com/v/3liCbRZPrZA


● Lets try a basic one dimensional example!

33


● Aw snap, that was easy!

34


● Ok, what about a harder one dimensional example?

35


● Project the lower dimensional data into a higher dimensional space just like in the animation!

36


● There is several ways to implement a SVM

○ Polynomial Learning Machine (Like the animation)

○ Radial Basis Function Machines

○ Two-Layer Neural Networks

37

Simple Neural Network

● Neural Networks are computer science models inspired by nature!

● The brain is a massive natural neural network consisting of neurons and synapses

● Neural networks can be modeled using a graphical model

38

Simple Neural Network

● Neurons → Nodes● Synapses → Edges

Molecular Form Neural Network Model39

Two-Layer Neural Network

Kernel is a sigmoid function

Implementing the rules

40

Two-Layer Neural Network

● Using this technique the following are found automatically:

i. Architecture of a two-layer machine

ii. Determining N number of units in first layer (# of support vectors)

iii. The vectors of the weights wi = xi in the first layer

iv. The vector of weights for the second layer (values of 𝛂)

41

Conclusion

● The quality of a learning machine is characterized by three main components

a. How rich and universal is the set of functions that the LM can approximate?

b. How well can the machine generalize?c. How fast does the learning process for this

machine converge

42

Table of Contents


43

Exam Question #1

• What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machineo The kernel function

44

Exam Question #2

• What is empirical data modeling? Give a summary of the main concept and its componentso Empirical data modeling is the induction of observations

to build up a model. Then the model is used to deduce responses of an unobserved system.

45

Exam Question #3

• What must the Remp( )𝛂 do over the set of loss functions?o It must converge to the R( )𝛂

46

Table of Contents

• Empirical Data Modeling• What is Statistical Learning Theory• Model of Supervised Learning• Risk Minimization• Vapnik-Chervonenkis Dimensions• Structural Risk Management (SRM)• Support Vector Classification

o Optimal Separating Hyperplane & Quadratic Programming

• Support Vector Machines (SVM)• Exam Questions• Q & A Session 47

End

Any questions?

48