kernel classifiers from a machine learning perspective (sec. 2.1- 2.2) jin-san yang biointelligence...

Kernel Classifiers from a Machine Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2)Learning Perspective (sec. 2.1- 2.2)

Jin-San Yang

Biointelligence Laboratory

School of Computer Science and Engineering

Seoul National University

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.1 The Basic Setting2.1 The Basic Setting

• Definition 2.1 (Learning problem) finding the unknown (functional) relationship h between objects x and targets y based on a sample z of size m.

• For a given object x, evaluate the distribution and decide on the class y by

• Problems of estimating PZ based on the given sample. Cannot predict classes for a new object Need to constrain the set of possible mappings from objects to classes

• Definition 2.2 (Features and feature space)

• Definition 2.4 (Linear function and linear classifier) Similar objects are mapped to similar classes via linearity Linear classifier is unaffected by the scale of weight (and hence the weight is assumed to be of unit length.)

y Z

Z

X

ZxXY yxP

yxP

xP

yxPyP

~| ))~,((

)),((

)(

)),(()(

xxxxiV

iz

i

izIyxh

m

yxzmiyxV

)(

)},(|},...,1{{)),((

2: nX K l

w

w

(x)

(x) = sign(

f

h

,w x w

x,w

(x)

)

,


• F is isomorphic to W The task of learning reduces to finding the best classifier in the hypothesis space F

• Properties for the goodness of a classifier: Dependent on the unknown PZ

Make the maximization task computationally easier Pointwise wrt. the object-class pairs (due to the independence of samplings )

• Expected risk:

• Example 2.7 (Cost matrices) In classifying handwritten digits, 0-1 loss function is inappropriate. (There are approximately 10 times more “no pictures of 1” than “pictures of 1”.)

{ , }

{ }

{ }

X

Xw

F = x | K R

W = K | K

H = h sign(f )| F Y

x w w

w w

1

w wf

)),((minarg* YXflEf XYFf

)),(()( YXflEfR XY

0)(10 )),(( xyfIyxfl

.,0

0)(1,

0)(1,

1'1)),(( 21

12

))((

otw

xfandyc

xfandyc

Cyxfl xfsignyC


• Remark 2.8 (Geometrical picture) Linear classifiers, parameterized by weight w, are hyperplanes passing through the origin in feature space K.

(Hypothesis space) (Feature space)

011

011

WWWW

XXXK


2.2 Learning by Risk 2.2 Learning by Risk MinimizationMinimization• Definition 2.9 (Learning algorithm) a mapping A such that

(where X: object space, Y:output space, F: set of feature mappings) - we have no knowledge of the function (or PZ) to be optimized

• Definition 2.10 (Generalization error)

• Definition 2.11 (Empirical Risk) the empirical risk functional over F or training error of f

is defined as

FYXA m

m

)(:

1

][inf)]([],[ fRzARzARFf

0)][)]([(lim:0

],[minarg)(

))((

)),((1

],[

*

1

fRZARP

zfRzA

YXzwhere

yxflm

zfR

ERMZm

empFf

EMP

m

m

iiiemp

m


2.2.1 The (Primal) Perceptron Algorithm2.2.1 The (Primal) Perceptron Algorithm

When is misclassified by thelinear classifier, the update step amounts changing into and thus attracts the hyperplane.

}0,|{ 1 twxx

}0,|{ twxx

)1,( ix

ix

tw

1tw

ix

tw iitt xyww 1

ii xy


• Definition 2.12 (Version Space) The set of all classifiers consistent with the training sample. Given the training sample and a hypothesis space ,

For linear classifiers, the set of consistent weights is called a version space:

HyxhmiHhzV iiH })(:},,2,1{|{)(

mYXyxz )(),( XYH

WwxymiWwzV ii }0,:},,2,1{|{)(


2.2.2 Regularized Risk Functional2.2.2 Regularized Risk Functional

• Drawbacks of minimizing empirical risk ERM makes the learning task an ill-posed one. (a slight variation of training s

ample makes large deviation of expected risks, overfitting)• Regularization is one way to overcome this problem

Introduce a regularizer a-priori

To restrict the space of solutions to be compact subsets of the (originally overly large space) F.

This can be achieved by requiring the set to be compact for each positive number

If we decrease for increasing sample size in the right way, it can be shown that the regularization method leads to as

minimize only the empirical risk, minimizes only with regularizer

RF

fzfRzA

zfR

empFf

reg

:

][],[minarg)(

],[

FffF }][|{

.0

m

f

0


• Structural risk minimization (SRM) by Vapnik: Define a structuring of the hypothesis space F into nested subsets

of increasing complexity. In each hypothesis space, empirical minimization is preformed SRM returns the classifier with the smallest risk and can be used with comple

xity values.

• Maximum-a-posteriori (MAP) estimation: empirical risk as the negative log-probability of the training sample z, for a classifier f. MAP estimate maximized the mode of the posterior densitiy The choice of regularizer is comparable to the choice of prior probability in th

e Bayesian framework and reflects prior knowledge.

FFF 10


F

( ) ( ) ( )

exp( ( ( ) ))( ) = exp( ( ( ) ))

exp( ( ( ) )) ( )

Assuming a prior density ( ) = exp( [ ])

by Bayes' theorem, the posterior desity beco

mi

m

Y|X=x F= f i X|F= f iZ |F= fi=1

Y|X=x F= f

y Y

z = y x

l f x ,y 1y l f x ,y

l f x ,y C x

f -λmΩ f

P P P

P

f

,

,

mF|Z =

emp

mes :

( ) exp( ( ( ), )) exp( [ ])

exp( R [ ] [ ])

m

i izi=1

f l f x y f

f,z f

f m

kernel classifiers from a machine learning perspective (sec. 2.1- 2.2) jin-san yang biointelligence...

Documents