kernel classifiers from a machine learning perspective (sec. 2.1- 2.2) jin-san yang biointelligence...

10
Kernel Classifiers from a Machine Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- Learning Perspective (sec. 2.1- 2.2) 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering Seoul National University

Upload: eleanore-clarke

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

Kernel Classifiers from a Machine Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2)Learning Perspective (sec. 2.1- 2.2)

Jin-San Yang

Biointelligence Laboratory

School of Computer Science and Engineering

Seoul National University

Page 2: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.1 The Basic Setting2.1 The Basic Setting

• Definition 2.1 (Learning problem) finding the unknown (functional) relationship h between objects x and targets y based on a sample z of size m.

• For a given object x, evaluate the distribution and decide on the class y by

• Problems of estimating PZ based on the given sample. Cannot predict classes for a new object Need to constrain the set of possible mappings from objects to classes

• Definition 2.2 (Features and feature space)

• Definition 2.4 (Linear function and linear classifier) Similar objects are mapped to similar classes via linearity Linear classifier is unaffected by the scale of weight (and hence the weight is assumed to be of unit length.)

y Z

Z

X

ZxXY yxP

yxP

xP

yxPyP

~| ))~,((

)),((

)(

)),(()(

xxxxiV

iz

i

izIyxh

m

yxzmiyxV

)(

)},(|},...,1{{)),((

2: nX K l

w

w

(x)

(x) = sign(

f

h

,w x w

x,w

(x)

)

,

Page 3: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

• F is isomorphic to W The task of learning reduces to finding the best classifier in the hypothesis space F

• Properties for the goodness of a classifier: Dependent on the unknown PZ

Make the maximization task computationally easier Pointwise wrt. the object-class pairs (due to the independence of samplings )

• Expected risk:

• Example 2.7 (Cost matrices) In classifying handwritten digits, 0-1 loss function is inappropriate. (There are approximately 10 times more “no pictures of 1” than “pictures of 1”.)

{ , }

{ }

{ }

X

Xw

F = x | K R

W = K | K

H = h sign(f )| F Y

x w w

w w

1

w wf

)),((minarg* YXflEf XYFf

)),(()( YXflEfR XY

0)(10 )),(( xyfIyxfl

.,0

0)(1,

0)(1,

1'1)),(( 21

12

))((

otw

xfandyc

xfandyc

Cyxfl xfsignyC

Page 4: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

• Remark 2.8 (Geometrical picture) Linear classifiers, parameterized by weight w, are hyperplanes passing through the origin in feature space K.

(Hypothesis space) (Feature space)

011

011

WWWW

XXXK

Page 5: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.2 Learning by Risk 2.2 Learning by Risk MinimizationMinimization• Definition 2.9 (Learning algorithm) a mapping A such that

(where X: object space, Y:output space, F: set of feature mappings) - we have no knowledge of the function (or PZ) to be optimized

• Definition 2.10 (Generalization error)

• Definition 2.11 (Empirical Risk) the empirical risk functional over F or training error of f

is defined as

FYXA m

m

)(:

1

][inf)]([],[ fRzARzARFf

0)][)]([(lim:0

],[minarg)(

))((

)),((1

],[

*

1

fRZARP

zfRzA

YXzwhere

yxflm

zfR

ERMZm

empFf

EMP

m

m

iiiemp

m

Page 6: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.2.1 The (Primal) Perceptron Algorithm2.2.1 The (Primal) Perceptron Algorithm

When is misclassified by thelinear classifier, the update step amounts changing into and thus attracts the hyperplane.

}0,|{ 1 twxx

}0,|{ twxx

)1,( ix

ix

tw

1tw

ix

tw iitt xyww 1

ii xy

Page 7: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

• Definition 2.12 (Version Space) The set of all classifiers consistent with the training sample. Given the training sample and a hypothesis space ,

For linear classifiers, the set of consistent weights is called a version space:

HyxhmiHhzV iiH })(:},,2,1{|{)(

mYXyxz )(),( XYH

WwxymiWwzV ii }0,:},,2,1{|{)(

Page 8: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.2.2 Regularized Risk Functional2.2.2 Regularized Risk Functional

• Drawbacks of minimizing empirical risk ERM makes the learning task an ill-posed one. (a slight variation of training s

ample makes large deviation of expected risks, overfitting)• Regularization is one way to overcome this problem

Introduce a regularizer a-priori

To restrict the space of solutions to be compact subsets of the (originally overly large space) F.

This can be achieved by requiring the set to be compact for each positive number

If we decrease for increasing sample size in the right way, it can be shown that the regularization method leads to as

minimize only the empirical risk, minimizes only with regularizer

RF

fzfRzA

zfR

empFf

reg

:

][],[minarg)(

],[

FffF }][|{

.0

m

f

0

Page 9: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

• Structural risk minimization (SRM) by Vapnik: Define a structuring of the hypothesis space F into nested subsets

of increasing complexity. In each hypothesis space, empirical minimization is preformed SRM returns the classifier with the smallest risk and can be used with comple

xity values.

• Maximum-a-posteriori (MAP) estimation: empirical risk as the negative log-probability of the training sample z, for a classifier f. MAP estimate maximized the mode of the posterior densitiy The choice of regularizer is comparable to the choice of prior probability in th

e Bayesian framework and reflects prior knowledge.

FFF 10

Page 10: Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering

(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/

F

( ) ( ) ( )

exp( ( ( ) ))( ) = exp( ( ( ) ))

exp( ( ( ) )) ( )

Assuming a prior density ( ) = exp( [ ])

by Bayes' theorem, the posterior desity beco

mi

m

Y|X=x F= f i X|F= f iZ |F= fi=1

Y|X=x F= f

y Y

z = y x

l f x ,y 1y l f x ,y

l f x ,y C x

f -λmΩ f

P P P

P

f

,

,

mF|Z =

emp

mes :

( ) exp( ( ( ), )) exp( [ ])

exp( R [ ] [ ])

m

i izi=1

f l f x y f

f,z f

f m