kernel classifiers from a machine learning perspective (sec. 2.1- 2.2) jin-san yang biointelligence...
TRANSCRIPT
Kernel Classifiers from a Machine Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2)Learning Perspective (sec. 2.1- 2.2)
Jin-San Yang
Biointelligence Laboratory
School of Computer Science and Engineering
Seoul National University
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.1 The Basic Setting2.1 The Basic Setting
• Definition 2.1 (Learning problem) finding the unknown (functional) relationship h between objects x and targets y based on a sample z of size m.
• For a given object x, evaluate the distribution and decide on the class y by
• Problems of estimating PZ based on the given sample. Cannot predict classes for a new object Need to constrain the set of possible mappings from objects to classes
• Definition 2.2 (Features and feature space)
• Definition 2.4 (Linear function and linear classifier) Similar objects are mapped to similar classes via linearity Linear classifier is unaffected by the scale of weight (and hence the weight is assumed to be of unit length.)
y Z
Z
X
ZxXY yxP
yxP
xP
yxPyP
~| ))~,((
)),((
)(
)),(()(
xxxxiV
iz
i
izIyxh
m
yxzmiyxV
)(
)},(|},...,1{{)),((
2: nX K l
w
w
(x)
(x) = sign(
f
h
,w x w
x,w
(x)
)
,
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
• F is isomorphic to W The task of learning reduces to finding the best classifier in the hypothesis space F
• Properties for the goodness of a classifier: Dependent on the unknown PZ
Make the maximization task computationally easier Pointwise wrt. the object-class pairs (due to the independence of samplings )
• Expected risk:
• Example 2.7 (Cost matrices) In classifying handwritten digits, 0-1 loss function is inappropriate. (There are approximately 10 times more “no pictures of 1” than “pictures of 1”.)
{ , }
{ }
{ }
X
Xw
F = x | K R
W = K | K
H = h sign(f )| F Y
x w w
w w
1
w wf
)),((minarg* YXflEf XYFf
)),(()( YXflEfR XY
0)(10 )),(( xyfIyxfl
.,0
0)(1,
0)(1,
1'1)),(( 21
12
))((
otw
xfandyc
xfandyc
Cyxfl xfsignyC
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
• Remark 2.8 (Geometrical picture) Linear classifiers, parameterized by weight w, are hyperplanes passing through the origin in feature space K.
(Hypothesis space) (Feature space)
011
011
WWWW
XXXK
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.2 Learning by Risk 2.2 Learning by Risk MinimizationMinimization• Definition 2.9 (Learning algorithm) a mapping A such that
(where X: object space, Y:output space, F: set of feature mappings) - we have no knowledge of the function (or PZ) to be optimized
• Definition 2.10 (Generalization error)
• Definition 2.11 (Empirical Risk) the empirical risk functional over F or training error of f
is defined as
FYXA m
m
)(:
1
][inf)]([],[ fRzARzARFf
0)][)]([(lim:0
],[minarg)(
))((
)),((1
],[
*
1
fRZARP
zfRzA
YXzwhere
yxflm
zfR
ERMZm
empFf
EMP
m
m
iiiemp
m
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.2.1 The (Primal) Perceptron Algorithm2.2.1 The (Primal) Perceptron Algorithm
When is misclassified by thelinear classifier, the update step amounts changing into and thus attracts the hyperplane.
}0,|{ 1 twxx
}0,|{ twxx
)1,( ix
ix
tw
1tw
ix
tw iitt xyww 1
ii xy
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
• Definition 2.12 (Version Space) The set of all classifiers consistent with the training sample. Given the training sample and a hypothesis space ,
For linear classifiers, the set of consistent weights is called a version space:
HyxhmiHhzV iiH })(:},,2,1{|{)(
mYXyxz )(),( XYH
WwxymiWwzV ii }0,:},,2,1{|{)(
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.2.2 Regularized Risk Functional2.2.2 Regularized Risk Functional
• Drawbacks of minimizing empirical risk ERM makes the learning task an ill-posed one. (a slight variation of training s
ample makes large deviation of expected risks, overfitting)• Regularization is one way to overcome this problem
Introduce a regularizer a-priori
To restrict the space of solutions to be compact subsets of the (originally overly large space) F.
This can be achieved by requiring the set to be compact for each positive number
If we decrease for increasing sample size in the right way, it can be shown that the regularization method leads to as
minimize only the empirical risk, minimizes only with regularizer
RF
fzfRzA
zfR
empFf
reg
:
][],[minarg)(
],[
FffF }][|{
.0
m
f
0
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
• Structural risk minimization (SRM) by Vapnik: Define a structuring of the hypothesis space F into nested subsets
of increasing complexity. In each hypothesis space, empirical minimization is preformed SRM returns the classifier with the smallest risk and can be used with comple
xity values.
• Maximum-a-posteriori (MAP) estimation: empirical risk as the negative log-probability of the training sample z, for a classifier f. MAP estimate maximized the mode of the posterior densitiy The choice of regularizer is comparable to the choice of prior probability in th
e Bayesian framework and reflects prior knowledge.
FFF 10
(C) 2005, SNU Biointelligence Lab, http://bi.snu.ac.kr/
F
( ) ( ) ( )
exp( ( ( ) ))( ) = exp( ( ( ) ))
exp( ( ( ) )) ( )
Assuming a prior density ( ) = exp( [ ])
by Bayes' theorem, the posterior desity beco
mi
m
Y|X=x F= f i X|F= f iZ |F= fi=1
Y|X=x F= f
y Y
z = y x
l f x ,y 1y l f x ,y
l f x ,y C x
f -λmΩ f
P P P
P
f
,
,
mF|Z =
emp
mes :
( ) exp( ( ( ), )) exp( [ ])
exp( R [ ] [ ])
m
i izi=1
f l f x y f
f,z f
f m