review rong jin. comparison of different classification models the goal of all classifiers...

ReviewRong Jin

Comparison of Different Classification Models The goal of all classifiers

Predicating class label y for an input x Estimate p(y|x)

K Nearest Neighbor (kNN) Approach

(k=1)(k=4)

Probability interpretation: estimate p(y|x) as

, | , ( )( | ) , ( ) is the neighborhood around

| ( ) |

i i i ix y y y x N xp y x N x x

N x

K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)?

Leave one out approach Weight K nearest neighbor

Neighbor is defined through a weight function

Estimate p(y|x)

How to estimate the appropriate value for 2?

( ) ( , )( | )

( )i ii

ii

w x y yp y x

w x

2

22

( ) exp2

ii

x xw x

Weighted K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability

Leave one out likelihood of training data

Search the optimal 2 by maximizing the leave one out likelihood

( ) ( , ) 1 ( ) ( , )( | )

( ) 1 ( )i j j ii j i j j ii

j ji j i ji j i

w x y y w x y yp y x

w x w x

LOO 1 1

1 ( ) ( , )log ( | ) log

1 ( )n n i j j ii

j jj ji ji

w x y yl p y x

w x

Weight K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability

Leave one out likelihood of training data

Search the optimal 2 by maximizing the leave one out likelihood

( ) ( , ) 1 ( ) ( , )( | )

( ) 1 ( )i j j ii j i j j ii

j ji j i ji j i

w x y y w x y yp y x

w x w x

LOO 1 1

1 ( ) ( , )log ( | ) log

1 ( )n n i j j ii

j jj ji ji

w x y yl p y x

w x

Gaussian Generative Model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior Estimate p(x|y) and p(y)

Allocate a separate set of parameters for each class {1, 2,…, c}

p(xly;) p(x;y)

Maximum likelihood estimation2

22

( )1( | ) exp

22

y

yy

xp x y

2

22

1 1

( )1log ( | ) log log 2 log

2 2i

i i i

i

N Ni y

i i y y yi i y

xl p x y p p

Gaussian Generative Model Difficult to estimate p(x|y) if x is of high dimensionality

Naïve Bayes:

Essentially a linear model

How to make a Gaussian generative model discriminative? (m,m) of each class are only based on the data belonging

to that class lack of discriminative power

1 2( | ; ) ( | ; ) ( | ; )... ( | ; )dp x y p x y p x y p x y

Gaussian Generative Model

Maximum likelihood estimation

2

22

2'' 1

'22' 1 ''

( )1exp

22( | ) ( )( | )

( )( | ') ( ') 1exp

22

yy

yy

c cyy

yy yy

xp

p x y p yp y x

xp x y p yp

1

2 2' '2

2 221 ' 1 ''

log ( | )

( ) ( )1log 2 log log exp

2 2 22

i

i i

i

N

i ii

N ci y y i y

y yi yy yy

l p y x

x p xp

How to optimize this objective function?

Gaussian Generative Model Bound optimization algorithm

1 1 1

' ' ' ' ' '1 1 1

2 ' '

'2 ' 2

''

, , ,..., , , : parameter of current iteration

' , , ,..., , , : parameter of last iteration

( ) ( ')

( )(2 )1log log

2 2

log2

i i i i i i

i i i

c c c

c c c

y y y y i y y

y y y

y

y

p p

p p

l l

p x

p

p

' 2 2

1 ' ' '

'2 2'2 2' 1 ' 1' '' '

( ) ( )exp log exp

2 22

N

c ci i y y i y

y yy yy

x p x

Gaussian Generative Model

2 ' '

'2 ' 2

2' '

22' 1 '1 '

' ' 2' '

'2'2' 1 ''

( ) ( ')

( )(2 )1log log

2 2

( )exp

22

( )exp

22

i i i i i i

i i i

y y y y i y y

y y y

cNy i y

y yi y

cy i y

y yy

l l

p x

p

p x

p x

Using log 1x x We have decomposed the interaction of parameters between different classes

Question: how to handle x with multiple features ?

Logistic Regression Model A linear decision boundary: wx+b

A probabilistic model p(y|x)

Maximum likelihood approach for estimating weights w and threshold b

0 positive

0 negative

w x b

w x b

1( 1| )

1 exp( ( ))p y x

y w x b

( ) ( )

1 1

( ) ( )

1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N

i ii i

l D p x p x

w x b w x b

1w x b

Logistic Regression Model Overfitting issue Example: text classification

Words that appears in only one document will be assigned with infinite large weight

Solution: regularization

( ) ( )

21 1

( ) ( ) 21 1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N mji i j

i i

l D p x p x s w

s ww x b w x b

Regularization term

Kernelize logistic regression model

1

1

( ), ( )

( , ) ( , )

Ni ii

Ni ii

x x w x

w x K w x K x x

1

1 , 1

1

1 1( | )

1 exp( ( , )) 1 exp ( , )

1( ) log ( , )

1 exp ( , )

Ni ii

N Nreg i j i ji i jN

i j j ij

p y xyK x w y K x x

l c K x xy K x x

Non-linear Logistic Regression Model

Non-linear Logistic Regression Model Hierarchical Mixture Expert Model

Group linear classifiers into a tree structure

Group 1

g1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

Group 2

g2(x)

m1,2(x) m2,1(x) m2,2(x)

1 11 2 21

1 12 2 22

( 1| ) ( | ) ( 1| ) ( | )( | ) ( 1| ) ( 1| )

( 1| ) ( | ) ( 1| ) ( | )

g x m y x g x m y xp y x r x r x

g x m y x g x m y x

Products generates nonlinearity in the prediction function

It could be a rough assumption by assuming all data points can be fitted by a linear model

But, it is usually appropriate to assume a local linear model

KNN can be viewed as a localized model without any parameters Can we extend the KNN approach by introducing a

localized linear model?

Non-linear Logistic Regression Model

Localized Logistic Regression Model Similar to the weight KNN

Weigh each training example by

Build a logistic regression model using the weighted examples

2

22

( ) exp2

ii

x xw x

2

21

1log

1 exp

Nreg i

i

l c wy w x b

2

21

1( ) log

1 exp

Nreg ii

i

l w x c wy w x b

Conditional Exponential Model An extension of logistic regression model to multiple class case

A different set of weights wy and threshold b for each class y Translation invariance

1( | ; ) exp( )

( )

( ) exp( )

y y

y yy

p y x b x wZ x

Z x b x w

1 1 0w b

Iterative scaling methods for optimization

Maximum Entropy Model Finding the simplest model that matches with the

data

1 1( | )

1 1

max ( | ) max ( | ) log ( | )

subject to

( | ) ( , ), ( | )=1

i

N Ni i i ii i yp y x p

N Ni i i i ii i y

H y x p y x p y x

p y x x x y y p y x

Maximize Entropy Prefer uniform distribution

Constraints Enforce the model to be

consistent with observed data

Classification Margin

Support Vector Machine Classification margin Maximum margin principle:

Separate data far away from the decision boundary

Two objectives Minimize the classification

error over training data Maximize the classification

margin Support vectors

Only support vectors have impact on the location of decision boundary

denotes +1

denotes -1

0w x b

Support Vector Machine Classification margin Maximum margin principle:

Separate data far away from the decision boundary

Two objectives Minimize the classification

error over training data Maximize the classification

margin Support vectors

Only support vectors have impact on the location of decision boundary

denotes +1

denotes -1

Support Vectors

0w x b

Support Vector Machine Separable case

Noisy case

* * 21

,

1 1

2 2

{ , }= argmin

subject to

1

1

....

1

mii

w b

N N

w b w

y w x b

y w x b

y w x b

* * 21 1

,

1 1 1 1

2 2 2 2

{ , }= argmin

subject to

1 , 0

1 , 0

....

1 , 0

m Ni ji j

w b

N N N N

w b w c

y w x b

y w x b

y w x b

Support Vector Machine Separable case

Noisy case

* * 21

,

1 1

2 2

{ , }= argmin

subject to

1

1

....

1

mii

w b

N N

w b w

y w x b

y w x b

y w x b

* * 21 1

,

1 1 1 1

2 2 2 2

{ , }= argmin

subject to

1 , 0

1 , 0

....

1 , 0

m Ni ji j

w b

N N N N

w b w c

y w x b

y w x b

y w x b

Quadratic programming!

Logistic Regression Model vs. Support Vector Machine Logistic regression model

Support vector machine

21 1

,{ , }* arg min log 1 exp ( )

N mi ji j

w bw b y w x b s w

* * 21 1

,

1 1 1 1

{ , }= argmin

subject to

1 , 0

....

1 , 0

N mi ji j

w b

N N N N

w b c w

y w x b

y w x b

Different loss function for punishing mistakes

Identical terms

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 10

0.5

1

1.5

2

2.5

3

3.5

wx+b

Loss

Loss function for logistic regressionLoss function for SVM

Logistic Regression Model vs. Support Vector Machine

Logistic regression differs from support vector machine only in the loss function

( ) 1y wx b

Kernel Tricks Introducing nonlinearity into the discriminative models Diffusion kernel

A graph laplacian L for local similarity

Diffusion kernel

Propagate local similarity information into a global one

,

,

,

i j

i j

i kk i

s x x i jL

s x x i j

, or L dK e K LK

d

Fisher Kernel Derive a kernel function from a generative model Key idea

Map a point x in original input space into the model space The similarity of two data points are measured in the

model space

Original Input Space Model Space

1 1( )x x

2 2( )x x

Measure the similarity in the

model space

Kernel Methods in Generative Model Usually, kernels can be introduced to a

generative model through a Gaussian process Define a “kernelized” covariance matrix

Positive semi-definitive, similar to Mercer’s condition

Multi-class SVM SVMs can only handle two-class outputs One-against-all

Learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N”

Error Correct Output Code (ECOC) Encode each class into a bit vector

1 0 0 1

0 1 0 1

0 1 1 0

A

B

C

S1 S2 S3 S4

x 1 1 1 0

1

1

2

Ordinal Regression A special class of multi-class classification problem There a natural ordinal relationship between multiple

classes Maximum margin principle

The computation of margin involves multiple classes

‘good’

‘OK’

‘bad’w’

Ordinal Regression

1 2

1 2

1 2

* * *1 2 1 2

, ,

1 1 2 2, ,

1 2

2 2, ,1 1

{ , , } arg max margin( , , )

arg max min(margin ( , ),margin ( , ))

arg max min min , ming o o b

b b

b b

d dD D D Db bi ii i

b b b b

b b

b b

w w

w

w

x xw

w w

w w

x w x w

1

1 2

2

subject to

: 0

: 0, 0

: 0

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w

1 2

* * * 21 2 1

, ,

{ , , } arg min dii

b b

b b w w

w

1

1 2

2

subject to

: 1

: 1, 1

: 1

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w

Decision Tree

From slides of Andrew Moore

Decision Tree A greedy approach for generating a decision tree

1. Choose the most informative feature Using the mutual information measurements

2. Split data set according to the values of the selected feature

3. Recursive until each data item is classified correctly Attributes with real values

Quantize the real value into a discrete one

Decision Tree The overfitting problem

Tree pruning Reduced error pruning Rule post-pruning

Generalize Decision Tree

+ +

a decision tree with simple data partition

+

a decision tree using classifiers for data partition

+

Each node is a linear classifier

Attribute 1

Attribute 2

classifier

review rong jin. comparison of different classification models the goal of all classifiers...

Documents

likelihood slide

pyx slide

neighbor neighbor

maximum likelihood estimate

gaussian generative

linear model

weight function estimate

prediction function