review rong jin. comparison of different classification models the goal of all classifiers...

41
Review Rong Jin

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

ReviewRong Jin

Page 2: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Comparison of Different Classification Models The goal of all classifiers

Predicating class label y for an input x Estimate p(y|x)

Page 3: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

K Nearest Neighbor (kNN) Approach

(k=1)(k=4)

Probability interpretation: estimate p(y|x) as

, | , ( )( | ) , ( ) is the neighborhood around

| ( ) |

i i i ix y y y x N xp y x N x x

N x

Page 4: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)?

Leave one out approach Weight K nearest neighbor

Neighbor is defined through a weight function

Estimate p(y|x)

How to estimate the appropriate value for 2?

( ) ( , )( | )

( )i ii

ii

w x y yp y x

w x

2

22

( ) exp2

ii

x xw x

Page 5: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)?

Leave one out approach Weight K nearest neighbor

Neighbor is defined through a weight function

Estimate p(y|x)

How to estimate the appropriate value for 2?

( ) ( , )( | )

( )i ii

ii

w x y yp y x

w x

2

22

( ) exp2

ii

x xw x

Page 6: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)?

Leave one out approach Weight K nearest neighbor

Neighbor is defined through a weight function

Estimate p(y|x)

How to estimate the appropriate value for 2?

( ) ( , )( | )

( )i ii

ii

w x y yp y x

w x

2

22

( ) exp2

ii

x xw x

Page 7: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Weighted K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability

Leave one out likelihood of training data

Search the optimal 2 by maximizing the leave one out likelihood

( ) ( , ) 1 ( ) ( , )( | )

( ) 1 ( )i j j ii j i j j ii

j ji j i ji j i

w x y y w x y yp y x

w x w x

LOO 1 1

1 ( ) ( , )log ( | ) log

1 ( )n n i j j ii

j jj ji ji

w x y yl p y x

w x

Page 8: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Weight K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability

Leave one out likelihood of training data

Search the optimal 2 by maximizing the leave one out likelihood

( ) ( , ) 1 ( ) ( , )( | )

( ) 1 ( )i j j ii j i j j ii

j ji j i ji j i

w x y y w x y yp y x

w x w x

LOO 1 1

1 ( ) ( , )log ( | ) log

1 ( )n n i j j ii

j jj ji ji

w x y yl p y x

w x

Page 9: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Gaussian Generative Model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior Estimate p(x|y) and p(y)

Allocate a separate set of parameters for each class {1, 2,…, c}

p(xly;) p(x;y)

Maximum likelihood estimation2

22

( )1( | ) exp

22

y

yy

xp x y

2

22

1 1

( )1log ( | ) log log 2 log

2 2i

i i i

i

N Ni y

i i y y yi i y

xl p x y p p

Page 10: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Gaussian Generative Model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior Estimate p(x|y) and p(y)

Allocate a separate set of parameters for each class {1, 2,…, c}

p(xly;) p(x;y)

Maximum likelihood estimation2

22

( )1( | ) exp

22

y

yy

xp x y

2

22

1 1

( )1log ( | ) log log 2 log

2 2i

i i i

i

N Ni y

i i y y yi i y

xl p x y p p

Page 11: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Gaussian Generative Model Difficult to estimate p(x|y) if x is of high dimensionality

Naïve Bayes:

Essentially a linear model

How to make a Gaussian generative model discriminative? (m,m) of each class are only based on the data belonging

to that class lack of discriminative power

1 2( | ; ) ( | ; ) ( | ; )... ( | ; )dp x y p x y p x y p x y

Page 12: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Gaussian Generative Model

Maximum likelihood estimation

2

22

2'' 1

'22' 1 ''

( )1exp

22( | ) ( )( | )

( )( | ') ( ') 1exp

22

yy

yy

c cyy

yy yy

xp

p x y p yp y x

xp x y p yp

1

2 2' '2

2 221 ' 1 ''

log ( | )

( ) ( )1log 2 log log exp

2 2 22

i

i i

i

N

i ii

N ci y y i y

y yi yy yy

l p y x

x p xp

How to optimize this objective function?

Page 13: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Gaussian Generative Model Bound optimization algorithm

1 1 1

' ' ' ' ' '1 1 1

2 ' '

'2 ' 2

''

, , ,..., , , : parameter of current iteration

' , , ,..., , , : parameter of last iteration

( ) ( ')

( )(2 )1log log

2 2

log2

i i i i i i

i i i

c c c

c c c

y y y y i y y

y y y

y

y

p p

p p

l l

p x

p

p

' 2 2

1 ' ' '

'2 2'2 2' 1 ' 1' '' '

( ) ( )exp log exp

2 22

N

c ci i y y i y

y yy yy

x p x

Page 14: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Gaussian Generative Model

2 ' '

'2 ' 2

2' '

22' 1 '1 '

' ' 2' '

'2'2' 1 ''

( ) ( ')

( )(2 )1log log

2 2

( )exp

22

( )exp

22

i i i i i i

i i i

y y y y i y y

y y y

cNy i y

y yi y

cy i y

y yy

l l

p x

p

p x

p x

Using log 1x x We have decomposed the interaction of parameters between different classes

Question: how to handle x with multiple features ?

Page 15: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Logistic Regression Model A linear decision boundary: wx+b

A probabilistic model p(y|x)

Maximum likelihood approach for estimating weights w and threshold b

0 positive

0 negative

w x b

w x b

1( 1| )

1 exp( ( ))p y x

y w x b

( ) ( )

1 1

( ) ( )

1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N

i ii i

l D p x p x

w x b w x b

1w x b

Page 16: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Logistic Regression Model Overfitting issue Example: text classification

Words that appears in only one document will be assigned with infinite large weight

Solution: regularization

( ) ( )

21 1

( ) ( ) 21 1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N mji i j

i i

l D p x p x s w

s ww x b w x b

Regularization term

Page 17: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Kernelize logistic regression model

1

1

( ), ( )

( , ) ( , )

Ni ii

Ni ii

x x w x

w x K w x K x x

1

1 , 1

1

1 1( | )

1 exp( ( , )) 1 exp ( , )

1( ) log ( , )

1 exp ( , )

Ni ii

N Nreg i j i ji i jN

i j j ij

p y xyK x w y K x x

l c K x xy K x x

Non-linear Logistic Regression Model

Page 18: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Non-linear Logistic Regression Model Hierarchical Mixture Expert Model

Group linear classifiers into a tree structure

Group 1

g1(x)

m1,1(x)

Group Layer

ExpertLayer

r(x)

Group 2

g2(x)

m1,2(x) m2,1(x) m2,2(x)

1 11 2 21

1 12 2 22

( 1| ) ( | ) ( 1| ) ( | )( | ) ( 1| ) ( 1| )

( 1| ) ( | ) ( 1| ) ( | )

g x m y x g x m y xp y x r x r x

g x m y x g x m y x

Products generates nonlinearity in the prediction function

Page 19: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

It could be a rough assumption by assuming all data points can be fitted by a linear model

But, it is usually appropriate to assume a local linear model

KNN can be viewed as a localized model without any parameters Can we extend the KNN approach by introducing a

localized linear model?

Non-linear Logistic Regression Model

Page 20: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Localized Logistic Regression Model Similar to the weight KNN

Weigh each training example by

Build a logistic regression model using the weighted examples

2

22

( ) exp2

ii

x xw x

2

21

1log

1 exp

Nreg i

i

l c wy w x b

2

21

1( ) log

1 exp

Nreg ii

i

l w x c wy w x b

Page 21: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Localized Logistic Regression Model Similar to the weight KNN

Weigh each training example by

Build a logistic regression model using the weighted examples

2

22

( ) exp2

ii

x xw x

2

21

1log

1 exp

Nreg i

i

l c wy w x b

2

21

1( ) log

1 exp

Nreg ii

i

l w x c wy w x b

Page 22: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Conditional Exponential Model An extension of logistic regression model to multiple class case

A different set of weights wy and threshold b for each class y Translation invariance

1( | ; ) exp( )

( )

( ) exp( )

y y

y yy

p y x b x wZ x

Z x b x w

1 1 0w b

Page 23: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Iterative scaling methods for optimization

Maximum Entropy Model Finding the simplest model that matches with the

data

1 1( | )

1 1

max ( | ) max ( | ) log ( | )

subject to

( | ) ( , ), ( | )=1

i

N Ni i i ii i yp y x p

N Ni i i i ii i y

H y x p y x p y x

p y x x x y y p y x

Maximize Entropy Prefer uniform distribution

Constraints Enforce the model to be

consistent with observed data

Page 24: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Classification Margin

Support Vector Machine Classification margin Maximum margin principle:

Separate data far away from the decision boundary

Two objectives Minimize the classification

error over training data Maximize the classification

margin Support vectors

Only support vectors have impact on the location of decision boundary

denotes +1

denotes -1

0w x b

Page 25: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Support Vector Machine Classification margin Maximum margin principle:

Separate data far away from the decision boundary

Two objectives Minimize the classification

error over training data Maximize the classification

margin Support vectors

Only support vectors have impact on the location of decision boundary

denotes +1

denotes -1

Support Vectors

0w x b

Page 26: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Support Vector Machine Separable case

Noisy case

* * 21

,

1 1

2 2

{ , }= argmin

subject to

1

1

....

1

mii

w b

N N

w b w

y w x b

y w x b

y w x b

* * 21 1

,

1 1 1 1

2 2 2 2

{ , }= argmin

subject to

1 , 0

1 , 0

....

1 , 0

m Ni ji j

w b

N N N N

w b w c

y w x b

y w x b

y w x b

Page 27: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Support Vector Machine Separable case

Noisy case

* * 21

,

1 1

2 2

{ , }= argmin

subject to

1

1

....

1

mii

w b

N N

w b w

y w x b

y w x b

y w x b

* * 21 1

,

1 1 1 1

2 2 2 2

{ , }= argmin

subject to

1 , 0

1 , 0

....

1 , 0

m Ni ji j

w b

N N N N

w b w c

y w x b

y w x b

y w x b

Quadratic programming!

Page 28: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Logistic Regression Model vs. Support Vector Machine Logistic regression model

Support vector machine

21 1

,{ , }* arg min log 1 exp ( )

N mi ji j

w bw b y w x b s w

* * 21 1

,

1 1 1 1

{ , }= argmin

subject to

1 , 0

....

1 , 0

N mi ji j

w b

N N N N

w b c w

y w x b

y w x b

Different loss function for punishing mistakes

Identical terms

Page 29: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 10

0.5

1

1.5

2

2.5

3

3.5

wx+b

Loss

Loss function for logistic regressionLoss function for SVM

Logistic Regression Model vs. Support Vector Machine

Logistic regression differs from support vector machine only in the loss function

( ) 1y wx b

Page 30: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Kernel Tricks Introducing nonlinearity into the discriminative models Diffusion kernel

A graph laplacian L for local similarity

Diffusion kernel

Propagate local similarity information into a global one

,

,

,

i j

i j

i kk i

s x x i jL

s x x i j

, or L dK e K LK

d

Page 31: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Fisher Kernel Derive a kernel function from a generative model Key idea

Map a point x in original input space into the model space The similarity of two data points are measured in the

model space

Original Input Space Model Space

1 1( )x x

2 2( )x x

Measure the similarity in the

model space

Page 32: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Kernel Methods in Generative Model Usually, kernels can be introduced to a

generative model through a Gaussian process Define a “kernelized” covariance matrix

Positive semi-definitive, similar to Mercer’s condition

Page 33: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Multi-class SVM SVMs can only handle two-class outputs One-against-all

Learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N”

Page 34: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Error Correct Output Code (ECOC) Encode each class into a bit vector

1 0 0 1

0 1 0 1

0 1 1 0

A

B

C

S1 S2 S3 S4

x 1 1 1 0

1

1

2

Page 35: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ordinal Regression A special class of multi-class classification problem There a natural ordinal relationship between multiple

classes Maximum margin principle

The computation of margin involves multiple classes

‘good’

‘OK’

‘bad’w’

Page 36: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ordinal Regression

1 2

1 2

1 2

* * *1 2 1 2

, ,

1 1 2 2, ,

1 2

2 2, ,1 1

{ , , } arg max margin( , , )

arg max min(margin ( , ),margin ( , ))

arg max min min , ming o o b

b b

b b

d dD D D Db bi ii i

b b b b

b b

b b

w w

w

w

x xw

w w

w w

x w x w

1

1 2

2

subject to

: 0

: 0, 0

: 0

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w

1 2

* * * 21 2 1

, ,

{ , , } arg min dii

b b

b b w w

w

1

1 2

2

subject to

: 1

: 1, 1

: 1

i g i

i o i i

i b i

D b

D b b

D b

x x w

x x w x w

x x w

Page 37: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Decision Tree

From slides of Andrew Moore

Page 38: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Decision Tree A greedy approach for generating a decision tree

1. Choose the most informative feature Using the mutual information measurements

2. Split data set according to the values of the selected feature

3. Recursive until each data item is classified correctly Attributes with real values

Quantize the real value into a discrete one

Page 39: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Decision Tree The overfitting problem

Tree pruning Reduced error pruning Rule post-pruning

Page 40: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Decision Tree The overfitting problem

Tree pruning Reduced error pruning Rule post-pruning

Page 41: Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Generalize Decision Tree

+ +

a decision tree with simple data partition

+

a decision tree using classifiers for data partition

+

Each node is a linear classifier

Attribute 1

Attribute 2

classifier