linear models (i)

Linear Models (I)

Rong Jin

Review of Information Theory What is information? What is entropy?

Average information Minimum coding length Important inequality

1( ) logi

ii

H P pp

1 1( ) log logi i

i ii i

H P p pp q

Distribution for Generating Symbols

Distribution for Coding Symbols

Review of Information Theory (cont’d) Mutual information

Measure the correlation between two random variables Symmetric

Kullback-Leibler distance

Difference between two distributions

,

( , )( ; ) ( ) ( | ) ( , ) log

( ) ( )x y

P x yI X Y H X H X Y P x y

P x P y

~( ) ( )

( , ) ( ) log [log ]( ) ( )D

D DD M D x Px

M M

P x P xKL P P P x E

P x P x

Outline Classification problems Information theory for text classification Gaussian generative Naïve Bayes Logistic regression

Classification ProblemsYXf :

XInput Y Output?

• Given input X={x1, x2, …, xm}

• Predict the class label y

• y{-1,1}, binary class classification problems

• y {1, 2, 3, …, c}, multiple class classification problems

• Goal: need to learn the function: YXf :

Examples of Classification Problems Text categorization:

Input features: words ‘campaigning’, ‘efforts’, ‘Iowa’, ‘Democrats’, … Class label: ‘politics’ and ‘non-politics’

Image Classification:

Input features: color histogram, texture distribution, edge distribution, … Class label: ‘bird image’ and ‘non-bird image’

Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, …

Topic: politics

Which is a bird image?

Learning Setup for Classification Problems Training examples:

Identical Independent Distribution (i.i.d.)

Training examples are similar to testing examples Goal

Find a model or a function that is consistent with the training data

1 1 2 2{ , , , ,..., , }train n nD x y x y x y

Information Theory for Text Classification

If coding distribution is similar to the generating distribution short coding length good compression rate

1 1( ) log logi i

i ii i

H P p pp q

Distribution for Generating Symbols

Distribution for Coding Symbols

Compression Algorithm for TC

Compression Model M1

Compression Model M2

Politics

Sports

New Document

16K bits

10K bits

Topic:

Sports

Probabilistic Models for Classification Problems Apply statistical inference methods

Key: finding the best parameters Maximum likelihood (MLE) approach

Log-likelihood of data

Find the parameters that maximizes the log-likelihood

1( ) log ( | ; )

ntrain i ii

l D p y x

*1

max ( ) log ( | ; )n

train i iil D p y x

Training Examples

{ , }i ix y

Learning a Statistical Model

Prediction

p(y|x;)

Generative Models Not directly estimate p(y|x;) Using Bayes rule

Estimate p(xly;) instead of p(y|x;)

Why p(xly;)? Most well known distributions are p(xl). Allocate a separate set of parameters for each class

{1, 2,…, c}

p(xly;) p(xly) Describes the special input patterns for each class y

( ; ) ( | ; )( | ; )

( , ; )

p y p x yp y x

p y x

Gaussian Generative Model (I) Assume a Gaussian model for each class One dimension case

Results for MLE

2

22

1 2

( )1( | ; ) exp

22

{ , ,..., }, ={ , , ( )}

y

yy

c k k k

xp x y

p y k

2{ | } { | }

1 1, ( )

| { | } | | { | } |

| { | } |( )

i i

n nk i k i ki y k i y k

i i

i

x xi y k i y k

i y kp y k

n

( ; ) ( | ; )( | ; )

( , ; )

p y p x yp y x

p y x

Example

1.7, 0.1, 0.5

1.5, 0.2, 0.5male male male

female female female

p

p

• Height histogram for males and females.

• Using Gaussian generative model

• P(male|1.8) = ? , P(female|1.4) = ?

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female

Gaussian Generative Model (II) Consider multiple input features

X={x1, x2, …, xm} Multi-variate Gaussian distribution

y is a mm covariance matrix Results for MLE

Problem: Singularity of y : too many parameters

1

/ 2 1/ 2

1 1

1 1( | ; ) ~ ( , ) exp

22 | |

( , , ( 1),..., , , ( ))

T

y y y y ymy

c c

p x y N x x

p y p y k

{ | }

,{ , | , }

1,

| { | } |

1( )( )

|{ | } | | { | } |

i

s t

ny ii y y

i

ni j i i j jy s y s ys t y y y y

s t

xi y y

x xs y y t y y

Overfitting Issue Complex model Insufficient training

Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples

Total number parameters for a full Gaussian model is 5 means 500 parameters 5 covariance matrices 50,000 parameters 50,500 parameters insufficient training data

Another Example of Overfitting

-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


-6 -4 -2 0 2 4 6-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


-8 -6 -4 -2 0 2 4 6 8-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Naïve Bayes Simplify the model complexity

Diagonalize the covariance matrix y

Simplified Gaussian distribution

Feature independence assumption Naïve Bayes assumption

2

211/ 2 2

1

1 21 1

( )1( | ; ) exp ( | ; )

22

{ , ,..., }, { , ,..., , }

i i mm y i i

iimim y

ii

m i i i i ic c

xp x y p x y

Naïve Bayes A terrible estimator for But it is a very reasonable estimator for

Why?

The ratio of likelihood is more important

Naïve Bayes does a reasonable job on the estimation of ratio

( | ; )p x y

( | ; )p y x

' 1 ' 1

( ; ) ( | ; ) 1( | ; )

( '; ) ( | '; )( '; ) ( | '; )( '; ) ( | '; )

c cy y

p y p x yp y x

p y p x yp y p x yp y p x y

( | '; )

( | '; )

p x y

p x y

The Ratio of Likelihood Binary class

Both classes share the similar variance

2 2

2 21

21

2 2

12 221

( 1) ( | 1) ( 1)log log

( 1) ( | 1) ( 1)

2( 1)log

( 1)

2 ( ,..., )

i i i im

i ii

i i i i im

ii

i imc

ii

x xp y p x y p y

p y p x y p y

xp y

p y

x diag

( 1)log

( 1)

p y

p y

1{ ,..., }m

• A linear model !

Decision Boundary

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female

• Gaussian Generative Models == Finding a linear decision boundary

• Why not do it directly?

linear models (i)

Documents

gaussian model is5

model complexitydiagonalize

class label y y

edge distribution

texture distribution

input x

singularity of y

special input patterns