linear models (i)
DESCRIPTION
Linear Models (I). Rong Jin. Review of Information Theory. What is information? What is entropy? Average information Minimum coding length Important inequality. Distribution for Generating Symbols. Distribution for Coding Symbols. Review of Information Theory (cont’d). - PowerPoint PPT PresentationTRANSCRIPT
Linear Models (I)
Rong Jin
Review of Information Theory What is information? What is entropy?
Average information Minimum coding length Important inequality
1( ) logi
ii
H P pp
1 1( ) log logi i
i ii i
H P p pp q
Distribution for Generating Symbols
Distribution for Coding Symbols
Review of Information Theory (cont’d) Mutual information
Measure the correlation between two random variables Symmetric
Kullback-Leibler distance
Difference between two distributions
,
( , )( ; ) ( ) ( | ) ( , ) log
( ) ( )x y
P x yI X Y H X H X Y P x y
P x P y
~( ) ( )
( , ) ( ) log [log ]( ) ( )D
D DD M D x Px
M M
P x P xKL P P P x E
P x P x
Outline Classification problems Information theory for text classification Gaussian generative Naïve Bayes Logistic regression
Classification ProblemsYXf :
XInput Y Output?
• Given input X={x1, x2, …, xm}
• Predict the class label y
• y{-1,1}, binary class classification problems
• y {1, 2, 3, …, c}, multiple class classification problems
• Goal: need to learn the function: YXf :
Examples of Classification Problems Text categorization:
Input features: words ‘campaigning’, ‘efforts’, ‘Iowa’, ‘Democrats’, … Class label: ‘politics’ and ‘non-politics’
Image Classification:
Input features: color histogram, texture distribution, edge distribution, … Class label: ‘bird image’ and ‘non-bird image’
Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, …
Topic: politics
Which is a bird image?
Learning Setup for Classification Problems Training examples:
Identical Independent Distribution (i.i.d.)
Training examples are similar to testing examples Goal
Find a model or a function that is consistent with the training data
1 1 2 2{ , , , ,..., , }train n nD x y x y x y
Information Theory for Text Classification
If coding distribution is similar to the generating distribution short coding length good compression rate
1 1( ) log logi i
i ii i
H P p pp q
Distribution for Generating Symbols
Distribution for Coding Symbols
Compression Algorithm for TC
Compression Model M1
Compression Model M2
Politics
Sports
New Document
16K bits
10K bits
Topic:
Sports
Probabilistic Models for Classification Problems Apply statistical inference methods
Key: finding the best parameters Maximum likelihood (MLE) approach
Log-likelihood of data
Find the parameters that maximizes the log-likelihood
1( ) log ( | ; )
ntrain i ii
l D p y x
*1
max ( ) log ( | ; )n
train i iil D p y x
Training Examples
{ , }i ix y
Learning a Statistical Model
Prediction
p(y|x;)
Generative Models Not directly estimate p(y|x;) Using Bayes rule
Estimate p(xly;) instead of p(y|x;)
Why p(xly;)? Most well known distributions are p(xl). Allocate a separate set of parameters for each class
{1, 2,…, c}
p(xly;) p(xly) Describes the special input patterns for each class y
( ; ) ( | ; )( | ; )
( , ; )
p y p x yp y x
p y x
Gaussian Generative Model (I) Assume a Gaussian model for each class One dimension case
Results for MLE
2
22
1 2
( )1( | ; ) exp
22
{ , ,..., }, ={ , , ( )}
y
yy
c k k k
xp x y
p y k
2{ | } { | }
1 1, ( )
| { | } | | { | } |
| { | } |( )
i i
n nk i k i ki y k i y k
i i
i
x xi y k i y k
i y kp y k
n
( ; ) ( | ; )( | ; )
( , ; )
p y p x yp y x
p y x
Example
1.7, 0.1, 0.5
1.5, 0.2, 0.5male male male
female female female
p
p
• Height histogram for males and females.
• Using Gaussian generative model
• P(male|1.8) = ? , P(female|1.4) = ?
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
Gaussian Generative Model (II) Consider multiple input features
X={x1, x2, …, xm} Multi-variate Gaussian distribution
y is a mm covariance matrix Results for MLE
Problem: Singularity of y : too many parameters
1
/ 2 1/ 2
1 1
1 1( | ; ) ~ ( , ) exp
22 | |
( , , ( 1),..., , , ( ))
T
y y y y ymy
c c
p x y N x x
p y p y k
{ | }
,{ , | , }
1,
| { | } |
1( )( )
|{ | } | | { | } |
i
s t
ny ii y y
i
ni j i i j jy s y s ys t y y y y
s t
xi y y
x xs y y t y y
Overfitting Issue Complex model Insufficient training
Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples
Total number parameters for a full Gaussian model is 5 means 500 parameters 5 covariance matrices 50,000 parameters 50,500 parameters insufficient training data
Another Example of Overfitting
-6 -4 -2 0 2 4 6-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Another Example of Overfitting
-6 -4 -2 0 2 4 6-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Another Example of Overfitting
-6 -4 -2 0 2 4 6-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Another Example of Overfitting
-8 -6 -4 -2 0 2 4 6 8-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Naïve Bayes Simplify the model complexity
Diagonalize the covariance matrix y
Simplified Gaussian distribution
Feature independence assumption Naïve Bayes assumption
2
211/ 2 2
1
1 21 1
( )1( | ; ) exp ( | ; )
22
{ , ,..., }, { , ,..., , }
i i mm y i i
iimim y
ii
m i i i i ic c
xp x y p x y
Naïve Bayes A terrible estimator for But it is a very reasonable estimator for
Why?
The ratio of likelihood is more important
Naïve Bayes does a reasonable job on the estimation of ratio
( | ; )p x y
( | ; )p y x
' 1 ' 1
( ; ) ( | ; ) 1( | ; )
( '; ) ( | '; )( '; ) ( | '; )( '; ) ( | '; )
c cy y
p y p x yp y x
p y p x yp y p x yp y p x y
( | '; )
( | '; )
p x y
p x y
The Ratio of Likelihood Binary class
Both classes share the similar variance
2 2
2 21
21
2 2
12 221
( 1) ( | 1) ( 1)log log
( 1) ( | 1) ( 1)
2( 1)log
( 1)
2 ( ,..., )
i i i im
i ii
i i i i im
ii
i imc
ii
x xp y p x y p y
p y p x y p y
xp y
p y
x diag
( 1)log
( 1)
p y
p y
1{ ,..., }m
• A linear model !
Decision Boundary
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
• Gaussian Generative Models == Finding a linear decision boundary
• Why not do it directly?