machine learningepxing/class/10701-08f/lecture/lecture5-annotated.pdf1 © eric xing @ cmu, 2006-2008...

1

© Eric Xing @ CMU, 2006-2008 1

Machine LearningMachine Learning

1010--701/15701/15--781, Fall 2008781, Fall 2008

Logistic RegressionLogistic Regression---- generative verses discriminative generative verses discriminative

classifier classifier

Eric XingEric XingLecture 5, September 22, 2008

Reading: Chap. 4.3 CB

© Eric Xing @ CMU, 2006-2008 2

AnnouncementsHw 1 due at the end of the class today

My office hour is 11:50-12:30 today (catching a flight in the afternoon)

Steve Hanneke will give the lecture on Wed

2

© Eric Xing @ CMU, 2006-2008 3

Generative vs. Discriminative Classifiers

Goal: Wish to learn f: X → Y, e.g., P(Y|X)

Generative classifiers (e.g., Naïve Bayes):Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)

Discriminative classifiers:Directly assume some functional form for P(Y|X)This is a ‘discriminative’ model of the data!Estimate parameters of P(Y|X) directly from training data

Yi

Xi

Yi

Xi

© Eric Xing @ CMU, 2006-2008 4

Discussion: Generative and discriminative classifiers

Generative:Modeling the joint distribution of all data

Discriminative:Modeling only points at the boundary

How? Regression!

3

© Eric Xing @ CMU, 2006-2008 5

Linear regression The data:

Both nodes are observed:X is an input vectorY is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)

A regression scheme can be used to model p(y|x) directly,rather than p(x,y)

Yi

XiN

{ }),(,),,(),,(),,( NN yxyxyxyx L332211

© Eric Xing @ CMU, 2006-2008 6

Classification and logistic regression

4

© Eric Xing @ CMU, 2006-2008 7

The logistic function

© Eric Xing @ CMU, 2006-2008 8

Logistic regression (sigmoid classifier)

The condition distribution: a Bernoulli

where µ is a logistic function

We can used the brute-force gradient method as in LR

But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …)

yy xxxyp −−= 11 ))(()()|( µµ

xT

ex

θµ

−+=1

1)(

5

© Eric Xing @ CMU, 2006-2008 9

Training Logistic Regression: MCLE

Estimate parameters θ=<θ0, θ1, ... θm> to maximize the conditional likelihood of training data

Training data

Data likelihood =

Data conditional likelihood =

© Eric Xing @ CMU, 2006-2008 10

Expressing Conditional Log Likelihood

Recall the logistic function:

and conditional likelihood:

6

© Eric Xing @ CMU, 2006-2008 11

Maximizing Conditional Log Likelihood

The objective:

Good news: l(θ) is concave function of θ

Bad news: no closed-form solution to maximize l(θ)

© Eric Xing @ CMU, 2006-2008 12

Gradient Ascent

Property of sigmoid function:

The gradient:

The gradient ascent algorithm iterate until change < εFor all i,

repeat

7

© Eric Xing @ CMU, 2006-2008 13

Overfitting …

© Eric Xing @ CMU, 2006-2008 14

How about MAP?It is very common to use regularized maximum likelihood.

One common approach is to define priors on θ– Normal distribution, zero mean, identity covarianceHelps avoid very large weights and overfitting

MAP estimate

8

© Eric Xing @ CMU, 2006-2008 15

MLE vs MAPMaximum conditional likelihood estimate

Maximum a posteriori estimate

© Eric Xing @ CMU, 2006-2008 16

The Newton’s methodFinding a zero of a function

9

© Eric Xing @ CMU, 2006-2008 17

The Newton’s method (con’d)To maximize the conditional likelihood l(θ):

since l is convex, we need to find θ∗ where l’(θ∗)=0 !

So we can perform the following iteration:

© Eric Xing @ CMU, 2006-2008 18

The Newton-Raphson methodIn LR the θ is vector-valued, thus we need the following generalization:

∇ is the gradient operator over the function

H is known as the Hessian of the function

10

© Eric Xing @ CMU, 2006-2008 19

The Newton-Raphson methodIn LR the θ is vector-valued, thus we need the following generalization:

∇ is the gradient operator over the function

H is known as the Hessian of the function

© Eric Xing @ CMU, 2006-2008 20

Iterative reweighed least squares (IRLS)

Recall in the least square est. in linear regression, we have:

which can also derived from Newton-Raphson

Now for logistic regression:

11

© Eric Xing @ CMU, 2006-2008 21

IRLSRecall in the least square est. in linear regression, we have:

which can also derived from Newton-Raphson

Now for logistic regression:

© Eric Xing @ CMU, 2006-2008 22

Logistic regression: practical issues

NR (IRLS) takes O(N+d3) per iteration, where N = number of training cases and d = dimension of input x, but converge in fewer iterations

Quasi-Newton methods, that approximate the Hessian, work faster.

Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.

Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

12

© Eric Xing @ CMU, 2006-2008 23

Case study

Dataset 20 News Groups (20 classes)Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/)61,118 words, 18,774 documentsClass labels descriptions

© Eric Xing @ CMU, 2006-2008 24

Experimental setup

ParametersBinary feature; Using (stochastic) Gradient descent vs. IRLS to estimate parameters

Training/Test Sets: Subset of 20ng (binary classes)Use SVD to do dimension reduction (to 100)Random select 50% for training10 run and report average result

Convergence Criteria: RMSE in training set

13

© Eric Xing @ CMU, 2006-2008 25

Convergence curves

© Eric Xing @ CMU, 2006-2008

alt.atheismvs.

comp.graphics

rec.autosvs.

rec.sport.baseball

comp.windows.xvs.

rec.motorcycles

Legend: - X-axis: Iteration #; Y-axis: error - In each figure, red for IRLS and blue for gradient descent

© Eric Xing @ CMU, 2006-2008 26

Generative vs. Discriminative Classifiers

Goal: Wish to learn f: X → Y, e.g., P(Y|X)

Generative classifiers (e.g., Naïve Bayes):Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)

Discriminative classifiers:Directly assume some functional form for P(Y|X)This is a ‘discriminative’ model of the data!Estimate parameters of P(Y|X) directly from training data

Yi

Xi

Yi

Xi

14

© Eric Xing @ CMU, 2006-2008 27

Gaussian Discriminative Analysis learning f: X → Y, where

X is a vector of real-valued features, < X1…Xm >Y is boolean

What does that imply about the form of P(Y|X)?The joint probability of a datum and its label is:

Given a datum xn, we predict its label using the conditional probability of the label given the datum:

Yi

Xi

{ }221

212 221

111

)-(-exp)(

),,|()(),|,(

/ knk

knn

kn

knn

x

yxpypyxp

µπσ

π

σµσµ

σ=

=×===

{ }{ }∑ −−

−−==

'

2'2

12/12'

22

12/12

)(exp)2(

1

)(exp)2(

1

),,|1(2

2

kknk

knk

nkn

x

xxyp

µπσ

π

µπσ

πσµ

σ

σ

© Eric Xing @ CMU, 2006-2008 28

Naïve Bayes Classifier When X is multivariate-Gaussian vector:

The joint probability of a datum and it label is:

The naïve Bayes simplification

More generally:

Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density

{ })-()-(-exp)(

),,|()(),|,(

/ knT

knk

knn

kn

knn

xx

yxpypyxp

µµπ

π

µµrrrr

rrrr

121

2121

111−Σ

Σ=

Σ=×==Σ=

Yi

Xi

{ }∏

∏

=

=×===

jjk

jn

jkk

jjkjk

kn

jn

kn

knn

x

yxpypyxp

jk

221

212 221

111

)-(-exp)(

),,|()(),|,(

,/,

,,

,µ

πσπ

σµσµ

σ

∏=

×=m

jn

jnnnn yxpypyxp

1),|()|(),|,( ηππη

Yi

Xi1 Xi

2 Xim…

15

© Eric Xing @ CMU, 2006-2008 29

The predictive distribution

Understanding the predictive distribution

Under naïve Bayes assumption:

For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function

* ),|,(

),|,(),|(

),,|,(),,,|(' '''∑ Σ

Σ=

ΣΣ=

=Σ=k kknk

kknk

n

nkn

nkn xN

xNxp

xypxypµπ

µπµ

πµπµ v

vv 11

( ){ }1

1222

1221

1

2222

12

1221

121

11

1

1

1

ππ

σσσµπ

σµπ

µµµµσ

σ)(2

log)-(exp

log)-(exp

log)][-]([)-(exp

−

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛−−−

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛−−−

++−+

=

∑

∑+

=

∑ jjjjjj

nCx

Cx

jjj jjj

nj

j jjj

nj

x

nT xe θ−+

=1

1

)|( nn xyp 11 =

**

log)(exp

log)(exp

),,,|(

' ,'','

'

,,

∑ ∑

∑

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛−−−−

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛−−−−

=Σ=

k j jkj

kj

njk

k

j jkj

kj

njk

k

nkn

Cx

Cx

xyp

σµσ

π

σµσ

π

πµ2

2

22

21

21

1 v

© Eric Xing @ CMU, 2006-2008 30

The decision boundary

The predictive distribution

The Bayes decision rule:

For multiple class (i.e., K>2), * correspond to a softmax function

∑ −

−

==

j

x

x

nkn

nTj

nTk

eexyp

θ

θ

)|( 1

nT xM

j

jnj

nnex

xypθ

θθ−

=

+=

⎭⎬⎫

⎩⎨⎧

−−+

==

∑ 11

1

11

01

1 exp

)|(

nT

x

x

x

nn

nn x

ee

exypxyp

nT

nT

nT

θ

θ

θ

θ=

⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

+

+===

−

−

−

1

11

11

2

1 ln

)|()|(ln

16

© Eric Xing @ CMU, 2006-2008 31

Naïve Bayes vs Logistic Regression

Consider Y boolean, X continuous, X=<X1 ... Xm>Number of parameters to estimate:

NB:

LR:

Estimation method:NB parameter estimates are uncoupledLR parameter estimates are coupled

© Eric Xing @ CMU, 2006-2008 32


Asymptotic comparison (# training examples → infinity)

when model assumptions correctNB, LR produce identical classifiers

when model assumptions incorrectLR is less biased – does not assume conditional independencetherefore expected to outperform NB

17

© Eric Xing @ CMU, 2006-2008 33


Non-asymptotic analysis (see [Ng & Jordan, 2002] )

convergence rate of parameter estimates – how many training examples needed to assure good estimates?

NB order log m (where m = # of attributes in X)LR order m

NB converges more quickly to its (perhaps less helpful) asymptotic estimates

© Eric Xing @ CMU, 2006-2008 34

Rate of convergence: logistic regression

Let hDis,m be logistic regression trained on n examples in mdimensions. Then with high probability:

Implication: if we want for some small constant ε0, it suffices to pick order m examples

Convergences to its asymptotic classifier, in order m examples

result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m

18

© Eric Xing @ CMU, 2006-2008 35

Rate of convergence: naïve Bayes parameters

Let any ε1, δ>0, and any n ≥ 0 be fixed. Assume that for some fixed ρ0 > 0, we have that

Let

Then with probability at least 1-δ, after n examples:

1. For discrete input, for all i and b

2. For continuous inputs, for all i and b

© Eric Xing @ CMU, 2006-2008 36

Some experiments from UCI data sets

19

© Eric Xing @ CMU, 2006-2008 37

Back to our 20 NG Case study

Dataset 20 News Groups (20 classes)61,118 words, 18,774 documents

Experiment:Solve only a two-class subset: 1 vs 2.1768 instances, 61188 features.Use dimensionality reduction on the data (SVD).Use 90% as training set, 10% as test set.Test prediction error used as accuracy measure.

© Eric Xing @ CMU, 2006-2008 38

Versus training size

Generalization error (1)

• 30 features.• A fixed test set• Training set varied

from 10% to 100% of the training set

20

© Eric Xing @ CMU, 2006-2008 39

Versus model size

Generalization error (2)

• Number of dimensions of the data varied from 5 to 50 in steps of 5

• The features were chosen in decreasing order of their singular values

• 90% versus 10% split on training and test

© Eric Xing @ CMU, 2006-2008 40

SummaryLogistic regression

– Functional form follows from Naïve Bayes assumptionsFor Gaussian Naïve Bayes assuming varianceFor discrete-valued Naïve Bayes too

But training procedure picks parameters without the conditional independence assumption

– MLE training: pick W to maximize P(Y | X; θ)– MAP training: pick W to maximize P(θ | X,Y)

‘regularization’helps reduce overfitting

Gradient ascent/descent/IRLS– General approach when closed-form solutions unavailable

Generative vs. Discriminative classifiers– Bias vs. variance tradeoff

machine learningepxing/class/10701-08f/lecture/lecture5-annotated.pdf1 © eric xing @ cmu, 2006-2008...

Documents