linear regressionepxing/class/10708-07/slides/lecture9-learningbn… · 2 eric xing 3 logistic...

20
1 1 School of Computer Science Learning generalized linear models and tabular CPT of structured full BN Probabilistic Graphical Models (10 Probabilistic Graphical Models (10- 708) 708) Lecture 9, Oct 15, 2007 Eric Xing Eric Xing X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8 Reading: J-Chap. 7,8. Eric Xing 2 Linear Regression Let us assume that the target variable and the inputs are related by the equation: where ε is an error term of unmodeled effects or random noise Now assume that ε follows a Gaussian N(0,σ), then we have: i i T i y ε θ + = x = 2 2 2 2 1 σ θ σ π θ ) ( exp ) ; | ( i T i i i y x y p x

Upload: others

Post on 23-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

1

1

School of Computer Science

Learning generalized linear models and tabular CPT of

structured full BN

Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)

Lecture 9, Oct 15, 2007

Eric XingEric XingReceptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

Receptor A

Kinase C

TF F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

X1 X2

X3 X4 X5

X6

X7 X8

Reading: J-Chap. 7,8.

Eric Xing 2

Linear RegressionLet us assume that the target variable and the inputs are related by the equation:

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

iiT

iy εθ += x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

221

σθ

σπθ )(exp);|( i

Ti

iiyxyp x

Page 2: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

2

Eric Xing 3

Logistic Regression (sigmoid classifier)

The condition distribution: a Bernoulli

where µ is a logistic function

We can used the brute-force gradient method as in LR

But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model

yy xxxyp −−= 11 ))(()()|( µµ

xTex

θµ

−+=

11)(

Eric Xing 4

Exponential familyFor a numeric random variable X

is an exponential family distribution with natural (canonical) parameter η

Function T(x) is a sufficient statistic.Function A(η) = log Z(η) is the log normalizer.Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...

{ }{ })(exp)(

)(

)()(exp)()|(

xTxhZ

AxTxhxp

T

T

ηη

ηηη1

=

−= XnN

Page 3: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

3

Eric Xing 5

Multivariate Gaussian Distribution

For a continuous vector random variable X∈Rk:

Exponential family representation

Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

( )

( )( ){ }Σ−Σ−Σ+Σ−=

⎭⎬⎫

⎩⎨⎧ −Σ−−

Σ=Σ

−−−

logtrexp

)()(exp),(

/

//

µµµπ

µµπ

µ

12111

21

2

1212

21

21

21

TTTk

Tk

xxx

xxxp

( )[ ] ( )[ ]( )[ ]

( )( ) 2

221

112211

21

121

21

1211

211

2

2/)(

log)(trlog)(vec;)(

and ,vec,vec;

k

TT

T

xh

AxxxxT

−−−−

=

−−−=Σ+Σ=

=

Σ−=Σ==Σ−Σ=

π

ηηηηµµη

ηµηηηµη

Moment parameter

Natural parameter

Eric Xing 6

Multinomial distributionFor a binary vector random variable

Exponential family representation

),|(multi~ πxx

⎭⎬⎫

⎩⎨⎧

== ∑k

kkx

Kxx xxp

K

πππππ lnexp)( L21

21

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−+⎟⎟

⎞⎜⎜⎝

⎛∑−

=

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−⎟

⎞⎜⎝

⎛−+=

∑∑

∑∑∑−

=

=−

=

=

=

=

1

1

1

111

1

1

1

1

1

1

11

11

K

kk

K

k kKk

kk

K

kk

K

k

KK

kk

k

x

xx

ππ

π

ππ

lnlnexp

lnlnexp

Page 4: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

4

Eric Xing 7

Why exponential family?Moment generating property

{ }{ }

[ ])()(

)(exp)()(

)(exp)()(

)()(

)(log

xTE

dxZ

xTxhxT

dxxTxhdd

Z

Zdd

ZZ

dd

ddA

T

T

=

=

=

==

ηη

ηηη

ηηη

ηηη1

1

{ } { }

[ ] [ ][ ])(

)()(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

Zdd

Zdx

ZxTxhxTdx

ZxTxhxT

dAd

2

TT

=−=

−= ∫∫2

22

2 1 ηηηη

ηηη

η

Eric Xing 8

Moment estimationWe can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizerA(η).The qth derivative gives the qth centered moment.

When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

L

variance)(

mean)(

=

=

2

2

ηηηη

dAdddA

Page 5: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

5

Eric Xing 9

Moment vs canonical parametersThe moment parameter µ can be derived from the natural (canonical) parameter

A(h) is convex since

Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by µ − the moment parameterization.

[ ] µηη def

)()(== xTE

ddA

[ ] 02

2

>= )()( xTVardAdη

η

)(def

µψη =

4

8

-2 -1 0 1 2

4

8

-2 -1 0 1 2

A

ηη∗

Eric Xing 10

MLE for Exponential FamilyFor iid data, the log-likelihood is

Take derivatives and set to zero:

This amounts to moment matching.We can infer the canonical parameters using

{ }

∑ ∑

−⎟⎠

⎞⎜⎝

⎛+=

−=

n nn

Tn

nn

Tn

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

ηη

ηηηl

)(

)()(

)()(

=

=∂

=∂

∂−=

∂∂

nnMLE

nn

nn

xTN

xTN

A

ANxT

1

1

0

µ

ηη

ηη

η

)

l

)( MLEMLE µψη )) =

Page 6: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

6

Eric Xing 11

SufficiencyFor p(x|θ), T(x) is sufficient for θ if there is no information in Xregarding θ yeyond that in T(x).

We can throw away X for the purpose pf inference w.r.t. θ .

Bayesian view

Frequentist view

The Neyman factorization theorem

T(x) is sufficient for θ if

T(x) θX ))(|()),(|( xTpxxTp θθ =

T(x) θX ))(|()),(|( xTxpxTxp =θ

T(x) θX

))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =

))(,()),(()|( xTxhxTgxp θθ =⇒

Eric Xing 12

ExamplesGaussian:

Multinomial:

Poisson:

( )[ ]( )[ ]

( ) 2211

21

1211

2 /)(

log)(vec;)(

vec;

k

T

T

xh

AxxxxT

−−

=

Σ+Σ=

=

Σ−Σ=

π

µµη

µη

∑∑ ==⇒n

nn

nMLE xN

xTN

111 )(µ

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

∑=⇒n

nMLE xN1µ

!)(

)()(

log

xxh

eAxxT

1=

==

==

ηλη

λη

∑=⇒n

nMLE xN1µ

Page 7: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

7

Eric Xing 13

Generalized Linear Models (GLIMs)

The graphical modelLinear regressionDiscriminative linear classificationCommonality:

model Ep(Y)=µ=f(θTX)What is p()? the cond. dist. of Y.What is f()? the response function.

GLIMThe observed input x is assumed to enter into the model via a linear combination of its elementsThe conditional mean µ is represented as a function f(ξ) of ξ, where f is known as the response functionThe observed output y is assumed to be characterized by an exponential family distribution with conditional mean µ.

Xn

YnN

xTθξ =

Eric Xing 14

GLIM, cont.

The choice of exp family is constrained by the nature of the data YExample: y is a continuous vector multivariate Gaussian

y is a class label Bernoulli or multinomial

The choice of the response functionFollowing some mild constrains, e.g., [0,1]. Positivity …Canonical response function:

In this case θTx directly corresponds to canonical parameter η.

ηψfθ

xµξ

( ){ })()(exp)()|( ηηη φ Ayxyhyp T −=⇒ 1

)(⋅= −1ψf

{ })()(exp)()|( ηηη Ayxyhyp T −=

Page 8: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

8

Eric Xing 15

MLE for GLIMs with natural response

Log-likelihood

Derivative of Log-likelihood

Online learning for canonical GLIMsStochastic gradient ascent = least mean squares (LMS) algorithm:

( )∑ ∑ −+=n n

nnnT

n Ayxyh )()(log ηθl

( )

)(

)(

µ

µ

θη

ηη

θ

−=

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

yX

xy

dd

ddAyx

dd

T

nnnn

n

n

n

nnn

l

This is a fixed point function because µ is a function of θ

( ) ntnn

tt xy µρθθ −+=+1

( ) size step a is and where ρθµ nTtt

n x=

Eric Xing 16

Batch learning for canonical GLIMs

The Hessian matrix

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(ηn)

( )

WXX

xxddx

dd

ddx

ddxxy

dd

dddH

T

nT

nTn

n n

nn

Tn

n n

nn

nTn

nn

nnnTT

−=

=−=

−=

=−==

∑∑

θηηµ

θη

ηµ

θµµ

θθθ

since

l2

[ ]TnxX =

⎟⎟⎠

⎞⎜⎜⎝

⎛=

N

N

dd

ddW

ηµ

ηµ ,,diag K

1

1

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−−−−−

=

nx

xx

XMMM

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

yM

v 2

1

Page 9: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

9

Eric Xing 17

Recall LMSCost function in matrix form:

To minimize J(θ), take derivative and set to zero:

( ) ( )yy

yJ

T

n

ii

Ti

vv −−=

−= ∑=

θθ

θθ

XX

x

2121

1

2)()( ⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−−−−−

=

nx

xx

XMMM

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

yM

v 2

1

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

v

v

vvv

vvvv

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT v=⇒ θ

The normal equations

( ) yXXX TT v1−=*θ

Eric Xing 18

Iteratively Reweighted Least Squares (IRLS)

Recall Newton-Raphson methods with cost function J

We now have

Now:

where the adjusted response is

This can be understood as solving the following " Iteratively reweighted least squares " problem

JHttθθθ ∇−= −+ 11

WXXH

yXJT

T

−=

−=∇ )( µθ

( ) [ ]( ) ttTtT

tTttTtT

tt

zWXXWX

yXXWXXWX

H

1

1

11

−+

=

−+=

∇+=

)( µθ

θθ θ l

( ) )( tttt yWXz µθ −+=−1

)()(minarg θθθθ

XzWXz Tt −−=+1

Page 10: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

10

Eric Xing 19

Example 1: logistic regression (sigmoid classifier)

The condition distribution: a Bernoulli

where µ is a logistic function

p(y|x) is an exponential family function, with mean:

and canonical response function

IRLS

yy xxxyp −−= 11 ))(()()|( µµ

)()( xex ηµ −+

=1

1

[ ] )(| xexyE ηµ −+

==1

1

xTθξη ==

⎟⎟⎟

⎜⎜⎜

−=

−=

)(

)(

)(

NN

W

dd

µµ

µµ

µµηµ

1

1

1

11

O

Eric Xing 20

Logistic regression: practical issues

It is very common to use regularized maximum likelihood.

IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.Quasi-Newton methods, that approximate the Hessian, work faster.Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.Stochastic gradient descent can also be used if N is large c.f. perceptronrule:

( ) θθλθσθ

λθ

θσθθ

T

nn

Tn

Txy

xyl

Ip

xye

xyp T

2

01

11

1

−=

=+

=±=

)(log)(

),(Normal~)(

)(),(

( ) λθθσθ −−=∇ nnnT

n xyxy )(1l

Page 11: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

11

Eric Xing 21

Example 2: linear regressionThe condition distribution: a Gaussian

where µ is a linear function

p(y|x) is an exponential family function, with mean:

and canonical response function

IRLS

)()( xxx T ηθµ ==

[ ] xxyE Tθµ ==|

xTθξη ==1

IWdd

=

= 1ηµ

( )( ){ })()(exp)(

))(())((exp),,( //

ηη

µµπ

θ

Ayxxh

xyxyxyp

T

Tk

−Σ−⇒

⎭⎬⎫

⎩⎨⎧ −Σ−−

Σ=Σ

121

1212 2

12

1

( )( ) ( )

( ) )(

)(tTTt

ttTT

ttTtTt

yXXX

yXXXX

zWXXWX

µθ

µθ

θ

−+=

−+=

=

−+

1

1

11

⇒∞→

⇒t

YXXX TT 1−= )(θ

Steepest descent Normal equation

Rescale

Eric Xing 22

ClassificationGenerative and discriminative approach

Q

X

Q

X

RegressionLinear, conditional mixture, nonparametric

X Y

Density estimationParametric and nonparametric methods

µ,σ

XX

Simple GMs are the building blocks of complex BNs

Page 12: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

12

23

School of Computer ScienceAn (incomplete)

genealogy of graphical

models

The structures of most GMs (e.g., all listed here), are not learned from data, but designed by human.

But such designs are useful and indeed favored because thereby human knowledge are put into good use …

Eric Xing 24

MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:

∑ ∑∏ ∏ ⎟⎠

⎞⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

⎛==

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, θθθθ ππ xxl

X2=1

X2=0

X5=0

X5=1

Page 13: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

13

Eric Xing 25

Earthquake

Radio

Burglary

Alarm

Call

∏=

==M

ii i

xpp1

)|()( πxxXFactorization:

ji

kii x

jkixp

πθπ x

x|

)|( =

Local Distributions defined by, e.g., multinomial parameters:

How to define parameter prior?

Assumptions (Geiger & Heckerman 97,99):Complete Model EquivalenceGlobal Parameter IndependenceLocal Parameter IndependenceLikelihood and Prior Modularity

? )|( Gp θ

Eric Xing

Global Parameter IndependenceFor every DAG model:

Local Parameter IndependenceFor every node:

∏=

=M

iim GpGp

1)|()|( θθ

Earthquake

Radio

Burglary

Alarm

Call

∏=

=i

ji

ki

q

jxi GpGp

1)|()|(

| πθθ

x

independent of)( | YESAlarmCallP =θ

)( | NOAlarmCallP =θ

Global & Local Parameter Independence

Page 14: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

14

Eric Xing

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!

sample 1

sample 2

M

θ2|1θ1 θ2|1

X1 X2

X1 X2

Global ParameterIndependence

Local ParameterIndependence

Parameter Independence,Graphical View

Eric Xing 28

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

Discrete DAG Models:

Dirichlet prior:

Gaussian DAG Models:

Normal prior:

Normal-Wishart prior:

∏∏∏

∑-- )(

)(

)()(

kk

kk

kk

kk

kk CP 11 αα θαθα

αθ =

Γ

Γ=

)(Multi~| θπ jxi i

x

),(Normal~| Σµπ jxi i

x

⎭⎬⎫

⎩⎨⎧ −Ψ−−

Ψ=Ψ − )()'(exp

||)(),|( // νµνµ

πνµ 1

212 21

21

np

{ }

. where

,WtrexpW),(),|W( /)(/

1

212

21

−−

Σ=⎭⎬⎫

⎩⎨⎧ ΤΤ=Τ

W

nww

wwncp αααα

( ),)(,Normal),,|( 1−= WW µµ ανανµp

Page 15: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

15

Eric Xing 29

MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:

∑ ∑

∏ ∏

⎟⎠

⎞⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

=

i niin

n iiin

i

i

xp

xp

DpD

),|(log

),|(log

)|(log);(

,

,

θ

θ

θθ

π

π

x

x

l

Eric Xing 30

Consider the distribution defined by the directed acyclic GM:

This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

Example: decomposable likelihood of a directed model

),,|(),|(),|()|()|( 132431311211 θθθθθ xxxpxxpxxpxpxp =

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

Page 16: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

16

Eric Xing 31

MLE for BNs with tabular CPDsAssume each CPD is represented as a table (multinomial) where

Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional tableThe sufficient statistics are counts of family configurations

The log-likelihood is

Using a Lagrange multiplier to enforce , we get:

)|(def

kXjXpiiijk === πθ

iπX

∑=n

kn

jinijk ixxn π,,

def

∑∏ ==kji

ijkijkkji

nijk nD ijk

,,,,

loglog);( θθθl

1=∑j ijkθ ∑=

kjikij

ijkMLijk n

n

,','

θ

Eric Xing 32

MLE and Kulback-Leiblerdivergence

KL divergence

Empirical distribution

Where δ(x,xn) is a Kronecker delta function

Maxθ(MLE) Minθ(KL)

( ) ∑=x xp

xqxqxpxqD)()(log)()(||)(

∑=

=N

nnxxN

xp1

1 ),()(~ defδ

≡( )

);(

)|(log)(~log)(~

)|(log)(~)(~log)(~)|(

)(~log)(~)|(||)(~

DN

C

xpN

xpxp

xpxpxpxpxpxpxpxpxpD

nn

x

xx

x

θ

θ

θθ

θ

l1

1

+=

−=

−=

=

∑∑

∑∑

Page 17: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

17

Eric Xing 33

Consider a time-invariant (stationary) 1st-order Markov modelInitial state probability vector:

State transition probability matrix:

The joint:

The log-likelihood:

Again, we optimize each parameter separatelyπ is a multinomial frequency vector, and we've seen it beforeWhat about A?

Parameter sharing

X1 X2 X3 XT…

)(def

11 == kk Xpπ

)|(def

11 1 === −it

jtij XXpA

∏∏= =

−=T

t tttT XXpxpXp

2 2111 )|()|()|( : πθ

∑∑∑=

−+=n

T

ttntn

nn AxxpxpD

211 ),|(log)|(log);( ,,, πθl

Eric Xing 34

Learning a Markov chain transition matrix

A is a stochastic matrix: Each row of A is multinomial distribution.So MLE of Aij is the fraction of transitions from i to j

Application: if the states Xt represent words, this is called a bigram language model

Sparse data problem:If i j did not occur in data, we will have Aij =0, then any futher sequence with word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation

1=∑ j ijA

∑ ∑∑ ∑

= −

= −=•→

→=

n

T

titn

jtnn

T

titnML

ijx

xxi

jiA2 1

2 1

,

,,

)(#)(#

MLiti AA •→•→ −+= )(~

λλη 1

Page 18: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

18

Eric Xing 35

Bayesian language modelGlobal and local parameter independence

The posterior of Ai · and Ai' · is factorized despite v-structure on Xt, because Xt-1 acts like a multiplexerAssign a Dirichlet prior βi to each row of the transition matrix:

We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

X1 X2 X3 XT…

Ai ·π Ai' ·

βα

A

)(# where,)(

)(#)(#

),,|( ',

,def

•→+=−+=

+•→+→

==i

Ai

jiDijpA

i

ii

MLijikii

i

kii

Bayesij β

βλλβλ

ββ

β 1

Eric Xing 36

Example: HMM: two scenariosSupervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he

changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

Page 19: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

19

Eric Xing 37

Recall definition of HMMTransition probabilities between any two states

or

Start probabilities

Emission probabilities associated with each state

or in general:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... ,)|( , jiit

jt ayyp === − 11 1

( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1

( ).,,,lMultinomia~)( Myp πππ K211

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

Eric Xing 38

Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,

Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters θ are:

What if x is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

∑∑ ∑∑ ∑ ==

•→→

== −

= −

' ',

,,

)(#)(#

j ij

ij

n

T

titn

jtnn

T

titnML

ij AA

y

yyi

jia2 1

2 1

∑∑ ∑∑ ∑ ==

•→→

==

=

' ',

,,

)(#)(#

k ik

ik

n

T

titn

ktnn

T

titnML

ik BB

y

xyi

kib1

1

( ){ }NnTtyx tntn :,::, ,, 11 ==

Page 20: Linear Regressionepxing/Class/10708-07/Slides/lecture9-learningBN… · 2 Eric Xing 3 Logistic Regression (sigmoid classifier) zThe condition distribution: a Bernoulli where µis

20

Eric Xing 39

Supervised ML estimation, ctd.Intuition:

When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the training data

Drawback:Given little data, there may be overfitting:

P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD

Example:Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

Eric Xing 40

PseudocountsSolution for small training sets:

Add pseudocountsAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik

Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

Larger total pseudocounts ⇒ strong prior belief

Small total pseudocounts: just to avoid 0 probabilities --- smoothing

This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts