Download - Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially

School of Computer Science

Probabilistic Graphical Models

Parameter Est. in fully observed BNs

Eric XingLecture 4, January 25, 2016

Reading: KF-chap 17

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

© Eric Xing @ CMU, 2005-2016 1

Learning Graphical ModelsThe goal:

Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs)

(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)……..

(B,E,A,C,R)=(F,T,T,T,F)

E

R

B

A

C

0.9 0.1

e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)

E

R

B

A

C

Structural learning

Parameter learning

© Eric Xing @ CMU, 2005-2016 2

Learning Graphical Models Scenarios:

completely observed GMs directed undirected

partially or unobserved GMs directed undirected (an open research topic)

Estimation principles: Maximal likelihood estimation (MLE) Bayesian estimation Maximal conditional likelihood Maximal "Margin" Maximum entropy

We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

© Eric Xing @ CMU, 2005-2016 3

ML Structural Learning for completely observed

GMs

E

R

B

A

C

Data

),,( )()( 111 nxx

),,( )()( Mn

M xx 1

),,( )()( 221 nxx

© Eric Xing @ CMU, 2005-2016 4

Two “Optimal” approaches “Optimal” here means the employed algorithms guarantee to

return a structure that maximizes the objectives (e.g., LogLik) Many heuristics used to be popular, but they provide no guarantee on attaining

optimality, interpretability, or even do not have an explicit objective E.g.: structured EM, Module network, greedy structural search, deep learning via

auto-encoders, etc.

We will learn two classes of algorithms for guaranteed structure learning, which are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs: Trees: The Chow-Liu algorithm (this lecture) Pairwise MRFs: covariance selection, neighborhood-selection (later)

© Eric Xing @ CMU, 2005-2016 5

Structural Search How many graphs over n nodes?

How many trees over n nodes?

But it turns out that we can find exact solution of an optimal tree (under MLE)! Trick: MLE score decomposable to edge-related elements Trick: in a tree each node has only one parent! Chow-liu algorithm

)(2

2nO

)!(nO

© Eric Xing @ CMU, 2005-2016 6

Information Theoretic Interpretation of ML

i xGiGiGi

i xGiGi

Gi

i nGiGnin

n iGiGnin

GG

Gii

iii

Gii

ii

i

ii

ii

xpxpM

xpMxcount

M

xp

xp

GDpDG

)(

)(

,)(|)()(

,)(|)(

)(

)(|)(,,

)(|)(,,

),|(log),(ˆ

),|(log),(

),|(log

),|(log

),|(log);,(

x

x

xx

xx

x

x

l

From sum over data points to sum over count of variable states

© Eric Xing @ CMU, 2005-2016 7

Information Theoretic Interpretation of ML (con'd)

ii

iGi

i xii

i x iG

GiGiGi

i x i

i

G

GiGiGi

i xGiGiGi

GG

xHMxIM

xpxpMxpp

xpxpM

xpxp

pxp

xpM

xpxpM

GDpDG

i

iGii i

ii

i

Gii i

ii

i

Gii

iii

)(ˆ),(ˆ

)(ˆlog)(ˆ)(ˆ)(ˆ

),,(ˆlog),(ˆ

)(ˆ)(ˆ

)(ˆ),,(ˆ

log),(ˆ

),|(ˆlog),(ˆ

),|(ˆlog);,(

)(

, )(

)(|)()(

, )(

)(|)()(

,)(|)()(

)(

)(

)(

x

xx

x

xx

x

xx

x

x

x

l

Decomposable score and a function of the graph structure

© Eric Xing @ CMU, 2005-2016 8

Chow-Liu tree learning algorithm Objection function:

Chow-Liu: For each pair of variable xi and xj

Compute empirical distribution:

Compute mutual information:

Define a graph with node x1,…, xn

Edge (I,j) gets weight

ii

iGi

GG

xHMxIM

GDpDG

i)(ˆ),(ˆ

),|(ˆlog);,(

)(

x

l i

Gi ixIMGC ),(ˆ)( )(x

Mxxcount

XXp jiji

),(),(ˆ

ji xx ji

jijiji xpxp

xxpxxpXXI

, )(ˆ)(ˆ),(ˆ

log),(ˆ),(ˆ

),(ˆ ji XXI

© Eric Xing @ CMU, 2005-2016 9

Chow-Liu algorithm (con'd) Objection function:

Chow-Liu:Optimal tree BN Compute maximum weight spanning tree Direction in BN: pick any node as root, do breadth-first-search to define directions I-equivalence:

ii

iGi

GG

xHMxIM

GDpDG

i)(ˆ),(ˆ

),|(ˆlog);,(

)(

x

l i

Gi ixIMGC ),(ˆ)( )(x

A

B C

D E

C

A E

B

DE

C

D A B

),(),(),(),()( ECIDCICAIBAIGC © Eric Xing @ CMU, 2005-2016 10

Structure Learning for general graphs Theorem:

The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2

Most structure learning approaches use heuristics Exploit score decomposition Two heuristics that exploit decomposition in different ways

Greedy search through space of node-orders

Local search of graph structures

© Eric Xing @ CMU, 2005-2016 11

ML Parameter Est. for completely observed GMs of

given structure

Z

X

The data:{ (z1,x1), (z2,x2), (z3,x3), ... (zN,xN)}

© Eric Xing @ CMU, 2005-2016 12

Parameter Learning Assume G is known and fixed,

from expert design from an intermediate outcome of iterative structure learning

Goal: estimate from a dataset of N independent, identically distributed (iid) training cases D = {x1, . . . , xN}.

In general, each training case xn=(xn,1, . . . , xn,M) is a vector of M values, one per node, the model can be completely observable, i.e., every element in xn is known (no

missing values, no hidden variables), or, partially observable, i.e., i, s.t. xn,i is not observed.

In this lecture we consider learning parameters for a BN with given structure and is completely observable

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl

© Eric Xing @ CMU, 2005-2016 13

Review of density estimation Can be viewed as single-node graphical models

Instances of exponential family dist.

Building blocks of general GM

MLE and Bayesian estimate

x1 x2 x3 xN…

xiN

GM:

© Eric Xing @ CMU, 2005-2016 14

Bernoulli distribution: Ber(p)

Multinomial distribution: Mult(1,)

Multinomial (indicator) variable:

. , w.p.

and ],,[

where , ∑

∑

[1,...,6]∈

[1,...,6]∈

11

110

6

5

4

3

2

1

jjjj

jjj

X

XX

XXXXXX

X

x

k

xk

xT

xG

xC

xAj

j

kTGCA

jXPjxp

∏

}face-dice index the where,{))(( 1

Discrete Distributions

101

xpxp

xPfor for

)( xx ppxP 11 )()(

© Eric Xing @ CMU, 2005-2016 15

Multinomial distribution: Mult(n,)

Count variable:

Discrete Distributions

jj

K

Nnn

nn where ,

1

n

K

nK

nn

K nnnN

nnnNnp K

!!!!

!!!!)(

2121

21

21

© Eric Xing @ CMU, 2005-2016 16

Example: multinomial model Data:

We observed N iid die rolls (K-sided): D={5, 1, K, …, 3}

Representation:

Unit basis vectors:

Model:

How to write the likelihood of a single observation xn?

The likelihood of datasetD={x1, …,xN}:

Knnn x

Kxx

k

kni nkxPxP,2,1,

21

, })rollth theof side-die index the where,1({)(

N

n k

xk

N

nnN

knxPxxxP11

21,)|()|,...,,(

1 and },1,0{ where, 1

,,

,

2,

1,

K

kknkn

Kn

n

n

n xx

x

xx

x

1 and , w.p.1}1{

, ,...Kk

kkknX

K

k

xk

kn

1

,

k

nk

k

x

kk

N

nkn

1,

x1 x2 x3 xN…

xiN

GM:

© Eric Xing @ CMU, 2005-2016 17

MLE: constrained optimization with Lagrange multipliers Objective function:

We need to maximize this subject to the constrain

Constrained cost function with a Lagrange multiplier

Take derivatives wrt k

Sufficient statistics The counts, are sufficient statistics of data D

k

kkk

nk nDPD k loglog)|(log);(l

11

K

kk

K

kk

kkkn

11 logl

k

kk

kkk

k

k

k

Nnn

n 0l

Nnk

MLEk ,

n

knMLEk xN ,,1

or

Frequency as sample mean

, ),,,( ,1 n knkK xnnnn

© Eric Xing @ CMU, 2005-2016 18

Bayesian estimation: Dirichlet distribution:

Posterior distribution of :

Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior

Posterior mean estimation:

kk

kk

kk

kk

kk CP 1-1- )()(

)()(

k

nk

kk

k

nk

N

NN

kkkk

xxppxxpxxP 11

1

11 ),...,(

)()|,...,(),...,|(

NndCdDp kk

k

nkkkk

kk 1)|(

x1 x2 x3 xN…

GM:

N

xi

Dirichlet parameters can be understood as pseudo-counts

© Eric Xing @ CMU, 2005-2016 19

More on Dirichlet Prior: Where is the normalize constant C() come from?

Integration by parts () is the gamma function: For inregers,

Marginal likelihood:

Posterior in closed-form:

Posterior predictive rate:

k k

k kKK dd

CK

)()(

111

11

1

0

1 dtet t )(!)( nn 1

)()()|()|()|()|},...,({

nC

Cdpnpnpxxp N1

k

nkN

kknCnp

pnpxxP 11 )(

)|()|()|()},,...,{|(

)(Dir n

)()()()},,...,{|( 1

11N

ni

k

nkNN xnC

nCdnCxxixp kkkk

||||

nn ii

© Eric Xing @ CMU, 2005-2016 20

Sequential Bayesian updating Start with Dirichlet prior Observe N ' samples with sufficient statistics . Posterior

becomes:

Observe another N " samples with sufficient statistics . Posterior becomes:

So sequentially absorbing data in any order is equivalent to batch update.

):(Dir)|( P

'n

)':(Dir)',|( nnP

"n

)"':(Dir)",',|( nnnnP

© Eric Xing @ CMU, 2005-2016 21

Hierarchical Bayesian Models are the parameters for the likelihood p(x|) are the parameters for the prior p(|) . We can have hyper-hyper-parameters, etc. We stop when the choice of hyper-parameters makes no

difference to the marginal likelihood; typically make hyper-parameters constants.

Where do we get the prior? Intelligent guesses Empirical Bayes (Type-II maximum likelihood) computing point estimates of :

)|(maxarg

npMLE

© Eric Xing @ CMU, 2005-2016 22

Limitation of Dirichlet Prior:

N

xi

© Eric Xing @ CMU, 2005-2016 23

- Log Partition Function- Normalization Constant

log

logexp

,~

,~

1

1

1

1

1

1

1

0

K

i

K

iii

KK

K

i

i

eC

e

N

LN

μ Σ

N

xi

The Logistic Normal Prior

Pro: co-variance structure Con: non-conjugate (we will discuss how to solve this later)

© Eric Xing @ CMU, 2005-2016 24

Logistic Normal Densities

© Eric Xing @ CMU, 2005-2016 25

Uniform Probability Density Function

Normal (Gaussian) Probability Density Function

The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, (mean) and (standard deviation), determine the location and shape of

the distribution. The highest point on the normal curve is at the mean, which is also the median and mode. The mean can be any numerical value: negative, zero, or positive.

Multivariate Gaussian

Continuous Distributions

elsewhere for )/()(

01

bxaabxp

22 2

21

/)()( xexp

xx

ff((xx))

xx

ff((xx))

XXXp T

n1

212 21

21 exp),;(

//

© Eric Xing @ CMU, 2005-2016 26

MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is

where the scatter matrix is

The sufficient statistics are nxn and nxnxnT.

Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not

invertible

SN

xxN

xN

nT

MLnMLnMLE

n nMLE

11

1

TMLMLn n

Tnnn

TMLnMLn NxxxxS

Kn

n

n

n

x

xx

x

,

2,

1,

TN

T

T

x

xx

X2

1

© Eric Xing @ CMU, 2005-2016 27

Bayesian parameter estimation for a Gaussian There are various reasons to pursue a Bayesian approach

We would like to update our estimates sequentially over time. We may have prior knowledge about the expected magnitude of the parameters. The MLE for Σ may not be full rank if we don’t have enough data.

We will restrict our attention to conjugate priors.

We will consider various cases, in order of increasing complexity: Known σ, unknown µ Known µ, unknown σ Unknown µ and σ

© Eric Xing @ CMU, 2005-2016 28

Bayesian estimation: unknown µ, known σ

Normal Prior:

Joint probability:

Posterior:

220

212 22 /)(exp)( /

P

1

222

022

2

22

2 11

11

N

Nx

NN ~ and ,

///

///~ where

x1 x2 x3 xN…

GM:

N

xi

Sample mean

220

212

1

22

22

22

212

/)(exp

exp),(

/

/

N

nn

N xxP

22212 22 ~/)~(exp~)|( /

xP

© Eric Xing @ CMU, 2005-2016 29

Bayesian estimation: unknown µ, known σ

The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.

The precision of the posterior 1/σ2N is the precision of the prior 1/σ2

0 plus one contribution of data precision 1/σ2 for each observed data point.

Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)

Effect of single data point

Uninformative (vague/ flat) prior, σ20 →∞

1

20

22

020

2

20

20

2

2 11

11

N

Nx

NN

N~ ,

///

///

20

2

20

020

2

20

001

)( )( xxx

0 N

© Eric Xing @ CMU, 2005-2016 30

Other scenarios Known µ, unknown λ = 1/σ2

The conjugate prior for λ is a Gamma with shape a0 and rate (inverse scale) b0

The conjugate prior for σ2 is Inverse-Gamma

Unknown µ and unknown σ2 The conjugate prior is

Normal-Inverse-Gamma

Semi conjugate prior

Multivariate case: The conjugate prior is

Normal-Inverse-Wishart© Eric Xing @ CMU, 2005-2016 31

Estimation of conditional density Can be viewed as two-node graphical models

Instances of GLIM

Building blocks of general GM

MLE and Bayesian estimate

See supplementary slides

Q

X

Q

X

© Eric Xing @ CMU, 2005-2016 32

MLE for general BNs If we assume the parameters for each CPD are globally

independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl

X2=1

X2=0

X5=0

X5=1

© Eric Xing @ CMU, 2005-2016 33

A plate is a “macro” that allows subgraphs to be replicated

For iid (exchangeable) data, the likelihood is

We can represent this as a Bayes net with N nodes. The rules of plates are simple: repeat every structure in a box a number of

times given by the integer in the corner of the box (e.g. N), updating the plate index variable (e.g. n) as you go.

Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.

X1

X2

XN

…

Xn

N

Plates

n

nxpDp )|()|(

© Eric Xing @ CMU, 2005-2016 34

Consider the distribution defined by the directed acyclic GM:

This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

Decomposable likelihood of a BN

),,|(),|(),|()|()|( 432431321211 xxxpxxpxxpxpxp

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

© Eric Xing @ CMU, 2005-2016 35

MLE for BNs with tabular CPDs Assume each CPD is represented as a table (multinomial)

where

Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional table

The sufficient statistics are counts of family configurations

The log-likelihood is

Using a Lagrange multiplier to enforce , we get:

)|(def

kXjXpiiijk

iX

nkn

jinijk ixxn ,,

def

kji

ijkijkkji

nijk nD ijk

,,,,

loglog);( l

1 j ijk

''

jkij

ijkMLijk n

n

© Eric Xing @ CMU, 2005-2016 36

Earthquake

Radio

Burglary

Alarm

Call

M

ii i

xpp1

)|()( xxXFactorization:

ji

kii x

jkixp

x

x|

)|(

Local Distributions defined by, e.g., multinomial parameters:

How to define parameter prior?

Assumptions (Geiger & Heckerman 97,99):

Complete Model Equivalence Global Parameter Independence Local Parameter Independence Likelihood and Prior Modularity

? )|( Gp

© Eric Xing @ CMU, 2005-2016

37

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!

sample 1

sample 2

2|11 2|1

X1 X2

X1 X2

Global ParameterIndependence

Local ParameterIndependence

Parameter Independence,Graphical View

© Eric Xing @ CMU, 2005-2016 39

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

Discrete DAG Models:

Dirichlet prior:

Gaussian DAG Models:

Normal prior:

Normal-Wishart prior:

kk

kk

kk

kk

kk CP 1-1- )()(

)()(

)(Multi~| jxi i

x

),(Normal~| jxi i

x

)()'(exp

||)(),|( //

1

212 21

21

np

. where

,WtrexpW),(),|W( /)(/

1

212

21

W

nww

wwncp

,)(,Normal),,|( 1 WW p

© Eric Xing @ CMU, 2005-2016 40

Consider a time-invariant (stationary) 1st-order Markov model Initial state probability vector:

State transition probability matrix:

The joint:

The log-likelihood:

Again, we optimize each parameter separately is a multinomial frequency vector, and we've seen it before What about A?

Parameter sharing

X1 X2 X3 XT…

A

)(def

11 kk Xp

)|(def

11 1 it

jtij XXpA

T

t tttT XXpxpXp

2 2111 )|()|()|( :

n

T

ttntn

nn AxxpxpD

211 ),|(log)|(log);( ,,, l

© Eric Xing @ CMU, 2005-2016 41

Learning a Markov chain transition matrix A is a stochastic matrix: Each row of A is multinomial distribution. So MLE of Aij is the fraction of transitions from i to j

Application: if the states Xt represent words, this is called a bigram language model

Sparse data problem: If i j did not occur in data, we will have Aij =0, then any future sequence with

word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation

1 j ijA

n

T

ti

tn

jtnn

T

ti

tnMLij

x

xxi

jiA2 1

2 1

,

,,

)(#)(#

MLiti AA )(

~ 1

© Eric Xing @ CMU, 2005-2016 42

Bayesian language model Global and local parameter independence

The posterior of Ai∙ and Ai'∙ is factorized despite v-structure on Xt, because Xt-

1 acts like a multiplexer Assign a Dirichlet prior i to each row of the transition matrix:

We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

X1 X2 X3 XT…

Ai · Ai'·

…

A

)(# where,)(

)(#)(#

),,|( ',

,def

i

Ai

jiDijpA

i

ii

MLijikii

i

kii

Bayesij

1

© Eric Xing @ CMU, 2005-2016 43

Example: HMM: two scenarios Supervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation

© Eric Xing @ CMU, 2005-2016 44

Recall definition of HMM Transition probabilities between

any two states

or

Start probabilities

Emission probabilities associated with each state

or in general:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... ,)|( , jiit

jt ayyp 11 1

.,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1

.,,,lMultinomia~)( Myp 211

.,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211

.,|f~)|( I iyxp iitt 1

© Eric Xing @ CMU, 2005-2016 45

Supervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is known,

Define:Aij = # times state transition ij occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters are:

What if x is continuous? We can treat as NTobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

' ',

,,

)(#)(#

j ij

ij

n

T

ti

tn

jtnn

T

ti

tnMLij A

A

y

yyi

jia2 1

2 1

' ',

,,

)(#)(#

k ik

ik

n

T

ti

tn

ktnn

T

ti

tnMLik B

By

xyi

kib1

1

NnTtyx tntn :,::, ,, 11

© Eric Xing @ CMU, 2005-2016 46

Supervised ML estimation, ctd. Intuition:

When we know the underlying states, the best estimate of is the average frequency of transitions & emissions that occur in the training data

Drawback: Given little data, there may be overfitting:

P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD

Example: Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

© Eric Xing @ CMU, 2005-2016 47

Pseudocounts Solution for small training sets:

Add pseudocountsAij = # times state transition ij occurs in y + Rij

Bik = # times state i in y emits k in x + Sik

Rij, Sij are pseudocounts representing our prior belief Total pseudocounts: Ri = jRij , Si = kSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

Larger total pseudocounts strong prior belief

Small total pseudocounts: just to avoid 0 probabilities --- smoothing

This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts

© Eric Xing @ CMU, 2005-2016 48

Summary: Learning GM

For fully observed BN, the log-likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored Structural learning

Chow liu Neighborhood selection

Learning single-node GM – density estimation: exponential family dist. Typical discrete distribution Typical continuous distribution Conjugate priors

Learning two-node BN: GLIM Conditional Density Est. Classification

Learning BN with more nodes Local operations

© Eric Xing @ CMU, 2005-2016 49

ClassificationGenerative and discriminative approaches

Q

X

Q

X

Linear/Logistic Regression

Two node fully observed BNs

Conditional mixtures

© Eric Xing @ CMU, 2005-2016 51

Classification: Goal: Wish to learn f: X Y

Generative: Modeling the joint distribution

of all data

Discriminative: Modeling only points

at the boundary

© Eric Xing @ CMU, 2005-2016 52

Conditional Gaussian The data:

Both nodes are observed: Y is a class indicator vector

X is a conditional Gaussian variable with a class-specific mean

GM:Yi

XiN

k

yknn

knyyp ,):(multi)(

22

12/12, )-(-exp

)2(1),,1|( 2 knknn xyxp

n k

ykn

knxNyxp ,),:(),,|(

),(,),,(),,(),,( NN yxyxyxyx 332211

© Eric Xing @ CMU, 2005-2016 53

Data log-likelihood

MLE

),,|()|(log),(log);( nnn

nn

nn yxpypyxpD θl

),;(max,

,, Nn

N

yDrga kn

kn

MLEkMLEk

θlthe fraction of samples of class m

k

nnkn

nkn

nnkn

MLEkMLEk n

xy

y

xyD

,

,

,

,, ),;(maxarg θl the average of samples of class m

MLE of conditional GaussianGM:

Yi

XiN

© Eric Xing @ CMU, 2005-2016 54

Prior:

Posterior mean (Bayesian est.)

Bsyesian estimation of conditional Gaussian

GM:

Yi

XiN

):(Dir)|( P

),:(Normal)|( kkP K

1

222

22

2

22

2 11

11

N

nnn

Bayesk

MLkk

kBayesk and ,

///

///

,,

Nn

NNN kkk

MLkBayesk ,,

© Eric Xing @ CMU, 2005-2016 55

Classification Gaussian Discriminative Analysis:

The joint probability of a datum and it label is:

Given a datum xn, we predict its label using the conditional probability of the label given the datum:

This is basic inference introduce evidence, and then normalize

22

12/12

,,,

)-(-exp)2(

1),,1|()1(),|1,(

2 knk

knnknknn

x

yxpypyxp

'

2'2

12/12'

22

12/12

,

)-(-exp)2(

1

)-(-exp)2(

1

),,|1(2

2

kknk

knk

nkn

x

xxyp

Yi

Xi

© Eric Xing @ CMU, 2005-2016 56

Transductive classification Given Xn, what is its corresponding Yn

when we know the answer for a set of training data?

Frequentist prediction: we fit , and from data first, and then …

Bayesian: we compute the posterior dist. of the parameters first …

GM:

Yi

XiN

KYn

Xn

'''

,, ),|,(

),|,(),,|(

),,|,1(),,,|1(

kknk

knk

n

nknnkn xN

xNxp

xypxyp

© Eric Xing @ CMU, 2005-2016 57

Linear Regression The data:

Both nodes are observed: X is an input vector Y is a response vector

(we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)

A regression scheme can be used to model p(y|x) directly,rather than p(x,y)

Yi

XiN

),(,),,(),,(),,( NN yxyxyxyx 332211

© Eric Xing @ CMU, 2005-2016 58

A discriminative probabilistic model Let us assume that the target variable and the inputs are

related by the equation:

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

By independence assumption:

iiT

iy x

2

2

221

)(exp);|( i

Ti

iiyxyp x

2

12

1 221

n

i iT

inn

iii

yxypL

)(exp);|()(

x

© Eric Xing @ CMU, 2005-2016 59

Linear regression Hence the log-likelihood is:

Do you recognize the last term?

Yes it is:

It is same as the MSE!

n

i iT

iynl1

22 2

1121 )(log)( x

n

ii

Ti yJ

1

2

21 )()( x

© Eric Xing @ CMU, 2005-2016 60

A recap: LMS update rule

Pros: on-line, low per-step cost Cons: coordinate, maybe slow-converging

Steepest descent

Pros: fast-converging, easy to implement Cons: a batch,

Normal equations

Pros: a single-shot algorithm! Easiest to implement. Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues

(e.g., matrix is singular ..)

ntT

nntt y xx )(1

n

in

tTnn

tt y1

1 xx )(

yXXX TT 1*

© Eric Xing @ CMU, 2005-2016 61

ClassificationGenerative and discriminative approach

Q

X

Q

X

RegressionLinear, conditional mixture, nonparametric

X Y

Density estimationParametric and nonparametric methods

,

XX

Simple GMs are the building blocks of complex BNs

© Eric Xing @ CMU, 2005-2016 63

Download - Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially

Top Related