# learning graphical models - 2018-10-10¢  learning graphical modelslearning graphical...     Post on 01-Aug-2020

2 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• Machine LearningMachine Learninggg

Learning Graphical ModelsLearning Graphical Models

Eric XingEric Xing

Lecture 12, August 14, 2010

• Inference and LearningInference and Learning  A BN M describes a unique probability distribution P

 We use inference as a name for the process of computing answers to such queries

 So far we have learned several algorithms for exact and approx. inferenceg pp

 Task 2: How do we estimate a plausible model M from data D?

i. We use learning as a name for the process of obtaining point estimate of M.

ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.

iii When not all variables are observable even computing point estimate of M

Eric Xing © Eric Xing @ CMU, 2006-2010 2

iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.

• Learning Graphical Models

The goal:

Learning Graphical Models

g

Given set of independent samples (assignments of random variables) find the best (the most likely?)random variables), find the best (the most likely?) graphical model (both the graph and the CPDs)

E BE B

R A

C

R A

C

(B,E,A,C,R)=(T,F,F,T,F) (B,E,A,C,R)=(T,F,T,T,F)

C

0.9 0.1

e 0 2 0 8 be b

BE P(A | E,B) C

Eric Xing © Eric Xing @ CMU, 2006-2010 3

…….. (B,E,A,C,R)=(F,T,T,T,F)

e

b e

0.2 0.8

0.01 0.99 0.9 0.1

b b

e

• Learning Graphical ModelsLearning Graphical Models  Scenarios:

  completely observed GMs

 directed  undirected 

  partially observed GMs

 directed  undirected (an open research topic)

  Estimation principles:

 Maximal likelihood estimation (MLE)  Bayesian estimationy  Maximal conditional likelihood  Maximal "Margin"

W l i f th f ti ti th

Eric Xing © Eric Xing @ CMU, 2006-2010 4

 We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

• Z

ML P t E t f

Z

X

ML Parameter Est. for completely observed GMs of p y

given structure

 The data: {( (1) (1)) ( (2) (2)) ( (3) (3)) ( (N) (N))}

Eric Xing © Eric Xing @ CMU, 2006-2010 5

{(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))}

• The basic idea underlying MLE  Likelihood X1 X2

The basic idea underlying MLE

(for now let's assume that the structure is given):

L Lik lih d

X1 X2

X3);,|()|()|()|()|( 33332211  XXXpXpXpXpXL  θθ

 Log-Likelihood:

 Data log-likelihood

),,|(log)|(log)|(log)|(log)|( 33332211  XXXpXpXpXpXl  θθ

 Data log-likelihood

 



n nnnn nn n

n n

XXXpXpXp

XpDATAl

),|(log)|(log)|(log

)|(log)|(

,,,,, 32132211 

θθ

 MLE

 nnn ,,,,,

)|(maxarg},,{ DATAlMLE θ321 

Eric Xing © Eric Xing @ CMU, 2006-2010 6

∑∑∑ ),|(logmaxarg ,)|(logmaxarg ,)|(logmaxarg ,,,*,*,* n

nnn n

n n

n XXXpXpXp 32133222111  

• Example 1: conditional Gaussian

 The completely observed model: Z

Example 1: conditional Gaussian

p y  Z is a class indicator vector

Z

X

110 2

1

  

  

∑and][where mm ZZ Z Z

Z 110 

   

   

 ∑and ],,[where , m

M

ZZ

Z

Z 

 zzzi M

zp  )|( 21

1 All except one of these terms

and a datum is in class i w.p. i

X i diti l G i i bl ith l ifi

 

m

z m

Mi m

zp

zp



)(

)|( 211 will be one

 X is a conditional Gaussian variable with a class-specific mean

 2 2

1 212 22

11 )-(-exp )(

),,|( / m m xzxp 

 

 

Eric Xing © Eric Xing @ CMU, 2006-2010 7

)( mz

m m

xNzxp ),|(),,|( ∏  

• Example 1: conditional Gaussian

Z

Example 1: conditional Gaussian  Data log-likelihood

zxpzp

zxpzpxzpDl

nnn

nn n

n n

nn



 

∑∑

),,|(log)|(log

),,|()|(log),(log)|(



θ Z

X

g

C

xN

pp

mm

n

z m

m n

n m

z m

n nn

n n

m n

m n 

∑∑∑∑

∑ ∏∑ ∏

)(l

),|(loglog

)|(g)|(g

21



Cxzz n m

mn m n

n m m

m n  ∑∑∑∑ )-(-log 22

1 2  

s t∀)|(⇒)|(maxarg ∑∂* mDlDl  10θθ  MLE

s.t. ,∀ ,)|( ⇒ ),|(maxarg

*

m ∂ ∂

N n

N z

mDlDl

mn m n

m

mm m





  10θθ

the fraction of samples of class m

Eric Xing © Eric Xing @ CMU, 2006-2010 8

p

m

n n m n

n m n

n n m n

mm n

xz

z

xz Dl

∑ ∑ ∑** ⇒ ),|(maxarg   θ the average of

samples of class m

• Example 2: HMM: two scenariosExample 2: HMM: two scenarios  Supervised learning: estimation when the “right answer” is g g

known  Examples:

GIVEN: a genomic region x = x1…x1,000,000 where we have good, , (experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

 Unsupervised learning: estimation when the “right answer” is unknown  Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

Eric Xing © Eric Xing @ CMU, 2006-2010 9

 QUESTION: Update the parameters  of the model to maximize P(x|) --- Maximal likelihood (ML) estimation

• Recall definition of HMMRecall definition of HMM  Transition probabilities between y2 y3y1 yT...

any two states

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... ,)|( , jiitjt ayyp   11 1 or

St t b biliti

  .,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1

 Start probabilities

 .,,,lMultinomia~)( Myp  211

 Emission probabilities associated with each state

  .,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211

Eric Xing © Eric Xing @ CMU, 2006-2010 10

or in general:   .,|f~)|( I iyxp iitt 1

• Supervised ML estimationSupervised ML estimation  Given x = x1…xN for which the true state path y = y1…yN is

known,

 Define:   

  

 

 

n

T

t tntn

T

t tntnn xxpyypypp

1 ,,

2 1,,1, )|()|()(log),(log),;( yxyxθl

Aij = # times state transition ij occurs in y Bik = # times state i in y emits k in x

 We can show that the maximum likelihood parameters  are:

    

 

  

 

' ',

,,

)(# )(#

j ij

ij

n T t

i tn

j tnn

T t

i tnML

ij A A

y yy

i jia

2 1

2 1

    

 

 

' ',

,,

)(# )(#

k ik

ik

n T t

i tn

k tnn

T t

i tnML

ik B B

y xy

i kib

1

1

  

Eric Xing © Eric Xing @ CMU, 2006-2010 11

 If y is continuous, we can treat as NT observations of, e.g., a Gaussian, and apply learning rules for Gaussian …

  NnTtyx tntn :,::, ,, 11 

• Supervised ML estimation ctdSupervised ML estimation, ctd.  Intuition:

 When we know the underlying states, the best estimate of  is the average frequency of transitions & emissions that occur in the training data

 Drawback:  Given little data, there may be overfitting:

 P(x|) is maximized but  is unreasonable P(x|) is maximized, but  is unreasonable 0 probabilities – VERY BAD