learning graphical models - sjtubcmi.sjtu.edu.cn/ds/lecture12_xing.pdf · 2018-10-10 · learning...

50
Machine Learning Machine Learning Learning Graphical Models Learning Graphical Models Eric Xing Eric Xing Lecture 12, August 14, 2010 Eric Xing © Eric Xing @ CMU, 2006-2010 1 Reading:

Upload: others

Post on 01-Aug-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Machine LearningMachine Learninggg

Learning Graphical ModelsLearning Graphical Models

Eric XingEric Xing

Lecture 12, August 14, 2010

Eric Xing © Eric Xing @ CMU, 2006-2010 1

Reading:

Page 2: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Inference and LearningInference and Learning A BN M describes a unique probability distribution P

Typical tasks:

Task 1: How do we answer queries about P? Task 1: How do we answer queries about P?

We use inference as a name for the process of computing answers to such queries

So far we have learned several algorithms for exact and approx. inferenceg pp

Task 2: How do we estimate a plausible model M from data D?

i. We use learning as a name for the process of obtaining point estimate of M.

ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.

iii When not all variables are observable even computing point estimate of M

Eric Xing © Eric Xing @ CMU, 2006-2010 2

iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.

Page 3: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Learning Graphical Models

The goal:

Learning Graphical Models

g

Given set of independent samples (assignments of random variables) find the best (the most likely?)random variables), find the best (the most likely?) graphical model (both the graph and the CPDs)

E BE B

R A

C

R A

C

(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)

C

0.9 0.1

e 0 2 0 8

beb

BE P(A | E,B)C

Eric Xing © Eric Xing @ CMU, 2006-2010 3

……..(B,E,A,C,R)=(F,T,T,T,F)

e

be

0.2 0.8

0.01 0.990.9 0.1

bb

e

Page 4: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Learning Graphical ModelsLearning Graphical Models Scenarios:

completely observed GMs

directed undirected

partially observed GMs

directed undirected (an open research topic)

Estimation principles:

Maximal likelihood estimation (MLE) Bayesian estimationy Maximal conditional likelihood Maximal "Margin"

W l i f th f ti ti th

Eric Xing © Eric Xing @ CMU, 2006-2010 4

We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

Page 5: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Z

ML P t E t f

Z

X

ML Parameter Est. for completely observed GMs of p y

given structure

The data:{( (1) (1)) ( (2) (2)) ( (3) (3)) ( (N) (N))}

Eric Xing © Eric Xing @ CMU, 2006-2010 5

{(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))}

Page 6: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

The basic idea underlying MLE Likelihood X1 X2

The basic idea underlying MLE

(for now let's assume that the structure is given):

L Lik lih d

X1 X2

X3);,|()|()|()|()|( 33332211 XXXpXpXpXpXL θθ

Log-Likelihood:

Data log-likelihood

),,|(log)|(log)|(log)|(log)|( 33332211 XXXpXpXpXpXl θθ

Data log-likelihood

n nnnn nn n

n n

XXXpXpXp

XpDATAl

),|(log)|(log)|(log

)|(log)|(

,,,,, 32132211

θθ

MLE

nnn ,,,,,

)|(maxarg},,{ DATAlMLE θ321

Eric Xing © Eric Xing @ CMU, 2006-2010 6

∑∑∑ ),|(logmaxarg ,)|(logmaxarg ,)|(logmaxarg ,,,*

,*

,*

nnnn

nn

nn XXXpXpXp 32133222111

Page 7: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Example 1: conditional Gaussian

The completely observed model: Z

Example 1: conditional Gaussian

p y Z is a class indicator vector

Z

X

1102

1

∑and][where mm ZZZZ

Z 110

∑and ],,[where ,m

M

ZZ

Z

Z

zzzi M

zp )|(21

1All except one of these terms

and a datum is in class i w.p. i

X i diti l G i i bl ith l ifi

m

zm

Mim

zp

zp

)(

)|( 211 will be one

X is a conditional Gaussian variable with a class-specific mean

22

1212 22

11 )-(-exp)(

),,|( / mm xzxp

Eric Xing © Eric Xing @ CMU, 2006-2010 7

)(mz

mm

xNzxp ),|(),,|( ∏

Page 8: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Example 1: conditional Gaussian

Z

Example 1: conditional Gaussian Data log-likelihood

zxpzp

zxpzpxzpDl

nnn

nnn

nn

nn

∑∑

),,|(log)|(log

),,|()|(log),(log)|(

θZ

X

g

C

xN

pp

mm

n

zm

mn

n m

zm

nnn

nn

mn

mn

∑∑∑∑

∑ ∏∑ ∏

)(l

),|(loglog

)|(g)|(g

21

Cxzzn m

mnmn

n mm

mn ∑∑∑∑ )-(-log 2

21

2

s t∀)|(⇒)|(maxarg ∑∂* mDlDl 10θθ

MLE

s.t. ,∀ ,)|( ⇒ ),|(maxarg

*

m∂∂

Nn

Nz

mDlDl

mnmn

m

mm m

10θθ

the fraction of samples of class m

Eric Xing © Eric Xing @ CMU, 2006-2010 8

p

m

n nmn

nmn

n nmn

mm n

xz

z

xzDl

∑∑∑** ⇒ ),|(maxarg θ the average of

samples of class m

Page 9: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Example 2: HMM: two scenariosExample 2: HMM: two scenarios Supervised learning: estimation when the “right answer” is g g

known Examples:

GIVEN: a genomic region x = x1…x1,000,000 where we have good, ,(experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

Eric Xing © Eric Xing @ CMU, 2006-2010 9

QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation

Page 10: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Recall definition of HMMRecall definition of HMM Transition probabilities between y2 y3y1 yT...

any two states

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... ,)|( , jiit

jt ayyp 11 1

or

St t b biliti

.,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1

Start probabilities

.,,,lMultinomia~)( Myp 211

Emission probabilities associated with each state

.,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211

Eric Xing © Eric Xing @ CMU, 2006-2010 10

or in general: .,|f~)|( I iyxp iitt 1

Page 11: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Supervised ML estimationSupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is

known,

Define:

n

T

ttntn

T

ttntnn xxpyypypp

1,,

21,,1, )|()|()(log),(log),;( yxyxθl

Aij = # times state transition ij occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters are:

' ',

,,

)(#)(#

j ij

ij

nTt

itn

jtnn

Tt

itnML

ij AA

yyy

ijia

2 1

2 1

' ',

,,

)(#)(#

k ik

ik

nTt

itn

ktnn

Tt

itnML

ik BB

yxy

ikib

1

1

Eric Xing © Eric Xing @ CMU, 2006-2010 11

If y is continuous, we can treat as NTobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

NnTtyx tntn :,::, ,, 11

Page 12: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Supervised ML estimation ctdSupervised ML estimation, ctd. Intuition:

When we know the underlying states, the best estimate of is the average frequency of transitions & emissions that occur in the training data

Drawback: Given little data, there may be overfitting:

P(x|) is maximized but is unreasonable P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD

Example: Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Th 1 0

Eric Xing © Eric Xing @ CMU, 2006-2010 12

Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

Page 13: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

PseudocountsPseudocounts Solution for small training sets:

Add pseudocounts

Aij = # times state transition ij occurs in y + RijBik = # times state i in y emits k in x + Sik

Rij, Sij are pseudocounts representing our prior belief Total pseudocounts: Ri = jRij , Si = kSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

Larger total pseudocounts strong prior belief

Small total pseudocounts: just to avoid 0 probabilities --- smoothing

This is equivalent to Bayesian est. under a uniform prior with

Eric Xing © Eric Xing @ CMU, 2006-2010 13

"parameter strength" equals to the pseudocounts

Page 14: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

MLE for general BN parametersMLE for general BN parameters If we assume the parameters for each CPD are globally p g y

independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:p

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl

X2=1 X5=0X2 1

X2=0

X5 0

X5=1

Eric Xing © Eric Xing @ CMU, 2006-2010 14

Page 15: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Example: decomposable likelihood of a directed model Consider the distribution defined by the directed acyclic GM:

likelihood of a directed modely y

),,|(),|(),|()|()|( 132431311211 xxxpxxpxxpxpxp

This is exactly like learning four separate small BNs, each of which consists of a node and its parentswhich consists of a node and its parents.

X1X1

X1 X1

X2 X3X2 X3

X2 X3

Eric Xing © Eric Xing @ CMU, 2006-2010 15

X4 X4

Page 16: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

E.g.: MLE for BNs with tabular CPDsCPDs Assume each CPD is represented as a table (multinomial)

where

Note that in case of multiple parents, will have a composite

)|(def

kXjXpiiijk

Xp p pstate, and the CPD will be a high-dimensional table

The sufficient statistics are counts of family configurations

i

kj xxndef

The log-likelihood is

n ninijk ixxn ,,

kji

ijkijkkji

nijk nD ijk

,,,,

loglog);( l

Using a Lagrange multiplier to enforce , we get:1j ijk

kjikij

ijkMLijk n

n

''

Eric Xing © Eric Xing @ CMU, 2006-2010 16

kji ,,

Page 17: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

ZZ

X

Learning partially observed GMsGMs

The data:

Eric Xing © Eric Xing @ CMU, 2006-2010 17

The data:{(x(1)), (x(2)), (x(3)), ... (x(N))}

Page 18: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

What if some nodes are not observed? Consider the distribution defined by the directed acyclic GM:

observed?y y

),,|(),|(),|()|()|( 132431311211 xxxpxxpxxpxpxp

X1

X2 X3

X4

Eric Xing © Eric Xing @ CMU, 2006-2010 18

Need to compute p(xH|xV) inference

Page 19: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Recall: EM AlgorithmRecall: EM Algorithm A way of maximizing likelihood function for latent variable models.

Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:

1. Estimate some “missing” or “unobserved” data from observed data and current tparameters.

2. Using this “complete” data, find the maximum likelihood parameter estimates.

Alternate between filling in the latent variables using the best guess Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:

E-step: )(maxarg tt qFq 1p M-step:

In the M-step we optimize a lower bound on the likelihood. In the E-

),(maxargq

qFq ),(maxarg ttt qF

11

Eric Xing © Eric Xing @ CMU, 2006-2010 19

step we close the gap, making bound=likelihood.

Page 20: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

EM for general BNsEM for general BNswhile not convergedg

% E-stepfor each node i

ESSi = 0 % reset expected sufficient statisticsfor each data sample n

d i f ith Xdo inference with Xn,H

for each node i)( iii xxSSESS

% M-stepfor each node i

)|(,,,,

),( HnHni xxpninii xxSSESS

Eric Xing © Eric Xing @ CMU, 2006-2010 20

i := MLE(ESSi )

Page 21: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Example: HMMExample: HMM Supervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening, p y g,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation

Eric Xing © Eric Xing @ CMU, 2006-2010 21

Page 22: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

The Baum Welch algorithmThe Baum Welch algorithm The complete log likelihoodp g

The expected complete log likelihood

n

T

ttntn

T

ttntnnc xxpyypypp

1211 )|()|()(log),(log),;( ,,,,,yxyxθl

The expected complete log likelihood

EM

n

T

tkiyp

itn

ktn

n

T

tjiyyp

jtn

itn

niyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(, logloglog),;(

,,,, xxxyxθ l

EM The E step

)|( ,,, nitn

itn

itn ypy x1

The M step ("symbolically" identical to MLE)

,,, ntntntn ypy

)|,( ,,,,,, n

jtn

itn

jtn

itn

jitn yypyy x1111

Eric Xing © Eric Xing @ CMU, 2006-2010 22

n

Tt

itn

nTt

jitnML

ija 1

1

2

,

,,

n

Tt

itn

ktnn

Tt

itnML

ikx

b 1

1

1

,

,,

Nn

inML

i 1,

Page 23: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Unsupervised ML estimationUnsupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is

kunknown,

EXPECTATION MAXIMIZATION

0. Starting with our best guess of a model M, parameters :

1 Estimate A B in the training data1. Estimate Aij , Bik in the training data How? , ,

2. Update according to Aij , Bik Now a "supervised learning" problem

ktntn

itnik xyB ,, , tn

jtn

itnij yyA

, ,, 1

Now a supervised learning problem

3. Repeat 1 & 2, until convergence

This is called the Baum-Welch Algorithm

Eric Xing © Eric Xing @ CMU, 2006-2010 23

g

We can get to a provably more (or equally) likely parameter set each iteration

Page 24: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

E B

ML St t l L i f

R A

C

ML Structural Learning for completely observed p y

GMs D tData

),,( )()( 111 nxx

),,( )()( 221 nxx

Eric Xing © Eric Xing @ CMU, 2006-2010 24

),,( )()( Mn

M xx 1

),,( 1 n

Page 25: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Information Theoretic Interpretation of MLInterpretation of ML

iGiGnin

GG

iixp

GDpDG

)(|)(,, ),|(log

),|(log);,(

x

l

i nGiGnin

n i

iixp )(|)(,, ),|(log x

i xGiGi

Gi

Gii

ii

i xpMxcount

M)(,

)(|)()( ),|(log),(

xx

x

i x

GiGiGiGii

iiixpxpM

)(,)(|)()( ),|(log),(ˆ

x

xx

Eric Xing © Eric Xing @ CMU, 2006-2010 25

From sum over data points to sum over count of variable states

Page 26: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Information Theoretic Interpretation of ML (con'd)Interpretation of ML (con d)

GiGiGi

GG

xpxpM

GDpDG

iii),|(ˆlog),(ˆ

),|(ˆlog);,(

)(|)()(

xx

l

i x i

i

G

GiGiGi

i x

xpxp

pxp

xpMGii i

ii

i

Gii

iii

)(ˆ)(ˆ

)(ˆ),,(ˆ

log),(ˆ, )(

)(|)()(

,)(|)()(

)(

)(

x

xx

x

x

i xii

i x iG

GiGiGi xpxpM

xppxp

xpMiGii i

ii

i

Gii i

)(ˆlog)(ˆ)(ˆ)(ˆ

),,(ˆlog),(ˆ

, )(

)(|)()(

)(

)(

)(

xx

xx

i

ii

Gi xHMxIMi

)(ˆ),(ˆ )(x

Eric Xing © Eric Xing @ CMU, 2006-2010 26

Decomposable score and a function of the graph structure

Page 27: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Structural SearchStructural Search How many graphs over n nodes? )(

22nOy g p

How many trees over n nodes?

)(

)!(nO

But it turns out that we can find exact solution of an optimal tree (under MLE)!tree (under MLE)! Trick: in a tree each node has only one parent! Chow-liu algorithm

Eric Xing © Eric Xing @ CMU, 2006-2010 27

Page 28: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Chow Liu tree learning algorithmChow-Liu tree learning algorithm Objection function:j

iGi

GG

xHMxIM

GDpDG

i)(ˆ),(ˆ

),|(ˆlog);,(

)(

x

l i

Gi ixIMGC ),(ˆ)( )(x

Chow-Liu:For each pair of variable and

ii

For each pair of variable xi and xj

Compute empirical distribution:

Compute mutual information:

Mxxcount

XXp jiji

),(),(ˆ

ji xxpxxpXXI

),(ˆlog)(ˆ)(ˆ Compute mutual information:

Define a graph with node x1,…, xn

Edge (I j) gets weight

ji xx ji

jiji xpxpxxpXXI

, )(ˆ)(ˆlog),(),(

)(ˆ XXI

Eric Xing © Eric Xing @ CMU, 2006-2010 28

Edge (I,j) gets weight ),( ji XXI

Page 29: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Chow Liu algorithm (con'd)Chow-Liu algorithm (con d) Objection function:j

iGi

GG

xHMxIM

GDpDG

i)(ˆ),(ˆ

),|(ˆlog);,(

)(

x

l i

Gi ixIMGC ),(ˆ)( )(x

Chow-Liu:Optimal tree BN

ii

Optimal tree BN Compute maximum weight spanning tree Direction in BN: pick any node as root, do breadth-first-search to define directions I-equivalence: I-equivalence:

A

B C

C

A ED

E

C

Eric Xing © Eric Xing @ CMU, 2006-2010 29

D E B D A B

),(),(),(),()( ECIDCICAIBAIGC

Page 30: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Structure Learning for general graphsgraphs Theorem:

The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2

Most structure learning approaches use heuristics Exploit score decomposition Two heuristics that exploit decomposition in different ways Two heuristics that exploit decomposition in different ways

Greedy search through space of node-orders

Local search of graph structures

Eric Xing © Eric Xing @ CMU, 2006-2010 30

Page 31: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Inferring gene regulatory networksnetworks

Network of cis-regulatory pathwaysg y p y

Success stories in sea urchin fruit fly etc from decades of experimental

Eric Xing © Eric Xing @ CMU, 2006-2010 31

Success stories in sea urchin, fruit fly, etc, from decades of experimental research

Statistical modeling and automated learning just started

Page 32: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Gene Expression Profiling by MicroarraysMicroarrays

Receptor A Receptor BX1 X2Receptor A Receptor BX1 X2Receptor A Receptor BX1 X2

Kinase C

Trans. Factor F

Kinase EKinase DX3 X4 X5

X6 F E

Kinase C

Trans. Factor F

Kinase EKinase DX3 X4 X5

X6

Kinase C

Trans. Factor F

Kinase EKinase DX3 X4 X5

X6 F EF E

Eric Xing © Eric Xing @ CMU, 2006-2010 32

Gene G Gene HX7 X8

F E

Gene G Gene HX7 X8Gene G Gene HX7 X8

F EF E

Page 33: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Structure Learning Algorithms

St t l EM (F i d 1998)

Structure Learning Algorithms

Expression data

Structural EM (Friedman 1998) The original algorithm

Expression data Sparse Candidate Algorithm

(Friedman et al.) Discretizing array signalsg y g Hill-climbing search using local operators: add/delete/swap of a

single edge Feature extraction: Markov relations, order relations Re-assemble high-confidence sub-networks from features

E B

Learning Algorithm Module network learning (Segal et al.)

Heuristic search of structure in a "module graph" Module assignment

Eric Xing © Eric Xing @ CMU, 2006-2010 33

R A

C

g Parameter sharing Prior knowledge: possible regulators (TF genes)

Page 34: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Learning GM structureLearning GM structure Learning of best CPDs given DAG is easyg g y

collect statistics of values of each node given specific assignment to its parents

Learning of the graph topology (structure) is NP-hard heuristic search must be applied generally leads to a locally optimal network heuristic search must be applied, generally leads to a locally optimal network

Overfitting It turns out, that richer structures give higher likelihood P(D|G) to the data (adding

an edge is always preferable)

AC

BAC

B

an edge is always preferable)

C

),|(≤)|( BACPACP more parameters to fit => more freedom => always exist more "optimal" CPD(C)

We prefer simpler (more explanatory) networks

Eric Xing © Eric Xing @ CMU, 2006-2010 34

We prefer simpler (more explanatory) networks Practical scores regularize the likelihood improvement complex networks.

Page 35: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Learning Graphical Model Structure via Learning Graphical Model Structure via Neighborhood Selection

Eric Xing © Eric Xing @ CMU, 2006-2010 35

Page 36: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Undirected Graphical Models

Why?

Undirected Graphical Models

y

Sometimes an UNDIRECTEDassociation graph makes more assoc at o g ap a es o esense and/or is more informative gene expressions may be influencedgene expressions may be influenced

by unobserved factor that are post-transcriptionally regulated

B B B

The unavailability of the state of B lt i t i A d C

BA C

BA C

BA C

Eric Xing © Eric Xing @ CMU, 2006-2010 36

results in a constrain over A and C

Page 37: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Gaussian Graphical ModelsGaussian Graphical Models Multivariate Gaussian density:y

)-()-(-exp)(

),|( //

xxx 1

21

21221

Tn

p

WOLG: let

jiji

iji

iiinp xxqxqQ

Qxxxp 221

2/

2/1

21 -exp)2(

),0|,,,(

We can view this as a continuous Markov Random Field with potentials defined on every node and edge:

Eric Xing © Eric Xing @ CMU, 2006-2010 37

Page 38: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

The covariance and the precision matricesmatrices Covariance matrix

Graphical model interpretation?

Precision matrix Precision matrix

Graphical model interpretation?

Eric Xing © Eric Xing @ CMU, 2006-2010 38

Page 39: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Sparse precision vs. sparse covariance in GGMcovariance in GGM

211 33 544211 33 544

00061

150080130150100 .....

948000837000726

1

070040070010080120070100020130030010020030150

...............

59000

080070120030150 .....

)5(or )1(511

15 0 nbrsnbrsXXX

Eric Xing © Eric Xing @ CMU, 2006-2010 39

01551 XX

Page 40: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Another exampleAnother example

How to estimate this MRF? What if p >> n What if p >> n

MLE does not exist in general! What about only learning a “sparse” graphical model?

This is possible when s=o(n)

Eric Xing © Eric Xing @ CMU, 2006-2010 40

This is possible when s o(n) Very often it is the structure of the GM that is more interesting …

Page 41: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Single node ConditionalSingle-node Conditional The conditional dist. of a single node i given the rest of the g g

nodes can be written as:

WOLG: let

We can write the following conditional auto-regression function for each node:

Eric Xing © Eric Xing @ CMU, 2006-2010 41

function for each node:

Page 42: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Conditional independenceConditional independence From

Let:

Given an estimate of the neighborhood si, we have:

Thus the neighborhood si defines the Markov blanket of node i

Eric Xing © Eric Xing @ CMU, 2006-2010 42

g i

Page 43: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Recall lassoRecall lasso

Eric Xing © Eric Xing @ CMU, 2006-2010 43

Page 44: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Graph RegressionGraph Regression

Lasso:Neighborhood selection

Eric Xing © Eric Xing @ CMU, 2006-2010 44

Page 45: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Graph RegressionGraph Regression

Eric Xing © Eric Xing @ CMU, 2006-2010 45

Page 46: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Graph RegressionGraph Regression

It can be shown that:i iid l d d l t h i l diti ( "i t bl ")

Eric Xing © Eric Xing @ CMU, 2006-2010 46

given iid samples, and under several technical conditions (e.g., "irrepresentable"), the recovered structured is "sparsistent" even when p >> n

Page 47: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

ConsistencyConsistency

Theorem: for the graphical regression algorithm, under g p g g ,certain verifiable conditions (omitted here for simplicity):

Note the from this theorem one should see that the regularizer is not actually used to introduce an “artificial” sparsity bias, but a devise to ensure consistency

d fi it d t d hi h di i ditiunder finite data and high dimension condition.

Eric Xing © Eric Xing @ CMU, 2006-2010 47

Page 48: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Learning (sparse) GGMLearning (sparse) GGM Multivariate Gaussian over all continuous expressions

{ })-()-(-exp||)2(

1=]),...,([ 1-

21

1 21

2

xxxxp T

n n

p

The precision matrix Q reveals the topology of the (undirected) network

Learning Algorithm: Covariance selection Want a sparse matrix Q As shown in the previous slides, we can use L_1 regularized linear

regression to obtain a sparse estimate of the neighborhood of each variable

Eric Xing © Eric Xing @ CMU, 2006-2010 48

Page 49: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Recent trends in GGM:Recent trends in GGM: Covariance selection (classical L1-regularization based

method) Dempster [1972]:

Sequentially pruning smallest elements in precision matrix

method (hot !) Meinshausen and Bühlmann [Ann. Stat.

06]: Used LASSO regression forin precision matrix

Drton and Perlman [2008]: Improved statistical tests for pruning

Used LASSO regression for neighborhood selection

Banerjee [JMLR 08]: Block sub-gradient algorithm for finding

precision matrixp Friedman et al. [Biostatistics 08]:

Efficient fixed-point equations based on a sub-gradient algorithm

…Serious limitations in

ti b k d h …practice: breaks down when covariance matrix is not invertible

Structure learning is possible

Eric Xing © Eric Xing @ CMU, 2006-2010 49

Structure learning is possible even when # variables > # samples

Page 50: Learning Graphical Models - SJTUbcmi.sjtu.edu.cn/ds/Lecture12_Xing.pdf · 2018-10-10 · Learning Graphical ModelsLearning Graphical Models Scenarios: completely observed GMs directed

Learning Ising Model (i e pairwise MRF)(i.e. pairwise MRF) Assuming the nodes are discrete, and edges are weighted, g g g

then for a sample xd, we have

It can be shown following the same logic that we can use L_1 regularized logistic regression to obtain a sparse estimate of the neighborhood of each variable in the discrete case.

Eric Xing © Eric Xing @ CMU, 2006-2010 50