learning graphical models - sjtubcmi.sjtu.edu.cn/ds/lecture12_xing.pdf · 2018-10-10 · learning...
TRANSCRIPT
Machine LearningMachine Learninggg
Learning Graphical ModelsLearning Graphical Models
Eric XingEric Xing
Lecture 12, August 14, 2010
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Reading:
Inference and LearningInference and Learning A BN M describes a unique probability distribution P
Typical tasks:
Task 1: How do we answer queries about P? Task 1: How do we answer queries about P?
We use inference as a name for the process of computing answers to such queries
So far we have learned several algorithms for exact and approx. inferenceg pp
Task 2: How do we estimate a plausible model M from data D?
i. We use learning as a name for the process of obtaining point estimate of M.
ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.
iii When not all variables are observable even computing point estimate of M
Eric Xing © Eric Xing @ CMU, 2006-2010 2
iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.
Learning Graphical Models
The goal:
Learning Graphical Models
g
Given set of independent samples (assignments of random variables) find the best (the most likely?)random variables), find the best (the most likely?) graphical model (both the graph and the CPDs)
E BE B
R A
C
R A
C
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)
C
0.9 0.1
e 0 2 0 8
beb
BE P(A | E,B)C
Eric Xing © Eric Xing @ CMU, 2006-2010 3
……..(B,E,A,C,R)=(F,T,T,T,F)
e
be
0.2 0.8
0.01 0.990.9 0.1
bb
e
Learning Graphical ModelsLearning Graphical Models Scenarios:
completely observed GMs
directed undirected
partially observed GMs
directed undirected (an open research topic)
Estimation principles:
Maximal likelihood estimation (MLE) Bayesian estimationy Maximal conditional likelihood Maximal "Margin"
W l i f th f ti ti th
Eric Xing © Eric Xing @ CMU, 2006-2010 4
We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
Z
ML P t E t f
Z
X
ML Parameter Est. for completely observed GMs of p y
given structure
The data:{( (1) (1)) ( (2) (2)) ( (3) (3)) ( (N) (N))}
Eric Xing © Eric Xing @ CMU, 2006-2010 5
{(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))}
The basic idea underlying MLE Likelihood X1 X2
The basic idea underlying MLE
(for now let's assume that the structure is given):
L Lik lih d
X1 X2
X3);,|()|()|()|()|( 33332211 XXXpXpXpXpXL θθ
Log-Likelihood:
Data log-likelihood
),,|(log)|(log)|(log)|(log)|( 33332211 XXXpXpXpXpXl θθ
Data log-likelihood
n nnnn nn n
n n
XXXpXpXp
XpDATAl
),|(log)|(log)|(log
)|(log)|(
,,,,, 32132211
θθ
MLE
nnn ,,,,,
)|(maxarg},,{ DATAlMLE θ321
Eric Xing © Eric Xing @ CMU, 2006-2010 6
∑∑∑ ),|(logmaxarg ,)|(logmaxarg ,)|(logmaxarg ,,,*
,*
,*
nnnn
nn
nn XXXpXpXp 32133222111
Example 1: conditional Gaussian
The completely observed model: Z
Example 1: conditional Gaussian
p y Z is a class indicator vector
Z
X
1102
1
∑and][where mm ZZZZ
Z 110
∑and ],,[where ,m
M
ZZ
Z
Z
zzzi M
zp )|(21
1All except one of these terms
and a datum is in class i w.p. i
X i diti l G i i bl ith l ifi
m
zm
Mim
zp
zp
)(
)|( 211 will be one
X is a conditional Gaussian variable with a class-specific mean
22
1212 22
11 )-(-exp)(
),,|( / mm xzxp
Eric Xing © Eric Xing @ CMU, 2006-2010 7
)(mz
mm
xNzxp ),|(),,|( ∏
Example 1: conditional Gaussian
Z
Example 1: conditional Gaussian Data log-likelihood
zxpzp
zxpzpxzpDl
nnn
nnn
nn
nn
∑∑
∏
),,|(log)|(log
),,|()|(log),(log)|(
θZ
X
g
C
xN
pp
mm
n
zm
mn
n m
zm
nnn
nn
mn
mn
∑∑∑∑
∑ ∏∑ ∏
)(l
),|(loglog
)|(g)|(g
21
Cxzzn m
mnmn
n mm
mn ∑∑∑∑ )-(-log 2
21
2
s t∀)|(⇒)|(maxarg ∑∂* mDlDl 10θθ
MLE
⇒
s.t. ,∀ ,)|( ⇒ ),|(maxarg
∑
∑
*
m∂∂
Nn
Nz
mDlDl
mnmn
m
mm m
10θθ
the fraction of samples of class m
Eric Xing © Eric Xing @ CMU, 2006-2010 8
p
m
n nmn
nmn
n nmn
mm n
xz
z
xzDl
∑∑∑** ⇒ ),|(maxarg θ the average of
samples of class m
Example 2: HMM: two scenariosExample 2: HMM: two scenarios Supervised learning: estimation when the “right answer” is g g
known Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good, ,(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
Eric Xing © Eric Xing @ CMU, 2006-2010 9
QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation
Recall definition of HMMRecall definition of HMM Transition probabilities between y2 y3y1 yT...
any two states
A AA Ax2 x3x1 xT
y2 y3y1 yT...
... ,)|( , jiit
jt ayyp 11 1
or
St t b biliti
.,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1
Start probabilities
.,,,lMultinomia~)( Myp 211
Emission probabilities associated with each state
.,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211
Eric Xing © Eric Xing @ CMU, 2006-2010 10
or in general: .,|f~)|( I iyxp iitt 1
Supervised ML estimationSupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is
known,
Define:
n
T
ttntn
T
ttntnn xxpyypypp
1,,
21,,1, )|()|()(log),(log),;( yxyxθl
Aij = # times state transition ij occurs in yBik = # times state i in y emits k in x
We can show that the maximum likelihood parameters are:
' ',
,,
)(#)(#
j ij
ij
nTt
itn
jtnn
Tt
itnML
ij AA
yyy
ijia
2 1
2 1
' ',
,,
)(#)(#
k ik
ik
nTt
itn
ktnn
Tt
itnML
ik BB
yxy
ikib
1
1
Eric Xing © Eric Xing @ CMU, 2006-2010 11
If y is continuous, we can treat as NTobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …
NnTtyx tntn :,::, ,, 11
Supervised ML estimation ctdSupervised ML estimation, ctd. Intuition:
When we know the underlying states, the best estimate of is the average frequency of transitions & emissions that occur in the training data
Drawback: Given little data, there may be overfitting:
P(x|) is maximized but is unreasonable P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD
Example: Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F
Th 1 0
Eric Xing © Eric Xing @ CMU, 2006-2010 12
Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
PseudocountsPseudocounts Solution for small training sets:
Add pseudocounts
Aij = # times state transition ij occurs in y + RijBik = # times state i in y emits k in x + Sik
Rij, Sij are pseudocounts representing our prior belief Total pseudocounts: Ri = jRij , Si = kSik ,
--- "strength" of prior belief, --- total number of imaginary instances in the prior
Larger total pseudocounts strong prior belief
Small total pseudocounts: just to avoid 0 probabilities --- smoothing
This is equivalent to Bayesian est. under a uniform prior with
Eric Xing © Eric Xing @ CMU, 2006-2010 13
"parameter strength" equals to the pseudocounts
MLE for general BN parametersMLE for general BN parameters If we assume the parameters for each CPD are globally p g y
independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:p
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl
X2=1 X5=0X2 1
X2=0
X5 0
X5=1
Eric Xing © Eric Xing @ CMU, 2006-2010 14
Example: decomposable likelihood of a directed model Consider the distribution defined by the directed acyclic GM:
likelihood of a directed modely y
),,|(),|(),|()|()|( 132431311211 xxxpxxpxxpxpxp
This is exactly like learning four separate small BNs, each of which consists of a node and its parentswhich consists of a node and its parents.
X1X1
X1 X1
X2 X3X2 X3
X2 X3
Eric Xing © Eric Xing @ CMU, 2006-2010 15
X4 X4
E.g.: MLE for BNs with tabular CPDsCPDs Assume each CPD is represented as a table (multinomial)
where
Note that in case of multiple parents, will have a composite
)|(def
kXjXpiiijk
Xp p pstate, and the CPD will be a high-dimensional table
The sufficient statistics are counts of family configurations
i
kj xxndef
The log-likelihood is
n ninijk ixxn ,,
kji
ijkijkkji
nijk nD ijk
,,,,
loglog);( l
Using a Lagrange multiplier to enforce , we get:1j ijk
kjikij
ijkMLijk n
n
''
Eric Xing © Eric Xing @ CMU, 2006-2010 16
kji ,,
ZZ
X
Learning partially observed GMsGMs
The data:
Eric Xing © Eric Xing @ CMU, 2006-2010 17
The data:{(x(1)), (x(2)), (x(3)), ... (x(N))}
What if some nodes are not observed? Consider the distribution defined by the directed acyclic GM:
observed?y y
),,|(),|(),|()|()|( 132431311211 xxxpxxpxxpxpxp
X1
X2 X3
X4
Eric Xing © Eric Xing @ CMU, 2006-2010 18
Need to compute p(xH|xV) inference
Recall: EM AlgorithmRecall: EM Algorithm A way of maximizing likelihood function for latent variable models.
Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current tparameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
Alternate between filling in the latent variables using the best guess Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:
E-step: )(maxarg tt qFq 1p M-step:
In the M-step we optimize a lower bound on the likelihood. In the E-
),(maxargq
qFq ),(maxarg ttt qF
11
Eric Xing © Eric Xing @ CMU, 2006-2010 19
step we close the gap, making bound=likelihood.
EM for general BNsEM for general BNswhile not convergedg
% E-stepfor each node i
ESSi = 0 % reset expected sufficient statisticsfor each data sample n
d i f ith Xdo inference with Xn,H
for each node i)( iii xxSSESS
% M-stepfor each node i
)|(,,,,
),( HnHni xxpninii xxSSESS
Eric Xing © Eric Xing @ CMU, 2006-2010 20
i := MLE(ESSi )
Example: HMMExample: HMM Supervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening, p y g,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation
Eric Xing © Eric Xing @ CMU, 2006-2010 21
The Baum Welch algorithmThe Baum Welch algorithm The complete log likelihoodp g
The expected complete log likelihood
n
T
ttntn
T
ttntnnc xxpyypypp
1211 )|()|()(log),(log),;( ,,,,,yxyxθl
The expected complete log likelihood
EM
n
T
tkiyp
itn
ktn
n
T
tjiyyp
jtn
itn
niyp
inc byxayyy
ntnntntnnn 1211
11,)|(,,,)|,(,,)|(, logloglog),;(
,,,, xxxyxθ l
EM The E step
)|( ,,, nitn
itn
itn ypy x1
The M step ("symbolically" identical to MLE)
,,, ntntntn ypy
)|,( ,,,,,, n
jtn
itn
jtn
itn
jitn yypyy x1111
Eric Xing © Eric Xing @ CMU, 2006-2010 22
n
Tt
itn
nTt
jitnML
ija 1
1
2
,
,,
n
Tt
itn
ktnn
Tt
itnML
ikx
b 1
1
1
,
,,
Nn
inML
i 1,
Unsupervised ML estimationUnsupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is
kunknown,
EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters :
1 Estimate A B in the training data1. Estimate Aij , Bik in the training data How? , ,
2. Update according to Aij , Bik Now a "supervised learning" problem
ktntn
itnik xyB ,, , tn
jtn
itnij yyA
, ,, 1
Now a supervised learning problem
3. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
Eric Xing © Eric Xing @ CMU, 2006-2010 23
g
We can get to a provably more (or equally) likely parameter set each iteration
E B
ML St t l L i f
R A
C
ML Structural Learning for completely observed p y
GMs D tData
),,( )()( 111 nxx
),,( )()( 221 nxx
Eric Xing © Eric Xing @ CMU, 2006-2010 24
),,( )()( Mn
M xx 1
),,( 1 n
Information Theoretic Interpretation of MLInterpretation of ML
iGiGnin
GG
iixp
GDpDG
)(|)(,, ),|(log
),|(log);,(
x
l
i nGiGnin
n i
iixp )(|)(,, ),|(log x
i xGiGi
Gi
Gii
ii
i xpMxcount
M)(,
)(|)()( ),|(log),(
xx
x
i x
GiGiGiGii
iiixpxpM
)(,)(|)()( ),|(log),(ˆ
x
xx
Eric Xing © Eric Xing @ CMU, 2006-2010 25
From sum over data points to sum over count of variable states
Information Theoretic Interpretation of ML (con'd)Interpretation of ML (con d)
GiGiGi
GG
xpxpM
GDpDG
iii),|(ˆlog),(ˆ
),|(ˆlog);,(
)(|)()(
xx
l
i x i
i
G
GiGiGi
i x
xpxp
pxp
xpMGii i
ii
i
Gii
iii
)(ˆ)(ˆ
)(ˆ),,(ˆ
log),(ˆ, )(
)(|)()(
,)(|)()(
)(
)(
x
xx
x
x
i xii
i x iG
GiGiGi xpxpM
xppxp
xpMiGii i
ii
i
Gii i
)(ˆlog)(ˆ)(ˆ)(ˆ
),,(ˆlog),(ˆ
, )(
)(|)()(
)(
)(
)(
xx
xx
i
ii
Gi xHMxIMi
)(ˆ),(ˆ )(x
Eric Xing © Eric Xing @ CMU, 2006-2010 26
Decomposable score and a function of the graph structure
Structural SearchStructural Search How many graphs over n nodes? )(
22nOy g p
How many trees over n nodes?
)(
)!(nO
But it turns out that we can find exact solution of an optimal tree (under MLE)!tree (under MLE)! Trick: in a tree each node has only one parent! Chow-liu algorithm
Eric Xing © Eric Xing @ CMU, 2006-2010 27
Chow Liu tree learning algorithmChow-Liu tree learning algorithm Objection function:j
iGi
GG
xHMxIM
GDpDG
i)(ˆ),(ˆ
),|(ˆlog);,(
)(
x
l i
Gi ixIMGC ),(ˆ)( )(x
Chow-Liu:For each pair of variable and
ii
For each pair of variable xi and xj
Compute empirical distribution:
Compute mutual information:
Mxxcount
XXp jiji
),(),(ˆ
ji xxpxxpXXI
),(ˆlog)(ˆ)(ˆ Compute mutual information:
Define a graph with node x1,…, xn
Edge (I j) gets weight
ji xx ji
jiji xpxpxxpXXI
, )(ˆ)(ˆlog),(),(
)(ˆ XXI
Eric Xing © Eric Xing @ CMU, 2006-2010 28
Edge (I,j) gets weight ),( ji XXI
Chow Liu algorithm (con'd)Chow-Liu algorithm (con d) Objection function:j
iGi
GG
xHMxIM
GDpDG
i)(ˆ),(ˆ
),|(ˆlog);,(
)(
x
l i
Gi ixIMGC ),(ˆ)( )(x
Chow-Liu:Optimal tree BN
ii
Optimal tree BN Compute maximum weight spanning tree Direction in BN: pick any node as root, do breadth-first-search to define directions I-equivalence: I-equivalence:
A
B C
C
A ED
E
C
Eric Xing © Eric Xing @ CMU, 2006-2010 29
D E B D A B
),(),(),(),()( ECIDCICAIBAIGC
Structure Learning for general graphsgraphs Theorem:
The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2
Most structure learning approaches use heuristics Exploit score decomposition Two heuristics that exploit decomposition in different ways Two heuristics that exploit decomposition in different ways
Greedy search through space of node-orders
Local search of graph structures
Eric Xing © Eric Xing @ CMU, 2006-2010 30
Inferring gene regulatory networksnetworks
Network of cis-regulatory pathwaysg y p y
Success stories in sea urchin fruit fly etc from decades of experimental
Eric Xing © Eric Xing @ CMU, 2006-2010 31
Success stories in sea urchin, fruit fly, etc, from decades of experimental research
Statistical modeling and automated learning just started
Gene Expression Profiling by MicroarraysMicroarrays
Receptor A Receptor BX1 X2Receptor A Receptor BX1 X2Receptor A Receptor BX1 X2
Kinase C
Trans. Factor F
Kinase EKinase DX3 X4 X5
X6 F E
Kinase C
Trans. Factor F
Kinase EKinase DX3 X4 X5
X6
Kinase C
Trans. Factor F
Kinase EKinase DX3 X4 X5
X6 F EF E
Eric Xing © Eric Xing @ CMU, 2006-2010 32
Gene G Gene HX7 X8
F E
Gene G Gene HX7 X8Gene G Gene HX7 X8
F EF E
Structure Learning Algorithms
St t l EM (F i d 1998)
Structure Learning Algorithms
Expression data
Structural EM (Friedman 1998) The original algorithm
Expression data Sparse Candidate Algorithm
(Friedman et al.) Discretizing array signalsg y g Hill-climbing search using local operators: add/delete/swap of a
single edge Feature extraction: Markov relations, order relations Re-assemble high-confidence sub-networks from features
E B
Learning Algorithm Module network learning (Segal et al.)
Heuristic search of structure in a "module graph" Module assignment
Eric Xing © Eric Xing @ CMU, 2006-2010 33
R A
C
g Parameter sharing Prior knowledge: possible regulators (TF genes)
Learning GM structureLearning GM structure Learning of best CPDs given DAG is easyg g y
collect statistics of values of each node given specific assignment to its parents
Learning of the graph topology (structure) is NP-hard heuristic search must be applied generally leads to a locally optimal network heuristic search must be applied, generally leads to a locally optimal network
Overfitting It turns out, that richer structures give higher likelihood P(D|G) to the data (adding
an edge is always preferable)
AC
BAC
B
an edge is always preferable)
C
),|(≤)|( BACPACP more parameters to fit => more freedom => always exist more "optimal" CPD(C)
We prefer simpler (more explanatory) networks
Eric Xing © Eric Xing @ CMU, 2006-2010 34
We prefer simpler (more explanatory) networks Practical scores regularize the likelihood improvement complex networks.
Learning Graphical Model Structure via Learning Graphical Model Structure via Neighborhood Selection
Eric Xing © Eric Xing @ CMU, 2006-2010 35
Undirected Graphical Models
Why?
Undirected Graphical Models
y
Sometimes an UNDIRECTEDassociation graph makes more assoc at o g ap a es o esense and/or is more informative gene expressions may be influencedgene expressions may be influenced
by unobserved factor that are post-transcriptionally regulated
B B B
The unavailability of the state of B lt i t i A d C
BA C
BA C
BA C
Eric Xing © Eric Xing @ CMU, 2006-2010 36
results in a constrain over A and C
Gaussian Graphical ModelsGaussian Graphical Models Multivariate Gaussian density:y
)-()-(-exp)(
),|( //
xxx 1
21
21221
Tn
p
WOLG: let
jiji
iji
iiinp xxqxqQ
Qxxxp 221
2/
2/1
21 -exp)2(
),0|,,,(
We can view this as a continuous Markov Random Field with potentials defined on every node and edge:
Eric Xing © Eric Xing @ CMU, 2006-2010 37
The covariance and the precision matricesmatrices Covariance matrix
Graphical model interpretation?
Precision matrix Precision matrix
Graphical model interpretation?
Eric Xing © Eric Xing @ CMU, 2006-2010 38
Sparse precision vs. sparse covariance in GGMcovariance in GGM
211 33 544211 33 544
00061
150080130150100 .....
948000837000726
1
070040070010080120070100020130030010020030150
...............
59000
080070120030150 .....
)5(or )1(511
15 0 nbrsnbrsXXX
Eric Xing © Eric Xing @ CMU, 2006-2010 39
01551 XX
Another exampleAnother example
How to estimate this MRF? What if p >> n What if p >> n
MLE does not exist in general! What about only learning a “sparse” graphical model?
This is possible when s=o(n)
Eric Xing © Eric Xing @ CMU, 2006-2010 40
This is possible when s o(n) Very often it is the structure of the GM that is more interesting …
Single node ConditionalSingle-node Conditional The conditional dist. of a single node i given the rest of the g g
nodes can be written as:
WOLG: let
We can write the following conditional auto-regression function for each node:
Eric Xing © Eric Xing @ CMU, 2006-2010 41
function for each node:
Conditional independenceConditional independence From
Let:
Given an estimate of the neighborhood si, we have:
Thus the neighborhood si defines the Markov blanket of node i
Eric Xing © Eric Xing @ CMU, 2006-2010 42
g i
Recall lassoRecall lasso
Eric Xing © Eric Xing @ CMU, 2006-2010 43
Graph RegressionGraph Regression
Lasso:Neighborhood selection
Eric Xing © Eric Xing @ CMU, 2006-2010 44
Graph RegressionGraph Regression
Eric Xing © Eric Xing @ CMU, 2006-2010 45
Graph RegressionGraph Regression
It can be shown that:i iid l d d l t h i l diti ( "i t bl ")
Eric Xing © Eric Xing @ CMU, 2006-2010 46
given iid samples, and under several technical conditions (e.g., "irrepresentable"), the recovered structured is "sparsistent" even when p >> n
ConsistencyConsistency
Theorem: for the graphical regression algorithm, under g p g g ,certain verifiable conditions (omitted here for simplicity):
Note the from this theorem one should see that the regularizer is not actually used to introduce an “artificial” sparsity bias, but a devise to ensure consistency
d fi it d t d hi h di i ditiunder finite data and high dimension condition.
Eric Xing © Eric Xing @ CMU, 2006-2010 47
Learning (sparse) GGMLearning (sparse) GGM Multivariate Gaussian over all continuous expressions
{ })-()-(-exp||)2(
1=]),...,([ 1-
21
1 21
2
xxxxp T
n n
p
The precision matrix Q reveals the topology of the (undirected) network
Learning Algorithm: Covariance selection Want a sparse matrix Q As shown in the previous slides, we can use L_1 regularized linear
regression to obtain a sparse estimate of the neighborhood of each variable
Eric Xing © Eric Xing @ CMU, 2006-2010 48
Recent trends in GGM:Recent trends in GGM: Covariance selection (classical L1-regularization based
method) Dempster [1972]:
Sequentially pruning smallest elements in precision matrix
method (hot !) Meinshausen and Bühlmann [Ann. Stat.
06]: Used LASSO regression forin precision matrix
Drton and Perlman [2008]: Improved statistical tests for pruning
Used LASSO regression for neighborhood selection
Banerjee [JMLR 08]: Block sub-gradient algorithm for finding
precision matrixp Friedman et al. [Biostatistics 08]:
Efficient fixed-point equations based on a sub-gradient algorithm
…Serious limitations in
ti b k d h …practice: breaks down when covariance matrix is not invertible
Structure learning is possible
Eric Xing © Eric Xing @ CMU, 2006-2010 49
Structure learning is possible even when # variables > # samples
Learning Ising Model (i e pairwise MRF)(i.e. pairwise MRF) Assuming the nodes are discrete, and edges are weighted, g g g
then for a sample xd, we have
It can be shown following the same logic that we can use L_1 regularized logistic regression to obtain a sparse estimate of the neighborhood of each variable in the discrete case.
Eric Xing © Eric Xing @ CMU, 2006-2010 50