probabilistic graphical modelsepxing/class/10708-17/slides/lecture10...school of computer science...
TRANSCRIPT
School of Computer Science
Probabilistic Graphical Models
Gaussian graphical models and Ising models: modeling networks
Eric XingLecture 10, February 20, 2017
Reading: See class website© Eric Xing @ CMU, 2005-2017 1
( ( (
Social Network Internet Regulatory Network
Network Research
© Eric Xing @ CMU, 2005-2017 2
Where do networks come from?l The Jesus network
© Eric Xing @ CMU, 2005-2017 3
Evolving networks
March 2005 January 2006 August 2006
Can I get his vote?
Corporativity, Antagonism,
Cliques,…
over time?
© Eric Xing @ CMU, 2005-2017 4
…
t=1 2 3 T
Evolving networks
© Eric Xing @ CMU, 2005-2017 5
Recall: ML Structural Learning for completely observed
GMs E
R
B
A
C
Data
),,( )()( 111 nxx !
),,( )()( Mn
M xx !1
!),,( )()( 22
1 nxx !
© Eric Xing @ CMU, 2005-2017 6
Two “Optimal” approaches l “Optimal” here means the employed algorithms guarantee to
return a structure that maximizes the objectives (e.g., LogLik)l Many heuristics used to be popular, but they provide no guarantee on attaining
optimality, interpretability, or even do not have an explicit objectivel E.g.: structured EM, Module network, greedy structural search, etc.
l We will learn two classes of algorithms for guaranteed structure learning, which are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs:l Trees: The Chow-Liu algorithm (this lecture)l Pairwise MRFs: covariance selection, neighborhood-selection (later)
© Eric Xing @ CMU, 2005-2017 7
x1
x2
x3
x4
x1
x2
x3
x4⇓⇓
8
Key Idea: network inference as parameter estimation
© Eric Xing @ CMU, 2005-2017
l Nodal states can be either discrete (Ising/Potts model), or continuous (Gaussian graphical model), or heterogeneous
l the parameter matrix encodes the graph structure
1
2
6 5
4
3
9© Eric Xing @ CMU, 2005-2017
Model: Pairwise Markov Random Fields
Recall Multivariate Gaussianl Multivariate Gaussian density:
l WOLG: let
l We can view this as a continuous Markov Random Field with potentials defined on every node and edge:
{ })-()-(-exp)(
),|( //µµ
πµ xxx 1
21
21221 −ΣΣ
=Σ Tn
p
( )⎭⎬⎫
⎩⎨⎧
−== ∑∑<
jiji
iji
iiinp xxqxqQ
Qxxxp 221
2/
2/1
21 -exp)2(
),0|,,,(π
µ!
© Eric Xing @ CMU, 2005-2017 10
Cell type
Microarray samples Encodes dependencies
among genes
Gaussian Graphical Model
© Eric Xing @ CMU, 2005-2017 11
Edge corresponds to non-zero precision matrix element
Precision Matrix Encodes Non-ZeroEdges in Gaussian Graphical Modela
© Eric Xing @ CMU, 2005-2017 12
� Correlation network is based on Covariance Matrix
� A GGM is a Markov Network based on Precision Matrix� Conditional Independence/Partial Correlation Coefficients
are a more sophisticated dependence measure
With small sample size, empirical covariance matrix cannot be inverted
Markov versus Correlation Network
© Eric Xing @ CMU, 2005-2017 13
Sparsityl One common assumption to make: sparsity
l Makes empirical sense: Genes are only assumed to interface with small groups of other genes.
l Makes statistical sense: Learning is now feasible in high dimensions with small sample size
sparse
© Eric Xing @ CMU, 2005-2017 14
Network Learning with the LASSO
l Assume network is a Gaussian Graphical Model
l Perform LASSO regression of all nodes to a target node
© Eric Xing @ CMU, 2005-2017 15
Network Learning with the LASSO
l LASSO can select the neighborhood of each node
© Eric Xing @ CMU, 2005-2017 16
L1 Regularization (LASSO)l A convex relaxation.
l Enforces sparsity!
Constrained Form Lagrangian Form
……
y
x1
x2
x3
xn-1
xn
ββ1ββ2
ββ3
ββn-1
ββn
© Eric Xing @ CMU, 2005-2017 17
Theoretical Guaranteesl Assumptions
l Dependency Condition: Relevant Covariates are not overly dependentl Incoherence Condition: Large number of irrelevant covariates cannot be too
correlated with relevant covariatesl Strong concentration bounds: Sample quantities converge to expected values
quickly
If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant.
© Eric Xing @ CMU, 2005-2017 18
Network Learning with the LASSO
l Repeat this for every nodel Form the total edge set
© Eric Xing @ CMU, 2005-2017 19
If
Then with high probability,
Consistent Structure Recovery[Meinshausen and Buhlmann 2006, Wainwright 2009]
© Eric Xing @ CMU, 2005-2017 20
Why this algorithm work?l What is the intuition behind graphical regression?
l Continuous nodal attributesl Discrete nodal attributes
l Are there other algorihtms?
l More general scenarios: non-iid sample and evolving networks
l Case study
© Eric Xing @ CMU, 2005-2017 21
Multivariate Gaussianl Multivariate Gaussian density:
l A joint Gaussian:
l How to write down p(x2), p(x1|x2) or p(x2|x1) using the block elements in µ and Σ?l Formulas to remember:
{ })-()-(-exp)(
),|( //µµ
πµ xxx 1
21
21221 −ΣΣ
=Σ Tn
p
) , (),|( ⎥⎦
⎤⎢⎣
⎡ΣΣ
ΣΣ⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=Σ⎥
⎦
⎤⎢⎣
⎡
2221
1211
2
1
2
1
2
1
µµ
µxx
xx
Np
211
22121121
221
2212121
2121121
ΣΣΣ−Σ=
−ΣΣ+=
=
−
−
|
|
||
)(
),|()(
V
xm
Vmxxx
µµ
Np
222
22
2222
Σ=
=
=
m
m
mmp
Vm
Vmxxµ
),|()( N
© Eric Xing @ CMU, 2005-2017 22
The matrix inverse lemmal Consider a block-partitioned matrix:
l First we diagonalize M
l Schur complement:
l Then we inverse, using this formula:
l Matrix inverse lemma
⎥⎦
⎤⎢⎣
⎡=
HGFE
M
⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡
HGE-FH
IG-HI
HGFE
I
-FHI -
-
-
000
0
1
1
1
XZW YW XYZ -- 11 =⇒=
GE-FHM/H -1=
( )
( ) ( )( ) ( )
( ) ( )( ) ( ) ⎥
⎦
⎤⎢⎣
⎡ +=⎥
⎦
⎤⎢⎣
⎡
+=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡=
−−
−−−−−
−−−−−
−−
−
−−
−
111
111111
111111
111
1
1
1
1
11
0000
M/EGEM/E-M/EF-EGEM/EFEE
FHM/HGHHM/HG-HFHM/H-M/H
I-FHI
HM/H
IG-HI
HGFE
M
-
-
-
-
-
-
( ) ( ) 1111111 --- GEFH-GEFEEGE-FH −−−−+=
© Eric Xing @ CMU, 2005-2017 23
The covariance and the precision matrices
( ) ( )( ) ( ) ⎥
⎦
⎤⎢⎣
⎡
+=
−−−−−
−−−
111111
1111
-
-
FHM/HGHHM/HG-HFHM/H-M/HM
( ) ⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡
Σ+ΣΣ
Σ=
−−
−−
−−
−
−−
11
1111
111111
111
111
1111111
Qqqq
qI-q-qqQ
T
T
T
!!
!!!!
σσσσ
⎥⎦
⎤⎢⎣
⎡
Σ=Σ
−11
111
σσσ
!!T
⇓⇓
⇓⇓
© Eric Xing @ CMU, 2005-2017 24
Single-node Conditional l The conditional dist. of a single node i given the rest of the
nodes can be written as:
l WOLG: let
211
22121121
221
2212121
2121121
ΣΣΣ−Σ=
−ΣΣ+=
=
−
−
|
|
||
)(
),|()(
V
xm
Vmxxx
µµ
Np
( ) ⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡
Σ+ΣΣ
Σ=
−−
−−
−−
−
−−
11
1111
111111
111
111
1111111
Qqqq
qI-q-qqQ
T
T
T
!!
!!!!
σσσσ
© Eric Xing @ CMU, 2005-2017 25
Conditional auto-regression l From
l We can write the following conditional auto-regression function for each node:
l Neighborhood est. based on auto-regression coefficient
© Eric Xing @ CMU, 2005-2017 26
Conditional independencel From
l Given an estimate of the neighborhood si, we have:
l Thus the neighborhood si defines the Markov blanket of node i
© Eric Xing @ CMU, 2005-2017 27
Recent trends in GGM:l Covariance selection (classical
method) l Dempster [1972]:
l Sequentially pruning smallest elements in precision matrix
l Drton and Perlman [2008]: l Improved statistical tests for
pruning
l L1-regularization based method (hot !)l Meinshausen and Bühlmann [Ann.
Stat. 06]: l Used LASSO regression for
neighborhood selectionl Banerjee [JMLR 08]:
l Block sub-gradient algorithm for finding precision matrix
l Friedman et al. [Biostatistics 08]: l Efficient fixed-point equations
based on a sub-gradient algorithm
l …
Serious limitations in practice: breaks down when covariance matrix is not invertible
Structure learning is possible even when # variables > # samples
© Eric Xing @ CMU, 2005-2017 28
The Meinshausen-Bühlmann (MB) algorithm:
Step 1: Pick up one variable
Step 2: Think of it as “y”, and the rest as “z”
Step 3: Solve Lasso regression problem between y and z
Step 4: Connect the k-th node to those having nonzero weight in w
l Solving separated Lasso for every single variables:
The resulting coefficient does not correspond to the Q value-wise
© Eric Xing @ CMU, 2005-2017 29
L1-regularized maximum likelihood learning
l Input: Sample covariance matrix S
l Assumes standardized data (mean=0, variance=1)l S is generally rank-deficient
l Thus the inverse does not exist
l Output: Sparse precision matrix Ql Originally, Q is defined as the inverse of S, but not directly invertiblel Need to find a sparse matrix that can be thought as of as an inverse of S
l Approach: Solve an L1-regularized maximum likelihood equation
log likelihood regularizer
© Eric Xing @ CMU, 2005-2017 30
From matrix opt. to vector opt.:coupled Lasso for every single Var.
l Focus only on one row (column), keeping the others constant
l Optimization problem for blue vector is shown to be Lasso (L1-regularized quadratic programming)
l Difference from MB’s: Resulting Lasso problems are coupledl The gray part is actually not constant; changes after solving one Lasso problem
(because it is the opt of the entire Q that optimize a single loss function, whereas in MB each lasso has its own loss function..
l This coupling is essential for stability under noise
© Eric Xing @ CMU, 2005-2017 31
Learning Ising Model (i.e. pairwise MRF)
l Assuming the nodes are discrete (e.g., voting outcome of a person), and edges are weighted, then for a sample x, we have
l It can be shown the pseudo-conditional likelihood for node k is
© Eric Xing @ CMU, 2005-2017 32
Question: vector-valued nodes
© Eric Xing @ CMU, 2005-2017 33
New Problem: Evolving Social Networks
March 2005 January 2006 August 2006
Can I get his vote?
Corporativity,
Antagonism,Cliques,…over time?
© Eric Xing @ CMU, 2005-2017 34
T0 TN
…
Drosophila development
t*
n=1 or some small #
Reverse engineering time-specific "rewiring" networks
© Eric Xing @ CMU, 2005-2017 35
Inference Il KELLER: Kernel Weighted L1-regularized Logistic Regression
l Constrained convex optimizationl Estimate time-specific nets one by one, based on "virtual iid" samplesl Could scale to ~104 genes, but under stronger smoothness assumptions
[Song, Kolar and Xing, Bioinformatics 09]
Lasso:
© Eric Xing @ CMU, 2005-2017 36
l Conditional likelihood
l Neighborhood Selection:
l Time-specific graph regression:l Estimate at
Where
and
Algorithm – nonparametric neighborhood selection
© Eric Xing @ CMU, 2005-2017 37
Structural consistency of KELLER
Assumptionsl Define:
l A1: Dependency Condition
l A2: Incoherence Condition
l A3: Smoothness Condition
l A4: Bounded Kernel
© Eric Xing @ CMU, 2005-2017 38
Theorem [Kolar and Xing, 09]
© Eric Xing @ CMU, 2005-2017 39
l TESLA: Temporally Smoothed L1-regularized logistic regression
l Constrained convex optimizationl Scale to ~5000 nodes, does not need smoothness assumption, can
accommodate abrupt changes.
Inference II [Amr and Xing, PNAS 2009, AOAS 2009]
© Eric Xing @ CMU, 2005-2017 40
Temporally Smoothed Graph Regression
TESLA:
…
© Eric Xing @ CMU, 2005-2017 41
Modified estimation procedurel estimate block partition on which the coefficient functions
are constant
l estimate the coefficient functions on each block of the partition
(*)
(**)
© Eric Xing @ CMU, 2005-2017 42
Structural Consistency of TESLA
I. It can be shown that, by applying the results for modelselection of the Lasso on a temporal difference transformation of (*), the block are estimated consistently
II. Then it can be further shown that, by applying Lasso on (**), the neighborhood of each node on each of the estimated blocks consistently
l Further advantages of the two step procedurel choosing parameters easierl faster optimization procedure
[Kolar, and Xing, 2009]
© Eric Xing @ CMU, 2005-2017 43
Senate network – 109th congress
l Voting records from 109th congress (2005 - 2006)l There are 100 senators whose votes were recorded on the
542 bills, each vote is a binary outcome
© Eric Xing @ CMU, 2005-2017 44
Senate network – 109th congress
March 2005 January 2006 August 2006
© Eric Xing @ CMU, 2005-2017 45
Senator Chafee
© Eric Xing @ CMU, 2005-2017 46
Senator Ben Nelson
T=0.2 T=0.8
© Eric Xing @ CMU, 2005-2017 47
S1 (normal)
Progression and Reversion of Breast Cancer cells
T4 (malignant)
T4 revert 1
T4 revert 2
T4 revert 3
© Eric Xing @ CMU, 2005-2017 48
T4
S1
T4R1
T4R2
T4R3
Estimate Neighborhoods Jointly Across All Cell Types
How to share information across the cell types?
© Eric Xing @ CMU, 2005-2017 49
Penalize differences between networks of adjacent cell types
T4
S1
T4R1
T4R2
T4R3
Sparsity of Difference
© Eric Xing @ CMU, 2005-2017 50
RSS for all cell types
sparsity Sparsity of difference
Tree-Guided Graphical Lasso (Treegl)
© Eric Xing @ CMU, 2005-2017 51
S1
T4
EGFR-ITGB1
PI3K-MAPKK
MMP
Network Overview
© Eric Xing @ CMU, 2005-2017 52
S1 cellsT4 cells: Increased Cell Proliferation, Growth, Signaling, Locomotion
Interactions – Biological Processes
© Eric Xing @ CMU, 2005-2017 53
MMP-T4R cells: Significantly reduced interactions
T4 cells
Interactions – Biological Processes
© Eric Xing @ CMU, 2005-2017 54
PI3K-MAPKK-T4R: Reduced Growth, Locomotion and SignalingT4 cells
Interactions – Biological Processes
© Eric Xing @ CMU, 2005-2017 55
Fancier network est. scenariosl Dynamic Directed (auto-regressive) Networks
l Missing Data
l Multi-attribute Data
[Song, Kolar and Xing, NIPS 2009]
[Kolar and Xing, ICML 2012]
[Kolar, Liu and Xing, JMLR 2013]
© Eric Xing @ CMU, 2005-2017 56
Summaryl Graphical Gaussian Model
l The precision matrix encode structurel Not estimatable when p >> n
l Neighborhood selection:l Conditional dist under GGM/MRFl Graphical lassol Sparsistency
l Time-varying Markov networksl Kernel reweighting est.l Total variation est.
© Eric Xing @ CMU, 2005-2017 57