stochastic block models of mixed membership

23
School of Computer Science Stochastic Block Models of Mixed Membership Edo Airoldi 1,2 , Dave Blei 2 , Steve Fienberg 1 , Eric Xing 1 1 Carnegie-Mellon University & 2 Princeton University SAMSI, High Dimensional Inference and Random Matrices, September 17 th , 2006

Upload: micah-ross

Post on 30-Dec-2015

36 views

Category:

Documents


2 download

DESCRIPTION

Stochastic Block Models of Mixed Membership. Edo Airoldi 1,2 , Dave Blei 2 , Steve Fienberg 1 , Eric Xing 1 1 Carnegie-Mellon University & 2 Princeton University. SAMSI, High Dimensional Inference and Random Matrices, September 17 th , 2006. Interaction graphs. Expression graphs. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stochastic Block Models of Mixed Membership

School of Computer Science

Stochastic Block Models of Mixed Membership

Edo Airoldi 1,2, Dave Blei 2, Steve Fienberg 1, Eric Xing 1

1 Carnegie-Mellon University & 2 Princeton University

SAMSI, High Dimensional Inference and Random Matrices, September 17th, 2006

Page 2: Stochastic Block Models of Mixed Membership

2

School of Computer Science

The Scientific Problem

• Protein-protein interactions in Yeast

• Different studies test protein interactions with different technologies (precision)

Expression graphs

Interaction graphs

Page 3: Stochastic Block Models of Mixed Membership

3

School of Computer Science

M = 871 nodesM2 = 750K entries

The Data: Interaction Graphs

• M proteins in a graph (nodes)• M2 observations on pairs of proteins

– Edges are random quantities, Y [n,m]

• Interactions are not independent– Interacting proteins form a protein complex

• T graphs on the same set of proteins• Partial annotations for each protein, X [n]

Page 4: Stochastic Block Models of Mixed Membership

4

School of Computer Science

The Scientific Problems

• What are stable protein complexes?– They perform many cellular processes– A protein may be a member of several ones

• How many are there?

• How do stable protein complexes interact?– Test hypotheses (inform new analyses)– Learn complex-to-complex interaction patterns

Page 5: Stochastic Block Models of Mixed Membership

5

School of Computer Science

Disease Spread

Social Network

Food Web

ElectronicCircuit

Internet

More Network Data

Page 6: Stochastic Block Models of Mixed Membership

6

School of Computer Science

An Abstraction of the Data

• A collection of unipartite graphs: G1:T = (Y1:T ,N )

• Integer, real, multivariate edge weights: Yt = { Yt [nm] : n,m N }

• Node-specific (multivariate) attributes: X1:T = { Xt [n] : n N }

• Partially observable Y1:T and X1:T

Page 7: Stochastic Block Models of Mixed Membership

7

School of Computer Science

The Challenge

• Given the data abstraction and the goals of the analysis

• Can we posit a rich class of models that is instrumental for thinking about the scientific problems we face? Amenable to theoretical analyses?

Page 8: Stochastic Block Models of Mixed Membership

8

School of Computer Science

Modeling Ideas

• Hierarchical Bayes– Latent variables encode semantic elements– Assume structure on observable-latent elements

• Combination of 2 class of models

1. Models of mixed membership

2. Network models (block models)

Stochastic block models of mixed membership

=

Page 9: Stochastic Block Models of Mixed Membership

9

School of Computer Science

Graphical Model Representation

MixedMembership

StochasticBlocks

Page 10: Stochastic Block Models of Mixed Membership

10

School of Computer Science

Interactions(observed*)

j

i

yij = 1

i

j

1 2 3

Mixed membershipVectors (latent*)

h

g

1 2 3123

23 = 0.9

Group-to-grouppatterns (latent*)

Pr ( yij=1 | i,j, ) = i j

T

A Hierarchical Likelihood

Page 11: Stochastic Block Models of Mixed Membership

11

School of Computer Science

More Modeling Issues

• Technical :: Sparsity– Introduce parameter that modulates the relative

importance of ones and zeros (binary edges) in the cost function that drives the clustering

• Biological :: Ribosomes & Distress– Some protein complexes act like hubs because

they are involved, e.g., in protein production or cell recovery (Y2H technology is invasive)

Page 12: Stochastic Block Models of Mixed Membership

12

School of Computer Science

Large Scale Computation

• Masses of data– 750K observations in a small problem (M=871)– 2.5M observations with (M=1578)– 3M expressions for 6K genes/proteins in Yeast

• Variational inference [ Jordan et al., 2001 ]– Naïve implementation does not work– We develop a novel “nested” variational algorithm

Page 13: Stochastic Block Models of Mixed Membership

13

School of Computer Science

Example: A Scientific Question

• Do PPI contain information about functions?

Model ApproximatePosterior onMembershipVectors

?

Raw dataFunctionalAnnotations

YLD014W

Page 14: Stochastic Block Models of Mixed Membership

14

School of Computer Science

Interactions in Yeast (MIPS)

• Do PPI contain information about functions?

YLD014W

1

01 2 3 . . . 15

Page 15: Stochastic Block Models of Mixed Membership

15

School of Computer Science

Results: Identifiability

• In this example we map latent groups to known functional categories

KnownAnnotations

UnknownAnnotations

Page 16: Stochastic Block Models of Mixed Membership

16

School of Computer Science

Results: Functional Annotations

Page 17: Stochastic Block Models of Mixed Membership

17

School of Computer Science

Results: Mixed Membership

Mixed membership

• The estimated membership vectors support the mixed membership assumption

Page 18: Stochastic Block Models of Mixed Membership

18

School of Computer Science

Results: Stochastic Block Model

Page 19: Stochastic Block Models of Mixed Membership

19

School of Computer Science

• Assumptions for unipartite graphs– Population: existence of K sub-populations

– Latent variable: mixed memb. vectors [n] ~ D

– Subject: exchangeable edges given blocks & memb. Y[nm] ~ f ( . | [n] [m] )

– Sampling scheme: the graphs are IID

• Additional data, e.g., attributes, annotations– Integrated model formulation (descriptive/predictive)

General Bayesian Formulation

T

Page 20: Stochastic Block Models of Mixed Membership

20

School of Computer Science

Variational Algorithms• Naïve algorithm:

– init (i i, ij ij)

– while (≈ log-lik )update (ij ij)

update (i i)

• Nested algorithm:– init (i i)

– while (≈ log-lik )loop ij

• init ij

• while (≈ log-lik )update ij

partially update (i,j)

We trade space for time but …

Page 21: Stochastic Block Models of Mixed Membership

21

School of Computer Science

Variational Algorithms for MMSB

On a single machine* we empirically observed: faster convergence (offsets extra computation), and more stable paths to convergence.

NaïveNaïve

Nested Nested

Page 22: Stochastic Block Models of Mixed Membership

22

School of Computer Science

Take Home Points

• Bayesian formulation is integral to the biology

• A novel class of models that combines MM for soft-clustering & network models for dependent data

• Latent aspects patterns that correlate with, help predict, functional processes in the cell

• Current implementation allows for fast inference on large matrices through variational approximation considerable opportunity to improve upon both computation and efficiency of the approximation

Page 23: Stochastic Block Models of Mixed Membership

23

School of Computer Science

• Data & Problems: Gavin et al. (2002) Nature; Ho et al. (2002) Nature; Mewes et al. (2004) Nucleic Acids Research; Krogan et al. (2006) Nature.

• Mixed Membership Models– Pritchard et al. (2000); Erosheva (2002); Rosenberg et al. (2002);

Blei et al. (2003); Xing et al. (2003ab); Erosheva et al. (2004); Airoldi et al. (2005); Blei & Lafferty (2006); Xing et al. (2006)

• Stochastic network models– Wasserman et al. (1980, 1994, 1996); Fienberg et al. (1985); Frank

& Strauss (1986); Nowicki & Snijders (2001); Hoff et al. (2002), Airoldi et al. (2006)

• More material on the Web at: http://www.cs.cmu.edu/~eairoldi/

• ICML Workshop on “Statistical Network Analysis: Models, Issues and New Directions” on June 29 at Carnegie Mellon, Pittsburgh PA: http://nlg.cs.cmu.edu/