module networks sushmita roy bmi/cs 576 [email protected] nov 18 th & 20th, 2014

34
Module networks Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 [email protected] Nov 18 th & 20th, 2014

Upload: meredith-harmon

Post on 04-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module networks

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/[email protected] 18th & 20th, 2014

Page 2: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

RECAP

• Probabilistic graphical models (PGMs) provide a natural representation of molecular networks

• Learning PGMs from data requires us to solve two problems – Parameter estimation given a graph structure– Structure learning which requires us to learn both structure and parameters

• Bayesian networks are a type of PGMs popular for representing molecular networks

• Learning Bayesian networks from gene expression data requires additional considerations– Sparse candidate algorithm: heuristics to reduce the number of candidate

parents– Bootstrap to assess model confidence– Module networks

Page 3: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

What you should know

• Motivation of module networks• What is a module network?• Compare and contrast module network

learning algorithm to other network inference algorithms– E.g. Sparse candidate

Page 4: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module Networks

• Motivation: – Most complex systems have too many variables– Not enough data to robustly learn networks– Large networks are hard to interpret

• Key idea: Group similarly behaving variables into “modules” and learn the same parents and parameters for each module

• Relevance to gene regulatory networks– Genes that are co-expressed are likely regulated in

similar ways

Segal et al 2005

Page 5: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Definition of a module

• Statistical definition (specific to module networks by Segal 2005)– A set of random variables that share a statistical

model• Biological definition of a module– Set of genes that are co-expressed and co-

regulated

Page 6: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

An expression module

Set of genes that behave similarly across conditions

Modules

Gasch & Eisen, 2002

Genes

Genes

Genes

Page 7: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Key questions of Module Networks

• How to represent the Conditional Probability Distributions (CPD) for children?– Regression Tree

• How to learn module networks?

Page 8: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Defining a Module Network

• A probabilistic graphical model over N random variables

• Set of module variables M1.. MK

• Module assignments A that specifies the module (1-to-K) for each Xi

• CPD per module P(Mj|PaMj), PaMj are parents of module Mj

– Each variable Xi in Mj has the same conditional distribution

Page 9: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Bayesian network vs Module network

Each variable takes three values: UP, DOWN, SAME

Page 10: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Bayesian network vs Module network

• Bayesian network– Different CPD per random variable– Learning only requires to search for parents

• Module network– CPD per module• Same CPD for all random variables in the same module

– Learning requires parent search and module membership assignment

Page 11: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Learning a Module Network

• Given training dataset , fixed number of modules (K)

• Learn– Module assignments A of each variable to a

module – The parents of each module

Page 12: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Score of a Module networkModule network

Data

K: number of modules, Xj : jth module PaMj Parents of module Mj

Likelihood of module j

Page 13: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module network learning algorithm

Page 14: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module assignment search

• Happens in two places• Module initialization– Interpret as clustering of the random variables

• Module re-assignment

Page 15: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module initialization as clustering of variables

for module network

Page 16: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module re-assignment

• Must preserve the acyclic graph structure• Must improve score• Module re-assignment happens using a

sequential update procedure:– Update only one variable at a time– The change in score of moving a variable from one

module to another while keeping the other variables fixed

Page 17: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module re-assignment via sequential update

Page 18: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Representing the Conditional probability distribution

• Xi are continuous variables

• How to represent the distribution of Xi given the state of its parents?

• How to capture context-specific dependencies?

• Module networks use a regression tree

Page 19: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Modeling the relationship between regulators and targets

• suppose we have a set of (8) genes that all have in their upstream regions the same activator/repressor binding sites

Page 20: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

A regression tree

• A rooted binary tree T• Each node in the tree is either an interior node

or a leaf node• Interior nodes are labeled with a binary test

Xi<u, u is a real number observed in the data• Leaf nodes are associated with univariate

distributions of the child

Page 21: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

A regression tree to capture a CPD

X1 > e1

X2 > e2

YES

NO

NO YESLeaf

Interior node

X3

X1 X2

Expression of gene represented by X3 modeled using Gaussians at each leaf node

e1, e2 are values seen in the data

Page 22: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

An example regression tree for a Module network

Page 23: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

A very simple regression tree

X2

X3

e1 e2

X2 > e1

X2 > e2

YESNO

NO YESX3

Page 24: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Algorithm for growing a regression tree

• Input: dataset D, child variable Xi, candidate parents Ci of Xi

• Output: Tree T• Initialize T to a leaf node, estimated from all

samples of Xi

• While not converged– For every leaf node l in T

• Find with the best split at l• If split improves score

– add two leaf nodes, i and j below l– Update samples and parameters associated with , i and j

Page 25: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Learning a regression tree• Assume we are searching for the parents of a variable X3 and it

already has two parents X1 and X2

• X4 will be considered using “split” operations of existing leaf nodes

X1 > e1

X2 > e2

YES

NO

NO YES

X4 > e3

NO YES

X1 > e1

X2 > e2

YES

NO

NO YES

N1 N2N3

Nl: Gaussian associated with leaf l

N2N3

N4N5

Page 26: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Convergence in regression tree

• Depth of tree• Improvement in score• Maximum number of parents• Minimum number of samples per leaf node

Page 27: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Assessing the value of using Module Networks

• Using simulated data– Generate data from a known module network – Known module network was in turn learned from real data

• 10 modules, 500 variables

– Evaluate using• Test data likelihood• Recovery of true parent-child relationships are recovered in learned

module network

• Using gene expression data– External validation of modules (Gene ontology, motif

enrichment)– Cross-check with literature

Page 28: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Test data likelihood

Each line type represents size of training data

10 Modules is the best for almost all training data set sizes

Page 29: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Recovery of graph structure

Page 30: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Application of Module networks to yeast expression data

Segal et al, Regev, Pe’er, Gasch 2005

Page 31: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Module networks has better performance than simple Bayesian network

Gain in test data likelihood over Bayesian network using expression data

Page 32: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

The Respiration and Carbon Module

Regulation tree

Page 33: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Global View of Modules

• modules for common processes often share common– regulators– binding site motifs

Page 34: Module networks Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Nov 18 th & 20th, 2014

Summary

• Module networks– A type of Bayesian network– Identifies modules (sets of similarly behaving random

variables) and learns parents for each module– Conditional probability distributions capture “rules” of

regulatory relationships– Learning requires inferring parent->module

relationships and module assignments• In practice give more realistic networks compared

to Bayesian networks