network analysis for computational biology

44

Upload: tuxette

Post on 11-May-2015

151 views

Category:

Science


2 download

DESCRIPTION

Séminaire INSERM, I2C, Toulouse October 11th, 2013

TRANSCRIPT

Page 1: Network analysis for computational biology

Network analysis for computational biology

Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org

IUT STID (Carcassonne) & SAMM (Université Paris 1)

Séminaire INSERM - 11 octobre 2011

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 1 / 19

Page 2: Network analysis for computational biology

Outline

1 An introduction to networks/graphs

2 Analysis of the e�ect of low-calorie diet on obese women

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 2 / 19

Page 3: Network analysis for computational biology

An introduction to networks/graphs

Outline

1 An introduction to networks/graphs

2 Analysis of the e�ect of low-calorie diet on obese women

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 3 / 19

Page 4: Network analysis for computational biology

An introduction to networks/graphs

What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.

A relation between two entities is modeled by an edgearête

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19

Page 5: Network analysis for computational biology

An introduction to networks/graphs

What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.

The entities are called the nodes or the verticesn÷uds/sommets

A relationbetween two entities is modeled by an edgearête

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19

Page 6: Network analysis for computational biology

An introduction to networks/graphs

What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.

A relation between two entities is modeled by an edgearête

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19

Page 7: Network analysis for computational biology

An introduction to networks/graphs

ExamplesBibliographic network

nodes gene, TF, enzyme...

edges a �relationship� between two nodes that is already reported inthe litterature

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 5 / 19

Page 8: Network analysis for computational biology

An introduction to networks/graphs

ExamplesNetwork inferred from expression data

nodes genesedges a �strong direct co-expression� between two nodes observed

in the gene expression data

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 5 / 19

Page 9: Network analysis for computational biology

An introduction to networks/graphs

Standard issues associated with networks

Inference

Giving expression data, how to build a graph whose edges represent thedirect links between genes?Example: co-expression networks built from microarray/RNAseq data (nodes = genes;edges = signi�cant �direct links� between expressions of two genes)

Graph mining (examples)

1 Network visualization: nodes are not a priori associated to a givenposition. How to represent the network in a meaningful way?

2 Network clustering: identify �communities� (groups of nodes that are

densely connected and share a few links (comparatively) with the other groups)

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19

Page 10: Network analysis for computational biology

An introduction to networks/graphs

Standard issues associated with networks

Inference

Giving expression data, how to build a graph whose edges represent thedirect links between genes?

Graph mining (examples)

1 Network visualization: nodes are not a priori associated to a givenposition. How to represent the network in a meaningful way?

Random positionsPositions aiming at representingconnected nodes closer

2 Network clustering: identify �communities� (groups of nodes that are

densely connected and share a few links (comparatively) with the other groups)

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19

Page 11: Network analysis for computational biology

An introduction to networks/graphs

Standard issues associated with networks

Inference

Giving expression data, how to build a graph whose edges represent thedirect links between genes?

Graph mining (examples)

1 Network visualization: nodes are not a priori associated to a givenposition. How to represent the network in a meaningful way?

2 Network clustering: identify �communities� (groups of nodes that are

densely connected and share a few links (comparatively) with the other groups)

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19

Page 12: Network analysis for computational biology

An introduction to networks/graphs

Network inference

Data: large scale gene expression data

individuals

n ' 30/50

X =

. . . . . .

. . X ji . . .

. . . . . .

︸ ︷︷ ︸variables (genes expression), p'103/4

What we want to obtain: a network with

• nodes: genes;

• edges: signi�cant and direct co-expression between two genes (tracktranscription regulations)

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 7 / 19

Page 13: Network analysis for computational biology

An introduction to networks/graphs

Advantages of inferring a network from large scaletranscription data

1 over raw data: focuses on the strongest direct relationships:irrelevant or indirect relations are removed (more robust) and the dataare easier to visualize and understand.Expression data are analyzed all together and not by pairs.

2 over bibliographic network: can handle interactions with yetunknown (not annotated) genes and deal with data collected in aparticular condition.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 8 / 19

Page 14: Network analysis for computational biology

An introduction to networks/graphs

Advantages of inferring a network from large scaletranscription data

1 over raw data: focuses on the strongest direct relationships:irrelevant or indirect relations are removed (more robust) and the dataare easier to visualize and understand.Expression data are analyzed all together and not by pairs.

2 over bibliographic network: can handle interactions with yetunknown (not annotated) genes and deal with data collected in aparticular condition.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 8 / 19

Page 15: Network analysis for computational biology

An introduction to networks/graphs

Using correlations: relevance network[Butte and Kohane, 1999,Butte and Kohane, 2000]

First (naive) approach: calculate correlations between expressions for allpairs of genes, threshold the smallest ones and build the network.

�Correlations� Thresholding Graph

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 9 / 19

Page 16: Network analysis for computational biology

An introduction to networks/graphs

But correlation is not causality...

strong indirect correlationy z

x

Networks are built using partial correlations, i.e., correlations betweengene expressions knowing the expression of all the other genes(residual correlations).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19

Page 17: Network analysis for computational biology

An introduction to networks/graphs

But correlation is not causality...

strong indirect correlationy z

x

set.seed(2807); x <- runif(100)

y <- 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261

z <- 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751

cor(y,z); [1] 0.9971105

] Partial correlation

cor(lm(y∼x)$residuals,lm(z∼x)$residuals) [1] -0.1933699

Networks

are built using partial correlations, i.e., correlations between geneexpressions knowing the expression of all the other genes (residualcorrelations).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19

Page 18: Network analysis for computational biology

An introduction to networks/graphs

But correlation is not causality...

strong indirect correlationy z

x

set.seed(2807); x <- runif(100)

y <- 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261

z <- 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751

cor(y,z); [1] 0.9971105

] Partial correlation

cor(lm(y∼x)$residuals,lm(z∼x)$residuals) [1] -0.1933699

Networks

are built using partial correlations, i.e., correlations between geneexpressions knowing the expression of all the other genes (residualcorrelations).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19

Page 19: Network analysis for computational biology

An introduction to networks/graphs

But correlation is not causality...

strong indirect correlationy z

x

Networks are built using partial correlations, i.e., correlations betweengene expressions knowing the expression of all the other genes(residual correlations).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19

Page 20: Network analysis for computational biology

An introduction to networks/graphs

Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?

Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19

Page 21: Network analysis for computational biology

An introduction to networks/graphs

Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19

Page 22: Network analysis for computational biology

An introduction to networks/graphs

Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19

Page 23: Network analysis for computational biology

An introduction to networks/graphs

Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19

Page 24: Network analysis for computational biology

An introduction to networks/graphs

Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19

Page 25: Network analysis for computational biology

An introduction to networks/graphs

Visualization software• package igraph1 [Csardi and Nepusz, 2006] (staticrepresentation with useful tools for graph mining)

• free software Gephi2 (interactive software, supports zoomingand panning)

1http://igraph.sourceforge.net/

2http://gephi.org

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 12 / 19

Page 26: Network analysis for computational biology

An introduction to networks/graphs

Visualization software• package igraph1 [Csardi and Nepusz, 2006] (staticrepresentation with useful tools for graph mining)

• free software Gephi2 (interactive software, supports zoomingand panning)

1http://igraph.sourceforge.net/

2http://gephi.org

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 12 / 19

Page 27: Network analysis for computational biology

An introduction to networks/graphs

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19

Page 28: Network analysis for computational biology

An introduction to networks/graphs

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).

The orange node's degree is equal to 2, its betweenness to 4.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19

Page 29: Network analysis for computational biology

An introduction to networks/graphs

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19

Page 30: Network analysis for computational biology

An introduction to networks/graphs

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19

Page 31: Network analysis for computational biology

An introduction to networks/graphs

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19

Page 32: Network analysis for computational biology

An introduction to networks/graphs

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19

Page 33: Network analysis for computational biology

An introduction to networks/graphs

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19

Page 34: Network analysis for computational biology

An introduction to networks/graphs

Vertex clustering classi�cationCluster vertexes into groups that are densely connected and share a fewlinks (comparatively) with the other groups. Clusters are often calledcommunities communautés (social sciences) or modules modules(biology).Example 1: Natty's facebook network

Node modules are known to be more robust and meaningful than individualrelationships between pairs of nodes.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 14 / 19

Page 35: Network analysis for computational biology

An introduction to networks/graphs

Find clusters by modularity optimization modularitéThe modularity [Newman and Girvan, 2004] of the partition(C1, . . . , CK ) is equal to:

Q(C1, . . . , CK ) =1

2m

K∑k=1

∑xi ,xj∈Ck

(Wij − Pij)

with Pij : weight of a �null model� (graph with the same degree distributionbut no preferential attachment):

Pij =didj2m

with di =12

∑j 6=i Wij .

A good clustering should maximize the modularity:

• Q ↗ when (xi , xj) are in the same cluster and Wij � Pij

• Q ↘ when (xi , xj) are in two di�erent clusters and Wij � Pij

Modularity optimization helps separate hubs (an edge is all the moreimportant in the criterion that it is attached to a vertex with a low degree).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19

Page 36: Network analysis for computational biology

An introduction to networks/graphs

Find clusters by modularity optimization modularitéThe modularity [Newman and Girvan, 2004] of the partition(C1, . . . , CK ) is equal to:

Q(C1, . . . , CK ) =1

2m

K∑k=1

∑xi ,xj∈Ck

(Wij − Pij)

with Pij : weight of a �null model� (graph with the same degree distributionbut no preferential attachment):

Pij =didj2m

with di =12

∑j 6=i Wij .

A good clustering should maximize the modularity:

• Q ↗ when (xi , xj) are in the same cluster and Wij � Pij

• Q ↘ when (xi , xj) are in two di�erent clusters and Wij � Pij

Modularity optimization helps separate hubs (an edge is all the moreimportant in the criterion that it is attached to a vertex with a low degree).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19

Page 37: Network analysis for computational biology

An introduction to networks/graphs

Find clusters by modularity optimization modularitéThe modularity [Newman and Girvan, 2004] of the partition(C1, . . . , CK ) is equal to:

Q(C1, . . . , CK ) =1

2m

K∑k=1

∑xi ,xj∈Ck

(Wij − Pij)

with Pij : weight of a �null model� (graph with the same degree distributionbut no preferential attachment):

Pij =didj2m

with di =12

∑j 6=i Wij .

A good clustering should maximize the modularity:

• Q ↗ when (xi , xj) are in the same cluster and Wij � Pij

• Q ↘ when (xi , xj) are in two di�erent clusters and Wij � Pij

Modularity optimization helps separate hubs (an edge is all the moreimportant in the criterion that it is attached to a vertex with a low degree).

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19

Page 38: Network analysis for computational biology

Analysis of the e�ect of low-calorie diet on obese women

Outline

1 An introduction to networks/graphs

2 Analysis of the e�ect of low-calorie diet on obese women

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 16 / 19

Page 39: Network analysis for computational biology

Analysis of the e�ect of low-calorie diet on obese women

Data

Experimental protocol

135 obese women and 3 times: before LCD, after a 2-month LCD and 6months later (between the end of LCD and the last measurement, womenare randomized into one of 5 recommended diet groups).At every time step, 221 gene expressions, 28 fatty acids and 15 clinicalvariables (i.e., weight, HDL, ...)

Data pre-processing

At CID3, individuals are split into three groups: weight loss, weightregain and stable weight (groups are not correlated to the diet groupaccording to χ2-test).

Partition similar to the one obtained with weight increase/decrease ratio.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19

Page 40: Network analysis for computational biology

Analysis of the e�ect of low-calorie diet on obese women

Data

Experimental protocol

135 obese women and 3 times: before LCD, after a 2-month LCD and 6months later (between the end of LCD and the last measurement, womenare randomized into one of 5 recommended diet groups).At every time step, 221 gene expressions, 28 fatty acids and 15 clinicalvariables (i.e., weight, HDL, ...)

Correlations between gene expressions and between a gene expression and afatty acid levels are not of the same order: inference method must bedi�erent inside the groups and between two groups.

Data pre-processing

At CID3, individuals are split into three groups: weight loss, weightregain and stable weight (groups are not correlated to the diet groupaccording to χ2-test).

Partition similar to the one obtained with weight increase/decrease ratio.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19

Page 41: Network analysis for computational biology

Analysis of the e�ect of low-calorie diet on obese women

Data

Data pre-processing

At CID3, individuals are split into three groups: weight loss, weightregain and stable weight (groups are not correlated to the diet groupaccording to χ2-test).

Partition similar to the one obtained with weight increase/decrease ratio.1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19

Page 42: Network analysis for computational biology

Analysis of the e�ect of low-calorie diet on obese women

Method

Network inference Clustering Mining

3 inter-dataset networkssparse CCA

merge for each

time step into

one network

3 intra-dataset networkssparse partialcorrelation

5 networksCID1 CID23×CID3

Extract important nodes

Study/Compare clusters

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 18 / 19

Page 43: Network analysis for computational biology

Analysis of the e�ect of low-calorie diet on obese women

Brief overview on results5 networks inferred with 264 nodes each:

CID1 CID2 CID3g1 CID3g2 CID3g3

size LCC 244 251 240 259 258density 2.3% 2.3% 2.3% 2.3% 2.3%transitivity 17.2% 11.9% 21.6% 10.6% 10.4%

nb clusters 14 (2-52) 10 (4-52) 11 (2-46) 12 (2-51) 12 (3-54)

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 19 / 19

Page 44: Network analysis for computational biology

Analysis of the e�ect of low-calorie diet on obese women

References

Butte, A. and Kohane, I. (1999).

Unsupervised knowledge discovery in medical databases using relevance networks.In Proceedings of the AMIA Symposium, pages 711�715.

Butte, A. and Kohane, I. (2000).

Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.In Proceedings of the Paci�c Symposium on Biocomputing, pages 418�429.

Csardi, G. and Nepusz, T. (2006).

The igraph software package for complex network research.InterJournal, Complex Systems.

Fruchterman, T. and Reingold, B. (1991).

Graph drawing by force-directed placement.Software, Practice and Experience, 21:1129�1164.

Newman, M. and Girvan, M. (2004).

Finding and evaluating community structure in networks.Physical Review, E, 69:026113.

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 19 / 19