network analysis for computational biology
DESCRIPTION
Séminaire INSERM, I2C, Toulouse October 11th, 2013TRANSCRIPT
Network analysis for computational biology
Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org
IUT STID (Carcassonne) & SAMM (Université Paris 1)
Séminaire INSERM - 11 octobre 2011
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 1 / 19
Outline
1 An introduction to networks/graphs
2 Analysis of the e�ect of low-calorie diet on obese women
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 2 / 19
An introduction to networks/graphs
Outline
1 An introduction to networks/graphs
2 Analysis of the e�ect of low-calorie diet on obese women
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 3 / 19
An introduction to networks/graphs
What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.
A relation between two entities is modeled by an edgearête
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19
An introduction to networks/graphs
What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.
The entities are called the nodes or the verticesn÷uds/sommets
A relationbetween two entities is modeled by an edgearête
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19
An introduction to networks/graphs
What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.
A relation between two entities is modeled by an edgearête
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19
An introduction to networks/graphs
ExamplesBibliographic network
nodes gene, TF, enzyme...
edges a �relationship� between two nodes that is already reported inthe litterature
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 5 / 19
An introduction to networks/graphs
ExamplesNetwork inferred from expression data
nodes genesedges a �strong direct co-expression� between two nodes observed
in the gene expression data
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 5 / 19
An introduction to networks/graphs
Standard issues associated with networks
Inference
Giving expression data, how to build a graph whose edges represent thedirect links between genes?Example: co-expression networks built from microarray/RNAseq data (nodes = genes;edges = signi�cant �direct links� between expressions of two genes)
Graph mining (examples)
1 Network visualization: nodes are not a priori associated to a givenposition. How to represent the network in a meaningful way?
2 Network clustering: identify �communities� (groups of nodes that are
densely connected and share a few links (comparatively) with the other groups)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19
An introduction to networks/graphs
Standard issues associated with networks
Inference
Giving expression data, how to build a graph whose edges represent thedirect links between genes?
Graph mining (examples)
1 Network visualization: nodes are not a priori associated to a givenposition. How to represent the network in a meaningful way?
Random positionsPositions aiming at representingconnected nodes closer
2 Network clustering: identify �communities� (groups of nodes that are
densely connected and share a few links (comparatively) with the other groups)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19
An introduction to networks/graphs
Standard issues associated with networks
Inference
Giving expression data, how to build a graph whose edges represent thedirect links between genes?
Graph mining (examples)
1 Network visualization: nodes are not a priori associated to a givenposition. How to represent the network in a meaningful way?
2 Network clustering: identify �communities� (groups of nodes that are
densely connected and share a few links (comparatively) with the other groups)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19
An introduction to networks/graphs
Network inference
Data: large scale gene expression data
individuals
n ' 30/50
X =
. . . . . .
. . X ji . . .
. . . . . .
︸ ︷︷ ︸variables (genes expression), p'103/4
What we want to obtain: a network with
• nodes: genes;
• edges: signi�cant and direct co-expression between two genes (tracktranscription regulations)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 7 / 19
An introduction to networks/graphs
Advantages of inferring a network from large scaletranscription data
1 over raw data: focuses on the strongest direct relationships:irrelevant or indirect relations are removed (more robust) and the dataare easier to visualize and understand.Expression data are analyzed all together and not by pairs.
2 over bibliographic network: can handle interactions with yetunknown (not annotated) genes and deal with data collected in aparticular condition.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 8 / 19
An introduction to networks/graphs
Advantages of inferring a network from large scaletranscription data
1 over raw data: focuses on the strongest direct relationships:irrelevant or indirect relations are removed (more robust) and the dataare easier to visualize and understand.Expression data are analyzed all together and not by pairs.
2 over bibliographic network: can handle interactions with yetunknown (not annotated) genes and deal with data collected in aparticular condition.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 8 / 19
An introduction to networks/graphs
Using correlations: relevance network[Butte and Kohane, 1999,Butte and Kohane, 2000]
First (naive) approach: calculate correlations between expressions for allpairs of genes, threshold the smallest ones and build the network.
�Correlations� Thresholding Graph
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 9 / 19
An introduction to networks/graphs
But correlation is not causality...
strong indirect correlationy z
x
Networks are built using partial correlations, i.e., correlations betweengene expressions knowing the expression of all the other genes(residual correlations).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
An introduction to networks/graphs
But correlation is not causality...
strong indirect correlationy z
x
set.seed(2807); x <- runif(100)
y <- 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261
z <- 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751
cor(y,z); [1] 0.9971105
] Partial correlation
cor(lm(y∼x)$residuals,lm(z∼x)$residuals) [1] -0.1933699
Networks
are built using partial correlations, i.e., correlations between geneexpressions knowing the expression of all the other genes (residualcorrelations).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
An introduction to networks/graphs
But correlation is not causality...
strong indirect correlationy z
x
set.seed(2807); x <- runif(100)
y <- 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261
z <- 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751
cor(y,z); [1] 0.9971105
] Partial correlation
cor(lm(y∼x)$residuals,lm(z∼x)$residuals) [1] -0.1933699
Networks
are built using partial correlations, i.e., correlations between geneexpressions knowing the expression of all the other genes (residualcorrelations).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
An introduction to networks/graphs
But correlation is not causality...
strong indirect correlationy z
x
Networks are built using partial correlations, i.e., correlations betweengene expressions knowing the expression of all the other genes(residual correlations).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
An introduction to networks/graphs
Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?
Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
An introduction to networks/graphs
Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
An introduction to networks/graphs
Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
An introduction to networks/graphs
Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
An introduction to networks/graphs
Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
An introduction to networks/graphs
Visualization software• package igraph1 [Csardi and Nepusz, 2006] (staticrepresentation with useful tools for graph mining)
• free software Gephi2 (interactive software, supports zoomingand panning)
1http://igraph.sourceforge.net/
2http://gephi.org
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 12 / 19
An introduction to networks/graphs
Visualization software• package igraph1 [Csardi and Nepusz, 2006] (staticrepresentation with useful tools for graph mining)
• free software Gephi2 (interactive software, supports zoomingand panning)
1http://igraph.sourceforge.net/
2http://gephi.org
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 12 / 19
An introduction to networks/graphs
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
An introduction to networks/graphs
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
The orange node's degree is equal to 2, its betweenness to 4.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
An introduction to networks/graphs
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
An introduction to networks/graphs
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
An introduction to networks/graphs
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
An introduction to networks/graphs
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
An introduction to networks/graphs
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
An introduction to networks/graphs
Vertex clustering classi�cationCluster vertexes into groups that are densely connected and share a fewlinks (comparatively) with the other groups. Clusters are often calledcommunities communautés (social sciences) or modules modules(biology).Example 1: Natty's facebook network
Node modules are known to be more robust and meaningful than individualrelationships between pairs of nodes.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 14 / 19
An introduction to networks/graphs
Find clusters by modularity optimization modularitéThe modularity [Newman and Girvan, 2004] of the partition(C1, . . . , CK ) is equal to:
Q(C1, . . . , CK ) =1
2m
K∑k=1
∑xi ,xj∈Ck
(Wij − Pij)
with Pij : weight of a �null model� (graph with the same degree distributionbut no preferential attachment):
Pij =didj2m
with di =12
∑j 6=i Wij .
A good clustering should maximize the modularity:
• Q ↗ when (xi , xj) are in the same cluster and Wij � Pij
• Q ↘ when (xi , xj) are in two di�erent clusters and Wij � Pij
Modularity optimization helps separate hubs (an edge is all the moreimportant in the criterion that it is attached to a vertex with a low degree).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19
An introduction to networks/graphs
Find clusters by modularity optimization modularitéThe modularity [Newman and Girvan, 2004] of the partition(C1, . . . , CK ) is equal to:
Q(C1, . . . , CK ) =1
2m
K∑k=1
∑xi ,xj∈Ck
(Wij − Pij)
with Pij : weight of a �null model� (graph with the same degree distributionbut no preferential attachment):
Pij =didj2m
with di =12
∑j 6=i Wij .
A good clustering should maximize the modularity:
• Q ↗ when (xi , xj) are in the same cluster and Wij � Pij
• Q ↘ when (xi , xj) are in two di�erent clusters and Wij � Pij
Modularity optimization helps separate hubs (an edge is all the moreimportant in the criterion that it is attached to a vertex with a low degree).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19
An introduction to networks/graphs
Find clusters by modularity optimization modularitéThe modularity [Newman and Girvan, 2004] of the partition(C1, . . . , CK ) is equal to:
Q(C1, . . . , CK ) =1
2m
K∑k=1
∑xi ,xj∈Ck
(Wij − Pij)
with Pij : weight of a �null model� (graph with the same degree distributionbut no preferential attachment):
Pij =didj2m
with di =12
∑j 6=i Wij .
A good clustering should maximize the modularity:
• Q ↗ when (xi , xj) are in the same cluster and Wij � Pij
• Q ↘ when (xi , xj) are in two di�erent clusters and Wij � Pij
Modularity optimization helps separate hubs (an edge is all the moreimportant in the criterion that it is attached to a vertex with a low degree).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19
Analysis of the e�ect of low-calorie diet on obese women
Outline
1 An introduction to networks/graphs
2 Analysis of the e�ect of low-calorie diet on obese women
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 16 / 19
Analysis of the e�ect of low-calorie diet on obese women
Data
Experimental protocol
135 obese women and 3 times: before LCD, after a 2-month LCD and 6months later (between the end of LCD and the last measurement, womenare randomized into one of 5 recommended diet groups).At every time step, 221 gene expressions, 28 fatty acids and 15 clinicalvariables (i.e., weight, HDL, ...)
Data pre-processing
At CID3, individuals are split into three groups: weight loss, weightregain and stable weight (groups are not correlated to the diet groupaccording to χ2-test).
Partition similar to the one obtained with weight increase/decrease ratio.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19
Analysis of the e�ect of low-calorie diet on obese women
Data
Experimental protocol
135 obese women and 3 times: before LCD, after a 2-month LCD and 6months later (between the end of LCD and the last measurement, womenare randomized into one of 5 recommended diet groups).At every time step, 221 gene expressions, 28 fatty acids and 15 clinicalvariables (i.e., weight, HDL, ...)
Correlations between gene expressions and between a gene expression and afatty acid levels are not of the same order: inference method must bedi�erent inside the groups and between two groups.
Data pre-processing
At CID3, individuals are split into three groups: weight loss, weightregain and stable weight (groups are not correlated to the diet groupaccording to χ2-test).
Partition similar to the one obtained with weight increase/decrease ratio.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19
Analysis of the e�ect of low-calorie diet on obese women
Data
Data pre-processing
At CID3, individuals are split into three groups: weight loss, weightregain and stable weight (groups are not correlated to the diet groupaccording to χ2-test).
Partition similar to the one obtained with weight increase/decrease ratio.1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19
Analysis of the e�ect of low-calorie diet on obese women
Method
Network inference Clustering Mining
3 inter-dataset networkssparse CCA
merge for each
time step into
one network
3 intra-dataset networkssparse partialcorrelation
5 networksCID1 CID23×CID3
Extract important nodes
Study/Compare clusters
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 18 / 19
Analysis of the e�ect of low-calorie diet on obese women
Brief overview on results5 networks inferred with 264 nodes each:
CID1 CID2 CID3g1 CID3g2 CID3g3
size LCC 244 251 240 259 258density 2.3% 2.3% 2.3% 2.3% 2.3%transitivity 17.2% 11.9% 21.6% 10.6% 10.4%
nb clusters 14 (2-52) 10 (4-52) 11 (2-46) 12 (2-51) 12 (3-54)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 19 / 19
Analysis of the e�ect of low-calorie diet on obese women
References
Butte, A. and Kohane, I. (1999).
Unsupervised knowledge discovery in medical databases using relevance networks.In Proceedings of the AMIA Symposium, pages 711�715.
Butte, A. and Kohane, I. (2000).
Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.In Proceedings of the Paci�c Symposium on Biocomputing, pages 418�429.
Csardi, G. and Nepusz, T. (2006).
The igraph software package for complex network research.InterJournal, Complex Systems.
Fruchterman, T. and Reingold, B. (1991).
Graph drawing by force-directed placement.Software, Practice and Experience, 21:1129�1164.
Newman, M. and Girvan, M. (2004).
Finding and evaluating community structure in networks.Physical Review, E, 69:026113.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 19 / 19