network analysis for computational biology

Network analysis for computational biology

Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org

IUT STID (Carcassonne) & SAMM (Université Paris 1)

Séminaire INSERM - 11 octobre 2011

1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 1 / 19

mailto:[email protected]

http://www.nathalievilla.org

Outline

1 An introduction to networks/graphs

2 Analysis of the e�ect of low-calorie diet on obese women


An introduction to networks/graphs

Outline





What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.

A relation between two entities is modeled by an edgearête




The entities are called the nodes or the verticesn÷uds/sommets

A relationbetween two entities is modeled by an edgearête




A relation between two entities is modeled by an edgearête



ExamplesBibliographic network

nodes gene, TF, enzyme...

edges a �relationship� between two nodes that is already reported inthe litterature



ExamplesNetwork inferred from expression data

nodes genesedges a �strong direct co-expression� between two nodes observed

in the gene expression data



Standard issues associated with networks

Inference

Giving expression data, how to build a graph whose edges represent thedirect links between genes?Example: co-expression networks built from microarray/RNAseq data (nodes = genes;edges = signi�cant �direct links� between expressions of two genes)

Graph mining (examples)

1 Network visualization: nodes are not a priori associated to a givenposition. How to represent the network in a meaningful way?

2 Network clustering: identify �communities� (groups of nodes that are

densely connected and share a few links (comparatively) with the other groups)




Inference

Giving expression data, how to build a graph whose edges represent thedirect links between genes?



Random positionsPositions aiming at representingconnected nodes closer






Inference

Giving expression data, how to build a graph whose edges represent thedirect links between genes?







Network inference

Data: large scale gene expression data

individuals

n ' 30/50

X =

. . . . . .

. . X ji . . .

. . . . . .

︸︷︷︸variables (genes expression), p'103/4

What we want to obtain: a network with

• nodes: genes;

• edges: signi�cant and direct co-expression between two genes (tracktranscription regulations)



Advantages of inferring a network from large scaletranscription data

1 over raw data: focuses on the strongest direct relationships:irrelevant or indirect relations are removed (more robust) and the dataare easier to visualize and understand.Expression data are analyzed all together and not by pairs.

2 over bibliographic network: can handle interactions with yetunknown (not annotated) genes and deal with data collected in aparticular condition.



Using correlations: relevance network[Butte and Kohane, 1999,Butte and Kohane, 2000]

First (naive) approach: calculate correlations between expressions for allpairs of genes, threshold the smallest ones and build the network.

�Correlations� Thresholding Graph



But correlation is not causality...

strong indirect correlationy z

x

Networks are built using partial correlations, i.e., correlations betweengene expressions knowing the expression of all the other genes(residual correlations).





x

set.seed(2807); x <- runif(100)

y <- 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261

z <- 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751

cor(y,z); [1] 0.9971105

] Partial correlation

cor(lm(y∼x)$residuals,lm(z∼x)$residuals) [1] -0.1933699

Networks

are built using partial correlations, i.e., correlations between geneexpressions knowing the expression of all the other genes (residualcorrelations).





x

Networks are built using partial correlations, i.e., correlations betweengene expressions knowing the expression of all the other genes(residual correlations).



Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?

Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.



Visualization tools help understand the graphmacro-structurePurpose: How to display the nodes in a meaningful and aesthetic way?Standard approach: force directed placement algorithms (FDP)algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.



Visualization software• package igraph1 [Csardi and Nepusz, 2006] (staticrepresentation with useful tools for graph mining)

• free software Gephi2 (interactive software, supports zoomingand panning)

1http://igraph.sourceforge.net/

2http://gephi.org


http://igraph.sourceforge.net/

http://gephi.org


Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex.

Vertices with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between allpairs of vertices that pass through the vertex. Betweenness is acentrality measure (vertices with a large betweenness that are the most likely to

disconnect the network if removed).







The orange node's degree is equal to 2, its betweenness to 4.



Vertex clustering classi�cationCluster vertexes into groups that are densely connected and share a fewlinks (comparatively) with the other groups. Clusters are often calledcommunities communautés (social sciences) or modules modules(biology).Example 1: Natty's facebook network

Node modules are known to be more robust and meaningful than individualrelationships between pairs of nodes.



Find clusters by modularity optimization modularitéThe modularity [Newman and Girvan, 2004] of the partition(C1, . . . , CK ) is equal to:

Q(C1, . . . , CK ) =1

2m

K∑k=1

∑xi ,xj∈Ck

(Wij − Pij)

with Pij : weight of a �null model� (graph with the same degree distributionbut no preferential attachment):

Pij =didj2m

with di =12

∑j 6=i Wij .

A good clustering should maximize the modularity:

• Q ↗ when (xi , xj) are in the same cluster and Wij � Pij

• Q ↘ when (xi , xj) are in two di�erent clusters and Wij � Pij

Modularity optimization helps separate hubs (an edge is all the moreimportant in the criterion that it is attached to a vertex with a low degree).


Analysis of the e�ect of low-calorie diet on obese women

Outline





Data

Experimental protocol

135 obese women and 3 times: before LCD, after a 2-month LCD and 6months later (between the end of LCD and the last measurement, womenare randomized into one of 5 recommended diet groups).At every time step, 221 gene expressions, 28 fatty acids and 15 clinicalvariables (i.e., weight, HDL, ...)

Data pre-processing

At CID3, individuals are split into three groups: weight loss, weightregain and stable weight (groups are not correlated to the diet groupaccording to χ2-test).

Partition similar to the one obtained with weight increase/decrease ratio.



Data

Experimental protocol

135 obese women and 3 times: before LCD, after a 2-month LCD and 6months later (between the end of LCD and the last measurement, womenare randomized into one of 5 recommended diet groups).At every time step, 221 gene expressions, 28 fatty acids and 15 clinicalvariables (i.e., weight, HDL, ...)

Correlations between gene expressions and between a gene expression and afatty acid levels are not of the same order: inference method must bedi�erent inside the groups and between two groups.

Data pre-processing


Partition similar to the one obtained with weight increase/decrease ratio.



Data

Data pre-processing


Partition similar to the one obtained with weight increase/decrease ratio.1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19


Method

Network inference Clustering Mining

3 inter-dataset networkssparse CCA

merge for each

time step into

one network

3 intra-dataset networkssparse partialcorrelation

5 networksCID1 CID23×CID3

Extract important nodes

Study/Compare clusters



Brief overview on results5 networks inferred with 264 nodes each:

CID1 CID2 CID3g1 CID3g2 CID3g3

size LCC 244 251 240 259 258density 2.3% 2.3% 2.3% 2.3% 2.3%transitivity 17.2% 11.9% 21.6% 10.6% 10.4%

nb clusters 14 (2-52) 10 (4-52) 11 (2-46) 12 (2-51) 12 (3-54)



References

Butte, A. and Kohane, I. (1999).

Unsupervised knowledge discovery in medical databases using relevance networks.In Proceedings of the AMIA Symposium, pages 711�715.

Butte, A. and Kohane, I. (2000).

Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.In Proceedings of the Paci�c Symposium on Biocomputing, pages 418�429.

Csardi, G. and Nepusz, T. (2006).

The igraph software package for complex network research.InterJournal, Complex Systems.

Fruchterman, T. and Reingold, B. (1991).

Graph drawing by force-directed placement.Software, Practice and Experience, 21:1129�1164.

Newman, M. and Girvan, M. (2004).

Finding and evaluating community structure in networks.Physical Review, E, 69:026113.


network analysis for computational biology

Science

network analysis

network visualization

network clustering

expression data nodes

relational data

relevance network butte

raw data

microarrayrnaseq data