topological analysis in ppi networks & network motif discovery jin chen msu cse891-001 2012 fall...

41
Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

Upload: brice-page

Post on 13-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

1

Topological Analysis in PPI Networks & Network Motif Discovery

Jin ChenMSU CSE891-001

2012 Fall

Page 2: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

2

Layout

• Topological properties of real networks– Degree distribution (power-law & exponential)– Path distance (small-world, non-small-world)

• Network motif– Definitions– Algorithms

Page 3: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

3

WWW has power-law degree distribution

Distribution of links on the www

a) Outgoing links. The tail of the distributions follows P(k)≈k-r, with rout=2.45

b) Incoming links, and rin=2.1c) Average of the shortest path

between two documents as a function of system size

R. Albert, H. Jeong, A.-L. Barabási, Nature 401, 130 (1999)

The degree distribution scales as a power-law

Page 4: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

4

Power grid has exponential degree distribution

R. Albert et al, Phys. Rev. E 69, 025103(R) (2004)

Page 5: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

5

Metabolic networks have a power-law degree distribution

H. Jeong et al., Nature 407, 651 (2000)

Archaeoglobus fulgidus

E. coli

Caenorhabditis elegans

All

Page 6: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

6

Regulatory Network of E. Coli has out-degree power-law distribution & in-degree exponential distribution

Shen-Orr et al. Nature Genetics 31, 64 - 68 (2002)

from RegulonDB (Salgado et al. 2006)

The distribution of the number of transcription factors controlling a gene is exponential

The distribution of the number of genes regulated by a transcription factor is power-law with an average of ~5

Page 7: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

7

Small-world networks• A small-world network is a network in which most nodes are not

neighbors of one another, but most nodes can be reached from every other by a small number of hops or steps

• A small-world network is defined as:

• Small-world properties are found in many real-world phenomena

log( )L Nwhere L is the distance between two randomly chosen nodes; N is the number of nodes N in the network

Page 8: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

8

Six degrees of separation

Six degrees of separation = everyone is on average approximately six steps away from any other person on Earth

But if persons are linked if they knew each other, then the number of degrees of separation between Albert Einstein and Alexander the Great is almost certainly greater than 30

http://en.wikipedia.org/wiki/Six_degrees_of_separation

Page 9: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

9

Relationship btw. power-law & small-world

• If a network has a degree-distribution which can be fit with a power law distribution, it is taken as a sign that the network is small-world

• But a small-world network is not necessary to have power-law distribution (e.g. clique)

Page 10: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

10

Robustness

• Barabasi AL hypothesized that the prevalence of small world networks in biological systems may reflect an evolutionary advantage of such an architecture

• One possibility is that small-world networks are more robust to perturbations than other network architectures

• It would provide an advantage to biological systems that are subject to damage by mutation or viral infection

Page 11: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

True PPIs fit small-world, false PPIs distributed randomly

• Hypothesis: true PPIs fit the pattern of a small-world network; false PPIs are distributed randomly in the network

• By studying the local cohesiveness for each PPI, true and false PPIs can be separated– Incorporate a set of clustering coefficient measures of neighborhood

cohesiveness– Look for “network motifs” as an index of how well the PPIs are locally

connected

Debra S. Goldberg, Frederick P. Roth (2003). PNAS, 100(8) 4372–4376.

Page 12: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

12

• “Network Motifs: Simple Building Blocks of Complex Networks”– Focused on directed, cyclic subgraphs of 3 or 4 nodes in

yeast (no self-loops)– Used exhaustive enumeration and random networks as a

comparison

Concept of Network Motif

Milo et al. Science (2002) Vol. 298 no. 5594 pp. 824-827

Page 13: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

13

• In the 13 possible 3 node networks, one predominates in gene expression networks (Feed forward loop)

• In the 199 possible 4 node networks, one predominates (bi-fan)

Concept of Network Motif

X

Z Y

X

Z

Y

W

Feed Forward loop Bi-fan

Page 14: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

14

Page 15: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

15

• Efficient sampling algorithm for detecting network motifs– Focused on directed, cyclic graphs– Used a sampling approach to estimate motif

frequency– Found motifs of size 6 & 7

Concept of Network Motif

Kashtan et.al. Bioinformatics (2004) Volume20, Issue11 Pp. 1746-1758

Page 16: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

16

Problem Definition• Given a PPI network– Unlabelled & undirected subgraphs– Find repeated and unique motifs of size 2 to K (5 to 25)

• Mining Maximal Frequent Subgraphs from Graph Databases (SPIN, FSSM)– Looks for frequent labelled subgraphs from a database of graphs– Counts whether a subgraph occurs at least once in a graph

Huan et al. SIGKDD (2004)

Page 17: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

17

Tough problem1. Number of motifs increases exponentially with size2. Motifs frequency is not A priori3. Graph isomorphism does not have polynomial

solution

Concepts of frequency• f1: allow arbitrary overlaps of nodes & edges ---NOT DOWNWARD CLOSURE!• f2: allow overlaps of nodes but edges disjoint• f3: no overlap allowed (edge and node-disjoint)

Page 18: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

18

Algorithm parameters

• Input a Protein-Protein Interaction (PPI) network G– K : maximal motif size– F : frequency threshold– S : uniqueness threshold

• Output set U of frequent and unique motifs of size 3 to K

• Since motifs are small (2 to 25 nodes), use adjacency matrices. Further, represent motifs as Canonical Adjacency Matrices (CAM)

Chen et al SIGKDD 2006

Page 19: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

19

Find Repeated size-k Trees• Given a graph G• Let K = 5 (max motif size)• Let F = 2 (min frequency)• Let S = 0.95 (uniqueness threshold)

1

2

3

45G

Page 20: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

20

Find Repeated size-k Trees

Find all subgraphs of size 2 to 5.

Fig 2. Size 2 to 5 trees

t2 t3 t4_1 t4_2

t5_1 t5_2 t5_3

Page 21: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

21

Find Repeated size-k TreesOccurences of t4_1 in G.

1

2

3

5 4

1

2

3

5 4

1

2

3

5 4

1

2

3

5 4

1

2

3

5 4

1

2

3

5 4

Page 22: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

22

Find Repeated size-k Trees

Tree t2 t3 t4_1 t4_2 t5_1 t5_2 t5_3Freq. 7 13 6 17 1 5 7

t2 t3 t4_1 t4_2

t5_1 t5_2 t5_3

F = 2

Page 23: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

23

Find Repeated size-k TreesRemaining frequent trees

t2

t3

t4_1 t4_2

t5_2 t5_3

T2 =

T3 =

T4 =

T5 =

Page 24: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

24

Use Repeated Size-k Trees to Partition Graph

Take each graph in Tk and use it to partition G (i.e. T4)

1

2

3

5 4

1

2

3

5 4

1

2

3

5 4

1

2

3

5 4

1

2

3

5 4

GD4

Page 25: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

25

Perform graph join operation to find repeated size-k graphs

t4_1

t4_2

Page 26: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

26

Perform graph join operation to find repeated size-k graphs

Generate all k-node, k-1 edge graphs from each graph in Tk. (i.e. 4-node, 3-edge subgraphs from T4)

t4_1

t4_2

&

& &

h1 h2

h3 h4 h5

Page 27: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

27

Perform graph join operation to find repeated size-k graphs

Join each tree with it’s cousins to produce frequent motif candidates Ck.

t4_1

t4_2

&

& &

h1 h2

h3 h4 h5

C4

Page 28: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

28

Perform graph join operation to find repeated size-k graphs

Count the frequency of each graph Ck in GDk.

GD4

1

2

3

5

1 3

5 4

2

3

5 4

1

2

3

4

1

2

5 4

g1_2

g1_1

F = 4

F = 2

Page 29: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

29

Generate k node, k+1 edge graphs from k node, k edge graphs

Perform graph join operation to find repeated size-k graphs.

g1_2 h6 g2

F = 2 in GD4

move edge merge

Page 30: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

31

Graph CousinsType I : Direct Cousin h is isomorphic to a

subgraph which has the same number of nodes & edges as g and g != h

gh g’

is a Type I cousin of

because

is isomorphic to

Page 31: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

32

Graph Cousins

GD4

gh

G4_1 G4_2 G4_3 G4_4 G4_5 G4_1 G4_2 G4_3 G4_5

Page 32: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

33

Graph Cousins

GD4

gh

G4_1 G4_2 G4_3 G4_4 G4_5 G4_1 G4_2 G4_3 G4_5

Page 33: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

34

Graph CousinsType II : Twin Cousin h is isomorphic to a

subgraph g.

h g

is isomorphic to

Page 34: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

35

Graph CousinsType III : Distant Cousin h is a disconnected

subgraph of g.

hg

is a disconnected subgraph of

Page 35: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

36

Graph CousinsType III : Distant Cousin h is a disconnected

subgraph of g.

is a disconnected subgraph of

hg

Page 36: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

37

• Saves time when counting graph frequency

• GDk partitions the network into several subgraphs

• If they can limit the isomorphism search to a subset of those graphs, they can save time

Graph Cousins

Page 37: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

38

Determine subgraph frequency in random networks

• A frequent subgraphs may appear frequently by chance

• In order to determine the significance of a subgraph, generate random networks with the same number of node and the same number of edges

• Also impose the constraint that each node must have the same number of neighbors as it’s counterpart in the real network

Page 38: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

39

Performance Test

• Uetz dataset : 957 PPIs, 104 proteins– In budding yeast

• MIPS CYGD dataset : 10199 PPIs, 4341 proteins– Also in budding yeast

• Compared with– Exhaustive enumeration– Sampling– FPF

Page 39: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

40

Performance : runtime

~2.8 hrs

F = 50U = 0.95

Page 40: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

41

Performance : runtime

~2.8 hrs

Page 41: Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE891-001 2012 Fall 1

42

Performance : max. motif size