optimal network alignment with graphlet degree vectors tijana milenković (department of computing,...
TRANSCRIPT
Optimal Network Alignment with Graphlet Degree Vectors
Tijana Milenković (Department of Computing, Imperial College London && Department of
Computer Science, University of California)
Weng Leong Ng (Department of Computer Science, University of California),
Wayne Hayes(Department of Computer Science, University of California && Department
of Mathematics, Imperial College London)
Nataša Pržulj(Department of Computing, Imperial College London)
Cancer Informatics 2010
Presented by: Lila Shnaiderman
2/34
Motivation• Lately, advances in experimental techniques:
– yeast two-hybrid assay, – Mass spectrometry of purified complexes, – genome-wide chromatin immunoprecipitation,– etc.
• So, increasing amounts of biological network data becoming available!
• Comparative analyses of biological networks have as large an impact as comparative genomics on: – understanding of biology– Evolution– disease
• So, meaningful network comparisons across species becomes one of the foremost problems in evolutionary and systems biology!!!
Background• Subgraph isomorphism problem:
– Is one graph exists as an exact subgraph of another graph.– NP-complete complexity– So, network comparisons are computationally infeasible…
• Network alignment:– The most common network comparison method.– Is more general problem:
• Find the best way to “fit” a graph into another graph (not an exact subgraph)
• Unclear:– how to guide the alignment process– how to measure the “goodness” of an inexact fit– So, heuristic strategies must be sought
3/34
Background – alignment types• Local alignment:
– The majority of existing methods.
– match a small sub network from one network to one or more sub networks in another network.
– Can be ambiguous…
• Global alignment:– Measures the overall similarity between two networks.
– Aligns every node in the smaller network to exactly one node in the larger network.
– most existing methods incorporate some a priori information external to network topology
• like protein sequence similarities in PPIs networks, etc.
• Best known global network alignment algorithm based solely on network topology:– GRAph ALigner (GRAAL): uses a heuristic search strategy to quickly
find approximate alignments 4/34
Current solution: H-GRAAL• Hungarian-algorithm based GRAAL
• More expensive
• Guaranteed to find optimal alignments relative to
any fixed, deterministic cost function.
• Relies solely and explicitly on a strong and
direct measure of network topological similarity.
• Applicable to any type of networks
• Allows to transfer the knowledge between aligned
networks.
5/34
• A small connected induced sub graph of a larger network.
6/34
Graphlet degree vectors (1)
0 1
2
G1G0
3
G2
G7
G4
6
7
G5
8
G6
13
12
G8
G3
4
5
11
10
9
14
Graphlet degree vectors (2)• Graphlet degrees vector of node V: counts the
number of different graphlets that the node touches (for all graphlets on 2 to 5 nodes).
7/34
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
v0
v v
v
Graphlet degree vectors (3)
8/34
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
v1
2 v
orbit
Graphlet degree vectors (4)
9/34
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
v1
2 vv
v v
Graphlet degree vectors (4)
10/34
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
v
34
5 ?
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
Graphlet degree vectors (5)
11/34
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
v4
5 v
v v
Graphlet degree vectors (6)
12/34
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
v
6
7
v
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
8
v
Graphlet degree vectors (7)
13/34
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0
v v11
10
9
What is the degree of node V (according to the vector)?
There are 73 different orbits across all 2-5-node graphlets
The signature of node V
Degree Vector - Signature• Many real-world Networks:
– Have a small-world nature
• So, degree Vector is an effective measure:– Looks at network distance of 4 around a node – Captures a large portion of network topology
• Thus, comparing two signatures:– Highly constraining measure of local
topological similarity between nodes.
14/34
Signature similarity• For uG, ui: =
– the ith coordinate of its signature vector.
– Distance:
– wi is the weight of orbit i.• Accounts for dependencies between orbits
• higher weights to orbits that are not affected by many other orbits
• Questions:– Why log?
– Why “+1”? 15/34
Distance and Similarity
• Total Distance:
– in (0,1)– O means: u,v identical
• Similarity: S(u,v) = 1-D(u,v)
16/34
H-GRAAL algorithm-definitions• G1 and G2 are networks:
– |V(G1)|<|V(G2)|
• Alignment of G1 to G2:
– set of ordered pairs (u,v), u ∈ V (G1) and
v ∈ V (G2)
– no two ordered pairs share the same G1-node or the
same G2-node.
– Each pair called aligned pair.
• Maximum alignment:– Every G1-node is in some aligned pair– From now on:
alignment=maximum alignment 17/34
H-GRAAL algorithm• H-GRAAL:
– Hungarian-algorithm-based GRAph Aligner
• Produces an alignment:– of minimum total cost between networks
– total cost: summed over all aligned pairs
– aligned pair cost: based on signature similarity
• The cost of aligning u and v:– favors alignment of the densest parts of the networks;
– Reduced as the degrees of both nodes increase: higher degree nodes with similar signatures provide a tighter constraint
– α ∈ [0, 1]: weighs the cost-function contributions of the node signature similarity between u and v
– 1 − α: weights the contribution of nodes degrees.
18/34
Alignment Cost
• Cost=0: a pair of topologically identical nodes u and v• Cost close to 2: a pair of topologically very different nodes.
• Any problem with this formula?• T(u,v) for most nodes is very low:
– As, there is small number of hubs (highly-linked nodes),
– So max_deg(G1) and max_deg(G2) are much larger than
deg(u) and deg(v).
19/34
Hungarian Algorithm• solves the assignment problem in polynomial
time:– Create two bipartite graphs V(G1), V(G2).
– Edge (u,v) from V(G1) to V(G2): labeled with the node
alignment cost.– Find perfect match between them (with minimal cost).
• More than one optimal alignment is possible:– the particular found alignment is highly dependent on
the implementation details of the underlying Hungarian algorithm.
– For example: the order of presenting the nodes to the algorithm
20/34
Finding Few Optimal Alignment• Can learn about all possible optimal matchings.
• Make H-GRAAL to give more alignments:– “Remove” (u,v): raise the alignment cost of a node-pair (u,v)
in A0 to +∞
– Run H-GRAAL again• Found alignment with higher cost than A0, “Remove” different edge.
• After trying to “remove” all edges, if not found alignment with optimal cost, no more optimal alignments exist.
• This process has too high complexity…– O(|V(G1)|3x||E(G1)|)
– There exist a fix O(|V(G1)|2x||E(G1)|) (based on dynamic
Hungarian algorithm).
– My remark: still very slow (can take months…)21/34
Few Optimal Alignment algorithm• Optimizing aligned pair:
– Appears in at least one optimal alignment.
• The set of optimizing pairs:– Can be computed in at worst O(n4) time.– Can be easily parallelized.
22/34
My remark: too slow…
Few Optimal Alignments - Analysis
• Significance of aligned pair:– According to number of optimizing pairs per
u.– If (u,v) were the only optimizing pair for u:
every optimal alignment contains (u,v). I.e., (u,v) is highly significant.
• Core alignment: – the set of all such special optimizing pairs.– Large core alignment means: stable
alignment.23/34
Measures of alignment quality (1)• Edge correctness (EC) –
– percentage of edges in one graph that are aligned to edges in the other graph.
To be able to measure the following measurements, must know the “true alignment” …
• Node correctness (NC) – – percentage of nodes in one network that are correctly aligned
to nodes in the other network
• Interaction correctness (IC) – – percentage of interactions that are aligned correctly
• IC is stricter than EC:– EC does not require that the alignment partners are the
correct ones 24/34
Measures of alignment quality (2)• Usually the “true alignment” is not known
– So, can measure just EC…
– two alignments possibly can have similar ECs, where one alignment is “good” and the other is “bad” EC is not enough…
• To uncover regions of similar topology:– the aligned edges must cluster together and form large
and dense connected sub-graphs.
• Common connected sub-graph (CCS):– connected sub-graph that appears in both networks
• Good alignment has:– large and dense CCSs.
– Large EC 25/34
Statistical Significance• Random alignment of real-world networks:
– the probability of obtaining a given or better EC at random.
• Null model of random alignment:– Random mapping g: E1 → V1 × V2.
– n1 = |V1|, n2 = |V2|, m1 = |E1|, and m2 = |E2|.
– p = n2 (n2 − 1)/2: the number of node pairs in G2
– EC = x%: the edge correctness of the given alignment– k = [m1 × x]: the number of aligned edges from G1 to edges in G2.
• P: – the probability of successfully aligning k or more edges by chance
(the tail of the hypergeometric distribution):
.26/34
More statistical Significance Metrics• H-GRAAL’s alignment of random model networks:
– Checks the significance of the alignment in compare to alignment of random networks:
• Align two PPI networks,
• align them with random networks,
• compare results.
• Biological Validation:– find the number of aligned protein pairs sharing a Gene
Ontology (GO) term.
– Compute its statistical significance.
• Significance of functional enrichments:– Align metabolic networks of different species
– generate phylogenetic trees based on H-GRAALs ECs.
– Compute its statistical significance. 27/34
Results (1)
• H-GRAAL always produces better alignments than GRAAL for all values of α.
• using only degrees (α = 0) gives bad results. – So, graphlet-based signatures are far more valuable than a measure
based on degree alone.
28/34
Results (2)
• The largest common connected sub-graph in the alignment of the yeast and human PPI networks– consisting of 1,290 interactions amongst 317 proteins. – This network appears, in its entirety, in the PPI networks of both
species. 29/34
Results (3)
• Statistics of H-GRAAL’s core yeast-human alignment for α = 0.5.
• The percentage of yeast proteins, out of 2,390 of them, that participate in n “optimizing pairs”.
• Shows the quality of H-GRAAL!30/34
Results (4)
• Comparison of the phylogenetic trees for protists and fungies• H-GRAAL’s and GRAAL’s tree are slightly different from the
sequence-based one. • Sequence-based trees are built based on:
– multiple alignment of gene sequences– whole genome alignments.
31/34
Results (5)• Multiple alignments have few problems:
– Can be misleading due to gene rearrangements, inversions, transpositions, and translocations (at the substring level)
– Different species might have an unequal number of genes or genomes of vastly different lengths.
• Whole genome alignments can be misleading: – Noncontiguous copies of a gene or non-decisive gene order. – The trees are built incrementally from smaller pieces that are
“patched” together probabilistically probabilistic errors expected.
• H-GRAAL’s and GRAAL’s have none of these. But – There are noise problems – Incompleteness of PPI networks.
• No reason to believe that the sequence-based tree or GRAAL’s one should a priori be considered the correct one
32/34
Conclusions• Presented H-GRAAL algorithm for global
alignment between networks• Presented different statistics to evaluate
the quality of the alignment.• Experimented with different PPI networks,
and not only PPI.• Showed that H-GRAAL is the best known
global alignment algorithm.• H-GRAAL can have huge influence on
researching biological networks!33/34
34/34
Thank you for your attention!