Graph mining in bioinformatics
Laur Tooming
Graphs in biology
• Graphs are often used in bioinformatics for describing processes in the cell
• Vertices are genes or proteins• The meaning of an edge depends on the type of
the graph – Protein-protein interaction– Gene regulation
What we’re looking for
• We want to find sets of genes that have a biological meaning.
• Idea: find graph-theoretically relevant sets of vertices and find out if they are also biologically meaningful.
• Simple example: connected components• A more advanced idea: graph clustering.
Find subgraphs that have a high edge density.
Markov Cluster Algorithm (MCL)
• If there is cluster structure in a graph, random walks tend to remain in a cluster for a long time
• Graph modelled as a stochastic matrix: sum of entries in a column is 1
• aij - probability that randomly walking out of j will go to i on the next step
• Bigger edge weight means greater probability of choosing that edge
Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://micans.org/mcl/
Markov Cluster Algorithm (MCL)
• Two procedures, inflation and expansion, are applied alternatively
• Expansion: matrix squaring– considers longer random walks
• Inflation: raising entries to some power, rescaling to remain stochastic– Weakens weak edges and strengthens strong
ones
• Converges to a steady state
Markov Cluster Algorithm (MCL)
Images from http://micans.org/mcl/ani/mcl-animation.html
Betweenness centrality clustering
• An edge between different clusters is on many shortest paths from one cluster to another.
• An edge inside a cluster is on less shortest paths, because there are more alternative paths inside a cluster.
• Betweenness centrality of an edge - the number of shortest paths in the graph containing that edge.
• Remove edges with the highest centrality from the graph to obtain clustering.
• Optimisations:– instead of all shortest paths, pick a sample of vertices
and calculate shortest paths from them– remove several edges at once
GraphWeb• Web interface for analysing biological graphs
• Simple syntax for entering graphs– multiple datasets
– directed edges
– edge weights
• Visualising graphs with GraphViz• Finding biological meaning with g:Profiler
ds1: A > B 10
ds2: A > B 4
ds1: B C 5
ds2: C > D 12
Combining several datasets
• Whether or not there is an edge between two vertices is determined in biological experiments, which may sometimes give false results.
• For a given graph different sources may give different information. Some sources may be more trustworthy than others.
• We would like to combine different sources and assess the trustworthyness of each edge in the resulting graph.
• Edge weight in summary graph: sum over datasets– w(e,G) = Σ w(e,Gi)*w(Gi)
Combining several datasets
The end