graph mining in bioinformatics laur tooming. graphs in biology graphs are often used in...

15
Graph mining in bioinformatics Laur Tooming

Upload: jodie-perkins

Post on 26-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Graph mining in bioinformatics

Laur Tooming

Page 2: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Graphs in biology

• Graphs are often used in bioinformatics for describing processes in the cell

• Vertices are genes or proteins• The meaning of an edge depends on the type of

the graph – Protein-protein interaction– Gene regulation

Page 3: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

What we’re looking for

• We want to find sets of genes that have a biological meaning.

• Idea: find graph-theoretically relevant sets of vertices and find out if they are also biologically meaningful.

• Simple example: connected components• A more advanced idea: graph clustering.

Find subgraphs that have a high edge density.

Page 4: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are
Page 5: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are
Page 6: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Markov Cluster Algorithm (MCL)

• If there is cluster structure in a graph, random walks tend to remain in a cluster for a long time

• Graph modelled as a stochastic matrix: sum of entries in a column is 1

• aij - probability that randomly walking out of j will go to i on the next step

• Bigger edge weight means greater probability of choosing that edge

Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://micans.org/mcl/

Page 7: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Markov Cluster Algorithm (MCL)

• Two procedures, inflation and expansion, are applied alternatively

• Expansion: matrix squaring– considers longer random walks

• Inflation: raising entries to some power, rescaling to remain stochastic– Weakens weak edges and strengthens strong

ones

• Converges to a steady state

Page 8: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Markov Cluster Algorithm (MCL)

Images from http://micans.org/mcl/ani/mcl-animation.html

Page 9: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Betweenness centrality clustering

• An edge between different clusters is on many shortest paths from one cluster to another.

• An edge inside a cluster is on less shortest paths, because there are more alternative paths inside a cluster.

• Betweenness centrality of an edge - the number of shortest paths in the graph containing that edge.

• Remove edges with the highest centrality from the graph to obtain clustering.

• Optimisations:– instead of all shortest paths, pick a sample of vertices

and calculate shortest paths from them– remove several edges at once

Page 10: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

GraphWeb• Web interface for analysing biological graphs

• Simple syntax for entering graphs– multiple datasets

– directed edges

– edge weights

• Visualising graphs with GraphViz• Finding biological meaning with g:Profiler

ds1: A > B 10

ds2: A > B 4

ds1: B C 5

ds2: C > D 12

Page 11: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Combining several datasets

• Whether or not there is an edge between two vertices is determined in biological experiments, which may sometimes give false results.

• For a given graph different sources may give different information. Some sources may be more trustworthy than others.

• We would like to combine different sources and assess the trustworthyness of each edge in the resulting graph.

• Edge weight in summary graph: sum over datasets– w(e,G) = Σ w(e,Gi)*w(Gi)

Page 12: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

Combining several datasets

Page 13: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are
Page 14: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are
Page 15: Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are

The end