pan-genome graphs biodata14

Post on 02-Jul-2015

394 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Pan-genome graphs for bacteria and the web.

TRANSCRIPT

11/6/2014 graphSVG.svg

file:///Users/anwarren/Documents/biodata14/graphSVG.svg 1/1

Background

• “Pan Genome” - way to think about, compute on, visualize the differences and similarities of many genomes at once

• Reference free structure

• Many, many genomes

de Bruijn Graph Construction

• Dk = (V,E)• V = All length-k subfragments• E = Directed edges between consecutive subfragments

• Nodes overlap by k-1 words

• Locally constructed graph reveals the global sequence structure• Overlaps between sequences implicitly computed

Slide: http://cbcb.umd.edu/confcour/CMSC828H-materials/Lecture12-MSchatz-DeBruijnAssembly.pptx

It was the best was the best ofIt was the best of

Original Fragment Directed Edge

de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001

Strategy: find all k-mers, build graph

• Every k-mer becomes a node

• Two nodes are linked with an edge if they

share a k-1 mer

GACTGGGACTCC

GACTGG ACTGGG

GGACTC GGGACT

TGGGACCTGGGA

GACTCC

Strategy: k-mers from feature families, build graph

• Every k-mer becomes a node

– If it is present in m genomes

• Two nodes are linked with an edge if they share a k-1 mer

• d# = a feature family

d1d2d3d4d5d6d7d8

d9

d1d2d3d4d5d

6

d2d3d4d5d6d

7

d4d5d6d7d8d9d3d4d5d6d7d8

d1d2d3d4d5d6d7d8

d9

rf-graph de Bruijn “like”

Create pg-graph

Similarities and Differences10 groups of 10

Organism Sum Pairwise Distances (Phylogenetic)

E. coli 0.07

Coxiella 10.42

Mycobacterium 2.70

Brucella 0.08

Rickettsia 8.62

Burkholderia 7.21

Clostridium 9.05

Bacillus 4.48

Staph. 2.08

Strep. 4.79

Similarities and Differences

Node Increase = (Nodes – Max(Families)) / Nodes

Diversity Score= Sum of maximum pairwise distances in Order level tree

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.850 0.900 0.950 1.000

No

de

Incr

eas

e

MUMi

Node Increase vs. MUMi

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.00 2.00 4.00 6.00 8.00 10.00 12.00

No

de

Incr

eas

e

Diversity Score

Node Increase vs. Diversity Score

MUMi= Maximum of all pairwise MUMi in a group

Layout

Gephi ToolkitYifan Hu’s MultilevelForce Atlas 2

Colors and Lines

Dealing with many Genomes

N=2K=5M=2B. Abortus

N=40, K=5, M=2, B. Suis

N=20K=5M=2Brucella

N=400, K=5, M=2, All Brucella N=1000, K=10, M=100, E. coli

Information Compounded

For the Web

• GEXF

– NetworkX, Gephi,

– Cytoscape, Gexf-JS, D3-Gexf

• BGZF GFF

– Backing store

– Byte range loading

Other Uses

• “Rearrangement” detection

Other Uses

• “Scaffolding”

– e.g. 86 contigs

• Closing

– Predicted primers

Other Uses• Rearrangements

– Insertions/Deletions

– Islands

– Inversions

Other Uses

• Synthetic BAM

Takeaways

• A new way to leverage protein family databases

• “Reference free” structure for many bacterial genomes using feature families

• Quickly investigate whole genome relationships and speed up potentially expensive calculations

Acknowledgements

• Eric Nordberg

• Lenny Heath

• CID at VBI (PATRIC)

• RAST – Argonne (PATRIC)

https://github.com/aswarren

https://twitter.com/aswarren

top related