de novo repeat classification and fragment assembly: from de bruijn to a-bruijn graphs ·...
TRANSCRIPT
![Page 1: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/1.jpg)
De novo repeat classification and fragment assembly:
from de Bruijn to A-Bruijn graphs
![Page 2: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/2.jpg)
IS30 repeat family
Repeats in Bacterial Genome: IS30 repeat family in N. meningitidis
Edge label: 130(2) means length 130bp and multiplicity 2
![Page 3: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/3.jpg)
De novo repeat classification and fragment assembly with A-Bruijn
graph approach
Pavel Pevzner1, Haixu Tang2, Glenn Tesler3
1Department of Computer Science and Engineering, UCSD2Department of Computer Science, University of Indiana
3Department of Mathematics, UCSD
![Page 4: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/4.jpg)
Adapted from http://www.hhmi.org/research/investigators/eichler.html
Duplication landscape of 2p11
Mosaic Arrangements in Segmental Duplications in Human Genome
![Page 5: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/5.jpg)
Two-step model of segmental duplications (Evan Eichler, 1997)
Adapted from Horvath et al 2005
Ancestral duplication units are called duplicons
![Page 6: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/6.jpg)
Mosaic repeats structure: an imaginary example
A B C D E F G H I J
A B C D E F G H I JC
A B C D E F G H I JC B C D
A B C D E F G H I JC B C D F GC
• The mosaic structure of segmental duplications in human genome is revealed using the A-Bruijn graph approach:Jiang, Tang, She, P.P, Eichler. Evolutionary reconstruction of segmental duplications
reveals punctuated cores of human gene innovations (Nature Genetics, 2007)
![Page 7: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/7.jpg)
The Segmental Duplication in Human Genome: The Ancestral Duplicon Problem
• The duplicon problem: can we identify duplicons?• Case-by-case studies:
* Jackson et al 1999; …, Stankiewicz et al 2004; Horvath et al 2005• All duplicons: Hard
* Eichler group •only had limited success with chr22 (Bailey et al 2002),
• The mosaic structure of segmental duplications in human genome is revealed using the A-Bruijn graph approach:
Zhaoshi Jiang, Haixu Tang, Xinwei She, PP, Evan Eichler. Evolutionary reconstruction of segmental duplications reveals punctuated cores of human gene innovations (Nature Genetics, in press)
![Page 8: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/8.jpg)
The Duplicon Problem• Tang et al 2007 proposed a computational method for identifying all duplicons.
Adapted from Tang et al 2005
![Page 9: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/9.jpg)
Repeat Classification
• Repeat representation as a mosaic of sub-repeats
• Detailed study of each sub-repeat and its further classification into repeat sub-families
![Page 10: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/10.jpg)
Mosaic Structure of Segmental Duplication in Human Chromosome 22
Bailey et. al. 2002, Am. J. Hum. Genet.
![Page 11: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/11.jpg)
Algorithmic Challenge
• Problem: find all repeat elements and reveal the sub-repeat mosaic structure.– Perfect repeats: de Bruijn graph, suffix tree.– Imperfect repeats: OPEN PROBLEM.
• Goal: Generalize the de Bruijn graph for imperfect repeats.
![Page 12: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/12.jpg)
De Novo Repeat Classification: messy and ill-defined problem
“The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.”
Bao & Eddy 2002, Genome Research
![Page 13: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/13.jpg)
De Novo Repeat Classification: messy and ill-defined problem
“The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.”
Bao & Eddy 2002, Genome Research
![Page 14: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/14.jpg)
De Novo Repeat Classification
Library of repeat elements
Element 1 AGCCTACG … …
… …
Element 2 TGCATTTT … …Element 3 GAACTCAC … …
De novo compilation
?
Reputer
Pairwise similarity
![Page 15: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/15.jpg)
Similarity matrix
A B C D E F G H I JC B C D F GC
![Page 16: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/16.jpg)
De novo Repeat Classification: Previous Studies
1. RepeatFinder: Volfovsky et. al. 20012. RECON: Bao & Eddy, 2002
– Heuristic algorithms that work well for real data
– Do not reveal the mosaic structure
![Page 17: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/17.jpg)
Annotating Known Repeat Elements
Library of (known) repeat elements
Element 1 AGCCTACG … …
… …
Element 2 TGCATTTT … …Element 3 GAACTCAC … …
De novo Repeat Classification Problem: Given a newly sequenced genome, find all repeat elements (sub-repeats) in this genome.
Genome
RepeatMasker
![Page 18: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/18.jpg)
Mosaic Structure of Repeats: A Real Example(Clone G12 from Human Chromosome Y)
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
RECON (Bao and Eddy, 2002) does not reveal the mosaic:
A-Bruijn representation 2 copies
3 copies
2 copies
4 copies
?
![Page 19: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/19.jpg)
De Bruijn Graph: Applications in Bioinformatics
• Sequencing by hybridization (Pevzner, 1989)• Re-sequencing with DNA arrays (Shamir and
Tsur, 2001, Peer et al., 2002)• Fragment assembly (Idury and Waterman, 1995,
Pevzner et. al. 2001)• EST analysis (Heber, et. al., 2002)• Computational mass-spectrometry (Bocker, 2003,
Bandeira et al., 2007)
![Page 20: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/20.jpg)
De Bruijn Graph: Classification of Perfect Repeats
ABCDEFCGHBCDIFCGJ
Vertices: (k-1)-mers from the sequence
AB BC CD DE EF FC CG
GHHB
DI IF
GJ
Every sub-repeat is represented as a repeat edge in the graph.
BCD FCG
Edges: k-mers from the sequence
![Page 21: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/21.jpg)
Repeat Gluing
y y
x
xy
y
x y x y
x y
xy y
x y
![Page 22: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/22.jpg)
Repeat Gluing
y y
x
xy
y
x y x y
x y
xy y
x y
gluing instruction
![Page 23: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/23.jpg)
Similarity matrix
A B C D E F G H I JC B C D F GC
![Page 24: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/24.jpg)
A B C D E F G H I JC B C D F GC
CA B
FG
D
I
E
HJ
Sub-repeats:edges in the
repeatgraph
2 copies
2 copies
2 copies
2 copies4 copiesC
FB
D G
repeat graph
![Page 25: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/25.jpg)
In reality, repeats are usually imperfect
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
… … AG-CCATCGACGTCACC … …… … AGTGCCTCG-CGTCTCC … …
![Page 26: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/26.jpg)
Similarity matrix
A B C D E F G H I JC B C D F GC
![Page 27: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/27.jpg)
y y
x
x
Inconsistent Gluing
y y
x
x
Consistent Gluing
![Page 28: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/28.jpg)
Challenge: Generalize the Notion of De Bruijn Graph for Imperfect Repeats
• Input– a genomic sequence– all significant local pairwise alignments
• Output– repeat graph representing all repeats as a
mosaic of sub-repeats
![Page 29: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/29.jpg)
A-Bruijn Graph Construction
… acat … acgt … ccat …Genomic sequence
a c a ta c g t
c c a t
A-graph
acatacgt
1 acgtccat
2 acatccat
3
Pairwise localalignment
cA-Bruijn graph
a,c a,g t
… a c a t … a c g t … c c a t …… a c a t… a c g t… c c a t…
Similarity matrix
![Page 30: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/30.jpg)
A-Bruijn Graph Construction: Bulges
… at … act … acat …Genomic sequence
a ta c t
a c a t
A-graph
c
A-Bruijn graph
a a t
a-tact
1 ac-tacat
2 a--tacat
3
Consistent pairwise localalignment
Similarity matrix… a t … a c t … a c a t …
… a t… a c t… a c a t…
![Page 31: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/31.jpg)
A-Bruijn Graph Construction: Whirls
… at … act … acat …Genomic sequence
a ta c t
a c a t
A-graph
c
A-Bruijn graph
a t
a-tact
1 ac-tacat
2 --atacat
3
Inconsistent pairwise localalignment
Similarity matrix… a t … a c t … a c a t …
… a t… a c t… a c a t…
![Page 32: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/32.jpg)
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y y
x
x
repeat graph
Repeat Graph
A-Bruijn graph
![Page 33: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/33.jpg)
A-Bruijn graph
repeat graph
Problem: Simplifying A-Bruijn Graph
![Page 34: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/34.jpg)
Removing Bulges and Whirls: Solving MSLG Problem
Maximum Subgraph with Large Girth (MSLG) Problem:
Input: a graph;Output: a maximum weight subgraph that does not contain short cycles, i. e. cycles of length less than a parameter girth.
Solution known only when the girth is infinite -- Maximum Spanning Tree Problem (maximum weight acyclic subgraph).
NP-hard problem (Skiena, 2002).
![Page 35: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/35.jpg)
Minimum (or Maximum) Spanning Trees
• The first algorithm for finding a MST was developed by Boruvka in 1926 to minimize the cost of electrical coverage in Bohemia.
• The MST Problem– Connect all of the cities using the
least amount of wire possible
![Page 36: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/36.jpg)
Solution: Maximum Spanning Tree Approximation to MSLG Problem
![Page 37: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/37.jpg)
RepeatGluer AlgorithmA-Bruijn graph
Bulge andwhirl removal
a c b d,h e,g,i f,j
Zigzag path straightening
a b
c
d e f
gh
i
j
Erosion
![Page 38: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/38.jpg)
IS30 repeat family
Repeats in Bacterial Genome: IS30 repeat family in N. meningitidis
Edge label: 130(2) means length 130bp and multiplicity 2
![Page 39: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/39.jpg)
The Repeats Graph of N. meningitidis
![Page 40: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/40.jpg)
Repeats in Human Genome (ALU)
Edge label: 17(80-127) means length 17bp and multiplicity between 80-127
![Page 41: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/41.jpg)
Multiple Alignment = Finding Repeats in a Concatenate of all Sequences
Raphael et al., 2004 (Genome Research)
![Page 42: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/42.jpg)
Fragment Assembly in Genome Sequencing
• Genomes are very long (human genome has 3 billions base-pairs)• Current technology can only reliably “read” a short DNA fragment
(read, typically 500 – 1000 base pairs)• Whole genome shotgun sequencing: break genome into millions of
overlapping pieces (read) and sequence each read• Fragment Assembly: assembly of the genome from millions of
overlapping reads• Celera Genomics has assembled human genome (2001)• Over 100 large genomes are waiting for assembly• Current assembly programs may make mis-assembly
– Recent mammalian genome assembly, e.g. mouse, rat, etc, are estimated to have thousands of mis-assemblies
• Repeats in the genomic sequence are the main cause of mis-assembly
![Page 43: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/43.jpg)
Fragment Assembly Using Repeat Graph
Reads
GenomeA B C D E F G H I JC B C D F GC
A B C D I F G H E JC B C D F GC
CA B
FG
D
I
E
HJ
repeat graph
Every possible genome reconstruction corresponds to an Eulerian path in the repeat graph.
![Page 44: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/44.jpg)
Fragment Assembly = Building Repeat Graph from Concatenated Reads
Key idea: The repeat graph built from concatenated reads is identical to the repeat graph built from genomic sequence if the reads “cover” the genomic sequence.
![Page 45: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/45.jpg)
Similarity matrix
… 1 0 … 1 0 0 … 0 0 1 0 …… … … … … … … … … … … … …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …… 0 0 … 0 1 0 … 0 1 0 0 …… 1 0 … 1 0 0 … 1 0 0 0 …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …… 1 0 … 1 0 0 … 1 0 0 0 …… 0 0 … 0 0 0 … 0 0 1 0 …… 0 0 … 0 1 0 … 0 1 0 0 …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …
Fragment Assembly: Building Repeat Graph from Reads
ab
ca
b
c
![Page 46: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/46.jpg)
Fragment Assembly: Building Repeat Graph from Reads
x yrepeat graph
ab
c
Snapshots of similarity matrixa
b
c
![Page 47: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/47.jpg)
FragmentGluer Algorithm (outline)
• Concatenate reads (in an arbitrary order!) into a single sequence
• Compute the similarity matrix for this concatenated sequence (overlap detection between reads)
• Use this matrix as a “glue” to build the repeat graph with the RepeatGluer algorithm
![Page 48: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/48.jpg)
FragmentGluer algorithm
• Identify and remove chimeric read;• Concatenate the remaining reads into a sequence and compose
similarity matrix A from the pairwise alignments of reads;• Construct the A-Bruijn graph from the similarity matrix A;• Remove bulges and whirls;• Thread each read through the resulting graph and form the consensus
sequence from reads;• Define the coverage of a vertex in the graph as the number of reads
that are threaded through this vertex;• Define coverage of simple paths as average coverage of their vertices;
![Page 49: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/49.jpg)
FragmentGluer algorithm (cont’d)
• Form the repeat graph by collapsing simple paths in the graph;• The consensus sequence of an edge in the repeat graph is defined as
the consensus sequence of the corresponding simple path;• Output repeat families as tangles in the repeat graph;• Every tangle is a collection of edges (sub-repeats) with corresponding
consensus sequences;• Transform mate-pairs into mate-paths in the graph obtained and
perform equivalent transformations on the resulting set of mate –paths;• Use mate-pairs to resolve differences between nearly identical copies
of repeats;• Define contigs as consensus sequences of simple paths in the resulting
graph;• Assemble the resulting contigs into scaffolds by the EULER
Scaffolding algorithm
![Page 50: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/50.jpg)
Benchmarking EULER+, Phrap and Arachne on all BACs from Human Chromosome 20
•EULER+ produced the least number of misassembled contigsMisassembled contigs by Phrap: 37Misassembled contigs by ARACHNE: 17Misassembled contigs by EULER+: 7
•EULER+ also had the least number of collapsed repeat copies (4), ahead of Phrap (5) and Arachne (9).
•Average number of contigs per BAC was the least for EULER+ (6.2) followed by Phrap (6.8) and ARACHNE (13.8).
Pevzner et al., 2004 (Genome Research)
![Page 51: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/51.jpg)
Comparison of Euler and Newbler Assemblies
Genome Assembler No. contigs N50 Misassembled Coverage Net size
E. coli Euler 199 46887 3 94.7 4277
Newbler 141 60757 0 99.1 4531
Repeat graph 94 125693 - 92.1 4560
S. pneumoniae Euler 127 32619 1 96.8 2001
Newbler 253 11905 0 95.0 2000
Repeat graph 136 36004 - 97.5 2091
![Page 52: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/52.jpg)
Comparison of Euler and Newbler Assemblies(without flowgrams)
Genome Assembler No. contigs N50 Misassembled Coverage Net size
E. coli Euler 199 46887 3 94.7 4277
Newbler 311 28475 0 99.1 4531
Repeat graph 94 125693 - 92.1 4560
S. pneumoniae Euler 127 32619 1 96.8 2001
Newbler 253 11905 0 95.0 2000
Repeat graph 136 36004 - 97.5 2091
![Page 53: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/53.jpg)
454 Reads+Sanger Reads
0
10,000
20,000
30,000
40,000
50,000
1x 1.5x 2x 2.5x 3x 3.5x 4x 4.5x 5x
S. pneumoniae S. pneumoniae, 3kb matesS. pneumoniae 30x 454
Sanger coverage
![Page 54: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/54.jpg)
Acknowledgements
• Evan Eichler, Genome Sciences, University of Washington
![Page 55: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/55.jpg)
A B C D E F G H I JC B C D F GC
CA B
FG
D
I
E
HJ
Sub-repeats:tangle edges
2 copies
2 copies
2 copies
2 copies4 copiesC
FB
D G
repeat graph
![Page 56: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/56.jpg)
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4
56 7
9
8
10
1411
12
15
141 2,9,13 3,6,10,14 4,7,11
5
8,12
15 1614428328 140 628 1185
2905
381
repeat graph
Sub-repeats:tangle edges
3 copies
3 copies
2 copies
4 copies
![Page 57: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/57.jpg)
Mosaic repeats structure: an imaginary example
A B C D E F G H I J
A B C D E F G H I JC
A B C D E F G H I JC B C D
A B C D E F G H I JC B C D F GC
![Page 58: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/58.jpg)
Mosaic structure of repeats: a real example(BAC G12 from human chromosome Y)
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
![Page 59: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/59.jpg)
Mosaic structure of repeats
A B C D E F G H I JC B C D F GC
Consensus repeat elements
Repeat boundary problem(Bao & Eddy, 2002)
Mosaic representation 2 copies
2 copies
2 copies
2 copies4 copies
![Page 60: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/60.jpg)
A-Bruijn graph approach: an overview
All local alignments (genomic dot plot)
A-Bruijn Graph
Repeat Graph
Removing bulges and whirls
Gluing
![Page 61: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach](https://reader035.vdocuments.mx/reader035/viewer/2022062603/5f02672b7e708231d4041ab1/html5/thumbnails/61.jpg)
Bacterial genomes assembly
All 5 bacterial genomes were mis-assembled by Phrap (either repeat collapsing or joining non-contiguous regions or both). ARACHNE and EULER+ make no assembly errors except for one genome.