genome sequencing algorithms (graph algorithms) · input: a graph. output: a path visiting every...
TRANSCRIPT
Genome Sequencing Algorithms(Graph Algorithms)
William Hamilton (1805 – 1865)
Leonhard Euler (1707 – 1783)
Nicolaas Govert de Bruijn (1918 – 2012)
The Genome Sequencing Problem
● Determining the order of nucleotides in a genome
● Human genome contains about 3 billion nucleotides
● Applications in Medicine, Agriculture, Biotechnology, ...
The Genome Sequencing Problem
● There is no technology to read the genome from one end to another.– Short snippets, called reads (200-300 nucleotides),
can be identified.
– No info about a location of a read is known.
● Assembling individual reads into the entire genome is akin to solving a giant overlapping puzzle.
● The newspaper explosion analogy
History of Genome Sequencing● 1977: Walter Gilbert and Frederick Sanger
developed independent DNA sequencing methods.● 1990: Human Genome Project, Francis Collins.● 1997: Celera Genomics, Craig Venter.● 2000: Human genome is sequenced.
Next Generation Sequencing● Illumina sequences human genomes for
$10,000● Complete Genomics sequences 100s of
genomes per month● Beijing Genome Institute has 100s of
sequencing machines. Is the world's biggest sequencing center.
Next Generation Sequencing
● Identification of mutations in personal genomes for health diagnosis
● Genome 10K project
Sequencing in India
Genome Assembly – The Computational Problem
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
Genome Assembly – The Computational Problem
Sequencing Machine generates reads
A String Reconstruction ProblemA String Reconstruction Problem
The Genome Sequencing Problem
Reconstruct a genome from readsReconstruct a genome from reads
Input: A collection of strings, ReadsOutput: A string, Genome, reconstructed from all the ReadsInput: A collection of strings, ReadsOutput: A string, Genome, reconstructed from all the Reads
k-mer CompositionComposition3(TAATGCCATGGGATGTT) =
TAA AAT ATG TGCGCC CCA CAT ATG TGG GGG GGA GAT TGT GTT
Lexicographical ordering of k-mers
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
ATG
The String Reconstruction Problem
Reconstruct a string from its k-mer composition.Reconstruct a string from its k-mer composition.
Input: A collection of k-mersOutput: A Genome, such that Composition
k(Genome)
is equal to the collection of k-mers
Input: A collection of k-mersOutput: A Genome, such that Composition
k(Genome)
is equal to the collection of k-mers
Naive String Reconstruction Approach
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
TAAAAT
ATGTGT
GTTNo 3-mer begins with TT!No 3-mer begins with TT!
Representing a Genome as a Path
Composition3(TAATGCCATGGGATGTT) =
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GATATGTGTGTT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT TGT GTTATG
The Genome
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Nodes in a Graph
Path turns into a Graph
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GATATGTGTGTT
Path turns into a Graph
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Path turns into a Graph
Nodes are ordered lexicographically.
How does one find the genome string?
Genome Path in the Graph
TAAAAT ATG TGCGCCCCACATATG TGGGGGGGAGATATG TGTGTT
TAATGCCATGGGATGTT
The genome string is a Hamiltonian walk in the graph
TAAT G CC A T G GG A T G T T
Hamiltonian Path Problem
Find a Hamiltonian path in the graphFind a Hamiltonian path in the graph
Input: A graph.
Output: A path visiting every node in the graph exactly once.
Input: A graph.
Output: A path visiting every node in the graph exactly once.
Hamiltonian Path: A path in a graph that traverses every node exactly once
William R Hamilton (1805 – 1865)
A Different Path
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
3-mers as nodes
3-mers as edges
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
A Different Path
3-mers as edges and nodes as prefix and suffixes of the corresponding 3-mers
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
TAAAAT
TGC
GCC
CCA
CAT
ATGTGG
GGG
GGA
GAT ATG
TGTGTTTA AA AT
GCCC
CA
TG
GGGA
GT TT
TG
ATG
De Bruijn Graph of the Genome
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TTATG
De Bruijn Graph of the Genome
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
The genome string is an Eulerian walk in the De Bruijn graph
TAATGCCATGGGATGTT
ATG
Eulerian Path Problem
Leonhard Euler (1707 – 1783)
Find an Eulerian path in a graphFind an Eulerian path in a graph
Input: A graph.
Output: A path visiting every edge in the graph exactly once.
Input: A graph.
Output: A path visiting every edge in the graph exactly once.
Eulerian Path: A path in a graph that traverses every edge exactly once.
Hamiltonian Path vs. Eulerian Path
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TTATG
Hamiltonian Path vs. Eulerian Path
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
Euler has presented an efficient solution to the Eulerian path problem. No fast algorithm exists to solve the Hamiltonian Path problem. The
Hamiltonian Path Problem is NP-Complete.
ATG
NP Complete Problems
Gary and Johnson,Computers and Intractability, 1979
Genome AssemblyTAATGCCATGGGATGTT
AATAAT ATGATG ATGATG ATGATG CATCAT CCACCA GATGAT GCCGCC GGAGGA GGGGGG GTTGTT TAATAA TGCTGC TGGTGG TGTTGT
To Do ...TAATGCCATGGGATGTT
AATAAT ATGATG ATGATG ATGATG CATCAT CCACCA GATGAT GCCGCC GGAGGA GGGGGG GTTGTT TAATAA TGCTGC TGGTGG TGTTGT
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TTATG
Constructing De Bruijn Graph
The composition of the genome is known
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TAA
TA AA
AAT
AA AT
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
Constructing De Bruijn Graph
TAA
TA AA
AAT
AA AT
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
Constructing De Bruijn Graph
TAA
TA
AAT
AA AT
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
Constructing De Bruijn Graph
TAA
TA
AAT
AA
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
Constructing De Bruijn Graph
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
TAAAAT
TGC
GCC
CCA
CAT
ATGTGG
GGG
GGA
GAT ATG
TGTGTTTA AA AT
GCCC
CA
TG
GGGA
GT TT
TG
ATG
De Bruijn Graph of the Genome Composition
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
De Bruijn Graph(Genome Composition) == De Bruijn Graph(Genome)De Bruijn Graph(Genome Composition) == De Bruijn Graph(Genome)
ATG
Constructing the De Bruijn Graph
● De Bruijn graph of a collection of k-mers:– Represent every k-mer as an edge between its
prefix and suffix
– Glue all nodes with identical labels
Universal String Problem (Nicolaas De Bruijn, 1946): Find a circular string containing each binary k-mer exactly once.
Universal String Problem (Nicolaas De Bruijn, 1946): Find a circular string containing each binary k-mer exactly once.
Euler Cycle Problem
Find an Eulerian cycle in a graphFind an Eulerian cycle in a graph
Input: A graph.
Output: A cycle visiting every edge in the graph exactly once.
Input: A graph.
Output: A cycle visiting every edge in the graph exactly once.
The Konigsberg Bridges
Eulerian Graph
A graph is Eulerian if it contains an Eulerian cycle
Every balanced and strongly connected graph is Eulerian
1
23
4
5
6
7
8
9
10
11
Algorithm to Find the Eulerian Cycle
1
2
3
45
6 7
8
1
2
3
4
Algorithm to Find the Eulerian Cycle
5
6
7 8
9
10
111
2
3
4
Algorithm to Find the Eulerian Cycle
EulerianCycle(Graph)EulerianCycle(Graph)
form a cycle Cycle by randomly walking in Graph while there are unexplored edges in Graph select a node newStart in Cycle with unexplored edges form Cycle' by traversing Cycle (starting at newStart) and then randomly walking Cycle ← Cycle' return Cycle
form a cycle Cycle by randomly walking in Graph while there are unexplored edges in Graph select a node newStart in Cycle with unexplored edges form Cycle' by traversing Cycle (starting at newStart) and then randomly walking Cycle ← Cycle' return Cycle
From Reads to De Bruijn Graph to Genome
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
TAATGCCATGGGATGTTTAATGCCATGGGATGTT
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
ATG
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
GGG
TAATGCCATGGGATGT
ATG
Multiple Eulerian Paths
T
Multiple Eulerian Paths
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
GGG
TAATG CCATGGGAT GTT
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
GGG
TAATGCCATGGGATGTTTAATGCCATGGGATGTT
ATG
ATG
DNA Sequencing with Read-pairs● Read-pair is a pair of reads separated by a
fixed distance d.
Genome: TAATGCCATGGGATGTT.AAT-CCA is a 3,1 read pair.
AAT-CCA represents the sequenceAATGCCA in the original deBruijn graph
DNA Sequencing with Read-pairs
Composition3(TAATGCCATGGGATGTT) =
TAA AAT ATG TGCGCC CCA CAT ATG TGG GGG GGA GAT TGT GTTATG
PairedComposition3,1(TAATGCCATGGGATGTT) =
TAA|GCC
AAT|CCA
ATG|CAT
TGC|ATGGCC|TGG
CCA|GGGCAT|GGA
ATG|GATTGG|ATG
GGG|TGTGGA|GTT
Paired Composition
TAAGCC
AATCCA
ATGCAT
TGCATG
GCCTGG
CCAGGG
CATGGA
ATGGAT
TGGATG
GGGTGT
GGAGTT
TAAGCC
AATCCA
ATGCAT
TGCATG
GCCTGG
CCAGGG
CATGGA
ATGGAT
TGGATG
GGGTGT
GGAGTT
Lexicographical order:
String Reconstruction from Read-pairs
String reconstruction from read-pairsString reconstruction from read-pairs
Input: A collection of paired k-mers
Output: A string Text such that PairedComposition(Text) is equal to the collection of paired k-mers
Input: A collection of paired k-mers
Output: A string Text such that PairedComposition(Text) is equal to the collection of paired k-mers
Paired De Bruijn Graphs
TAAGCC
AATCCA
ATGCAT
TGCATG
GCCTGG
CCAGGG
CATGGA
ATGGAT
TGGATG
GGGTGT
GGAGTT
AACC
ATCA
AATCCA
ATCA
TGAT
ATGCAT
ATGA
TGAT
CAGG
ATGA
ATGGAT
CATGGA
CCGG
CAGG
GCTG
CCGG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
AACC
TGAT
GCTG
TAAGCC
TGCATG
TGAT
GGTG
TGGATG
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
ATCA
TGAT
ATGA
TGAT
CAGG
ATGA
ATGGAT
CATGGA
CCGG
CAGG
GCTG
CCGG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
AACC
TGAT
GCTG
TAAGCC
TGCATG
TGAT
GGTG
TGGATG
Combine nodes with identical labels
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
TGAT
ATGA
TGAT
CAGG
ATGGAT
CATGGA
CCGG
GCTG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
TAAGCC
TGCATG
TGGATG
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
TGAT
ATGA
CAGG
ATGGAT
CATGGA
CCGG
GCTG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
TAAGCC
TGCATG
TGGATG
Paired DeBruijn graphs obtained from the paired composition and the genome are identical.
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
TGAT
ATGA
CAGG
ATGGAT
CATGGA
CCGG
GCTG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
TAAGCC
TGCATG
TGGATG
Unique genome string: TAATGCCATGGGATGTT