genome&sequencing:&introduc2on&...
TRANSCRIPT
![Page 1: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/1.jpg)
Genome Sequencing: Introduc2on to Fragment Assembly
Lecture 5: September 4, 2012
![Page 2: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/2.jpg)
Review from Last Lecture
![Page 3: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/3.jpg)
Sample Prepara2on
Fragments
![Page 4: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/4.jpg)
Sample Prepara2on
Sequencing
ACGTAGAATCGACCATG
GGGACGTAGAATACGAC
ACGTAGAATACGTAGAA
Reads
Fragments
Next Genera2on Sequencing (NGS)
![Page 5: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/5.jpg)
Sample Prepara2on
Sequencing
Assembly
ACGTAGAATACGTAGAAACAGATTAGAGAG…
Con2gs
Fragments
Reads
ACGTAGAATCGACCATG GGGACGTAGAATACGAC
ACGTAGAATACGTAGAA
![Page 6: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/6.jpg)
6
Sample Prepara2on
Sequencing
Assembly
Analysis
Fragments
Reads
Con2gs
“…the ability to determine DNA sequences is star2ng to outrun the ability of researchers to store, transmit and especially to analyze the data.”
-‐ New York Times, November 30, 2011
![Page 7: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/7.jpg)
Sample Prepara2on
Sequencing
Assembly
Analysis
Fragments
Reads
Con2gs
![Page 8: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/8.jpg)
Algorithms for Fragment Assembly
![Page 9: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/9.jpg)
Whole Genome Shotgun Sequencing
Genome amplified and sliced into smaller fragments (>=600bp)
Genome
Build consensus sequence from overlap
![Page 10: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/10.jpg)
Tradi2onal (“Sanger”) Sequencing • Sequence shotgun fragments of length 600 bp using Sanger sequencing.
• Fragment Assembly is accomplished using “overlap-‐layout-‐consensus” approach: • overlap: matching all possible reads and finding any overlapping.
• layout: finding order of reads along DNA and pu_ng them together.
• consensus: deriving how sequence will appear based on layout.
![Page 11: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/11.jpg)
Overlap-‐Layout-‐Consensus Approach • Build an overlap graph where each node represents a read. An edge exists between two reads if they overlap
• Traverse the graph to find unambiguous paths which form the con2gs
![Page 12: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/12.jpg)
Problems! • The main problem with this approach is that it is very, very, very slow and will only work on small genomes or low coverage.
• Not commonly used for complete assembly, however, some sobware tools s2ll use this approach: – Celera: genome assembler for 454, PacBio, and Illumina data
– LOCAS: Resequencing genomes. – HapAssembler: for sequencing highly polymorphic genomes
![Page 13: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/13.jpg)
Problems!
Unfortunately, overlap-‐layout-‐consensus approach will not work for NGS data or significantly large genomes: – There is too much data. Calcula2ng the overlap for each pair of reads would take way to much 2me. – There has to be a new method for fragment assembly.
![Page 14: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/14.jpg)
De Bruijn Graph Approach to Assembly
![Page 15: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/15.jpg)
De Bruijn Graph for Assembly • Introduced in 1989.
• Adapted for next genera2on sequencing data.
Pevzner. J Biomol Struct Dyn (1989) 7:63—73. Iduly & Waterman. J. Comput Biol (1995) 2:291—306.
Euler-‐SR: Chaisson & Pevzner. Genome Res. (2008) 18:324—30. Velvet: Zerbino & Birney. Genome Res. (2008) 18:821—29. ALLPATHS: Butler et al. Genome Res. (2008) 18(5):810—20. ABySS: Simpson et al. Genome Res (2009) 19:1117—1123.
![Page 16: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/16.jpg)
De Bruijn Graph Construc2on
I. Choose a value of !. II. For each !-‐mer that exists in any sequence
create an edge with one vertex labeled as the prefix and one vertex labeled as the suffix.
III. Glue all ver2ces that have the same label. (Pevzner, Tang & Tesler, 2004)
![Page 17: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/17.jpg)
GTCTATTCGCTAATTCACTA
ATTCG
ATTCA TTCA ATTC
ATTC TTCG
(Pevzner, Tang & Tesler, 2004)
De Bruijn Graph Construc2on
![Page 18: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/18.jpg)
GTCTATTCGCTAATTCACTA
ATTCG
ATTCA TTCA ATTC
ATTC TTCG
(Pevzner, Tang & Tesler, 2004)
De Bruijn Graph Construc2on
![Page 19: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/19.jpg)
GTCTATTCGCTAATTCACTA
ATTCG
ATTCA TTCA
ATTC
TTCG
(Pevzner, Tang & Tesler, 2004)
De Bruijn Graph Construc2on
![Page 20: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/20.jpg)
Challenges in Fragment Assembly • Repeats in the genome. • Sequencing errors, which vary by plamorm.
• Size of the data, e.g. 1.5 billion reads.
ACCAGTTGACTGGGATCCTTTTTAAAGACTGGGATTTTAACGCG CAGTTGACTG
TGGGATCC TGGGATTT
TGGGAATT TGGGACTT TGGGA--T
TGGGAACTTATT
Subs2tu2on Dele2on Inser2on
![Page 21: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/21.jpg)
De Bruijn Graph of a Genome
Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL
(! −1)-‐mers !-‐mers ABC BCD CDE DEF EFG GHI
HIC ICD FGK GKL
ABCD BCDE CDEF DEFG EFGH GHIC
HICD ICDE EFGK FGKL
![Page 22: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/22.jpg)
De Bruijn Graph of a Genome
ABC BCD CDE DEF EFG FGK GKL
FGH
GHI HIC
ICD
Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL
![Page 23: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/23.jpg)
De Bruijn Graph of a Genome
Example Genome: ABCDEFGHICDEFGKL Bulges (undirected cycles) and whirls (directed cycles) occur because of sequencing errors or repeats in the genome.
ABC BCD CDE DEF EFG FGK GKL
FGH
GHI HIC
ICD
![Page 24: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/24.jpg)
De Bruijn Graph of a Genome
Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL
1 3
2
ABC BCD CDE DEF EFG FGK GKL
FGH
GHI HIC
ICD
![Page 25: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/25.jpg)
Typical De Bruijn Graph
… However, this is over a billion ver2ces (for a very small bacteria genome).
![Page 26: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/26.jpg)
De Bruijn Graph of a Genome
Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL
1 3
2
ABC BCD CDE DEF EFG FGK GKL
FGH
GHI HIC
ICD
![Page 27: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/27.jpg)
De Bruijn Graph of a Genome
ABC BCD CDE DEF EFG FGK GKL
Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL
![Page 28: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/28.jpg)
De Bruijn Graph of a Genome
ABC BCD CDE DEF EFG FGK GKL
Example Genome: ABCDEFGHICDEFGKL Resul2ng Erroneous Genome: ABCDEFGKL
1
![Page 29: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/29.jpg)
Paired-‐end Reads
• Random fragment with an approximately known size.
• Both ends are sequenced. • Specified prior to data acquisi2on.
ACTATAAT ACCGCGAT Insert Size
![Page 30: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/30.jpg)
Standard (Mul2-‐cell) Data
Coverage
(Chitsaz et al., 2011)
![Page 31: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/31.jpg)
Coverage
Coverage
(Chitsaz et al., 2011)
Single-‐cell Data
![Page 32: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/32.jpg)
Detangling the de Bruijn Graph
Even using mate-‐pair informa2on, the de Bruijn graph is highly tangled. There are the following op2ons for detangling the de Bruijn graph: 1. Error correc2on of reads. 2. Bulge and whirl removal.
32
![Page 33: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/33.jpg)
Detangling the de Bruijn Graph
Even using mate-‐pair informa2on, the de Bruijn graph is highly tangled. There are the following op2ons for detangling the de Bruijn graph: 1. Error correc2on of reads 2. Bulge and whirl removal
33
PROBLEM! Both inevitably end-‐up causing errors rather than correcHng then.
![Page 34: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/34.jpg)
Assembly Demonstra2on
![Page 35: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/35.jpg)
Notes About Demo • We used Velvet because it’s the simplest to use. • Write a shell script to run the assembler to keep track of the parameters you used and to avoid wri2ng out the command each 2me.
• Almost all assemblers require you to specify the following: – value of k, whether the data is paired end, insert length, minimum con2g length
– And some2mes, whether it is single-‐cell data.
![Page 36: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/36.jpg)
• How you specify the mate-‐pair informa2on varies from assembler to assembler. You have to read the manual and write a (perl) script to specify the data in the correct format!
Notes About Demo
![Page 37: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/37.jpg)
• Assemblers can be challenging programs to run. All of them have intricacies even in the installa2on of the program.
• Therefore, running an assembler requires: 1. Some knowledge about Unix/Linux commands. 2. Access to a server with large amounts of
memory (64G for small bacteria genomes, 512G for larger genomes).
Notes About Demo
![Page 38: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/38.jpg)
• Be aware that your assembler may not always produce decent results. Can you tell if you did? Yes.
Notes About Demo
![Page 39: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/39.jpg)
Assembly Evalua2on
![Page 40: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/40.jpg)
What has been sequence?
• We’ve sequenced a number of genomes but several genomes remain difficult
• Plant genomes are very hard because they are extremely long, contain huge repeat regions, and are polyploid
• Note: we do not dis2nguish between genotypes… that is a separate problem
![Page 41: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/41.jpg)
Organism Type Genome Size No. of predicted genes
Homo Sapiens Human 3.2Gb 20,251 Takifugu rubripes
Puffer fish 390Mb 22-‐29,000
Oryza sa2va Rice 420Mb 32-‐50,000 Anopheles gambiae
Mosquito 278Mb 13,700
Saccharomyces cerevisiae
Baker’s yeast
12.1Mb 6,200
Cucumis sa2vus cucumber 367Mb 27,000
• kb (= kbp) = kilo base pairs = 1,000 bp • Mb = mega base pairs = 1,000,000 bp • Gb = giga base pairs = 1,000,000,000 bp.
![Page 42: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/42.jpg)
Assembly Evalua2on • How can we tell the difference between a good assembly and a bad assembly? – Answer: N50 staHsHc, which is a metric of the length of a set of sequences, with greater weight given to longer sequences.
– Given a set of sequences of varying lengths, the N50 length is defined as the length N for which half of all bases in the sequences are in a sequence of length L < N.
– There are some contradictory in the defini2on(s) of the N50 value.
![Page 43: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/43.jpg)
Calcula2ng N50
Alterna2ve defini2on: the largest en2ty E such that at least half of the total size of the en22es is contained in en22es larger than E. 1. Read Fasta file and calculate sequence length. 2. Sort length on reverse order. 3. Calculate Total size. 4. Calculate N50.
![Page 44: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/44.jpg)
Other Evalua2ons
• Number of inser2ons, dele2ons, and subs2tu2on errors in an assembly
• misassembly of con2gs (chimeric indels)
44
>=500 bp
![Page 45: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/45.jpg)
Other Evalua2ons
• Number of inser2ons, dele2ons, and subs2tu2on errors in an assembly
• misassembly of con2gs (chimeric indels)
45
>=500 bp
![Page 46: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/46.jpg)
Next Lecture
![Page 47: Genome&Sequencing:&Introduc2on& to&FragmentAssembly&cs680/Slides/lecture5.pdfTradi2onal&(“Sanger”)&Sequencing& • Sequence&shotgun&fragments&of&length&600&bp& using&Sanger&sequencing.&](https://reader034.vdocuments.mx/reader034/viewer/2022052010/60210013b78c356e514a4469/html5/thumbnails/47.jpg)
Detangling the de Bruijn Graph
Even using mate-‐pair informa2on, the de Bruijn graph is highly tangled. There are the following op2ons for detangling the de Bruijn graph: 1. Error correc2on of reads 2. Bulge and whirl removal