biological motivation for fragment assembly rhys price jones anne r. haake
TRANSCRIPT
![Page 1: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/1.jpg)
Biological Motivation for Fragment Assembly
Rhys Price Jones
Anne R. Haake
![Page 2: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/2.jpg)
What is fragment assembly?
• The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments– i.e. sequence reassembly
• The sequence reassembly process must realign the short fragments, in the correct order, and then generate a consensus sequence.
![Page 3: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/3.jpg)
A Simple Case
• Suppose target sequence is known to be about 10 bp
• Sequenced fragments are:
ACCGTCGTGCTTACTACCGT
![Page 4: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/4.jpg)
--ACCGT------CGTGCTTAC------TACCGT--
__________TTACCGTGC
Overlaps between fragments and the estimated length of the target sequence guide the assembly
![Page 5: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/5.jpg)
Why is fragment assembly important?
• We need to have reliable, complete genomic sequences of human and other model organisms
• Base-pair sequence is the most basic piece of DNA information (gene structure and function described by sequence)
![Page 6: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/6.jpg)
Why fragment the DNA in the first place?
• Human genome is large: ~3 X 109 base pairs long
• Sequencers can generate sequences only approx. 500-600 bp long at a time
![Page 7: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/7.jpg)
Solutions?
• Directed Sequencing: use custom primers to sequentially sequence from genomic DNA This is a slow and expensive process
• Shotgun Sequencing: DNA is extracted, fragmented (e.g. sheared), cloned, sequenced from both ends of clone, reassembled, and finished (gaps are closed)
![Page 8: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/8.jpg)
Solutions?
• Cloning of fragments is accomplished using different vectors, chosen according to the size of the fragments (inserts into the vector).
• Large fragments: YACs 1 Mb, BACs 100-200 Kb
• Intermediate: Cosmids, Lamba• Small: Plasmids, M13
![Page 9: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/9.jpg)
Genome Sequencing Strategies
• Human Genome Project: map-based strategy– initially used “tiling set” of large clones that cover
genome– ends of the tiling set clones sequenced to allow
ordering/mapping to the chromosome– individual clones subjected to shotgun sequencing – the sequences from the clones (shotgun
fragments) then reassembled
• Celera: whole genome sequence strategy– shotgun sequencing
![Page 10: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/10.jpg)
Celera: Whole Genome Sequencing
• Celera (which won the race for the draft human sequence) took a whole genome sequence strategy
• cloned all of the fragmented human genome into 3 different sized clone libraries
• sequenced both ends of each clone• reassembly • advances in automated sequencing speed
and accuracy were key to the success of the Celera approach
![Page 11: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/11.jpg)
Another Reason Fragment Assembly is Important:
• Assembly and/or clustering sets of expressed sequence tags (ESTs)
• The problem is that these are partial and they may span more than one exon (intron sequences, present in the genomic sequence have been spliced out)
• Identity of the ESTs and assignment to genes is aided by finding overlap with other ESTs.
![Page 12: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/12.jpg)
Experimental issues present some challenges for algorithm development
• DNA sequencing data is imperfect• Every base in the DNA should be covered several
times (at least twice; once in each direction) to minimize effects of random errors
• Base calling (determining of the base identity from the DNA sequencer trace) errors can occur -the quality of traces is not always high. Capillary tube sequencing has reduced errors caused by lane bleed-through of slab gel sequencing
![Page 13: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/13.jpg)
• Basecalling software (e.g. Phred) attempts to assign base to each position in sequence as well as quality data
• The quality of the sequence tends to degrade at the ends.
• Vector sequence also contaminating at ends.• NHGR standard: 99.99% accuracy before
submission of sequence to GenBank.
![Page 14: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/14.jpg)
SeqManContig assembler and trace viewer. Can align
against a reference sequence
http://www.dnastar.com/images2/r13a_lg.gif
http://www.dnastar.com/images2/r13a_lg.gif
![Page 15: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/15.jpg)
A big issue:
• Human genome contains repetitive sequences– Highly repetitive: not-transcribed, role unknown,
present in millions of copies. Satellite (5-50 bp), – Moderately repetitive: some are transcribed,
present in up to 100,000’s of copies• Tandem repeats e.g. Minisatellite (12-100 bp),
Microsatellite (2-6 bp), telomeres• Interspersed repeats: larger repeats with high copy
number e.g. SINE (Alu), LINE, tRNAs, rRNAs
![Page 16: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/16.jpg)
Another issue:
• Orientation of the fragments is unknown• Is the input fragment or its reverse
complement a substring of the consensus?
CACGT CACGT--------ACGT -ACGT---------ACTACG --CGTAGT----GTACT -----AGTAC---ACTGA --------ACTGACTGA ---------CTGA
![Page 17: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/17.jpg)
Yet, another
• Chimeras (mixed or heterogeneous DNA) may be introduced during the cloning process
• DNA from non-contiguous regions of the chromosome may be introduced as well as host DNA (for example, when growing plasmids in E. coli, the E. coli chromosomal DNA often contaminates clones)
![Page 18: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/18.jpg)
General Considerations:
• The algorithms used to generate the consensus sequence must take the biological issues into account (although some don’t!).
• Need to consider prior biological information when analyzing a program’s assembly output.– e.g. known chromosomal sites or DNA
fingerprinting data may be inconsistent with the program’s assembly output.
![Page 19: Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake](https://reader036.vdocuments.mx/reader036/viewer/2022083008/56649ed15503460f94be07fb/html5/thumbnails/19.jpg)
Fragment Assembly Programs
• GeneSkipper• Phred/Phrap/Consed