genome sequence assembly concepts and methods shih-jon wang may 13, 2008

Genome sequence assembly

concepts and methods

Shih-Jon Wang

May 13, 2008

• Assembly Process Overview• Assembly algorithms• Repeats• Scaffolding • Phred/Phrap/Consed• Assembly pipelines

OUTLINE

Assembly process overview

A Genome Sequencing Project

Building a Library

• Break DNA into random fragments (8-10x)

SHOTGUNs

• Whole Genome Shotgun

• Bac-Bac Shotgun

• Size of inserts:

• --Bac insert: ~150KB

• --Fosmid insert: ~30KB

• --Normal insert: ~3KB

Clone and scaffold

(a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences,

or contigs, in the genome under assembly.

Computer 35 (7):47-54

Building a Library

• Break DNA into random fragments (~10x)

• Break DNA into random fragments (~10x)-- Amplify the fragments in a vector-- Sequence 800-1000 bases at each end

Assembling the fragments

• Break DNA into random fragments• Sequence the ends of the fragments• Assemble the sequenced ends

Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known

Building Scaffolds

Assembly Gaps

--sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap--physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

Finishing the Project

Unifying View of Assembly

Assembly Algorithms

Assembly Methods

• Overlap-layout-consensus

– greedy (Phrap, CAP3, TIGR...)

– graph-based (Euler)

Phrap/CAP3

Greedy • Build a rough map of fragment overlaps• Pick the largest scoring overlap• Merge the two fragments• Repeat until no more merges can be done!!! IDEAL CASE !!!

Real World Problems

• Sequencing errors

• Chimera

• Repeats

• Contaminants

• Polymorphism

• Orientation

Error Correction

Overlap b/w two sequences

All pairs alignment

• Try all pairs – must consider ~ n^2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs

are possible

– Build a table of k-mers contained in sequences (single pass through the genome)

– Generate the pairs from k-mer table (single pass

through k-mer table)

Repeats

Repeat sequence

The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly.

Computer 35 (7):47-54

重覆序列■ 重覆頻率分

Interspersed repeats Short interspersed element (SINE), eg. Alu <300 bp Long interspersed element (LINE), ca. 5 kb

Tandem repeats Satellite DNA Minisat. & Variable number of tandem repeats Microsat.: mono-, di-, tri-, tetra-nucleotide

■ 重覆方向分同向重覆序列反向重覆序列

Repeat detection

Pre-assembly: find fragments that belong to repeats

• statistically (Reps)

• repeat database (RepeatMasker)

Statistical repeat detection

• Significant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with

theoretical value(e.g., 800 bp reads & 8x coverage - reads "arrive" e

very 100 bp)• Problem 1: assumption of uniform distribution of

fragments - leads to false positives non-random libraries poor clonability regions

• Problem 2: repeats with low copy number are missed

Scaffolding

Sequencing hierarchy

• Random sequencing– unrelated reads ~700 pairs• Assembly– un-related contigs 5K-10K pairs• Scaffolding– unrelated scaffolds 30K~ 50K pairs• Finishing/gap closure– completed genomes millions-billions of bas

e-pairs

Definition

Scaffolder output

• order and orientation of contigs• size of gaps between contigs• linking evidence: mate-pairs spanning gaps

Clone-mates

Linking information

Hierarchical scaffolding

Ambiguous scaffold

Phred/Phrap/Consed Analysis

What is Phred/Phrap/Consed ?

Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading;b. Quality (confidence) assignment to each individual base;c. Vector & repeat sequences identification and masking;d. Sequence assembly;e. Assembly visualization and editing;f. Automatic finishing.

How to deal with the enormous amount of reads generated by

the high throughput DNA sequencers?

Phred Genome Research 8: 175-194

PhredPhred is a program that performs several

tasks:

a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI, ESD (MegaBACE) and LI-COR.

b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base.

d. Creates output files – base calls and quality values are written to output files.

File Directories

• chromat_dir/

• edit_dir/

• phd_dir/

Trace File High quality region – no ambiguities (Ns)

Trace File Medium quality region – some

ambiguities (Ns)

Trace File Poor quality region – low confidence

Phred value formula

q = - 10 x log10 (p) whereq - quality valuep - estimated probability error for a base call

Examples:Examples:

qq = 20 means = 20 means pp = 10 = 10-2-2 (1 error in 100 (1 error in 100 bases)bases)qq = 40 means = 40 means pp = 10 = 10-4-4 (1 error in 10,000 (1 error in 10,000 bases)bases)

Base Calling

• phred -id . -p -pd ../phd_dir

• phred -view pf84c05.s1

The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g

BEGIN_COMMENT

CHROMAT_FILE: EBV10201A02.gABI_THUMBPRINT: PHRED_VERSION: 0.990722.gCALL_METHOD: phredQUALITY_LEVELS:99TIME: Thu May 24 00:18:58 2001TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 12153TRIM: CHEM: termDYE: big

END_COMMENT

BEGIN_DNAt 8 5c 13 17a 19 26c 19 32

t 6 11908t 6 11908a 6 11921a 6 11921g 6 11927g 6 11927t 6 11947t 6 11947c 6 11953c 6 11953a 6 11964a 6 11964g 6 11981g 6 11981c 4 11994c 4 11994n 4 12015n 4 12015c 4 12037c 4 12037n 4 12044n 4 12044n 4 12058n 4 12058n 4 12071n 4 12071n 4 12085n 4 12085n 4 12098n 4 12098n 4 12111n 4 12111n 4 12124n 4 12124c 4 12144c 4 12144n 4 12151n 4 12151END_DNAEND_DNA END_SEQUENCEEND_SEQUENCE

t 24 2221t 24 2221a 24 2232a 24 2232a 22 2245a 22 2245a 27 2261a 27 2261g 25 2272g 25 2272c 19 2286c 19 2286c 12 2302c 12 2302t 19 2314t 19 2314g 12 2324g 12 2324g 15 2331g 15 2331g 19 2346g 19 2346g 23 2363g 23 2363t 33 2378t 33 2378g 36 2390g 36 2390c 44 2404c 44 2404c 44 2419c 44 2419t 39 2433t 39 2433a 39 2446a 39 2446a 34 2460a 34 2460t 35 2470t 35 2470g 34 2482g 34 2482

t 16 8191t 16 8191g 19 8200g 19 8200t 13 8211t 13 8211c 13 8229c 13 8229g 4 8241g 4 8241n 4 8253n 4 8253c 4 8263c 4 8263t 10 8276t 10 8276t 9 8286t 9 8286c 12 8301c 12 8301t 16 8313t 16 8313c 12 8329c 12 8329c 12 8336c 12 8336c 15 8343c 15 8343t 19 8356t 19 8356c 9 8371c 9 8371g 13 8386g 13 8386g 14 8397g 14 8397a 7 8417a 7 8417g 9 8427g 9 8427g 4 8445g 4 8445

phd2fasta• phd2fasta program

– –converts .phdfiles to sequence in multifasta format

– –writes .qualfile (quality scores) for each trace file – –phd2fasta -id ../phd_dir -os CLONE.fasta -oq

CLONE.fasta.qual

• Output: – –fasta.seqcontains fastasequences – –fasta.seq.qualcontains quality scores

Vector Sequence Cleaning (1)

• DNA sequence cleaning: quality trimming and vector removal---Lucy:

• Lucy Steps: – Read input seq#, seq info, and quality info– Chop off splice sites– Remove vector insert– Produce output seq for fragment assembly.

Vector Sequence Cleaning (2)• Restriction on file name:can’t contain any symbol eg. “–” “. “ “_”• Lucy major parameters to set up:-vector vector_completeSeq splice_site_file

(splice_site_file: 2 splice-site seq before and after the insertion point on the vector)

• Lucy Output: – identified locations of good/clean region – trim seq without vector, linker, Ns (<3% Ns)

splice_site_file• ~ 150 bases, 50 bases overlap around splice • >PUCsplice.for.begingattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacgacggccagtgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga

• >PUCsplice.for.endacggccagtgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtgaaattgttatccgctcacaattccacacaacatacgagccggaagcataaa

• >PUCsplice.rev.begintttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt

• >PUCsplice.rev.endtcacacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgtcgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc

Cross_match

• cross_match -minmatch 12 -penalty -2 -minscore 20 -screen CLONE.fasta /net/share/sequence_pipeline/vector.fasta

Phrap-- Phragment Assembly Program (or Phil’s Revise

d Assembly Program)• Phrap is a program for assembling shotgun DNA Phrap is a program for assembling shotgun DNA

sequence data sequence data • Key Features:Key Features:• a. Uses the entire read content – no need fa. Uses the entire read content – no need f

or trimming.or trimming.• b. User supplied (i.e. Repbase) + internally b. User supplied (i.e. Repbase) + internally

computed data – better accuracy of assembly in computed data – better accuracy of assembly in the presence of repeats.the presence of repeats.

• c. Contig sequence is constituted by a mosc. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s aic of the highest quality parts of the reads – it’s not a consensus! not a consensus!

--Phrap is a program for assembling shotgun DNA --Phrap is a program for assembling shotgun DNA sequence data sequence data

• d. Provides extensive information about assembld. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.coy – contained in phrap.out, *.ace and *.screen.contigs.qual filesntigs.qual files

• e. Handles very large datasets – hundreds of thoe. Handles very large datasets – hundreds of thousands of reads are easily manipulated.usands of reads are easily manipulated.

• f. Generate output files – contain some important f. Generate output files – contain some important data and enable visualization by other programsdata and enable visualization by other programs

Banded Search

K-mers

• >GL1234.b1

gattaagttgggtaacgccagggttttcccagtcac…

gattaagttgggta

attaagttgggtaa

ttaagttgggtaac

taagttgggtaacg

Phrap output files

• *.contigs – fasta file containing the contigs*.contigs – fasta file containing the contigs– Contigs with more than one readContigs with more than one read– Singletons (single reads with a match to some other Singletons (single reads with a match to some other

contig but that couldn’t be merged consistently to it)contig but that couldn’t be merged consistently to it)

• *.singlets – fasta file of the singlet reads*.singlets – fasta file of the singlet reads– Reads with no match to other readReads with no match to other read

• *.ace – allows for viewing the assembly using C*.ace – allows for viewing the assembly using Consedonsed

• *.view – required for viewing the assembly usin*.view – required for viewing the assembly using Phrapviewg Phrapview

Phrap parameters• phrap -new_ace CLONE.fasta.screen >outfile

• OPTIONS DEFAULT FUNCTION• -penalty -2 ↑=>↑Stringency

. -gap_init penalty-2

. -gap_ext penalty-1• -minmatch* 14 ↑=>↓time↓Matches• -bandwidth 14 ↓=>↓time↓String. • -minscore 30 ↑=>↑String.• *highly sensitive! bigger genomes bigger value

Phrap parameters

• OPTIONS DEFAULT FUNCTION• -forcelevel 0~10 ↓=>↑String. • -repeat_stringency 0.95• 0<x<1 ↑=>↑String.• -force_high* ↑=>↑String.• -revise_greedy** ↓Misassembly• -shatter_greedy** ↓ContigLength

* Ignore edited high-quality discrepancies**break assembly at weak joins

Phrap parameters

• OPTIONS DEFAULT FUNCTION• -max_subclone_size• 5000 F.-R. check• -default_qual 15• -preassemble*• -group_delim* _

*used together

Consed Genome Research 8: 195-202, 1998

Consed

A program for viewing and editing assemblies A program for viewing and editing assemblies produced by Phrapproduced by Phrap

Key Features:Key Features:

a. Assembly viewer - allows for visualization of contigs, aa. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and fissembly (aligned reads), quality values of reads and final sequence. nal sequence.

b. Trace file viewer – single and multiple trace files can bb. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequene visualized allowing for comparison of a given sequence in several reads.ce in several reads.

Consed

A program for viewing and editing assemblieA program for viewing and editing assemblies produced by Phraps produced by Phrap

Key Features:Key Features:

c. Navigation – identify and list regions which are below a gc. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepanciesiven quality threshold, contain high quality discrepancies, single-strand coverage, etc., single-strand coverage, etc.

d. Autofinish – automatic set of functions for: gap closure, d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relatiimprovement of sequence quality, determination of relative orientation of contigs, identification of regions covereve orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The prod by a single read or by reads of a single strand. The program automatically performs primer picking and chooses gram automatically performs primer picking and chooses the templates.the templates.

Phred/Phrap/Consed Pipeline

Chromat_dirChromat_dir

Phd_dirPhd_dir

Edit_dirEdit_dir

DirectoriesDirectories::

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - se qs_ fas ta .sc re en .con tigsassem bly file - seq s_ fa s ta .sc re e n .a ce#

Vector screening and m askingCross_M atch (local alignment program) x vec to r.seqscreened/masked file - seq s_fa s ta .scre en

Conversion - phd to fastaphd2fasta.plnucleotide sequences - seq s_fa s taquality values - seq s_ fa s ta .sc re e n .q u a l

Quality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

Comparison of shotgun sequence data from Wolbachia genome Project

Computer 35 (7):47-54

CAP3 3XA 6189 57 443

PHRAP 3XA 6396 54 529

CAP3 3XB 12,368 44 71

PHRAP 3XB 13,116 47 228

CAP3 3XC 10,709 49 227

PHRAP 3XC 11,406 45 332

CAP3 3XD 11,408 43 115

PHRAP 3XD 11,350 49 240

CAP3 5XA 10,582 42 249

PHRAP 5XA 18,268 31 252

CAP3 5XB 26,034 17 100

PHRAP 5XB 33,693 18 115

CAP3 5XC 20,939 29 172

PHRAP 5XC 20,912 27 261

CAP3 5XD 14,219 35 46

PHRAP 5XD 14,696 33 129

CAP3 8XA 71,025 12 83

PHRAP 8XA 71,395 8 80

CAP3 8XB 53,127 8 59

PHRAP 8XB 53,078 7 36

CAP3 8XC 52,134 8 4

PHRAP 8XC 76,922 6 6

CAP3 8XD 72,690 7 35

PHRAP 8XD 102,523 6 60

CAP3 10XA 91,380 4 28

PHRAP 10XA 91,329 3 11

CAP3 10XB 167,655 1 5

PHRAP 10XB 138,551 2 7

CAP3 10XC 106,631 5 44

PHRAP 10XC 77,747 4 12

CAP3 10XD 79,900 4 2

PHRAP 10XD 79,978 3 2

Softwares

• CAP3 (for EST): http://genome.cs.mtu.edu/cap/cap3.html

• Phrap (for large genome): http://www.phrap.org

• --Similar algorithm• --Insufficient documentation and support• -- Always have to write scripts to parse outpu

ts• --NO PERFECT PROGRAM!!!

Questions??

wangsj@yahoo.com

genome sequence assembly concepts and methods shih-jon wang may 13, 2008

random fragments

x break dna

dna sequences

library break dna

adjacent contigs

sequenced ends

sequencing gap

pairs finishinggap closure

Documents

valuexvail 2012 - kai shih

shih - translation.illinois.edu

i t e003 shih 091707

shih- tzu dog

wu shih-mao

el manual del shih tzu

norsk shih tzu klubb

the use of p-values and bayes factors in genome- wide...

shih physical 01 paper

shu mei shih

liwen shih , ph.d. computer engineering u of houston –...

14' portfolio(shih-chieh, wang)

shih-chiang ’ s research log

presenter : chang, shih-jie authors : shih- hwa ...

joanne shih

shih tzu breed technical brochure

nikki shih portfolio (spread)

+porfolio+ shih yi lin

shih tzu - uncommon goods · shih tzu shih tzu black and...

designer shih tzus