hierarchical sequencing

19
273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing

Upload: delora

Post on 24-Feb-2016

103 views

Category:

Documents


0 download

DESCRIPTION

Hierarchical Sequencing. a BAC clone. map. Hierarchical Sequencing Strategy. Obtain a large collection of BAC clones Map them onto the genome (Physical Mapping) Select a minimum tiling path Sequence each clone in the path with shotgun Assemble Put everything together. genome. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Hierarchical Sequencing

Page 2: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Hierarchical Sequencing Strategy

1. Obtain a large collection of BAC clones2. Map them onto the genome (Physical Mapping)3. Select a minimum tiling path4. Sequence each clone in the path with shotgun5. Assemble6. Put everything together

a BAC clone

mapgenome

Page 3: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Hierarchical Sequencing Strategy

1. Obtain a large collection of BAC clones2. Map them onto the genome (Physical Mapping)3. Select a minimum tiling path4. Sequence each clone in the path with shotgun5. Assemble6. Put everything together

a BAC clone

mapgenome

Page 4: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Methods of physical mapping

Goal:

Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence

Methods:

• Hybridization• Digestion

Page 5: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

1. Hybridization

Short words, the probes, attach to complementary words

1. Construct many probes2. Treat each BAC with all probes3. Record which ones attach to it4. Same words attaching to BACS X, Y overlap

p1 pn

Page 6: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

2. Digestion

Restriction enzymes cut DNA where specific words appear

1. Cut each clone separately with an enzyme2. Run fragments on a gel and measure length3. Clones Ca, Cb have fragments of length { li, lj, lk }

overlap

Double digestion:Cut with enzyme A, enzyme B, then enzymes A + B

Page 7: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Online Clone-by-cloneThe Walking Method

Page 8: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

The Walking Method

1. Build a very redundant library of BACs with sequenced clone-ends (cheap to build)

2. Sequence some “seed” clones

3. “Walk” from seeds using clone-ends to pick library clones that extend left & right

Page 9: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Walking: An Example

Page 10: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Some Terminologyinsert a fragment that was incorporated in a circular genome, and can be copied (cloned)

vector the circular genome (host) that incorporated the fragment

BAC Bacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kb

read a 500-900 long word that comes out of a sequencing machine

coverage the average number of reads (or inserts) that cover a position in the target DNA piece

shotgun the process of obtaining many reads sequencing from random locations in DNA, to

detect overlaps and assemble

Page 11: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse paired reads

plasmids (2 – 10 Kbp)

cosmids (40 Kbp) known dist

~800 bp~800 bp

Page 12: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Fragment Assembly(in whole-genome shotgun sequencing)

Page 13: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Fragment Assembly

Given N reads…Given N reads…Where N ~ 30 Where N ~ 30

million…million…

We need to use a We need to use a linear-time linear-time algorithmalgorithm

Page 14: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

Steps to Assemble a Genome

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some Terminology

read a 500-900 long word that comes out of sequencer

mate pair a pair of reads from two endsof the same insert fragment

contig a contiguous sequence formed by several overlapping readswith no gaps

supercontig an ordered and oriented set(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from thesequene multiple alignment of reads

in a contig

Page 15: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

1. Find Overlapping Reads

aaactgcagtacggatctaaactgcag aactgcagt… gtacggatct tacggatctgggcccaaactgcagtacgggcccaaa ggcccaaac… actgcagta ctgcagtacgtacggatctactacacagtacggatc tacggatct… ctactacac tactacaca

(read, pos., word, orient.)aaactgcagaactgcagtactgcagta… gtacggatctacggatctgggcccaaaggcccaaacgcccaaact…actgcagtactgcagtacgtacggatctacggatctacggatcta…ctactacactactacaca

(word, read, orient., pos.)aaactgcagaactgcagtacggatcta actgcagta actgcagtacccaaactgcggatctacctactacacctgcagtacctgcagtacgcccaaactggcccaaacgggcccaaagtacggatcgtacggatctacggatcttacggatcttactacaca

Page 16: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

1. Find Overlapping Reads

• Find pairs of reads sharing a k-mer, k ~ 24• Extend to full alignment – throw away if not >98% similar

TAGATTACACAGATTAC

TAGATTACACAGATTAC|||||||||||||||||

T GA

TAGA| ||

TACA

TAGT||

• Caveat: repeats A k-mer that occurs N times, causes O(N2) read/read comparisons ALU k-mers could cause up to 1,000,0002 comparisons

• Solution: Discard all k-mers that occur “too often”

• Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available

Page 17: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

1. Find Overlapping Reads

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA

Page 18: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

1. Find Overlapping Reads

• Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGA

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGA

insert Areplace T with C

correlated errors—probably caused by repeats disentangle overlaps

TAGATTACACAGATTACTGATAGATTACACAGATTACTGA

TAG-TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAG-TTACACAGATTATTGAIn practice, error correction removes up to 98% of the errors

Page 19: Hierarchical Sequencing

CS273a Lecture 4, Autumn 08, Batzoglou

2. Merge Reads into Contigs

• Overlap graph: Nodes: reads r1…..rn

Edges: overlaps (ri, rj, shift, orientation, score)

Note:of course, we don’tknow the “color” ofthese nodes

Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat