cse182-l16 - university of california, san diego · • the signal decays after 1000 bases....

Post on 30-Oct-2019

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CSE182-L16

LW statistics/Assembly

Silly Quiz

•  Who are these people, and what is the occasion?

Genome Sequencing and Assembly

Sequencing

•  A break at T is shown here.

•  Measuring the lengths using electrophoresis allows us to get the position of each T

•  The same can be done with every nucleotide. Fluorescent labeling can help separate different nucleotides

November 13 Bafna

•  Automated detectors ‘read’ the terminating bases.

•  The signal decays after 1000 bases.

November 13 Bafna

Sequencing Genomes: Clone by Clone

•  Clones are constructed to span the entire length of the genome.

•  These clones are ordered and oriented correctly (Mapping)

•  Each clone is sequenced individually

November 13 Bafna

Shotgun Sequencing

•  Shotgun sequencing of clones was considered viable

•  However, researchers in 1999 proposed shotgunning the entire genome.

November 13 Bafna

Library

•  Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.

November 13 Bafna

Whole Genome Shotgun

•  Break up the entire genome into pieces

•  Sequence ends, and assemble using a computer

•  LW statistics & Repeats argue against the success of such an approach

Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

November 13 Bafna

Whole Genome Shotgun

•  Break up the entire genome into pieces

•  Sequence ends, and assemble using a computer

•  LW statistics & Repeats argue against the success of such an approach

Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

Shotgun Sequencing

•  Shotgun sequencing of clones was considered viable for small genomes

•  However, researchers in 1999 proposed shotgunning the entire genome.

Massively parallel sequencing

•  Sanger sequencing allows us to sequence <=1000 bp in one lane, up to 96 lanes, in one run.

•  Today, we can sequence many Mbp in a single run

Questions

•  Algorithmic: How do you put the genome back together from the pieces?

•  Statistical? How many pieces do you need to sequence, etc.? –  The answer to the statistical questions had already

been given in the context of mapping, by Lander and Waterman.

Lander Waterman Statistics

G

L

•  The fragments are falling randomly on the genome •  Overlapping fragments form islands of contiguous sequence. •  Ideally, we want one island for each chromosome. How many

fragments should we sequence?

Lander Waterman Statistics

G

L €

G = Genome LengthL = Fragment LengthN = Number of FragmentsT = Required Overlapc = Coverage = LN/Gα = N/Gθ = T/Lσ = 1-θ

LW statistics: questions

•  As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island of overlapping contigs.

•  Q1: What is the expected number of islands?

•  The number should increase at first, and gradually decrease.

Analysis: Expected Number Islands

•  Computing Expected # islands. •  Let Xi=1 if an island ends at position i, Xi=0

otherwise. •  Number of islands = ∑i Xi •  Expected # islands = E(∑i Xi) = ∑i E(Xi)

Prob. of an island ending at i

•  E(Xi) = Prob (Island ends at pos. i) •  =Prob(clone began at position i-L+1

AND no clone began in the next L-T positions)

i L T

E(Xi) =α 1−α( )L−T =αe−cσ

Expected # islands = E(Xi) =i∑ Gαe−cσ = Ne−cσ

Computing # islands

•  As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island.

•  Q1: What is the expected number of islands?

•  Ans: N exp(-cσ) •  The number

increases at first, and gradually decreases.

Expected # of clones in an island

•  Expected # of clones in an island =

ecσ

Q: How? Why do we care?

Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.

Problem 1: size of contigs

•  Islands might simply be too small in length •  σ = (1-T/L) = (1-50/500) = 0.9, c = 8. •  #Islands = N e-cσ = 36K •  Size of an island = 54K •  Not enough to make it an acceptable assembly! •  PLUS, there is the problem of Repeats, Chimerism etc.

Assembly Basics

•  Three main components: –  Overlap –  Layout –  Consensus

Overlap

•  Given a pair of fragments s1 and s2, do they belong together?

•  Yes, if a prefix of s2 matches a suffix of s1

•  How would you compute such a match?

Overlap

•  S[i,j] = optimum score of an alignment of s1[1..i] against a substring of that starts anywhere, but ends in j. s2[*..j]

i

j

•  The best prefix-suffix alignment is given by:

•  Maxi {S[i,n] }

Overlap Detection

•  Compute the best prefix-suffix alignments between each pair of fragments.

•  Keep the “high-scoring” ones as evidence of true overlap.

•  What is the problem?

Overlap detection problem

•  Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. –  G = 3000Mb, L=500 –  Coverage LN/G = 10 –  N = 10*3*109/500 = 6*107

–  Number of comparisons needed = 3.6 * 1015

–  Number of alignments per minute=6 –  Number of compute nodes = 100 –  Time needed (Number of years) =

•  Not good! (Only a small fraction are true overlaps)

3.6⋅ 1015

(6⋅ 60⋅ 24⋅ 365⋅ 100)=11M

k-mer based overlap (Piegeonhole principle again)

•  Consider a 25bp sequence. –  Expected number of occurrences in

the genome –  3*109*4-25 = 2*10-6

•  A 25-bp sequence appears is unique to the genome!

•  Two overlapping sequences should share a 25-mer

•  Two non-overlapping sequences should not!

25bp

Sorting k-mers

•  Build a list of k-mers that appear in the sequences and their reverse complements

•  Create a record with 4 entries: –  K-mer –  Sequence number –  Position in the sequence –  Reverse complementation flag

•  Sort a vector of these according to k-mer •  How many records per k-mer are

expected? •  If number of records exceeds threshold,

discard (why?)

K-mer S.id Pos.

Alignment module

•  Coalesce k-mer hits into longer, gap-free partial alignments.

•  These extended k-mer hits are saved.

•  For each pair of sequences, form a directed graph.

•  For each maximal path in the graph, construct an alignment.

•  Refine alignment via banded DP

Problem2: Size

•  Islands might simply be too small in length •  σ = (1-T/L) = (1-50/500) = 0.9, c = 8. •  #Islands = N e-cσ = 36K •  Size of an island = 54K •  Not enough to make it an acceptable assembly! •  PLUS, there is the problem of Repeats, Chimerism etc.

Solution 2: Clones can have mate-pairs

•  Recall that we sequence about 1000bp of the end of a clone •  If we sequenced both ends, we get extra information,

particularly if we know the length of the original clone.

Mate Pairs

•  Mate-pairs allow you to merge islands (contigs) into super-contigs

Super-contigs are quite large

•  Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small.

•  Use the mate-pairs to order and orient the contigs, and make super-contigs.

Problem 3: Repeats

Repeats & Chimerisms

•  40-50% of the human genome is made up of repetitive elements.

•  Repeats can cause great problems in the assembly!

•  Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly

Repeat detection

•  Lander Waterman strikes again! •  The expected number of clones in a Repeat containing island is

MUCH larger than in a non-repeat containing island (contig). •  Thus, every contig can be marked as Unique, or non-unique. In

the first step, throw away the non-unique islands.

Repeat

Detecting Repeat Contigs 1: Read Density

•  Compute the log-odds ratio of two hypotheses:

•  H1: The contig is from a unique region of the genome.

•  The contig is from a region that is repeated at least twice

Detecting Chimeric reads

•  Chimeric reads: Reads that contain sequence from two genomic locations.

•  Good overlaps: G(a,b) if a,b overlap with a high score

•  Transitive overlap: T(a,c) if G(a,b), and G(b,c)

•  Find a point x across which only transitive overlaps occur. X is a point of chimerism

Whole genome shotgun

•  Input: –  Shotgun sequence fragments (reads) –  Mate pairs

•  Output: –  A single sequence created by consensus of overlapping reads

•  First generation of assemblers did not include mate-pairs (Phrap, CAP..)

•  Second generation: CA, Arachne, Euler

Assembly

•  Use k-mers to detect potential overlaps •  Use alignments to build contig graphs •  Decide the unique contigs based on LW

statistics –  Discard repeat contigs –  Break chimeric contigs

•  Use mate-pairs to build scaffolds •  Fill gaps

Assembly

•  Use k-mers to detect potential overlaps •  Use alignments to build contig graphs •  Decide the unique contigs based on LW

statistics –  Discard repeat contigs –  Break chimeric contigs

•  Use mate-pairs to build scaffolds

Consensus Derivation

•  Consensus sequence is created by converting pairwise read alignments into multiple-read alignments

Summary

•  Whole genome shotgun is now routine: –  Human, Mouse, Rat, Dog, Chimpanzee.. –  Many Prokaryotes (One can be sequenced in a day) –  Plant genomes: Arabidopsis, Rice –  Model organisms: Worm, Fly, Yeast

•  A lot is not known about genome structure, organization and function. –  Comparative genomics offers low hanging fruit

Final exam syllabus

•  Take home •  The entire course, but emphasis will be given to

post-midterm lectures •  HMMs, •  Gene-finding, •  mass spectrometry, •  Micro-array analysis, •  genome sequencing and assembly

What we did not cover

top related