cse182-l16 - university of california, san diego · • the signal decays after 1000 bases....
Post on 30-Oct-2019
4 Views
Preview:
TRANSCRIPT
CSE182-L16
LW statistics/Assembly
Silly Quiz
• Who are these people, and what is the occasion?
Genome Sequencing and Assembly
Sequencing
• A break at T is shown here.
• Measuring the lengths using electrophoresis allows us to get the position of each T
• The same can be done with every nucleotide. Fluorescent labeling can help separate different nucleotides
November 13 Bafna
• Automated detectors ‘read’ the terminating bases.
• The signal decays after 1000 bases.
November 13 Bafna
Sequencing Genomes: Clone by Clone
• Clones are constructed to span the entire length of the genome.
• These clones are ordered and oriented correctly (Mapping)
• Each clone is sequenced individually
November 13 Bafna
Shotgun Sequencing
• Shotgun sequencing of clones was considered viable
• However, researchers in 1999 proposed shotgunning the entire genome.
November 13 Bafna
Library
• Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.
November 13 Bafna
Whole Genome Shotgun
• Break up the entire genome into pieces
• Sequence ends, and assemble using a computer
• LW statistics & Repeats argue against the success of such an approach
Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together
November 13 Bafna
Whole Genome Shotgun
• Break up the entire genome into pieces
• Sequence ends, and assemble using a computer
• LW statistics & Repeats argue against the success of such an approach
Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together
Shotgun Sequencing
• Shotgun sequencing of clones was considered viable for small genomes
• However, researchers in 1999 proposed shotgunning the entire genome.
Massively parallel sequencing
• Sanger sequencing allows us to sequence <=1000 bp in one lane, up to 96 lanes, in one run.
• Today, we can sequence many Mbp in a single run
Questions
• Algorithmic: How do you put the genome back together from the pieces?
• Statistical? How many pieces do you need to sequence, etc.? – The answer to the statistical questions had already
been given in the context of mapping, by Lander and Waterman.
Lander Waterman Statistics
G
L
• The fragments are falling randomly on the genome • Overlapping fragments form islands of contiguous sequence. • Ideally, we want one island for each chromosome. How many
fragments should we sequence?
Lander Waterman Statistics
G
L €
G = Genome LengthL = Fragment LengthN = Number of FragmentsT = Required Overlapc = Coverage = LN/Gα = N/Gθ = T/Lσ = 1-θ
LW statistics: questions
• As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island of overlapping contigs.
• Q1: What is the expected number of islands?
• The number should increase at first, and gradually decrease.
Analysis: Expected Number Islands
• Computing Expected # islands. • Let Xi=1 if an island ends at position i, Xi=0
otherwise. • Number of islands = ∑i Xi • Expected # islands = E(∑i Xi) = ∑i E(Xi)
Prob. of an island ending at i
• E(Xi) = Prob (Island ends at pos. i) • =Prob(clone began at position i-L+1
AND no clone began in the next L-T positions)
i L T
€
E(Xi) =α 1−α( )L−T =αe−cσ
€
Expected # islands = E(Xi) =i∑ Gαe−cσ = Ne−cσ
Computing # islands
• As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island.
• Q1: What is the expected number of islands?
• Ans: N exp(-cσ) • The number
increases at first, and gradually decreases.
Expected # of clones in an island
• Expected # of clones in an island =
€
ecσ
Q: How? Why do we care?
Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.
Problem 1: size of contigs
• Islands might simply be too small in length • σ = (1-T/L) = (1-50/500) = 0.9, c = 8. • #Islands = N e-cσ = 36K • Size of an island = 54K • Not enough to make it an acceptable assembly! • PLUS, there is the problem of Repeats, Chimerism etc.
Assembly Basics
• Three main components: – Overlap – Layout – Consensus
Overlap
• Given a pair of fragments s1 and s2, do they belong together?
• Yes, if a prefix of s2 matches a suffix of s1
• How would you compute such a match?
Overlap
• S[i,j] = optimum score of an alignment of s1[1..i] against a substring of that starts anywhere, but ends in j. s2[*..j]
i
j
• The best prefix-suffix alignment is given by:
• Maxi {S[i,n] }
Overlap Detection
• Compute the best prefix-suffix alignments between each pair of fragments.
• Keep the “high-scoring” ones as evidence of true overlap.
• What is the problem?
Overlap detection problem
• Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. – G = 3000Mb, L=500 – Coverage LN/G = 10 – N = 10*3*109/500 = 6*107
– Number of comparisons needed = 3.6 * 1015
– Number of alignments per minute=6 – Number of compute nodes = 100 – Time needed (Number of years) =
• Not good! (Only a small fraction are true overlaps)
€
3.6⋅ 1015
(6⋅ 60⋅ 24⋅ 365⋅ 100)=11M
k-mer based overlap (Piegeonhole principle again)
• Consider a 25bp sequence. – Expected number of occurrences in
the genome – 3*109*4-25 = 2*10-6
• A 25-bp sequence appears is unique to the genome!
• Two overlapping sequences should share a 25-mer
• Two non-overlapping sequences should not!
25bp
Sorting k-mers
• Build a list of k-mers that appear in the sequences and their reverse complements
• Create a record with 4 entries: – K-mer – Sequence number – Position in the sequence – Reverse complementation flag
• Sort a vector of these according to k-mer • How many records per k-mer are
expected? • If number of records exceeds threshold,
discard (why?)
K-mer S.id Pos.
Alignment module
• Coalesce k-mer hits into longer, gap-free partial alignments.
• These extended k-mer hits are saved.
• For each pair of sequences, form a directed graph.
• For each maximal path in the graph, construct an alignment.
• Refine alignment via banded DP
Problem2: Size
• Islands might simply be too small in length • σ = (1-T/L) = (1-50/500) = 0.9, c = 8. • #Islands = N e-cσ = 36K • Size of an island = 54K • Not enough to make it an acceptable assembly! • PLUS, there is the problem of Repeats, Chimerism etc.
Solution 2: Clones can have mate-pairs
• Recall that we sequence about 1000bp of the end of a clone • If we sequenced both ends, we get extra information,
particularly if we know the length of the original clone.
Mate Pairs
• Mate-pairs allow you to merge islands (contigs) into super-contigs
Super-contigs are quite large
• Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small.
• Use the mate-pairs to order and orient the contigs, and make super-contigs.
Problem 3: Repeats
Repeats & Chimerisms
• 40-50% of the human genome is made up of repetitive elements.
• Repeats can cause great problems in the assembly!
• Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly
Repeat detection
• Lander Waterman strikes again! • The expected number of clones in a Repeat containing island is
MUCH larger than in a non-repeat containing island (contig). • Thus, every contig can be marked as Unique, or non-unique. In
the first step, throw away the non-unique islands.
Repeat
Detecting Repeat Contigs 1: Read Density
• Compute the log-odds ratio of two hypotheses:
• H1: The contig is from a unique region of the genome.
• The contig is from a region that is repeated at least twice
Detecting Chimeric reads
• Chimeric reads: Reads that contain sequence from two genomic locations.
• Good overlaps: G(a,b) if a,b overlap with a high score
• Transitive overlap: T(a,c) if G(a,b), and G(b,c)
• Find a point x across which only transitive overlaps occur. X is a point of chimerism
Whole genome shotgun
• Input: – Shotgun sequence fragments (reads) – Mate pairs
• Output: – A single sequence created by consensus of overlapping reads
• First generation of assemblers did not include mate-pairs (Phrap, CAP..)
• Second generation: CA, Arachne, Euler
Assembly
• Use k-mers to detect potential overlaps • Use alignments to build contig graphs • Decide the unique contigs based on LW
statistics – Discard repeat contigs – Break chimeric contigs
• Use mate-pairs to build scaffolds • Fill gaps
Assembly
• Use k-mers to detect potential overlaps • Use alignments to build contig graphs • Decide the unique contigs based on LW
statistics – Discard repeat contigs – Break chimeric contigs
• Use mate-pairs to build scaffolds
Consensus Derivation
• Consensus sequence is created by converting pairwise read alignments into multiple-read alignments
Summary
• Whole genome shotgun is now routine: – Human, Mouse, Rat, Dog, Chimpanzee.. – Many Prokaryotes (One can be sequenced in a day) – Plant genomes: Arabidopsis, Rice – Model organisms: Worm, Fly, Yeast
• A lot is not known about genome structure, organization and function. – Comparative genomics offers low hanging fruit
Final exam syllabus
• Take home • The entire course, but emphasis will be given to
post-midterm lectures • HMMs, • Gene-finding, • mass spectrometry, • Micro-array analysis, • genome sequencing and assembly
What we did not cover
top related