dna sequencing
DESCRIPTION
DNA Sequencing. DNA sequencing. How we obtain the sequence of nucleotides of a species. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…. Human Genome Project. 1990 : Start. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/1.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
DNA Sequencing
![Page 2: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/2.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
DNA sequencing
How we obtain the sequence of nucleotides of a species
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
![Page 3: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/3.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Human Genome Project
1990: Start
2000: Bill Clinton:2001: Draft
2003: Finished$3 billion
3 billion basepairs
“most important scientific discovery in the 20th century”
now what?
![Page 4: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/4.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Which representative of the species?
Which human?
Answer one:
Answer two: it doesn’t matter
Polymorphism rate: number of letter changes between two different members of a species
Humans: ~1/1,000
Other organisms have much higher polymorphism rates Population size!
![Page 5: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/5.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Why humans are so similar
A small population that interbred reduced the genetic variation
Out of Africa ~ 40,000 years ago
Out of Africa
Heterozygosity: HH = 4Nu/(1 + 4Nu)u ~ 10-8, N ~ 104
H ~ 410-4
N
![Page 6: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/6.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
There is never “enough” sequencing
100 million species
7 billion individuals
Somatic mutations (e.g., HIV, cancer)
Sequencing is a functional assay
![Page 7: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/7.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Sequencing Growth
Cost of one human genome• 2004: $30,000,000• 2008: $100,000• 2010: $10,000• 2011: $4,000 (today)• 2012-13: $1,000• ???: $300
How much would you pay for a smartphone?
![Page 8: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/8.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
DNA Sequencing
Goal:Find the complete sequence of A, C, G, T’s in DNA
Challenge:There is no machine that takes long DNA as an input, and gives the complete sequence as output
Can only sequence ~150 letters at a time
![Page 9: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/9.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Ancient sequencing technology – SangerVectors
+ =
DNA
Shake
DNA fragments
VectorCircular genome(bacterium, plasmid)
Knownlocation
(restrictionsite)
![Page 10: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/10.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Ancient sequencing technology – SangerGel Electrophoresis
1. Start at primer(restriction site)
2. Grow DNA chain
3. Include dideoxynucleoside (modified a, c, g, t)
4. Stops reaction at all possible points
5. Separate products with length, using gel electrophoresis
![Page 11: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/11.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
DNA Sequencing – Illumina TruSeq
![Page 12: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/12.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Method to sequence longer regions
cut many times at random (Shotgun)
genomic segment
Get one or two reads from each segment
~900 bp ~900 bp
![Page 13: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/13.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Two main assembly problems
• De Novo Assembly
• Resequencing
![Page 14: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/14.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Human Genome Variation
SNP TGCTGAGATGCCGAGA Novel Sequence TGCTCGGAGA
TGC - - - GAGA
Inversion Mobile Element orPseudogene Insertion
Translocation Tandem Duplication
Microdeletion TGC - - AGATGCCGAGA Transposition
Large Deletion Novel Sequenceat Breakpoint
TGC
![Page 15: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/15.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Outline
• De novo assembly, outline of algorithms
• Read mapping for resequencing Sequence alignment
• Variation detection Applications to human genomes, cancer genomes
![Page 16: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/16.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Reconstructing the Sequence (De Novo Assembly)
Cover region with high redundancy
Overlap & extend reads to reconstruct the original genomic region
reads
![Page 17: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/17.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Definition of Coverage
Length of genomic segment: GNumber of reads: NLength of each read: L
Definition: Coverage C = N L / G
How much coverage is enough?
Lander-Waterman model: Prob[ not covered bp ] = e-C
Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides
C
![Page 18: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/18.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Repeats
Bacterial genomes: 5%Mammals: 50%
Repeat types:
• Low-Complexity DNA (e.g. ATATATATACATA…)
• Microsatellite repeats (a1…ak)N where k ~ 3-6(e.g. CAGCAGTAGCAGCACCAG)
• Transposons SINE (Short Interspersed Nuclear Elements)
e.g., ALU: ~300-long, 106 copies LINE (Long Interspersed Nuclear Elements)
~4000-long, 200,000 copies LTR retroposons (Long Terminal Repeats (~700 bp) at each end)
cousins of HIV
• Gene Families genes duplicate & then diverge (paralogs)
• Recent duplications ~100,000-long, very similar copies
![Page 19: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/19.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
50% of human DNA is composed of repeats
Error!Glued together two distant regions
![Page 20: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/20.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
What can we do about repeats?
Two main approaches:• Cluster the reads
• Link the reads
![Page 21: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/21.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
What can we do about repeats?
Two main approaches:• Cluster the reads
• Link the reads
![Page 22: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/22.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
What can we do about repeats?
Two main approaches:• Cluster the reads
• Link the reads
![Page 23: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/23.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
C R D
ARB, CRD
or
ARD, CRB ?
A R B
![Page 24: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/24.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
![Page 25: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/25.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Fragment Assembly(in whole-genome shotgun sequencing)
![Page 26: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/26.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Fragment Assembly
Given N reads…Where N ~ 30
million…
We need to use a linear-time algorithm
![Page 27: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/27.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Steps to Assemble a Genome
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge some “good” pairs of reads into longer contigs
3. Link contigs to form supercontigs
Some Terminology
read a 500-900 long word that comes out of sequencer
mate pair a pair of reads from two endsof the same insert fragment
contig a contiguous sequence formed by several overlapping readswith no gaps
supercontig an ordered and oriented set(scaffold) of contigs, usually by mate
pairs
consensus sequence derived from thesequene multiple alignment of reads
in a contig
![Page 28: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/28.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
1. Find Overlapping Reads
aaactgcagtacggatctaaactgcag aactgcagt… gtacggatct tacggatctgggcccaaactgcagtacgggcccaaa ggcccaaac… actgcagta ctgcagtacgtacggatctactacacagtacggatc tacggatct… ctactacac tactacaca
(read, pos., word, orient.)aaactgcagaactgcagtactgcagta… gtacggatctacggatctgggcccaaaggcccaaacgcccaaact…actgcagtactgcagtacgtacggatctacggatctacggatcta…ctactacactactacaca
(word, read, orient., pos.)aaactgcagaactgcagtacggatcta actgcagta actgcagtacccaaactgcggatctacctactacacctgcagtacctgcagtacgcccaaactggcccaaacgggcccaaagtacggatcgtacggatctacggatcttacggatcttactacaca
![Page 29: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/29.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
1. Find Overlapping Reads
• Find pairs of reads sharing a k-mer, k ~ 24• Extend to full alignment – throw away if not >98% similar
TAGATTACACAGATTAC
TAGATTACACAGATTAC|||||||||||||||||
T GA
TAGA| ||
TACA
TAGT||
• Caveat: repeats A k-mer that occurs N times, causes O(N2) read/read comparisons ALU k-mers could cause up to 1,000,0002 comparisons
• Solution: Discard all k-mers that occur “too often”
• Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available
![Page 30: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/30.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
1. Find Overlapping Reads
Create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA
![Page 31: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/31.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
1. Find Overlapping Reads
• Correct errors using multiple alignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGA
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGA
insert Areplace T with C
correlated errors—probably caused by repeats disentangle overlaps
TAGATTACACAGATTACTGATAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGAIn practice, error correction removes up to 98% of the errors
![Page 32: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/32.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
2. Merge Reads into Contigs
• Overlap graph: Nodes: reads r1…..rn
Edges: overlaps (ri, rj, shift, orientation, score)
Note:of course, we don’tknow the “color” ofthese nodes
Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat
![Page 33: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/33.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
2. Merge Reads into Contigs
We want to merge reads up to potential repeat boundaries
repeat region
Unique Contig
Overcollapsed Contig
![Page 34: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/34.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
2. Merge Reads into Contigs
• Remove transitively inferable overlaps If read r overlaps to the right reads r1, r2,
and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)
r r1 r2 r3
![Page 35: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/35.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
2. Merge Reads into Contigs
![Page 36: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/36.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Repeats, errors, and contig lengths
• Repeats shorter than read length are easily resolved Read that spans across a repeat disambiguates order of flanking regions
• Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat
• To make the genome appear less repetitive, try to:
Increase read length Decrease sequencing error rate
Role of error correction:Discards up to 98% of single-letter sequencing errors
decreases error rate decreases effective repeat content increases contig length
![Page 37: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/37.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
3. Link Contigs into Supercontigs
Too dense Overcollapsed
Inconsistent links Overcollapsed?
Normal density
![Page 38: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/38.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Find all links between unique contigs
3. Link Contigs into Supercontigs
Connect contigs incrementally, if 2 forward-reverse links
supercontig(aka scaffold)
![Page 39: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/39.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Fill gaps in supercontigs with paths of repeat contigsComplex algorithmic step• Exponential number of paths• Forward-reverse links
3. Link Contigs into Supercontigs
![Page 40: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/40.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
De Brujin Graph formulation
• Given sequence x1…xN, k-mer length k,Graph of 4k vertices,Edges between words with (k-1)-long overlap
![Page 41: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/41.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
4. Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
(Alternative: take maximum-quality letter)
![Page 42: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/42.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Panda Genome
![Page 43: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/43.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Hominid lineage
![Page 44: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/44.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
Orangutan genome
1 Sumatran genome with Sanger5 Sumatran, 5 Bornean genomes with HiSeq
![Page 45: DNA Sequencing](https://reader035.vdocuments.mx/reader035/viewer/2022062501/56816345550346895dd3d46c/html5/thumbnails/45.jpg)
CS273a Lecture 4, Autumn 08, BatzoglouCS273a 2011
History of WGA
• 1982: -virus, 48,502 bp
• 1995: h-influenzae, 1 Mbp
• 2000: fly, 100 Mbp
• 2001 – present human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee,
several fungal genomes
Gene Myers
Let’s sequence the human
genome with the shotgun
strategy
That is impossible, and
a bad idea anyway
Phil Green
1997