sequencing - kth · genome assembly assembly quality genes gene finding estscan modern assembly...
TRANSCRIPT
Genome assembly Assembly quality Genes Gene finding ESTScan
Today
• Genome assembly• Ab initio gene finding
Genome assembly Assembly quality Genes Gene finding ESTScan
Genome sequencing
• Full overview• Localizing genes: order,
proximity,. . .• Reveal gene regulation• Understand pseudogenes• Foundation for further studies
Genome assembly Assembly quality Genes Gene finding ESTScan
Technologies for whole genomesequencing
• Whole Genome Shotgun: Dominatestoday. Read random parts of the genome,then assemble.Advantage: Easy to automize, only one
type of dataDisadvantage: Hard to assemble the
pieces!
Genome assembly Assembly quality Genes Gene finding ESTScan
Technologies for whole genomesequencing
• Compartmental shotgun: Break genomein chunks, put each chunk in ”BAC”,Bacterial Artifical Chromosome, then WGSon BAC.Advantage: Easier to assemble. Manages
duplications better.Disadvantage: More steps, more data:
need a physical map.
Genome assembly Assembly quality Genes Gene finding ESTScan
Mapping a genome
• Physical map: a localization and ordering ofmarkers on a genome
• What markers are on your BACs?• In what order are they?
Genome assembly Assembly quality Genes Gene finding ESTScan
Genome assembly
• Create overlap graph for all reads fromgenome/chromosome/BAC.
• Find path through the graph, as for ESTdata.
• HUGE computational problem
(Dog: 31.5× 106 reads)
Genome assembly Assembly quality Genes Gene finding ESTScan
Assembly problems
• Compute the overlaps (solved)• What overlap is important?
• How long?• How many changes allowed?
• Repeats are hard• There are pieces missing...• There are differences between individuals!
Genome assembly Assembly quality Genes Gene finding ESTScan
Assembly problem: duplications
Compressions such as this can easily total 1%or more of the genome, and the ”orphan”regions can be quite long, 5 000–10 000 bp ormore. (Salzberg and Yorke, 2005)
Find duplicated regions by looking at thenumber of reads covering them!
Genome assembly Assembly quality Genes Gene finding ESTScan
Modern assembly
• Need constraints• Physical map gives constraints• Paired-ends reads, or mate pairs give
constraints for ”real” WGS• Create clones of a given length — 2 kb,
10 kb, and 50 kb — standard reads in bothends.
• Paired-ends gives scaffolds,”super-contigs”.
Genome assembly Assembly quality Genes Gene finding ESTScan
Scaffolds from paired-endsRegular reads:
Contig 2
Reads
Contig 1
With paired-endsReads
Contig 1 Contig 2
A scaffoldGenome assembly Assembly quality Genes Gene finding ESTScan
CCCAGTCCATTCTCCCACTGATGTCTGTAACTATATTCATCAATCTCTTTGATTTCCAAAGCGCATACCATGGCCTCTGAATGTACTTTGCAGCTTGCCCTTCACACACCCTATACCAATAGTTTTCTGGCTCCTGACCATCAAACTGCCTCCATATGACTGTGCTCTTGTCTTTCCTTAGTTGCATGGGTGTCATCTTATGGGTCACGACCTTCTTAACTGGAACTTTCTTCTTATGGGAGCGATCCCATTTCTTCCAACCTCTCAAAAATTCACCCTCTTCAACAATTGACGCCTCCTCCTTAAGATGCTCAAAATCAAAATAAAACCTAAAATCCTTCCCCTCCTGATCCTTCCCATCCGGATTATAATTACCTGCCAAGCATATACTCAAGTCCATGACAAATGTCCTGTTCAAATACATAGCCTCCCCCAACCCACACAAGAAACTCCACATGTAATGATTCATTCCCTTGCAATAATCCCCTCCTCTTGAATAATACAAGTACTTCCCTTTTCTAAAGTTCGTTTCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCTCTCAAACTATCGTGAGTACGGGATTATATTTCTAAAGTAAGTAAATTATCCAAGATGGGCCGATCCATTACGAAAAGGAAAGATAATCCACTATTATTTCTTAATATAAGGGCAAATTTAAATATATAAAGATGTAATTGTTGTTGGCGAGTGCCTCTCTTGGTTGTTGAGAGTGAAATTGACAGCAAAGTTGTAGATTGTGACAGCCAATGTAACTTATTCACAAATTGGCCTGCCAATGGTACATCATGAATCGCTATGCCACATACTTGATTATACCTTCTAAAGTACCTGTGAATTTTATTTATTTTTTCCATTTTAAAAGTATGTTTTATTTGGAAAAAATATCAAATTATTATTTTACTATTATATTTTAATATATTCTTAAATAAAAAACACTATTAAAAAATATTATGCACCGCAATAATAAACACATATTAAACAAACAGATAATTTTTATATGGCATTTCACTATTGTTGTGGAATAATATCTTT
Genome assembly Assembly quality Genes Gene finding ESTScan
Assembly quality: C
• Popular measure of quality: coverage• C = average number of reads of each
nucleotide in genome• ”Coverage is 10x”: every position in ≈ 10
PCR reads.Genome CD. melanogaster ∼ 14xHuman, HGP 7xHuman, Celera ∼ 5.5xHedgehog, cat, and more 1.8x – 2xDog 8xHorse 7.8x
Genome assembly Assembly quality Genes Gene finding ESTScan
Assembly quality: N50• ”The latest”, N50:
1. Sort all contigs by size2. Count from largest contig and downward, stop
when half the genome is covered.3. Length of last contig is N50.
• Given a 30 Mbp genome with contigs:
4.54.543.5
5
2.52.6
2.5
0.80.91.7
18
N50 is 4 Mbp, because 18 > 30/2
Genome assembly Assembly quality Genes Gene finding ESTScan
N50 characteristics• High N50 ⇒ Good contigs⇒ Good assembly
• Low N50⇒ Many small contigs to cover genome⇒ Bad assembly
• Note: a completely wrong assembly couldhave high N50!”The standard of judging assembly quality by size of contigs isquestionable. Large contigs may simply reflect overlyaggressive joining of contigs, thereby creating larger contigswith mis-assemblies. As a consequence, genome scientistswho are not experts at assembly can be completely misled bystatistics about contig sizes, and as a result might prefer the’larger’ but incorrect assembly when given a choice.”
Salzberg and Yorke, 2005
Genome assembly Assembly quality Genes Gene finding ESTScan
Lander-Waterman model• How many times will a position be
sequenced?• Key assumption: Reads are uniformly
distributed• Coverage is C• ”Let Xi be the number of times position i is
sequenced; Xi is Poisson(C) distributed.”
Pr(Xi = k) = Cke−C/k !
• Special case: Fraction of genome notsequenced:
Pr(Xi = 0) = e−C
Genome assembly Assembly quality Genes Gene finding ESTScan
Example on Lander-Waterman
Coverage Sequenced fraction2 0.8653 0.9504 0.9825 0.9936 0.9987 0.999
10 0.99995
Genome assembly Assembly quality Genes Gene finding ESTScan
More Lander-Waterman results
Require 0 < θ < 1 overlap to join reads into acontig.• Expected number of contigs if N reads:
Ne−C(1−θ)
Dog: 8x, require e.g. 10% overlap, 32× 106 reads:24 000 contigs
• Expected contig size: LeC(1−θ)−1C + θ.
Dog, assume L = 500: contigs are ∼ 83 700 bp
Genome assembly Assembly quality Genes Gene finding ESTScan
Lander-Waterman and reality
• ”For both a simulated unassisted 2x mousegenome assembly (Margulies et al. 2005)and the assisted 1.9x cat genome assemblyof Pontius et al. (2007) euchromaticgenome coverage by assembled contigswas only 65%, significantly less than thetheoretical Poisson expectation (Landerand Waterman 1988) of 85%.”
Green, 2007• Why the discrepancy?
Genome assembly Assembly quality Genes Gene finding ESTScan
Gene finding/prediction
• Goal: Find all genes• Which organism?• What is a gene?
• With or without introns?• With or without UTR?• Count RNA genes?• Only the protein coding part?• Regulatory elements (TFBS)?
• Here: Find the coding parts, the CDS
Genome assembly Assembly quality Genes Gene finding ESTScan
History of gene definitions
• 1866: Mendel and heredity• 1909: The word ”gene”• 1910: Morgan, beads-on-a-string model• 1941: Gene a blueprint for a protein• 1950s: A gene is a molecule• 1960s: A gene is transcribed• 1970s: A gene is an ORF• 1990s: One or several regions on the
genome
Genome assembly Assembly quality Genes Gene finding ESTScan
A modern gene definition
”The gene is a union of genomicsequences encoding a coherent set ofpotentially overlapping functionalproducts.”
Genome assembly Assembly quality Genes Gene finding ESTScan
Prokaryot gene structure
• First 35 bp promoter• Then one or more genes, an operon• No introns• Densely packed, 80
”Almost everything is a gene”• Genes may overlap
Genome assembly Assembly quality Genes Gene finding ESTScan
Find prokaryot genes
Method 1: Look for ORFs ≥ 300 bp (or so).• ORF = Open Reading Frame, i.e. asequence with no stop codon (somedemand starting with ATG).
Method 2: Compare with a gene model.• Larsen & Krogh: ”It is a commonmisconception that identification ofgenes in prokaryotes is almost trivial.”• The ORF method has many falsepredictions
Genome assembly Assembly quality Genes Gene finding ESTScan
Larsen & Krogh: EasyGene
• HMM for prokaryot gene finding• Every box is several model states• Coding part has several possible paths to
handle length variation
Genome assembly Assembly quality Genes Gene finding ESTScan
Codon model for EasyGene
One of threesubmodels for coding
region
• Finds basically everything. ≈ 5 % falsepredictions
• Extra: States are of ”order 4”, i.e. the consider fourprevious emission for the current emission. Astandard HMM model is of order 0.
Genome assembly Assembly quality Genes Gene finding ESTScan
Eukaryot gene structure
From Anders Krogh
Genome assembly Assembly quality Genes Gene finding ESTScan
Eukaryot gene structure: Signals
• Kozak sequence (for vertebrates?) prior toATG
• Acceptor-donor sites for exon/intronboundaries.
• Poly-A signal, consensus: ”AAATAAAA”• Length distributions• Codon frequencies
Genome assembly Assembly quality Genes Gene finding ESTScan
Signal: Kozak sequence
• Kozak-logo for cow
Genome assembly Assembly quality Genes Gene finding ESTScan
Signal: donor site
Munch & Krogh: Agene
Genome assembly Assembly quality Genes Gene finding ESTScan
Signal: acceptor site
Munch & Krogh: Agene
Genome assembly Assembly quality Genes Gene finding ESTScan
Signal: exon lengths
Munch & Krogh: Agene • Coding-length distribution in fly
Genome assembly Assembly quality Genes Gene finding ESTScan
Signal: intron lengths
Munch & Krogh: Agene
• Quite strong limitations on intron length in fly
Genome assembly Assembly quality Genes Gene finding ESTScan
Signal: Genetic code
• Codon frequencies correspond to tRNAaccess
• Deviate strongly from random nucleotidetriples
• What determines access to tRNA?
Genome assembly Assembly quality Genes Gene finding ESTScan
Robustness with genetic code
Genome assembly Assembly quality Genes Gene finding ESTScan
A gene model: Agene
Munch & Krogh: Agene
Genome assembly Assembly quality Genes Gene finding ESTScan
Quality of gene predictions
• Genscan (test fav) suggests 68101 genesin human!
• Error freq varies by species• Exons easily missed• Pseudogenes often misclassified. We have
about 5000 pseudogenes.• Expect to miss almost 50 % of true genes,
and find almost 50 % junk.At best: 90 % sensitivity, 90 % specificity
Genome assembly Assembly quality Genes Gene finding ESTScan
Species specific models
• Must find training data for new species.Hard!
• Want to use info from close species. Maybe surprisingly bad.
• Most interesting development: Gene findingin two species simultaneously, inorthologous regions.
Genome assembly Assembly quality Genes Gene finding ESTScan
Alternative transcripts• How are genes
spliced?• Example: DSCAM
in fly, 38 016possibletranscripts
(Florea, 2006)
Genome assembly Assembly quality Genes Gene finding ESTScan
Alternative transcripts
• Hard to do ab initio? Augustus is oneinitiative
• Use cDNA, EST, other experiments.• Example: Sequence/hybridize mRNA from
possible exon/exon boundaries.
Genome assembly Assembly quality Genes Gene finding ESTScan
Initiatives in gene finding
ENCODE: Find what parts of human genome istranscribed
HAVANA: ”Human And Vertebrate Analysis aNdAnnotation Background”
Both projects are delivering data.
Genome assembly Assembly quality Genes Gene finding ESTScan
ESTScan
• Gene finding in EST data• Simplified variant of GenScan: no introns
etc• Tries finding coding region in EST/contig.
With error correction, dangerous
Genome assembly Assembly quality Genes Gene finding ESTScan
My experience of ESTScan
• Testing on “junk sequences”: 20 % claimedas coding.
• How get good model for chicken?• No good training data• Try prepared models for other speices• Human bad, Arabidopsis better
• Hints off the web: Choose species withsimilar hexamer distribution
• Combine with ORF length analysis