gene finding in eukaryotesubio.bioinfo.cnio.es/cursos/cursoverano2008/madrid08/gene_finding... ·...
TRANSCRIPT
Gene Finding in Eukaryotes
Jan-Jaap [email protected]
Computational and Structural Biology Group, Centro Nacional de InvestigacionesOncologicas
Madrid, July 2008
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 1 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 3 / 24
Eurkaryotic Gene Structure
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 4 / 24
Schematic Gene Structure
ATG
GT AG GT AG
TGA
Exon Intron Exon Intron Exon
UTR CDS
Gene prediction programs only predict the coding fraction of genes
Signals Exons RegionsStart (ATG) Single Exons
Stops (TGA,TAA,TAG) First IntronsDonor (GT) Internal Intergenic
Acceptor (AG) Terminal 5’ and 3’ UTRs
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 5 / 24
Signals are difficult to find (1)
ExampleTry reading this sentence:
LOOKITSMUCHEASIERLIKETHIS
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 6 / 24
Signals are difficult to find (1)
ExampleTry reading this sentence:
Look! It’s much easier like this!
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 6 / 24
Signals Are Difficult To Find (2)
ExampleGenomic DNA sequence
GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCACTGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCCAGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGGGGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCTGGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCTGCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGCCACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAGGACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCCTGCACTAG
Only beginning and end of introns are shown
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 7 / 24
Signals Are Difficult To Find (2)
ExampleGenomic DNA sequence
GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCACTGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCCAGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGGGGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCTGGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCTGCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGCCACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAGGACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCCTGCACTAG
Only beginning and end of introns are shown
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 7 / 24
All Signals Predicted by geneid in a Genomic DNA Sequence
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 8 / 24
All Exons Predicted by geneid in a Genomic DNA Sequence
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 9 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 10 / 24
Different Approaches to Gene Finding
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Different Approaches to Gene Finding
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Different Approaches to Gene Finding
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Different Approaches to Gene Finding
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Search by Signal
Signals are usually represented as patterns
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Search by Signal
Signals are usually represented as patterns
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Search by Signal
Signals are usually represented as patterns
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Search by Signal
Signals are usually represented as patterns
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Search by Signal
Signals are usually represented as patterns
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Patterns: examples
Example (patterns 2)Consensus sequence:
p1 = CTAAAAATAA
p2 = TTAAAAATAA
p3 = TTTAAAATAA
p4 = CTATAAATAA
p5 = TTATAAATAA
p6 = CTTAAAATAG
p7 = TTTAAAATAG
P = YTWWAAATAR
Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 13 / 24
Patterns: examples
Example (patterns 2)Consensus sequence:
p1 = CTAAAAATAA
p2 = TTAAAAATAA
p3 = TTTAAAATAA
p4 = CTATAAATAA
p5 = TTATAAATAA
p6 = CTTAAAATAG
p7 = TTTAAAATAG
P = YTWWAAATAR
Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 13 / 24
Position Weight Matrices
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Position Weight Matrices
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Position Weight Matrices
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Position Weight Matrices
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Position Weight Matrices
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Position Weight Matrices
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Example: Donor Sites
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 15 / 24
Search by Content
Coding statisticsHidden Markov Models
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Search by Content
Coding statisticsHidden Markov Models
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Search by Content
Coding statisticsHidden Markov Models
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Search by Content
Coding statisticsHidden Markov Models
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Search by Content
Coding statisticsHidden Markov Models
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Search by Content
Coding statisticsHidden Markov Models
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Coding Statistics
The probability of a sequence S being protein coding:
p(S) =p(C1) · p(C2) . . . p(Cn)
164 ·
164 . . . 1
64
p(S) ≈ f (C1) · f (C2) . . . f (Cn)1
64 ·1
64 . . . 164
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 17 / 24
Coding Statistics
The probability of a sequence S being protein coding:
p(S) =p(C1) · p(C2) . . . p(Cn)
164 ·
164 . . . 1
64
p(S) ≈ f (C1) · f (C2) . . . f (Cn)1
64 ·1
64 . . . 164
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 17 / 24
Hidden Markov Models
can have HMM’s for entire genescan have HMM’s for coding/non-coding regions
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 18 / 24
Hidden Markov Models
can have HMM’s for entire genescan have HMM’s for coding/non-coding regions
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 18 / 24
Hidden Markov Models
can have HMM’s for entire genescan have HMM’s for coding/non-coding regions
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 18 / 24
Search by Homology
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24
Search by Homology
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24
Search by Homology
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24
Search by Homology
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24
Comparative Gene Prediction
Comparing the human FOS gene with:
(a) Mouse (b) Chicken (c) Pufferfish
using tblastx
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 20 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 21 / 24
Gene Prediction Accuracy
measured in annotated sequencescan measure at nucleotide, exon and gene level
Sn =TP
TP + FNSP =
TPTP + FP
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 22 / 24
Gene Prediction Accuracy
measured in annotated sequencescan measure at nucleotide, exon and gene level
Sn =TP
TP + FNSP =
TPTP + FP
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 22 / 24
Outline
1 Gene StructureEukaryotesFind Signals
2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative
3 Accuracy of Predictions
4 References
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 23 / 24
References
1 http://genome.imim.es/courses/BioinformaticaUPF/
2 http://genome.imim.es/∼jjw/oeiras05/3 Guigo, R. (1999) DNA Composition, Codon Usage and Exon Prediction.
In Bisshop, M., ed. Genetic Databases, Academic Press.
4 Eddy, S. (2004) What is a Hidden Markov Model? Nature 22:1315–6.
5 Burset, M. and Guigo, R. (1996). Evaluation of gene structure predictionprograms. Genomics, 34:353–7.
6 Brent M.R. and Guigo, R. (2004). Recent advances in gene structureprediction. Curr. Opin. Struct. Biol. 14:264–72.
7 Guigo, R. and Reese, M.G. (2005). EGASP: collaboration throughcompetition to find human genes. Nature Methods, 2:575–6.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 24 / 24