gene finding in eukaryotesubio.bioinfo.cnio.es/cursos/cursoverano2008/madrid08/gene_finding... ·...

54
Gene Finding in Eukaryotes Jan-Jaap Wesselink [email protected] Computational and Structural Biology Group, Centro Nacional de Investigaciones Oncol ´ ogicas Madrid, July 2008 Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 1 / 24

Upload: others

Post on 23-Nov-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Gene Finding in Eukaryotes

Jan-Jaap [email protected]

Computational and Structural Biology Group, Centro Nacional de InvestigacionesOncologicas

Madrid, July 2008

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 1 / 24

Page 2: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Page 3: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Page 4: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Page 5: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 2 / 24

Page 6: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 3 / 24

Page 7: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Eurkaryotic Gene Structure

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 4 / 24

Page 8: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Schematic Gene Structure

ATG

GT AG GT AG

TGA

Exon Intron Exon Intron Exon

UTR CDS

Gene prediction programs only predict the coding fraction of genes

Signals Exons RegionsStart (ATG) Single Exons

Stops (TGA,TAA,TAG) First IntronsDonor (GT) Internal Intergenic

Acceptor (AG) Terminal 5’ and 3’ UTRs

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 5 / 24

Page 9: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Signals are difficult to find (1)

ExampleTry reading this sentence:

LOOKITSMUCHEASIERLIKETHIS

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 6 / 24

Page 10: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Signals are difficult to find (1)

ExampleTry reading this sentence:

Look! It’s much easier like this!

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 6 / 24

Page 11: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Signals Are Difficult To Find (2)

ExampleGenomic DNA sequence

GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCACTGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCCAGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGGGGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCTGGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCTGCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGCCACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAGGACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCCTGCACTAG

Only beginning and end of introns are shown

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 7 / 24

Page 12: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Signals Are Difficult To Find (2)

ExampleGenomic DNA sequence

GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCACTGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCCAGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGGGGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCTGGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCTGCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGCCACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAGGACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCCTGCACTAG

Only beginning and end of introns are shown

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 7 / 24

Page 13: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

All Signals Predicted by geneid in a Genomic DNA Sequence

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 8 / 24

Page 14: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

All Exons Predicted by geneid in a Genomic DNA Sequence

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 9 / 24

Page 15: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 10 / 24

Page 16: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Page 17: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Page 18: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Page 19: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Different Approaches to Gene Finding

Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24

Page 20: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Page 21: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Page 22: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Page 23: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Page 24: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Signal

Signals are usually represented as patterns

Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24

Page 25: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Patterns: examples

Example (patterns 2)Consensus sequence:

p1 = CTAAAAATAA

p2 = TTAAAAATAA

p3 = TTTAAAATAA

p4 = CTATAAATAA

p5 = TTATAAATAA

p6 = CTTAAAATAG

p7 = TTTAAAATAG

P = YTWWAAATAR

Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 13 / 24

Page 26: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Patterns: examples

Example (patterns 2)Consensus sequence:

p1 = CTAAAAATAA

p2 = TTAAAAATAA

p3 = TTTAAAATAA

p4 = CTATAAATAA

p5 = TTATAAATAA

p6 = CTTAAAATAG

p7 = TTTAAAATAG

P = YTWWAAATAR

Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 13 / 24

Page 27: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Page 28: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Page 29: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Page 30: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Page 31: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Page 32: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Position Weight Matrices

Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies

Mi,j = ln(ftrue

ffalse)

5 score of a given sequence is sum of matrix coefficients

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24

Page 33: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Example: Donor Sites

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 15 / 24

Page 34: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Page 35: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Page 36: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Page 37: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Page 38: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Page 39: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Content

Coding statisticsHidden Markov Models

Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24

Page 40: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Coding Statistics

The probability of a sequence S being protein coding:

p(S) =p(C1) · p(C2) . . . p(Cn)

164 ·

164 . . . 1

64

p(S) ≈ f (C1) · f (C2) . . . f (Cn)1

64 ·1

64 . . . 164

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 17 / 24

Page 41: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Coding Statistics

The probability of a sequence S being protein coding:

p(S) =p(C1) · p(C2) . . . p(Cn)

164 ·

164 . . . 1

64

p(S) ≈ f (C1) · f (C2) . . . f (Cn)1

64 ·1

64 . . . 164

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 17 / 24

Page 42: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Hidden Markov Models

can have HMM’s for entire genescan have HMM’s for coding/non-coding regions

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 18 / 24

Page 43: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Hidden Markov Models

can have HMM’s for entire genescan have HMM’s for coding/non-coding regions

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 18 / 24

Page 44: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Hidden Markov Models

can have HMM’s for entire genescan have HMM’s for coding/non-coding regions

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 18 / 24

Page 45: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Page 46: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Page 47: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Page 48: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Search by Homology

Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24

Page 49: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Comparative Gene Prediction

Comparing the human FOS gene with:

(a) Mouse (b) Chicken (c) Pufferfish

using tblastx

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 20 / 24

Page 50: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 21 / 24

Page 51: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Gene Prediction Accuracy

measured in annotated sequencescan measure at nucleotide, exon and gene level

Sn =TP

TP + FNSP =

TPTP + FP

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 22 / 24

Page 52: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Gene Prediction Accuracy

measured in annotated sequencescan measure at nucleotide, exon and gene level

Sn =TP

TP + FNSP =

TPTP + FP

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 22 / 24

Page 53: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

Outline

1 Gene StructureEukaryotesFind Signals

2 Different Approaches To Gene FindingDifferent InformationSearch by SignalPWMSearch by ContentCoding StatisticsHMMHomologyComparative

3 Accuracy of Predictions

4 References

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 23 / 24

Page 54: Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink jjwesselink@cnio.es Computational and

References

1 http://genome.imim.es/courses/BioinformaticaUPF/

2 http://genome.imim.es/∼jjw/oeiras05/3 Guigo, R. (1999) DNA Composition, Codon Usage and Exon Prediction.

In Bisshop, M., ed. Genetic Databases, Academic Press.

4 Eddy, S. (2004) What is a Hidden Markov Model? Nature 22:1315–6.

5 Burset, M. and Guigo, R. (1996). Evaluation of gene structure predictionprograms. Genomics, 34:353–7.

6 Brent M.R. and Guigo, R. (2004). Recent advances in gene structureprediction. Curr. Opin. Struct. Biol. 14:264–72.

7 Guigo, R. and Reese, M.G. (2005). EGASP: collaboration throughcompetition to find human genes. Nature Methods, 2:575–6.

Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 24 / 24