computational approaches to gene findingmolsim.sci.univr.it/2013_bioinfo2/genefinding_ale.pdf ·...
TRANSCRIPT
![Page 1: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/1.jpg)
Gene Prediction
Alejandro Giorgetti
![Page 2: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/2.jpg)
Annotation
• Defining the bological role of a molecule in all itscomplexity and mapping this knowledge onto the relevant gene products encoded by genomes.
• Computational gene prediction is the cornerstoneupon which a genome annotation is built
![Page 3: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/3.jpg)
1. Basic Information
• What types of predictions can we make?
• What are they based on?
• Each gene prediction is a hypothesis waiting to betested and the results of testing then inform our nextset of hypothesis
![Page 4: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/4.jpg)
Bioinformatics asExtrapolation
• Computational gene finding is a process of:
– Identifying common phenomena in known genes
– Building a computational framework/model that can accurately describe the common phenomena
– Using the model to scan uncharacterized sequence toidentify regions that match the model, which becomeputative genes
– Test and validate the predictions
![Page 5: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/5.jpg)
Different Types of Gene Finding
• RNA genes– tRNA, rRNA, snRNA, snoRNA, microRNA
• Protein coding genes– Prokaryotic
• No introns, simpler regulatory features– Eukaryotic
• Exon-intron structure• Complex regulatory features
![Page 6: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/6.jpg)
Approaches to Gene Finding
• Direct– Exact or near-exact matches of EST, cDNA, or Proteins
from the same, or closely related organism
• Indirect1. Look for something that looks like one gene (homology)
1. Look for something that looks like all genes (ab initio)
1. Hybrid, combining homology and ab initio (and perhaps even direct) methods
![Page 7: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/7.jpg)
Prokaryotes• Identify long Open Reading Frames (ORF), likely to
code for proteins.– Identification of start codon ATG that maximizes the length
of the ORF.– End codons (UAA,UAG, UGA)– Pribnow box (TATAAT), the -35 sequence or ribosomal
binding sites: transcriptional and translational start sites.– Codon bias (codon usage).
• GLIMMER, GeneMark. 90% for both sensitivity and specificity.
![Page 8: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/8.jpg)
Pieces of a (Eukaryotic) Gene
(on the genome)
5’
3’
3’
5’
~ 1-100 Mbp
5’
3’
3’
5’……
……
~ 1-1000 kbp
exons (cds & utr) / introns(~ 102-103 bp) (~ 102-105 bp)
Polyadenylationsite
promoter (~103 bp)
enhancers (~101-102 bp) other regulatory sequences(~ 101-102 bp)
![Page 9: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/9.jpg)
What is it about genes thatwe can measure (and
model)?• Most of our knowledge is biased towards protein-
coding characteristics– ORF (Open Reading Frame): a sequence defined by in-
frame AUG and stop codon, which in turn defines a putative amino acid sequence.
– Codon Usage: most frequently measured by CAI (Codon Adaptation Index)
• Other phenomena– Nucleotide frequencies and correlations:
• value and structure– Functional sites:
• splice sites, promoters, UTRs, polyadenylation sites
![Page 10: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/10.jpg)
A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771
![Page 11: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/11.jpg)
Codon Adaptation Index(CAI)
CAI= ∏i= codons
f codon i
f �codoni�max
• Parameters are empirically determined by examininga “large” set of example genes
• This is not perfect– Genes sometimes have unusual codons for a reason– The predictive power is dependent on length of sequence
![Page 12: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/12.jpg)
General Things to Rememberabout (Protein-coding) Gene
Prediction Software• It is, in general, organism-specific
• It works best on genes that are reasonably similar tosomething seen previously
• It finds protein coding regions far better than non-codingregions
• In the absence of external (direct) information, alternative forms will not be identified
• It is imperfect! (It’s biology, after all…)
![Page 13: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/13.jpg)
2. Classes of Information
• Extrinsic information: cDNA, EST, protein products. Not itself a genome sequence.
• Intrinsic information: De novo gene predictors usingthe sequences of one or more genomes as their onlyinput. Ab initio: do not use informant genomes.
![Page 14: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/14.jpg)
Extrinsic Information
• BLAST family, FASTA, etc.– Pros: fast, statistically well founded– Cons: no understanding/model of gene structure
• BLAT, Sim4, EST_GENOME, etc.– Pros: gene structure is incorporated– Cons: non-canonical splicing, slower than blast
![Page 15: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/15.jpg)
Intrinsic Information
• The most ab initio would be a program that simulates the transcription, splicing and postprocessing of a transcript using only the information present in the cell.
• Rudimentary understanding: we must rely on metrics derived from known genes and features.– Signal sensors– Content sensors
![Page 16: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/16.jpg)
Intrinsic Information: Signals• Nucleic acid motifs that are recognized by the cellular
machinery.• Start and stop codons.• Donor and acceptor splice sites• Splicing enhancers• Silencer elements• Transcription start and termination sites• Polyadenylation signals• Proximal and distal promoter sequences
• Signals modeled as Position Weight Matrices (PWM)• Capture the intrinsic variability characteristics.• Derived from a set of aligned sequences which are funtinally
related.• Sliding windows.
![Page 17: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/17.jpg)
Intrinsic Information: Content
• Proper detection the start and end codons is still a challenge. Coding bias in sequence composition.
• Uneven usage of aa in real proteins.• Uneven usage of synonymous codons• Coding statistcs as functions that compute a real number
related to the likelihood that a given DNA sequence codes for protein.
• Codon or di-codon usage• Periodicity in base occurrence (GRAIL).
![Page 18: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/18.jpg)
Conservation
• When one or more informant genomes are available it is possible to detect the characteristic conservation pattern of coding sequence and use it as an orthogonal measure for coding potential.
• TWINSCAN• GeneID• SGP2• CONTRAST• Genscan• Fgenesh
![Page 19: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/19.jpg)
2. Some MathematicalConcepts and Definitions
• Models
• Bayesian Statistics
• Markov Models & Hidden Markov Models
![Page 20: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/20.jpg)
Models in Computational(Molecular) Biology
• In gene finding, models can best be thought of as “sequence generators” (e.g., Hidden Markov Models) or “sequence classifiers” (e.g., Neural Networks)
• The better (and usually more complex) a model is, the better the performance is likely to be
![Page 21: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/21.jpg)
Models of Sequence Generation:Markov Chains
• Different types of structure components (exons or introns) are characterized by a state
• The gene model is thought to be generated by a state machine: starting form 5’ to 3’,each base is generated by a emission probability conditioned on the current state, and the transition from one state to an other: transition probability (an intron can only follow an exon, reading frames of to adjacent exonsshould be compatible, etc.)
• All the parameters of the emission probabilities and transition probabilities are learned.
![Page 22: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/22.jpg)
Basic HMM
Described bya set of possible states: start, exon, donor, intron, acceptor, stop set of possible observations: A,C,G,Ttransition probability matrixemision probability matrix (frequencies of nucleotidesoccurring in a particular state)initial state probabilities.
![Page 23: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/23.jpg)
Models of SequenceGeneration:
Markov Chains
• A Markov chain is a model for stochastic generation of sequential phenomena
• Every position in a chain is equivalent
• The order of the Markov chain is the number of previous positions on which the current position depends– e.g., in nucleic acid sequence, 0-order is mononucleotide, 1st-order
is dinucleotide, 2nd-order is trinucleotide, etc.
• The model parameters are the frequencies of the elements at each position (possibly as a function of preceding elements)
![Page 24: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/24.jpg)
Markov Chains as Models ofSequence Generation
• 0th-order
• 1st-order
• 2nd-order
( ) ( ) ( ) ( ) ( ) ( )∏=
−⋅=⋅⋅=N
iii sssssssssP
211231211 |pp|p|pp L
( ) ( ) ( ) ( ) ( ) ( )∏=
−−⋅=⋅⋅=N
iiii ssssssssssssssP
31221324213212 |pp|p|pp L
( ) ( ) ( ) ( ) ( )∏=
=⋅⋅=N
iisssssP
13210 pppp L
L4321 sssss =
( ) ( ) ( ) ( ) ( ) ( ) ( )∏=
−⋅=⋅⋅⋅=N
iii sssactatttsP
2111 |pp|p|p|pp L
( ) ( ) ( ) ( ) ( ) ( ) ( )∏=
−−⋅=⋅⋅⋅=N
iiii sssssacgtacttattsP
312212 |pp|p|p|pp L
( ) ( ) ( ) ( ) ( ) ( ) ( )∏=
=⋅⋅⋅⋅=N
iisgcattsP
10 pppppp L
ttacggts =
![Page 25: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/25.jpg)
Hidden Markov Models
In general, sequences are not monolithic, but can be made up of discrete segments
Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emissionprobabilities depend upon the state
Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model
![Page 26: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/26.jpg)
HMM DetailsAn HMM is completely defined by its:
State-to-state transition matrix (Φ)Emission matrix (H)State vector (x)
We want to determine the probability of any specific (query) sequence having been generated by the model; with multiple models, we then use Bayes’ rule to determine the best model for the sequence
Two algorithms are typically used for the likelihood calculation:ViterbiForward
Models are trained with known examples
![Page 27: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/27.jpg)
The HMM Matrixes: Φ and H
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=Φ
0002.0001.000996.0001.05.00002.0998.05.00000
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
32.018.018.032.0
25.025.022.028.0
H
xm(i) = probability of being in state m at position i;
H(m,yi) = probability of emitting character yi in state m;
Φmk = probability of transition from state k to m.
![Page 28: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/28.jpg)
A more realistic (and complex) HMM model for Gene Prediction (Genie)
Kulp, D., PhD Thesis, UCSC 2003
![Page 29: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/29.jpg)
Scoring an HMM: Viterbi, Forward, and Forward-
BackwardTwo algorithms are typically used for the likelihood
calculation: Viterbi and ForwardViterbi is an approximation;
The probability of the sequence is determined by using the most likely mapping of the sequence to the model
in many cases good enough (gene finding, e.g.), but not always
Forward is the rigorous calculation;The probability of the sequence is determined by summing over
all mappings of the sequence to the model
Forward-Backward produces a probabilistic map of the model to the sequence
![Page 30: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/30.jpg)
Eukaryotic Gene Prediction: GRAIL II: Neural network based
prediction
(Uberbacher and Mural 1991; Uberbacher et al. 1996)
![Page 31: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/31.jpg)
Open Challenges in Predicting Eukaryotic
(Protein-Coding) Genes• Alternative Processing of Transcripts
– Splice variants, Start/stop variants
• Overlapping Genes– Mostly UTRs or intronic, but coding is possible
• Non-canonical functional elements– Splice w/o GT-AG,
• UTR predictions– Especially with introns
• Small (mini) exons
![Page 32: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/32.jpg)
Open Challenges in Predicting Prokaryotic
(Protein-Coding) Genes• Start site prediction
– Most algorithms are greedy, taking the largest ORF
• Overlapping Genes– This can be very problematic, esp. with use of Viterbi-like
algorithms
• Non-canonical coding
![Page 33: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/33.jpg)
Eukaryotic gene predictiontools and web servers
• Genscan (ab initio), GenomeScan (hybrid) – (http://genes.mit.edu/)
• Twinscan (hybrid)– (http://genes.cs.wustl.edu/)
• FGENESH (ab initio)– (http://www.softberry.com/berry.phtml?topic=gfind)
• GeneMark.hmm (ab initio)– (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi)
• MZEF (ab initio)– (http://rulai.cshl.org/tools/genefinder/)
• GrailEXP (hybrid)– (http://grail.lsd.ornl.gov/grailexp/)
• GeneID (hybrid)– (http://www1.imim.es/geneid.html)
![Page 34: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/34.jpg)
Prokaryotic Gene Prediction
• Glimmer– http://www.tigr.org/~salzberg/glimmer.html
• GeneMark– http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.
cgi
• Critica– http://www.ttaxus.com/index.php?pagename=Software
• ORNL Annotation Pipeline– http://compbio.ornl.gov/GP3/pro.shtml
![Page 35: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/35.jpg)
Non-protein Coding Gene Tools and Information
• tRNA– tRNA-ScanSE
• http://www.genetics.wustl.edu/eddy/tRNAscan-SE/– FAStRNA
• http://bioweb.pasteur.fr/seqanal/interfaces/fastrna.html
• snoRNA– snoRNA database
• http://rna.wustl.edu/snoRNAdb/
• microRNA– Sfold
• http://www.bioinfo.rpi.edu/applications/sfold/index.pl– SIRNA
• http://bioweb.pasteur.fr/seqanal/interfaces/sirna.html
![Page 36: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/36.jpg)
Assessing performance:Sensitivity and Specificity
Testing of predictions is performed on sequences where the gene structure is known
Sensitivity is the fraction of known genes (or bases or exons) correctly predicted“Am I finding the things that I’m supposed to find”
Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes“What fraction of my predictions are true?”
In general, increasing one decreases the other
![Page 37: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/37.jpg)
Graphic View of Specificity and Sensitivity
iveFalseNegatveTruePositiveTruePositi
AllTrueveTruePositiSn
+==
iveFalsePositveTruePositiveTruePositi
eAllPositivveTruePositiSp
+==
![Page 38: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/38.jpg)
Quantifying the tradeoff:Correlation Coefficient
( )( ) ( )( )[ ]( )( )( )( )
FNTNPNFPTPPPFNTPAPFPTNAN
PNAPPPANFNFPTNTPCC
+=+=+=+=
−=
;;;
![Page 39: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches](https://reader034.vdocuments.mx/reader034/viewer/2022043020/5f3c0545b21ab212dd4d6a1f/html5/thumbnails/39.jpg)
Specificity/Sensitivity Tradeoffs
• Ideal Distribution of Scores
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40 45 50score (arb units)
coun
t
random sequence true sites
• More Realistically…
0
200
400
600
800
1000
1200
0 10 20 30 40 50
score (arb units)
coun
t
random sequence true sites