10/24/05 promoter prediction rna structure & function prediction
DESCRIPTION
10/24/05 Promoter Prediction RNA Structure & Function Prediction. Announcements. Seminar (Mon Oct 24) (several additional seminars listed in email sent to class) 12:10 PM IG Faculty Seminar in 101 Ind Ed II - PowerPoint PPT PresentationTRANSCRIPT
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 1
10/24/05
Promoter Prediction
RNA Structure & FunctionPrediction
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 2
AnnouncementsSeminar (Mon Oct 24) (several additional seminars listed in email sent to class)
12:10 PM IG Faculty Seminar in 101 Ind Ed II"Laser capture microdissection-facilitated
transcriptional profiling of abscission zones in Arabidopsis" Coralie Lashbrook, EEOB
http://www.bb.iastate.edu/%7Emarit/GEN691.html
Mark your calendars:1:10 PM Nov 14 Baker Seminar in Howe Hall Auditorium
"Discovering transcription factor binding sites"
Douglas Brutlag,Dept of Biochemistry & Medicine, Stanford University School of Medicine
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 3
Announcements
544 Semester ProjectsThanks to all who sent already!
Others: Information needed [email protected]
Briefly describe: • Your background & current grad research• Is there a problem related to your research you would like to learn more about & develop as project for this course? or • What would your ‘dream’ project be?
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 4
Announcements
Exam 2 - this Friday
Posted Online: Exam 2 Study Guide 544 Reading Assignment (2
papers)
Office Hours: David Mon 1-2 PM in 209 Atanasoff
Drena Tues 10-11AM in 106 MBB Michael - none this week
Thurs No Lab - Extra Office Hrs instead: David 1-3 PM in 209 Atanasoff Drena 1-3 PM in 106 MBB
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 5
Announcements
• Updated PPTs & PDFs for Gene Prediction lectures (covered on Exam 2) will be posted today (changes are minor)
• Is everyone on BCB 444/544 mailing list? Auditors?
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 6
Promoter Prediction & RNA Structure/Function Prediction
Mon Quite a few more words re: Gene prediction
Promoter prediction Wed RNA structure & function
RNA structure prediction2' & 3' structure prediction
miRNA & target prediction Thurs No Lab
Fri Exam 2
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 7
Reading Assignment - previousMount Bioinformatics
• Chp 9 Gene Prediction & Regulation• pp 361-401• Ck Errata:
http://www.bioinformaticsonline.org/help/errata2.html
* Brown Genomes 2 (NCBI textbooks online)• Sect 9 Overview: Assembly of Transcription Initiation
Complex • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?
rid=genomes.chapter.7002
• Sect 9.1-9.3 DNA binding proteins, Transcription initiation• http://www.ncbi.nlm.nih.gov/books/bv.fcgi?
rid=genomes.section.7016* NOTEs: Don’t worry about the details!!
• See Study Guide for Exam 2 re:Sections covered
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 8
Optional - but very helpful reading:
1) Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709
http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html
2) Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
03489059922
(that's a hint!)
Check this out: http://www.phylofoot.org/NRG_testcases/
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 9
Reading Assignment (for Wed)
Mount Bioinformatics• Chp 8 Prediction of RNA Secondary Structure • pp. 327-355• Ck Errata:
http://www.bioinformaticsonline.org/help/errata2.html
Cates (Online) RNA Secondary Structure Prediction Module• http://cnx.rice.edu/content/m11065/latest/
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 10
Review last lecture: Gene Prediction
(formerly Gene Prediction - 3)
• Overview of steps & strategies• Algorithms• Gene prediction software
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 11
Predicting Genes - Basic steps:• Obtain genomic DNA sequence• Translate in all 6 reading frames
• Compare with protein sequence database• Also perform database similarity search with EST & cDNA databases, if available
• Use gene prediction programs to locate genes• Analyze gene regulatory sequences
Note: Several important details missing above:1. Mask to "remove" repetitive elements (ALUs, etc.) ・2. Perform database search on translated DNA
(BlastX,TFasta)3. Use several programs to predict genes
(GenScan,GeneMark.hmm)4. Translate putative ORFs and search for functional
motifs (Blocks, Motifs, etc.) & regulatory sequences
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 12
Gene prediction flowchart
Fig 5.15Baxevanis & Ouellette 2005
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 13
Overview of gene prediction strategies
What sequence signals can be used?• Transcription: TF binding sites, promoter,
initiation site, terminator• Processing signals: splice donor/acceptors, polyA signal• Translation: start (AUG = Met) & stop (UGA,UUA, UAG)
ORFs, codon usageWhat other types of information can be used?• cDNAs & ESTs (pairwise alignment)• homology (sequence comparison, BLAST)
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 14
Examples of gene prediction software
1)Similarity-based or Comparative • BLAST • SGP2 (extension of GeneID)
2)Ab initio = “from the beginning”• GeneID - (used in lab last week)• GENSCAN - (used in lab last week)• GeneMark.hmm - (should try this!)
3)Combined "evidence-based”• GeneSeqer (Brendel et al., ISU)
BEST? GENSCAN, GeneMark.hmm, GeneSeqer
but depends on organism & specific task
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 15
Annotated lists of gene prediction software
• URLs from Mount Chp 9, available onlineTable 9.1 http://www.bioinformaticsonline.org/links/ch_09_t_1.html
• from Pevsner Chps 14 & 16http://www.bioinfbook.org/chapt14.htm - prokaryotichttp://www.bioinfbook.org/chapt16.htm - eukaryotic
• Table in Zhang Nat Rev Genet article: hptt://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html
• Another list: Kozar, Stanfordhttp://cmgm.stanford.edu/classes/genefind/
Performance Evaluation? Guig�ó, Barcelona (&
sites above)http://www1.imim.es/courses/SeqAnalysis/GeneIdentification/Evaluation.html
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 16
Gene prediction: Eukaryotes vs prokaryotes
Gene prediction is easier in microbial genomes
Methods? Previously, mostly HMM-based
Now: similarity-based methodsbecause so many genomes
availableMany microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available.e.g., GeneMark.hmm TIGR Comprehensive Microbial Resource (CMR)
NCBI Microbial Genomes
see Mount Fig 9.7 (E.coli gene)
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 17
UCSC Browser view of 1000 kb region (Human URO-D gene)
Fig 5.10Baxevanis & Ouellette 2005
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 18
• Perform pairwise alignment with large gaps in one sequence (due to introns)• Align genomic DNA with cDNA, ESTs, protein sequences
• Score semi-conserved sequences at splice junctions• Using a Bayesian model
• Score coding constraints in translated exons• Using a Bayesian model
Spliced Alignment Algorithm
Brendel 2005
GeneSeqer - Brendel et al.http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Intron
GT AG
Splice sites
Donor
Acceptor
Brendel et al (2004) Bioinformatics 20: 1157
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 19
Brendel - Spliced Alignment I:Compare with cDNA or EST probes
Genomic DNA
Start codon Stop codon
mRNA -Poly(A)Cap-
5’-UTR 3’-UTR
Start codon Stop codon
Brendel 2005
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 20
Brendel - Spliced Alignment II:Compare with protein probes
Genomic DNA
Start codon Stop codon
Protein
Brendel 2005
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 21
• Information Content Information Content IIii ::
I f fi iBB U C A G
iB= +∈∑2 2, , ,
log ( )
• Extent of Splice Signal Window:
I Ii I≤ +196. σ
i: ith position in sequenceĪ: avg information content over all positions >20 nt from splice siteσĪ: avg sample standard deviation of Ī
Splice Site Detection
Brendel 2005
Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal?
YES
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 22
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
HumanT2_GT
HumanT2_AG
Information content vs position
Brendel 2005
Which sequences are exons & which are introns?How can you tell?
Brendel et al (2004) Bioinformatics 20: 1157
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 23
Bayesian Splice Site Prediction
where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site
Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr
∑=H
HPHSPHPHSPSHP }){}|{/(}{}|{}|{
11,/}{}|{}{}{
11
1−−∏∏
+−=−−
+−=− ==
iii s
r
lislii
r
lil ffspsspspSP
Brendel 2005
Brendel et al (2004) Bioinformatics 20: 1157
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 24
Bayes Factor as Decision Criterion
H0: H=T }){1(
}{
})|{1(
}|{
Tp
Tp
STp
STpBF
−−=
2-class model: }|{}|{ FSpTSpBF =
7-class model:
€
BF =p{S |Tx}p{Tx}x=1,2,0
∑p{Tx}x=1,2,0
∑p{S |Fx}p{Fx}x=1,2,0,i
∑p{Fx}x=1,2,0,i
∑
Brendel 2005
Brendel et al (2004) Bioinformatics 20: 1157
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 25
en en+1
in in+1
PG
PA(n)PG
(1-PG)PD(n+1)
(1-PG)PD(n+1)
(1-PG)(1-PD(n+1))
1-PA(n)
PG
Markov Model for Spliced Alignment
Brendel 2005
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 26
Evaluation of Splice Site Prediction
• Normalized specificity: σ αα β
=−
− +1
1
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
Brendel 2005
• Specificity: rAN
AP=
• Misclassification rates: α =FN
APβ =
FP
AN
• Sensitivity: = Coverage
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 27
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 200.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
σ σ
SnSn
HumanGT site
HumanAG site
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
SnSn
A. thalianaAG site
A. thalianaGT site
σ σ
Brendel 2005
Performance?
Note: these are not ROC curves (plots of (1-Sn) vs Sp)
• But plots such as these (& ROCs) much better than using "single number" to compare different methods• Both types of plots illustrate trade-off: Sn vs Sp
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 28
Evaluation of Splice Site Prediction
Fig 5.11Baxevanis & Ouellette 2005
What do measures really mean?
Sp =
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 29
Careful: different definitions for "Specificity"
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
• Specificity:
• Sensitivity:
cf. Guig�ó definitions Sn: Sensitivity = TP/(TP+FN)
Sp: Specificity = TN/(TN+FP) = Sp-
AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1
Other measures? Predictive Values, Correlation Coefficient
Brendel definitions
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 30
Best measures for comparing different methods?
• ROC curves (Receiver Operating Characteristic?!!)
http://www.anaesthetist.com/mnm/stats/roc/
"The Magnificent ROC" - has fun applets & quotes:
"There is no statistical test, however intuitive and simple, which will not be abused by medical researchers"
• Correlation Coefficient(Matthews correlation coefficient (MCC)
MCC = 1 for a perfect prediction 0 for a completely random assignment
-1 for a "perfectly incorrect" prediction
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Do not memorize this!
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 31
Performance of GeneSeqer vs other methods?
• Comparison with ab initio gene prediction
(e.g., GENESCAN)
• Depends on:• Availability of ESTs• Availability of protein homologs
Brendel 2005
Other Performance Evaluations? Guig�óhttp://www1.imim.es/courses/SeqAnalysis/GeneIdentification/Evaluation.html
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 32
Target protein alignment score
0.000.100.200.300.400.500.600.700.800.901.00
0 10 20 30 40 50 60 70 80 90 100
Exo
n (S
n +
Sp)
/ 2
GeneSeqer
NAP
GENSCAN
Brendel 2005
GENSCAN - Burge, MIT
GeneSeqer vs GENSCAN (Exon prediction)
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 33
0.000.100.200.300.400.500.600.700.800.901.00
0 10 20 30 40 50 60 70 80 90 100Target protein alignment score
Intr
on (
Sn
+ S
p) /
2
GeneSeqer
NAP
GENSCAN
Brendel 2005
GENSCAN - Burge, MIT
GeneSeqer vs GENSCAN(Intron prediction)
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 34
Other Resources
Current Protocols in Bioinformaticshttp://www.4ulr.com/products/currentprotocols/bioinformatics.html
Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and
Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome
4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 35
New Today: Promoter Prediction
• A few more words about Gene prediction
• Predicting regulatory regions (focus on promoters)
Brief review promoters & enhancers Predicting in eukaryotes vs prokaryotes
Introduction to RNAStructure & function
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 36
Predicting Promoters
What signals are there?
Algorithms
Promoter prediction software
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 37
What signals are there? Simple ones in prokaryotes
BIOS Scientific Publishers Ltd, 1999
Brown Fig 9.17
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 38
Prokaryotic promoters
• RNA polymerase complex recognizes promoter sequences located very close to & on 5’ side (“upstream”) of initiation site
• RNA polymerase complex binds directly to these. with no requirement for “transcription factors”
• Prokaryotic promoter sequences are highly conserved
• -10 region • -35 region
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 39
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
What signals are there? Complex ones in eukaryotes!
Fig 9.13Mount 2004
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 40
Simpler view of complex promoters in eukaryotes:
Fig 5.12Baxevanis & Ouellette 2005
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 41
Eukaryotic genes are transcribed by 3 different RNA polymerases
BIOS Scientific Publishers Ltd, 1999
Brown Fig 9.18
Recognize different types of promoters & enhancers:
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 42
Eukaryotic promoters & enhancers
• Promoters located “relatively” close to initiation site
(but can be located within gene, rather than upstream!)
• Enhancers also required for regulated transcription(these control expression in specific cell types, developmental stages, in response to environment)
• RNA polymerase complexes do not specifically recognize promoter sequences directly
• Transcription factors bind first and serve as “landmarks” for recognition by RNA polymerase complexes
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 43
Eukaryotic transcription factors
• Transcription factors (TFs) are DNA binding proteins that also interact with RNA polymerase complex to activate or repress transcription
• TFs contain characteristic “DNA binding motifs” http://www.ncbi.nlm.nih.gov/books/bv.fcgi?
rid=genomes.table.7039
• TFs recognize specific short DNA sequence motifs “transcription factor binding sites”
• Several databases for these, e.g. TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 44
Zinc finger-containing transcription factors
• Common in eukaryotic proteins
• Estimated 1% of mammalian genes encode zinc-finger proteins
• In C. elegans, there are 500!
• Can be used as highly specific DNA binding modules
BIOS Scientific Publishers Ltd, 1999
Brown Fig 9.12
• Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 45
Global alignment of human & mouse obese gene promoters (200 bp
upstream from TSS)
Fig 5.14Baxevanis & Ouellette 2005
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction 46
Reading Assignment (for Wed)
Mount Bioinformatics• Chp 8 Prediction of RNA Secondary Structure • pp. pp. 327-355• Ck Errata:
http://www.bioinformaticsonline.org/help/errata2.html
Cates (Online) RNA Secondary Structure Prediction Module• http://cnx.rice.edu/content/m11065/latest/