experiences and suggestions for the annotation of tomato bac clones 2005-09-28 dr. cheol-goo hur...
Post on 22-Dec-2015
218 views
TRANSCRIPT
Experiences and suggestions for the annotation of tomato BAC clones
2005-09-28
Dr. Cheol-Goo Hur
Plant Genome Lab.
Genome Research Center
KRIBB, Korea
Contents
• Phase-I Annotation
• Define gene structures
• Sample Annotations
• Future Works
• Acknowledgements
Phase-I Annotation
Target AnalysisTools / Data
SGN Guideline*1 KRIBB
Protein Coding Genes
Computational Gene Prediction
GeneMark.hmm, FGENESH, GlimmerM, GENSCAN+, Eugene
FGENESH (N.tabacuum)
Experimental Gene Identification
GeneSeqer, SIM4, BLAST(Tomato cDNAs, ESTs, unigenes)
BLAT, SIM4, GMAP, GeneSeqer(dbEST, GenBank mRNAs),GeneWise (GenPept Proteins)
Resolution of Conflict PASA, GeneSeqer (Automatic)Apollo Genome Viewer (Manual)
Combined Modeller (Automatic)*2
Apollo Genome Viewer (Manual)
tRNA Computational tRNA Prediction
tRNAscan-SE tRNAscan-SE
Other RNAs
Similarity-based RNA Identification(microRNAs, snoRNAs)
- Cross-match(GenBank rRNA, Rfam)
Repeats Repeat Scanning - RepeatMasker/Cross-match(RepBase/TIGR Plant Repeats)
*1. Version 0.9 March 31, 2005
*2. KRIBB
Functional Annotation
Target AnalysisTools / Data
SGN Guideline KRIBB
Function of
Protein Coding Genes
Conserved Functional Domains*1
InterProScan(InterPro Databases)
InterProScan (InterPro Databases)
Homology to Proteins*1,2
BLASTx(Arabidopsis, rice, Medicago, Swiss-Prot, GenBank nr)
BLASTx(UniProt)
Gene Ontology assignment
- BLASTx
(Arabidopsis Proteins associated with GOA, TAIR GO data) *3
EC/Pathway - BLASTx
(Arabidopsis*3 Proteins associated with KEGG EC/Pathway data)
Protein Location Predictions
Transmembrane Domains (TMHMM), Subcellular Location(TargetP)
Transmembrane Domains (TMHMM), Subcellular Location(TargetP)
*1. Automated annotation should be converted to GO code for easier comparisons.*2. Classify into 5 class of gene annotations based on seqeunce similarity and availablity of expression data. (known / putative / similar to/ expressed / no evidence)*3. Use Arabidopsis Full Protein set to maximize the number of GO assigned genes.
Data Set for gene structure and annotation (Aug. 2005)
• BACs – Sequenced: 29 (4 BACs overlapping in 2 pairs)– Annotated: 22
• ESTs : 200 015 (cf: Potato 193 233, Pepper 115 598)• Full-length mRNAs (GenBank): 596• Full-length Proteins (UniProt 5.1): 1 044• Protein DB (UniProt Release 5.1)
– Swiss-PROT/TREMBL: 181 821 / 1 748 002• Arabidopsis Proteins
– GO associated (TAIR): 26 196– Pathway/EC associated (KEGG): 1 520
Defining the Gene Structure
• New Genomes, New Challenges: lack of data
• To get best performance with given data, well-combined method is needed
– Combine experimental data-based gene models
– Extend the gene boundary and make up for the missing parts with predicted gene models
– Final manual curation
• Ex) EuGene for Medicago Genome Annotation
Structure of Protein Coding Genes
Transcripts(AlternativeSplicedForms
(ESTs)
PCpG
TSS
TIS
ATG(Met)
Stop
Poly-A Site
TAATAGTGA AATAAAGT---AG
IntronSplicingSignal
CDS
1. Define gene structure by various data evidences
• Full-length evidenced genes (mRNAs / Proteins)
• Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library)
• Partially evidenced genes (Other partial ESTs)
• No-evidenced genes (Prediction only)
PredictmRNAProtein
PredictEST
1) Full-length Evidenced Genes
• Gene locus with full-length mRNA / Protein (GMAP, GeneWise)• Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A,
protein:CDS), Exon/Intron, (some alternative splicing structure)• Requirement: more than 1 mRNA or Proteins• Processing:
– Merge the same AS forms– mRNA evidence: Predict CDS (ESTscan etc.)– Protein evidence: Mend gene boundary(TSS, poly-A)
mRNA
Protein
Predict
2) Full-length Clue Evidenced Genes
• Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP)
• Gene boundary(TSS, poly-A), some Exon/Intron• Requirement: more than 1 full-length clue ESTs• Processing:
– Merge the same AS forms– Link the same-cloned ESTs– Mend uncomplete portion with predicted model– CDS to be predicted (ESTscan / orfPredictor etc.)
EST
Predict
3) Partially Evidenced Genes
• Gene locus with general ESTs (GMAP)• Some Exon/Intron, poly-A• More ESTs, more information expected• Requirement: more than 2 ESTs with more than 2 couples
of overlapped hard-edges• Processing:
– Merge the same AS forms– Link the same-cloned ESTs– Mend incomplete portion with predicted model– CDS to be predicted (ESTscan/orfPredictor etc.)
EST1
Predict
EST2
4) No-evidenced Genes
• Predicted model only (hypothetical gene)
• Predicted CDS
Predict
• Test BLAT/SIM4/GMAP/GeneSeqer– BLAT – Fast/Unaccurate– SIM4/GMAP/GeneSeqer – Approx. the Same results
• KRIBB: Prefiltering ESTs by BLAT + GMAPing• Cutoff: Coverage > 80%, Identity > 92%
2. Transcript-Genome mappers
Problem of Repeat and Similarity? Or miss assembly?
Similarity cutoff needed
3. Protein-based Gene Models
• GeneWise / FGENESH+
• KRIBB: GeneWise after prefiltering Proteins by BLASTx – BLASTx Cutoff: Coverage>80%, Identity>80%
Sample Annotations: define gene structure and annotation
1) Full-length Evidenced Gene: C02HBa0025N15.220
• mRNA/Protein evidence
• Annotation
– Product: SNF1 [Lycopersicon esculentum]
– IPR000719 Prot_kinase
– GO:0006468(P) protein amino acid phosphorylation
– GO:0004672(F) protein kinase activity
– EC:2.7.1.-: Snf1-related protein kinase (KIN10) (SKIN10)
– TMHMM: outside
2) Full-length Evidenced Gene: C02HBa0066C13.60
• Protein evidence• Annotation
– Product: phytochrome E [Lycopersicon esculentum]– IPR001294 Phytochrome– GO:0006355(P) regulation of transcription, DNA-dependent – GO:0008020(F) G-protein coupled photoreceptor activity – TargetP/TMHMM: C/outside– FunCat: 30.01 intracellular signalling
70.01 cell wall
3) Full-length Clue Evidenced Gene: C02HBa0060J03.170
• Kazusa full-length cDNA/EST evidence• Annotation
– Product: putative protein [Arabidopsis thaliana]– IPR001251: CRAL_bd_TRIO_C– TMHMM: outside
~1Kb
3 Exon
4) Partially Evidenced Gene: C02HBa0060J03.90
• EST evidence• Annotation
– Product: putative protein [Arabidopsis thaliana]– IPR000719 Prot_kinase – GO:0006468(P) protein amino acid phosphorylation – GO:0004672(F) protein kinase activity – GO:0016020(C) membrane– TMHMM: outside
5) Gene with alternative splicing: C02HBa0060J03.40-4
• EST evidence• Annotation
– Product: transformer-SR ribonucleoprotein [N.tabacum]– IPR000504 RNA-binding region RNP-1– GO:0003676(F) nucleic acid binding – GO:0030529(C) ribonucleoprotein complex – TargetP/TMHMM: C/outside
Annotation Results
Property Value Unit
BAC (Annotated/Sequenced)
Length (Average/Total)
22 / 29
122 / 2698
BAC
kb
Putative Protein CDSs
Gene Density
Gene Length, Average
Exon Length, Average
Exons per Gene, Average
With ESTs
Protein Annotated
Domain Annotated
GO Annotated
Pathway Annotated
EC Annotated
620
4.6
3.1
272
7.3
352(57%)
446(72%)
424(68%)
338(55%)
25( 4%)
29( 5%)
gene
kb/gene
kb
bp
exon/gene
gene
gene
gene
gene
gene
gene
tRNA 13 gene
Repeats 144(5.3%) kb
*1. All values from annotated 22 BACs.
Future Works
• Training data set for Tomato gene HMM models• Automation• Performance assessment• Manual curation (Apollo)
Tool Author Source
BLAT Jim Kent UCSC (http://www.cse.ucsc.edu/~kent/)
FGENESH Solovyev, et al. SoftBerry, Inc. (http://www.softberry.com/)
GMAP Thomas D.Wu, Colin K. Watanabe
Genentech, Inc., (http://www.gene.com/share/gmap/)
GeneSeqer V. Brendel, et al. Iowa State/Stanford University (http://bioinformatics. iastate.edu/bioinformatics2go/gs/download.html)
GeneWise Ewan Birney EBI (http://www.ebi.ac.uk/Wise2/)
InterProScan EBI(http://www.ebi.ac.uk/InterProScan/)
Miropeats Parsons J.D. Washington University (http://genomeold.wustl.edu/groups/ informatics/software/miropeats/)
BLAST(NCBI) S.F. Altschul NCBI (http://www.ncbi.nlm.nih.gov/blast/)
Phred/phrap/cross_match Phil Green University of Washington (http://www.phrap.org/)
RepeatMasker Arian Smit, P. Green
(http://www.repeatmasker.org/)
SIM4 Liliana Florea et al. PennState University (http://globin.cse.psu.edu/)
TargetP Olof Emanuelsson, et al.
CBS in Technical University of Denmark (http://www.cbs.dtu.dk/services/TargetP/)
TMHMM A. Krogh, et al. CBS in Technical University of Denmark (http://www.cbs.dtu.dk/services/TMHMM/)
tRNAscan-SE T.M. Lowe, S.R. Eddy
University of Washington(http://www.genetics.wustl.edu/eddy/tRNAscan-SE/)
Acknowledgement
http://sol.kribb.re.kr
Thanks you for your attention!