solanaceae 2006 bac annotation

24
Solanaceae 2006 Solanaceae 2006 BAC Annotation BAC Annotation 2006. 07. 26 2006. 07. 26 Plant Genome Research Plant Genome Research Center Center KRIBB, KOREA KRIBB, KOREA

Upload: radwan

Post on 11-Jan-2016

48 views

Category:

Documents


3 download

DESCRIPTION

Solanaceae 2006 BAC Annotation. 2006. 07. 26 Plant Genome Research Center KRIBB, KOREA. Developmental Environments. OS : SGI IRIX 6.5 CPU : MIPS 500MHz 12 CPUs MEM : 12288 MB OS : SUSE Linux 9.0 version 2.6.11.4-21.11-bigsmp CPU : Intel(R) Xeon(TM) CPU 2.80GHz - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Solanaceae 2006  BAC Annotation

Solanaceae 2006 Solanaceae 2006 BAC AnnotationBAC Annotation

2006. 07. 262006. 07. 26

Plant Genome Research CenterPlant Genome Research Center

KRIBB, KOREAKRIBB, KOREA

Page 2: Solanaceae 2006  BAC Annotation

Developmental EnvironmentsDevelopmental Environments

• OS : SGI IRIX 6.5 • CPU : MIPS 500MHz 12 CPUs• MEM : 12288 MB

• OS : SUSE Linux 9.0 version 2.6.11.4-21.11-bigsmp• CPU : Intel(R) Xeon(TM) CPU 2.80GHz• MEM : 6231 MB

• DBMS : MySQL-4.0.25• Language : PHP 5.0.4, Apache 2.0.54, Perl-5.8.7

Page 3: Solanaceae 2006  BAC Annotation

Data SetsData Sets

• BACs (SGN test BACs)– Annotated: 10

• ESTs : 200,015 (cf: 202,043 -current)• Full-length mRNAs (GenBank): 596• Protein DB (UniProt Release 7.7)

– Swiss-Prot/trEMBL: 228,917 / 2,914,826– Swiss-Prot/trEMBL(plant) 15,203 / 219,361

• Arabidopsis Proteins – Proteins, Genomes (TAIR): 30,693 – GO associated (TAIR): 28,812– Pathway/EC associated (KEGG): 1,521

• Tomato Chip DATA - tomato Expression Database (cornell)

Page 4: Solanaceae 2006  BAC Annotation

Structural AnnotationStructural Annotation

Target AnalysisTools / Data

SGN Guideline KRIBB

Protein Coding Genes

Computational Gene Prediction

GeneMark.hmm, FGENESH, GlimmerM, GENSCAN+, Eugene

FGENESH (N.tabacuum)

GENSCAN

Experimental Gene Identification

GeneSeqer, SIM4, BLAST(Tomato cDNAs, ESTs, unigenes)

BLAT, SIM4, GMAP, GeneSeqer(dbEST, GenBank mRNAs),GeneWise2.0 (GenPept Proteins)

Resolution of Conflict PASA, GeneSeqer (Automatic)Apollo Genome Viewer (Manual)

Combined Modeller (Automatic)

Apollo Genome Viewer (Manual)

tRNA Computational tRNA Prediction

tRNAscan-SE tRNAscan-SE

Other RNAs

Similarity-based RNA Identification(microRNAs, snoRNAs)

- Cross-match(GenBank rRNA, Rfam)

Promoter TFBS/Promoter analysis

- Transfac, MEME, Gibs, Pratt

Repeats Repeat Scanning - RepeatMasker/Cross-match(RepBase/TIGR Plant Repeats)

Page 5: Solanaceae 2006  BAC Annotation

Functional AnnotationFunctional AnnotationTarget Analysis

Tools / Data

SGN Guideline KRIBB

Conserved Functional Domains

InterProScan(InterPro Databases)

InterProScan (InterPro Databases)

Homology to Proteins

BLASTx(Arabidopsis, rice, Medicago, Swiss-Prot, GenBank nr)

BLASTx, WU-BLAST-2.0

(Swiss-Prot, trEMBL, Arabidopsis)

Gene Ontology assignment

- BLASTx

(Arabidopsis Proteins associated with GOA, TAIR GO data)

EC/Pathway - BLASTx

(Arabidopsis Proteins associated with KEGG EC/Pathway data)

TFBS /

Promoter

WU-BLAST2 (blastx)

Arabidopsis proteins associated with TFBS/Promotor

Function of

Protein Coding Genes

Protein Location Predictions

Transmembrane Domains (TMHMM), Subcellular Location(TargetP)

Transmembrane Domains (TMHMM), Subcellular Location(TargetP)

Page 6: Solanaceae 2006  BAC Annotation

Define gene structure by various data evidencesDefine gene structure by various data evidences

• Full-length evidenced genes (mRNAs / Proteins)

• Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library)

• Partially evidenced genes (Other partial ESTs)

• No-evidenced genes (Prediction only)

PredictmRNAProtein

PredictEST

Page 7: Solanaceae 2006  BAC Annotation

1) Full-length Evidenced Genes

• Gene locus with full-length mRNA / Protein (GMAP, GeneWise)• Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A,

protein:CDS), Exon/Intron, (some alternative splicing structure)• Requirement: more than 1 mRNA or Proteins• Processing:

– Merge the same AS forms– mRNA evidence: Predict CDS (ESTscan etc.)– Protein evidence: Mend gene boundary(TSS, poly-A)

mRNA

Protein

Predict

Sample Sample

mRNAsmRNAs

TIGR TCTIGR TC

stackPACKstackPACK

ESTsESTs

Predicted GenesPredicted Genes

Page 8: Solanaceae 2006  BAC Annotation

2) Full-length Clue Evidenced Genes2) Full-length Clue Evidenced Genes

• Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP)

• Gene boundary(TSS, poly-A), some Exon/Intron• Requirement: more than 1 full-length clue ESTs• Processing:

– Merge the same AS forms– Link the same-cloned ESTs– Mend uncomplete portion with predicted model– CDS to be predicted (ESTscan / orfPredictor etc.)

EST

Predict

Sample Sample

Full length Clue ESTsFull length Clue ESTs(kazusa)(kazusa)

Predicted GenesPredicted Genes

ESTsESTs

Page 9: Solanaceae 2006  BAC Annotation

3) Partially Evidenced Genes3) Partially Evidenced Genes

• Gene locus with general ESTs (GMAP)• Some Exon/Intron, poly-A• More ESTs, more information expected• Requirement: more than 2 ESTs with more than 2 couples

of overlapped hard-edges• Processing:

– Merge the same AS forms– Link the same-cloned ESTs– Mend incomplete portion with predicted model– CDS to be predicted (ESTscan/orfPredictor etc.)

EST1

Predict

EST2

Sample Sample

ESTsESTs

Predicted GenesPredicted Genes

Page 10: Solanaceae 2006  BAC Annotation

4) No-evidenced Genes

• Predicted model only (hypothetical gene)

• Predicted CDS

PredictSample Sample

No Evidence !!No Evidence !!

Page 11: Solanaceae 2006  BAC Annotation

Gene Structure Annotation - Gene Structure Annotation - ProblemsProblems

False positive intergenic region:2 annotated genes actually correspond to a single gene

False negative intergenic region: One annotated gene structure actually contains 2 genes

False negative gene prediction: Missing gene (no annotation)

Other: partially incorrect gene annotation missing annotation of alternative transcripts -Alternative Splicing

Pseudo-genesPromoter / Regulatory Elements

Page 12: Solanaceae 2006  BAC Annotation

Estimated Gene PredictionEstimated Gene PredictionCATEGORYCATEGORY NUMBERNUMBERPredicted Genes 301 TSS 294 Start Codon 296 Stop Codon 297 PAS signals 1) 100 PolyA ( ≥ 7) 296Genes overlapping EST Clusers 148 Genes hitting mulitple EST Clusters 61 Genes hitting single EST Clusters 87Genes overlapping ESTs 165 EST mapping Genes (≥ 2) 109 EST mapping Genes ( =1) 56Genes hitting mRNAs 6Genes hitting Full-length cDNAs 20

1) hexamer signal A(A/U)AAA - PASes (predict polyadenylation signals) hexamers

Page 13: Solanaceae 2006  BAC Annotation

Gene Structure BrowserGene Structure Browser

• Test BLAT/SIM4/GMAP/GeneSeqer– BLAT – Fast/Unaccurate– SIM4/GMAP/GeneSeqer – Approx. the Same results

• KRIBB: Prefiltering ESTs by BLAT + GMAP• Cutoff: Coverage > 80%, Identity > 90%

dbESTs

TIGR TC

UnigeneKazusa Full ESTs

Protein

FGENESH

GENSCAN

mRNARepeats / Domain

Page 14: Solanaceae 2006  BAC Annotation

Click !!

Page 15: Solanaceae 2006  BAC Annotation

Click !!

Page 16: Solanaceae 2006  BAC Annotation

Functional AnnotationFunctional Annotation

Protein DB/ EC / GOProtein DB/ EC / GO

Page 17: Solanaceae 2006  BAC Annotation

TFBS / PromoterTFBS / Promoter

Protein DB / GOProtein DB / GO

Functional AnnotationFunctional Annotation

Page 18: Solanaceae 2006  BAC Annotation

TargetP/TMHMMTargetP/TMHMM

Enzyme / PathwayEnzyme / Pathway

Domain / MotifDomain / Motif

Functional AnnotationFunctional Annotation

Page 19: Solanaceae 2006  BAC Annotation

Expression AnnotationExpression Annotation(Digital Expression )(Digital Expression )

Principle of identifying differentially expressed genes by Hypergeometric Test N: ESTs for all genes in all tissues,n: ESTs for selected genes in all tissues,K: ESTs for all genes in selected tissue,k: ESTs for selected gene in selected tissue,P: Significance of over- or under-expression in selected tissue

Page 20: Solanaceae 2006  BAC Annotation

Expression AnnotationExpression Annotation(ARRAY CHIP)(ARRAY CHIP)

Page 21: Solanaceae 2006  BAC Annotation

Expression Annotation Expression Annotation (Tissue Specific Genes)(Tissue Specific Genes)

Principle of identifying differentially expressed genes by Audic's TestPrinciple of identifying differentially expressed genes by Audic's Test

x: number of cognate ESTs of a given gene in a selected libraryN1: selected libraryy: number of cognate ESTs of a given gene in other libraryN2: other library

Page 22: Solanaceae 2006  BAC Annotation

CaActin

CacnA (16)

CacnB (18)

CacnC (13)

CacnD (10)

CacnE (25)

CacnF (31)

CacnG (20)

Leaf

stem root

Buf

Xag

IM M.G

Break

erM.R

Flor

al b

ud

Flow

erBar

k

Flower

Pathogen

Fruit* 25 cycles, annealing temp. 55℃* (# of ESTs)

Pepper tissue-specific gene analysis

Page 23: Solanaceae 2006  BAC Annotation

Annotation ResultsAnnotation ResultsPropertyProperty ValueValue UnitUnit

BAC (Annotated)

Length (Average)

10

120

BAC

kb

Putative Protein CDSs

Gene Density

Gene Length, Average

Exon Length, Average

Exons per Gene, Average

With ESTs

Protein Annotated

Domain Annotated

GO Annotated

Pathway Annotated

EC Annotated

TFBS/Promoter Annotated

Tissue specific Annotated

Expression Annotated

301

4.2

3.1

338

8.4

165

196

213

144

17

17

127

56

18

gene

kb/gene

kb

bp

exon/gene

gene

gene

gene

gene

gene

gene

gene

gene

gene

tRNA 0 gene

Repeats 144 kb

Page 24: Solanaceae 2006  BAC Annotation

Thanks !!Thanks !!

Solanaceae 2006 BAC Annotation Test page

http://crop.kribb.re.kr/SOL-Test/

http://sol.kribb.re.kr/