the genome sequence of melampsora larici- populina the causal agent of the poplar rust disease
DESCRIPTION
Mlp Summer workshop – INRA Nancy, August 20-21 2008. The genome sequence of Melampsora larici- populina the causal agent of the poplar rust disease Gene content in the Mlp Genome ( automated annotation). Duplessis Sébastien (INRA Nancy). - PowerPoint PPT PresentationTRANSCRIPT
The genome sequence of Melampsora larici-populinathe causal agent of the poplar rust disease
Gene content in the Mlp Genome(automated annotation)
Mlp Summer workshop – INRA Nancy, August 20-21 2008
Duplessis Sébastien (INRA Nancy)
Tree/Microbe Interactions Joint Unit, INRA/University Nancy, UMR 1136 IAM
Annotation of Mlp Genome – Gene prediction
2006-2007
Codingpotential search
SpliceMachine
NetstartRepeats
BlastnBlastx
EuGene, FGeneSH, Genewise
Intrinsicapproaches
Extrinsicapproaches
PredictedGenes
(manual curation)
tBlastx
PucciniaSporobolomycesBasidiomycetes
Swissprot Mlp ESTs
Mlp Genome Project – Summer 2007
Pre-release of Mlp genome assembly (16.4% gaps – Assembled with JAZZ)
Main genome scaffold total: 2,682
ESTs from 50/50 spores and germtubes of Mlp 98AG31
INRA Nancy => ~4,000 (2004)JGI => ~60,000 (2007)
=> ~52,000 ESTs
ESTs from spores and germlings of Melamspora Spp. [Mlp, Mmd, Mmt, Mo]
CFS Laval => ~3,000 Mlp / ~4,200 Mmd / ~3,000 Mo / ~3,000 MmtIn planta ESTs from Mlp haustoria => ~1,700 Mlp H3B
=> ~15,000 ESTs
Blast against Mlp scafolds Blast against Mlp ESTsBlast against available basidiomycete genomes
Melampsora IAM website => summer 2007 (B. Hilselberger) updated in 2008 (E. Tisserant)
Files to help in annotation using Artemis
=> fasta of genome scaffolds
=> gff files of ESTs clusters
=> gff files of blastn Hits vs. Puccinia, Sporobolomyces & Ustilago gene models
Melampsora IAM website => summer 2007 (B. Hilselberger) updated in 2008 (E. Tisserant)
Annotation of FL sequences = TRAINING SET for gene predictors (EuGene, fgenesh, )
Gene models annotation based on complete EST support & Homology
Coding for know ubiquitous function (metabolism, cytoskeleton elements…)Coding for hypothetical proteins and new genes?Coding for proteins of various size
Mannual curation performed with Artemis (Nancy & Québec)
=> 348 GM curated
Edition of annotation cards => Melampsora Genome Consortium website
TRAINING SET for gene prediction (EuGene, fgenesh, )
=> 348 GM curated
=> 52,269 ESTs from Mlp 98AG31
=> raw TE prediction based on Mlp genome pre-release
• 39 scaffolds (43.9 Mbp)• 409 repetitive elements provided by collaborator ,
87 generated in pipeline• nr: N.crassa, M.grisea, F.graminearum• ESTs
– 3941 uniseqs described in 2003 paper– 6318 uniseqs described in 2008 paper– 8799 JGI cluster consensi (includes
external ESTs)• 5 C.parasitica CDSs from NCBI
JGI Gene prediction (Andrea Aerts – Jan-Mar/2008 )
Outputsfeature Mellp1 Sporo1 Lacbi1 Phchr1 Pospl1
Scaffolds (Mbp)
101.1 21.2 64.9 35.1 90.9
Gaps (Mbp)3.4
(3.4%)0.33
(1.6%)6.2
(9.6%) N/A21.9
(24.1%)
Repeats (Mbp)
49.4 (48.9%)
0.31 (1.5%)
14.4 (22.2%)
0.32 (0.91%)
4.96 (5.46%)
Gene length (Mbp)
25.0 (24.7%)
13.2 (62.3%)
31.6 (48.7%)
16.8 (47.9%)
35.6 (39.2%)
# genes 15,410 5,536 20,614 10,048 17,173
# genes / Mbp 152.42 261.13 317.63 286.27 188.92
What do the genes look like?Mellp1 Sporo1 Lacbi1 Phchr1 Pospl1
Gene length
1622.89 2389.05 1533.42 1667.04 2075.26
Transcript length
1241.87 1750.21 1134.45 1365.73 1438.85
Protein length
383.36 564.80 367.19 455.18 458.46
Exon length
256.26 242.77 210.13 233.64 211.92
Intron length
101.07 104.88 92.70 64.18 111.92
Exon frequency
4.85 7.21 5.40 5.85 6.79
How were the genes predicted?
Method Mellp1 Sporo1 Lacbi1 Phchr1 Pospl1
KGs and ESTs
1377 (8.9%) 54 (1%) 64 (0.3%) 12 (0.1%) 61 (0.4%)
homology 2653 (17.2%)Eug 5603
(36.4%)
2713 (49%)
3699 (18%)Eug 9848
(47.7%)
3526 (35.1%)
7549 (43.9%)
ab initio 5777 (37.5%) 2769 (50%)
7003 (34%) 6510 (64.8%)
9563 (55.7%)
How good are the genes?
metric Mellp1 Sporo1 Lacbi1 Phchr1 Pospl1
start + stop
14432 (94%)
3891 (70%) 18218 (88%)
8352 (83%) 14569 (85%)
nr 6664 (43%) 4446 (80%) 10925 (53%)
ND 13374 (78%)
Pfam 4101 (27%) 3272 (59%) 7653 (37%) 4769 (47%) 7681 (45%)
EST 3230 (21%) 1759 (32%) 2468 (12%) ND 4038 (23%)
KOG assignments
Mellp1 Sporo1 Lacbi1 Phchr1 Pospl1
Cellular Processes & Signaling
2769 (18%)
1525 (28%)
3351 (16%)
2132 (21%)
3482 (20%)
Information Storage & Processing
1864 (12%)
1149 (21%)
2196 (11%)
1456 (14%)
2251 (13%)
Metabolism 2127 (14%)
1358 (25%)
2294 (11%)
2044 (20%)
3589 (21%)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Amino Acid Metabolism
Biodegradation of Xenobiotics
Biosynthesis of Secondary Metabolites
Carbohydrate Metabolism
Energy Metabolism Lipid Metabolism
Metabolism of Cofactors and Vitamins Metabolism of Complex Carbohydrates
Metabolism of Complex Lipids Metabolism of Other Amino Acids
Nucleotide Metabolism
Mellp1
Sporo1
Lacbi1
Phchr1
Pospl1
KEGG assignments
Prediction of Gene Models using EuGene (VIB - Ghent)
Annotation performed with Mlp genome pre-release
M-P Oudot Le Secq - Eugene annotation using Laccaria bicolor annotation parameters=> ~ 17,000 Mlp gene models (<1,500 TEs) => Mlp GM v0.0
Yao-Cheng Lin - Eugene annotation using parameters specifically defined for M. larici-populina=> ~9,000 Mlp gene models (> 200aa)
Annotation performed with Mlp genome assembly release Jan2008
Yao-Cheng Lin - EuGene annotation using specific training for M. larici-populina
=> 12,386 Mlp gene models
4308 hits vs yeast4899 hits against Uniprot (7487 no hits - 1/3 ; 2/3)4708 supported by ESTs
Yao-Cheng Lin – Last EuGene annotation (summer 2008)
including 454 data (~ 5000 contigs) and adjusted parameters for small secreted proteins prediction
=> 17,167 Mlp gene models (6,989 < 300aa)
• Genewise – 9193 models• Fgenesh_pm 3147 models• estExt_fpm 2438 models
JGI Gene prediction (Andrea Aerts – 03/28/2008 )
Reconciliation and release in April 2008
+
EuGene Prediction
JGI Gene Models prediction
16694 Gene models
4465 EuGene models (27%)
4810 fgenesh1 (29%) + 5422 fgenesh2 (32%)
=> 65.5% fgenesh models
1997 Genewise/GenewisePlus models (12%)
21% of fgenesh/genewise models were consolidated with EST Extension
Prediction method:– Ab initio: 51 %– EuGene: 27 %– Homology based: 14 %– EST based: 8 %
16,694 gene models predicted by JGI predictions (& EuGene)
Gene Model validation:– Complete (5'M-3'*): 94 %– Alignment with nr: 43 %– Alignment with pfam: 25 %– EST support: 27 %
JGI Gene Models prediction
16,694 gene models predicted by JGI (& EuGene)
Mean exon size: 250 pb (Laccaria: 210 pb)Mean intron size: 120 pb (Laccaria: 93 pb)Mean protein size: 378 (Laccaria: 367 aa)
Mean gene length: 1685 pb (Laccaria: 1.5 kb)Mean transcript length: 1224 b (Laccaria: 1.1 kb)Exon # / gene: 4.90 (Laccaria: 5.4)
Protein length < 300 aa— Laccaria: 52%, Coprinus: 40%— Melampsora: 49%, Puccinia: 54%
JGI Gene Models prediction – Introns donors and acceptors
Gene Models density on the 20 largest scaffoldsMean gene density of 2.04/10kb => 1 gene /4.9 kb (Laccaria 1 gene / 3.1 kb)
28% of the genome is coding sequence
16,694 putative proteins (gene models) = JGI prediction + extra putative proteins identified with EuGene
15,725 proteins > 100 AALaccaria >17,000Phanerochaete 10,048Coprinopsis 8,759Ustilago 6,522
7,830 with homologs in nr (47%) including 3,893 hypothetical proteins
(Puccinia, Laccaria, mostly basidiomycete) 5,461 with homologs in swissprot (33%) 6,820 with homologs in Laccaria (41%) 4,507 supported by Mlp ESTs (27%)
A large proportion (30%) of Mlp genes do not have homologues in other fungal genomes including Pucciniales P. graminis and Sporobolomyces roseus
JGI Gene Models prediction – The Mlp gene space
ESTs Phakopsora Puccinia Sporobolomyces Ustilago Phanerochaete Coprinus Laccaria Magnaporthe
0
10
20
30
40
50
60
70
Matchs (%)
Blast vs. Other fungal deduced proteomes
33% of Melampsora larici-populina specific Gene Models (5,500 models with no homologs but ~300 Pfam/IPR hits)
10,344 homologs in P. graminis (62%)~ 25% of orthologs with P. graminis
Mlp gene models functional classification
Cellular component
cell
macromolecular complex
organelle
extracellular region
envelope
Molecular function
catalytic activity
binding
transporter activity
enzyme regulator activity
molecular transducer activity
motor activity
transcription regulator activity
structural molecule activity
nutrient reservoir activity
antioxidant activity
Biological process
metabolic process
establishment of localization
cellular process
biological regulation
response to stimulus
reproduction
GO classification: 27.8%
• KEGG pathways: 2758 gene models (16.5%)
Amino Acid
Metabolism
Biodegradat ion
of Xenobiot ics
Biosynthesis of
Secondary
Metabolit es
Carbohydrat e
Metabolism
Energy
Metabolism
Lipid
Metabolism
Metabolism of
Cofactors and
Vit amins
Metabolism of
Other Amino
Acids
Nucleot ide
Metabolism
0
5
10
15
20
25
30
35
Melampsora
Puccinia
Sporobolomyces
%
JGI summary – A complete table to help in annotating Mlp gene models
Emilie Tisserant & Benoît Hilselberger (INRA Nancy) Mlp Bioinfo
Yao-Cheng Lin (VIB, Ghent, BE) EuGene prediction, Mlp gene families
Mlp 98AG31
Marie-Pierre Oudot-Le Secq (INRA Nancy)early EuGene gene prediction
the 'bad guy' genomic team at INRA
UMR 1136 IAM Duplessis Sébastien & Francis Martin