eukaryotic genome annotation

1
1 Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B- 9052 Gent, Belgium 2 INRA-associated to Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B-9052 Gent, Belgium E-mail: [email protected] URL: http://bioinformatics.psb.ugent.be/ Eukaryotic Genome Annotation Lieven Sterck 1 , Stéphane Rombauts 1 , Jeffrey Fawcett 1 , Yao-Cheng Lin 1 , Steven Robbens 1 , Jan Wuyts 1 , Francis Dierick 1 , Pierre Rouzé 2 and Yves Van de Peer 1 1: Schiex T, Moisan A, and Rouzé P. (2001) EuGène: An Eukaryotic Gene Finder that combines several sources of evidence. Computational Biology, Eds. O. Gascuel and M-F. Sagot, LNCS 2066, pp. 111-125, 2001 This work is supported by the European Commission (QLRI-CT-2001-00006) 2: Tuskan et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray ex Brayshaw). Science 313, 1596 - 1604 3: Derelle et al. (2006) Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features, Proc. Natl. Acad. Sci. USA 103, 11647-11652 Gene prediction and genome annotation have always been one of the main research topics of our group. Over the past years we have demonstrated the strength of our annotation platform and gained name and fame in the field of genome annotation through a number of collaborative efforts to annotate newly sequenced plant genomes. Now, although we are still involved in several annotation projects for higher plants, we are also more and more asked to be responsible for producing automatic genome annotations for a broader diversity of eukaryotic genomes like fungi and algae. Introductio Introductio n n Raw sequence data is not useful for biologists. To be meaningful it has to be converted into biological significant knowledge : markers, genes, RNAs, protein sequences. Genome annotation is the first step toward this knowledge acquisition. A thorough annotation must take into account: • similarities with known sequences (proteins, ESTs, other genomes,…) • region content analysis • signal prediction software (ATG, splice sites) • integrated prediction tools (GenScan, FgenesH, … ) • all available significant biological knowledge Intrinsic approaches Extrinsic approaches RepeatMasker Blastn Blastx EuGene Predicted Genes (structural annotation) ATCCGTAAGATGGTGCGA TGCCCTAAATGGGTCGGT TTATAAAGGCGCGTAGGT AAGTGCAATTTATTCTTC AAGTTCCGAATTTTATAT GCGCATATCGTCAGTTCT TCTGTTGCAGTTGGCGCA CTTGGACTACCTGCAATT TATTCTTCAAGTTCCGAA TTTTATAT join(9265..9395,97 49..99342). complement(join(10 164..10295,10349.. 10420,10467..10514 ,10566..10626,1068 1..10770,10823..10 949,11001)) Genomic sequence EuGene is developed by T. Schiex and co-workers (INRA-Toulouse, France) in cooperation with our group. Strengths of Strengths of EuGene EuGene References References • EuGene can be specifically adapted to the particularities of newly sequenced genomes which leads to higher quality predictions • exploits probabilistic models like Markov models for discriminating coding from non coding sequences • integrates information from several signal (splice site, translation start...) prediction software or other 3 rd party software • Exploits the wealth of existing sequences (mRNA, 5'/3' EST couples, proteins, genomic homologous sequences) • integrates each source of information through small independent software components, called "plugins" The EuGene Annotation The EuGene Annotation Platform Platform • each base of the genomic sequence is represented individually (nodes) • weighting, removal and addition of edges according to available information • shortest path in the graph = a possible gene structure Based on all the available information, EuGene will output a prediction of maximal score, i.e. maximally consistent with the provided information. Start sites Splice sites SpliceMachin e Content potential for coding, intron and intergenic Coding IMM Intron IMM Intergenic IMM Schematical representation of the EuGene platform. Depicted above is the basic set-up of EuGene, this scheme can be modified according to the genome that has to be annotated and the available data. Information Information incorporation incorporation Try to automate this as much as possible through the use of annotation platforms.

Upload: taniel

Post on 08-Jan-2016

53 views

Category:

Documents


3 download

DESCRIPTION

join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001)). SpliceMachine. Start sites Splice sites. Coding IMM Intron IMM Intergenic IMM. Content potential for coding, intron and intergenic. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Eukaryotic Genome Annotation

1  Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B-9052 Gent, Belgium 2  INRA-associated to Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B-

9052 Gent, Belgium

E-mail: [email protected] URL: http://bioinformatics.psb.ugent.be/

Eukaryotic Genome AnnotationEukaryotic Genome AnnotationLieven Sterck1, Stéphane Rombauts1, Jeffrey Fawcett1, Yao-Cheng Lin1, Steven Robbens1, Jan Wuyts1, Francis Dierick1, Pierre Rouzé2 and Yves Van de Peer1

1: Schiex T, Moisan A, and Rouzé P. (2001) EuGène: An Eukaryotic Gene Finder that combines several sources of evidence. Computational Biology, Eds. O. Gascuel and M-F. Sagot, LNCS 2066, pp. 111-125, 2001

This work is supported by the European Commission (QLRI-CT-2001-00006)

2: Tuskan et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray ex Brayshaw). Science 313, 1596 - 1604

3: Derelle et al. (2006) Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features, Proc. Natl. Acad. Sci. USA 103, 11647-11652

Gene prediction and genome annotation have always been one of the main research topics of our group. Over the past years we have demonstrated the strength of our annotation platform and gained name and fame in the field of genome annotation through a number of collaborative efforts to annotate newly sequenced plant genomes. Now, although we are still involved in several annotation projects for higher plants, we are also more and more asked to be responsible for producing automatic genome annotations for a broader diversity of eukaryotic genomes like fungi and algae. IntroductionIntroduction

Raw sequence data is not useful for biologists. To be meaningful it has to be converted into biological significant knowledge : markers, genes, RNAs, protein sequences. Genome annotation is the first step toward this knowledge acquisition.

A thorough annotation must take into account:

• similarities with known sequences (proteins, ESTs, other genomes,…)

• region content analysis

• signal prediction software (ATG, splice sites)

• integrated prediction tools (GenScan, FgenesH, … )

• all available significant biological knowledge

Intrinsicapproaches

Extrinsicapproaches

RepeatMaskerBlastnBlastx

EuGene

PredictedGenes

(structural annotation)

ATCCGTAAGATGGTGCGATGCCCTAAATGGGTCGGTTTATAAAGGCGCGTAGGTAAGTGCAATTTATTCTTCAAGTTCCGAATTTTATATGCGCATATCGTCAGTTCTTCTGTTGCAGTTGGCGCACTTGGACTACCTGCAATTTATTCTTCAAGTTCCGAATTTTATAT

ATCCGTAAGATGGTGCGATGCCCTAAATGGGTCGGTTTATAAAGGCGCGTAGGTAAGTGCAATTTATTCTTCAAGTTCCGAATTTTATATGCGCATATCGTCAGTTCTTCTGTTGCAGTTGGCGCACTTGGACTACCTGCAATTTATTCTTCAAGTTCCGAATTTTATAT

join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001))

join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001))

Genomicsequence

EuGene is developed by T. Schiex and co-workers (INRA-Toulouse, France) in cooperation with our group.

Strengths of EuGeneStrengths of EuGene

ReferencesReferences

• EuGene can be specifically adapted to the particularities of newly sequenced genomes which leads to higher quality predictions• exploits probabilistic models like Markov models for discriminating coding from non coding sequences • integrates information from several signal (splice site, translation start...) prediction software or other 3rd party software• Exploits the wealth of existing sequences (mRNA, 5'/3' EST couples, proteins, genomic homologous sequences) • integrates each source of information through small independent software components, called "plugins"

The EuGene Annotation The EuGene Annotation PlatformPlatform

• each base of the genomic sequence is represented individually (nodes)

• weighting, removal and addition of edges according to available information

• shortest path in the graph = a possible gene structure

Based on all the available information, EuGene will output a prediction of maximal score, i.e. maximally consistent with the provided information.

Start sitesSplice sites

SpliceMachine

Content potential for

coding, intron and intergenic

• Coding IMM• Intron IMM• Intergenic IMM

Schematical representation of the EuGene platform. Depicted above is the basic set-up of EuGene, this scheme can be modified according to the genome that has to be annotated and the available data.

Information incorporationInformation incorporation

Try to automate this as much as possible through the use of annotation platforms.