annotation of anopheline genomes at vectorbase dan lawson, vectorbase & the anopheles genomes...

36
Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Upload: iris-holt

Post on 29-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Annotation of Anopheline Genomes at VectorBase

Dan Lawson, VectorBase & The Anopheles Genomes Cluster ConsortiumEMBL-EBI

Page 2: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Anopheline species in this study: Current status

Genome sequencing

• 9 of 16 species assembled and annotated

RNAseq

• 10 of 12 species sequenced

Isolate re-sequencing

• 12 of 12 species sequenced

Page 3: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Genome annotation

• First-pass genome annotation is almost always based on “automatic” computational approaches

• ab initio

• Similarity based

• Transcript (ESTs, RNAseq)

• Protein (nr protein database)

Page 4: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Genome annotation

• First-pass genome annotation is almost always based on “automatic” computational approaches

• ab initio

• Similarity based

• Transcript (ESTs, RNAseq)

• Protein (nr protein database)

Page 5: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Genome annotation

• First-pass genome annotation is almost always based on “automatic” computational approaches

• ab initio

• Similarity based

• Transcript (ESTs, RNAseq)

• Protein (nr protein database)

Page 6: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Genome assembly

Map Repeats

Genefinding

Protein-coding genes

Map Transcripts Map Peptides

nc-RNAs

Functional annotation

Submission to archival databases (Release)

Genome annotation - building a pipeline

Page 7: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Automatic annotation strategies

similarityab initio

Page 8: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Genome annotation: resources

• ab initio predictions using SNAP and Augustus

• Mixed whole animal RNAseq datasets generated using Illumina sequencing

• Assembled using Trinity (Broad Institute)

• Many dipteran proteomes (including 4 mosquitoes & D. melanogaster)

• All arthropod/metazoan proteomes

Page 9: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

MAKER annotation with RNAseq and reference proteomes

• Aim:

• Gene prediction aggregation for the masses.

• Used for a number of arthropod genome projects

• Touted as the default pipeline for many more (part of the GMOD toolkit)

• Overview

• ab-initio gene predictions from SNAP, Augustus & FGENESH

• Final gene models from MAKER

• Similarity alignments from both EXONERATE and BLAST

• Repeats from RepeatFinder & RepeatMasker

• Additional data sets integrated via GFF3 files (RNA-Seq)

• Uses MPI for parallelization over a compute farm

• Summary

• Iterative runs give acceptable reference gene sets.

• Used for Heliconius, Glossina, sandflies and the first tranche of the 16 Anophelines

Page 10: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Current VectorBase annotation pipeline

• MAKER based automatic annotation

• includes SNAP training and ab initio

• RNAseq based transcript similarity prediction

• Taxonomically constrained peptide similarity prediction

• 2 rounds of prediction refinement & final round includes all peptide similarity

• Community annotation phase

• Capture gene structure changes

• Metadata associated with locus (symbol, description, citation)

• Submission to INSDC, propagation to UniProt

• Presentation through VectorBase

Start

1.0 set(automatic

)

1.1 set(published

)

Page 11: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Projection from a reference annotation

Page 12: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Gene prediction based on projection from reference annotation

• Local alignment of An. gambiae CDS to the assemblies provide a platform for improving gene predictions.

• Example loci: Rps7 (AGAP008916)

• Potential for transcript based assembly improvement via seqedits of genome sequence

Page 13: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Annotation: Preliminary genesets

• 10,738 - 13,162 predictions

• no ncRNAs yet predicted

Page 14: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• An. gambiae PEST 12,810 protein-coding genes

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 15: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 16: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 17: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 18: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

• No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 19: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

• No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%)

• No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 (≃ 71%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 20: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

• No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%)

• No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 (≃ 71%)

• No. of clusters containing 8 Anophelines (- darlingi & christyi) but not gambiae 600

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 21: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

All genomes deserves a home

• Genome browser

• Similarity searches

• BLAST/BLAT

• Query tools

• Simple keyword

• Complex queries

• DownloadsSimilarity searches

Query tool

Downloads

Browser

Browser

Compara

Page 22: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

VectorBase

• Long term home for these genomes is VectorBase.

• NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens

• Ensembl genome browser

• Similarity searches

• File downloads

Page 23: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Anopheles Genomes Cluster wiki site

Page 24: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Thematic analysis groups & community annotation

• Community led annotation of the genomes using the Community Annotation Portal (CAP)

Page 25: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Community annotation decision tree

Page 26: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Community annotation decision tree

Page 27: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Community annotation decision tree

Page 28: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Community annotation decision tree

Page 29: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Community annotation workflow

ARTEMIS APOLLO

scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 52 305 696 + . ID=xxxx3;Name=sp|Q91VD9|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|

scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|

>MY SUPERCONTIGATATATGCGTTGAGCTGCGTTACGTTCGGGATGCGTTAGGCTTGTGAGCTGGATCGGTCCTGCCTGCGTCGATATAAACGACCT…

Identify gene

Modify model

SubmitCAP

GFF3 FASTA

Page 30: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

CAP reporting

• Email report back to submitter to show status

• If successful then the model is stored in a local database and then presented to the genome browser via DAS

• Failed submissions have (some) information as to why. Submitters then need to correct these errors and re-submit

Page 31: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

CAP submissions displayed in the genome browser

• Similarity track for supporting evidence (from previous updates)

Page 32: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Genome annotation metrics

• Metrics for quality of a gene set are far from standardised but...

• Simple statistics (length, number of exons, intron size)

• Level of support from transcript data (how many genes have overlapping EST/RNAseq)

• Junction data (confirmation of introns)

• Comparison to public datasets (UniProt)

• Protein domains (InterPro)

• Comparative analysis - orthologs/paralogs

Page 33: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Still to do...

Primary annotation

• Still 7 genomes outstanding from the Broad Institute - de novo repeat finding and MAKER annotation

Analysis

• Whole genome alignments and (12 Drosopholid analysis pipelines from Kellis group - Rob Waterhouse)

• Data presentation (Trinity clusters, correlation with legacy Hittinger clusters, velvet assembled 37 bp reads)

• Variation (SNP calls) from each of the 16 species

Other genomes

• New version of the An. darlingi genome (Osvaldo Marinotti, recently published in NAR)

• New version of the Indian strain of An. stephensi (Jake Tu)

Page 34: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Acknowledgements

VEMBL-EBI

Imperial College

Daniel Lawson, Gareth Maslen, Mikkel Christensen, Nick Langridge, Derek Wilson, Gautier Koscielny, Karyn Megy, Martin Hammond, Daniel Hughes, Ewan Birney, Paul Kersey

Fotis Kafatos, Bob MacCallum, George Christophides, Seth Redmond, Timo Tiirikka

NoTre Dame

HaRvardIMBB

New MexicO

ASequencers

Ensembl GEnomes

Maggie Werner-Washburne Phil Baker

Bill Gelbart, Susan Russo, Dave Emmert, Pinglei Zhou, Lynn Crosby, Kathy Campbell

Kitsos Louis, Pantelis Topalis, Emmanuel Dialynas, Vicky Dritsou

TIGR/JCVI WashU Broad Institute, Baylor College

Frank Collins, Greg Madey, Rob Bruggner, Nate Konopinski, EO Stinson, Scott Emrich, Andrew Sheehan, Rory Carmichael, Dave Cieslak, Dave Campbell, Ryan Butler, Katie Cybulski, Neil Lobo, Gloria Calderon, Greg Davis

Dan Neafsey, Brian Haas Nora Besansky, Michael Fontaine

Michael Nuhn

Rob Waterhouse Paul Howell

Page 36: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI

Anopheles Genomes Cluster Consortium

Steering committee

Community liaisons