importing community annotations into vectorbase. aims provide the vectorbase community with tools...

Post on 05-Jan-2016

232 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Importing Community annotations into

VectorBase

Aims

• Provide the VectorBase community with tools for improving genome annotation.

• Must have low entry requirements, be scaleable and (relatively) simple to use

Genome annotation

• First-pass genome annotation is almost always based on “automatic” computational approaches

• ab initio

• Similarity based

• Transcript (ESTs, RNAseq)

• Protein (nr protein database)

Genome assembly

Map Repeats

Genefinding

Protein-coding genes

Map Transcripts Map Peptides

nc-RNAs

Functional annotation

Submission to archival databases (Release)

Genome annotation - building a pipeline

Current VectorBase annotation pipeline

• MAKER based automatic annotation

• includes SNAP training and ab initio

• RNAseq based transcript similarity prediction

• Taxonomically constrained peptide similarity prediction

• 2 rounds of prediction refinement & final round includes all peptide similarity

• Community annotation phase

• Capture gene structure changes

• Metadata associated with locus (symbol, description, citation)

• Submission to INSDC, propagation to UniProt

• Presentation through VectorBase

Start

1.0 set(automati

c)

1.1 set(published

)

Processing submissions

• 4 phases

• Capture

• Moderation

• Storage

• Integration

Capture: Community annotation decision tree

Community annotation decision tree

Tool of choice: WebApollo

• Web-based

• Eliminates main drawback of deprecated CAP system - GFF3 format validation

WebApollo example

Community annotation decision tree

Community annotation decision tree

Tool of choice: Web forms

Moderation & Storage

• Gene metadata captured through forms to spreadsheets

• Batch submissions use similar spreadsheet format

Integration: Dataflow for ‘patch’ build

CAP GFF3

WebApollo

Reference core

Updated geneset

TXT

Patch

Users

Stable IDs

Reports

Updated core

IDs

Reference core CAP

Release coreGoogle Fusion

TableXrefs

Release

XrefsGoogle Form

`

Metadata

Users

}Commit

Presentation of community annotation

Usage (as of 2015-03-30)

• 31 WebApollo instances (Organisms)

• 3,407 gene models

• Gene metadata (protein-coding loci)

• 4,987 gene symbols

• 512 gene synonyms

• 57,878 gene descriptions

• 910 loci citations from 208 publications

Supplementing annotations

• Community jamboree’s

• ‘Standard’ improvement (e.g. Sandfly, snail communities)

• Glossina community (e.g. March 2015, Kenya)

• VectorBase

• Default Xref run includes symbol/description assignment via UniProt

• Projection of gene description via orthology from key marker species (e.g. An. gambiae). Due to be deployed for June (VB-2015-06) release.

• Supplemental data from genome papers (e.g. 16 Anopheles spp, Musca)

Deprecated CAP system example

top related