variation data in vectorbase nih/niaid vectorbase site visit march 2015

19
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Upload: michael-baldwin

Post on 14-Jan-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Variation data in VectorBase

NIH/NIAID VectorBase site visitMarch 2015

Page 2: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Variation data types

VectorBase captures both sequence and structural variations (stable chromosomal inversions in An. gambiae)

The bulk of the data is sequence variants, primarily SNPs, based on high-throughput genomic re-sequencing of isolates. Data is formatted using VCF (Variant Call Format).

Page 3: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Variation data bridges between PopBio and genome browser (Ensembl)

Variants are stored in a mySQL using the Ensembl RDB schema.

Sample metdata is stored in the PopBio (Chado postgres).

Large and complex data sets which depend on the accuracy of the genome assembly and parameterization of variant calling algorithms.

Page 4: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Summary of VectorBase variation datasets (2015-03)

1) Aedes aegypti contains the latest SNP chip data set from Powell et al, 2015 (PMID:25721127).2) An. stephensi (Indian strain) was a new database distinct from the SDA-500 strain.3) Four further variation datasets for An. farauti, An. merus, An. sinensis and An. melas are

available and will be loaded after updates to the assemblies for these organisms.

Page 5: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Representation in Popbio

Genomic (re)sequencing is an assay type in PopBioSample metadata stored in PopBio and Biosamples databases

MR4 colony sequencing (VBP0000002)

Page 6: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Variant presentation @ VectorBase

Page 7: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Variant presentation @ VectorBase

Page 8: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Variant presentation @ VectorBase

Page 9: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Querying and using variation data @ VectorBase

Browser tracks

Page 10: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Querying and using variation data @ VectorBase

Browser tracksBiomart datasets

Page 11: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Querying and using variation data @ VectorBase

Browser tracksBiomart datasetsSample metadata

Page 12: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Querying and using variation data @ VectorBase

Browser tracksBiomart datasetsSample metadataVEP tool (Variant Effect Predictor)

Page 13: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Internal VectorBase variation + PopBio dataflows.

VCF

ISA-TAB

Sample +variationsetids

Ensemblvariationdatabase

PopBio

Display of variant data in genomic context

Display of detailed sample metadata, e.g. geodata

Page 14: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Use of Apache Solr to provide unified variation search across VectorBase site.

VCF

Ensemblvariationdatabase

PopBio

Display of variant data in genomic context

Display of detailed sample metadata, e.g. geodata

ISA-TAB

Page 15: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Identification and management of redundant variant records via MongoDB NoSQL db.

Slide courtesy of Christoph Grabmüller – Ensembl Genomes 2014.

Page 16: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

VectorBase interactions with external data sources relevant to variation studies.

dbSNPEVA

BioSamples

CommunityCommunityInitial submission

of variation data(multiple formats).

VectorBase

VCF format

data

Samplemetadata(ontology

compliant)

Long term variation archive

Ongoing curation of data either solely

by community, or in collaboration with VectorBase

Page 17: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

EVA - long term storage of variant data• Processing of variation data for VectorBase species of dbSNP is too slow to be

useful (>1 year)• VectorBase accepts community variation data submissions and processes these

rapidly (this involves active collaboration with submitters to convert submissions into suitable data formats and link entries to ontologies and other metadata tracking systems).

• Store submitted variation data as VCF files in the European Variation Archive long term ( www.ebi.ac.uk/eva )

• EVA to broker submission of VCF data to dbSNP who can then resolve duplicate submissions and allocate persistent IDs which can be reincorporated into VectorBase variation records.

• VectorBase has submitted data for Anopheles coluzzii + Anopheles gambiae to EVA. The anopheles 16 genomes data will be submitted to the archive once remaining sample tracking issues have been resolved.

Page 18: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

BioSample- sample metadata

• Joint EBI/NCBI database that stores submitter supplied data relating to samples used in other primary NCBI archives such as– SRA (Sequence Read Archive)– dbGaP (Genotypes and Phenotypes database)– GenBank

• VectorBase works with community members to ensure sample metadata is captured and tagged with appropriate ontology terms (e.g. “A Multipurpose High Throughput SNP Chip for the Dengue and Yellow Fever Mosquito, Aedes aegypti.” Evans et al. 2015 PMID:25721127).

• Joint VectorBase/researcher submitters allow samples to be curated by the community.

Page 19: Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015

Future plans

Consolidation:Continue to broker and improve sample metadata submissions with BioSamplesWork with EVA developers to pilot VCF brokerage with dbSNPImprove “Sample picker” interface

New data:MalariaGen 1000 Anopheles project (AR2 data release)Individuals from 8 countries, min. 30 samples per population, est. 50-100 million

variantsData queries:

Increase use of SOLR to replace and augment BioMart functionalityUse of other database solutions for specific queries (e.g. mongoDB)