vectorbase popbio introduction nih/niaid vectorbase site visit march 2015

15
VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Upload: godwin-clarke

Post on 31-Dec-2015

231 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

VectorBase PopBio Introduction

NIH/NIAID VectorBase site visitMarch 2015

Page 2: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

What is PopBio?

Flexible database for sample and assay metadata for field- or lab-derived population biology data.

● collection event & location (GeoData)● basic sample information● assays

o species identificationo phenotypes (host species [e.g. from blood meal],

insecticide resistance, ...)o genotypeso manipulations (sampleA+sampleB->sampleC)

Page 3: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

What is it for?

Allows integration of individual studies (e.g. insecticide resistance studies conducted in individual countries).

Enables meta-analysis of community data.

Page 4: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Data sources

Legacy:IRbaseUC Davis/UCLA (but updates planned)

Recent:Bulk imports (e.g. Malaria Atlas Project surveillance data)

Publications (typically with extra data direct from authors)

MalariaGen & 16 AnophelesOther unpublished/in progress

Page 5: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Future data sources

ICEMRsNational/international IR surveillance MalariaGenPartners (Vestergaard, Oxford University MAP)

Smaller published and unpublished datasets

Page 6: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Data model

GMOD Chado schema

Heavy reliance on CVs/ontologies → flexibility→ computability

Vastly oversimplified explanation of schema:Projects have samples have assays have results

Page 7: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Ontologies

VectorBase ontologies: insecticide resistance, malaria, dengue & anatomyThird party ontologies: sample properties, genomic variation types, placenames, phenotypic qualities

Page 8: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Curation and data import

ISA-Tab spreadsheet format

Investigation - Study - Assay

Widely used for 'omics metadata

Ontology-based annotation is well supported

Ontology term suggestion tools available in Google Spreadsheets

Challenges

● consistent representation of data and choice of ontology terms by curator(s) through time

● too complex for casual submitters

ISA-Tab's Study and its associated list of samples maps to PopBio's project and samples, while Assay maps to… assay!

High level "object relational mapper" Perl API handles storage into and retrieval from Chado database for consistency and maintainability.

Example: a sample may have several species identification assays. Our API provides a method for the sample object which returns the best single species term to summarise those results.

Page 9: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Updating existing data

1. Edit ISA-Tab, delete project and reload project from new ISA-Tab(stable IDs for project, samples and assays are retained)

2. Edit ISA-Tab but apply simple SQL updates or an API script to modify the database(as delete+reload can be slow)

No database → ISA-Tab route at present.

Page 10: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Scalability (storage + maintenance)

Current size: 121 projects, 57, 637 samples, 172, 636 assays (of which 4, 387 are IR)

API overhead some tasks take overnight⇒● loading for 1000+ sample datasets● search index generation

No issues yet with maintenance (e.g. backup and transfer of databases

Page 11: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Scalability (web-based retrieval)

"Dumb" API-based retrieval for "smart" web client (see next slide) is too slow on its own.

Currently using pre-filled RAM-based cache to speed up API requests for web-users. Not necessarily scalable. Still not very fast!

See future plans...

Page 12: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

{"sample_manipulations":[], "name":"G05-2019", "species_identification_assays":[{"result_summary":"<span class=\"species_name\">Anopheles arabiensis</span> (PCR-based species identification)", "name":"G05-2019.species", "description":null, "props":[{"cvterms":[{"name":"species assay result", "accession":"VBcv:0000961"}, {"name":"Anopheles arabiensis", "accession":"VBsp:0002224"}]}], "protocols":[{"props":[], "name":"VBA0046035:PROTO2", "type":{"name":"PCR-based species identification", "accession":"MIRO:30000040"}, "description":"Mosquito DNA was extracted from the carcass and identified to species and molecular form using rDNA-based PCR assays.", "uri":""}], "performers":[], "id":"VBA0046035", "type":"species identification assay"}], "species":{"name":"Anopheles arabiensis", "accession":"VBsp:0002224"}, "description":null, "genotype_assays":[{"result_summary":"inversion: 2La/a; inversion: 2Rjb/b (cytological chromosome examination)", "genome_browser_path":null, "name":"G05-2019.karyotyping", "description":null, "genotypes":[{"uniquename":"VBA0046036:2La/a", "props":[{"value":"2La/a", "cvterms":[{"name":"inversion", "accession":"SO:1000036"}]}, {"value":"2L", "cvterms":[{"name":"chromosome_arm", "accession":"SO:0000105"}]}], "name":"2La/a", "type":{"name":"paracentric_inversion", "accession":"SO:1000047"}, "description":"inversion: 2La/a"}, {"uniquename":"VBA0046036:2Rjb/b", "props":[{"value":"2Rjb/b", "cvterms":[{"name":"inversion", "accession":"SO:1000036"}]}, {"value":"2R", "cvterms":[{"name":"chromosome_arm", "accession":"SO:0000105"}]}], "name":"2Rjb/b", "type":{"name":"paracentric_inversion", "accession":"SO:1000047"}, "description":"inversion: 2Rjb/b"}], "vcf_file":null, "props":[], "protocols":[{"props":[{"value":"microscope manufacturer: Olympus", "cvterms":[{"name":"protocol component", "accession":"VBcv:autocreated:protocol component"}]}, {"cvterms":[{"name":"protocol component", "accession":"VBcv:autocreated:protocol component"}, {"name":"Giemsa staining", "accession":"IDOMAL:0000552"}]}], "name":"VBA0046036:PROTO3", "type":{"name":"cytological chromosome examination", "accession":"MIRO:30000037"}, "description":"Ovaries were prepared for karyotype analysis according to standard procedures. The banding pattern was observed under a phase-contrast microscope (400×) and interpreted with reference to the chromosomal map and nomenclature of Coluzzi and colleagues. ", "uri":""}], "performers":[], "type":"genotype assay", "id":"VBA0046036"}], "props":[{"cvterms":[{"name":"sex", "accession":"EFO:0000695"}, {"name":"female", "accession":"PATO:0000383"}]}, {"cvterms":[{"name":"developmental stage", "accession":"EFO:0000399"}, {"name":"adult", "accession":"IDOMAL:0000655"}]}], "field_collections":[{"result_summary":"Burkina Faso (pyrethrum spray catch)", "name":"G05-2019.collect", "description":null, "geolocation":{"longitude":"-0.05727", "props":[{"cvterms":[{"name":"collection site", "accession":"VBcv:0000831"}, {"name":"Burkina Faso", "accession":"GAZ:00000905"}]}, {"value":"Bonsse", "cvterms":[{"name":"location", "accession":"VBcv:0000698"}]}, {"value":"Burkina Faso", "cvterms":[{"name":"country", "accession":"VBcv:0000701"}]}], "latitude":"12.1693", "geodetic_datum":"WGS 84", "name":"Burkina Faso", "altitude":null}, "props":[{"value":"2005-08-02", "cvterms":[{"name":"date", "accession":"VBcv:0000705"}]}], "protocols":[{"props":[], "name":"VBA0046034:PROTO1", "type":{"name":"pyrethrum spray catch", "accession":"MIRO:30000023"}, "description":"Freshly-fed female An. gambiae s.l. were collected in the morning while resting inside human dwellings by manual aspiration with the aid of electrical aspirators. Mosquitoes were kept in small cages wrapped in wet towels and stored inside cool boxes. Additionally, indoor insecticide space-sprays were carried out in the early afternoon.", "uri":"\n"}], "performers":[], "type":"field collection", "id":"VBA0046034"}], "species_qualifications":[{"name":"unambiguous", "accession":"VBcv:autocreated:unambiguous"}], "type":{"name":"individual", "accession":"EFO:0000542"}, "id":"VBS0015615", "phenotype_assays":[]}

Page 13: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Web interface

PopBio browser:https://www.vectorbase.org/popbio/A good example project page:https://www.vectorbase.org/popbio/project/?id=VBP0000010

New entry page currently in development:http://funcgen.vectorbase.org/popbio-map-preview/vb_geohashes_mean.html

Page 14: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Web interfacePlan to develop or modify something similar to MalariaGen's Panoptes with richer/more flexible metadata capabilities:

Page 15: VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

Plans

Map interface: delivery for June (VB-2015-06) release and present/demo at Kolymbari, ICEMR meetings

Spreadsheet submission wizard development scheduled for Fall 2015.

Year 2: Sample x genotype browser development, including e! REST and variation Solr work.

Year 2: Refactor project pages with scalable (but still flexible) data transfer (probably also Solr-driven) & update graphics.