genomic database performance improvements with document-based database architecture

17
Genomic Database Performance Impro vements With Document-Based Database Archi tecture Wade L. Schulz MD/PhD Donn K. Felker Brent G. Nelson MD Sponsor: Michael Linden MD/PhD presentations.wadeschulz.com/aclps2014

Upload: wadeschulz

Post on 03-Jun-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 1/17

Genomic DatabasePerformance Improvements

With Document-BasedDatabase Architecture

Wade L. Schulz MD/PhDDonn K. Felker

Brent G. Nelson MD

Sponsor: Michael Linden MD/PhD

presentations.wadeschulz.com/aclps2014

Page 2: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 2/17

DisclosuresStakeholders in AgileMedicine, which does not provideany genomics-related software, products, or services.

Whole-genomesequencing

“The obvious laboratory testto determine the basis ofevery disease process.” 

Page 3: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 3/17

“Shotgun sequencing”described 

Capillaryelectrophoresis

released 

First commercialsequencer (ABI Prism) 

Pyrosequencingdeveloped 

1980 1986

1995

Sequencing Evolution

slide: 3 / 17CLPS 2014 Genomic Database Performance

1998

2005

First commercialpyrosequencer

(454 Life Sciences) 

2010

Semiconductorsequencer released

(Ion Torrent) 

Page 4: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 4/17

Database Evolution

slide: 4 / 17CLPS 2014 Genomic Database Performance

Relational databasedefined 

First NoSQL

databases begin toemerge 

MySQL released MongoDB released 

1970

1995 2007 2009984

Sybase released 

Page 5: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 5/17

Database Evolution

slide: 5 / 17CLPS 2014 Genomic Database Performance

Page 6: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 6/17

1Assess efficiency ofrelational and documentdatabases for storinggenomic annotations 

Experimental Goals

slide: 6 / 17

2Quantify the benefit of in-memory indexing to querygenomic annotations 

3Determine whethertraditional disk or solidstate drives improvedatabase performance 

ACLPS 2014 Genomic Database Performance

Page 7: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 7/17

ParseWrite recordsor documentsinto database

Load IndexCreateindexes 

Create data setfrom dbSNPannotation

Query documentsand single/multi-table records 

Query

Experimental Design

slide: 7 / 17CLPS 2014 Genomic Database Performance

61,268,661 records 

Page 8: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 8/17

Virtual HardwareAmazon EC2 Digital Ocean

Operating System Amazon Linux (x64) CentOS (x64)

Processors 4 vCPU/8 ECU 8 vCPU

Memory 15 GB 16 GB

Disk TypeElastic Block Store

or PIOPSSolid State

slide: 8 / 17CLPS 2014 Genomic Database Performance

Data Models{

 _id: ObjectId(),has_sig: bool,rsid: string,

chr: string,loci:[

{gene: string,mrna_acc: string,class: string

}]

}

MongoDB MySQL

Page 9: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 9/17

What actually happens?Results

Page 10: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 10/17

Write Speed

slide: 10 / 17CLPS 2014 Genomic Database Performance

Page 11: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 11/17

slide: 11 / 17CLPS 2014 Genomic Database Performance

Index CreationMongoDB

MySQL

Page 12: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 12/17

slide: 12 / 17CLPS 2014 Genomic Database Performance

Query EfficiencyString: Search for number of SNPs with gene code (COMT: 842 records)

- MongoDB: {"loci.gene":"GRIN2B"}- MySQL: “SELECT count(distinct s.rsid)

FROM locus l, snp sWHERE l.snp_id = s.id AND l .gene = COMT‘” 

Boolean: Search for number of records with clinical significance annotation- MongoDB: {"has_sig":"true"}- MySQL: "SELECT count(s.id)

FROM locus l, snp sWHERE l.snp_id = s.id AND s.has_sig = true"

MongoDB

MySQL

Page 13: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 13/17

Why? In-place update?

slide: 13 / 17CLPS 2014 Genomic Database Performance

A B A BC

A B C

   H   D   D 

   S   S   D 

Page 14: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 14/17

Conclusions

- Drive type candrastically affectwrite speed

- MongoDB hassignificantly higherwrite speeds,especially for largeimports

Write speed- MySQL is more efficient

at creating Booleanindexes

- Index creation is

otherwise comparableon traditional disk

- MySQL index creationrates may suffer on SSD

Indexing

- In-memory indexing ofMongoDB provides asignificant performanceadvantage (~150x)

Queries

slide: 14 / 17CLPS 2014 Genomic Database Performance

Page 15: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 15/17

WadeSchulz

DonnFelker

Clinical Pathology

Resident,Yale University 

BrentNelson

Healthcare Software

Architect,Mobile and CloudComputing 

Neuromodulation

Fellow,University of Minnesota 

team

slide: 15 / 17CLPS 2014 Genomic Database Performance

Page 16: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 16/17

Questions

Page 17: Genomic Database Performance Improvements With Document-Based Database Architecture

8/12/2019 Genomic Database Performance Improvements With Document-Based Database Architecture

http://slidepdf.com/reader/full/genomic-database-performance-improvements-with-document-based-database-architecture 17/17

Presentation Resources

ACLPS 2014 Genomic Database Performance slide: 17 / 17

presentations.wadeschulz.com/aclps2014

github.com/wadeschulz/research_snpdb