commodity compung in genomics research -...

CommodityCompu+nginGenomicsResearch

MichaelSchatzandMihaiPop

CenterforBioinforma-csandComputa-onalBiology,UniversityofMaryland,CollegePark,MD,USA

h;p://www.cbcb.umd.edu/research/cloud/

RecentadvancesinDNAsequencingtechnologyfromIllumina,454LifeSciences,ABI,andHelicos,are enabling next genera-on sequencing instruments to sequence the equivalent of the humangenome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the humangenomeproject of the late 90's andearly '00s required years ofworkonhundredsofmachineswith sequencing costs measured in hundreds of millions of dollars. This drama-c increase inefficiencyhasspurredtremendousgrowthinapplica-onsforDNAsequencing.

Forexample,whereasthehumangenomeprojectsoughttosequencethegenomeofasmallgroupofindividuals,the1000genomesprojectaimstocatalogthegenomesof1000individualsfromallregionsoftheglobeinjustthreeyears.Relatedprojectsaimtocatalogallofthebiologicallyac-vetranscribed regionsof the genomeover awide varietyof environmental anddisease condi-ons.Similarstudiesarealsounderwayformodelorganismssuchasmouse,rat,chicken,rice,andyeast,andotherorganismsofinterest.

Cheapandfastsequencingtechnologiesarealsoprovidingscien-stswiththetoolstoanalyzethelargelyunknownmicrobialbiosphere.Themajorityofmicrobesinhabi-ngourworldandourbodiesareunknownandcannotbeeasilymanipulated inthe laboratory. Inrecentyearsanewscien-ficfield has emerged ‐ metagenomics ‐ that aims to characterize en-re microbial communi-es bysequencingtheDNAdirectlyextractedfromanenvironment.Severalstudieshavealreadytargeteda rangeofnaturalenvironments (ocean, soil,minedrainage)aswellas thecommensalmicrobesinhabi-ngthebodiesofhumansandotheranimalsandinsects.Thela[erarethetargetofanewNIHini-a-ve–theHumanMicrobiomeProject‐anefforttocharacterizethediversityofhuman‐associatedmicrobialcommuni-esandtounderstandtheircontribu-onstohumanhealth.

The raw data generated by the new sequencing instruments o^en exceed 1 terabyte and arestraining the computa-onal infrastructure typically available in an average research lab.Furthermore,biologicaldatasetsareonlyincreasinginsize,asdataformoreindividualsandmoreenvironments are collected, further complica-ng computa-onal analyses. Even seemingly simpletasks,suchasmappingacollec-onofsequencingreadstooneof thehumanreferencegenome,can require days of computa-on, andde novo assembly of an en-re human genomeusing newgenera-on sequence data has only been possible with specialized compute resources. The onlylong‐termsolu-ontothechallengesposedbythemassivebiologicaldata‐setsbeinggeneratedistocombine computa-onalbiology researchwithadvances fromhighperformance compu-ng.Hereweexplore theuseofcommoditycompu-ngwithinacloudcompu-ngparadigmto tackle theseproblems.

Abstract

CloudBurst:HighlySensi+veShortReadMapping

CloudBurstisanewopen‐sourceread‐mappingalgorithm,foruseinavarietyofbiologicalanalysesincludingSNPdiscovery,genotyping,andpersonal genomics. It is modeled a^er the short read mappingprogramRMAP,butusesHadooptocomputealignments inparallel.CloudBurst's running -me scales linearly with the number of readsmapped,andwithnear linearspeedupasthenumberofprocessorsincreases.Inalargeremotecomputecloudwith96cores,CloudBurstachieves over a 100‐fold speedup over a serial execu-on of RMAP,reducing the running -me from hours to mere minutes to mapmillionsofshortreadstothehumangenome.

Schatz,MC(2009)CloudBurst:HighlySensi+veReadMappingwithMapReduce.Bioinforma-cs.25(11):1363‐1369.h[p://www.cbcb.umd.edu/so^ware/cloudburst

GenomeAssemblywithMapReduceCrossbow:SearchingforSNPswithCloudCompu+ng

Mapstepisshortreadalignment.ManyinstancesofBow+eruninparallelacrossthe cluster. Input tuples are reads andoutputtuplesarealignments.

A clustermay consist of any number ofnodes. Hadoop handles the details ofrou-ng data, distribu-ng and invokingprograms,providingfaulttolerance,etc.

The user first uploads reads to afilesystemvisible to theHadoop cluster.If the Hadoop cluster is in EC2, thefilesystemmightbeanS3bucket. IftheHadoop cluster is local, the filesystemmightbeanNFSshare.

Sort step bins alignments according toprimarykey (genomechromosome)andsorts according to a secondary key(offset into chromosome). This ishandledefficientlybyHadoop.

Reduce step calls SNPs for eachreference par--on. Many instances ofSOAPsnp run in parallel across thecluster. Input tuples are sortedalignments for a par--on and outputtuplesareSNPcalls.

Crossbowisacloud‐compu-ngso^waresystemthatcombinesthespeedoftheshortreadalignerBow-e and the accuracy of the SOAPsnp consensus and SNP caller. Execu-ng these tools inparallelwiththeHadoop implementa-onofMapReduce,CrossbowalignsreadsandmakesSNPcallswith>99%accuracyfromadatasetcomprising38‐foldcoverageofthehumangenomeinlessthanonedayona clusterof40 computer cores, and in less than threehoursusinga320‐coreclusterrentedfromacommercialcloudcompu-ngservice.Crossbow’sabilitytoruninthecloudsmeansthatusersneednotownoroperateanexpensivecomputerclustertorunCrossbow.

Langmead, B, Schatz, MC, Lin, J, Pop, M, Salzberg, SL (2009) Searching for SNPs with CloudCompu+ng.SubmiCedforpublica-on.h[p://bow-e‐bio.sf.net/crossbow

Denovo assemblyofhuman‐sizedgenomes fromshort readsonasinglemachine isnot feasiblewith a current assembler such as Velvet, EULER‐USR, or ALLPATHS, as those algorithms wouldrequire >10TB of RAM to execute. However,MapReduce enables the de Bruijn graph assemblyalgorithmstoscaletotheselargedatasetsasoutlinedbelow.

1.DeBruijnGraphConstruc+onandCompressionConstruc-on of the de Bruijn graph is naturally implemented inMapReduce. Themap func-onemitskeyvaluepairs(ki,ki+1)forconsecu-vek‐mersinthereads,whicharethengloballyshuffledandreducedtobuildanadjacencylistforallk‐mersinthereads.Regionsofthegenomebetweenrepeat boundaries formnon‐branching simple paths, and are efficiently compressed inO(log(S))MapReduceroundsusingaparallellistrankingalgorithm.

Reads

ACTG ATCT CTGA CTGG CTGC GACT GCTG GGCT TCTG TGAC TGCA TGGC

DeBruijnGraph

ATC TCT CTG TGC GCA

TGA

GAC

ACT

TGG

GGC

GCT

ATCT

TGACT

TGGCT

TGCA

CompressedGraph

CTG

2.ErrorCorrec+onErrorsinthereadsdistortthegraphstructurecrea-ngdead‐ends(le^)orbubbles(middle).Thesegraph structures are recognized and resolved in a single MapReduce cycle crea-ng addi-onalsimplepaths(right).

BC

B’A

B

B’A C B*A C

3.Graphcontrac+onAddi-onal simplifica-on techniques suchas x‐cut (le^) and cycle tree compac-on (right) furthersimplifythegraphstructure,andcreatemoreopportuni-esforsimplepathcompression.

C

BAr

D

G

C

BA r

Dr

G

4.ScaffoldingFinallymate‐pairs, ifavailable,areanalyzedtofurtherresolveambigui-es.MapReduce isusedtoiden-fy the connected components of the assembly graph, which are then separately butconcurrentlyanalyzedusingthescaffoldingcomponentofanassemblersuchasVelvetortheCeleraAssembler.Thefinal sequences resolves larger regionsof thegenome, revealingnewbiologynotaccessiblethroughpurelycompara-vetechniques.

A

C

DRB

A C DR B R R

Humanchromosome1

Read1

Read2

map shuffle

…

…

reduce

Read 1, Chromosome 1, 12345-12365

Read 2, Chromosome 1, 12350-12370

Simulated DataCrossbowsensi+vity

Crossbowprecision

Nodes Time/Cost

HumanChr.2240Mreads1.8GB

99.01% 99.14% 1 master + 2 worder

0h30m$2.40

HumanChr.X172Mreads

5.6GB98.97% 99.64% 1 master +

8 workers 0h26m$7.20

Genuine DataAutosomalagreement

Chr.Xagreement

Nodes Time/Cost

Wholehuman,versusIllumina1MBeadChip

2.7Breads103GB

99.5% 99.6%1master+40workers

2h53m$98.40

AcknowledgementsWewouldliketothankUMDstudentsandfacultyBenLangmead,JimmyLin,StevenSalzberg,andDanSommerfortheircontribu-ons.

ThisworkissupportedbytheNSFClusterExploratory(CluE)programgrantIIS‐0844494.

commodity compung in genomics research -...

Documents