commodity compung in genomics research -...
TRANSCRIPT
CommodityCompu+nginGenomicsResearch
MichaelSchatzandMihaiPop
CenterforBioinforma-csandComputa-onalBiology,UniversityofMaryland,CollegePark,MD,USA
h;p://www.cbcb.umd.edu/research/cloud/
RecentadvancesinDNAsequencingtechnologyfromIllumina,454LifeSciences,ABI,andHelicos,are enabling next genera-on sequencing instruments to sequence the equivalent of the humangenome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the humangenomeproject of the late 90's andearly '00s required years ofworkonhundredsofmachineswith sequencing costs measured in hundreds of millions of dollars. This drama-c increase inefficiencyhasspurredtremendousgrowthinapplica-onsforDNAsequencing.
Forexample,whereasthehumangenomeprojectsoughttosequencethegenomeofasmallgroupofindividuals,the1000genomesprojectaimstocatalogthegenomesof1000individualsfromallregionsoftheglobeinjustthreeyears.Relatedprojectsaimtocatalogallofthebiologicallyac-vetranscribed regionsof the genomeover awide varietyof environmental anddisease condi-ons.Similarstudiesarealsounderwayformodelorganismssuchasmouse,rat,chicken,rice,andyeast,andotherorganismsofinterest.
Cheapandfastsequencingtechnologiesarealsoprovidingscien-stswiththetoolstoanalyzethelargelyunknownmicrobialbiosphere.Themajorityofmicrobesinhabi-ngourworldandourbodiesareunknownandcannotbeeasilymanipulated inthe laboratory. Inrecentyearsanewscien-ficfield has emerged ‐ metagenomics ‐ that aims to characterize en-re microbial communi-es bysequencingtheDNAdirectlyextractedfromanenvironment.Severalstudieshavealreadytargeteda rangeofnaturalenvironments (ocean, soil,minedrainage)aswellas thecommensalmicrobesinhabi-ngthebodiesofhumansandotheranimalsandinsects.Thela[erarethetargetofanewNIHini-a-ve–theHumanMicrobiomeProject‐anefforttocharacterizethediversityofhuman‐associatedmicrobialcommuni-esandtounderstandtheircontribu-onstohumanhealth.
The raw data generated by the new sequencing instruments o^en exceed 1 terabyte and arestraining the computa-onal infrastructure typically available in an average research lab.Furthermore,biologicaldatasetsareonlyincreasinginsize,asdataformoreindividualsandmoreenvironments are collected, further complica-ng computa-onal analyses. Even seemingly simpletasks,suchasmappingacollec-onofsequencingreadstooneof thehumanreferencegenome,can require days of computa-on, andde novo assembly of an en-re human genomeusing newgenera-on sequence data has only been possible with specialized compute resources. The onlylong‐termsolu-ontothechallengesposedbythemassivebiologicaldata‐setsbeinggeneratedistocombine computa-onalbiology researchwithadvances fromhighperformance compu-ng.Hereweexplore theuseofcommoditycompu-ngwithinacloudcompu-ngparadigmto tackle theseproblems.
Abstract
CloudBurst:HighlySensi+veShortReadMapping
CloudBurstisanewopen‐sourceread‐mappingalgorithm,foruseinavarietyofbiologicalanalysesincludingSNPdiscovery,genotyping,andpersonal genomics. It is modeled a^er the short read mappingprogramRMAP,butusesHadooptocomputealignments inparallel.CloudBurst's running -me scales linearly with the number of readsmapped,andwithnear linearspeedupasthenumberofprocessorsincreases.Inalargeremotecomputecloudwith96cores,CloudBurstachieves over a 100‐fold speedup over a serial execu-on of RMAP,reducing the running -me from hours to mere minutes to mapmillionsofshortreadstothehumangenome.
Schatz,MC(2009)CloudBurst:HighlySensi+veReadMappingwithMapReduce.Bioinforma-cs.25(11):1363‐1369.h[p://www.cbcb.umd.edu/so^ware/cloudburst
GenomeAssemblywithMapReduceCrossbow:SearchingforSNPswithCloudCompu+ng
Mapstepisshortreadalignment.ManyinstancesofBow+eruninparallelacrossthe cluster. Input tuples are reads andoutputtuplesarealignments.
A clustermay consist of any number ofnodes. Hadoop handles the details ofrou-ng data, distribu-ng and invokingprograms,providingfaulttolerance,etc.
The user first uploads reads to afilesystemvisible to theHadoop cluster.If the Hadoop cluster is in EC2, thefilesystemmightbeanS3bucket. IftheHadoop cluster is local, the filesystemmightbeanNFSshare.
Sort step bins alignments according toprimarykey (genomechromosome)andsorts according to a secondary key(offset into chromosome). This ishandledefficientlybyHadoop.
Reduce step calls SNPs for eachreference par--on. Many instances ofSOAPsnp run in parallel across thecluster. Input tuples are sortedalignments for a par--on and outputtuplesareSNPcalls.
Crossbowisacloud‐compu-ngso^waresystemthatcombinesthespeedoftheshortreadalignerBow-e and the accuracy of the SOAPsnp consensus and SNP caller. Execu-ng these tools inparallelwiththeHadoop implementa-onofMapReduce,CrossbowalignsreadsandmakesSNPcallswith>99%accuracyfromadatasetcomprising38‐foldcoverageofthehumangenomeinlessthanonedayona clusterof40 computer cores, and in less than threehoursusinga320‐coreclusterrentedfromacommercialcloudcompu-ngservice.Crossbow’sabilitytoruninthecloudsmeansthatusersneednotownoroperateanexpensivecomputerclustertorunCrossbow.
Langmead, B, Schatz, MC, Lin, J, Pop, M, Salzberg, SL (2009) Searching for SNPs with CloudCompu+ng.SubmiCedforpublica-on.h[p://bow-e‐bio.sf.net/crossbow
Denovo assemblyofhuman‐sizedgenomes fromshort readsonasinglemachine isnot feasiblewith a current assembler such as Velvet, EULER‐USR, or ALLPATHS, as those algorithms wouldrequire >10TB of RAM to execute. However,MapReduce enables the de Bruijn graph assemblyalgorithmstoscaletotheselargedatasetsasoutlinedbelow.
1.DeBruijnGraphConstruc+onandCompressionConstruc-on of the de Bruijn graph is naturally implemented inMapReduce. Themap func-onemitskeyvaluepairs(ki,ki+1)forconsecu-vek‐mersinthereads,whicharethengloballyshuffledandreducedtobuildanadjacencylistforallk‐mersinthereads.Regionsofthegenomebetweenrepeat boundaries formnon‐branching simple paths, and are efficiently compressed inO(log(S))MapReduceroundsusingaparallellistrankingalgorithm.
Reads
ACTG ATCT CTGA CTGG CTGC GACT GCTG GGCT TCTG TGAC TGCA TGGC
DeBruijnGraph
ATC TCT CTG TGC GCA
TGA
GAC
ACT
TGG
GGC
GCT
ATCT
TGACT
TGGCT
TGCA
CompressedGraph
CTG
2.ErrorCorrec+onErrorsinthereadsdistortthegraphstructurecrea-ngdead‐ends(le^)orbubbles(middle).Thesegraph structures are recognized and resolved in a single MapReduce cycle crea-ng addi-onalsimplepaths(right).
BC
B’A
B
B’A C B*A C
3.Graphcontrac+onAddi-onal simplifica-on techniques suchas x‐cut (le^) and cycle tree compac-on (right) furthersimplifythegraphstructure,andcreatemoreopportuni-esforsimplepathcompression.
C
BAr
D
G
C
BA r
Dr
G
4.ScaffoldingFinallymate‐pairs, ifavailable,areanalyzedtofurtherresolveambigui-es.MapReduce isusedtoiden-fy the connected components of the assembly graph, which are then separately butconcurrentlyanalyzedusingthescaffoldingcomponentofanassemblersuchasVelvetortheCeleraAssembler.Thefinal sequences resolves larger regionsof thegenome, revealingnewbiologynotaccessiblethroughpurelycompara-vetechniques.
A
C
DRB
A C DR B R R
Humanchromosome1
Read1
Read2
map shuffle
…
…
reduce
Read 1, Chromosome 1, 12345-12365
Read 2, Chromosome 1, 12350-12370
Simulated DataCrossbowsensi+vity
Crossbowprecision
Nodes Time/Cost
HumanChr.2240Mreads1.8GB
99.01% 99.14% 1 master + 2 worder
0h30m$2.40
HumanChr.X172Mreads
5.6GB98.97% 99.64% 1 master +
8 workers 0h26m$7.20
Genuine DataAutosomalagreement
Chr.Xagreement
Nodes Time/Cost
Wholehuman,versusIllumina1MBeadChip
2.7Breads103GB
99.5% 99.6%1master+40workers
2h53m$98.40
AcknowledgementsWewouldliketothankUMDstudentsandfacultyBenLangmead,JimmyLin,StevenSalzberg,andDanSommerfortheircontribu-ons.
ThisworkissupportedbytheNSFClusterExploratory(CluE)programgrantIIS‐0844494.