commodity compung in genomics research -...

1
Commodity Compu+ng in Genomics Research Michael Schatz and Mihai Pop Center for Bioinforma-cs and Computa-onal Biology, University of Maryland, College Park, MD, USA h;p://www.cbcb.umd.edu/research/cloud/ Recent advances in DNA sequencing technology from Illumina, 454 Life Sciences, ABI, and Helicos, are enabling next genera-on sequencing instruments to sequence the equivalent of the human genome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the human genome project of the late 90's and early '00s required years of work on hundreds of machines with sequencing costs measured in hundreds of millions of dollars. This drama-c increase in efficiency has spurred tremendous growth in applica-ons for DNA sequencing. For example, whereas the human genome project sought to sequence the genome of a small group of individuals, the 1000 genomes project aims to catalog the genomes of 1000 individuals from all regions of the globe in just three years. Related projects aim to catalog all of the biologically ac-ve transcribed regions of the genome over a wide variety of environmental and disease condi-ons. Similar studies are also underway for model organisms such as mouse, rat, chicken, rice, and yeast, and other organisms of interest. Cheap and fast sequencing technologies are also providing scien-sts with the tools to analyze the largely unknown microbial biosphere. The majority of microbes inhabi-ng our world and our bodies are unknown and cannot be easily manipulated in the laboratory. In recent years a new scien-fic field has emerged ‐ metagenomics ‐ that aims to characterize en-re microbial communi-es by sequencing the DNA directly extracted from an environment. Several studies have already targeted a range of natural environments (ocean, soil, mine drainage) as well as the commensal microbes inhabi-ng the bodies of humans and other animals and insects. The la[er are the target of a new NIH ini-a-ve – the Human Microbiome Project ‐ an effort to characterize the diversity of human‐ associated microbial communi-es and to understand their contribu-ons to human health. The raw data generated by the new sequencing instruments o^en exceed 1 terabyte and are straining the computa-onal infrastructure typically available in an average research lab. Furthermore, biological datasets are only increasing in size, as data for more individuals and more environments are collected, further complica-ng computa-onal analyses. Even seemingly simple tasks, such as mapping a collec-on of sequencing reads to one of the human reference genome, can require days of computa-on, and de novo assembly of an en-re human genome using new genera-on sequence data has only been possible with specialized compute resources. The only long‐term solu-on to the challenges posed by the massive biological data‐sets being generated is to combine computa-onal biology research with advances from high performance compu-ng. Here we explore the use of commodity compu-ng within a cloud compu-ng paradigm to tackle these problems. Abstract CloudBurst: Highly Sensi+ve Short Read Mapping CloudBurst is a new open‐source read‐mapping algorithm, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. It is modeled a^er the short read mapping program RMAP, but uses Hadoop to compute alignments in parallel. CloudBurst's running -me scales linearly with the number of reads mapped, and with near linear speedup as the number of processors increases. In a large remote compute cloud with 96 cores, CloudBurst achieves over a 100‐fold speedup over a serial execu-on of RMAP, reducing the running -me from hours to mere minutes to map millions of short reads to the human genome. Schatz, MC (2009) CloudBurst: Highly Sensi+ve Read Mapping with MapReduce. Bioinforma-cs. 25(11):1363‐1369. h[p://www.cbcb.umd.edu/so^ware/cloudburst Genome Assembly with MapReduce Crossbow: Searching for SNPs with Cloud Compu+ng Map step is short read alignment. Many instances of Bow+e run in parallel across the cluster. Input tuples are reads and output tuples are alignments. A cluster may consist of any number of nodes. Hadoop handles the details of rou-ng data, distribu-ng and invoking programs, providing fault tolerance, etc. The user first uploads reads to a filesystem visible to the Hadoop cluster. If the Hadoop cluster is in EC2, the filesystem might be an S3 bucket. If the Hadoop cluster is local, the filesystem might be an NFS share. Sort step bins alignments according to primary key (genome chromosome) and sorts according to a secondary key (offset into chromosome). This is handled efficiently by Hadoop. Reduce step calls SNPs for each reference par--on. Many instances of SOAPsnp run in parallel across the cluster. Input tuples are sorted alignments for a par--on and output tuples are SNP calls. Crossbow is a cloud‐compu-ng so^ware system that combines the speed of the short read aligner Bow-e and the accuracy of the SOAPsnp consensus and SNP caller. Execu-ng these tools in parallel with the Hadoop implementa-on of MapReduce, Crossbow aligns reads and makes SNP calls with >99% accuracy from a dataset comprising 38‐fold coverage of the human genome in less than one day on a cluster of 40 computer cores, and in less than three hours using a 320‐core cluster rented from a commercial cloud compu-ng service. Crossbow’s ability to run in the clouds means that users need not own or operate an expensive computer cluster to run Crossbow. Langmead, B, Schatz, MC, Lin, J, Pop, M, Salzberg, SL (2009) Searching for SNPs with Cloud Compu+ng. SubmiCed for publica-on. h[p://bow-e‐bio.sf.net/crossbow De novo assembly of human‐sized genomes from short reads on a single machine is not feasible with a current assembler such as Velvet, EULER‐USR, or ALLPATHS, as those algorithms would require >10TB of RAM to execute. However, MapReduce enables the de Bruijn graph assembly algorithms to scale to these large datasets as outlined below. 1. De Bruijn Graph Construc+on and Compression Construc-on of the de Bruijn graph is naturally implemented in MapReduce. The map func-on emits key value pairs (k i ,k i+1 ) for consecu-ve k‐mers in the reads, which are then globally shuffled and reduced to build an adjacency list for all k‐mers in the reads. Regions of the genome between repeat boundaries form non‐branching simple paths, and are efficiently compressed in O(log(S)) MapReduce rounds using a parallel list ranking algorithm. Reads ACTG ATCT CTGA CTGG CTGC GACT GCTG GGCT TCTG TGAC TGCA TGGC De Bruijn Graph ATC TCT CTG TGC GCA TGA GAC ACT TGG GGC GCT ATCT TGACT TGGCT TGCA Compressed Graph CTG 2. Error Correc+on Errors in the reads distort the graph structure crea-ng dead‐ends (le^) or bubbles (middle). These graph structures are recognized and resolved in a single MapReduce cycle crea-ng addi-onal simple paths (right). BC B’ A B B’ A C B* A C 3. Graph contrac+on Addi-onal simplifica-on techniques such as x‐cut (le^) and cycle tree compac-on (right) further simplify the graph structure, and create more opportuni-es for simple path compression. C B A r D G C B A r D r G 4. Scaffolding Finally mate‐pairs, if available, are analyzed to further resolve ambigui-es. MapReduce is used to iden-fy the connected components of the assembly graph, which are then separately but concurrently analyzed using the scaffolding component of an assembler such as Velvet or the Celera Assembler. The final sequences resolves larger regions of the genome, revealing new biology not accessible through purely compara-ve techniques. A C D R B A C D R B R R Human chromosome 1 Read 1 Read 2 map shuffle reduce Read 1, Chromosome 1, 12345-12365 Read 2, Chromosome 1, 12350-12370 Simulated Data Crossbow sensi+vity Crossbow precision Nodes Time / Cost Human Chr. 22 40 M reads 1.8 GB 99.01% 99.14% 1 master + 2 worder 0h 30m $2.40 Human Chr. X 172 M reads 5.6 GB 98.97% 99.64% 1 master + 8 workers 0h 26m $7.20 Genuine Data Autosomal agreement Chr. X agreement Nodes Time / Cost Whole human, versus Illumina 1M BeadChip 2.7 B reads 103 GB 99.5% 99.6% 1 master + 40 workers 2h 53m $98.40 Acknowledgements We would like to thank UMD students and faculty Ben Langmead, Jimmy Lin, Steven Salzberg, and Dan Sommer for their contribu-ons. This work is supported by the NSF Cluster Exploratory (CluE) program grant IIS‐0844494.

Upload: trancong

Post on 09-May-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Commodity Compung in Genomics Research - …schatzlab.cshl.edu/publications/posters/2009-10-05.CLu… ·  · 2012-07-11Commodity Compung in Genomics Research ... tolerance, etc

CommodityCompu+nginGenomicsResearch

MichaelSchatzandMihaiPop

CenterforBioinforma-csandComputa-onalBiology,UniversityofMaryland,CollegePark,MD,USA

h;p://www.cbcb.umd.edu/research/cloud/

RecentadvancesinDNAsequencingtechnologyfromIllumina,454LifeSciences,ABI,andHelicos,are enabling next genera-on sequencing instruments to sequence the equivalent of the humangenome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the humangenomeproject of the late 90's andearly '00s required years ofworkonhundredsofmachineswith sequencing costs measured in hundreds of millions of dollars. This drama-c increase inefficiencyhasspurredtremendousgrowthinapplica-onsforDNAsequencing.

Forexample,whereasthehumangenomeprojectsoughttosequencethegenomeofasmallgroupofindividuals,the1000genomesprojectaimstocatalogthegenomesof1000individualsfromallregionsoftheglobeinjustthreeyears.Relatedprojectsaimtocatalogallofthebiologicallyac-vetranscribed regionsof the genomeover awide varietyof environmental anddisease condi-ons.Similarstudiesarealsounderwayformodelorganismssuchasmouse,rat,chicken,rice,andyeast,andotherorganismsofinterest.

Cheapandfastsequencingtechnologiesarealsoprovidingscien-stswiththetoolstoanalyzethelargelyunknownmicrobialbiosphere.Themajorityofmicrobesinhabi-ngourworldandourbodiesareunknownandcannotbeeasilymanipulated inthe laboratory. Inrecentyearsanewscien-ficfield has emerged ‐ metagenomics ‐ that aims to characterize en-re microbial communi-es bysequencingtheDNAdirectlyextractedfromanenvironment.Severalstudieshavealreadytargeteda rangeofnaturalenvironments (ocean, soil,minedrainage)aswellas thecommensalmicrobesinhabi-ngthebodiesofhumansandotheranimalsandinsects.Thela[erarethetargetofanewNIHini-a-ve–theHumanMicrobiomeProject‐anefforttocharacterizethediversityofhuman‐associatedmicrobialcommuni-esandtounderstandtheircontribu-onstohumanhealth.

The raw data generated by the new sequencing instruments o^en exceed 1 terabyte and arestraining the computa-onal infrastructure typically available in an average research lab.Furthermore,biologicaldatasetsareonlyincreasinginsize,asdataformoreindividualsandmoreenvironments are collected, further complica-ng computa-onal analyses. Even seemingly simpletasks,suchasmappingacollec-onofsequencingreadstooneof thehumanreferencegenome,can require days of computa-on, andde novo assembly of an en-re human genomeusing newgenera-on sequence data has only been possible with specialized compute resources. The onlylong‐termsolu-ontothechallengesposedbythemassivebiologicaldata‐setsbeinggeneratedistocombine computa-onalbiology researchwithadvances fromhighperformance compu-ng.Hereweexplore theuseofcommoditycompu-ngwithinacloudcompu-ngparadigmto tackle theseproblems.

Abstract

CloudBurst:HighlySensi+veShortReadMapping

CloudBurstisanewopen‐sourceread‐mappingalgorithm,foruseinavarietyofbiologicalanalysesincludingSNPdiscovery,genotyping,andpersonal genomics. It is modeled a^er the short read mappingprogramRMAP,butusesHadooptocomputealignments inparallel.CloudBurst's running -me scales linearly with the number of readsmapped,andwithnear linearspeedupasthenumberofprocessorsincreases.Inalargeremotecomputecloudwith96cores,CloudBurstachieves over a 100‐fold speedup over a serial execu-on of RMAP,reducing the running -me from hours to mere minutes to mapmillionsofshortreadstothehumangenome.

Schatz,MC(2009)CloudBurst:HighlySensi+veReadMappingwithMapReduce.Bioinforma-cs.25(11):1363‐1369.h[p://www.cbcb.umd.edu/so^ware/cloudburst

GenomeAssemblywithMapReduceCrossbow:SearchingforSNPswithCloudCompu+ng

Mapstepisshortreadalignment.ManyinstancesofBow+eruninparallelacrossthe cluster. Input tuples are reads andoutputtuplesarealignments.

A clustermay consist of any number ofnodes. Hadoop handles the details ofrou-ng data, distribu-ng and invokingprograms,providingfaulttolerance,etc.

The user first uploads reads to afilesystemvisible to theHadoop cluster.If the Hadoop cluster is in EC2, thefilesystemmightbeanS3bucket. IftheHadoop cluster is local, the filesystemmightbeanNFSshare.

Sort step bins alignments according toprimarykey (genomechromosome)andsorts according to a secondary key(offset into chromosome). This ishandledefficientlybyHadoop.

Reduce step calls SNPs for eachreference par--on. Many instances ofSOAPsnp run in parallel across thecluster. Input tuples are sortedalignments for a par--on and outputtuplesareSNPcalls.

Crossbowisacloud‐compu-ngso^waresystemthatcombinesthespeedoftheshortreadalignerBow-e and the accuracy of the SOAPsnp consensus and SNP caller. Execu-ng these tools inparallelwiththeHadoop implementa-onofMapReduce,CrossbowalignsreadsandmakesSNPcallswith>99%accuracyfromadatasetcomprising38‐foldcoverageofthehumangenomeinlessthanonedayona clusterof40 computer cores, and in less than threehoursusinga320‐coreclusterrentedfromacommercialcloudcompu-ngservice.Crossbow’sabilitytoruninthecloudsmeansthatusersneednotownoroperateanexpensivecomputerclustertorunCrossbow.

Langmead, B, Schatz, MC, Lin, J, Pop, M, Salzberg, SL (2009) Searching for SNPs with CloudCompu+ng.SubmiCedforpublica-on.h[p://bow-e‐bio.sf.net/crossbow

Denovo assemblyofhuman‐sizedgenomes fromshort readsonasinglemachine isnot feasiblewith a current assembler such as Velvet, EULER‐USR, or ALLPATHS, as those algorithms wouldrequire >10TB of RAM to execute. However,MapReduce enables the de Bruijn graph assemblyalgorithmstoscaletotheselargedatasetsasoutlinedbelow.

1.DeBruijnGraphConstruc+onandCompressionConstruc-on of the de Bruijn graph is naturally implemented inMapReduce. Themap func-onemitskeyvaluepairs(ki,ki+1)forconsecu-vek‐mersinthereads,whicharethengloballyshuffledandreducedtobuildanadjacencylistforallk‐mersinthereads.Regionsofthegenomebetweenrepeat boundaries formnon‐branching simple paths, and are efficiently compressed inO(log(S))MapReduceroundsusingaparallellistrankingalgorithm.

Reads

ACTG ATCT CTGA CTGG CTGC GACT GCTG GGCT TCTG TGAC TGCA TGGC

DeBruijnGraph

ATC TCT CTG TGC GCA

TGA

GAC

ACT

TGG

GGC

GCT

ATCT

TGACT

TGGCT

TGCA

CompressedGraph

CTG

2.ErrorCorrec+onErrorsinthereadsdistortthegraphstructurecrea-ngdead‐ends(le^)orbubbles(middle).Thesegraph structures are recognized and resolved in a single MapReduce cycle crea-ng addi-onalsimplepaths(right).

BC

B’A

B

B’A C B*A C

3.Graphcontrac+onAddi-onal simplifica-on techniques suchas x‐cut (le^) and cycle tree compac-on (right) furthersimplifythegraphstructure,andcreatemoreopportuni-esforsimplepathcompression.

C

BAr

D

G

C

BA r

Dr

G

4.ScaffoldingFinallymate‐pairs, ifavailable,areanalyzedtofurtherresolveambigui-es.MapReduce isusedtoiden-fy the connected components of the assembly graph, which are then separately butconcurrentlyanalyzedusingthescaffoldingcomponentofanassemblersuchasVelvetortheCeleraAssembler.Thefinal sequences resolves larger regionsof thegenome, revealingnewbiologynotaccessiblethroughpurelycompara-vetechniques.

A

C

DRB

A C DR B R R

Humanchromosome1

Read1

Read2

map shuffle

reduce

Read 1, Chromosome 1, 12345-12365

Read 2, Chromosome 1, 12350-12370

Simulated DataCrossbowsensi+vity

Crossbowprecision

Nodes Time/Cost

HumanChr.2240Mreads1.8GB

99.01% 99.14% 1 master + 2 worder

0h30m$2.40

HumanChr.X172Mreads

5.6GB98.97% 99.64% 1 master +

8 workers 0h26m$7.20

Genuine DataAutosomalagreement

Chr.Xagreement

Nodes Time/Cost

Wholehuman,versusIllumina1MBeadChip

2.7Breads103GB

99.5% 99.6%1master+40workers

2h53m$98.40

AcknowledgementsWewouldliketothankUMDstudentsandfacultyBenLangmead,JimmyLin,StevenSalzberg,andDanSommerfortheircontribu-ons.

ThisworkissupportedbytheNSFClusterExploratory(CluE)programgrantIIS‐0844494.