biopig for scalable analysis of big sequencing data
DESCRIPTION
This talk was adapted from my presentation at the Finishing in the Future 2011, Santa Fe, NM.TRANSCRIPT
![Page 1: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/1.jpg)
BioPig: Hadoop-based Analytic Toolkit for Next-Generation Sequence Data
Zhong Wang, Ph.D.Computational Biology Staff Scientist
![Page 2: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/2.jpg)
Cellulase
The deep metagenome approach to discover cellulases for biofuel research
![Page 3: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/3.jpg)
Large data, large reward
http://www.cazy.org/
Only 1% shared (>=95% identity)50% validated activity
Science. 2011 Jan 28;331(6016):463-7.
![Page 4: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/4.jpg)
Sequence data
More data would be even better
![Page 5: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/5.jpg)
Rumen(2009) Rumen(2010) Rumen(2012)
17 Gb
250 Gb
1000 Gb
But, can analysis keep up with data growth?
![Page 6: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/6.jpg)
Ideal solutions for the terabase problem
1.Scalable to 1Tb?2.Performance (within hours)?
![Page 7: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/7.jpg)
High-Mem cluster
Input/Output (IO)Memory
![Page 8: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/8.jpg)
MP/MPI solution: k-mer counting
1
2
3
4
Raw Data Data slicesEach node/core
has data and table slices Count table
![Page 9: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/9.jpg)
MP/MPI performance
MPI version412 Gb, 4.5B reads2.7 hours on 128x24 coresNESRC HopperII
MP Threaded version268 Gb, 3B reads5 days on 32 coresHigh-Mem Cluster
• Experienced software engineers• Six months of development time• One nodes fails, all fail
Problems:
Fast, scalable
![Page 10: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/10.jpg)
Hadoop/Map Reduce framework
• Google MapReduce– Data Parallel programming model to process petabyte data– Generally has a map and a reduce step
• Apache Hadoop– Distributed file system (HDFS) and job handling for
scalability and robustness– Data locality to bring compute to data, avoiding network
transfer bottleneck
![Page 11: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/11.jpg)
Programmability: Hadoop vs Pig finding out top 5 websites young people visit
![Page 12: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/12.jpg)
BioPig: design goals
• Flexible– every dataset is unique, data analysts have domain knowledge that is essential to
optimize the analysis,– pluggable modules that analysts can use to build custom analytic pipelines,
• High-Level – domain-specific language enable data analysts to create custom pipelines,– hide details of parallelism (too complex for most people),
• Scalability– leverage data parallelism to speed up analytics,– integrate external tools and applications where necessary,– scale from 1 to hundreds of compute nodes with minimal effort and linear
scalability.• Robustness
– Data and computation are replicated across nodes to combat failures
BioPIG
![Page 13: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/13.jpg)
Runs on any hardware supporting Hadoop
• JGI Titanium (commodity hadoop cluster)– Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet
• NERSC Magellan Cloud Testbed– Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem
processors, 10Gbit InfiniBand, GPFS
• Amazon AWS– Elastic MapReduce with cluster compute nodes (23 GB of
memory, 2 x Intel quad-core “Nehalem” architecture 1690 GB of instance storage, 10G Ethernet
![Page 14: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/14.jpg)
BioPig Modules
Blast
Input/Output(Fasta,q)
K-merCounter
Assembly
![Page 15: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/15.jpg)
How k-mer count is implemented
Load Mapper Shuffle/sort
Reducer Merge
<id1, header, ‘attagc’><id2, header, ‘gttagg’>
<id1, ‘atta’>, <id1,’ttag’><id2, ‘gtta’>, <id2, ‘ttag’>
<‘atta’, id1>, <‘ttag’, id1, id2><‘gtta’, id2>, <‘tagg’, id2>
<‘atta’, 1>, <‘ttag’, 2><‘gtta’, 1>, <‘tagg’, 1>
<‘atta’, 3>, <‘ttag’, 2><‘gtta’, 2>, <‘tagg’, 1>
![Page 16: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/16.jpg)
A 7-liner BioPig script for k-mer counting
![Page 17: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/17.jpg)
Rumen metagenome gene discovery pipeline
Read preprocess
(remove artifacts)
pigBlast(blast reads
against known cellulases)
pigAssembler(Assemble reads
into contigs)
pigExtender(Extend contigs into full-length
enzymes)
![Page 18: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/18.jpg)
Cloud solution to large data
BioPig-Blaster
BioPig-Assembler
BioPig-Extender
BioPIG
BioPig: 61 lines of codeMPI-extender: ~12,000 lines (vs 31 in BioPig)
Flexibility
Programmability
Scalability
xx
![Page 19: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/19.jpg)
Conclusions
Hadoop-based BioPig shows great potential for scalable analysis on very large sequence data, it is robust and easy to use.
![Page 20: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/20.jpg)
Challenges in application
• IO optimization, e.g., reduce data copying • Some problems do not easily fit into
map/reduce framework, e.g., graph-based algorithms
• Integration into exiting framework, Galaxy
![Page 21: BioPig for scalable analysis of big sequencing data](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5562647dd8b42aab1a8b4b77/html5/thumbnails/21.jpg)
Acknowledgement
• Karan Bhatia• Henrik Nordberg• Kai Wang• Rob Egan• Alex Sczyrba• Jeremy Brand @JGI/NERSC• Shane Cannon @NERSC
BioPIG