large scale analytical workflows
TRANSCRIPT
![Page 1: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/1.jpg)
bioexcel.eu
Partners Funding
Large-scale analytical workflows on the cloud using Galaxy and Globus
Presenters: Ravi MadduriHost: Adam Carter
BioExcel Educational Webinar Series #8
16 November, 2016
![Page 2: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/2.jpg)
bioexcel.eu
Thiswebinarisbeingrecorded
![Page 3: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/3.jpg)
bioexcel.eu
BioExcel Overview• Excellence in Biomolecular Software
- Improve the performance, efficiency and scalability of key codes
• Excellence in Usability- Devise efficient workflow environments
with associated data integration
• Excellence in Consultancy and Training- Promote best practices and train end users
DMI Monitor
DMI Enactor
DMI Executor
DMI Enactor
Data Delivery Point
Data Source
Monitoring flow
Data flow
Service Invocation
DMI Optimiser
DMI Planner
DMIValidator
DMI Gateway
DMI Gateway
DMI Gateway
DMI Enactor
Portal / Workbench
DMI Request
DADC Engineer
DMI Expert
Repository
Registry
DMI Expert
Domain Expert
![Page 4: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/4.jpg)
bioexcel.eu
Interest Groups
• Integrative Modeling IG• Free Energy Calculations IG• Best practices for performance tuning IG• Hybrid methods for biomolecular systems IG• Biomolecular simulations entry level users IG• Practical applications for industry IG• Training• Workflows
Support platformshttp://bioexcel.eu/contact
Forums Code Repositories Chat channel Video Channel
![Page 5: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/5.jpg)
bioexcel.eu
Today’s PresenterRavi Madduri is a Scientist at Argonne National Laboratiories and Senior Research Fellow at University of Chicago.Ravi is actively involved in developing innovative software and networking technology. As lead architect of the Reliable File Transfer, he designed novel testing and profiling capabilities, ensuring that it met the needs of key communities such as TeraGrid.He implemented Grid file transfer patterns in the Java CoG Kit and developed a remote application virtualization infrastructure; the Grid-enable extension was incorporated in the Grid Service Authoring Toolkit and is used by NCI Information Systems.He is applying new technology in diverse science and engineering domains. For example, he is a key contributor to the Cancer Bioinformatics Grid. He played a lead role in the evolution of GridFTP and its adoption by researchers for the Laser Interferometer Gravitational Wave Observatory and the Large Hadron Collider. Moreover, as part of the NEESgrid project, he helped scientific teams incorporate Grid technology into their earthquake engineering research.
5
![Page 6: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/6.jpg)
globus.org/genomics
LargeScaleAnalyticalWorkflowsontheCloudusingGalaxyandGlobus
RaviK.MadduriArgonneNationalLaboratory,UniversityofChicago
![Page 7: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/7.jpg)
globus.org/genomics
• Globusisdeveloped,operated,andsupportedbyresearchers,developers,andbioinformaticiansattheComputationInstitute– UniversityofChicago/ArgonneNationalLab
• Weareanon-profitorganizationbuildingsolutionsfornon-profitresearchers
• Ourgoalistosupporttheadvancementofsciencebybringingtogetherourstrengthsandcapabilitiestohelpmeettheuniqueneedsofresearchersandresearchinstitutions
WhoWeAre
![Page 8: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/8.jpg)
globus.org/genomics
SequencingCenters
SequencingCenters
DataMovementandAccessChallenges
ManualDataAnalysis
PublicData
Storage
LocalCluster/CloudSeq
Center
ResearchLab
• Dataisdistributedindifferentlocations
• Researchlabsneedaccesstothedataforanalysis• BeabletoSharedatawithotherresearchers/collaborators
• Inefficientwaysofdatamovement• DataneedstobeavailableonthelocalandDistributedCompute
Resources• LocalClusters,Cloud,Grid
HowdoweanalyzethisSequenceData
OncewehavetheSequenceData
Picard
GATK
Fastq RefGenome
Alignment
VariantCalling
• ManuallymovethedatatotheComputenode
(Re)RunScript
Install
Modify
• InstallallthetoolsrequiredfortheAnalysis• BWA,Picard,GATK,FilteringScripts,etc.
• Shellscriptstosequentiallyexecutethetools• Manuallymodifythescriptsforanychange
• ErrorProne,difficulttokeeptrack,messy..• Difficulttomaintainandtransfertheknowledge
ChallengesInLargeScaleNGSAnalysis
![Page 9: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/9.jpg)
globus.org/genomics
Additional Challenges in Big Data
• Rapidly validating a hypothesis• Scaling up the analysis after validation• Trivially applying the same techniques on
other/all datasets of interest• Reproducibility
– Unique Identifiers for inputs and outputs– Publishable Results– Discoverable Results
11/23/16 BIGDATAforDISCOVERYSCIENCE
9
![Page 10: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/10.jpg)
globus.org/genomics
SequencingCenters
SequencingCenters
PublicData
Storage
LocalCluster/CloudSeq
Center
ResearchLab
Globusprovidesfor• High-performance• Fault-tolerant• Securefiletransferbetweenalldata-endpoints
Datamanagement Dataanalysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
GlobusGenomicsonAmazonEC2
• Analyticaltoolsareautomaticallyrunonthescalablecomputeresourceswhenpossible
• GlobusintegratedwithinGalaxy
• Web-basedUI• Drag-Drop
workflowcreations
• Easilymodifyworkflowswithnewtools
Galaxy-basedworkflowmanagement
GlobusGenomics
![Page 11: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/11.jpg)
globus.org/genomics
Technologies/Services
• EBS/S3 for scratch and semi-permanent storage• EC2 – on-demand, reserved, spot• VPCs• ELB• HTCondor• Globus transfer, identity management• Chef• Cloudtrails, SNS, SES – monitoring, notifications,
audit• IAM – access management, key management• RDS – state management
![Page 12: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/12.jpg)
globus.org/genomics
• Professionallymanagedandsupportedplatform• Bestpracticepipelines
– WholeGenome,Exome,RNA-Seq,ChIP-Seq,…
• Enhancedworkbenchwithbreadthofanalytictools• Technicalsupportandbioinformaticsconsulting• Accesstopre-integratedend-pointsforreliableandhigh-
performancedatatransfer(e.g.BroadInstitute,PerkinElmer,universitysequencingcenters,etc.)
• Cost-effectivesolutionwithsubscription-basedpricing
AdditionalCapabilities
![Page 13: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/13.jpg)
globus.org/genomics
Profiler
1.Submitprofilingrequest 5.Returnprofiles
Worker
Workerwebservice
PCP HTCondor
2.Provisionworkers
3.Start/monitorprofiling
Worker
Workerwebservice
PCP HTCondorWorker
Workerwebservice
PCP HTCondorWorker
Workerwebservice
PCP HTCondor
4.ParsePCPlogandstoreprofiles
A Cloud Tool Profiling Service
� Describe profile requests in JSON
� Provision resources and apply a profiling Web Service
q Use Performance Co-pilot (PCP) to capture usage
� Capture and process PCP logs
� Return profiles as JSON (or logs via s3)
![Page 14: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/14.jpg)
globus.org/genomics
Cost-aware Provisioning
14
1. Filter instance types with profiles
2. Determine price for each instance type across all AZs
3. Rank potential requests
4. Make requests and monitor
5. Cancel or repurpose excess active requests once one is fulfilled
$$$
???
R.Chardetal.Cost-awarecloudprovisioning,Proceedingsofthe11thIEEEInternationalConferenceone-Science(e-Science),2015.
![Page 15: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/15.jpg)
globus.org/genomics
• Workflows can be easily defined and automated with integrated Galaxy Platform capabilities
• Data movement is streamlined with integrated Globus file-transfer functionality
• Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure
GlobusGenomics
![Page 16: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/16.jpg)
globus.org/genomics
Packaging data for interchange
11/23/16 BIGDATAforDISCOVERYSCIENCE
16
https://github.com/ini-bdds/bdbag
![Page 17: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/17.jpg)
globus.org/genomics
Packaging data for interchange
A packaging format for encapsulating– Payload: arbitrary content– Tags: metadata describing the payload– Checksums: supports verification of content
Bio_data_bag/ |-- data | \-- genomic | \-- 2a673.fastq| \-- 2a673.fastq| -- manifest-md5.txt | afbfa231324812378123bfa data/genomic/2a673.fasta| -- bagit.txt
Contact-Name: John Smith
BDBag
![Page 18: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/18.jpg)
globus.org/genomics
Minimal viable identifiers (minid)• Every data item that you create can be
automatically assigned a digital id• You can reference it, share it, resolve it
![Page 19: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/19.jpg)
globus.org/genomics
Resolve a minid
![Page 20: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/20.jpg)
globus.org/genomics
Generating Data Identifiers
11/23/16 BIGDATAforDISCOVERYSCIENCE
20
https://github.com/ini-bdds/minid
![Page 21: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/21.jpg)
globus.org/genomics
Dnase Hypersensitivity Analysis
BDDSData
1.CreateaQuery 2.Query
EncodetoBDBagService3.Query
BDDSAnalysisServices4.BDBagMinID
BDDSERMRestService TRN
BDDSGalaxyService
BDBAG
5.ExecuteBigDataAnalysispipelines
6.Results
7.PublishResults
BDDSPublicationService
8.IndexResults
CEL
FASTQ
BDBAG
![Page 22: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/22.jpg)
globus.org/genomics
Extending Globus Genomics
• BDBags – Interchangeable data objects for collections of files– Checksums– Unique identifiers
• Batch Execution on BDBags generating results as “bags” of results
• Strong data provenance for reproducibility• Elastic File System for scratch and S3 for
results
![Page 23: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/23.jpg)
globus.org/genomics
Extending Globus Genomics
• 500+ skinny docker containers • Instrumented with cadvisor• Extended the application profiling service
to generate profiles for cpu, i/o, memory, disk
• Dashboards using graphana• RDS to store the profiles
![Page 24: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/24.jpg)
globus.org/genomics11/23/16 BIGDATAforDISCOVERYSCIENCE
24
![Page 25: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/25.jpg)
globus.org/genomics
GlobusGenomicsataglance
30institutions,groups
10smillioncorehours
labs
2PBsrawsequences
analyzed
>1500analysistools
1000sgenomesprocessed
>50workflows
99%uptimeoverthepast
twoyears
43stepsnumberofstepsinasingleworkflow
5 dayslongestrunning
workflow
100sdifferentspecies
1000sgenomesprocessed
5 dayslongestrunning
workflow
![Page 26: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/26.jpg)
globus.org/genomics
Typical Usecases
AprofileofinheritedpredispositiontobreastcanceramongNigerianwomen
Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
AcasestudyforhighthroughputanalysisofNGSdatafortranslationalresearchusingGlobusGenomics
D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan
![Page 27: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/27.jpg)
globus.org/genomics
Globus Genomics users
DobynsLab
Cox LabVolchenboum LabOlopade Lab
Nagarajan Lab
![Page 28: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/28.jpg)
globus.org/genomics
• More information on Globus Genomics:www.globus.org/genomics
• More information on Globus: www.globus.org
![Page 29: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/29.jpg)
globus.org/genomics
Our work is supported by:U.S. DEPARTMENT OF
ENERGY
29
![Page 30: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/30.jpg)
globus.org/genomics
Team
![Page 31: Large scale analytical workflows](https://reader036.vdocuments.mx/reader036/viewer/2022062316/58804f6e1a28ab22088b56f9/html5/thumbnails/31.jpg)
bioexcel.eu
Audience Q&A session
Please use the Questionsfunction in GoToWebinar
application
Any other questions or points to discuss after the live
webinar? Join the discussion the discussion at
http://ask.bioexcel.eu or jump straight to the topic at http://bit.ly/2fghe8B.