![Page 1: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/1.jpg)
Using the Open Science Data Cloud for Data Science Research
Robert Grossman University of Chicago
Open Cloud Consor=um June 17, 2013
![Page 2: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/2.jpg)
Data: 1 PB of OSDC data across several disciplines
Instrument: 3000 cores / 5 PB OSDC science cloud
+ +
Team: you and your colleagues
Discoveries
correla=on algorithms +
![Page 3: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/3.jpg)
Part 1 What Instrument Do we Use to Make Big Data Discoveries?
How do we build a “datascope?”
![Page 4: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/4.jpg)
What is big data?
TB? PB? EB?
W? KW? MW?
![Page 5: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/5.jpg)
An algorithm and compu=ng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computa=on in the same =me but over more data.
![Page 6: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/6.jpg)
Commercial Cloud Service Provider (CSP) 15 MW Data Center
100,000 servers 1 PB DRAM
100’s of PB of disk
Automa=c provisioning and infrastructure management
Monitoring, network security and forensics
Accoun=ng and billing Customer
Facing Portal
Data center network
~1 Tbps egress bandwidth
25 operators for 15 MW Commercial Cloud
![Page 7: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/7.jpg)
OSDC’s vote for a datascope: a (bou=que) data center scale facility with a big-‐data scalable analy=c infrastructure.
![Page 8: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/8.jpg)
Data: 1 PB of OSDC data across several disciplines
Instrument: 3000 cores / 5 PB OSDC science cloud
+ +
Team: you and your colleagues
Discoveries
correla=on algorithms +
![Page 9: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/9.jpg)
Discipline Dura2on Size # Devices
HEP -‐ LHC 10 years 15 PB/year* One
Astronomy -‐ LSST 10 years 12 PB/year** One
Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s
Some Examples of Big Data Science
*At full capacity, the Large Hadron Collider (LHC), the world's largest par=cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambi=ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul=ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hhp://www.lsst.org/News/enews/teragrid-‐1004.html
![Page 10: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/10.jpg)
One large instrument Many smaller instruments
![Page 11: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/11.jpg)
Part 2. What is a Cloud and Why Do We Care?
11
![Page 12: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/12.jpg)
There Are Two Essen=al Characteris=cs of a Cloud
1. Self service 2. Scale
• Clouds enable you to compute over large amounts of data with the necessity of first downloading the data.
• Clouds can be designed to be secure and compliant.
12
![Page 13: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/13.jpg)
Self Service
Self Service
13
![Page 14: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/14.jpg)
Scale
14
![Page 15: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/15.jpg)
Types of Clouds
• Public Clouds – Amazon
• Private Clouds – Run internally by universi=es or companies
• Community Clouds – Run by organiza=ons (either formally or informally), such as the Open Cloud Consor=um
15
![Page 16: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/16.jpg)
Amazon Web Services (AWS)?
Community clouds, science clouds, etc.
• Lower cost (at medium scale) • Data too important for
commercial cloud • Compu=ng over scien=fic
data is a core competency • Can support any required
governance / security
• Scale • Simplicity of a credit card • Wide variety of offerings.
vs.
OCC supports AWS interop and burs=ng when permissible. 16
![Page 17: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/17.jpg)
Science Clouds
NFP Science Clouds Commercial Clouds POV Democra=ze access to
data. Integrate data to make discoveries. Long term archive.
As long as you pay the bill; as long as the business model holds.
Data & Storage
Data intensive compu=ng & HP storage
Internet style scale out and object-‐based storage
Flows Large & small data flows Lots of small web flows Streams Streaming processing
required NA
Accoun=ng Essen=al Essen=al Lock in Moving environment
between CSPs essen=al Lock in is good
Interop Cri=cal, but difficult Customers will drive to some degree 17
![Page 18: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/18.jpg)
Essen=al Services for a Science CSP • Support for data intensive compu=ng • Support for big data flows • Account management, authen=ca=on and authoriza=on services
• Health and status monitoring • Billing and accoun=ng • Ability to rapidly provision infrastructure • Security services, logging, event repor=ng • Access to large amounts of public data • High performance storage • Simple data export and import services
![Page 19: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/19.jpg)
Datascope – Science Cloud Service Provider (Sci CSP)
Data scien=st
Sci CSP services
![Page 20: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/20.jpg)
Cloud Services Opera=ons Centers (CSOC)
• The OSDC operates Cloud Services Opera=ons Center (or CSOC).
• It is a CSOC focused on suppor=ng Science Clouds for researchers.
• Compare to Network Opera=ons Center or NOC.
• Both are an important part of cyber infrastructure for big data science.
![Page 21: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/21.jpg)
Datascope – Science Cloud Service Provider (Sci CSP)
Data scien=st
Sci CSP services
Cloud Service Opera=ons Center (CSOC)
![Page 22: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/22.jpg)
Part 3 Data Science
![Page 23: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/23.jpg)
Data
Founda=ons of data science
General and discipline specific souware applica=ons and tools
Models and algorithms
Establish best prac=ces, strategies for data science in general and discipline specific data science in par=cular
Analy=c infrastructure
Data
![Page 24: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/24.jpg)
What are the founda=ons for data science?
![Page 25: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/25.jpg)
Theory to Big Data Spectrum
Simple counts and sta=s=cs over big data
Mathema=cal theorems
No data Small data
Big data
Tradi=onal sta=s=cal modeling
Medium data
(Semi-‐)Automa=ng sta=s=cal modeling
GB TB PB
OSDC Datascope 0.5-‐2.0 MW
![Page 26: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/26.jpg)
Part 4 The Open Science Data Cloud
www.opensciencedatacloud.org
![Page 27: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/27.jpg)
Data: 1 PB of OSDC data across several disciplines
Instrument: 3000 cores / 5 PB OSDC science cloud
+ +
Team: you and your colleagues
Discoveries
correla=on algorithms +
![Page 28: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/28.jpg)
2013 Open Science Data Cloud (IaaS)
5 PB 2013 (OpenStack & GlusterFS)
Infrastructure automa=on & management
(Yates)
Compliance, & security
(OpenFISMA)
Accoun=ng & billing
(Salesforce.com)
Customer Facing Portal (Tukey)
Data center network
~10-‐100 Gbps bandwidth
5 engineers to operate 0.5 MW Science Cloud
Science Cloud SW & Services
• Virtual Machine (VM) containing common applica=ons & pipelines
• Tukey (OSDC portal & middleware v0.3) • Yates (infrastructure automa=on and management v0.1) 28
![Page 29: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/29.jpg)
Tukey
• Tukey (based in part on Horizon). • We have factored out digital ID service, file sharing, and transport from Bionimbus and Matsu.
![Page 30: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/30.jpg)
Yates
• Automa=on installa=on of OSDC souware stack on rack of computers.
• Based upon Chef • Version 0.1
![Page 31: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/31.jpg)
UDR
• UDT is a high performance network transport protocol • UDR = rsync + UDT • It is easy for an average systems administrator to keep 100’s of TB of distributed data synchronized.
• We are using it to distribute c. 1 PB from the OSDC
![Page 32: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/32.jpg)
Open Science Data Cloud Services
• Digital ID services • Data sharing services • Data transport services (UDR) • What other core services are essen&al? • Of course, working groups and applica=ons always add their own services
• These core services will hopefully make the OSDC ahrac=ve as a plaxorm (PaaS) for scien=fic discovery.
![Page 33: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/33.jpg)
33 www.opencloudconsor=um.org
• U.S based not-‐for-‐profit corpora=on. • Manages cloud compu=ng infrastructure to
support scien=fic research: Open Science Data Cloud.
• Manages cloud compu=ng infrastructure to support medical and health care research: Biomedical Commons Cloud
• Manages cloud compu=ng testbeds: Open Cloud Testbed.
![Page 34: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/34.jpg)
OCC Members & Partners
• Companies: Cisco, Yahoo!, Intel, … • Universi=es: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, ORNL, University of Illinois at Chicago, …
• Federal agencies and labs: NASA • Interna=onal Partners: Univ. Edinburgh, AIST (Japan), Univ. Amsterdam, …
• Partners: Na=onal Lambda Rail
34
![Page 35: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/35.jpg)
Third party open source souware
+
Tukey
Yates
Open source souware developed by the OCC and open standards
+
Data center
+ Data with permissions
+ Authoriza=on of users access to data
+ Policies, procedures, controls, etc.
+ Governance, legal agreements
+ Sustainability model 35
![Page 36: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/36.jpg)
Part 5 OSDC Data
![Page 37: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/37.jpg)
Data: 1 PB of OSDC data across several disciplines
Instrument: 3000 cores / 5 PB OSDC science cloud
+ +
Team: you and your colleagues
Discoveries
correla=on algorithms +
![Page 38: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/38.jpg)
![Page 39: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/39.jpg)
OSDC Public Data Sets
• Over 800 TB of open access data in the OSDC • Earth sciences data • Biological sciences data • Social sciences data • Digital humani=es
![Page 40: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/40.jpg)
Part 6 OSDC Working Groups
Just look around you
![Page 41: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/41.jpg)
Matsu Working Group: Clouds to Support Earth Science
41
matsu.opensciencedatacloud.org
![Page 42: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/42.jpg)
Matsu Architecture
Hadoop HDFS
Matsu Web Map Tile Service (WMTS)
Matsu MR-‐based Tiling Service
NoSQL Database
Images at different zoom layers suitable for OGC Web Mapping Server
Level 0, Level 1 and Level 2 images
MapReduce used to process Level n to Level n+1 data and to par==on images for different zoom levels
NoSQL-‐based Analy=c Services
Streaming Analy=c Services
MR-‐based Analy=c Services
Analy=c Services Storage for WMS =les and derived data products
Presenta=on Services
Web Coverage Processing Service
(WCPS)
Workflow Services
![Page 43: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/43.jpg)
Hadoop-‐Based Re-‐Analysis Zoom Level 1: 4 images Zoom Level 2: 16 images
Zoom Level 3: 64 images Zoom Level 4: 256 images
![Page 44: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/44.jpg)
Bionimbus Working Group
bionimbus.opensciencedatacloud.org (biological data)
![Page 45: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/45.jpg)
Bionimbus Protected Data Cloud
45
![Page 46: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/46.jpg)
Analyzing Data From The Cancer Genome Atlas (TCGA)
1. Apply to dbGaP for access to data.
2. Hire staff, set up and operate secure compliant compu=ng environment to mange 10 – 100+ TB of data.
3. Get environment approved by your research center.
4. Setup analysis pipelines. 5. Download data from CG-‐
Hub (takes days to weeks). 6. Begin analysis.
Current Prac2ce With Protected Data Cloud (PDC)
1. Apply to dbGaP for access to data.
2. Use your eRA commons creden=als to login to the PDC, select the data that you want to analyze, and the pipelines that you want to use.
3. Begin analysis.
46
![Page 47: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/47.jpg)
One Million Genomes • Sequencing a million genomes would most likely fundamentally change the way we understand genomic varia=on.
• The genomic data for a pa=ent is about 1 TB (including samples from both tumor and normal =ssue).
• One million genomes is about 1000 PB or 1 EB • With compression, it may be about 100 PB • At $1000/genome, the sequencing would cost about $1B
![Page 48: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/48.jpg)
Big data driven discovery on 1,000,000 genomes and 1 EB of data.
Genomic-‐driven
diagnosis
Improved understanding of genomic science
Genomic-‐ driven drug development
Precision diagnosis and treatment. Preven=ve
health care.
![Page 49: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/49.jpg)
Biomedical Commons Cloud (BCC) Working Group
Cloud for Public Data
Cloud for Controlled Genomic Data
Cloud for EMR, PHI,
data
Example: Open Cloud Consor=um’s Biomedical Commons Cloud (BCC)
Medical Research Center A
Medical Research Center B
Hospital D
Medical Research Center C
49
![Page 50: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/50.jpg)
Resource Who users Who operates Open Science Data Cloud (OSDC)
Pan science data for researchers
Open Cloud Consor=um (OCC) supported by University OCC members
Biomedical Commons Clouds (BCC)
(Interna=onal) biomedical researchers
OCC Biomedical Commons Cloud Working Group supported by OCC University members
Bionimbus Protected Data Cloud
Genomics researchers
University of Chicago supported by the OCC
50
![Page 51: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/51.jpg)
OpenFlow-‐Enabled Hadoop WG
• When running Hadoop some map and reduce jobs take significantly longer than others.
• These are stragglers and can significantly slow down a MapReduce computa=on.
• Stragglers are common (dirty secret about Hadoop) • Infoblox and UChicago are leading a OCC Working Group on OpenFlow-‐enabled Hadoop that will provide addi=onal bandwidth to stragglers.
• We have a testbed for a wide area version of this project.
![Page 52: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/52.jpg)
OSDC PIRE Project We select OSDC PIRE Fellows (US ci=zens or permanent residents): • We give them tutorials and training on big data science.
• We provide them fellowships to work with OSDC interna=onal partners.
• We give them preferred access to the OSDC.
Nominate your favorite scien=st as an OSDC PIRE Fellow. www.opensciencedatacloud.org (look for PIRE)
![Page 53: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/53.jpg)
Part 7 Key Ques=ons for This Workshop
![Page 54: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/54.jpg)
• Ques=on 1. How can we add partner sites at other loca=ons that extend the OSDC? In par=cular, how can we extend the OSDC to sites around the world? How can the OSDC interoperate with other science clouds?
• Ques=on 2. What data can we add to the OSDC to facilitate data intensive cross-‐disciplinary discoveries?
• Ques=on 3. How can we build a plugin structure so that Tukey can be extended by other users and by other communi=es?
• Ques=on 4. What tools and applica=ons can we add to the OSDC facilitate data intensive cross-‐disciplinary discoveries?
• Ques=on 5. How can we beher integrate digital IDs and file sharing services into the OSDC?
• Ques=on 6. What are 3-‐5 grand challenge ques=ons that leverage the OSDC?
![Page 55: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/55.jpg)
Ques=ons
![Page 56: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum](https://reader031.vdocuments.mx/reader031/viewer/2022041504/5e23b2f6a45e560ead4bebd8/html5/thumbnails/56.jpg)
Robert Grossman is a faculty member at the University of Chicago. He is the Chief Research Informa=cs Officer for the Biological Sciences Division, a Faculty Member and Senior Fellow at the Computa=on Ins=tute and the Ins=tute for Genomics and Systems Biology, and a Professor of Medicine in the Sec=on of Gene=c Medicine. His research group focuses on big data, biomedical informa=cs, data science, cloud compu=ng, and related areas. He is also the Founder and a Partner of Open Data Group, which has been building predic=ve models over big data for companies for over ten years. He recently wrote a book for the general reader that discusses big data (among other topics) called the Structure of Digital Compu=ng: From Mainframes to Big Data, which can be purchased from Amazon. He blogs occasionally about big data at rgrossman.com.