exploring clouds for acceleration of science · 2019/12/11 · aws fpgas + google machine learning...

Exploring Clouds for Acceleration of ScienceA project funded under a cooperative Agreement with the National Science FoundationNSF Award number 190444Jamie SunderlandExecutive Director Service Development, Internet2

[ 2 ]

If you had unlimited cloud compute power, how would your application run?

0 1 2 3 4 5 6 70

200

400

600

800

1000

1200

0

10

20

30

40

50

60

70

80

MPI Rank Vs RuntimeE-CAS Purdue “Building Clouds” Nov 2019

Time (Hrs) MPI Rank (Cores)

Baseline:20 hours using 48 cores

[ 3 ]

What is E-CAS?A project funded by NSF and supported by Commercial Cloud Providers (Amazon Web Services and Google Cloud Platform).

To:• Investigate the viability of commercial clouds for

leading-edge research computing and computational science supporting a range of academic disciplines

• Leverage the novel capabilities of cloud based heterogeneous hardware resources and platforms, such as CPUs, GPUs and FPGAs, for scientific applications and workflows .

• Achieve the best time-to-solution for workflows and scalability for scientific applications using cloud computing.

$100k2x $500k

2x $380k

AccInov

1 Year

Phase 1 – supported by AWS/GCP credits & NSF Grants

Phase 2 – Funded by NSF grants

6 sub-awards selected through peer review process

6

6x $80kNSF Grants

1 Year6x $100k AWS/GCP

Credits6 sub-Awards

2x NSF Grants =< $880k each including staff and overheads for 1 year

[ 4 ]

Focus AreasThe Program will include two primary classes of studies:

• Acceleration of Science: The goal of the Acceleration of Science studies is to achieve the best time-to-solution for scientific application/workflows using cloud. The measures of acceleration may include end-to-end performance (e.g., wall clock and data-movement), or other relevant measures such as number of concurrent simulations or workflows, or the ability to process near real-time streaming data.

• Innovation: The goal of the Innovation studies is to explore the innovative use of heterogeneous hardware resources such as CPUs, GPUs, and FPGAs to support and extend application workflows.

[ 5 ]

The phase 1 projects are:

• “Accelerating Science by Integrating Commercial Cloud Resources in the CIPRES Science Gateway”, Mark Miller, San Diego Supercomputing Center (UCSD).

• “Investigating Heterogeneous Computing at the Large Hadron Collider”, Phillip Harris, Massachusetts Institute of Technology (MIT).

• “Ice Cube computing in the cloud”, Benedikt Riedel, University of Wisconsin Madison.

• “Building Clouds: Worldwide building typology modelling from images”, Daniel Aliaga, Purdue University.

• “Deciphering the Brain's Neural Code Through Large-Scale Detailed Simulation of Motor Cortex Circuits”, William Lytton, State University of New York (SUNY Downstate MC)

• “Development of BioCompute Objects for Integration into Galaxy in a Cloud Computing Environment”, Raja Mazumder, George Washington University.

AWS + Nvidia V100 GPUs, Bursting from XSEDE Comet using I2 Cloud Connect

AWS FPGAs + Google Machine Learning Framework

AWS FPGAs, GPUs + Tensor Flow Machine Learning in GCP.

Computer Vision, Procedural Modelling and ML on GCP

Uses NetPyNE and Slurm to burst from Campus HPC to GCP up to 50,000 cores.

AWS and I2 Direct Connect campus HPC and Galaxy service on AWS.

The provider technologies

[ 6 ][ 6 ]

MIT: Heterogeneous Computing of Large Hadron Collider Data

Only a small fraction of the 40 million collisions per second at the LHC are stored and analyzed due to the huge volumes of data and the compute power required to process it. With the high-intensity upgrade, data rates will more than double.

Processing more collisions could lead to foundational discoveries of new physics.

Advances in computational techniques will flow through to other sciences.

Image credit: CERN

UW Madison: Cloud Computing for the IceCube Neutrino Observatory

The IceCube Neutrino observatory located at the South Pole supports science from a number of disciplines including astrophysics, particle physics, and geographical sciences operating continuously being simultaneously sensitive to the whole sky.

Astrophysical Neutrinos yield understanding of the most energetic events in the universe and could show the origin of cosmic rays.

Being able to burst into cloud supports follow-up computations of observed events & alerts to and from the community such as other telescopes and LIGO.

Image Credit: Erik Beiser, IceCube/NSF

[ 8 ][ 8 ]

SDSC: CIPRES Phylogenetic Analysis Gateway

CIPRES Science Gateway is used to create phylogenetic trees (map of evolutionary relationships between organisms derived from genes/genome sequences).

Currently using XSEDE Comet>20,000 users. >2400 jobs/month

*overcomes insufficient allocations on XSEDE and uses newer GPU technology.

[ 9 ][ 9 ]

GWU: BioCompute Objects in Galaxy Science Gateway

BioCompute Objects allow researchers to describe bioinformatic analyses comprised of any number of algorithmic steps and variables to make computational experimental results clearly understandable and easier to repeat.

Galaxy is a widely used bioinformatics platform that aims to make computational biology accessible to research scientists that do not have programming experience.

[ 10 ][ 10 ]

Purdue: Building Typology for Urban Climate Modelling

Seeks to improve weather and climate modelling using Urban Canopy Parameters by creating a global set of 3D building models for urban areas worldwide.

Combines Computer Vision, Procedural Modeling, and Machine Learning techniques (e.g., Convolutional Neural Networks) to infer from photographs and partial meta-data 3D models appropriate to calculate building-scale properties for Urban Canopy Parameters used in weather modeling.

[ 11 ][ 11 ]

SUNY Downstate: Deciphering the Brain’s Neural Code

Using extremely large-scale and detailed modelling of the the brain cortical circuits, this project seeks to decipher the codes through which the brain stores, processes and transmits information

[ 12 ]

Project overview. The first year…• Fast ramp-up. RFP, Review, Selection, Contracting and Projects running < 6 months.

• Attracted ~20 proposals, most high quality. (Peer reviewed selection)

• Good range of scientific disciplines, geographic spread – skewed to R1, HEP, Biotech.

• Bi-weekly all-teams project meetings. Monthly progress reports. Credit & invoicing all OK.

• Some projects have already demonstrated computation they could not previously do…. (Scale/Allocations/Performance)

• Identifying challenges in migrating workflows and software pipelines to cloud platforms.

• Projects truly “exploring” cloud to see how it can accelerate their science and provide avenues for innovation. Starting to identify some limitations and performance issues.

[ 13 ]

Some early successes….• SUNY Downstate “Neural Code” project was able to run larger simulations on GCP than could be

done on previous XSEDE allocations. This was due in-part to requiring extra-large memory instances, but also requiring 90+ hour run-times and 30,000+ cores.

• SDSC CIPRES project was able to achieve sufficient throughput using I2 Cloud Connect to use NFS mounts for data staging and not need to upload to S3 buckets. Enables “real time” feedback as jobs run, using faster nVidia V100 GPUs on the US-East-1 region and data located at SDSC.

• MIT (together with CERN and Fermilab) workshop on Accelerated Machine Learning. Demonstration and training in reusable FPGA, TPU, GPU-aaS ML triggering algorithms. ~80 Participants. Also developing MLaaS, TPUaaS ..etc to share with community.

• 3 teams have already won further awards/contracts based on foundational work in E-CAS• UW IceCube team worked with OSG under another NSF grant to demonstrate scaling up to 50,000 GPU

cores across AWS/GCP/Azure. 28 regions, 8 GPU models, estimated 200 PetaFLOPs• GWU BioCompute Ojects project has secured a contract with the FDA• SUNY has been awarded a grant from NIH to continue neural code research.

[ 14 ]

Some early general learnings• Cloud login accounts. Must be University backed for large scale projects. Resources are

limited by default. Paid support =better support, faster upgrades.

• Projects combine funds/credits from multiple sources. Clouds use credits before debit.

• Teams prefer to meet as a group and share (most) of what they are doing

• There are several different ways of doing the same thing. Which is best? (Time spent Re-inventing, re-doing benchmarks..etc).

• Newer resources = limited availability or significantly different spot prices

• Porting code takes significant time

• There are many “gotchas” behind the scenes.

[ 15 ]

0

200

400

600

800

1000

1200

0

10

20

30

40

50

60

70

80

1 2 4 8 8 16

MPI

RAN

K

SIM

ULA

TIO

N T

IME

(HO

URS

)

CLUSTER NODES

WRF Simulation Run-Time VS MPI Rank

Time (Hrs) MPI Rank (Cores)

Purdue baseline: 20 hours using 48 cores• Switched from N1 (600) to C1 (60)• Best time 14hrs with C1 (240)• Does not scale further

• Working with GCP and NCAR to resolve/improve

• Other parts of the pipeline have sped-up significantly especially GPU based image identification and ingest.

Examples of issues (Note: Investigation still ongoing)

Hurricane Florence Simulation

[ 16 ]


• SDSC seeing consistent slow-down of approx 30-40% on V100 GPU using cloud Vs Comet HPC

• Appears to be related data-set size. “Modest” jobs affected.

• Large 16GB datasets run fine.• Looking at V100 feature settings• Try P3-DxM instance (100G Net)• Have provided Singularity images

and sample data sets to AWS for diagnosis. In progress.

Tech details: (AWS)• Custom AMI, Singularity container• CUDA parallelizing to GPUs• P3.2XL and 8XL instances• 4xV100/16GB, 244GB RAM, 10Gbps

Tech details (Comet)• 4xV100/16GB, 190GB RAM, 100Gbps• Hyperthreading off• Zeon Gold 6130 CPUs 190GB RAM• Practically all computation done on GPU• Small jobs only need 1 GPU

[ 17 ]


GWU seeing slowdown on Galaxy (HIVE) (CPU)

• “Ramp-up” time Vs pre-launch vs RI

• Looking at economics of RI/on-demand combinations.

• Further testing required

[ 18 ]

Some of the other issues identified

• Large pools require University backed contracts, and support tickets to increase limits• Very large pools require multiple zones and providers, particularly for GPU• Continuous discussion required – EG. Don’t run large pools on Friday afternoons.• Social engineering to build excitement VS paid support tiers (tickets)• Cloud FPGA tools have limited flexibility• While Google has been very helpful, discussion groups usually get results• Time spent porting code to capture frame-buffers from cloud GPUs.• Time to build custom images with appropriate libraries

[ 19 ]

What could be better in the E-CAS program?• DIVERSITY: Only 1 proposal was received from a non-R1 institution. (Not selected)• DIVERSITY: Most from well funded science disciplines. HE Physics, Biotech… etc• PLANNING: Most projects appear to be “learning the technologies as they go”, spending a lot

of time on the technology rather than the science.• SUPPORT: Groups that have institutional-level contracts with cloud providers including a level

of paid support were able to get faster response on queries and resource limit increases.• SUPPORT: Several projects are performance testing (and repeating) certain elements of their

projects, duplicating work of other projects• SUPPORT: Greater assistance and interaction between Cloud Providers, Internet2 and

project groups. Active participation across projects, more than basic support/tracking.• MONITORING: Management software to enable better tracking of spend and alerting of best

practices including security.• TIMING & FUNDING: No overlap between phases means it may be difficult to keep grad-

students or research assistants between phases. There are many more working on the projects than actually funded. Projects need more people $$ overall.

[ 20 ]

Summary• Projects are now well into the ”learning and exploring” stage of phase 1. Some have

demonstrated innovative use of cloud technologies and some have demonstrated extreme scale.

• There have been many small learnings along the way that are being captured.

• Teams like to hear each-other’s experiences and share tips on performance improvements or areas to look for faults.

• We will be excited to see progress and all 6 teams will report in detail at a workshop alongside Internet2 Global Summit in Indianapolis March/April 2020

https://internet2.edu/ecas wiki: https://spaces.at.internet2.edu/display/ECAS/

https://internet2.edu/ecas

https://spaces.at.internet2.edu/display/ECAS/

exploring clouds for acceleration of science · 2019/12/11 · aws fpgas + google machine learning...

Documents