high-resiliency and auto-scaling of large-scale cloud computing...

1
Jet Propulsion Laboratory California Institute of Technology National Aeronautics and Space Administration High-Resiliency and Auto-Scaling of Large-Scale Cloud Computing for NASA’s OCO-2 L2 Full Physics Processing HySDS Team: Hook Hua, Gerald Manipon, Paul Ramirez, Phillip Southam, Michael Starch, Lan Dang, Brian Wilson OCO-2 SDOS Team: Charlie Avis, Albert Chang, Cecilia Cheng, Lan Dang, James McDuffie, Mike Smyth Jet Propulsion Laboratory, California Institute of Technology Poster Number: IN43B-1733 National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California http://www.nasa.gov Copyright 2015 California Institute of Technology. Government sponsorship acknowledged. Background Motivation Hybrid-cloud Science Data System Frequent science requirement changes Needed agile science data system approach Increase in science computing needs Pleiades Supercomputer scheduled downtimes conflict with science processing requirements Need elastic and large-scale processing capability HySDS Architecture Large-Scaling “Market Maker” Large-Scale Considerations Provenance Support Auto-Scaling Science Data System Data production compliant to Provenance for Earth Science (PROV-ES) specification Near real-time view of provenance from production Faceted search of provenance Visualization of provenance NASA ESDSWG PROV-ES Working Group OCO-2 launched on July 2nd, 2014, at the head of the A- Train Collect global measurements of atmospheric carbon dioxide with the precision, resolution, and coverage needed to characterize sources and sinks in order to improve our understanding of the global carbon cycle NASA OCO-2 Science Data Operations System (SDOS) Forward (1X) and bulk processing (4X) L2 bulk processing ported to NASA AMES Pleiades Supercomputer L2 full physics processing on ~200 nodes (15X) Running 48 x PGE processors on each compute node Rapid Hybrid Cloud Enablement Day 1 Team formed to bring in cloud capabilities via HySDS (Hybrid Science Data System) Introduction to HySDS, bursting out to cloud, current cloud computing strategies Day 8 HySDS team successfully integrated Level-2 Full Physics (L2FP) executable into HySDS and demonstrated parallel runs on Amazon Web Services (AWS). L2FP developers validated outputs from AWS Day 17 End-to-end testing with SDOS (at JPL) HySDS (in AWS) Benchmarking on AWS to optimize cost-effectiveness Day 21 SDOS gets ownership of PCS-HySDS subsystem Technology transfer training AWS basic and HySDS operations Access to HySDS source control Day 37 HySDS official delivery and operational handoff to SDOS. Introduction of on-premise HySDS workers on OCO-2 cluster to handle uploads and downloads. High-resiliency operations on AWS “spot market” 90% cheaper than “on-demand” PPS Pleiades (Ames) L2FP L2FP L2FP HySDS AWS April 2015 June 2015 Workers processed subset of soundings per granule | | | | | | | | Granule soundings AWS EC2 subsetted sounding ids Sounding ids AWS S3 input input output Processed sounding TES– T, P, H2O, O3, CH4, CO MLS– O3, H2O, CO OMI – O3 Aerosol polarization 3-D Aerosols AIRS– T, P, H2O, CO2, CH4 MODIS– clouds, aerosols, albedo CO2 ps, clouds, aerosols Coordinated Observations 3-D Clouds Aerosol polarization SDS can be dynamically provisioned at on-premise at JPL, DAAC, and/or at public cloud providers Scalable workers enables control over data throughput and monitoring of data movement between data centers Compute and Object Storage are horizontally scalable Data products in cloud object stores Redundancy and replication policies Scalable high-performance storage Accessible as URLs Use of AWS GovCloud addresses export controlled issues JPL MOS/GDS Raw Instrument L0a + Ancillary Data + Inst HK JPL on-premise cloud and/or AWS Resource Management Publishes data products Localizes Inputs Auto-Scaling Monitor Alerts & Provisions DAAC L0-L2 Pipelines & Ingest (VMs/containers) Gets job L0-L2 Data Products (Object Store) Publish data Gets job Ingest Worker Extracts Data Gets job Ops Work Data Cache Triage Worker Writes Work Data Discovery Services Publishes metadata Notifies Workflow Management Orchestrates Jobs Gets job Crawler Worker Monitors Events Event Management Events DAAC Staging Gets job Product Localize Worker Writes Data Products to Staging JPL Partitioning Jobs in AWS The size of the science data system compute nodes can automatically grow/shrink based on processing demand Auto-scaling tests to 3000 compute workers 96,000 x l2_fp simultaneous processors High-Resiliency & Spot Market USWest2 (Oregon) Hourly Costs Per vCPU Costs instance vCPU memory memorycpu ra4o disks ondemand ($/hr) reserved 1yr upfront ($/ hr) reserved 3yr upfront ($/ hr) spot linux ($/ hr) ondemand ($/cpu/hr) reserved 1yr upfront ($/ cpu/hr) reserved 3yr upfront ($/ cpu/hr) spot linux ($/ cpu/hr) m2.4xlarge 8 68.4 8.55 2 x 840 $1.0780 $0.4087 $0.2444 $0.1000 $0.1348 $0.0511 $0.0306 $0.0125 cc2.8xlarge 32 60.5 1.89 4 x 840 $2.0000 $0.9131 $0.6137 $0.2705 $0.0625 $0.0285 $0.0192 $0.0085 m3.2xlarge 8 30.0 3.75 SSD 2 x 80 $0.6160 $0.3750 $0.2300 $0.0700 $0.0770 $0.0469 $0.0288 $0.0088 c3.8xlarge 32 60.0 1.88 SSD 2 x 320 $1.6800 $0.9920 $0.6280 $2.4001 $0.0525 $0.0310 $0.0196 $0.0750 r3.8xlarge 32 244.0 7.63 SSD 2 x 320 $2.8000 $1.4860 $0.9820 $2.8000 $0.0875 $0.0464 $0.0307 $0.0875 c3.xlarge 4 7.5 1.88 SSD 2 x 40 $0.2310 $0.1370 $0.0870 $0.0353 $0.0578 $0.0343 $0.0218 $0.0088 Major cost savings (~10X)if can use spot instances On spot market, AWS will terminate compute instances if market prices exceed bid threshold HySDS able to self-recover from spot instance terminations Running in spot market forces data system to be more resilient to compute failures This OCO-2 production run of 1000 x 32vCPUs affected the market prices Spot market terminations If market prices passes tolerance bid threshold, then terminations Availability Zone (AZ) load rebalancing Terminations of nodes for balancing across AZs Instance failures 99.9% reliability means 1 hardware failure per 1000 nodes E.g. “EC2 has detected degradation of the underlying hardware hosting one or more of your Amazon EC2 instances in the us-west-2 region. Due to this degradation, your instance(s) could already be unreachable.” E.g. disk failures “Job drain” Addressing failures leading to job drain from work queues Thundering herdAPI rate limit exceeded Market MakerYou affecting spot market prices S3 object store performance optimizations needed Instance startup with data caching Auto-scaling slow scale-up needs AWS tweaks scale-down group vs self-terminating instances HySDS for OCO-2 L2 Full Physics Scalability Scaled up to 3000 x 32 vCPUs = 96,000 vCPUs Operability Flight operational hardening Visibility Faceted search interface for everything Cost Running on AWS spot market is ~10X cheaper Accepted by OCO-2 as usable and cost- effective for reprocessing scenario Interfaced with existing SDSes at JPL (e.g. OCO2 SDOS) Passing along lessons learned EC2 instance (VM) hdd ssd Ephemeral instance storage http://s3/bucket1/object1 /dev/sda1 /dev/sdb1 PGE’s fast local scratch disk Object Storage bucket obj obj obj obj bucket obj obj obj obj Data as objects Compute instances can scale up to demand Object storage can scale up data volume and aggregate data throughput to demand OCO-2 SDOS NFS Tarball dir and subset soundings Localize tarballs Download sounding subset list Upload processed soundings H5 Scalable Storage Scalable Compute Scalable Data Transfer Interoperable interface between OODT SDS at JPL and HySDS in AWS Upload tarballs and sounding subsets to S3 Download processed soundings from S3 Compute and Storage Usage AWS Transport and Compute ~5MB/s-40MB/s ~40MB/s ~200MB/s US-West-2 US-West-1 Data In/Out Compute Entry points into AWS are different JPL to US-West-1 on 10Gbps JPL to US-West-2 over internet Transport approach Stream data from JPL to S3/US- West-1 Maximize JPL-AWS network speeds Compute in other AWS regions E.g. EC2/US-West-2 Move data from S3/US-West-1 Results moved back to S3/US-West-1 Asynchronously localize results back to JPL from S3/US-West-1 # compute nodes over time Per node transfer rate over time Scalable internal data throughput @ 32,000 PGEs on 1,000 nodes Leveraging the Hybrid-cloud Science Data System (HySDS) A set of loosely-coupled science data system cloud services. Data Management, Data Processing, Data Access and Discovery, Operations, and Analytics Data system fabric over heterogeneous computing infrastructure Large-Scale, High-Resiliency, and Cost-effective hybrid approach AWS spot market support Faceted Navigation for all SDS interfaces Real-time metrics Crowd-sourcing and Collaboration Real-time Analytics Provenance for Earth Science

Upload: others

Post on 24-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Resiliency and Auto-Scaling of Large-Scale Cloud Computing …dusk.geo.orst.edu/Pickup/Esri/AGU2015/IN43B-1733-Hook... · 2015-12-15 · Per node transfer Scalable internal data

Jet Propulsion Laboratory!California Institute of Technology!National Aeronautics and Space Administration!

High-Resiliency and Auto-Scaling of Large-Scale Cloud Computing for NASA’s OCO-2 L2 Full Physics Processing

HySDS Team: Hook Hua, Gerald Manipon, Paul Ramirez, Phillip Southam, Michael Starch, Lan Dang, Brian Wilson OCO-2 SDOS Team: Charlie Avis, Albert Chang, Cecilia Cheng, Lan Dang, James McDuffie, Mike Smyth

Jet Propulsion Laboratory, California Institute of Technology Poster Number: IN43B-1733

National Aeronautics and Space Administration

Jet Propulsion Laboratory California Institute of Technology Pasadena, California http://www.nasa.gov Copyright 2015 California Institute of Technology. Government sponsorship acknowledged.

Background !

Mot ivat ion !

Hybr id-c loud Sc ience Data System !

•  Frequent science requirement changes •  Needed agile science data system approach

•  Increase in science computing needs •  Pleiades Supercomputer scheduled downtimes conflict with

science processing requirements •  Need elastic and large-scale processing capability

H y S D S A r c h i t e c t u r e !

Large -Sca l ing “Marke t Maker” !

Large -Sca le Cons idera t ions !

Provenance Suppor t !

A u t o - S c a l i n g S c i e n c e D a t a S y s t e m !

•  Data production compliant to Provenance for Earth Science (PROV-ES) specification

•  Near real-time view of provenance from production

•  Faceted search of provenance

•  Visualization of provenance

•  NASA ESDSWG PROV-ES Working Group

•  OCO-2 launched on July 2nd, 2014, at the head of the A-Train •  Collect global measurements of atmospheric carbon

dioxide with the precision, resolution, and coverage needed to characterize sources and sinks in order to improve our understanding of the global carbon cycle

•  NASA OCO-2 Science Data Operations System (SDOS) •  Forward (1X) and bulk processing (4X)

•  L2 bulk processing ported to NASA AMES Pleiades Supercomputer •  L2 full physics processing on ~200 nodes (15X) •  Running 48 x PGE processors on each compute node

Rapid Hybr id C loud Enablement !•  Day 1

•  Team formed to bring in cloud capabilities via HySDS (Hybrid Science Data System)

•  Introduction to HySDS, bursting out to cloud, current cloud computing strategies

•  Day 8 •  HySDS team successfully integrated Level-2 Full Physics

(L2FP) executable into HySDS and demonstrated parallel runs on Amazon Web Services (AWS).

•  L2FP developers validated outputs from AWS

•  Day 17 •  End-to-end testing with SDOS (at JPL) HySDS (in AWS) •  Benchmarking on AWS to optimize cost-effectiveness

•  Day 21 •  SDOS gets ownership of PCS-HySDS subsystem •  Technology transfer training

•  AWS basic and HySDS operations •  Access to HySDS source control

•  Day 37 •  HySDS official delivery and operational handoff to SDOS. •  Introduction of on-premise HySDS workers on OCO-2

cluster to handle uploads and downloads. •  High-resiliency operations on AWS “spot market”

•  90% cheaper than “on-demand”

PCS

PBS

Local

OCO-2 Cluster (JPL)

PPS

Pleiades (Ames)

L2FP L2FP

L2FP HySDS AWS

April 2015

June 2015

•  Workers processed subset of soundings per granule

| | | | | | | | Granule

soundings AWS EC2

subsetted sounding ids

Sounding ids

AWS S3

input

input

output

Processed sounding

TES – T, P, H2O, O3, CH4, CO MLS – O3, H2O, CO OMI – O3

Aerosol polarization

3-D Aerosols

AIRS – T, P, H2O, CO2, CH4 MODIS – clouds, aerosols, albedo

CO2 ps, clouds, aerosols

Coordinated Observations

3-D Clouds Aerosol polarization

•  SDS can be dynamically provisioned at on-premise at JPL, DAAC, and/or at public cloud providers

•  Scalable workers enables control over data throughput and monitoring of data movement between data centers

•  Compute and Object Storage are horizontally scalable

•  Data products in cloud object stores –  Redundancy and replication policies –  Scalable high-performance storage –  Accessible as URLs

•  Use of AWS GovCloud addresses export controlled issues

JPL MOS/GDS Raw Instrument L0a + Ancillary Data + Inst HK

JPL on-premise cloud and/or AWS

Resource Management

Publishes data products

Localizes Inputs

Auto-Scaling

Monitor Alerts & Provisions

DAAC

L0-L2 PGEs (EC2 Compute VMs)

L0-L2 PGEs (EC2 Compute VMs)

L0-L2 Pipelines & Ingest (VMs/containers)

Gets job

L0-L2 Data Products

(Object Store)

Publish data

Gets job

Ingest Worker

Extracts Data

Gets job

Ops Work Data Cache

Triage Worker

Writes Work Data

Discovery Services

Publishes metadata

Notifies

Workflow Management

Orchestrates Jobs Gets job

Crawler Worker

Monitors

Events Event Management Events

DAAC Staging

Gets job Product Localize

Worker

Writes Data Products to Staging

JPL

Par t i t ion ing Jobs in AWS !

•  The size of the science data system compute nodes can automatically grow/shrink based on processing demand

Auto-scaling tests to 3000 compute workers 96,000 x l2_fp simultaneous processors

H i g h - R e s i l i e n c y & S p o t M a r k e t !US-­‐West-­‐2  (Oregon)  

Hourly  Costs   Per  vCPU  Costs  

instance   vCPU   memory   memory-­‐cpu  ra4o   disks   on-­‐demand  

($/hr)  

reserved  1-­‐yr  upfront  ($/

hr)  

reserved  3-­‐yr  upfront  ($/

hr)  

spot  linux  ($/hr)  

on-­‐demand  ($/cpu/hr)  

reserved  1-­‐yr  upfront  ($/

cpu/hr)  

reserved  3-­‐yr  upfront  ($/

cpu/hr)  

spot  linux  ($/cpu/hr)  

m2.4xlarge   8   68.4   8.55   2  x  840   $1.0780   $0.4087   $0.2444   $0.1000   $0.1348   $0.0511   $0.0306   $0.0125  

cc2.8xlarge   32   60.5   1.89   4  x  840   $2.0000   $0.9131   $0.6137   $0.2705   $0.0625   $0.0285   $0.0192   $0.0085  

m3.2xlarge   8   30.0   3.75   SSD  2  x  80   $0.6160   $0.3750   $0.2300   $0.0700   $0.0770   $0.0469   $0.0288   $0.0088  

c3.8xlarge   32   60.0   1.88   SSD  2  x  320   $1.6800   $0.9920   $0.6280   $2.4001   $0.0525   $0.0310   $0.0196   $0.0750  

r3.8xlarge   32   244.0   7.63   SSD  2  x  320   $2.8000   $1.4860   $0.9820   $2.8000   $0.0875   $0.0464   $0.0307   $0.0875  

c3.xlarge   4   7.5   1.88   SSD  2  x  40   $0.2310   $0.1370   $0.0870   $0.0353   $0.0578   $0.0343   $0.0218   $0.0088  

•  Major cost savings (~10X)…if can use spot instances •  On spot market, AWS will terminate compute instances if

market prices exceed bid threshold •  HySDS able to self-recover from spot instance terminations •  Running in spot market forces data system to be more

resilient to compute failures

This OCO-2 production run of 1000 x 32vCPUs affected the market prices

•  Spot market terminations •  If market prices passes tolerance bid threshold, then

terminations •  Availability Zone (AZ) load rebalancing

•  Terminations of nodes for balancing across AZs •  Instance failures

•  99.9% reliability means 1 hardware failure per 1000 nodes

•  E.g. “EC2 has detected degradation of the underlying hardware hosting one or more of your Amazon EC2 instances in the us-west-2 region. Due to this degradation, your instance(s) could already be unreachable.”

•  E.g. disk failures •  “Job drain”

•  Addressing failures leading to job drain from work queues

•  “Thundering herd” •  API rate limit exceeded

•  “Market Maker” •  You affecting spot market prices

•  S3 object store performance optimizations needed •  Instance startup with data caching •  Auto-scaling

•  slow scale-up needs AWS tweaks •  scale-down group vs self-terminating instances

HySDS for OCO-2 L2 Ful l Physics !•  Scalability

•  Scaled up to 3000 x 32 vCPUs = 96,000 vCPUs

•  Operability •  Flight operational hardening

•  Visibility •  Faceted search interface for everything

•  Cost •  Running on AWS spot market is ~10X

cheaper •  Accepted by OCO-2 as usable and cost-

effective for reprocessing scenario •  Interfaced with existing SDSes at JPL (e.g.

OCO2 SDOS) •  Passing along lessons learned

EC2 instance (VM)

hdd ssd … Ephemeral instance storage

http://s3/bucket1/object1 /dev/sda1 /dev/sdb1

PGE’s fast local scratch disk

Object Storage

bucket

obj obj

obj obj

bucket

obj obj

obj obj

Data as objects

•  Compute instances can scale up to demand

•  Object storage can scale up data volume and aggregate data throughput to demand

OCO-2 SDOS

NFS

Tarball dir and subset

soundings

Localize tarballs

Download sounding subset list

Upload processed soundings H5

Scalable Storage

Scalable Compute

Scalable Data Transfer

Interoperable interface between OODT SDS at JPL and HySDS in AWS

Upload tarballs and sounding subsets to S3

Download processed soundings from S3

Compute and Storage Usage !

AWS Transpor t and Compute !

~5MB/s-40MB/s

~40MB/s

~200MB/s

US-West-2

US-West-1

Data In/Out

Compute

•  Entry points into AWS are different •  JPL to US-West-1 on 10Gbps

•  JPL to US-West-2 over internet

•  Transport approach •  Stream data from JPL to S3/US-

West-1

•  Maximize JPL-AWS network speeds •  Compute in other AWS regions

•  E.g. EC2/US-West-2

•  Move data from S3/US-West-1

•  Results moved back to S3/US-West-1 •  Asynchronously localize results back to

JPL from S3/US-West-1

# compute nodes over time

Per node transfer rate over time

Scalable internal data throughput

@ 32,000 PGEs on 1,000 nodes

•  Leveraging the Hybrid-cloud Science Data System (HySDS) •  A set of loosely-coupled science data system cloud services. Data

Management, Data Processing, Data Access and Discovery, Operations, and Analytics

•  Data system fabric over heterogeneous computing infrastructure •  Large-Scale, High-Resiliency, and Cost-effective hybrid approach •  AWS spot market support •  Faceted Navigation for all SDS interfaces •  Real-time metrics •  Crowd-sourcing and Collaboration •  Real-time Analytics •  Provenance for Earth Science