ecmwf 2014 craig tierney final1 · ncar (lsf) noaa (moab/torque) noaa/ornl (moab/torque) noaa...

53
Craig Tierney 1 Nathan Dauchy 2 Chris Harrop 1 Forrest Hobbs 3 1 Cooperative Institute for Research in Environmental Sciences, University of Colorado at Boulder 2 Computer Sciences Corporation 3 National Oceanic and Atmospheric Administration, Earth Science Research Laboratory, Global Systems Division

Upload: hoanglien

Post on 21-Sep-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Craig Tierney1

Nathan Dauchy2

Chris Harrop1

Forrest Hobbs3

1 Cooperative Institute for Research in Environmental Sciences, University of Colorado at Boulder2 Computer Sciences Corporation3 National Oceanic and Atmospheric Administration, Earth Science Research Laboratory, Global Systems Division

Page 2: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What is Deadline Driven Science?

• Deadline for completion is critical to value of 

workflow completion

– Real‐time experiments

– Guidance products

• Similar to operational, except

– No guarantees provided to product users

– No impact to life and property when runs are missed

Page 3: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What Are the Challenges?

• Most R&D HPC Systems

– FIFO queue, possibly with fair‐share

– Large mix of users, job sizes, varying operating modes

• Complex time, file, and job dependencies

• Need guarantees to meet deadlines

• Need reliable/resilient/robust workflow management

• No operational staff to monitor job completion 

Solutions needs to meet our philosophy of Portability

Page 4: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Standing Reservations

Workflow Management

Distributed CRON

Page 5: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Workflow Management with RocotoChris Harrop

Page 6: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What is Workflow Management?

Page 7: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What is Workflow Management?

Describe and manage the execution of a collection of tasks in a scientific application.

Page 8: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What is Workflow Management?

Describe and manage the execution of a collection of tasks in a scientific application.

That’s Easy!!!

Page 9: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What is Workflow Management?

Page 10: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What is Workflow Management?

Ensure completion of workflows with complex dependencies on tasks, files, and times on 

systems when, not if, component failures happen with no human active job monitoring.

Page 11: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

What is Workflow Management?

Ensure completion of workflows with complex dependencies on tasks, files, and times on 

systems when, not if, component failures happen with no human active job monitoring.

That’s Not So Easy…

Page 12: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Rocoto

• Supports weather and climate community 

modeling paradigms

• Runs in user‐space

• Portable across many different batch 

systems

– Moab/Torque, LSF, Grid Engine, SLURM

ROCOTO manages most all work by the Development Testbed Centerhttp://www.dtcenter.org/

Page 13: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Rocoto – Key Features

• Real‐time and retrospective modes

• Fault Tolerance

• Complex dependencies based on Time, File and Task

• Generic and portable batch specifications

• Multi‐threaded job submission

• Workflow throttling

• Meta tasks conveniently describe multiple, similar, tasks

Page 14: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

NCAR (LSF)

NOAA (Moab/Torque)

NOAA/ORNL (Moab/Torque)

NOAA (Moab/Torque) NOAA/WCOSS (LSF)

Sites Running RocotoU. of Miami (LSF)

Coastal Carolina U. (SLURM)

U. of Wisconsin (SLURM, Grid Engine)

Presidency of Meteorology and Environment,  Saudi Arabia (Torque)

U. Of Maryland (SLURM)

NREL (Moab/Torque)

Thomas J. Watson Research Center, IBM 

(SLURM)

IBM Research Laboratory, China  (SLURM)

U. of Colorado at Boulder(SLURM)

Page 15: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

A Typical Workflow

DataInput Data

DataInput Data

DataInput Data

Pre-processing

DataOutput Data

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Model

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Post-processing

Post-processing

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Grid Interpolation

Grid Interpolation

Pre-processing

Pre-processing

Pre-processing

Pre-processingVerificationVerification

Pre-processing

Pre-processing

Pre-processing

Pre-processingGraphicsGraphics

One set of post tasks per output file

One to many cores

One to several cores

Many output

files

Page 16: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

A Typical Workflow

DataInput Data

DataInput Data

DataInput Data

Pre-processing

DataOutput Data

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Model

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Post-processing

Post-processing

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Grid Interpolation

Grid Interpolation

Pre-processing

Pre-processing

Pre-processing

Pre-processingVerificationVerification

Pre-processing

Pre-processing

Pre-processing

Pre-processingGraphicsGraphics

One set of post tasks per output file

One to many cores

One to several cores

Many output

files

Page 17: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

A Typical Workflow

DataInput Data

DataInput Data

DataInput Data

Pre-processing

DataOutput Data

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Model

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Post-processing

Post-processing

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Grid Interpolation

Grid Interpolation

Pre-processing

Pre-processing

Pre-processing

Pre-processingVerificationVerification

Pre-processing

Pre-processing

Pre-processing

Pre-processingGraphicsGraphics

One set of post tasks per output file

One to many cores

One to several cores

Many output

files

Page 18: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

A Typical Workflow

DataInput Data

DataInput Data

DataInput Data

Pre-processing

DataOutput Data

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Model

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Post-processing

Post-processing

Pre-processing

Pre-processing

Pre-processing

Pre-processing

Grid Interpolation

Grid Interpolation

Pre-processing

Pre-processing

Pre-processing

Pre-processingVerificationVerification

Pre-processing

Pre-processing

Pre-processing

Pre-processingGraphicsGraphics

One set of post tasks per output file

One to many cores

One to several cores

Many output

files

Page 19: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

CASE: High Resolution Rapid Refresh

• 15 hour forecast, runs every hour• 3km resolution• Continental U.S. domain• Used in Aviation, Severe 

Weather, Renewable Energy, Forecasting

• Up to 263 different per run– Data Preparation– Data Assimilation– Model Execution– Post Processing and Visualization

Page 20: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

CASE: High Resolution Rapid Refresh

• Dependency trees vary depending on start time

• Uses meta‐tasks to describe each forecast hour

• Complex dependencies allow workflow to advance in absence of timely data arrival

HRRR was transition to Operations at the National Weather Service in September 2014

Page 21: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Standing Reservations

Workflow Management

Distributed CRON

Page 22: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Distributed, Highly‐Available, CRON ServicesCraig Tierney

Nathan Dauchy

Page 23: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Why we use CRON

• Weather forecasting is driven by the clock!

• Model cycles start every 1‐6 hours

• Workflow management scripts run every 

1‐5 minutes

• Input/output data pull/push/sync

• Systems management scripts

Page 24: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

D‐CRON – Distributed CRON System

• Provide a unified crontab across the 

system

• Distribute cron tasks multiple systems

• Peer‐to‐peer reliability daemon

• Functionality is transparent to the users

?

Page 25: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

D‐CRON Secondary Benefits

• Less help tickets about why their workflows did not start 

or complete

• No more questions about “lost” crontabs

• No longer need to monitor and maintain individual 

front‐end nodes by operations staff

Page 26: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

How D‐CRON Works

Page 27: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

How D‐CRON Works

User creates a crontab entry.

Page 28: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

How D‐CRON Works

Page 29: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

How D‐CRON Works

The user crontab is transparently modified to work with the D‐CRON system

Page 30: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Prior to each scheduling iteration, status of all service nodes is checked.

Service1 Service2 Service3 ServiceN…..

How D‐CRON Works

Page 31: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Work is distributed to all service nodes.

Service1 Service2 Service3 ServiceN…..

How D‐CRON Works

Page 32: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Hash function is used to determine which node does the work.

Service1 Service2 Service3 ServiceN…..

Local CRON daemon executes the work on a single node.

How D‐CRON Works

Page 33: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Hash function is used to determine which node does the work.

Service1 Service2 Service3 ServiceN…..

Local CRON daemon executes the work on a single node.

Work will always be scheduled on the same node unless there is a issue with the service node.

How D‐CRON Works

Page 34: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Prior to each scheduling iteration, status of all service nodes is checked.

Service1 Service2 Service3 ServiceN…..

How D‐CRON Works

Page 35: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Prior to each scheduling iteration, status of all service nodes is checked.

Service1 Service2 Service3 ServiceN…..

How D‐CRON Works

Page 36: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Work is farmed to all service nodes.

Service1 Service2 Service3 ServiceN…..

How D‐CRON Works

Page 37: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Hash function is used to determine which node does the work.

Service1 Service2 Service3 ServiceN…..

Local CRON daemon executes the work on a single node.

How D‐CRON Works

Page 38: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db

Hash function is used to determine which node does the work.

Service1 Service2 Service3 ServiceN…..

Local CRON daemon executes the work on a single node.

Work will always be scheduled on the same node unless there is a issue with the service node.

How D‐CRON Works

Page 39: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

How D‐CRON is used

• 81522 CRON tasks launched daily on Jet

– Versus 48140 batch jobs (Sept. 2014)

• 123785 CRON tasks launched daily on Zeus

– versus 80239 batch Jobs (Sept. 2014)

Page 40: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Standing Reservations

Workflow Management

Distributed CRON

Page 41: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Guaranteeing Resources for Real‐Time ExperimentsCraig Tierney, CIRES

Christopher Harrop, CIRES

Page 42: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Standing Reservations

• Pre‐allocated blocks of system that guarantee availability

• Finite reservation

– Can be release by user when not needed

• Infinite reservation 

Page 43: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Typical Standing Reservations

Pre‐Processing

Model Run(s)

Post‐Processing

Epoch

Time

1 to several cores

10s‐1000s of cores

1 to several cores

Nod

es

Page 44: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Infinite Reservations (IR)

• No end time

• Required when models cannot cold‐start

• On our system, IR are stressful on the system

– Causes problems with the scheduler

– Often blocks unused resources to non‐realtime jobs

• In 2014, we moved to a system based on preemption

– Reduce stress on the system

– Allowed for more non‐realtime work

Page 45: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

0

2

4

6

8

10

12

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16 18 20 22

Bac

klog

(Hou

rs)

Util

izat

ion

(pct

)

Simulation Time (hours)Util, NoRes Backlog,NoRes

Page 46: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

0

2

4

6

8

10

12

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16 18 20 22

Bac

klog

(Hou

rs)

Util

izat

ion

(pct

)

Simulation Time (hours)Util, NoRes Res Usage Backlog,NoRes

Page 47: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

0

2

4

6

8

10

12

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16 18 20 22

Bac

klog

(Hou

rs)

Util

izat

ion

(pct

)

Simulation Time (hours)Util, NoRes Util,WithRes Res Usage Backlog,NoRes Backlog,WithRes

Page 48: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

0

2

4

6

8

10

12

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16 18 20 22

Bac

klog

(Hou

rs)

Util

izat

ion

(pct

)

Simulation Time (hours)Util, NoRes Util,WithRes Res Usage Backlog,NoRes Backlog,WithRes

When there are reservations utilization drops because no jobs can be backfilled just before the reservation starts.

Page 49: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

0

2

4

6

8

10

12

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16 18 20 22

Bac

klog

(Hou

rs)

Util

izat

ion

(pct

)

Simulation Time (hours)Util, NoRes Util,WithRes Res Usage Backlog,NoRes Backlog,WithRes

Backlog is larger with reservations, especially during the reservation and afterwards as the system tries to drain the backlog.

Page 50: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Current Reservation Usage

• 2014 Hurricane Season

– 25196 total cores

– 105 reservations per day, 50% of total core hours

– Maximum of 8332 cores available via preemption 

(33% of available core hours)

83% of total resources under reservation/preemption

Page 51: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Summary

• Portable and resilient workflow management allows 

us to reliably complete experiments

• Extending CRON services to be distributed improves 

fault‐tolerance and reduces support requirements

• Using standing reservations allows real‐time 

experiments to reliably finish in traditional R&D HPC 

environments

Page 52: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Contact:[email protected]

Rocoto:http://rdhpcs.noaa.gov/rocoto/

[email protected]

Page 53: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina

Backup Slides