rich loft director, technology development computational and information systems laboratory

50
11/18/08 1 An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time? Rich Loft Director, Technology Development Computational and Information Systems Laboratory National Center for Atmospheric Research [email protected]

Upload: istas

Post on 01-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time?. Rich Loft Director, Technology Development Computational and Information Systems Laboratory National Center for Atmospheric Research [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 1

An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate

Predictions in Time?

Rich LoftDirector, Technology Development

Computational and Information Systems Laboratory

National Center for Atmospheric [email protected]

Page 2: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Main Points• Nature of the climate system makes it a grand

challenge computing problem.• We are at a critical juncture: we need regional

climate prediction capabilities!• Computer clock/thread speeds are stalled:

massive parallelism is the future of supercomputing.

• Our best algorithms, parallelization strategies and architectures are inadequate to the task.

• We need model acceleration improvements in all three areas if we are to meet the challenge.

11/18/08 2

Page 3: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Options for Application Acceleration

• Scalability– Eliminate bottlenecks– Find more parallelism – Load balancing algorithms

• Algorithmic Acceleration– Bigger Timesteps

• Semi-Lagrangian Transport• Implicit or semi-implicit time integration – solvers

– Fewer Points• Adaptive Mesh Refinement methods

• Hardware Acceleration– More Threads

• CMP, GP-GPU’s

– Faster threads • device innovations (high-K)

– Smarter threads• Architecture - old tricks, new tricks… magic tricks

– Vector units, GPU’s, FPGA’s

11/18/08 3

Page 4: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 4

Viner (2002)

A Very Grand Challenge:Coupled Models of the Earth

System

Typical Model Computation: - 15 minute time steps- 1 peta-flop per model year

~150 km

There are 3.5 million timesteps in a century

air column

water column

Page 5: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 5

Multicomponent Earth System Model

Atmosphere Ocean

Coupler

Sea IceLand

C/NCycle

Dyn.Veg.

Ecosystem & BGCGas chem. Prognostic

AerosolsUpperAtm.

LandUse

IceSheets Software Challenges:

•Increasing Complexity•Validation and Verification•Understanding the Output

Key concept: A flexible coupling framework is critical!

Page 6: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Climate Change

Credit: Caspar AmmanNCAR 11/18/08 6

Page 7: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 7

o IPCC AR4: “Warming of the climate system is un-equivocal” …

o …and it is “very likely” caused by human activities.

o Most of the observed changes over the past 50 years are now simulated by climate models adding confidence to future projections.

o Model Resolutions: O(100 km)

IPCC AR4 - 2007

Page 8: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Climate Change Research Epochs

Reproduce historical trends

Investigate climate change

Run IPCC Scenarios

Assess regional impacts

Simulate adaptation strategies

Simulate geoengineering solns

Before IPCC AR4 After

Curiosity Driven Policy Driven

11/18/08 8

2007

Page 9: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 9

ESSL - The Earth & Sun Systems Laboratory

Where we want to go:The Exascale Earth System

Model VisionCoupled Ocean-Land-Atmosphere Model

~1 km x ~1 km (cloud-resolving)

100 levels, whole atmosphere

Unstructured, adaptive grids

~100 m

10 levels

Landscape-resolving

~10 km x ~10 km (eddy-resolving)

100 levels

Unstructured, adaptive grids

Requirement: Computing power enhancement by as much as a factor of 1010-1012

YIKES!

Page 10: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Compute Factors for ultra-high resolution Earth System Model

11/18/08 10

Spatial resolution Provide regional details

103-105

Model completeness

Add “new” science 102

New parameterizations

Upgrade to “better” science

102

Run length Long-term implications

102

Ensembles, scenarios

Range of model variability

10

Total Compute Factor

1010-1012

(courtesy of John Drake, ORNL)

Page 11: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Why run-length:global thermohaline

circulation timescale: 3,000 years

11/18/08 11

Page 12: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 12

Why resolution: Atmospheric convective (cloud) scales in

the : O(1 km)

Page 13: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 13

Why High Resolution in the Ocean?

Ocean component of CCSM (Collins et al, 2006)

Eddy-resolving POP (Maltrud & McClean,2005)

1˚ 0.1˚

Page 14: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 14

High Resolution and the Land Surface

Page 15: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 15

Performance Improvements are not coming fast enough!

…suggests 1010 to 1012 improvement will take 40 years

Page 16: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

ITRS Roadmap: feature size dropping

14%/year

By 2050 reaches the size of an atom – oops!

11/18/08 16

Page 17: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 17

National Security Agency - The power consumption of today's advanced computing systems is rapidly becoming the limiting factor with respect to improved/increased computational ability." 

Page 18: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 18

Chip Level Trends: Stagnant Clock Speed

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

• Chip density is continuing increase ~2x every 2 years– Clock speed is

not– Number of cores

are doubling instead

• There is little or no additional hidden parallelism (ILP)

• Parallelism must be exploited by software

Page 19: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 19

Moore’s Law -> More’s Law: Speed-up through increasing

parallelism

How long can we double the number of cores per chip?

Page 20: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 20

Dr. Henry Tufoand myself with “frost”

(2005)

Characteristics:•2048 Processors/5.7 TF•PPC 440 (750 MHz) •Two processors/node•512 MB memory per node•6 TB file system

NCAR and University Colorado Partner to Experiment with Blue Gene/L

Page 21: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 21

Status and immediate plans for high resolution Earth System Modeling

Page 22: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 22

Current high resolution CCSM runs

• 0.25 ATM,LND + 0.1 OCN,ICE [ATLAS/LLNL]– 3280 processors– 0.42 simulated years/day (SYPD)– 187K CPU hours/year

• 0.50 ATM,LND + 0.1 OCN,ICE [FRANKLIN/NERSC]– Current

• 5416 processors • 1.31 SYPD• 99K CPU hours/year

– “Efficiency Goal• 4932 processors• 1.80 SYPD• 66K CPU hours/year

Page 23: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 23

Current 0.5 CCSM “fuel efficient” configuration [franklin]

5416 processors

168 sec.

OCN[np=3600]

120 sec.

ATM[np=1664]

52 sec.

CPL[np=384]

21 sec.

LND[np=16]

ICE[np=1800]

91 sec.

Page 24: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 24

Efficiency issues in current 0.5 CCSM configuration

120 sec.

Page 25: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 25

Load Balancing: Partitioning with Space Filling Curves

Partition for 3 processors

Page 26: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 26

Space-filling Curve Partitioning for Ocean Model running on 8

Processors

Key concept: no need to compute over land!

Static Load Balancing…

Page 27: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 27

Ocean Model 1/10 Degree performance

Key concept: You need routine access to > 1k procs to discover true scaling behaviour!

Page 28: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 28

Efficiency issues in Current CCSM 0.5 configuration

LND[np=16]

ICE[np=1800]

91 sec.

Page 29: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 29

Static, Weighted Load Balancing Example:Sea Ice Model CICE4 @ 1° on 20 processors

Small domains @ high latitudes

Large domains @ low latitudes

Courtesy of John DennisCourtesy of John Dennis

Page 30: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 30

Efficiency issues in current 0.5 CCSM configuration:

Coupler

CPL[np=384]

21 sec.

Unresolved scalability issues in Coupler – Options: Better interconnect, Nested grids,PGAS language paradigm

Page 31: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 31

Efficiency issues in current 0.5 CCSM configuration:

atmospheric component

ATM[np=1664]

52 sec.

Scalability limitation in 0.5° fv-CAM [MPI] – shift to hybrid OpenMP/MPI version

Page 32: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 32

Projected 0.5 CCSM “capability” configuration: 3.8 years/day

19460 processors

62 sec.

OCN[np=6100]

62 sec.

ATM[np=5200]

31 sec.

CPL[np=384]

21 sec.

LND[np=40]

ICE[np=8120]

10 sec.

Action: Run hybrid atmospheric model

Page 33: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 33

Projected 0.5 CCSM “capability” configuration - version 2: 3.8

years/day

14260 processors

62 sec.

OCN[np=6100]

62 sec.

ATM[np=5200]

31 sec.

CPL[np=384]

21 sec.

LND[np=40]

ICE[np=8120]

10 sec.Action: Thread ice model

Page 34: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 34

Scalable Geometry Choice: Cube-Sphere

• Sphere is decomposed into 6 identical regions using a central projection (Sadourny, 1972) with equiangular grid (Rancic et al., 1996).

• Avoids pole problems, quasi-uniform.

• Non-orthogonal curvilinear coordinate system with identical metric terms

Ne=16 Cube SphereShowing degree of

non-uniformity

Page 35: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 35

Scalable Numerical Method:High-Order Methods

• Algorithmic Advantages of High Order Methods– h-p element-based method on quadrilaterals (Ne

x Ne)– Exponential convergence in polynomial degree

(N)

• Computational Advantages of High Order Methods– Naturally cache-blocked N x N computations– Nearest-neighbor communication between

elements (explicit)– Well suited to parallel µprocessor systems

Page 36: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 36

HOMME: Computational Mesh

• Elements:– A quadrilateral “patch” of N x N

gridpoints– Gauss-Lobatto Grid– Typically N={4-8}

• Cube – Ne = Elements on an edge– 6 x Ne x Ne elements total

Page 37: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 37

Partitioning a cube-sphere on 8 processors

Page 38: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 38

Partitioning a cubed-sphere on

8 processors

Page 39: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 39

Aqua-Planet CAM/HOMME Dycore

Full CAM Physics/HOMME DycoreParallel I/O library used for physics aerosol input and

input data ( work COULD NOT have been done without Parallel IO)Work underway to couple to other CCSM components

5 years/day

Page 40: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 40

Projected 0.25 CCSM “capability” configuration - version 2: 4.0 years/day

30000 processors

60 sec.

OCN[np=6000]

60 sec.

HOMME ATM[np=24000]

47 sec.

CPL[np=3840]

8 sec.

LND[np=320]

ICE[np=16240]

5 sec.Action: insert scalable atmospheric dycore

Page 41: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 41

Using a bigger parallel machine

can’t be the only answer • Progress in the Top 500 list is not fast enough• Amdahl’s Law is formidable opponent• Dynamical timestep goes like N-1

– Merciless effect of Courant limit– The cost of dynamics relative to physics increases as

N– e.g. if dynamics takes 20% at 25 km it will take 86%

of the time at 1 km

• Traditional parallelization of horizontal leaves N2 per thread cost (vertical x horizontal)– Must inevitably slow down with stalled thread

speeds

Page 42: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

Options for Application Acceleration

• Scalability– Eliminate bottlenecks– Find more parallelism – Load balancing algorithms

• Algorithmic Acceleration– Bigger Timesteps

• Semi-Lagrangian Transport• Implicit or semi-implicit time integration – solvers

– Fewer Points• Adaptive Mesh Refinement methods

• Hardware Acceleration– More Threads

• CMP, GP-GPU’s

– Faster threads • device innovations (high-K)

– Smarter threads• Architecture - old tricks, new tricks… magic tricks

– Vector units, GPU’s, FPGA’s

11/18/08 42

Page 43: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08

Accelerator Research

• Graphics Cards – Nvidia 9800/Cuda– Measured 109x on WRF microphysics on

9800GX2• FPGA – Xilinx (data flow model)

– 21.7x simulated on sw-radiation code• IBM Cell Processor - 8 cores• Intel Larrabee

43

Page 44: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 44

DG+NH+AMR

•Curvilinear elements

•Overhead of parallel AMR at each time-step: less than 1%

Idea based on Fischer, Kruse, Loth (02)

Courtesy of Amik St. Cyr

Page 45: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 45

SLIM ocean model•Louvain la Neuve University

•DG, implicit, AMR unstructured To be coupled to prototype

unstructured ATM model

(Courtesy of J-F Remacle LNU)

Page 46: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

NCAR Summer Internships in Parallel Computational Science

(SIParCS)2007-2008

• Open to:– Upper division undergrads– Graduate students

• In Disciplines such as: – CS, Software Engineering– Applied Math, Statistics– ES Science

• Support:– Travel, Housing, Per diem– 10 weeks salary

• Number of interns selected:– 7 in 2007– 11 in 2008

http://www.cisl.ucar.edu/siparcs

Page 47: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 47

Meanwhile - the clock is ticking

Page 48: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 48

The Size of the Interdisciplinary/Interagency Team

Working on Climate Scalability• Contributors:

D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)A. St. Cyr (NCAR)J. Dennis (NCAR)J. Edwards (IBM)B. Fox-Kemper (MIT,CU)E. Hunke (LANL)B. Kadlec (CU)D. Ivanova (LLNL)E. Jedlicka (ANL)E. Jessup (CU)R. Jacob (ANL)P. Jones (LANL)S. Peacock (NCAR)K. Lindsay (NCAR)W. Lipscomb (LANL)R. Loy (ANL)J. Michalakes (NCAR)A. Mirin (LLNL)M. Maltrud (LANL)J. McClean (LLNL)R. Nair (NCAR)M. Norman (NCSU)T. Qian (NCAR)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)P. Worley (ORNL)M. Zhang (SUNYSB)

• Funding:– DOE-BER CCPP Program Grant

• DE-FC03-97ER62402• DE-PS02-07ER07-06• DE-FC02-07ER64340• B&R KP1206000

– DOE-ASCR• B&R KJ0101030

– NSF Cooperative Grant NSF01– NSF PetaApps Award

• Computer Time:– Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

LLNLStony Brook & BNL

– CRAY XT3/4 time:ORNLSandia

Page 49: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 49

Thanks! Any Questions?

Page 50: Rich Loft Director, Technology Development Computational and Information Systems Laboratory

11/18/08 50

Q. If you had a petascale computer

what would you do with it?

A. Use it as a prototype of an exascale computer.