cgam running the met office unified model on hpcx paul burton cgam, university of reading...

34
CGAM Running the Met Office Unified Model on HPCx Paul Burton CGAM, University of Reading [email protected] www.cgam.nerc.ac.uk/~paul

Post on 21-Dec-2015

228 views

Category:

Documents


4 download

TRANSCRIPT

CGAMRunning the Met Office Unified Model on HPCx

Paul Burton

CGAM, University of [email protected]

www.cgam.nerc.ac.uk/~paul

2April 19, 2023

Overview

• CGAM : Who, what, why and how

• The Met Office Unified Model

• Ensemble Climate Models

• High Resolution Climate Models

• Unified Model Performance

• Future Challenges and Directions

3April 19, 2023

Centre for Global

Atmospheric Modelling

Atmospheric Chemistry Modelling

Support UnitUniversities’ Weather and Environment

Research Network

Distributed Institute for Atmospheric

Composition

British Atmospheric Data Centre

University Facilities for Atmospheric Measurement

Facility for Airbourne

Atmospheric Measurements

Who is CGAM?

Data Assimilation

Research Centre

British Geological

Survey

Centre for Ecology and Hydrology

Proudman Oceanographic

Laboratory

Southampton Oceanography

Centre

Centre for Terrestrial

Carbon Dynamics

Environmental Systems

Science Centre

British Antarctic Survey

Tyndall Centre for Climate Change

ResearchNational Institute for Environmental

e-Science

Centre for Polar Observations and Modelling

NERC Centres for

Atmospheric Science

N.E.R.C.

4April 19, 2023

What does CGAM do?• Climate Science

– UK Centre of expertise for climate science– Lead UK research in climate science

• Understand and simulate the highly non-linear dynamics and feedbacks of the climate system

• Earth System Modelling• From seasonal to 100’s of years• Close links to Met Office

• Computational Science– Support scientists using Unified Model– Porting and optimisation– Development of new tools

5April 19, 2023

Why does CGAM exist?

• Will there be an El Nino this year?– How severe will it be?

• Are we seeing increases in extreme weather events in the UK?– 2000 Autumn floods– Drought?

• Will the milder winters of the last decade continue?

• Can we reproduce and understand past abrupt changes in climate?

6April 19, 2023

How does CGAM answer such questions?

• Models are our laboratory– Investigate predictability– Explore forcings and feedbacks– Test hypothesis

7April 19, 2023

Met Office Unified Model

• Standardise on using a single model• Met Office’s Hadley Centre recognised as

world leader in climate research• Two way collaboration with the Met Office• Very flexible model

– Forecast– Climate– Global or Limited Area– Coupled ocean model– Easy configuration via a GUI– User configurable diagnostic output

8April 19, 2023

Unified Model : Technical Details

• Climate configuration uses “old” vn4.5– Vn5 has an updated dynamical core– Next generation “HadGEM” climate

configuration will use this

• Grid-point model– Regular latitude/longitude grid

• Dynamics– Split-explicit finite-difference scheme– Diffusion and polar filtering

• Physical Parameterisation– Almost all constrained to a vertical column

9April 19, 2023

Unified Model : Parallelisation

• Domain decomposition– Atmosphere : 2D regular decomposition– Ocean : 1D (latitude) decomposition

• GCOM library for communications– Interface to selectable communications library:

MPI, SHMEM, ???– Basic communication primitives– Specialised communications for UM

• Communication Patterns– Halo update (SWAPBOUNDS)– Gather/scatter– Global/partial summations

• Designed/optimised for Cray T3E!

10April 19, 2023

Model Configurations

• Currently– HadAM3 / HadCM3

• Low resolution (270km : 96 x 73 x 19L)• Running on ~10-40 CPUs

– Turing (T3E1200), Green (O3800), Beowulf cluster

• Over the next year– More of the same– Ensembles

• Low resolution (HadAM3/HadCM3)• 10-100 members

– High resolution• 90km : 288 x 217 x 30L• 60km : 432 x 325 x 40L

11April 19, 2023

Ensemble Methods in Weather Forecasting

• Have been used operationally for many years (is. ECMWF)– Perturbed starting conditions– Reduced resolution

• Multi-model ensembles– Perturbed starting conditions– Different models

• Why are they used?– Give some indication of predictability– Allows objective assessment of weather-related

risks– More chance of seeing extreme events

12April 19, 2023

13April 19, 2023

Climate Ensembles

• Predictability• What confidence do we have in climate

change?• What effect do different forcings have?

– CO2 – different scenarios

– Volcano erruptions– Deforestation

• How sensitive is the model– Twiddle the knobs and see what happens

• How likely are extreme events?– Allows governments to take defensive action now

14April 19, 2023

Ensembles Implementation

• Setup– Allow users to specify and design an

ensemble exeperiment

• Runtime– Allow the ensemble to run as a single job

on the machine for easy management

• Analysis– How to view and process vast amounts of

data produced

15April 19, 2023

Setup : Normal UM workflow

UMUI

UM Job

Shell script [poe executable]Fortran Namelists

Data

Starting data

Forcing data Output

Diagnostics

Restart data

16April 19, 2023

Control

poe UM_JobUM_Job

$MEMBERid=…cd “Job.$MEMBERid”Run script

Setup : UM Ensemble workflow

Job.1

Shell scriptFortran Namelists

Data.1

Starting dataForcing data

Out.1

DiagnosticsRestart data

Job.2

Shell scriptFortran Namelists

Job.3

Shell scriptFortran Namelists

UM Job

Shell script [poe executable]

Fortran Namelists

ConfigN_MEMBERS=3Differences

Data.2

Starting dataForcing data

Data.3

Starting dataForcing data

Out.2

DiagnosticsRestart data

Out.3

DiagnosticsRestart data

ect

ecdt

17April 19, 2023

UM Ensemble : Runtime (1)

• “poe” called at top level – calls a “top_level_script”– Works out which CPU it’s on– Hence which member it is– Hence which directory/model SCRIPT to

run

• Model scripts run in a separate directory for each member

• Each model script calls the executable

18April 19, 2023

UM Ensemble : Run time (2)

• Uses “MPH” to change the global communicator– http://www.nersc.gov/research/SCG/acpi/MPH/

– Freely available tool from NERSC– MPH designed for running coupled multi-

model experiments

• Each member has a unique MPI communicator replacing the global communicator

19April 19, 2023

UM Ensemble : Future Work

• Run time tools

• Control and monitoring of ensemble members

• Real-time production of diagnostics– Currently each member writes its own

diagnostics files• Lots of disk space• I/O performance?

– Have a dedicated diagnostics process• Only output statistical analysis

20April 19, 2023

UK-HIGEM• National “Grand Challenge” Programme for

High Resolution Modelling of the Global Environment

• Collaboration between a number of academic groups and the Met Office’s Hadley Centre

• Develop high resolution version of HadGEM (~ 10 atmosphere, 1/30 ocean)

• Better understanding and prediction of– Extreme events– Predictability– Feedbacks and interactions– Climate “surprises”

• Regional Impacts of climate change

21April 19, 2023

UK HiGEM Status

• Project only just starting

• Plan to use Earth Simulator for production runs

• Preliminary runs carried out– Earth Simulator– Very encouraging results

• HPCx is a useful platform– For development– Possibly for some production runs

22April 19, 2023

UM Performance

• Two configurations– Low resolution 96x73x19L– High resolution 288x217x30L

• Built in comprehensive timer diagnostics– Wallclock time– Communications– Not yet implemented

• I/O, memory, hardware counters, ???

• Outputs an XML file

• Analysed using PHP web page

23April 19, 2023

LowRes ScalabilityTotal Wallclock Time

1.00E+01

1.00E+02

1.00E+03

0 1 2 3 4 5 6 7 8 9 101112 1314151617 1819202122 232425Nproc

Tim

e (S

eco

nd

s)

Overall

Dynamics

Physics

24April 19, 2023

LowRes : Communication Time

Send/Receive Time

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

0 1 2 3 4 5 6 7 8 9 1011 1213 1415 1617 1819 2021 2223 2425Nproc

% o

f S

ecti

on Overall

Dynamics

Physics

25April 19, 2023

LowRes : Load ImbalanceBarrier Time

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16 1718 1920 2122 2324 25Nproc

% o

f S

ecti

on Overall

Dynamics

Physics

26April 19, 2023

LowRes : Relative Costs% of Overall Time

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

0 1 2 3 4 5 6 7 8 9 1011 1213 1415 1617 1819 2021 2223 2425Nproc

% o

f O

vera

ll T

ime

Dynamics

Physics

27April 19, 2023

HiRes ScalabilityTotal Wallclock Time

1.00E+01

1.00E+02

1.00E+03

0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc

Tim

e (S

eco

nd

s)

Overall

Dynamics

Physics

28April 19, 2023

HiRes Communication TimeSend/Receive Time

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc

% o

f S

ecti

on

Overall

Dynamics

Physics

29April 19, 2023

HiRes Load ImbalanceBarrier Time

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc

% o

f S

ecti

on

Overall

Dynamics

Physics

30April 19, 2023

HiRes Relative Costs% of Overall Time

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc

% o

f O

vera

ll T

ime

Dynamics

Physics

31April 19, 2023

HiRes Exclusive Timer

• QT_POS has large “Collective” time– Unexpected!

• Call to global_MAX routine in gather/scatter– Not needed, so deleted!

32April 19, 2023

HiRes : After “optimisation”

• QT_POS reduced from 65s to 35s• Improved scalability• And repeat…

33April 19, 2023

Optimisation Strategy

• Low Res– Aiming for 8 CPU runs as ensemble

members (typically ~50 members)– Physics optimisation a priority

• Load Imbalance (SW radiation)• Single processor optimisation

• Hi Res– As many CPUs as is feasible– Dynamics optimisation a priority

• Remove/optimise collective operations• Increase average message length

34April 19, 2023

Future Challenges

• Diagnostics and I/O– UM does huge amounts of diagnostic I/O in a

typical climate run– All I/O through a single processor

• Cost of gather• Non-parallel I/O

• Ocean models– Only 1D decomposition, so limited scalability– T3E optimised!

• Next generation UM5.x– Much more expensive– Better parallelisation for dynamics scheme