cgam running the met office unified model on hpcx paul burton cgam, university of reading...
Post on 21-Dec-2015
228 views
TRANSCRIPT
CGAMRunning the Met Office Unified Model on HPCx
Paul Burton
CGAM, University of [email protected]
www.cgam.nerc.ac.uk/~paul
2April 19, 2023
Overview
• CGAM : Who, what, why and how
• The Met Office Unified Model
• Ensemble Climate Models
• High Resolution Climate Models
• Unified Model Performance
• Future Challenges and Directions
3April 19, 2023
Centre for Global
Atmospheric Modelling
Atmospheric Chemistry Modelling
Support UnitUniversities’ Weather and Environment
Research Network
Distributed Institute for Atmospheric
Composition
British Atmospheric Data Centre
University Facilities for Atmospheric Measurement
Facility for Airbourne
Atmospheric Measurements
Who is CGAM?
Data Assimilation
Research Centre
British Geological
Survey
Centre for Ecology and Hydrology
Proudman Oceanographic
Laboratory
Southampton Oceanography
Centre
Centre for Terrestrial
Carbon Dynamics
Environmental Systems
Science Centre
British Antarctic Survey
Tyndall Centre for Climate Change
ResearchNational Institute for Environmental
e-Science
Centre for Polar Observations and Modelling
NERC Centres for
Atmospheric Science
N.E.R.C.
4April 19, 2023
What does CGAM do?• Climate Science
– UK Centre of expertise for climate science– Lead UK research in climate science
• Understand and simulate the highly non-linear dynamics and feedbacks of the climate system
• Earth System Modelling• From seasonal to 100’s of years• Close links to Met Office
• Computational Science– Support scientists using Unified Model– Porting and optimisation– Development of new tools
5April 19, 2023
Why does CGAM exist?
• Will there be an El Nino this year?– How severe will it be?
• Are we seeing increases in extreme weather events in the UK?– 2000 Autumn floods– Drought?
• Will the milder winters of the last decade continue?
• Can we reproduce and understand past abrupt changes in climate?
6April 19, 2023
How does CGAM answer such questions?
• Models are our laboratory– Investigate predictability– Explore forcings and feedbacks– Test hypothesis
7April 19, 2023
Met Office Unified Model
• Standardise on using a single model• Met Office’s Hadley Centre recognised as
world leader in climate research• Two way collaboration with the Met Office• Very flexible model
– Forecast– Climate– Global or Limited Area– Coupled ocean model– Easy configuration via a GUI– User configurable diagnostic output
8April 19, 2023
Unified Model : Technical Details
• Climate configuration uses “old” vn4.5– Vn5 has an updated dynamical core– Next generation “HadGEM” climate
configuration will use this
• Grid-point model– Regular latitude/longitude grid
• Dynamics– Split-explicit finite-difference scheme– Diffusion and polar filtering
• Physical Parameterisation– Almost all constrained to a vertical column
9April 19, 2023
Unified Model : Parallelisation
• Domain decomposition– Atmosphere : 2D regular decomposition– Ocean : 1D (latitude) decomposition
• GCOM library for communications– Interface to selectable communications library:
MPI, SHMEM, ???– Basic communication primitives– Specialised communications for UM
• Communication Patterns– Halo update (SWAPBOUNDS)– Gather/scatter– Global/partial summations
• Designed/optimised for Cray T3E!
10April 19, 2023
Model Configurations
• Currently– HadAM3 / HadCM3
• Low resolution (270km : 96 x 73 x 19L)• Running on ~10-40 CPUs
– Turing (T3E1200), Green (O3800), Beowulf cluster
• Over the next year– More of the same– Ensembles
• Low resolution (HadAM3/HadCM3)• 10-100 members
– High resolution• 90km : 288 x 217 x 30L• 60km : 432 x 325 x 40L
11April 19, 2023
Ensemble Methods in Weather Forecasting
• Have been used operationally for many years (is. ECMWF)– Perturbed starting conditions– Reduced resolution
• Multi-model ensembles– Perturbed starting conditions– Different models
• Why are they used?– Give some indication of predictability– Allows objective assessment of weather-related
risks– More chance of seeing extreme events
13April 19, 2023
Climate Ensembles
• Predictability• What confidence do we have in climate
change?• What effect do different forcings have?
– CO2 – different scenarios
– Volcano erruptions– Deforestation
• How sensitive is the model– Twiddle the knobs and see what happens
• How likely are extreme events?– Allows governments to take defensive action now
14April 19, 2023
Ensembles Implementation
• Setup– Allow users to specify and design an
ensemble exeperiment
• Runtime– Allow the ensemble to run as a single job
on the machine for easy management
• Analysis– How to view and process vast amounts of
data produced
15April 19, 2023
Setup : Normal UM workflow
UMUI
UM Job
Shell script [poe executable]Fortran Namelists
Data
Starting data
Forcing data Output
Diagnostics
Restart data
16April 19, 2023
Control
poe UM_JobUM_Job
$MEMBERid=…cd “Job.$MEMBERid”Run script
Setup : UM Ensemble workflow
Job.1
Shell scriptFortran Namelists
Data.1
Starting dataForcing data
Out.1
DiagnosticsRestart data
Job.2
Shell scriptFortran Namelists
Job.3
Shell scriptFortran Namelists
UM Job
Shell script [poe executable]
Fortran Namelists
ConfigN_MEMBERS=3Differences
Data.2
Starting dataForcing data
Data.3
Starting dataForcing data
Out.2
DiagnosticsRestart data
Out.3
DiagnosticsRestart data
ect
ecdt
17April 19, 2023
UM Ensemble : Runtime (1)
• “poe” called at top level – calls a “top_level_script”– Works out which CPU it’s on– Hence which member it is– Hence which directory/model SCRIPT to
run
• Model scripts run in a separate directory for each member
• Each model script calls the executable
18April 19, 2023
UM Ensemble : Run time (2)
• Uses “MPH” to change the global communicator– http://www.nersc.gov/research/SCG/acpi/MPH/
– Freely available tool from NERSC– MPH designed for running coupled multi-
model experiments
• Each member has a unique MPI communicator replacing the global communicator
19April 19, 2023
UM Ensemble : Future Work
• Run time tools
• Control and monitoring of ensemble members
• Real-time production of diagnostics– Currently each member writes its own
diagnostics files• Lots of disk space• I/O performance?
– Have a dedicated diagnostics process• Only output statistical analysis
20April 19, 2023
UK-HIGEM• National “Grand Challenge” Programme for
High Resolution Modelling of the Global Environment
• Collaboration between a number of academic groups and the Met Office’s Hadley Centre
• Develop high resolution version of HadGEM (~ 10 atmosphere, 1/30 ocean)
• Better understanding and prediction of– Extreme events– Predictability– Feedbacks and interactions– Climate “surprises”
• Regional Impacts of climate change
21April 19, 2023
UK HiGEM Status
• Project only just starting
• Plan to use Earth Simulator for production runs
• Preliminary runs carried out– Earth Simulator– Very encouraging results
• HPCx is a useful platform– For development– Possibly for some production runs
22April 19, 2023
UM Performance
• Two configurations– Low resolution 96x73x19L– High resolution 288x217x30L
• Built in comprehensive timer diagnostics– Wallclock time– Communications– Not yet implemented
• I/O, memory, hardware counters, ???
• Outputs an XML file
• Analysed using PHP web page
23April 19, 2023
LowRes ScalabilityTotal Wallclock Time
1.00E+01
1.00E+02
1.00E+03
0 1 2 3 4 5 6 7 8 9 101112 1314151617 1819202122 232425Nproc
Tim
e (S
eco
nd
s)
Overall
Dynamics
Physics
24April 19, 2023
LowRes : Communication Time
Send/Receive Time
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
0 1 2 3 4 5 6 7 8 9 1011 1213 1415 1617 1819 2021 2223 2425Nproc
% o
f S
ecti
on Overall
Dynamics
Physics
25April 19, 2023
LowRes : Load ImbalanceBarrier Time
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
0 1 2 3 4 5 6 7 8 9 1011 1213 1415 16 1718 1920 2122 2324 25Nproc
% o
f S
ecti
on Overall
Dynamics
Physics
26April 19, 2023
LowRes : Relative Costs% of Overall Time
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
0 1 2 3 4 5 6 7 8 9 1011 1213 1415 1617 1819 2021 2223 2425Nproc
% o
f O
vera
ll T
ime
Dynamics
Physics
27April 19, 2023
HiRes ScalabilityTotal Wallclock Time
1.00E+01
1.00E+02
1.00E+03
0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc
Tim
e (S
eco
nd
s)
Overall
Dynamics
Physics
28April 19, 2023
HiRes Communication TimeSend/Receive Time
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc
% o
f S
ecti
on
Overall
Dynamics
Physics
29April 19, 2023
HiRes Load ImbalanceBarrier Time
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc
% o
f S
ecti
on
Overall
Dynamics
Physics
30April 19, 2023
HiRes Relative Costs% of Overall Time
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0 10 20 30 40 50 60 70 80 90 100 110 120 130Nproc
% o
f O
vera
ll T
ime
Dynamics
Physics
31April 19, 2023
HiRes Exclusive Timer
• QT_POS has large “Collective” time– Unexpected!
• Call to global_MAX routine in gather/scatter– Not needed, so deleted!
32April 19, 2023
HiRes : After “optimisation”
• QT_POS reduced from 65s to 35s• Improved scalability• And repeat…
33April 19, 2023
Optimisation Strategy
• Low Res– Aiming for 8 CPU runs as ensemble
members (typically ~50 members)– Physics optimisation a priority
• Load Imbalance (SW radiation)• Single processor optimisation
• Hi Res– As many CPUs as is feasible– Dynamics optimisation a priority
• Remove/optimise collective operations• Increase average message length
34April 19, 2023
Future Challenges
• Diagnostics and I/O– UM does huge amounts of diagnostic I/O in a
typical climate run– All I/O through a single processor
• Cost of gather• Non-parallel I/O
• Ocean models– Only 1D decomposition, so limited scalability– T3E optimised!
• Next generation UM5.x– Much more expensive– Better parallelisation for dynamics scheme