www.ccsm.ucar.edu running ccsm tony craig ccsm software engineering group ccsm@ucar.edu
Post on 17-Dec-2015
224 Views
Preview:
TRANSCRIPT
www.ccsm.ucar.edu
Running CCSMRunning CCSM
Tony CraigCCSM Software Engineering Group
ccsm@ucar.edu
www.ccsm.ucar.edu
OutlineOutline
• General review of CCSM
• Setting up and running a simple case
• Datasets
• Production
• Modifying source code
• Errors
• Tools
• Performance
www.ccsm.ucar.edu
Review of CCSMReview of CCSM
• Five components / Ten models– Atmosphere(3) : atm, datm, latm– Ocean(2) : ocn, docn– Land(2) : lnd, dlnd– Ice(2+) : ice, ice (prescribed mode), ice (mixed
layer ocean mode), dice– Coupler(1) : cpl
• Communication via MPI between components and coupler only
• Each component runs on multiple processors via MPI, OpenMP, MPI/OpenMP
www.ccsm.ucar.edu
Component parallelizationComponent parallelization
• atm : MPI, OpenMP, or MPI/OpenMP• lnd : MPI, OpenMP, or MPI/OpenMP• Ice : MPI only• ocn : MPI only• cpl : OpenMP only• The data models, datm, docn, dice, dlnd, and
latm : serial only, 1 processor
www.ccsm.ucar.edu
ConfigurationsConfigurations
• A = datm, dlnd, docn, dice, cpl• B = atm, lnd, ocn, ice, cpl• C = datm, dlnd, ocn, dice, cpl• D = datm, dlnd, docn, ice, cpl• F = atm, lnd, docn, ice (prescribed mode), cpl• G = latm, dlnd, ocn, ice, cpl• H = atm, dlnd, docn, dice, cpl• I = datm, lnd, docn, dice, cpl• K = atm, lnd, docn, dice, cpl• M = latm, dlnd, docn, ice (ml ocn mode), cpl
www.ccsm.ucar.edu
ResolutionsResolutions
• atm/lnd/datm/dlnd = T42, T31
• ocn/ice/docn/dice = gx1v3, gx3, gx3v4
• latm = T62
• Scientifically validated combinations– B, T42_gx1v3 = b20.007 control run
(test.a1 case)– B, T31_gx3v4 = paleo control run (test.a2
case)
www.ccsm.ucar.edu
“Available” configurations“Available” configurations
A B C D F G H I K M
T42_gx1v3 * * * * * * * *T31_gx3 * * * * * * *T31_gx3v4 *T62_gx1v3 * *T62_gx3 * *
= supported (subject to change)
= b20.007 control
= paleo control
***
www.ccsm.ucar.edu
PlatformsPlatforms
• IBM
• SGI
• Compaq*
www.ccsm.ucar.edu
Review of scriptsReview of scripts
• Main script (test.a1.run)– Sets primary ccsm environment variables– Calls $model.setup.csh
• Gets input datasets• Builds components
– Runs model– Archives– Harvests
www.ccsm.ucar.edu
Setting up a simple caseSetting up a simple case
• Use the GUI !!– The GUI modifies the scripts and creates a new
case for you– Input $CASE, $CSMROOT, $CSMDATA,
$EXEROOT– Input resolution– Input configuration (A-M)– Sets processor layout based on configuration (first
guess)– Sets some batch environment variables– Works well in the NCAR environment, other sites
require post script-generation tuning
www.ccsm.ucar.edu
Setting up a simple case, without GUISetting up a simple case, without GUI
• Create new case directory under scripts, copy over test.a1 files
• Rename file test.a1.run to $CASE.run– Edit $CASE, $CSMROOT, $CSMDATA,
$EXEROOT, $ARCROOT– Edit batch environment parameters– Edit $GRID– Edit $SETUPS– Edit $NTASKS, $NTHRDS
www.ccsm.ucar.edu
$NTASKS, $NTHRDS, batch$NTASKS, $NTHRDS, batch
• $NTASKS are the total number of MPI tasks for each component
• $NTHRDS are the number of OpenMP threads per MPI task
• $NTASKS*$NTHRDS = total number of processors for each component
• Tuning required to get optimal load balance• Batch parameters should match processors
used, consistency important, task_geometry (loadleveler) is very powerful
www.ccsm.ucar.edu
Component parallelizationComponent parallelization
• atm : MPI, OpenMP, or MPI/OpenMP• lnd : MPI, OpenMP, or MPI/OpenMP• ice : MPI only, NTHRDS=1• ocn : MPI only, NTHRDS=1• cpl : OpenMP only, NTASKS=1• The data models, datm, docn, dice, dlnd, and
latm : serial only, 1 processor, NTASKS=1, NTHRDS=1
www.ccsm.ucar.edu
Main script configuration summaryMain script configuration summary
• B case
MODELS ( atm lnd ocn ice cpl)
SETUPS ( atm lnd ocn ice cpl)
NTASKS ( 8 2 40 8 1)
NTHRDS ( 4 4 1 1 4)
• datm/dlnd/ocn/ice case
MODELS ( atm lnd ocn ice cpl)
SETUPS ( datm dlnd ocn ice cpl)
NTASKS ( 1 1 64 16 1)
NTHRDS ( 1 1 1 1 4)
www.ccsm.ucar.edu
$RUNTYPE$RUNTYPE
• Startup - initial startup of model using arbitrary initialization– set $CASE, $BASEDATE
• Continue - continuation of case, bit-for-bit guaranteed, uses model restart files– set $CASE
• Branch - start new case as a bit-for-bit continuation of another case, uses model restart files, requires continuous date– set $CASE, $REFCASE, $REFDATE
• Hybrid - start new case, not bit-for-bit continuation, uses model initial files in atm and land, can change starting date– set $CASE,$BASEDATE,$REFCASE,$REFDATE
www.ccsm.ucar.edu
Coupler namelistCoupler namelist
• Stop_option: ndays, nmonths, newmonth, halfyear, newyear, newdecade
• Stop_n : integer (ndays, nmonths)
• Rest_freq : ndays, monthly, quarterly, halfyear, yearly• Rest_n : integer (ndays)
• Diag_freq : daily, weekly, biweekly, monthly, quarterly, yearly, ndays
• Diag_n : integer (ndays)
• info_bcheck : integer
www.ccsm.ucar.edu
Data SetsData Sets
• Types– Grid files, binary– Namelist input, ascii– Initial datasets, binary/netcdf– Restart datasets, binary– History datasets, netcdf– Log files, ascii
• inputdata directory– This is usually pointed to by $CSMDATA
www.ccsm.ucar.edu
Data Flow, InputData Flow, Input
• Everything is copied to $EXEROOT• Tools and scripts attempt to automate most of the
“get input files”• Main script variables include $CSMDATA, $LFSINP,
$LMSINP, $MACINP, $RFSINP, $RMSINP
$EXEROOT
Mass Store
$ARCROOT/restart
$CSMDATA = inputdata
scripts/$CASE
Setup scripts
www.ccsm.ucar.edu
Data Flow, OutputData Flow, Output
• Output files are moved out of $EXEROOT• Harvesting is a separate process• Writing of restart files coordinated by the coupler• Writing of history files is not coordinated between
components, monthly average is default• Main script variables include $LMSOUT, $MACOUT,
$RFSOUT
$EXEROOTMass Store
$ARCROOT
Scripts
archivingharvesting
www.ccsm.ucar.edu
Log FilesLog Files
• Each component produces a log file, $model.log.$LID• $LID is a system date stamp• Date stamps are the same on all log files for a run• Log files are written into the $EXEROOT/$model
directories during execution• Log files are copied to $SCRIPTS/logs at the end of a
run• There are separate stdout and stderr that sometimes
contain output information
www.ccsm.ucar.edu
Archiving, ccsm_archiveArchiving, ccsm_archive
• Means moving model output to a separate area on a local disk, ccsm_archive
• Local disk area is set by $ARCROOT in the main script
• Benefits– Allows separation of running and harvesting– Mass storage availability does not prevent
continued execution of the model– Allows users to run in volatile temporary space– Supports simple harvesting in a clustered
machine environment (like nirvana)
www.ccsm.ucar.edu
Harvesting, $CASE.harHarvesting, $CASE.har
• Means copying model output to the local mass store• Separate script in scripts/$CASE, $CASE.har• Typically submitted in batch, can also be run
interactively• Submitted by main script after model run, off by
default• Sources ccsm_joe for important environment
variables• Harvests all files in $ARCROOT/{atm,lnd,ocn,ice,cpl}• Verifies accurate copy on mass store before
removing• Can scp files to remote machines
www.ccsm.ucar.edu
Exact RestartExact Restart
• CCSM can stop and restart exactly
• The coupler controls the frequency of restart file writes
• Restart files guarantee bit-for-bit continuity at a checkpoint boundary
• rpointer files are updated in the scripts/$CASE directory after each run
www.ccsm.ucar.edu
Restart file management (1)Restart file management (1)
• ccsm_archive– In scripts/$CASE– Called from main script after model run is
complete, commented out by default– $ARCROOT/restart contains the latest full set of
restart files– ccsm_archive copies full set of restart datasets
into $ARCROOT/restart after each run– ccsm_archive then tars up that restart set into the
$ARCROOT/restart.tars directory– These tar files can be large, regular clean up
required
www.ccsm.ucar.edu
Restart file management (2)Restart file management (2)
• ccsm_getrestart– In scripts/tools– Called from main script before model run starts,
commented out by default– Copies the latest set of restart files from
$ARCROOT/restart to the appropriate directories
• To “backup” model run to previous model date– Assumes both ccsm_archive and ccsm_getrestart
have been active in the main script– Delete all files in $ARCROOT/restart– Untar an $ARCROOOT/restart.tars file into
$ARCROOT/restart– Resubmit
www.ccsm.ucar.edu
Auto-ResubmitAuto-Resubmit
• RESUBMIT file in scripts/$CASE directory– contains a single integer– If the integer is >0, main script resubmits
itself and decrements the integer
• Runaway jobs– FIRST! set value in RESUBMIT file to 0– Attempt to kill running jobs
www.ccsm.ucar.edu
ProductionProduction
• Modify coupler namelist in cpl.setup.csh, set run length and restart frequency, turn down diagnostic frequency, set info_bcheck to 0.
• Run a startup, hybrid, or branch case $RUNTYPE
• Transition to continue $RUNTYPE• Turn on archiving, harvesting, and
ccsm_getrestart• Edit RESUBMIT file to initiate auto-
resubmission
www.ccsm.ucar.edu
Monitoring a runMonitoring a run
• Monitor the batch jobs using llq, bjobs, qstat• Verify that runs complete successfully, check
for timing information at the end of a log file• Tail -f $EXEROOT/cpl/cpl.log*• If runs are not succeeding,
– tail each log file– grep for ENDRUN in atm and lnd log files– Check stdout and stderr files for component
messages or system messages– Look for core files in $EXEROOT/$model– Look for zero length files in $EXEROOT/$model– Check email
www.ccsm.ucar.edu
Modifying source codeModifying source code
• Modifying files in the ccsm models directory is not recommended
• Create directories under scripts/$CASE– src.atm, src.lnd, src.ocn, src.ice, src.cpl– Copy subset of model source code to these
directories and modify it– Has highest priority with respect to build
• Benefits include– Release source code remains unmodified and
available– Allows implementation of case dependent code
modifications
www.ccsm.ucar.edu
Multiple Machine SupportMultiple Machine Support
• Should run on blackforest, babyblue, and ute “out of the box”
• “Other” machines include seaborg, nirvana, eagle, falcon, cheetah
• Supported platforms are indicated in $OS, $SITE, $MACH, $ARCH environment variables in the main script
• See also scripts/tools/test.a1.mods.$MACH for suggested changes to test.a1.run for “other” machines.
www.ccsm.ucar.edu
Running on a “New” MachineRunning on a “New” Machine
• Main script– Set batch queue commands– Add new $OS, $SITE, $MACH, $ARCH options– Set standard CCSM path names, $CSMROOT, …– Harvester submission issues– Set data movement variables, $LMSINP, …
• Harvester script– May require modification
• Tools– May need to modify ccsm_msread, ccsm_mswrite
• Build– Modify models/bld/Macros.$OS file
www.ccsm.ucar.edu
ccsm_joeccsm_joe
• Created by main script
• Updated every time the main script runs
• Case dependent
• Records important ccsm environment variables
• Can be “sourced” by other scripts to inherit ccsm environment variables
www.ccsm.ucar.edu
Interactive/Batch IssuesInteractive/Batch Issues
• Can run main script interactively• Typically used to build and pre-stage initial
data• Uncomment “exit” command in main script to
stop the script before script starts ccsm execution
• Batch environment highly site dependent– NQS– Loadleveler– LSF– PBS
www.ccsm.ucar.edu
Common Errors (1)Common Errors (1)
• Model won’t build– Try rebuilding clean– Remove all obj directories, these are
$OBJROOT/model/obj which is normally equivalent to $EXEROOT/model/obj
– When rebuilding, make sure $SETBLD is true in main script
• Model won’t continue due to restart problem– Determine cause of problem; quota, hardware,
script, zero length files, rpointer problems– Fix if possible– Back up to latest “good” restart dataset– Rerun
www.ccsm.ucar.edu
Common Errors (2)Common Errors (2)
• Ice model stops due to mp transport error– Double ndte in ice.setup.csh ice model namelist– Back up to latest “good” restart dataset– Run past previous stop date– Reset ndte value
• Ocean model non-convergence– Add about 10% to the number of model
timesteps/hour in ocn.setup.csh, DT_COUNT– Back up to latest “good” restart dataset– Run past previous stop date– Reset DT_COUNT– Non-convergence on first timestep is special case
www.ccsm.ucar.edu
ToolsTools
• Under scripts/tools– ccsm_getfile : hierarchical search for file– ccsm_getinput : hierarchical search for input file– ccsm_msread : copies a file from local mass store– ccsm_mswrite : copies a file to local mass store– ccsm_checkenvs : echo ccsm environment
variables, used to created ccsm_joe– ccsm-getrestart : copies restart files from
$ARCROOT/restart to appropriate $EXEROOT and scripts/$CASE directories
www.ccsm.ucar.edu
PerformancePerformance
• This is complicated!• Issues
– Performance of components and system as a function of resolution and configuration
– Scalability of individual components, scaling efficiency of individual components
– Task/Thread counts– Components sharing nodes, overloading nodes
with multiple components, overloading threads, overloading tasks
– Load balance of coupled system
www.ccsm.ucar.edu
Component TimingsComponent Timings
0
50
100
150
200
250
300
4 8 16 32 64
Number of processors
Seconds/simulated day
atmlndiceocn
www.ccsm.ucar.edu
CCSM Load BalancingCCSM Load Balancing
40 ocean
32 atm
16 ice
12 land
04 cpl
104 total
9.4 3.0
6.2 15.0
8.6 40.4
53.2
10.0 10.0
55
3 2
Timings in seconds per day
5
processors
www.ccsm.ucar.edu
Component/Hardware layoutComponent/Hardware layout
• Machine, set of nodes• Nodes, group of processors that share
memory• Processors, individual computing elements• General rules
– Do not oversubscribe processors, place only 1 MPI task or 1 thread on each processor
– Minimize the number of nodes used for a given component and processor requirement
– Multiple components can share a node as long as there is no oversubscription of processors
– Test several decompositions, layouts, task/thread combinations to try to optimize performance
www.ccsm.ucar.edu
SummarySummary
• CCSM is a complicated multi-executable climate model, expect there to be “spin-up” time
• CCSM is a scientific research code• There are many possible components,
configurations, platforms, and resolutions; we are unable to test everything
• Users are responsible for validating their science• NCAR can help with software/configuration problems,
ccsm@ucar.edu• Please report bugs, fixes, improvements, and ports to
new hardware, so we can incorporate those changes! ccsm@ucar.edu
top related