discover cluster upgrades: hello haswells and sles11 sp3, goodbye westmeres february 3, 2015 nccs...
TRANSCRIPT
Discover Cluster Upgrades:
Hello Haswells and SLES11 SP3, Goodbye Westmeres
February 3, 2015NCCS Brown Bag
NASA Center for Climate Simulation
Agenda
• Discover Cluster Hardware Changes & Schedule – Brief Update
• Using Discover SCU10 Haswell / SLES11 SP3
• Q & A
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 2
NASA Center for Climate Simulation
Discover Hardware Changes & Schedule update
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 3
NASA Center for Climate Simulation
Discover’s New Intel Xeon“Haswell” Nodes
• Discover’s Intel Xeon “Haswell” nodes:• 28 cores per node, 2.6 GHz
• Usable memory: 120 GB per node, ~4.25 GB per core (128 GB total)
• FDR InfiniBand (56 Gbps), 1:1 blocking
• SLES11 SP3
• NO SWAP space, but DO have lscratch and shmem disk space
• SCU10:– 720* Haswell nodes general use (1,080 nodes total), 30,240 cores
total, 1,229 TFLOPS peak total
• *Up to 360 of the 720 nodes may be episodically allocated for priority work
• SCU11: – ~600 Haswell nodes, 16,800 cores total, 683 TFLOPS peak
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 4
NASA Center for Climate Simulation
Discover Hardware Changes in a Nutshell
• January 30, 2015 (-70 TFLOPS):
– Removed: 516 Westmere (12-core) nodes (SCU3, SCU4)
• February 2, 2015 (+806 TFLOPS for general work):
– Added: ~720* Haswell (28-core) nodes (2/3 of SCU10)
• *Up to 360 of the 720 nodes may be episodically allocated to a priority project
• Week of February 9, 2015 (-70 TFLOPS):
– Removed: 516 Westmere (12-core) nodes (SCU1, SCU2)
– Removed: 7 oldest (‘Dunnington’) Dalis (dali02-dali08)
• Late February/early March 2015 (+713 TFLOPS for general work):
– Added: 600 Haswell (28-core) nodes (SCU11)
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 5
TFLOPS for General User Work
NASA Center for Climate Simulation
Discover Node Count for General Work – Fall/Winter Evolution
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 6
NASA Center for Climate Simulation
Discover Processor Cores for General Work – Fall/Winter Evolution
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 7
NASA Center for Climate Simulation
Oldest Dali Nodes to Be Decommissioned
• The oldest Dali nodes (dali02 – dali08) will be decommissioned starting February 9 (plenty of newer Dali nodes remain).
• You should see no impact from the decommissioning of old Dali nodes, provided you have not been explicitly specifying one of the dali02 – dali08 node names when logging in.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 8
NASA Center for Climate Simulation
Using Discover SCU10 and Haswell / SLES11 SP3
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 9
NASA Center for Climate Simulation
How to use SCU10
• 720 Haswell nodes on SCU10 available in sp3 partition
• To be placed on a login node with the SP3 development environment, after providing your NCCS LDAP password, specify “discover-sp3” at the “Host” prompt:
Host: discover-sp3
• However, you may submit to the sp3 partition from any login node.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 10
NASA Center for Climate Simulation
How to use SCU10
• To submit a job to the sp3 partition, use either:– Command line:
sbatch --partition=sp3 --constraint=hasw myjob.sh
– Or inline directives:#SBATCH --partition=sp3
#SBATCH --constraint=hasw
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 11
NASA Center for Climate Simulation
Porting your work: the fine print…
• There is a small (but non-zero) chance your scripts and binaries will run with no changes at all.
• Nearly all scripts and binaries will require changes to make best use of SCU10.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 12
NASA Center for Climate Simulation
Porting your work: the fine print…
• There is a small (but non-zero) chance your scripts and binaries will run with no changes at all.
• Nearly all scripts and binaries will require changes to make best use of SCU10, sooo…
With great power comes great responsibility.
- Ben Parker (2002)
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 13
NASA Center for Climate Simulation
Adjust for new core count
• Haswell nodes have 28 cores, 128 GB– >x2 memory/core from Sandy Bridge
• Specify total cores/tasks needed, not nodes.– Example: for Sandy Bridge nodes:
#SBATCH --ntasks=800
Not
#SBATCH --nodes=50
• This allows SLURM to allocate whatever resources are available.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 14
NASA Center for Climate Simulation
If you must control the details…
• … still don’t use --nodes.
• If you need more than ~4 GB/core, use fewer cores/node.#SBATCH --ntasks-per-node=N…
–Assumes 1 task/core (the usual case).
• Or specify required memory:#SBATCH --mem-per-cpu=N_MB…
• SLURM will figure out how many nodes are needed to meet this specification.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 15
NASA Center for Climate Simulation
Script changes summary
• Avoid specifying --partition unless absolutely necessary.– And sometimes not even then…
• Avoid specifying --nodes.– Ditto.
• Let SLURM do the work for you.– That’s what it’s there for, and it allows for better
resource utilization.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 16
NASA Center for Climate Simulation
Source code changes
• You might not need to recompile…– … but SP3 upgrade may require it.
• SCU10 hardware is brand-new, possibly needing a recompile.– New features, e.g. AVX2 vector registers– SGI nodes, not IBM– FDR vs QDR Infiniband– NO SWAP SPACE!
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 17
NASA Center for Climate Simulation
And did I mention…
• … NO SWAP SPACE!
• This is critical.– When you run out of memory now, you won’t start
to swap – your code will throw an exception.
• Ameliorated by higher GB/core ratio…– … but we still expect some problems from this.
• Use policeme to monitor the memory requirements of your code.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 18
NASA Center for Climate Simulation
If you do recompile…
• Current working compiler modules:– All Intel C compilers (ifortran not tested yet)– gcc 4.5, 4.81, 4.91– g95 0.93
• Current working MPI modules:– SGI MPT– Intel 4.1.1.036 and later– MVAPICH2 1.81, 1.9, 1.9a, 2.0, 2.0a, 2.1a– OpenMPI: 1.8.1, 1.8.2, 1.8.3
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 19
NASA Center for Climate Simulation
MPI “gotchas”
• Programs using old Intel MPI must be upgraded.
• MVAPICH2 and OpenMPI have only been tested on single-node jobs.
• All MPI modules (except SGI MPT) may experience stability issues when node counts are >~300.– Symptom: Abnormally long MPI teardown times.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 20
NASA Center for Climate Simulation
cron jobs
• discover-cron is still at SP1.– When running SP3-specific code, may need to ssh
to SP3 node for proper execution.– Not extensively tested yet.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 21
NASA Center for Climate Simulation
Sequential job execution
• Jobs may not execute in submission order.– Small and interactive jobs favored during the day.– Large jobs favored at night.
• If execution order is important, the dependencies must be specified to SLURM.
• Multiple dependencies can be specified with the --dependency option.– Can depend on start, end, failure, error, etc.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 22
NASA Center for Climate Simulation
Dependency example
# String to hold the job IDs.
job_ids=''
# Submit the first parallel processing job, save the job ID.
job_id=`sbatch job1.sh | cut -d ' ' -f 4`
job_ids="$job_ids:$job_id"
# Submit the second parallel processing job, save the job ID.
job_id=`sbatch job2.sh | cut -d ' ' -f 4`
job_ids="$job_ids:$job_id"
# Submit the third parallel processing job, save the job ID.
job_id=`sbatch job3.sh | cut -d ' ' -f 4`
job_ids="$job_ids:$job_id"
# Wait for the processing jobs to finish successfully, then
# run the post-processing job.
sbatch --dependency=afterok$job_ids postjob.sh
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 23
NASA Center for Climate Simulation
Coming attraction: shared nodes
• SCU10 nodes will initially be exclusive: 1 job/node
• This is how we roll on discover now.
• May leave a lot of unused cores and/or memory.
• Eventually, SCU10 nodes (and maybe others) will be shared among jobs.– Same or different users.
• What does this mean?Discover: Haswells & SLES11 SP3, Feb. 3, 2015 24
NASA Center for Climate Simulation
Shared nodes (future)
• You will no longer be able to assume that all of the node resources are for you.
• Specifying task and memory requirements will ensure SLURM gets you what you need.
• Your jobs must learn to “work and play well with others”.– Unexpected job interactions, esp. with I/O, may
cause unusual behavior when nodes are shared.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 25
NASA Center for Climate Simulation
Shared nodes (future, continued)
• If you absolutely must have a minimum number of CPUs in a node, the --mincpus=N option to sbatch will ensure you get it.
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 26
Questions & Answers
NCCS User Services:[email protected]
301-286-9120
https://www.nccs.nasa.gov
Thank you
NASA Center for Climate Simulation
Supplemental Slides
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 28
NASA Center for Climate Simulation
Discover Compute Nodes, February 3, 2015 (Peak ~1,629 TFLOPS)
• “Haswell” nodes, 28 cores per node, 2.6 GHz (new)
– SLES11 SP3
– SCU10, 4.5 GB memory per core (new)
• 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work)
• “Sandy Bridge” nodes, 16 cores per node, 2.6 GHz (no change)– SLES11 SP1
– SCU8, 2 GB memory per core
• 480 nodes, 7,680 cores, 160 TFLOPS peak
– SCU9, 4 GB memory per core
• 480 nodes, 7,680 cores, 160 TFLOPS peak
• “Westmere” nodes, 12 cores per node, 2 GB memory per core, 2.6 GHz– SLES11 SP1
– SCU1, SCU2 (SCUS 3, 4, and 7 already removed)
• 516 nodes, 6,192 cores total, 70 TFLOPS peakDiscover: Haswells & SLES11 SP3, Feb. 3, 2015 29
NASA Center for Climate Simulation
Discover Compute Nodes, March 2015 (Peak ~2,200 TFLOPS)
• “Haswell” nodes, 28 cores per node
– SLES11 SP3
– SCU10, 4.5 GB memory per core
• 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work)
– SCU11, 4.5 GB memory per core (new)
• ~600 nodes, 16,800 cores total, 683 TFLOPS peak
• “Sandy Bridge” nodes, 16 cores per node (no change)– SLES11 SP1
– SCU8, 2 GB memory per core
• 480 nodes, 7,680 cores, 160 TFLOPS peak
– SCU9, 4 GB memory per core
• 480 nodes, 7,680 cores, 160 TFLOPS peak
• No remaining “Westmere” nodesDiscover: Haswells & SLES11 SP3, Feb. 3, 2015 30
NASA Center for Climate Simulation
Jan.26-30Jan.
26-30Feb. 2-
6Feb. 2-
6Feb. 9-
13Feb. 9-
13Feb. 17-20Feb. 17-20
Feb. 23-27Feb. 23-27
Mar.2-27Mar.2-27
SCU10 Integration SCU10 GeneralAccess: +720*
Nodes
SCU10 arrived in mid-November 2014. Following installation & resolution of initial power issues, the NCCS provisioned SCU10 with Discover images and integrated it with GPFS storage. NCCS stress testing and targeted high-priority use occurred in January 2015.(*360 nodes episodically allocated for priority work)
SCU 8 and 9
No changes during this period (January – March 2015). In November 2014, 480 nodes previously allocated for a high-priority project were made available for all user processing.
SCU11 Integration
SCU 11 (600 Haswell nodes) has been delivered, and will be installed starting Feb. 9th. Then the NCCS will provision the system with Discover images and integrate it with GPFS storage. Power and I/O connections from Westmere SCUs 1, 2, 3, and 4 are needed for SCU11. Thus, SCUs 1, 2, 3, and 4 must be removed prior to SCU11 integration.
SLES11, SP3600 Nodes
16,800 CoresIntel Haswell683 TF Peak
SLES11, SP1960 Nodes
15,360 CoresIntel Sandy
Bridge320 TF Peak
SLES11, SP11,032 Nodes12,384 Cores
Intel Westmere139 TF Peak
SCU 1, 2, 3, 4Decommissioning Drain: 516
Nodes
To make room for the new SCU11 compute nodes, the nodes of Scalable Units 1, 2, 3, and 4 (12-core Westmeres installed in 2011) are being removed from operations during February. Removal of half of these nodes will coincide with the general access to SCU10, the remaining half during installation of SCU11.
SLES11, SP31,080 Nodes30,240 CoresIntel Haswell
1,229 TF Peak
Discover COMPUTE
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 31
Drain: 516Nodes
Remove: 516Nodes
Remove: 516Nodes
Physical Installation
Configur-ation
StressTesting
SCU11 GeneralAccess: +600
Nodes
NASA Center for Climate Simulation
Discover “SBU” Computational Capacity for General Work – Fall/Winter Evolution
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 32
NASA Center for Climate Simulation
Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only)
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 33
NASA Center for Climate Simulation
Total Number of Discover Intel Xeon Processor Cores as a Function of Time
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 34
NASA Center for Climate Simulation
Storage Augmentations
• Dirac (Mass Storage) Disk Augmentation– 4 Petabytes usable (5 Petabytes “raw”), installed
– Gradual data move: starts week of February 9 (many files, “inodes” to move)
• Discover Storage Expansion– 8 Petabytes usable (10 Petabytes “raw”), installed
– For both general use and targeted “Climate Downscaling” project
– Phased deployment, including optimizing the arrangement existing project and user nobackup space
Discover: Haswells & SLES11 SP3, Feb. 3, 2015 35