Scaling, Throughput and an Historical
perspective
Application Performance on Multi-
core processors
M.F. Guest≠, C.A. Kitchen≠, M. Foster† and D. Cho§
≠ Cardiff University, †Atos, § Mellanox Technologies
2Application Performance on Multi-core Processors
Outline
I. Performance Benchmarks and Cluster Systems
a. Synthetic Code Performance: STREAM and IMB
b. Application Code Performance: DLPOLY, GROMACS,
AMBER,GAMESS_UK, VASP and Quantum Espresso
c. Interconnect Performance: Intel MPI and Mellanox’s HPCX
d. Processor Family and Interconnect – “core to core” and “node
to node” benchmarks
II. Impact of Environmental Issues in Cluster Acceptance
tests
a. Security patches, turbo mode and Throughput testing
III. Performance profile of DL_POLY and GAMESS-UK over
the past two decades
IV. Acknowledgements and Summary
12 December 2018
3Application Performance on Multi-core Processors
Contents
I. Review of parallel application performance featuring synthetics and end-
user applications across a variety of clusters
¤ End-user Codes – DL_POLY, GROMACS, AMBER, NAMD, LAMMPS,
GAMESS-UK, Quantum Espresso, VASP, CP2K, ONETEP & OpenFOAM
• Ongoing Focus on Intel’s Xeon Scalable processors (“Skylake”), AMD’s
Naples EPYC processor plus nVIDIA GPUs, including
¤ Clusters with dual-socket nodes - Intel Xeon Gold 6148 Processor (20c, 27.5M
Cache, 2.40 GHz) & Xeon Gold 6138 Processor (20c, 27.5M Cache, 2.00
GHz) + AMD Naples EPYC 7551 (2.00 GHz) & EPYC 7601 (2.20 GHz) CPUs.
¤ Updated review of Intel MPI and Mellanox HPCX performance analysis .
II. How these benchmarks have been deployed in the framework of
procurement and acceptance testing, dealing with a variety of issues
e.g. (a) security patches, turbo mode etc. & (ii) Throughput testing.
III. An historical perspective of two of these codes – DL_POLY and
GAMESS-UK – and briefly overview the development and performance
profile of both over the past two decades.
12 December 2018
The Xeon Skylake Architecture
4Application Performance on Multi-core Processors
• The architecture of Skylake is
very different from that of the prior
“Haswell” and “Broadwell” Xeon
chips
• Three basic variants that now
cover what was formerly the Xeon
E5 and Xeon E7 product lines, with
Intel converging the Xeon E5 and
E7 chips into a single socket.
• Product segmentation – Platinum, Gold, Silver, & Bronze – with 51
variants of the SP chip
• Also custom versions requested by hyperscale and OEM customers.
• All of these chips differ from each other in a number of ways, including
number of cores, clock speed, L3 cache capacity, number and speed of
UltraPath links between sockets, number of sockets supported, main
memory capacity, width of the AVX vector units etc.
12 December 2018
Intel Xeon : Westmere - Skylake
Xeon 5600
(Westmere-EP)
Xeon E5-2600
(Sandy Bridge-EP)
Xeon E5-2600 v4
“Broadwell-EP”
Intel Xeon Scalable
Processor
“Skylake”
Cores / ThreadsUp to 6 cores / 12
threads
Up to 8 cores / 16
threads
Up to 22 Cores / 44
threads
Up to 28 Cores / 56
threads
Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-
inclusive)
Max memory
channels, speed
/ socket
3xDDR3 channels,
1333
4xDDR3 channels,
1600
4 channels of up to 3
RDIMMs, LRDIMMs
or 3DS LRDIMMs,
2400 MHz
6 channels of up to 2
RDIMMs, LRDIMMs
or 3DS LRDIMMs,
2666 MHz
New
instructionsAES-NI
AVX 1.0
8 DP Flops/Clock
AVX 2.0
16 DP Flops/Clock
AVX 512
32 DP Flops/Clock
QPI / UPI Speed
(GT/s)
1 QPI channels @
6.4 GT/s
2 QPI channels @ 8.0
GT/s
2 x QPI channels @
9.6 GT/s
Up to 3 x UPI @ 10.4
GT/s
PCIe Lanes /
Controllers /
Speed (GT/s)
36 lanes PCIe 2.0 on
chipset
40 Lanes / Socket
Integrated PCIe 3.0
40 / 10 / PCIe* 3.0
(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0
(2.5, 5, 8 GT/s)
Server /
Workstation
TDP
Server /
Workstation: 130W
Up to 130W Server;
150W Workstation 55 - 145W 70 – 205W
5Application Performance on Multi-core Processors 12 December 2018
06
SKU 7601 7551 7501 7451 7401 7351 7301
Freq (base) 2.2 2.0 2.0 2.3 2.0 2.4 2.2
Turboboost
All cores
active
2.7 2.6 2.6 2.9 2.8 2.9 2.7
Turboboost
On core
active
3.2 3.0 3.0 3.2 3.0 2.9 2.7
Cores/socket 32 32 32 24 24 16 16
L3 cache size 64 MB
Memory
Channel8
Memory Freq 2667 MT/s
TDP (W) 180 180 155/170 180 155/170 155/170 155/170
AMD® Epyc™ 7000 Series - SKU Map and FLOP/cycle
Architecture Sandy Bridge Haswell Skylake EPYC
ISA* AVX AVX2 AVX-512 AVX2
op/cycle2
(1 ADD, 1 MUL)
4
(2 FMA)
4
(2 FMA)
4
(2 ADD, 2 MUL)
Vector size
(DP = 64-bits)4 4 8 2
FLOP/cycle 8 16 32 8
* Instruction Set Architecture
12 December 2018 6Application Performance on Multi-core Processors
The AMD EPYC
only supports 2
× 128-bit AVX
natively, so
there’s a large
gap with Intel
SKL and their 2
× 512-bit FMAs.
Thus the FP
peak on AMD is
4 × lower than
on Intel SKL.
• Zen cores
¤ Private L1/L2 cache
• CCX
¤ 4 ZEN cores (or less)
¤ 8MB L3 shared cache
• Zeppelin
¤ 2 CCX (or less)
¤ 2 DDR4 channels
¤ 2 PCIe 16x
• Naples
¤ 4 Zeppelin SoC dies fully
connected by Infinity
Fabric.
¤ 4 Numa Nodes !
07
EPYC Architecture - Naples, Zeppelin & CCX2x16 PCie
2x D
DR
4 C
han
nels
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x16 PCie
2x D
DR
4 C
han
nels
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x D
DR
4 C
ha
nn
els
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x16 PCie
2x D
DR
4 C
han
nels
Coherent Links
Co
here
nt L
inks
8M
L3
Zen L2
Zen L2
Zen L2
Zen L2
8M
L3
L2
L2
Zen
Zen
L2
L2
Zen
Zen
2x16 PCie
∞
• Delivers 32 cores / 64 threads, 16MB L2 cache and 64MB L3 cache per socket.
• Design also means that there are four NUMA nodes per socket or eight NUMA nodes in
a dual socket system i.e. different memory latencies depending on which die needs data
from memory that can be attached to that die or another die on the fabric.
• The key difference with Intel’s Skylake SP architecture is that AMD needs to go off die within
the same socket where Intel stays on a single piece of silicon.
12 December 2018 7Application Performance on Multi-core Processors
Intel Skylake and AMD EPYC Cluster Systems
Cluster / Configuration
“Hawk” – Supercomputing Wales cluster at Cardiff comprising 201 nodes, totalling 8,040
cores, 46.080 TB total memory
• CPU: 2 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each; RAM: 192 GB,
384GB on high memory and GPU nodes; GPU: 26 x nVidia P100 GPUs with 16GB of
RAM on 13 nodes.
“Helios” – 32 node HPC Advisory Council cluster running SLURM: Mellanox ConnectX-5
Supermicro SYS-6029U-TR4 / Foxconn Groot 1A42USF00-600-G 32-node cluster; Dual
Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
Mellanox ConnectX-5 EDR 100Gb/s InfiniBand/VPI adapters with Socket Direct, Mellanox
Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches
Memory: 192GB DDR4 2677MHz RDIMMs per node
20 node Bull|ATOS AMD EPYC cluster running SLURM;
AMD EPYC 7551; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3.2 GHz
Base Clock: 2.2 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s
32 node Dell|EMC PowerEdge R7425 AMD EPYC cluster running SLURM;
AMD EPYC 7601; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3GHz Base
Clock: 2.0 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s
8Application Performance on Multi-core Processors 12 December 2018
Baseline Cluster Systems
9Application Performance on Multi-core Processors
Cluster Configuration
Intel Sandy Bridge Clusters
“Raven”128 x Bull|ATOS b510 EP-nodes each with 2 Intel Sandy Bridge E5-2670
(2.6 GHz), with Mellanox QDR infiniband.
Supercomputing
Wales
384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670
(2.6 GHz), with Mellanox QDR infiniband.
Intel Broadwell Clusters
Dell PE R730/R630,
Broadwell EP-2697A v4
2.6 GHz 16C
HPC Advisory Council, “Thor” cluster, Dell PowerEdge R730/R630 36-
node cluster: 2 x Xeon E5-2697A v4 @ 2.6GHz, 16 Core, 145W TDP,
40MB Cache,256GB DDR4 2400MHz, Interconnect: ConnectX-4 EDR
ATOS Broadwell EP-
2680 v4 2.4 GHz 16C
32 node cluster, Node config: 2 x Xeon E5-2680 v4 @ 2.4GHz, 16 Core,
145W TDP, 40MB Cache,128GB DDR4 2400MHz, Interconnect: Mellanox
ConnectX-4 EDR; and Intel OPA
IBM Power 8 S822LC
IBM Power 8 S822LC
with Mellanox EDR
20 cores, 3.49 GHz with performance CPU governor; 256 GB memory ;
1 – IB (EDR) port ; 2 × NVIDIA K80 GPU;
IBM PE (Parallel Environment) Operating System: RHEL 7.2 LE;
Compilers: xlC 13.1.3, xlf 15.1.3, gcc 4.8.5 (Red Hat), gcc 5.2.1 (from IBM
Advance Toolchain 9.0)
12 December 2018
The Performance Benchmarks
• The Test suite comprises both synthetics & end-user applications.
Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks
(http://software.intel.com/en-us/articles/intel-mpi-benchmarks), IOR and
STREAM
• Variety of “open source” & commercial end-user application codes:
• These stress various aspects of the architectures under consideration
and should provide a level of insight into why particular levels of
performance are observed e.g., memory bandwidth and latency, node
floating point performance and interconnect performance (both latency
and B/W) and sustained I/O performance.
GROMACS, LAMMPS, AMBER, NAMD, DL_POLY classic & DL_POLY-4 (molecular dynamics)
Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP
(ab initio Materials properties)
NWChem, GAMESS-US and GAMESS-UK
(molecular electronic structure)
10Application Performance on Multi-core Processors 12 December 2018
EPYC - Compiler and Run-time Options
Compilation:
INTEL COMPILERS 2018, IntelMPI 2017
Update 3, FFTW-3.3.5
INTEL SKL: -O3 –xCORE-AVX512
AMD EPYC: –O3 -xAVX2
AMD EPYC: -axCORE-AVX-I
#
# Preload the amd-cputype library to navigate
# the Intel Genuine cpu test
module use /opt/amd/modulefiles
module load AMD/amd-cputype/1.0
export LD_PRELOAD=$AMD_CPUTYPE_LIBexport OMP_PROC_BIND=true
# export KMP_AFFINITY=granularity=fine
export I_MPI_DEBUG=5
export MKL_DEBUG_CPU_TYPE=5
Application Performance on Multi-core Processors 1112 December 2018
STREAM (Atos Clusters):module load AMD/amd-cputype/1.0
icc -o stream.x stream.c -DSTATIC -
Ofast -xCORE-AVX2 -qopenmp -
DSTREAM_ARRAY_SIZE=800000000 \
-mcmodel=large -shared-intel
export OMP_NUM_THREADS=16
export OMP_PROC_BIND=true
export OMP_PLACES="{0:4:1}:16:4” #1
thread per CCX
export OMP_DISPLAY_ENV=true
STREAM (Dell|EMC EPYC):export OMP_NUM_THREADS=32
export OMP_PROC_BIND=true
export OMP_DISPLAY_ENV=true
export
OMP_PLACES="{0},{16},{8},{24},{2},{1
8},{10},{26},{4},{20},{12},{28},{6},
{22},{14},{30},{1},{17},{9},{25},{3}
,{19},{11},{27},{5},{21},{13},{29},{
7},{23},{15},{31}"
74,309
93,486
118,605114,367
132,035 128,083
169,830
185,863
196,721
184,087
303,797
279,640
0
50,000
100,000
150,000
200,000
250,000
300,000
Bull b510"Raven"SNB e5-
2670/2.6GHz
ClusterVision IVBe5-2650v2
2.6GHz
Dell R730 HSWe5-2697v32.6GHz (T)
Dell HSW e5-2660v3 2.6GHz
(T)
Thor BDW e5-2697A v4 2.6GHz
(T)
ATOS BDW e5-2680v4 2.4GHz
(T)
Mellanox SKLGold 61382.0GHz (T)
Dell SKL Gold6142 2.6GHz (T)
"Hawk" Atos SKLGold 6148
2.4GHz
IBM Power8S822LC 2.92GHz
AMD Epyc 75512.0 GHz
AMD Epyc 76012.2 GHz
Memory B/W –STREAM performance
TRIAD [Rate (MB/s) ]
Ivy Bridge & Haswell
E5-26xx v2,v3
OMP_NUM_THREADS (KMP_AFFINITY=physical
Broadwell
E5-26xx v4
Skylake Gold
6138, 6142, 6148
Application Performance on Multi-core Processors 1212 December 2018
4,644
5,843
4,236
5,718
4,126
4,5744,246
5,808
4,918
9,204
4,747
4,369
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
Bull b510"Raven"SNB e5-
2670/2.6GHz
ClusterVisionIVB e5-2650v2
2.6GHz
Dell R730 HSWe5-2697v32.6GHz (T)
Dell HSW e5-2660v3 2.6GHz
(T)
Thor BDW e5-2697A v42.6GHz (T)
ATOS BDW e5-2680v4 2.4GHz
(T)
Mellanox SKLGold 61382.0GHz (T)
Dell SKL Gold6142 2.6GHz (T)
"Hawk" AtosSKL Gold 6148
2.4GHz
IBM Power8S822LC 2.92GHz
AMD Epyc 75512.0 GHz
AMD Epyc 76012.2 GHz
Memory B/W – STREAM / core performance
TRIAD [Rate (MB/s) ]
OMP_NUM_THREADS (KMP_AFFINITY=physical
Ivy Bridge & Haswell
E5-26xx v2,v3
Broadwell
E5-26xx v4
Skylake Gold
6138, 6142, 6148
Application Performance on Multi-core Processors 1312 December 2018
3.8
11,466
5,957
1.7
3,694
1,729
1
10
100
1,000
10,000
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
Intel SKL Gold 6148 2.4GHz (T) OPA
Dell Skylake Gold 6150 2.7GHz (T) EDR
IBM Power8 S822LC 2.92GHz IB/EDR
Thor BDW e5-2697A v4 2.6GHz (T) EDR
Intel BDW e5-2690v4 2.6GHz (T) OPA
Dell OPA32 e5-2660v3 2.6GHz (T) OPA
Bull HSW E5-2680v3 2.5 GHz (T) Connect-IB
Dell R720 e5-2680v2 2.8 GHz (T) connect-IB
Azure A9 WE (e5-2670 2.6 GHz) IB RDMA
Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)
MPI Performance – PingPong
IMB Benchmark (Intel)
1 PE / node
Latency
Message Length (Bytes)
Mb
yte
s/s
ec
14Application Performance on Multi-core Processors
BE
TT
ER
12 December 2018
export I_MPI_DAPL_TRANSLATION_CACHE=1
Memory resident cache feature in DAPL
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Fujitsu CX250 SNB e5-2670/2.6GHz IB-QDR
ATOS BDW e5-2680v4 2.4GHz (T) OPA
Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR
Dell PE R730 BDW e5-2697Av4 2.6GHz (T) OPA
Dell|EMC SKL Gold 6130 2.1GHz (T) OPA
"Helios" Mellanox SKL Gold 6138 2.0GHz (T)
Intel SKL Gold 6148 2.4GHz (T) OPA
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR
Dell|EMC SKL Gold 6142 2.6GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR
MPI Collectives – Alltoallv (128 PEs)
IMB Benchmark (Intel)
128 PEs
Latency
BE
TT
ER
Message Length (Bytes)
Measured Time (usec)
15Application Performance on Multi-core Processors
EPYC performance
with Intel MPI ~ 4-6 ×
worse than that with
SKL processors!
12 December 2018
Time-consuming messages
called by Alltoall & Alltoallv (IPM)
Application Performance on Multi-core
Processors
I.1 THE CODES: DLPOLY, GROMACS, NAMD, LAMMPS,
GAMESS, NWChem, GAMESS-UK, ONETEP, VASP,
SIESTA, CASTEP, Quantum Espresso, CP2K – on a
variety of HPC systems.
Allinea (ARM) Performance Reports
Allinea Performance Reports provides a
mechanism to characterize and understand the
performance of HPC application runs through a
single-page HTML report.
17Application Performance on Multi-core Processors
• Based on Allinea MAP's adaptive sampling technology that keeps data
volumes collected and application overhead low.
• Modest application slowdown (ca. 5%) even with 1000’s of MPI
processes.
• Runs on existing codes: a single command added to execution scripts.
• If submitted through a batch queuing system, then the submission script
is modified to load the Allinea module and add the 'perf-report' command
in front of the required mpiexec command.
• perf-report mpiexec -n 4 $code
• A Report Summary: This characterizes how the application's wallclock
time was spent, broken down into CPU, MPI and I/O
• All examples updated on Broadwell Mellanox Cluster (E5-2697A v4)
12 December 2018
DL_POLY Developed as CCP5 parallel MD code by W. Smith,
T.R. Forester and I. Todorov
UK CCP5 + International user community
DLPOLY_classic (replicated data) and DLPOLY_3 &
_4 (distributed data – domain decomposition)
Areas of application:
liquids, solutions, spectroscopy, ionic solids,
molecular crystals, polymers, glasses, membranes,
proteins, metals, solid and liquid interfaces,
catalysis, clathrates, liquid crystals, biopolymers,
polymer electrolytes.
Molecular Dynamics Codes: AMBER, DL_POLY,
CHARMM, NAMD, LAMMPS, GROMACS etc
Molecular Simulation I. DL_POLY
18Application Performance on Multi-core Processors 12 December 2018
3.5
6.8
11.1
15.7
4.4
8.3
13.7
16.8
4.8
9.3
15.4
19.2
4.2
7.7
12.4
16.7
4.1
6.0
9.8
13.6
4.5
7.1
11.6
15.9
0.0
4.0
8.0
12.0
16.0
20.0
32 64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Intel Broadwell2 e5-2690v4 2.6GHz (T) OPA
"Helios" Skylake Gold 6138 2.0GHz (T) EDR
"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR
Dell Skylake Gold 6150 2.7GHz (T) EDR
ATOS AMD EPYC 7551 2.0 GHz (T) EDR
Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR
DL_POLY Classic – NaCl Simulation
Number of Processing Elements
Performance
Performance Data (32-256 PEs)
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)
BE
TT
ER
NaCl 27,000 atoms; 500 time steps
Application Performance on Multi-core Processors 1912 December 2018
A B
C D
• Distribute atoms, forces across the nodes
¤ More memory efficient, can address much larger
cases (105-107)
• Shake and short-ranges forces require only
neighbour communication
¤ communications scale linearly with number of
nodes
• Coulombic energy remains global
¤ Adopt Smooth Particle Mesh Ewald scheme
• includes Fourier transform smoothed charge
density (reciprocal space grid typically
64x64x64 - 128x128x128)
http://www.scd.stfc.ac.uk//research/app/ccg/software/DL_POLY/44516.aspx
W. Smith and I. Todorov
Domain Decomposition - Distributed data:
DL_POLY 4 – Distributed data
20Application Performance on Multi-core Processors
Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å
2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps
12 December 2018
1.8
3.1
4.9
2.9
5.1
8.5
2.8
5.0
7.9
2.4
4.5
7.4
2.7
5.0
8.4
3.2
5.7
3.0
5.4
9.0
3.2
9.5
0.0
2.0
4.0
6.0
8.0
10.0
64 128 256
Re
lati
ve
P
erf
orm
an
ce
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI
"Helios" Skylake Gold 6138 2.0GHz (T) EDR
Intel Skylake Platinum 8170 2.1GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
DL_POLY 4 – Gramicidin Simulation
Number of Processing Elements
Performance
BE
TT
ER
Gramicidin 792,960 atoms; 50 time steps
Performance Data (64-256 PEs)
21Application Performance on Multi-core Processors
Relative to the Fujitsu CX250 e5-2670 2.6 GHz 8-C (32 PEs)
12 December 2018
SKL 6142 2.6 GHz ~
1.06 X e5-2697v4 2.6
GHz
1.7
3.0
4.6
2.4
4.3
7.2
2.6
4.6
7.5
3.1
5.1
2.8
5.0
8.4
2.5
3.4
4.6
2.3
3.1 3.2
0.0
2.0
4.0
6.0
8.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI
"Helios" Skylake Gold 6138 2.0GHz (T) EDR
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
ATOS AMD EPYC 7551 2.0 GHz (T) EDR
Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR
DL_POLY 4 – Gramicidin Simulation – EPYC
Number of Processing Elements
Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)
BE
TT
ER
Gramicidin 792,960 atoms; 50 time steps
Application Performance on Multi-core Processors 2212 December 2018
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
DLPOLY4 – Gramicidin Simulation Performance Report
Smooth Particle Mesh Ewald Scheme
23Application Performance on Multi-core Processors
CPU Time Breakdown
Total Wallclock Time
Breakdown
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
12 December 2018
“DL_POLY_4 and Xeon Phi: Lessons Learnt”,
Alin Marin Elena , Christian Lalanne, Victor
Gamayunov , Gilles Civario, Michael Lysaght,
and IlianTodorov
Molecular Simulation - II GROMACS
GROMACS (GROningen MAchine for Chemical Simulations) is a
molecular dynamics package designed for simulations of proteins, lipids
and nucleic acids [University of Groningen] .
• Single and Double Precision
• Efficient GPU Implementations
Versions under Test:
Version 4.6.1 – 5 March 2013
Version 5.0.7 – 14 October 2015
Version 2016.3 – 14 March 2017
Version 2018.2 – 14 June 2018 (optimised for “Hawk” by Ade Fewings)
Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-
Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory
and Computation 4 (3): 435–447.
24Application Performance on Multi-core Processors 12 December 2018
http://manual.gromacs.org/documentation/
GROMACS Benchmark Cases
25Application Performance on Multi-core Processors
Ion channel system
• The 142k particle ion channel system is the
membrane protein GluCl - a pentameric chloride
channel embedded in a DOPC membrane and
solvated in TIP3P water, using the Amber ff99SB-
ILDN force field. This system is a challenging
parallelization case due to the small size, but is one
of the most wanted target sizes for biomolecular
simulations.
Lignocellulose
• Gromacs Test Case B from the UEA Benchmark
Suite. A model of cellulose and lignocellulosic
biomass in an aqueous solution. This system of
3.3M atoms is inhomogeneous, and uses reaction-
field electrostatics instead of PME and therefore
should scale well.
12 December 2018
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
Performance Data (32-256 PEs)
GROMACS – Ion-channel Performance Report
26Application Performance on Multi-core Processors
CPU Time Breakdown
Total Wallclock Time
Breakdown
12 December 2018
45.1
79.8
132.4
48.5
90.3
167.2
60.1
114.0
191.7
0.0
50.0
100.0
150.0
200.0
64 128 256
Gromacs 4.6.1
Gromacs 5.0
Gomacs 2016.3-single-AVX
Gromacs 2018.2
27
GROMACS – Ion Channel Simulation
Number of Processing Elements
Performance (ns /day)
Performance Data (64-256 PEs)
BE
TT
ER
142k particle ion channel system
Application Performance on Multi-core Processors 12 December 2018
Single Precision
"Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T) Nodes
with EDR Interconnect + dual P100 GPU nodes
20.6
37.9
68.8
32.5
60.3
100.4
36.4
68.5
123.6
54.0
97.4
151.5
48.5
90.3
149.0
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S}
Intel Skylake Gold 6148 2.4GHz (T) OPA {S}
"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S}
28
Ion Channel Simulation – Impact of Single Precision
Number of Processing Elements
Performance (ns /day)
Performance Data (64-256 PEs)
BE
TT
ER
142k particle ion channel system
Application Performance on Multi-core Processors 12 December 2018
GROMACS 5.0.7
1.0
1.2
1.6
1.9
2.2
3.2 3.2
3.5
1.2
1.8
2.7
3.2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
64 80 96 128 160 240 256 320 N=1,2×GPU
N=2,4×GPU
N=4,8×GPU
N=6,12×GPU
29
GROMACS – GPU Performance: Ion Channel Simulation
Number of Processing Elements
Relative Performance
BE
TT
ER
142k particle ion channel system
Application Performance on Multi-core Processors 12 December 2018
"Hawk" Atos Cluster -
SKL Gold 6148 2.4GHz (T)
Nodes with EDR
Interconnect + dual P100
GPU nodes
GROMACS 2018.2
2.4
4.8
8.7
2.4
4.7
8.6
3.7
7.3
13.5
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
64 128 256
Gromacs 4.6.1
Gromacs 5.0
Gromacs 2016.3-single-AVX
Gromacs 2018.2
30
GROMACS – Lignocellulose Simulation
Number of Processing Elements
Performance (ns /day)
Performance Data (64-256 PEs)
BE
TT
ER
Application Performance on Multi-core Processors 12 December 2018
3,316,463 atom system using
reaction-field electrostatics instead
of PME
"Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T)
Nodes with EDR Interconnect + dual P100 GPU nodes
Single Precision
0.9
1.7
3.3
1.3
2.6
5.0
1.4
2.8
5.2
1.6
3.1
6.1
2.9
5.5
10.1
0.0
2.0
4.0
6.0
8.0
10.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S}
Intel Skylake Gold 6148 2.4GHz (T) OPA {S}
"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S}
31
Lignocellulose Simulation – Impact of Single Precision
Number of Processing Elements
Performance (ns /day)
Performance Data (64-256 PEs)
BE
TT
ER
3,316,463 atom system using reaction-
field electrostatics instead of PME
Application Performance on Multi-core Processors 12 December 2018
GROMACS 5.0.7
1.01.2
1.5
1.9
2.3
3.43.6
4.4
1.6
2.9
5.3
7.1
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
64 80 96 128 160 240 256 320 N=1,2×GPU
N=2,4×GPU
N=4,8×GPU
N=6,12×GPU
32
GROMACS – GPU Performance: Lignocellulose Simulation
Number of Processing Elements
BE
TT
ER
Application Performance on Multi-core Processors 12 December 2018
"Hawk" Atos Cluster -
SKL Gold 6148 2.4GHz (T)
Nodes with EDR
Interconnect + dual P100
GPU nodes
3,316,463 atom system using
reaction-field electrostatics instead
of PME
GROMACS 2018.2
Relative Performance
Molecular Simulation - III The AMBER Benchmarks
• AMBER 16/1 is used, specifically
PMEMD & GPU accelerated PMEMD.
• M01 Benchmark
¤ Major Urinary Protein (MUP) + IBM ligand (21,736 atoms)
• M06 Benchmark
¤ Cluster of six MUPs (134,013 atoms)
• M27 Benchmark
¤ Cluster of 27 MUPs (657,585 atoms)
• M45 Benchmark
¤ Cluster of 45 MUPs (932,751 atoms)
All test cases run 30,000 steps * 2fs = 60ps simulation time. Periodic boundary
conditions, constant pressure, T=300K. Position data written every 500 steps.
R. Salomon-Ferrer, D.A. Case, R.C. Walker. An overview of the Amber biomolecular simulation package. WIREs Comput. Mol. Sci. 3, 198-210 (2013).
D.A. Case, T.E. Cheatham, III, T. Darden, H. Gohlke, R. Luo, K.M. Merz, Jr., A. Onufriev, C. Simmerling, B. Wang and R. Woods. The Amber biomolecular simulation programs. J. Computat. Chem. 26, 1668-1688 (2005).
33Application Performance on Multi-core Processors 12 December 2018
1.31
1.241.27 1.27
1.13
1.34
1.45 1.44
1.34
1.27
1.36
1.57
1.48
1.56
1.65 1.65
1.36
1.23
1.47
1.39
1.551.58
1.70
1.64
1.00
1.20
1.40
1.60
1.80
64 80 96 128 160 240 256 320
M06 M27 M45
AMBER – SKL vs. SNB: M06, M27 and M45
Number of Processing Elements
Relative Performance
BE
TT
ER
Performance Data (64-320 PEs)
SKL 6148 2.4 GHz // EDR vs SNB e5-2670 2.6 GHz // QDR
34Application Performance on Multi-core Processors 12 December 2018
1.361.48
1.88
2.292.41
3.12
3.403.54
2.73
4.21
0.0
1.0
2.0
3.0
4.0
64 80 96 128 160 240 256 320 N1 (ppn1GPU×2)
N1 (ppn2GPU×2)
35
AMBER – GPU Performance: M45 Simulation
Number of Processing Elements
Relative Performance (64 SNB cores)
BE
TT
ER
Application Performance on Multi-core Processors 12 December 2018
"Hawk" Atos Cluster - SKL Gold 6148
2.4GHz (T) with EDR Interconnect + dual
P100 GPU nodes vs. “Raven” 64 SNB e5-
2670 PEs
Cluster of 45 Major Urinary Protein
(MUP) + IBM ligand (932,751 atoms)
GAMESS-UK - Moving to Distributed Data.
The MPI/ScaLAPACK Implementation
of the GAMESS-UK SCF/DFT module
• Pragmatic approach to the replicated data constraints:
¤ MPI-based tools (such as ScaLAPACK) used in place of Global Arrays
¤ All data structures except those required for the Fock matrix build are fully
distributed (F, P)
• Partially distributed model chosen because, in the absence of efficient
one-sided communications it is difficult to efficiently load balance a
distributed Fock matrix build.
• Obvious drawback - some large replicated data structures are required.
¤ These are kept to a minimum. For a closed shell HF or DFT calculation only
2 replicated matrices are required, 1 × Fock and 1 × Density (doubled for
UHF).
36Application Performance on Multi-core Processors
“The GAMESS-UK electronic structure package: algorithms, developments and
applications'' M.F. Guest, I. J. Bush, H.J.J. van Dam, P. Sherwood, J.M.H. Thomas, J.H.
van Lenthe, R.W.A Havenith, J. Kendrick, Mol. Phys. 103, No. 6-8, 2005, 719-747.
12 December 2018
1.2
2.1
1.2
2.1
1.5
2.7
1.8
3.2
1.8
3.0
1.6
2.8
1.9
3.3
1.8
3.3
1.9
3.5
1.1
2.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull Haswell e5-2695v3 2.3GHz Connect-IB
Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) EDR
Thor Broadwell e5-2697A v4 2.6GHz (T) EDR
Mellanox SKL Gold 6138 2.0GHz (T) EDR
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
IBM Power8 S822LC 2.92GHz IB/EDR
Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)
GAMESS-UK Performance - Zeolite Y cluster
Performance
Number of Processing Elements
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)
BE
TT
ER
37Application Performance on Multi-core Processors
SKL 6142 2.6 GHz
~ 1.05 X e5-2697v4 2.6 GHz
12 December 2018
1.2
2.1
1.2
2.1
1.5
2.7
1.8
3.2
1.8
3.0
1.6
2.8
1.7
3.1
1.8
3.3
1.9
3.5
1.1
2.0
1.4
2.5
1.5
2.8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDRBull Haswell e5-2695v3 2.3GHz Connect-IBHuawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDRThor Broadwell e5-2697A v4 2.6GHz (T) EDRMellanox SKL Gold 6138 2.0GHz (T) EDRDell|EMC Skylake Gold 6130 2.1GHz (T) OPAIntel Skylake Gold 6148 2.4GHz (T) OPA"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDRDell|EMC Skylake Gold 6142 2.6GHz (T) EDRDell|EMC Skylake Gold 6150 2.7GHz (T) EDRIBM Power8 S822LC 2.92GHz IB/EDRATOS AMD EPYC 7551 2.0 GHz (T) EDRDell|EMC AMD EPYC 7601 2.2 GHz (T) EDR
Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)
GAMESS-UK MPI/ScaLAPACK code – EPYC Performance
Performance
Number of Processing Elements
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)
BE
TT
ER
Application Performance on Multi-core Processors 3812 December 2018
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
Performance Data (32-256 PEs)
GAMESS-UK.MPI DFT – DFT Performance Report
39Application Performance on Multi-core Processors
Cyclosporin 6-31G** basis (1855
GTOs); DFT B3LYP
CPU Time Breakdown
Total Wallclock Time
Breakdown
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
12 December 2018
• VASP – performs ab-initio QM molecular dynamics (MD) simulations using
pseudopotentials or the projector-augmented wave method and a plane
wave basis set.
• Quantum Espresso – an integrated suite of Open-Source computer codes
for electronic-structure calculations and materials modelling at the
nanoscale. It is based on density-functional theory (DFT), plane waves,
and pseudopotentials
• SIESTA - an O(N) DFT code for electronic structure calculations and ab
initio molecular dynamics simulations for molecules and solids. It uses
norm-conserving pseudopotentials and linear combination of numerical
atomic orbitals (LCAO) basis set.
• CP2K is a program to perform atomistic and molecular simulations of solid
state, liquid, molecular, and biological systems. It provides a framework for
different methods such as e.g., DFT using a mixed Gaussian & plane waves
approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling
code for quantum-mechanical calculations based on DFT.
Computational Materials
Advanced Materials Software
40Application Performance on Multi-core Processors 12 December 2018
Quantum Espresso is an
integrated suite of Open-
Source computer codes
for electronic-structure
calculations and
materials modelling at the
nanoscale. It is based on
density-functional theory,
plane waves, and
pseudopotentials.
Transition from v5.2 to
v6.1
Ground-state calculations.
Structural Optimization.
Transition states & minimum energy paths.
Ab-initio molecular dynamics.
Response properties (DFPT).
Spectroscopic properties.
Quantum Transport.
Benchmark Details
DEISA AU112
Au complex (Au112), 2,158,381 G-
vectors, 2 k-points, FFT dimensions:
(180, 90, 288)
PRACE
GRIR443
Carbon-Iridium complex (C200Ir243),
2,233,063 G-vectors, 8 k-points, FFT
dimensions: (180, 180, 192)
Quantum Espresso
41Application Performance on Multi-core Processors 12 December 2018
1.01.3
2.0 2.0
2.52.9
3.3 3.4
3.3
5.6
5.0
5.7
7.67.8
1.5
2.2
3.2
5.1
4.3
4.9
5.9
1.9
2.7
4.0
6.7
5.7
6.4
8.3
8.8
0.0
2.0
4.0
6.0
8.0
10.0
0 64 128 192 256 320
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL
Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Intel Skylake Gold 6148 2.4GHz (T) OPA
Number of Processing Elements
Perf
orm
an
ce
Performance Data (32 - 320 PEs)
BE
TT
ER
Quantum Espresso – Au112
42Application Performance on Multi-core Processors
Relative to the Fujitsu e5-2670
2.6 GHz 8-C (32 PEs)
12 December 2018
Version 5.2
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
Quantum Espresso – Au112 Performance Report
Au complex (Au112), 2,158,381 G-
vectors, 2 k-points, FFT
dimensions: (180, 90, 288)
43Application Performance on Multi-core Processors
CPU Time Breakdown
Total Wallclock Time
Breakdown
12 December 2018
Parallelism in Quantum Espresso
• Quantum ESPRESSO implements several MPI parallelization levels,
with Processors organized in a hierarchy of groups identified by different
MPI communicator levels. Group hierarchy:
• images: Processors divided into different "images", corresponding to a
different SCF or linear-response calculation, loosely coupled to others.
• Pools and bands: each image can be sub-partitioned into "pools", each
taking care of a group of k-points. Each pool is sub-partitioned into
"band groups", each taking care of a group of Kohn-Sham orbitals.
• PW Parallelisation: orbitals in the PW basis set, as well as charges
and density in either reciprocal or real space, distributed across
processors. All linear-algebra operations on array of PW / real-space
grids are automatically and effectively parallelized.
• tasks: Allows for good parallelization of the 3D FFT when no. of CPUs
exceeds the no. of FFT planes, FFTs on Kohn-Sham states are
redistributed to "task".
4412 December 2018Application Performance on Multi-core Processors
Parallelism in Quantum Espresso
• linear-algebra group: A further level, independent on PW or k-point
parallelization, is the parallelization of subspace diagonalization /
iterative orthonormalization.
• About communications Images and pools are loosely coupled and
CPUs communicate between different images and pools only once in a
while, whereas CPUs within each pool are tightly coupled and
communications are significant.
• Choosing parameters : To control the no. CPUs in each group,
command line switches: -nimage, -npools, -nband, -ntg, -ndiag or –
northo. Thus for Au112, use is of the following command line:
mpirun $code -inp ausurf.in -npool $NPOOL -ntg $NT -ndiag $ND
• This executes an energy calculation on $NP processors, with k-points
distributed across $NPOOL pools of $NP/$NPOOL processors each,
3D FFT is performed using $NT task groups, with the diagonalization of
the subspace Hamiltonian distributed to a square grid of $ND
processors.
4512 December 2018Application Performance on Multi-core Processors
Number of Processing Elements
Relative Performance
Performance Data (64 - 320 PEs)
BE
TT
ER
Impact of npool – Au112
46
2.42.7
3.2
4.4
4.8
5.4
6.7
7.7
6.5
4.7 4.8 4.9
4.0
4.44.0
1.0
1.4 1.5
2.0 2.02.3
2.62.4 2.5
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
32 64 96 128 160 192 224 256 288 320
Hawk (NPOOL=2, ND=nP)
Hawk (NPOOL=1)
Raven (NPOOL=2, ND=nP)
Raven (NPOOL=1)
Application Performance on Multi-core Processors 12 December 2018
Version 5.2
1.92.2
2.9
3.9 4.0
4.7
5.96.1
6.8
4.24.4
5.3
3.73.9
4.5
1.01.1
1.4
1.8
2.2
2.8
3.2 3.3
3.9
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
32 64 96 128 160 192 224 256 288 320
Hawk (NPOOL=2, ND=nP)
Hawk (NPOOL=1)
Raven (NPOOL=2, ND=nP)
Raven (NPOOL=1)
Number of Processing Elements
Relative Performance
Performance Data (64 - 320 PEs)
BE
TT
ER
Impact of npool – GRIR443
47Application Performance on Multi-core Processors 12 December 2018
1.3
1.9
2.42.3
3.2
3.8
2.8
4.0
5.2
2.8
4.0
4.9
1.4
1.82.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
96 128 160
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR
Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
Number of Processing Elements
Pe
rfo
rma
nc
e
BE
TT
ER
Performance Data (96-160 PEs)Quantum Espresso – GRIR443[R
ela
tive
to
th
e F
ujit
su
e5
-
26
70
2.6
GH
z 8
-C (
96
PE
s)]
Application Performance on Multi-core Processors 4812 December 2018
Zeolite Benchmark
• Zeolite with the MFI structure unit cell running
a single point calculation and a planewave cut
off of 400eV using the PBE functional
• 2 k-points; maximum number of plane-
waves: 96,834
• FFT grid; NGX=65, NGY=65, NGZ=43,
giving a total of 181,675 points
Pd-O Benchmark
• Pd-O complex – Pd75O12, 5X4 3-layer
supercell running a single point calculation
and a planewave cut off of 400eV. Uses the
RMM-DIIS algorithm for the SCF and
is calculated in real space.
• 10 k-points; maximum number of plane-
waves: 34,470
• FFT grid; NGX=31, NGY=49, NGZ=45,
giving a total of 68,355 points
VASP – Vienna Ab-initio Simulation Package
Benchmark Details
MFI Zeolite
Zeolite (Si96O192), 2 k-
points, FFT grid: (65,
65, 43); 181,675 points
Pd-O
complex
Palladium-Oxygen
complex (Pd75O12), 10
k-points, FFT grid: (31,
49, 45), 68,355 points
VASP (5.4.4) performs ab-
initio QM molecular
dynamics (MD) simulations
using pseudopotentials or
the projector-augmented
wave method and a plane
wave basis set.
Application Performance on Multi-core Processors 4912 December 2018
1.7
2.5
2.1
2.8
4.6
6.5
3.3
5.2
5.9
2.8
4.5
5.9
3.8
5.95.7
3.7
5.6
6.8
0.0
2.0
4.0
6.0
8.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX
Dell|EMC SKL Gold 6130 2.1GHz (T) OPA
"Helios" Mellanox SKL 6138 2.0GHz (T)
"Helios" Mellanox SKL 6138 2.0GHz (T) HPCX 2.3.0
Intel SKL Gold 6148 2.4GHz (T) OPA
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR
Dell|EMC SKL Gold 6142 2.6GHz (T) EDR
Dell|EMC SKL Gold 6150 2.7GHz (T) EDR
Number of Processing Elements
Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)
BE
TT
ER
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
VASP 5.4.4 – Pd-O Benchmark
50Application Performance on Multi-core Processors 12 December 2018
1.7
2.52.1
2.8
4.6
6.5
3.7
5.2
5.9
3.7
6.4
8.5
3.8
7.5
9.1
3.7
5.6
6.8
2.2
3.9
5.3
0.0
2.0
4.0
6.0
8.0
10.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX
"Helios" Mellanox SKL 6138 2.0GHz (T)
"Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2]
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2]
Dell|EMC SKL Gold 6142 2.6GHz (T) EDR
Dell|EMC SKL Gold 6150 2.7GHz (T) EDR
Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR [KPAR=2]
Number of Processing Elements
Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)
BE
TT
ER
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
VASP 5.4.4 – Pd-O Benchmark - Parallelisation on k-points
51Application Performance on Multi-core Processors 12 December 2018
NPEs KPAR NPAR
64 2 2
128 2 4
256 2 8
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
Performance Data (32-256 PEs)
VASP – Pd-O Benchmark Performance Report
52Application Performance on Multi-core Processors
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
CPU Time Breakdown
Total Wallclock Time
Breakdown
12 December 2018
1.0
1.51.7
1.4
2.9
4.7
1.6
3.2
4.3
1.5
2.7
3.9
1.8
3.4
4.0
1.7
3.0
4.2
1.7
3.2
4.5
0.0
1.0
2.0
3.0
4.0
5.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX
Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) OPA
"Helios" Mellanox SKL 6138 2.0GHz (T)
Dell|EMC SKL Gold 6130 2.1GHz (T) OPA
Intel SKL Gold 6148 2.4GHz (T) OPA
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR
Dell|EMC SKL Gold 6142 2.6GHz (T) EDR
Dell|EMC SKL Gold 6150 2.7GHz (T) EDR
Number of Processing Elements
Performance
BE
TT
ER
Zeolite (Si96O192) with MFI structure unit cell running a single point
calculation and a 400eV planewave cut off of using the PBE
functional. maximum number of plane-waves: 96,834, 2 k-points,
FFT grid: (65, 65, 43); 181,675 points
VASP 5.4.4 – Zeolite Benchmark
53Application Performance on Multi-core Processors
Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)
12 December 2018
1.0
1.51.7
1.4
2.9
4.7
1.6
4.3
4.7
1.5
2.7
3.9
1.8
3.4
4.0
1.7
3.0
4.2
1.7
3.2
4.5
0.0
1.0
2.0
3.0
4.0
5.0
64 128 256
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX
"Helios" Mellanox SKL 6138 2.0GHz (T)
"Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2]
Dell|EMC SKL Gold 6130 2.1GHz (T) OPA
Intel SKL Gold 6148 2.4GHz (T) OPA
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR
"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2]
Dell|EMC SKL Gold 6142 2.6GHz (T) EDR
Dell|EMC SKL Gold 6150 2.7GHz (T) EDR
Number of Processing Elements
Performance
BE
TT
ER
Zeolite (Si96O192) with MFI structure unit cell running a single point
calculation and a 400eV planewave cut off of using the PBE functional.
maximum number of plane-waves: 96,834, 2 k-points, FFT grid: (65, 65,
43); 181,675 points
VASP 5.4.4 – Zeolite Benchmark - Parallelisation on k-points
54Application Performance on Multi-core Processors
Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)
12 December 2018
1.4
2.0
2.6
1.5
2.1
2.7
0.9
1.4
1.6
1.4
1.8
2.1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
64 96 128
Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA
Intel Skylake Gold 6148 2.4GHz (T) OPA
ATOS AMD EPYC 7601 2.2 GHz (T) EDR
ATOS AMD EPYC 7601 2.2 GHz (T) EDR (16c/socket)
Number of Processing Elements
Perf
orm
an
ce
Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)
BE
TT
ER
VASP 5.4.1 – Zeolite Benchmark
Zeolite (Si96O192) with MFI structure unit cell running a single
point calculation and a 400eV planewave cut off of using the
PBE functional. maximum number of plane-waves: 96,834, 2 k-
points, FFT grid: (65, 65, 43); 181,675 points
Application Performance on Multi-core Processors 5512 December 2018
Application Performance on Multi-core
Processors:
I.2. Selecting Fabrics and Optimising
Performance:
Intel MPI and Mellanox HPCX
• Intel MPI Library - can select a communication fabric at runtime without having
to recompile the application. By default, it automatically selects the most
appropriate fabric based on both S/W and H/W configuration i.e. in most cases
you do not have to manually select a fabric.
• Specifying a particular fabric can boost performance. Can specify fabrics for both
intra-node and inter-node communications. Following fabrics available:
• For inter-node communication, it uses the first available fabric from the default
fabric list. List is defined automatically for each H/W and S/W configuration (see
I_MPI_FABRICS_LIST).
• For most configurations, this list is as follows: dapl, ofa, tcp, tmi, ofi
Selecting Fabrics – MPI Optimisation
57Application Performance on Multi-core Processors
Fabric Network hardware and software used
shm Shared memory (for intra node communication only).
dapl Direct Access Programming Library (DAPL) fabrics, such as InfiniBand (IB)
and iWarp (through DAPL).
ofa OpenFabrics Alliance (OFA) fabrics e.g. InfiniBand (through OFED verbs).
tcp TCP/IP network fabrics, such as Ethernet and InfiniBand (through IPoIB).
tmi Tag Matching Interface (TMI) fabrics, such as Intel True Scale Fabric, Intel
Omni Path Architecture and Myrinet (through TMI).
ofi OpenFabrics Interfaces* (OFI) - capable fabrics, such as Intel True Scale
Fabric, Intel Omni Path Architecture, IB and Ethernet (through OFI API).
12 December 2018
Mellanox HPC-X Toolkit
The Mellanox HPC-X Toolkit provides a MPI, SHMEM and UPC
software suite for HPC environments. Delivers “enhancements to
significantly increase the scalability & performance of message
communications in the network”. Includes:
¤ Complete MPI, SHMEM, UPC package, including Mellanox MXM and
FCA acceleration engines
¤ Offload collectives communication from MPI process onto Mellanox
interconnect hardware
¤ Maximize application performance with underlying hardware
architecture. Optimized for Mellanox InfiniBand and VPI interconnects
¤ Increase application scalability and resource efficiency
¤ Multiple transport support including RC, DC and UD
¤ Intra-node shared memory communication
• Performance comparison conducted on the Mellanox SKL 6138 / 2.00
GHz EDR based “Helios” cluster
58Application Performance on Multi-core Processors 12 December 2018
Application Performance & MPI Libraries
Performance comparison exercise undertaken to capture the impact
of the latest release of Intel MPI and Mellanox’s HPCX.
¤ In 2017, on the Mellanox HP Proliant- E5-2697A v4 EDR based
Thor cluster, comparison of Intel MPI and Mellanox HPCX for the
following applications (and associated data sets).
– DLPOLY4 (NaCl and Gramicidin) & GROMACS (Ion Channel and
lignocellulose)
– VASP (PdO Complex & Zeolite System)
– Quantum ESPRESSO (Au112 and GRIR443)
– OpenFOAM (Cavity 3D-3M)
¤ Simply compared the time to solution for each application i.e.
T HPCX / T Intel-MPI
across multiple core counts
59Application Performance on Multi-core Processors 12 December 2018
Application Performance & MPI Libraries
• Optimum performance found to be a function of both
application and core count.
¤ With the materials-based codes & OpenFOAM, and at high
core count (> 512 cores), HPCX exhibited a clear
performance advantage over Intel MPI.
¤ This was not the case for the classical MD codes where Intel
MPI showed a distinct advantage at all but the highest core
counts.
• Repeated the exercise on the Helios partition of the Skylake
cluster using latest releases of HPCX v2.2.0 and 2.3.0-pre
60Application Performance on Multi-core Processors
http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf
12 December 2018
DL_POLY 4 – Intel MPI vs. HPCX – December 2017
61Application Performance on Multi-core Processors
% Intel MPI Performance vs. HPCX
Processor Core Count
85%
90%
95%
100%
105%
110%
115%
120%
0 128 256 384 512 640 768 896 1024
DL_POLY4 - NaCl
DL_POLY4 - Gramicidin
12 December 2018
Intel MPI is seen to outperform HPC-X for the DLPOLY 4 NaCl
test case at all core counts, and at lower core counts for
Gramicidin
85%
90%
95%
100%
105%
110%
115%
120%
0 128 256 384 512 640 768 896 1024
DL_POLY4 - NaCl
DL_POLY4 - Gramicidin
DL_POLY 4 – Intel MPI vs. HPCX – December 2018
62Application Performance on Multi-core Processors
% Intel MPI Performance vs. HPCX
Processor Core Count
Advantage of Intel MPI now reduced at
most core counts for both NaCl and
Gramicidin
12 December 2018
95%
100%
105%
110%
115%
120%
125%
0 128 256 384 512 640 768 896 1024
GROMACS - ion channel
GROMACS - lignocellulose
GROMACS – Intel MPI vs. HPCX – December 2017
63Application Performance on Multi-core Processors
% Intel MPI
Performance vs. HPCX
Processor Core Count
At no point does the HPC-X implementation of
Gromacs outperform that using Intel MPI
12 December 2018
95%
100%
105%
110%
115%
120%
125%
0 128 256 384 512 640 768 896 1024
GROMACS - ion channel
GROMACS - lignocellulose
GROMACS – Intel MPI vs. HPCX – December 2018
64Application Performance on Multi-core Processors
% Intel MPI
Performance vs. HPCX
Processor Count
12 December 2018
Similar findings to DL_POLY, with the advantage of Intel
MPI over the HPC-X implementation of Gromacs
significantly reduced compared to the 2017 findings.
60%
70%
80%
90%
100%
110%
120%
0 128 256 384 512
VASP - Palladium Complex
VASP - Zeolite Cluster
VASP 5.4.1 – Intel MPI vs. HPCX – December 2017
65Application Performance on Multi-core Processors
% Intel MPI Performance vs. HPCX
Processor Count
Significantly different to the classical MD codes – now
HPCX is seen to outperform Intel MPI for the Zeolite
cluster at all core counts, and at larger core counts for
the Palladium complex
12 December 2018
60%
70%
80%
90%
100%
110%
120%
0 128 256 384 512
VASP - Palladium Complex
VASP - Zeolite Cluster
VASP 5.4.4 – Intel MPI vs. HPCX – December 2018
66Application Performance on Multi-core Processors
% Intel MPI Performance vs. HPCX
Processor Count
Significantly different to the 2017 findings – little
difference between Intel MPI and HPCX at larger core
counts, with Intel MPI superior at lower core counts.
12 December 2018
65%
75%
85%
95%
105%
115%
125%
0 128 256 384 512 640 768
Quantum Espresso - GRIR443
Quantum Espresso - Au112
Quantum Espresso v5.2 – Intel MPI vs. HPCX – Dec. 2017
67Application Performance on Multi-core Processors
% Intel MPI Performance vs. HPCX
Processor Count
Significantly different to the classical MD codes – as
with VASP, HPCX is seen to outperform Intel MPI for the
larger core counts
12 December 2018
65%
75%
85%
95%
105%
115%
125%
0 128 256 384 512
Quantum Espresso - GRIR443
Quantum Espresso - Au112
68Application Performance on Multi-core Processors
% Intel MPI Performance vs. HPCX
Processor Count
Significantly different to the 2017 findings – Intel MPI
superior at lower core counts, with HPCX somewhat
more effective at higher core counts.
12 December 2018
Quantum Espresso v6.1 – Intel MPI vs. HPCX – Dec. 2018
I.3 Relative Performance as a Function
of Processor Family and Interconnect –
SKL and SNB Clusters.
Application Performance on Multi-
core Processors
0.00
0.20
0.40
0.60
0.80
1.00
DLPOLY-4Gramicidin
DLPOLY-4 NaCl
GROMACS ion-channel
GROMACSlignocellulose
OpenFoam -3d3M
QE Au112
QE GRIR443
VASP Pd-Ocomplex
VASP Zeolitecomplex
BSMBenchBalance
Bull b510 "Raven"Sandy Bridge e5-2670/2.6 GHz IB-QDR
Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDR
ATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI
Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDRHPCX
Dell Skylake Gold 61302.1GHz (T) OPA
Intel Skylake Gold 61482.4GHz (T) OPA
Dell Skylake Gold 61422.6GHz (T) EDR
Dell Skylake Gold 61502.7GHz (T) EDR
Target Codes and Data Sets – 128 PEs
70Application Performance on Multi-core Processors
128 PE Performance [Applications]
12 December 2018
1.11
1.29
1.33
1.36
1.37
1.38
1.40
1.41
1.42
1.43
1.45
1.49
1.53
1.53
1.54
1.58
1.58
1.59
1.65
1.67
1.71
1.95
0.9 1.1 1.3 1.5 1.7 1.9 2.1
OpenFOAM - Cavity3d-3M
WRF - 4dbasic
Gromacs 2016-3 - ion channel
Gromacs 2016-3 -lignocellulose
Gromacs 5.0 - lignocellulose
Gromacs 4.6.1 - lignocellulose
Gromacs 4.6.1 - ion channel
CP2K - H2O-512
QE 5.2 - AU112
CP2K - H2O-256
Gromacs 5.0 - ion channel
WRF - conus 2.5km
VASP 5.4.4 - Zeolite
DLPOLY Classic Bench7
GAMESS-UK - SiOSi7
GAMESS-UK - DFT.cyclo.6-31G-dp
DLPOLY Classic - Bench5
DLPOLY Classic - Bench4
DL_POLY 4.08 - NaCl
DL_POLY 4.08 - Gramicidin
QE 5.2 - GRIR443
VASP 5.4.4 - PdO Complex
Improved Performance of
Hawk - Dell |EMC Skylake
Gold 6148 2.4GHz (T) EDR
vs.
Raven - ATOS b510 Sandy
Bridge e5-2670/2.6 GHz
IB-QDR
71Application Performance on Multi-core Processors
Average Factor = 1.49
SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR
12 December 2018
NPEs = 80
1.23
1.32
1.33
1.36
1.36
1.36
1.39
1.45
1.46
1.48
1.49
1.49
1.49
1.50
1.53
1.56
1.56
1.59
1.71
1.76
2.02
2.23
0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3
OpenFOAM - Cavity3d-3M
Gromacs 2016-3 - ion channel
QE 5.2 - AU112
Gromacs 2016-3 - lignocellulose
WRF - 4dbasic
Gromacs 5.0 - lignocellulose
Gromacs 4.6.1 - lignocellulose
DLPOLY Classic - Bench5
Gromacs 5.0 - ion channel
Gromacs 4.6.1 - ion channel
WRF - conus 2.5km
CP2K - H2O-512
QE 5.2 - GRIR443
DLPOLY Classic - Bench4
CP2K - H2O-256
DLPOLY Classic Bench7
GAMESS-UK - SiOSi7
GAMESS-UK - DFT.cyclo.6-31G-dp
DL_POLY 4.08 - NaCl
DL_POLY 4.08 - Gramicidin
VASP 5.4.4 - Zeolite
VASP 5.4.4 - PdO Complex
Improved Performance of
Hawk - Dell |EMC Skylake
Gold 6148 2.4GHz (T) EDR
vs.
Raven - ATOS b510 Sandy
Bridge e5-2670/2.6 GHz IB-
QDR
72Application Performance on Multi-core Processors
Average Factor = 1.53
SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR
12 December 2018
NPEs = 160
1.34
1.34
1.37
1.38
1.39
1.39
1.40
1.40
1.41
1.41
1.44
1.45
1.53
1.56
1.58
1.60
1.74
1.80
1.88
1.97
2.16
2.71
0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7
QE 5.2 - GRIR443
Gromacs 2016-3 - ion channel
DLPOLY Classic - Bench5
Gromacs 2016-3 -…
Gromacs 5.0 - lignocellulose
WRF - 4dbasic
CP2K - H2O-512
Gromacs 5.0 - ion channel
Gromacs 4.6.1 - lignocellulose
DLPOLY Classic - Bench4
OpenFOAM - Cavity3d-3M
WRF - conus 2.5km
Gromacs 4.6.1 - ion channel
CP2K - H2O-256
GAMESS-UK - DFT.cyclo.6-…
GAMESS-UK - SiOSi7
DL_POLY 4.08 - Gramicidin
DL_POLY 4.08 - NaCl
VASP 5.4.4 - Zeolite
QE 5.2 - AU112
DLPOLY Classic Bench7
VASP 5.4.4 - PdO Complex
Improved Performance of
Hawk - Dell |EMC Skylake
Gold 6148 2.4GHz (T) EDR
vs.
Raven - ATOS b510 Sandy
Bridge e5-2670/2.6 GHz IB-
QDR
73Application Performance on Multi-core Processors
Average Factor = 1.60
SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR
12 December 2018
NPEs = 320
74Application Performance on Multi-core Processors
Performance Benchmarks – Node to Node
• Analysis of performance Metrics across a variety of data sets
¤ “Core to core” and “node to node” workload comparisons
• Previous charts based on Core to core comparison i.e.
performance for jobs with a fixed number of cores
• Node to Node comparison typical of the performance when
running a workload (real life production). Expected to reveal
the major benefits of increasing core count per socket
¤ Focus on a 4 and 6 “node to node” comparison of the following:
¤ Benchmarks based on set of 10 applications & 19 data sets.
1
Raven - Bull b510 Sandy
Bridge e5-2670/2.6 GHz IB-
QDR [64 cores]
Hawk - Dell |EMC Skylake
Gold 6148 2.4GHz (T) EDR
[160 cores]
2
Raven - Bull b510 Sandy
Bridge e5-2670/2.6 GHz IB-
QDR [96 cores]
Hawk - Dell |EMC Skylake
Gold 6148 2.4GHz (T) EDR
[240 cores]
12 December 2018
2.50
2.62
2.70
2.81
2.94
2.95
2.96
2.98
3.09
3.11
3.26
3.28
3.31
3.40
3.46
3.50
1.0 1.5 2.0 2.5 3.0 3.5
CP2K - H2O-256
QE 5.2 - Au112
CP2K - H2O-512
DLPOLYclassic Bench4
GAMESS-UK (DFT.cyclo.6-31G-dp)
VASP 5.4.4 Pd-O complex
VASP 5.4.4 Zeolite complex
WRF 3.4 - 4dbasic
DLPOLY-4 NaCl
GROMACS 2016.3 - ion-channel
QE 5.2 - GRIR443
GROMACS 2016.3 - lignocellulose
DLPOLY-4 Gramicidin
GAMESS-UK (DFT.siosi7.3975)
WRF 3.4 - conus 2.5km
OpenFOAM - Cavity3d-3M
Improved Performance of
Dell |EMC Skylake Gold 6148
2.4GHz (T) EDR [160 cores]
vs.
Bull b510 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [64 cores]
75Application Performance on Multi-core Processors
Average Factor = 3.05
SKL “Gold” 6148 2.4 GHz EDR vs. SB e5-2670 2.6 GHz QDR
4 Node Comparison
12 December 2018
2.59
2.63
2.64
2.67
2.78
2.78
2.79
2.96
2.96
3.01
3.14
3.18
3.19
3.19
3.27
3.88
1.0 1.5 2.0 2.5 3.0 3.5 4.0
VASP 5.4.4 Zeolite complex
CP2K - H2O-256
GROMACS 2016.3 - ion-channel
VASP 5.4.4 Pd-O complex
WRF 3.4 - 4dbasic
GAMESS-UK (DFT.cyclo.6-31G-dp)
CP2K - H2O-512
DLPOLY-4 Gramicidin
DLPOLYclassic Bench4
DLPOLY-4 NaCl
QE 5.2 - GRIR443
WRF 3.4 - conus 2.5km
GAMESS-UK (DFT.siosi7.3975)
QE 5.2 - Au112
GROMACS 2016.3 - lignocellulose
OpenFOAM - Cavity3d-3M
Improved Performance of Hawk
Dell |EMC Skylake Gold 6148
2.4GHz (T) EDR [240 cores]
vs.
Bull b510 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [96 cores]
76Application Performance on Multi-core Processors
Average Factor =
2.98
SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR
6 Node Comparison
12 December 2018
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
DLPOLYclassic Bench4
DLPOLY-4Gramicidin
DLPOLY-4NaCl
GROMACSion-channel
GROMACSlignocellulose
GAMESS-UK(cyc-sporin)
GAMESS-UK(Siosi7)
QE Au112
QE GRIR443
VASP Pd-Ocomplex
VASP Zeolitecomplex
Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDRATOS Broadwell e5-2680v4 2.4GHz (T) OPA
Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI
Dell Skylake Gold 61302.1GHz (T) OPA
Intel Skylake Gold 61482.4GHz (T) OPA
Dell Skylake Gold 61422.6GHz (T) EDR
Dell Skylake Gold 61502.7GHz (T) EDR
Bull|ATOS Skylake Gold6150 2.7GHz (T) EDR
Dell|EMC AMD EPYC 76012.2 GHz (T) EDR
EPYC - Target Codes and Data Sets – 128 PEs
77Application Performance on Multi-core Processors
128 PE Performance [Applications]
12 December 2018
78Application Performance on Multi-core Processors
Performance Benchmarks – Node to Node
• Analysis of performance Metrics across a variety of data sets
¤ “Core to core” and “node to node” workload comparisons
• Previous EPYC charts based on Core to core comparison
i.e. performance for jobs with a fixed number of cores
• Node to Node comparison typical of the performance when
running a workload (real life production). Expected to reveal
the major benefits of increasing core count per socket
¤ Focus on a “node to node” comparison of the following:
¤ Benchmarks based on set of 6 applications & 15 data sets.
1Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [64 cores]
Dell |EMC AMD EPYC 7601 2.2 GHz
(T) EDR [256 cores]
2Dell |EMC Skylake Gold 6130 2.1GHz
(T) OPA [128 cores]
Dell |EMC AMD EPYC 7601 2.2 GHz
(T) EDR [256 cores]
12 December 2018
1.55
2.09
2.13
2.30
2.69
2.88
2.90
3.24
3.33
3.62
4.15
4.19
1.0 1.5 2.0 2.5 3.0 3.5 4.0
DLPOLYclassic Bench7
VASP Pd-O complex
DLPOLYclassic Bench5
DLPOLY-4 NaCl
DLPOLY-4 Gramicidin
VASP Zeolite complex
DLPOLYclassic Bench4
GROMACS ion-channel
QE Au112
GAMESS-UK (cyc-sporin)
GROMACS lignocellulose
GAMESS-UK (valino.A2)
Relative Performance of
Dell | EMC AMD EPYC 7601 2.2
GHz (T) EDR [256 cores]
vs.
Fujitsu CX250 Sandy Bridge e5-
2670/2.6 GHz IB-QDR [64 cores]
79Application Performance on Multi-core Processors
Average Factor = 2.92
Dell|EMC EPYC 7601 2.2 GHz (T) EDR vs. SB e5-2670 2.6 GHz QDR
12 December 2018
4 Node Comparison
0.74
0.80
0.84
0.94
1.00
1.07
1.13
1.21
1.44
1.51
1.51
1.64
1.78
1.78
1.83
0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
VASP Pd-O complex
QE Au112
QE GRIR443
DLPOLY-4 NaCl
DLPOLYclassic Bench7
DLPOLY-4 Gramicidin
VASP Zeolite complex
DLPOLYclassic Bench5
GROMACS ion-channel
DLPOLYclassic Bench4
GAMESS-UK (cyc-sporin)
GAMESS-UK (valino.A2)
GAMESS-UK (Siosi7)
GROMACS lignocellulose
GAMESS-UK (hf12z)
Relative Performance of
Dell |EMC AMD EPYC 7601 2.2
GHz (T) EDR [256 cores]
vs.
Dell |EMC Skylake Gold 6130
2.1GHz (T) OPA [128 cores]
80Application Performance on Multi-core Processors
Average Factor = 1.28
SKL “Gold” 6130 2.1 GHz OPA vs. AMD EPYC 7601 2.2 GHz (T) EDR
12 December 2018
4 Node Comparison
Summary
• Ongoing Focus on performance benchmarks and clusters featuring
Intel’s SKL processors, with the addition of the “Gold” 6138, 2.0
GHz [20c] and 6148, 2.4 GHz [20c] alongside the 6142, 2.6 GHz
[16c] ; and 6150, 2.7 GHz [18c]).
• Performance comparison with current SNB systems and those
based on dual Intel BDW processor EP nodes (16-core, 14-core)
with Mellanox EDR and Intel’s Omnipath OPA interconnects.
• Measurements of parallel application performance based on
synthetic and end user applications – DLPOLY, Gromacs, Amber,
GAMESS-UK, Quantum ESPRESSO and VASP.
¤ Use of Allinea Performance reports to guide analysis, and
updated comparison of Mellanox’s HPC-X and Intel MPI on
EDR-based systems
• Results augmented through consideration of two AMD Naples
EPYC clusters, featuring the 7601 (2.20 GHz) and 7551 (2.00 GHz)
processors.
81Application Performance on Multi-core Processors 12 December 2018
Summary II
• Relative Code Performance: Processor Family and Interconnect – “core
to core” and “node to node” benchmarks.
• A Core-to-Core comparison focusing on the Skylake “Gold” 6148
cluster (EDR) across 19 data sets (7 applications) suggests average
speedups between 1.49 (80 cores) through 1.60 (320 cores) when
comparing the to the Sandy Bridge-based “Raven” e5-2670 2.6GHz
cluster with QDR environment.
¤ Some applications however show much higher factors e.g.
GROMACS and VASP depending on the level of optimisation
undertaken on Hawk.
• A Node-to-Node comparison typical of the performance when running
a workload shows increased factors.
¤ A 4-node benchmark (160 cores) based on examples from 9
applications and 16 data sets show average improvement factors of
3.05 compared to the corresponding 4 node runs (64 cores) on the
Raven cluster.
¤ This factor is reduced somewhat, to 2.98, when using 6 node
benchmarks, comparing 240 SKL cores to 96 SNB cores.
82Application Performance on Multi-core Processors 12 December 2018
Summary III
• An updated comparison of Intel MPI and Mellanox’s HPCX
conducted on the “Helios” cluster suggests that the clear
delineation between MD (DLPOLY, GROMACS) and Materials-
based codes (VASP, Quantum Espresso) is no longer evident.
• Ongoing studies on the EPYC 2701 shows a complex
performance dependency on EPYC architecture.
¤ Codes with high usage of vector instructions (Gromacs, VASP
and Quantum Espresso) perform at best in somewhat modest
fashion.
¤ The AMD EPYC only supports 2 × 128-bit AVX natively, so
there’s a large gap with Intel and their 2 × 512-bit FMAs.
¤ The floating point peak on AMD is 4 × lower than Intel and
given that e.g., GROMACS has a native AVX-512 kernel for
Skylake, performance inevitably suffers.
83Application Performance on Multi-core Processors 12 December 2018
II. Acceptance Test Challenges and the
Impact of Environment.
Application Performance on Multi-
core Processors
Background - Supercomputing Wales, New HPC Systems
• Multi-million £ procurement exercise for new hubs agreed by all
partners
• Tender issued in May 2017 following 6-9 month review of research
community requirements and development of technical reference
design
• Budgetary challenges due to currency devaluation and increase in
component costs since budgets agreed in 2016
• Contracts awarded to Atos, March 2018. Hubs now installed and
operational, based on Intel Skylake Gold 6148, supported by Nvidia
GPU accelerators:
Lot 1 – “Hawk” system - Cardiff hub. 7,000 HPC + 1,040 HTC cores
Lot 2 – “Sunbird” system - Swansea hub. 5,000 HPC cores
Lot 3 – “Sparrow” – Cardiff High Performance Data Analytics
development system
Suppliers to provide development opportunities and other activities
through a programme of Community Benefits
12 December 2018Application Performance on Multi-core Processors 85
Performance Acceptances Tests
1. Consideration of the Performance Acceptance tests undertaken as
part of the Supercomputing Wales procurement. Carried out by Atos
on the “Hawk” HPC Skylake 6148 Cluster at Cardiff University.
2. Performance targets built on benchmarks specified in the ITT – but
developments impacted on the subsequent testing e.g., SPECTRE /
Meltdown.
3. Assess Performance through analyses of results generated through
three distinct run time environment variables, characterised by :
¤ Turbo Mode – ON or OFF. Impact considerably more complicated with
Skylake compared to previous Intel processor families.
¤ Security patches – DISABLED or ENABLED on the Skylake 6148 compute
nodes
¤ Distribution of processing cores – PACKED or UNPACKED on each node
e.g. 256 cores on either 7 or 8 × 40-core nodes.
4. Total of 8 Combinations – Impact on Performance ?
¤ ITT defined that all – “Application benchmarks should be in
“PACKED” mode; HPCC in non-turbo mode”12 December 2018Application Performance on Multi-core Processors 86
Process Adopted
1. Performance benchmark results generated by Atos (Martyn Foster)
on the Hawk HPC Skylake 6148 Cluster at Cardiff University
2. MF adopted a systematic approach to assessing performance
through the analyses of results generated across four distinct
environments (a subset of the 8 possible environments)
¤ “base (switch contained)” – Turbo mode off, security patches disabled on
the Skylake 6148 compute nodes
¤ “turbo + packed” - Turbo mode activated, with packed nodes – Slurm
default, with 40 cores per Skylake 6148 node
¤ “turbo + spread” - Turbo mode activated, de-populated nodes (32 cores /
node)
¤ “base + spectre” – base configuration above with security patches enabled
3. Identify those applications where the committed performance from the
SCW ITT submission (“Target”) is not achieved.10% shortfall allowed.
12 December 2018Application Performance on Multi-core Processors 87
GLOBAL_SETTINGS
export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/disablekpti"
export SPEC="disable"
##### OR #####
export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/enablekpti"
export SPEC="enable"
export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_on" ;
export TSTR=TURBO
##### OR #####
export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_off" ;
export TSTR=OFF
export SRUN_PACKING="-m Pack" ; export PSTR=Packed
##### OR #####
export SRUN_PACKING="-m NoPack"; export PSTR=Spread
export LAUNCHER="srun ${SRUN_PACKING} --cpu_bind=verbose,cores --export
LD_LIBRARY_PATH".
12 December 2018Application Performance on Multi-core Processors 88
SCW Application Performance Benchmarks
• The Benchmark suite comprises both synthetics & end-user
applications. Synthetics include HPCC (http://icl.cs.utk.edu/hpcc) &
IMB benchmarks (http://software.intel.com/en-us/articles/intel-mpi-
benchmarks), IOR and STREAM
• Variety of “open source” & commercial end-user application codes:
• These stress various aspects of the architectures under consideration
and should provide a level of insight into why particular levels of
performance are observed.
GROMACS and DL_POLY-4 (molecular dynamics)
Quantum Espresso and VASP (ab initio Materials properties)
BSMBench (particle physics – Lattice Gauge Theory Benchmarks)
OpenFOAM (computational engineering)
12 December 2018Application Performance on Multi-core Processors 89
“Sunbird” Acceptance Tests – User Applications
90
106%
108%
105%
113%
111%
105%
95%
91%
97%
89%
98%
107%106%
104%
110%
95%
87%
96%
105%
95%
78%
104%
94%
103%
98%
102%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
105.0%
110.0%
115.0%
DL
PO
LY
-Gra
mic
idin
(64
)
DL
PO
LY
-Gra
mic
idin
(12
8)
DL
PO
LY
-Gra
mic
idin
(25
6)
GR
OM
AC
S-io
nc
han
ne
l (64
)
GR
OM
AC
S-io
nc
han
ne
l (12
8)
GR
OM
AC
S-lig
no
cellu
los
e (1
28
)
GR
OM
AC
S-lig
no
cellu
los
e (2
56
)
VA
SP
-Pd
O (6
4)
VA
SP
-Pd
O (1
28)
VA
SP
-Ze
olite
(12
8)
VA
SP
-Ze
olite
(25
6)
QE
-Au
11
2 (6
4)
QE
-Au
11
2 (1
28)
QE
-GR
IR4
43
(25
6)
QE
-GR
IR443
(51
2)
Op
en
FO
AM
(12
8)
Op
en
FO
AM
(25
6)
BS
MB
en
ch
-Co
mm
s (2
56
)
BS
MB
en
ch
-Co
mm
s (5
12
)
BS
MB
en
ch
-Co
mm
s (1
02
4)
BS
MB
en
ch
-Ba
lan
ce
(25
6)
BS
MB
en
ch
-Ba
lan
ce
(51
2)
BS
MB
en
ch
-Ba
lan
ce
(10
24
)
BS
MB
en
ch
-Co
mp
ute
(25
6)
BS
MB
en
ch
-Co
mp
ute
(51
2)
BS
MB
en
ch
-Co
mp
ute
(10
24
)
Basket of Synthetic (HPCC, IOR, STREAM, IMB) and end-user application codes
– DL_POLY, GROMACS, VASP, ESPRESSO, OpenFOAM & BSMBENCH)
12 December 2018Application Performance on Multi-core Processors
85.0%
90.0%
95.0%
100.0%
105.0%
110.0%
115.0%D
LP
OL
Y-G
ram
icid
in (
64
)
DL
PO
LY
-Gra
mic
idin
(12
8)
DL
PO
LY
-Gra
mic
idin
(25
6)
GR
OM
AC
S-i
on
ch
an
ne
l (6
4)
GR
OM
AC
S-i
on
ch
an
ne
l (1
28
)
GR
OM
AC
S-l
ign
o. (1
28
)
GR
OM
AC
S-l
ign
o. (2
56
)
VA
SP
-Pd
O (
64
)
VA
SP
-Pd
O (
12
8)
VA
SP
-Ze
olite
(12
8)
VA
SP
-Ze
olite
(25
6)
QE
-Au
11
2 (
64
)
QE
-Au
11
2 (
12
8)
QE
-GR
IR443
(25
6)
QE
-GR
IR443
(51
2)
Op
en
FO
AM
(1
28)
Op
en
FO
AM
(2
56)
BS
M-C
om
ms
(2
56
)
BS
M-C
om
ms
(5
12
)
BS
M-C
om
ms
(1
02
4)
BS
M-B
ala
nc
e (
25
6)
BS
M-B
ala
nc
e (
51
2)
BS
M-B
ala
nc
e (
10
24
)
BS
M-C
om
pu
te (
25
6)
BS
M-C
om
pu
te (
51
2)
BS
M-C
om
pu
te (
10
24)
Impact of Turbo Mode on Performance (Security Patches Enabled)
Computational setup
BE
TT
ER
Re
lati
ve
Pe
rfo
rma
nc
e (
%)
12 December 2018Application Performance on Multi-core Processors 91
Normalised to corresponding
performance with Turbo OFF
Security patches Enabled
T Turbo-OFF / T Turbo-ON
85.0%
90.0%
95.0%
100.0%
105.0%
110.0%
115.0%D
LP
OL
Y-G
ram
icid
in (
64
)
DL
PO
LY
-Gra
mic
idin
(12
8)
DL
PO
LY
-Gra
mic
idin
(25
6)
GR
OM
AC
S-i
on
ch
an
ne
l (6
4)
GR
OM
AC
S-i
on
ch
an
ne
l (1
28
)
GR
OM
AC
S-l
ign
o. (1
28
)
GR
OM
AC
S-l
ign
o. (2
56
)
VA
SP
-Pd
O (
64
)
VA
SP
-Pd
O (
12
8)
VA
SP
-Ze
olite
(12
8)
VA
SP
-Ze
olite
(25
6)
QE
-Au
11
2 (
64
)
QE
-Au
11
2 (
12
8)
QE
-GR
IR443
(25
6)
QE
-GR
IR443
(51
2)
Op
en
FO
AM
(1
28)
Op
en
FO
AM
(2
56)
BS
M-C
om
ms
(2
56
)
BS
M-C
om
ms
(5
12
)
BS
M-C
om
ms
(1
02
4)
BS
M-B
ala
nc
e (
25
6)
BS
M-B
ala
nc
e (
51
2)
BS
M-B
ala
nc
e (
10
24
)
BS
M-C
om
pu
te (
25
6)
BS
M-C
om
pu
te (
51
2)
BS
M-C
om
pu
te (
10
24)
Impact of Turbo Mode on Performance (Security Patches Disabled)
Computational setup
BE
TT
ER
12 December 2018Application Performance on Multi-core Processors 92
Normalised to corresponding
performance with Turbo OFF
Security patches Disabled
Re
lati
ve
Pe
rfo
rma
nc
e (
%)
T Turbo-OFF / T Turbo-ON
90.0%
95.0%
100.0%
105.0%
110.0%D
LP
OL
Y-G
ram
icid
in (
64
)
DL
PO
LY
-Gra
mic
idin
(12
8)
DL
PO
LY
-Gra
mic
idin
(25
6)
GR
OM
AC
S-i
on
ch
an
ne
l (6
4)
GR
OM
AC
S-i
on
ch
an
ne
l (1
28
)
GR
OM
AC
S-l
ign
o. (1
28
)
GR
OM
AC
S-l
ign
o. (2
56
)
VA
SP
-Pd
O (
64
)
VA
SP
-Pd
O (
12
8)
VA
SP
-Ze
olite
(12
8)
VA
SP
-Ze
olite
(25
6)
QE
-Au
11
2 (
64
)
QE
-Au
11
2 (
12
8)
QE
-GR
IR4
43
(25
6)
QE
-GR
IR4
43
(51
2)
Op
en
FO
AM
(1
28)
Op
en
FO
AM
(2
56)
BS
M-C
om
ms
(2
56
)
BS
M-C
om
ms
(5
12
)
BS
M-C
om
ms
(1
02
4)
BS
M-B
ala
nc
e (
25
6)
BS
M-B
ala
nc
e (
51
2)
BS
M-B
ala
nc
e (
10
24
)
BS
M-C
om
pu
te (
25
6)
BS
M-C
om
pu
te (
51
2)
BS
M-C
om
pu
te (
10
24)
Impact of Security Patches on Performance (Turbo Mode OFF)
Computational setup
BE
TT
ER
Re
lati
ve
Pe
rfo
rma
nc
e (
%)
12 December 2018Application Performance on Multi-core Processors 93
Normalised to corresponding
performance with patches
disabled on the compute nodes
Turbo OFF
T DISABLED / T ENABLED
90.0%
95.0%
100.0%
105.0%
110.0%D
LP
OL
Y-G
ram
icid
in (
64
)
DL
PO
LY
-Gra
mic
idin
(12
8)
DL
PO
LY
-Gra
mic
idin
(25
6)
GR
OM
AC
S-i
on
ch
an
ne
l (6
4)
GR
OM
AC
S-i
on
ch
an
ne
l (1
28
)
GR
OM
AC
S-l
ign
o. (1
28
)
GR
OM
AC
S-l
ign
o. (2
56
)
VA
SP
-Pd
O (
64
)
VA
SP
-Pd
O (
12
8)
VA
SP
-Ze
olite
(12
8)
VA
SP
-Ze
olite
(25
6)
QE
-Au
11
2 (
64
)
QE
-Au
11
2 (
12
8)
QE
-GR
IR443
(25
6)
QE
-GR
IR443
(51
2)
Op
en
FO
AM
(1
28)
Op
en
FO
AM
(2
56)
BS
M-C
om
ms
(2
56
)
BS
M-C
om
ms
(5
12
)
BS
M-C
om
ms
(1
02
4)
BS
M-B
ala
nc
e (
25
6)
BS
M-B
ala
nc
e (
51
2)
BS
M-B
ala
nc
e (
10
24
)
BS
M-C
om
pu
te (
25
6)
BS
M-C
om
pu
te (
51
2)
BS
M-C
om
pu
te (
10
24)
Impact of Security Patches on Performance (Turbo Mode ON)
Computational setup
BE
TT
ER
Re
lati
ve
Pe
rfo
rma
nc
e (
%)
12 December 2018Application Performance on Multi-core Processors 94
Normalised to corresponding
performance with patches
disabled on the compute nodes
Turbo ON
T DISABLED / T ENABLED
90.0%
95.0%
100.0%
105.0%
110.0%
115.0%
DL
PO
LY
-Gra
mic
idin
(64
)
DL
PO
LY
-Gra
mic
idin
(12
8)
DL
PO
LY
-Gra
mic
idin
(25
6)
GR
OM
AC
S-i
on
ch
an
ne
l (6
4)
GR
OM
AC
S-i
on
ch
an
ne
l (1
28
)
GR
OM
AC
S-l
ign
o. (1
28
)
GR
OM
AC
S-l
ign
o. (2
56
)
VA
SP
-Pd
O (
64
)
VA
SP
-Pd
O (
12
8)
VA
SP
-Ze
olite
(12
8)
VA
SP
-Ze
olite
(25
6)
QE
-Au
11
2 (
64
)
QE
-Au
11
2 (
12
8)
QE
-GR
IR443
(25
6)
QE
-GR
IR443
(51
2)
Op
en
FO
AM
(1
28)
Op
en
FO
AM
(2
56)
BS
M-C
om
ms
(2
56
)
BS
M-C
om
ms
(5
12
)
BS
M-C
om
ms
(1
02
4)
BS
M-B
ala
nc
e (
25
6)
BS
M-B
ala
nc
e (
51
2)
BS
M-B
ala
nc
e (
10
24
)
BS
M-C
om
pu
te (
25
6)
BS
M-C
om
pu
te (
51
2)
BS
M-C
om
pu
te (
10
24)
Overall Impact of Environment on Performance
Computational setup
BE
TT
ER
Re
lati
ve
Pe
rfo
rma
nc
e (
%)
12 December 2018Application Performance on Multi-core Processors 95
Normalised with respect to the most
constrained environment - Turbo OFF,
security patches enabled, “packed” nodes
T CONSTRAIN / T MIN
Workload validation and Throughput tests
• Aim: Throughput designed to illustrate the Stability of the system
over an observed period of a week, while hardening the system
• Benchmarks based on multiple, concurrent instantiations of a
number of data sets associated with five of the end user application
codes and two of the synthetic benchmarks.
• Each data set is run a number of times on a variety of processor
(core) counts - typically 40, 80, 160, 320, 640 and 1024. This
combination of jobs has been designed to run for approximately 6
hours (elapsed time) on a 2720-core, 68 node cluster partition.
• Note that the metrics for success of these tests are twofold:
1. All jobs comprising a given run complete successfully and
2. There is a consistency of run time across each of the tests. The
measured time is simply the time at which the first of the jobs is
launched through the time that the last jobs finishes.
12 December 2018Application Performance on Multi-core Processors 96
Workload validation and Throughput tests
• Based around multiple instantiations
of a number of data sets associated
with the five codes, DLPOLY4,
Gromacs (v5.2), Quantum Espresso,
OpenFOAM and VASP, and the two
synthetic benchmarks, IMB and IOR.
• DLPOLY4 - NaCl & Gramicidin
• Gromacs - ion_channel &
lignocellulose
• QE 6.1 - AUSURF112 & GRIR443
• OpenFOAM - cavity3d-3M
• VASP 5.4.4 – PdO complex and
Zeolite
12 December 2018Application Performance on Multi-core Processors 97
SLURM Scripts
DLPOLY4.test2+test8.SCW.40.q
DLPOLY4.test2+test8.SCW.80.q
DLPOLY4.test2+test8.SCW.160.q
DLPOLY4.test2+test8.SCW.320.q
DLPOLY4.test2+test8.SCW.640.q
GROMACS.All.SCW.80.q
GROMACS.All.SCW.160.q
GROMACS.All.SCW.320.q
GROMACS.All.SCW.640.q
GROMACS.All.SCW.1024.q
IMB3.SCW.160.q
IMB3.SCW.320.q
IOR.SCW.4.q
IOR.SCW.8.q
OpenFOAM_cavity3d-3M.SCW.80.q
OpenFOAM_cavity3d-3M.SCW.160.q
OpenFOAM_cavity3d-3M.SCW.320.q
OpenFOAM_cavity3d-3M.SCW.640.q
QE.AUSURF112.SCW.160.q
QE.AUSURF112.SCW.320.q
QE.GRIR443.SCW.320.q
QE.GRIR443.SCW.640.q
VASP.example3.SCW.80.q
VASP.example3.SCW.160.q
VASP.example3.SCW.320.q
VASP.example4.SCW.160.q
VASP.example4.SCW.320.q
Throughput Tests – Hawk System – Two partition Approach
The throughput tests were undertaken on two separate partitions of the
Hawk cluster – compute64 and compute64b – to enable other testing and
early pilot user service. Each partition comprised 68 nodes.
Partition 1 – Compute 64 (68 Nodes)
• The first set of trial runs was executed between 12-14 May. A number of the
runs failed to complete, subsequently attributed to an apparent VASP related
error peculiar to the lustre file system:
forrtl: severe (121): Cannot access current working directory for unit 18, file "Unknown"
Image PC Routine Line Source
vasp_std 00000000014F3E09 Unknown Unknown Unknown
vasp_std 000000000150E10F Unknown Unknown Unknown
vasp_std 000000000134C950 Unknown Unknown Unknown
vasp_std 000000000040AF5E Unknown Unknown Unknown
libc-2.17.so 00002B450F32EC05 __libc_start_main Unknown Unknown
vasp_std 000000000040AE69 Unknown Unknown Unknown
forrtl: error (76): Abort trap signal
• This transient error affected perhaps one in twenty identical jobs, and although
reported into the appropriate Level 3 service regimes, has still not been formally
addressed. A workaround module was developed by Cardiff’s Tom Green when
it became clear that the formal channels were struggling.
module load lustre_getcwd_fix
12 December 2018Application Performance on Multi-core Processors 98
Throughput Tests – Hawk System II
Partition 1:
• A second set of trial runs were carried out over the bank holiday
weekend and successfully passed the associated tests over the
period 30 May – 3 June.
• Partition 2:
• Runs 11 -22: Initial runs using compute64b conducted between 7 –
10 June revealed a number of issues pointing to readiness of the
nodes. Timings from the first completed run suggested some
variability in run times for a given application/core count, with the total
run time significantly longer than those on compute64.
Run # Start Time Finish TimeTotal Elapsed Time
(hours:Mins)
6 30May 21-21 31May 03-25 6:02
7 31May 23-33 01Jun 05-38 6:05
8 02Jun 00-04 02Jun 06-06 6:02
9 02Jun 13:24 02Jun 19:27 6:04
10 02Jun 22-57 03Jun 05-00 6:03
11 03Jun 05-45 03Jun 11-47 6:02
12 03Jun 15-57 03Jun 22-12 6:15
12 December 2018Application Performance on Multi-core Processors 99
Throughput Tests – Hawk System III
• Following a lustre upgrade, a further set of runs were undertaken
between 21 June and 25 June. Runs 8 - 12 actually ran OK, so
formally compute 64b, along with compute64, can be judged to have
passed the Acceptance Test throughput requirement of five
consecutive error-free runs, although the variations in the individual
run times are perhaps larger than hoped.
• Testing on Hawk commenced on 12 May 2018 and was finally
completed on the 25 June 2018.
Run # Start Time Finish TimeTotal Elapsed Time
(hours:Mins)
8 23Jun 15-44 23Jun 20-57 5:13
9 23Jun 21-35 24Jun 02-55 5:20
10 24Jun 11-34 24Jun 17-06 5:32
11 24Jun 18-56 25Jun 00-16 5:20
12 25Jun 00-31 25Jun 06-03 5:32
12 December 2018Application Performance on Multi-core Processors 100
Throughput Tests – Sunbird System – Two partition approach
Partition 1: Runs 1 – 4:
¤ Run 3 did not complete with JOBID #11050 hanging, while JOBID #11372 of
Run 4 suffered the same fate. Both jobs failed with the all too familiar
VASP/lustre error diagnostics. The scripts used were identical to those
used on Hawk in June, and did not include the workaround introduced at the
time.
• Runs 5 -10: Completed successfully, with two of the VASP/Lustre partitions
trapped though the added module
module load lustre_getcwd_fix
Partition 2: Runs 11 - 22: Three jobs in one of the runs hung when hitting problems
on scs0105. That node had been taken out for when setting up the user-facing file
systems and needed the playbooks running. Several of the runs showed the impact
of the lustre issue with VASP.
• However, there were significant variations in the overall run times.
¤ At least three of the nodes appeared to be either defective or possess some
different bios settings (scs0064,scs0092 and scs0096). These were
subsequently removed from service.
¤ Turbo in inconsistent state across the compute nodes. Usually reset by
the Slurm prologue scripts, but they appear to have been commented out.
12 December 2018 101Application Performance on Multi-core Processors
Throughput Tests – Sunbird System
• Runs 23 – 30: Runs certainly acceptable from the metric of job
completion, for all completed successfully. Note there was no
reoccurrence of the lustre-related issue during this set of runs.
• Testing on Sunbird commenced on 10 August 2018, and was finally
completed on the evening of 19 August 2018
Run # Start Time Finish TimeTotal Elapsed Time
(hours:Mins)
23 17Aug 17:52 17Aug 23:07 5:15
24 17Aug 23:54 18Aug 05:11 5:17
25 18Aug 05:19 18Aug 10:36 5:17
26 18Aug 14:00 18Aug 19:16 5:16
27 18Aug 19:51 19Aug 01:08 5:17
28 19Aug 02:45 19Aug 07-57 5:12
29 19Aug 13:21 19Aug 18:32 5:11
30 19Aug 19:06 20Aug 00:265:20 (SLURM CG
Issue
12 December 2018Application Performance on Multi-core Processors 102
Throughput Tests – Nottingham OCF Cluster
• Tests modified to run on two partition of the OCF cluster at
Nottingham, “martyn" and "colin", each comprising 50 nodes with
EDR interconnect. All component nodes comprised dual Gold 6138
2.0GHz 20c SKL processors
• Initial runs of the workload failed to complete successfully, with each
of the 8 x 320-core IMB jobs hanging, consuming all of their
allocated time. Traced to an issue with the gatherv collective that
failed to complete across all specified msglens.
• Navigated around the issue by removing those environment variables
deemed likely to trigger the problem, specifically:
¤ export I_MPI_JOB_FAST_STARTUP=enable
¤ export I_MPI_SCALABLE_OPTIMIZATION=enable
¤ export I_MPI_DAPL_UD=enable
¤ export I_MPI_TIMER_KIND=rdtsc
• With these removed, runs proceeded to complete successfully.
• One of the allocated nodes (compute099) rendered unusable as a
result of tests - removed from service. Thus the subsequent runs
used 49 nodes, rather that the intended 50.
12 December 2018Application Performance on Multi-core Processors 103
Throughput Tests – Acceptance Achieved (OCF System)
Run # Start Time Finish TimeTotal Elapsed Time
(hours:Mins)
2 31Jul 18-04 01Aug 00-49 6:45
3 01Aug 01-22 01Aug 08-07 6:45
4 01Aug 08-32 01Aug 15-18 6:46
5 01Aug 17-08 01Aug 23-52 6:44
6 02Aug 03-20 02Aug 10-06 6:46
12 December 2018Application Performance in Materials Science 104
Table. Overall run times for the throughput runs on the “martyn” partition.
Run # Start Time Finish TimeTotal Elapsed Time
(hours:Mins)
1 02Aug 12-24 02Aug 19-05 6:41
2 03Aug 02-41 03Aug 09-18 6:37
3 03Aug 10-34 03Aug 17-18 6:44
4 03Aug 17-35 04Aug 00-26 6:51
5 04Aug 01-00 04Aug 07-45 6:45
Table. Overall run times for the throughput runs on the “colin” partition.
Results of "throughput benchmarks" carried out on the new OCF Skylake
cluster at Nottingham University between 31 July and 4 August 2018.
III. The Performance Evolution of two
Community Codes, DL_POLY and
GAMESS-UK
.
Application Performance on Multi-
core Processors
Outline and Contents
1. Introduction – DL_POLY and GAMESS-UK
¤ Background and Flagship community codes for the UK’s
CCP5 & CCP1 – Collaboration!
2. HPC Technology – Impact of Processor & Interconnect
developments
¤ The last 10 years of Intel dominance – Nehalem to Skylake
3. DL_POLY and GAMESS-UK Performance
¤ Benchmarks & Test Cases
¤ Overview of two decades of Code Performance: From the Cray
T3E/900 to Intel Skylake clusters
12 December 2018Application Performance on Multi-core Processors 106
“DL_POLY - A Performance Overview. Analysing, Understanding and Exploiting
available HPC Technology”, Martyn F Guest, Alin M Elena and Aidan B G Chalk,
Molecular Simulation, Accepted for publication (2019).
The Story of Two Community Codes
DL_POLY and GAMESS-UK - A Performance
Overview
HPC Technology –
Processor and
Networks
Computer Systems
• Benchmark timings - a wide variety of systems, starting with the Cray
T3E/1200 in 1999. Access initially undertaken as part of Daresbury’s
Distributed Computing support programme (DiSCO), with the
benchmarks presented at the annual Machine Evaluation Workshops
(1989-2014) and STFC’s successor Computing Insight (CIUK)
conferences (2015 onwards).
¤ Access typically short-lived as systems provided by suppliers to
enhance their profile at the MEW Workshops - limited opportunity for in
depth benchmarking.
• Systems include a wide range of CPU offerings. Representatives from
over a dozen generations of Intel processors, from the early days of
single processor nodes housing Pentium 3 and Pentium 4 CPUs,
through dual processor nodes featuring dual-core Woodcrest, quad-core
Clovertown & Harpertown processors, along with the Itanium and
Itanium2 CPUs, through to the extensive range of multi-core offerings
Westmere - Skylake.
12 December 2018Application Performance on Multi-core Processors 108
Computer Systems
• A variety of processors from AMD (Athlon, Opteron, MagnyCours,
Interlagos etc.) along with the “power” processors from the IBM
pSeries have also featured (typically dual processor configurations).
• In the same way a wide variety of processors feature, so too is the
appearance of a range of network interconnects. Fast Ethernet and
GBit Ethernet were rapidly superseded by the increasing capabilities of
the family of Infiniband interconnects from Voltaire and Mellanox (SDR,
DDR, QDR, FDR, EDR and soon HDR), along with the now defunct
offerings from Myrinet, Quadrics and QLogic. The Truescale
interconnect from Intel, along with its successor, Omnipath, also feature.
• Dating from the appearance of Intel’s SNB processors, many of the
timings generated with the Turbo mode feature enabled by the system
administrators. Such systems are tagged with “(T)” notation.
• As for software, most of the commodity clusters featuring Intel CPUs
used successive generation of Intel compilers along with Intel MPI,
although a range of MPI libraries have been used – OpenMPI, MPICH,
MVAPICH and MVAPICH2. Proprietary systems (Cray and IBM) used
system specific compilers and associated MPI libraries.
12 December 2018Application Performance on Multi-core Processors 109
Intel Xeon : Westmere - Skylake
Xeon 5600
(Westmere-EP)
Xeon E5-2600
(Sandy Bridge-EP)
Xeon E5-2600 v4
“Broadwell-EP”
Intel Xeon Scalable
Processor
“Skylake”
Cores / ThreadsUp to 6 cores / 12
threads
Up to 8 cores / 16
threads
Up to 22 Cores / 44
threads
Up to 28 Cores / 56
threads
Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-
inclusive)
Max memory
channels, speed
/ socket
3xDDR3 channels,
1333
4xDDR3 channels,
1600
4 channels of up to 3
RDIMMs, LRDIMMs
or 3DS LRDIMMs,
2400 MHz
6 channels of up to 2
RDIMMs, LRDIMMs
or 3DS LRDIMMs,
2666 MHz
New
instructionsAES-NI
AVX 1.0
8 DP Flops/Clock
AVX 2.0
16 DP Flops/Clock
AVX 512
32 DP Flops/Clock
QPI / UPI Speed
(GT/s)
1 QPI channels @
6.4 GT/s
2 QPI channels @ 8.0
GT/s
2 x QPI channels @
9.6 GT/s
Up to 3 x UPI @ 10.4
GT/s
PCIe Lanes /
Controllers /
Speed (GT/s)
36 lanes PCIe 2.0 on
chipset
40 Lanes / Socket
Integrated PCIe 3.0
40 / 10 / PCIe* 3.0
(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0
(2.5, 5, 8 GT/s)
Server /
Workstation
TDP
Server /
Workstation: 130W
Up to 130W Server;
150W Workstation 55 - 145W 70 – 205W
12 December 2018Application Performance on Multi-core Processors 110
The Story of Two Community Codes
DL_POLY and GAMESS-UK - A Performance
Overview
Overview of two
decades of
DL_POLY
Performance
A B
C D
• Distribute atoms, forces across the nodes
¤ More memory efficient, can address much larger
cases (105-107)
• Shake and short-ranges forces require only
neighbour communication
¤ communications scale linearly with number of
nodes
• Coulombic energy remains global
¤ Adopt Smooth Particle Mesh Ewald scheme
• includes Fourier transform smoothed charge
density (reciprocal space grid typically
64x64x64 - 128x128x128)
https://www.scd.stfc.ac.uk/Pages/DL_POLY.aspx
W. Smith and I. Todorov
Domain Decomposition - Distributed data:
DL_POLY 3/4 – Distributed data
Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å
2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps
12 December 2018Application Performance on Multi-core Processors 112
DL_POLY 4
• Test2 Benchmark
¤ NaCl Simulation;
216,000 ions, 200 time
steps, Cutoff=12Å
• Test8 Benchmark
¤ Gramicidin in water;
rigid bonds + SHAKE:
792,960 ions, 50 time
steps
The DLPOLY Benchmarks
DL_POLY Classic
• Bench4
¤ NaCl Melt Simulation with Ewald
sum electrostatics & a MTS
algorithm. 27,000 atoms; 500 time
steps.
• Bench5
¤ Potassium disilicate glass (with 3-
body forces). 8,640 atoms: 3,000
time steps
• Bench7
¤ Simulation of gramicidin A molecule
in 4012 water molecules using
neutral group electrostatics. 12,390
atoms: 5,000 time steps
12 December 2018Application Performance on Multi-core Processors 113
112
0
20
40
60
80
100
Cra
y T
3E
/1200 E
V56
600 M
Hz
IBM
SP
/Win
terh
aw
k2-3
75 M
Hz
SG
I O
rig
in 3
80
0/R
14k
-500
CS
4 A
MD
1.2
GH
z/F
E
CS
6 P
III/800 +
FE
/LA
M
IBM
Reg
att
a-H
CS
9 P
4/2
00
0 +
Myri
ne
t
IBM
p6
90
CS
10 P
4/2
666 +
Myri
ne
t
CS
16 Ita
niu
m2/1
300 +
My
rin
et
CS
11 P
4/2
400 +
Gb
itE
IBM
p6
90+
// H
PS
CS
19 O
pte
ron
246/2
.0 +
SC
I
CS
20 O
pte
ron
248/2
.2 +
M2
k
CS
22 P
4 E
M64
T/3
200 +
M2
k
Cra
y X
D1 O
pte
ron
250/2
.4 +
Ra
pid
Arr
ay
HP
Su
pe
rdo
me/Ita
niu
m2
1600
CS
26 O
pte
ron
875/2
.2 D
C +
M2k
CS
29 O
pte
ron
280/2
.4 D
C +
IB
CS
30 X
eo
n 5
160 3
.0G
Hz D
C +
IB
CS
32 O
pte
ron
2218-F
2.6
GH
z D
C +
IP
HT
X
CS
33 X
eo
n 5
160 3
.0G
Hz D
C +
IP
HT
X
CS
35 C
UB
RIC
Op
tero
n275
/2.2
DC
+ G
Bit
E
CS
42 O
pte
ron
2218-F
2.6
GH
z D
C +
Mella
no
x IB
CS
45 H
P B
L460
c X
eo
n 5
160/3
.0G
Hz D
C +
Me
llan
ox
IB
CS
50 S
GI Ic
e X
5365 C
lovert
ow
n 3
.0G
Hz Q
C +
…
CS
47 In
tel E
54
72 H
arp
ert
ow
n 3
.0G
Hz Q
C 1
600 F
SB
…
CS
51 B
ull X
eo
n E
5472 3
.0G
Hz Q
C 1
600 F
SB
+…
CS
54 S
GI Ic
e X
eo
n E
5440 2
.83G
Hz Q
C +
Vo
ltair
e IB
…
IBM
pS
eri
es 5
75 4
.7 G
Hz D
C +
IB
CS
57 In
tel X
55
60 N
eh
ale
m 2
.8G
Hz Q
C +
IB
QD
R
CS
60 V
igle
n E
55
20 N
H 2
.27G
Hz Q
C +
IB
DD
R…
CS
61 B
ullx X
555
0 N
H 2
.67G
Hz Q
C +
Co
nn
ectX
…
CS
63 In
tel X
55
70 N
H 2
.93G
Hz Q
C +
Co
nn
ectX
QD
R…
CS
66 In
tel L
7555 N
eh
ale
mE
X 1
.87G
Hz +
IB
QD
R
Fu
jits
u "
HT
C"
BX
92
2 X
5650 2
.66 G
Hz +
IB
CS
73 Q
Lo
gic
ND
C X
5670 2
.93
GH
z 6
-C +
QD
R…
Fu
jits
u B
X922 W
SM
X
5650 2
.67
GH
z IB
-QD
R
Fu
jits
u R
X300 S
NB
E5-2
680 8
-C +
IB
QD
R
Fu
jits
u C
X250 S
NB
e5-2
690/2
.9G
Hz IB
-QD
R
Fu
jits
u C
X250 S
NB
e5-2
670/2
.6G
Hz IB
-QD
R
Inte
l IV
B e
5-2
697v2 2
.7G
Hz T
rue
Scale
PS
M
Bu
ll B
710 IV
B e
5-2
697v2 2
.7G
Hz M
ellan
ox F
DR
Cra
y X
C30
e5-2
697v2 2
.7G
Hz A
RIE
S [
Arc
he
r]
Inte
l H
SW
e5-2
697v3
2.6
GH
z (
T)
Tru
escale
QD
R
Bu
ll H
SW
e5-2
690v3 2
.6G
Hz C
on
ne
ct-
IB
Bu
ll H
SW
e5-2
680v3 2
.5G
Hz (
T)
Co
nn
ect-
IB
Bo
sto
n B
DW
e5-2
650
v4 2
.2G
Hz (
T)
FD
R
Th
or
BD
W e
5-2
697A
v4 2
.6G
Hz (
T)
ED
R
IBM
Po
wer8
S822L
C 2
.92G
Hz IB
/ED
R
Hu
aw
ei F
usio
n C
H140 e
5-2
690 v
4 2
.6G
Hz (
T)
ED
R
Inte
l S
KL
Pla
tin
um
8170 2
.1G
Hz (
T)
OP
A[2
6c]
Dell S
KL
Go
ld 6
142 2
.6G
Hz (
T)
ED
R [
16c]
Bu
ll|A
TO
S S
KL
Go
ld 6
150 2
.7G
Hz (
T)
OF
A
Pe
rfo
rma
nc
e r
ela
tive
to
th
e C
ray T
3E
/12
00
E DLPOLY 2 - Bench 4 (32 PEs)
DL_POLY Classic: Bench 4
Performance Relative to the Cray T3E/1200 (32 CPUs)
12 December 2018Application Performance on Multi-core Processors 114
47
0.0
10.0
20.0
30.0
40.0
50.0C
ray T
3E
/1200E
EV
56 6
00 M
Hz
IBM
SP
/Win
terh
aw
k2-3
75 M
Hz
SG
I O
rig
in 3
80
0/R
14k
-500
CS
4 A
MD
1.2
GH
z/F
E
CS
6 P
III/800 +
FE
/LA
M
IBM
Reg
att
a-H
CS
9 P
4/2
00
0 +
Myri
ne
t
IBM
p6
90
CS
10 P
4/2
666 +
Myri
ne
t
CS
16 Ita
niu
m2/1
300 +
My
rin
et
CS
11 P
4/2
400 +
Gb
itE
IBM
p6
90+
// H
PS
CS
19 O
pte
ron
246/2
.0 +
SC
I
CS
20 O
pte
ron
248/2
.2 +
M2
k
CS
22 P
4 E
M64
T/3
200 +
M2
k
Cra
y X
D1 O
pte
ron
250/2
.4 +
Ra
pid
Arr
ay
HP
Su
pe
rdo
me/Ita
niu
m2
1600
CS
26 O
pte
ron
875/2
.2 D
C +
IB
CS
29 O
pte
ron
280/2
.4 D
C +
IB
CS
30 X
eo
n 5
160 3
.0G
Hz D
C +
IB
CS
32 O
pte
ron
2218-F
2.6
GH
z D
C +
IP
HT
X
CS
33 X
eo
n 5
160 3
.0G
Hz D
C +
IP
HT
X
CS
35 C
UB
RIC
Op
tero
n275
/2.2
DC
+ G
Bit
E
CS
42 O
pte
ron
2218-F
2.6
GH
z D
C +
IB
CS
45 H
P B
L460
c X
eo
n 5
160/3
.0G
Hz D
C +
IB
CS
50 S
GI Ic
e X
5365 C
lovert
ow
n 3
.0G
Hz Q
C…
CS
47 In
tel E
54
72 H
arp
ert
ow
n 3
.0G
Hz Q
C…
CS
51 B
ull X
eo
n E
5472 3
.0G
Hz Q
C 1
600 F
SB
…
CS
54 S
GI Ic
e X
eo
n E
5440 2
.83G
Hz Q
C +
…
IBM
pS
eri
es 5
75 4
.7 G
Hz D
C +
IB
CS
57 In
tel X
55
60 N
EH
2.8
GH
z Q
C +
IB
QD
R
CS
60 V
igle
n E
55
20 N
EH
2.2
7G
Hz Q
C +
IB
DD
R
CS
61 B
ullx X
555
0 N
EH
2.6
7G
Hz Q
C +
IB
CS
63 In
tel X
55
70 N
EH
2.9
3G
Hz Q
C +
C-X
…
CS
66 In
tel L
7555 N
EH
-EX
1.8
7G
Hz +
IB
QD
R
Fu
jits
u "
HT
C"
BX
92
2 X
5650 2
.66 G
Hz +
IB
CS
73 Q
Lo
gic
ND
C X
5670 2
.93
GH
z 6
-C +
QD
R…
Fu
jits
u B
X922 W
SM
X
5650 2
.67
GH
z IB
-QD
R
Fu
jits
u R
X300 S
NB
E5-2
680 8
-C +
IB
QD
R
Fu
jits
u C
X250 S
NB
e5-2
690/2
.9G
Hz IB
-QD
R
Fu
jits
u C
X250 S
NB
e5-2
670/2
.6G
Hz IB
-QD
R
Inte
l IV
B e
5-2
697v2 2
.7G
Hz T
rue
Scale
PS
M
Bu
ll B
710 IV
B e
5-2
697v2 2
.7G
Hz IB
FD
R
Cra
y X
C30
e5-2
697v2 2
.7G
Hz A
RIE
S [
Arc
he
r]
Inte
l H
SW
e5-2
697v3
2.6
GH
z (
T)
Tru
escale
QD
R
Bu
ll H
SW
e5-2
690v3 2
.6G
Hz C
on
ne
ct-
IB
Bu
ll H
SW
e5-2
680v3 2
.5G
Hz (
T)
Co
nn
ect-
IB
Bo
sto
n B
DW
e5-2
650
v4 2
.2G
Hz (
T)
FD
R
Th
or
BD
W e
5-2
697A
v4 2
.6G
Hz (
T)
ED
R
IBM
Po
wer8
S822L
C 2
.92G
Hz IB
/ED
R
Hu
aw
ei F
usio
n C
H140 e
5-2
690 v
4 2
.6G
Hz (
T)…
Inte
l S
KL
Pla
tin
um
8170 2
.1G
Hz (
T)
OP
A[2
6c]
Dell S
KL
Go
ld 6
142 2
.6G
Hz (
T)
ED
R [
16c]
Bu
ll|A
TO
S S
KL
Go
ld 6
150 2
.7G
Hz (
T)
OF
A
Pe
rfo
rma
nc
e r
ela
tive
to
th
e C
ray T
3E
/12
00
E
DLPOLY 2 - Bench 7 (32 PEs)
DL_POLY V2: Bench 7
Performance Relative to the Cray T3E/1200 (32 CPUs)
12 December 2018Application Performance on Multi-core Processors 115
61.5
0.0
10.0
20.0
30.0
40.0
50.0
60.0IB
M e
32
6 O
pte
ron
28
0 2
.4G
Hz
// G
bit
E (
Cis
co
) P
GI
HP
DL
14
5 G
2 O
pte
ron
28
0 2
.4G
Hz D
C // IB
HP
DL
14
5 G
2 O
pte
ron
28
0 2
.4G
Hz D
C // IB
Su
n X
410
0D
C O
pte
ron
28
0 2
.4G
Hz D
C /
/ IB
(P
SC
)
Qu
ad
rix X
eo
n 5
16
0 W
oo
dc
res
t 3
.0G
Hz D
C /
/ E
lan
4
HP
DL
14
0 G
3 X
eo
n 5
16
0 3
.0G
Hz D
C //
Vo
ltair
e IB
/DD
R
Bu
ll R
440
Xeo
n 5
160
3.0
GH
z D
C /
/ V
olt
air
e IB
/SD
R
Str
eam
lin
e X
eo
n 5
16
0 3
.0G
Hz D
C //
GB
itE
(S
Co
re)
Su
n X
220
0 M
2 O
pte
ron
22
18 2
.6G
Hz D
C //
IP (
PS
C)
IBM
x34
55 O
pte
ron
22
18
-F/2
.6G
Hz D
C //
IP
SG
I Ic
e X
53
65
C
lov
ert
ow
n 3
.0G
Hz Q
C // IB
IBM
pS
eri
es 5
75
po
we
r5 4
.7G
Hz D
C /
/ IB
PO
D E
543
0 H
arp
ert
ow
n 2
.66
GH
z Q
C //
IB/S
DR
(O
pe
nM
PI)
SG
I Ic
e E
54
62
2.8
3G
Hz Q
C //
IB (
mva
pic
h2
)
Str
eam
lin
e E
547
2 3
.0G
Hz Q
C //
IB
In
tel E
547
2 H
arp
ert
ow
n 3
.0G
Hz Q
C /
/ IB
Inte
l E
5482
2.8
0G
Hz Q
C //
IB/D
DR
(In
telM
PI)
Cra
y X
T4 O
pte
ron
2.3
GH
z Q
C //
XT
4 In
tern
al
inte
rco
nn
ect
Inte
l L
75
55
NH
EX
[8c]
1.8
7G
Hz /
/ IB
/QD
R (
mva
pic
h-1
.2)
Bu
ll X
555
0 N
H 2
.67G
Hz Q
C //
IB (
imp
i 3.2
.2)
Inte
l X
5560
NH
2.8
GH
z Q
C //
IB/Q
DR
Inte
l X
5570
NH
2.9
3G
Hz Q
C // IB
/QD
R (
imp
i-3
.2.2
)
De
ll P
E C
614
5 I
nte
rla
go
s O
pte
ron
627
6 [
16
c]
2.3
GH
z
Inte
l X
5670
WS
M [
6c]
2.9
3G
Hz //
IB/Q
DR
(m
vap
ich
2)
QL
og
ic N
DC
X56
75 [
6c]
3.0
7G
Hz //
IB/Q
DR
(Q
log
ic M
PI)
Inte
l S
NB
E5
-267
0 [
8c
] 2
.6G
Hz
// IB
/QD
R(i
mp
i)
Fu
jits
u R
X3
00
SN
B E
5-2
68
0 [
8c
] 2
.7 G
Hz /
/ IB
/QD
R
PO
D W
SM
X5
67
5 3
.07
GH
z [
6c]
// T
rue
sc
ale
/QD
R
PO
D S
NB
e5
-267
0 2
.6G
Hz [
8c
] //
Tru
es
ca
le Q
DR
Fu
jits
u C
X2
50
SN
B e
5-2
690
[8
c]
2.9
GH
z // IB
/QD
R
Clu
se
rVis
ion
IV
B e
5-2
650v
2 [
8c
] 2.6
GH
z /
/ T
rue
sc
ale
/QD
R
Inte
l IV
B e
5-2
697
v2
[1
2c
] 2
.7G
Hz /
/ T
rue S
cale
/QD
R
Bu
ll b
71
0 I
VB
e5-2
69
7v
2 [
12
c]
2.7
GH
z // IB
/FD
R
Inte
l IV
B e
5-2
690
v2
[1
0c
] 3
.0G
Hz (
T)
// T
rue S
cale
/QD
R
Bu
ll H
SW
e5
-26
80
v3
[1
2c
] 2
.5G
Hz (
T)
// I
B
Inte
l H
SW
e5-2
697
v3
[1
4c
] 2
.6G
Hz (
T)
// T
rue
sc
ale
/QD
R
Bu
ll H
SW
e5
-26
90
v3
[1
2c
] 2
.6G
Hz /
/ IB
De
ll H
SW
e5-2
660
v3
[1
0c
] 2
.6G
Hz (
T)
// O
PA
Bo
sto
n B
DW
e5
-26
50
v4
[1
2c
] 2
.2G
Hz (
T)
//
IB/F
DR
Ato
s B
DW
e5-2
680
v4 [
14c]
2.4
GH
z (
T)
// I
B/E
DR
Inte
l B
DW
e5
-26
90
v4
[1
4c
] 2
.6G
Hz (
T)
// O
PA
Th
or
BD
W e
5-2
697A
v4
[1
6c
] 2
.6G
Hz (T
) // I
B/E
DR
IBM
Po
we
r8 S
82
2L
C [
10c
] 2.9
2G
Hz //
IB/E
DR
Inte
l S
KL
Pla
tin
um
817
0 [
26
c]
2.1
GH
z (
T)
// O
PA
De
ll S
KL
Go
ld 6
14
2 [
16
c]
2.6
GH
z (
T)
// IB
/ED
R
De
ll S
KL
Go
ld 6
15
0 [
18
c]
2.7
GH
z (
T)
// IB
/ED
R
DLPOLY 3/4 - Gramicidin (128 cores)
DL_POLY 3/4 – Gramicidin (128 Cores)
Performance Relative to the IBM e326
Opteron280/2.4GHz + GbitE
Perf
orm
an
ce
DL_POLY 3
DL_POLY 4
12 December 2018Application Performance on Multi-core Processors 116
61.5
0.0
10.0
20.0
30.0
40.0
50.0
60.0F
ujits
u B
X922
WS
M X
565
0 [
6c
] 2
.67
GH
z //
…
PO
D W
SM
X56
75 3
.07
GH
z [
6c]
// T
rues
cale
/QD
R
Azu
re A
9 W
E (
e5
-267
0 2
.6 G
Hz)
[8c
] // I
B R
DM
A
PO
D S
NB
e5
-267
0 2
.6G
Hz [
8c
] //
Tru
es
ca
le Q
DR
Fu
jits
u C
X250
SN
B e
5-2
67
0 [
8c]
2.6
GH
z //…
Fu
jits
u C
X250
SN
B e
5-2
69
0 [
8c]
2.9
GH
z //…
Fu
jits
u C
X250
SN
B e
5-2
69
0 [
8c]
2.9
GH
z (T
) //…
Clu
serV
isio
n IV
B e
5-2
65
0v
2 [
8c
] 2
.6G
Hz /
/…
Inte
l IV
B e
5-2
697
v2
[1
2c]
2.7
GH
z /
/ IB
/FD
R
Inte
l IV
B e
5-2
697
v2
[1
2c]
2.7
GH
z /
/ T
rue…
Cra
y X
C3
0 e
5-2
69
7v
2 [
12
c]
2.7
GH
z //
AR
IES
…
Bu
ll b
71
0 I
VB
e5
-26
97v
2 [
12
c]
2.7
GH
z //
IB/F
DR
De
ll R
72
0 IV
B e
5-2
68
0v2
[1
0c
] 2.8
GH
z (
T)
// I
B
Inte
l IV
B e
5-2
690
v2
[1
0c]
3.0
GH
z (
T)
// T
rue
…
Bu
ll H
SW
e5
-26
95
v3
[14
c]
2.3
GH
z /
/ IB
Bu
ll H
SW
e5
-26
80
v3
[12
c]
2.5
GH
z (
T)
// I
B
Bu
ll H
SW
e5
-26
80
v3
[12
c]
2.5
GH
z (
T)
// I
B/E
DR
Inte
l H
SW
e5-2
697
v3 [
14
c]
2.6
GH
z (
T)
//…
Inte
l H
SW
e5-2
697
v3 [
14
c]
2.6
GH
z (
T)
//…
Bu
ll H
SW
e5
-26
90
v3
[12
c]
2.6
GH
z /
/ IB
SG
I IC
E-X
HS
W e
5-2
690
v3 [
12
c]
2.6
GH
z (
T)
//…
De
ll H
SW
e5
-26
60v
3
[10
c]
2.6
GH
z (
T)
// O
PA
Hu
aw
ei C
H14
0 e
5-2
683
v4 [
16
c]
2.1
GH
z (
T)
//…
Bo
sto
n B
DW
e5
-26
50v
4 [
12
c]
2.2
GH
z (
T)
//…
Bo
sto
n B
DW
e5
-26
80v
4 [
14
c]
2.4
GH
z (
T)
// O
PA
Ato
s B
DW
e5
-268
0v4
[1
4c
] 2
.4G
Hz (
T)
// I
B/E
DR
Ato
s B
DW
e5
-268
0v4
[1
4c
] 2
.4G
Hz (
T)
// O
PA
Inte
l B
DW
e5-2
690
v4
[1
4c
] 2
.6G
Hz (
T)
// O
PA
Inte
l B
DW
e5-2
690
v4
[1
4c
] 2
.6G
Hz (
T)
// IB
/ED
R
Th
or
BD
W e
5-2
69
7A
v4 [
16c
] 2
.6G
Hz
(T)
//…
Inte
l D
iam
on
d B
DW
e5
-269
7A
v4 [
16
c]
2.6
GH
z…
IBM
Po
wer8
S8
22
LC
[1
0c
] 2
.92
GH
z //
IB/E
DR
Dell
SK
L G
old
613
0 [
16
c]
2.1
GH
z (
T)
// O
PA
Inte
l S
KL
Pla
tin
um
81
70 [
26c
] 2
.1G
Hz (
T)
// O
PA
Inte
l S
KL
Go
ld 6
14
8 [
20
c]
2.4
GH
z (
T)
// O
PA
De
ll S
KL
Go
ld 6
14
2 [
16
c]
2.6
GH
z (
T)
// IB
/ED
R
Ato
s S
KL
Go
ld 6
15
0 [
18
c]
2.7
GH
z (
T)
// O
FA
De
ll S
KL
Go
ld 6
15
0 [
18
c]
2.7
GH
z (
T)
// IB
/ED
R
Ato
s A
MD
EP
YC
76
01
[3
2c
] 2
.2G
Hz (
T)
//…
DLPOLY 4 - Gramicidin (128 cores)
E5-26xxE5-26xx v2
E5-26xx v3
E5-26xx v4
Intel SKL
DL_POLY 4 – Gramicidin (128 cores)
Performance Relative to the
IBM e326 Opteron280/2.4GHz
/ GbitE
Perf
orm
an
ce
12 December 2018Application Performance on Multi-core Processors 117
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
DL_POLY4 – Gramicidin Perf Report
Smooth Particle Mesh Ewald Scheme
CPU Time Breakdown
Total Wallclock Time Breakdown
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
12 December 2018Application Performance on Multi-core Processors 118
The Story of Two Community Codes
DL_POLY and GAMESS-UK - A Performance
Overview
Overview of two
decades of
GAMESS-UK
Performance
Large-Scale Parallel Ab-Initio Calculations
• GAMESS-UK now has two parallelisation schemes:
¤ The traditional version based on the Global Array tools
• retains a lot of replicated data
• limited to about 4000 atomic basis functions
¤ Subsequent developments by Ian Bush (High Performance
Applications Group, Daresbury, now at Oxford University via NAG
Ltd.) have extended the system sizes available for treatment by
both GAMESS-UK (molecular systems) and CRYSTAL (periodic
systems)
• Partial introduction of “Distributed Data” architecture…
• MPI/ScaLAPACK based
12 December 2018Application Performance on Multi-core Processors 120
The GAMESS-UK Benchmarks
Five representative examples of increasing
complexity.
• Cyclosporin 6-31g basis (1000 GTOs) DFT B3LYP (direct
SCF)
• Cyclosporin 6-31g-dp basis (1855 GTOs) DFT B3LYP
(direct SCF)
• Valinomycin (dodecadepsipeptide) in water; DZVP2 DFT
basis, HCTH functional (1620 GTOs) (direct SCF)
• Mn(CO)5H TZVP/DZP MP2 - geometry optimization
• ((C6H4(CF3))2 6-31g basis DFT B3LYP opt geom + analytic
2nd Derivatives
12 December 2018Application Performance on Multi-core Processors 121
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0C
ray
T3E/
120
0 E
V5
6 60
0 M
Hz
IBM
pSe
ries
690
po
wer
4 1
.3 G
Hz
SGI O
rigi
n 3
80
0/R
14
k 50
0M
Hz
// N
UM
Alin
k 3
IBM
pSe
ries
690
+ p
ow
er4
1.7
GH
z //
HP
S (S
P7
)
IBM
pSe
ries
690
po
wer
4 1
.3 G
Hz
[J-F
it]
IBM
pSe
ries
690
po
wer
4 1
.3 G
Hz
CS2
0 O
pte
ron
24
8 2
.2G
Hz
// M
2k
Lon
esta
r D
ell
PE
18
55 P
en
tiu
m-4
3.2
GH
z //
IB
SGI A
ltix
37
00/I
tan
ium
2 1
.3G
Hz
SGI O
rigi
n 3
80
0/R
14
k 50
0M
Hz
// N
UM
Alin
k 3
CS2
0 O
pte
ron
24
8 2
.2G
Hz
// M
2k
HP
Su
pe
rdo
me
Itan
ium
2 1
.6G
Hz
CS2
2 P
en
tiu
m-4
EM
64T
3.2
GH
z //
M2
k
CS4
8 B
ull
R42
2 H
arp
erto
wn
E5
472
3.0
GH
z Q
C /
/ IB
CS4
2 IB
M x
34
55 O
pte
ron
22
18-F
2.6
GH
z D
C /
/ IB
IBM
pSe
ries
575
po
wer
6 4
.7G
Hz
DC
//
IB
CS5
4 S
GI A
ltix
Ice
82
00
Xeo
n E
544
0 2
.83
GH
z Q
C /
/ IB
CS6
6 In
tel L
75
55 N
H-E
X [
8c]
1.8
7G
Hz
// IB
QD
R [
pp
n=1
6]
CS6
0 V
igle
n E
552
0 N
H 2
.27G
Hz
QC
//
IB D
DR
(m
vap
ich
)
Fuji
tsu
"H
TC"
BX
922
X5
650
[6
c] 2
.66G
Hz
// IB
/QD
R
De
ll M
51
0 X
565
0 [
6c]
2.6
7 G
Hz
// IB
/QD
R (
imp
i)
CS6
4 In
tel X
56
70 W
SM [
6c]
2.9
3GH
z //
IB/Q
DR
(m
vap
ich
2)
Fuji
tsu
CX
25
0 SN
B e
5-26
70
[8c]
2.6
GH
z //
IB/Q
DR
Inte
l SN
B e
5-2
670
[8
c] 2
.6G
Hz
// IB
/QD
R (
pp
n=8
)
Bu
ll b
510
SN
B E
5-2
680
[8
c] 2
.7 G
Hz
(T)
// IB
/QD
R
Fuji
tsu
CX
25
0 SN
B E
5-26
90
[8c]
2.9
GH
z //
IB/Q
DR
Inte
l IV
B e
5-2
697
v2 [
12
c] 2
.7G
Hz
// T
rue
Scal
e/Q
DR
Bu
ll b
710
IVB
e5
-269
7v2
2.7
GH
z //
IB/F
DR
Bu
ll H
SW e
5-2
690
v3 [
12c]
2.6
GH
z //
IB
De
ll R
730
HSW
e5
-269
7v3
2.6
GH
z (T
) //
IB
SGI I
CE-
X H
SW e
5-2
680
v3 [
12
c] 2
.6G
Hz
(T)
// IB
/FD
R
Ato
s B
DW
e5
-26
80v
4 [
14c
]2.4
GH
z (T
) //
IB/E
DR
Ato
s SK
L G
old
614
8 2
.4G
Hz
(T)
// IB
/ED
R (
pp
n=1
6)
28.5
65.1Valinomycin DFT - DZVP2 1620 GTOs
GAMESS-UK. DFT B3LYP Performance
Performance Relative to the Cray T3E/1200 (32CPUs)
Basis: DZVP2_A2 (Dgauss)
Valinomycin, 1620 GTOs
Atos Skylake Gold 6148
2.4GHz (T) // IB/EDR
CS48 Bull R422
Harpertown E5472
3.0GHz QC // IB
12 December 2018Application Performance on Multi-core Processors 122
0.0
10.0
20.0
30.0
40.0
50.0
60.0C
ray
T3E/
900
EV
56
45
0 M
Hz
IBM
SP
/P2S
C
Cra
y T3
E/12
00
EV
56
60
0 M
Hz
CS1
Pe
nti
um
-3 4
50
MH
z //
FE/
LAM
IBM
SP
/Win
terh
awk-
2 p
ow
er3
37
5M
Hz
CS2
Qu
adri
x U
P2
000
/EV
67 6
67
MH
z //
QSN
et
Co
mp
aq A
lph
aSer
ver
ES4
0 66
7 M
hz
SGI O
rigi
n 3
800
/R1
2k
40
0MH
z
SGI O
rigi
n 3
800
/R1
4k
50
0MH
z //
NU
MA
link
3
CS6
Pe
nti
um
-3 8
00
MH
z //
FE/
LAM
CS7
AM
D A
thlo
n K
7 1
.0G
Hz
MP
//
SCI
CS9
Pe
nti
um
-4 2
.0G
Hz
// M
yrin
et
2k
IBM
pSe
rie
s 69
0 p
ow
er4
1.3
GH
z
SGI A
ltix
370
0/I
tan
ium
2 1
.3G
Hz
IBM
pSe
rie
s 69
0+
po
wer
4 1
.7 G
Hz
// H
PS
CS1
8 P
en
tiu
m-4
2.8
GH
z //
M2
k
CS1
8 P
en
tiu
m-4
2.8
GH
z //
GB
itE
CS2
0 O
pte
ron
24
8 2
.2G
Hz
// M
2k
CS2
1 P
en
tiu
m-4
EM
64T
3.4
GH
z //
IB
SGI O
rigi
n 3
800
/R1
4k
50
0MH
z //
NU
MA
link
3
IBM
pSe
rie
s 69
0+
po
wer
4 1
.7 G
Hz
// H
PS
(SP
9)
CS2
0 O
pte
ron
24
8 2
.2G
Hz
// M
2k
CS1
8 P
en
tiu
m-4
2.8
GH
z //
Gb
itE
CS1
8 P
en
tiu
m-4
2.8
GH
z //
M2
k
CS2
0 O
pte
ron
24
8 2
.2G
Hz
// M
2k
SGI O
rigi
n 3
800
/R1
4k
50
0MH
z //
NU
MA
link
3
IBM
pSe
rie
s 69
0 p
ow
er4
1.3
GH
z
Co
mp
aq A
lph
aSer
ver
SC E
S45
1.0
GH
z
SGI A
ltix
370
0/I
tan
ium
2 1
.3G
Hz
SGI A
ltix
370
0/I
tan
ium
2 1
.3G
Hz
SGI A
ltix
370
0/I
tan
ium
2 1
.5G
Hz
IBM
pSe
rie
s 57
5 p
ow
er5
1.5
GH
z D
C /
/ H
PS
HP
Su
per
do
me
Itan
ium
2 1
.6G
Hz
CS4
8 B
ull
R4
22
Har
per
tow
n E
54
72
3.0
GH
z Q
C /
/ IB
CS4
2 IB
M x
345
5 O
pte
ron
221
8-F
2.6
GH
z D
C /
/ IB
CS5
5 SG
I Ice
Xe
on
E5
46
2 2
.83
GH
z Q
C /
/ IB
(m
pav
ich
2)
IBM
pSe
rie
s 57
5 p
ow
er6
4.7
GH
z D
C /
/ IB
CS5
4 SG
I Alt
ix Ic
e 8
20
0 X
eon
E5
44
0 2
.83
GH
z Q
C /
/ IB
CS6
6 In
tel L
75
55
NH
-EX
[8
c] 1
.87
GH
z //
IB Q
DR
[p
pn
=8]
CS6
6 In
tel L
75
55
NH
-EX
[8
c] 1
.87
GH
z //
IB Q
DR
[p
pn
=16
]
CS6
6 In
tel L
75
55
NH
-EX
[8
c] 1
.87
GH
z //
IB Q
DR
CS6
0 V
igle
n E
55
20
NH
2.2
7G
Hz
QC
//
IB D
DR
(m
vap
ich
)
Alic
e X
55
50
NH
2.6
7G
Hz
QC
//
IB Q
DR
CS5
7 In
tel X
556
0 N
H 2
.8G
Hz
QC
//
IB Q
DR
De
ll M
510
X5
650
[6
c]2
.67
GH
z //
IB/Q
DR
(im
pi)
Fuji
tsu
"H
TC"
BX
92
2 X
565
0 [
6c]
2.6
6G
Hz
// IB
/QD
R
CS6
4 In
tel X
567
0 W
SM [
6c]
2.9
3G
Hz
// IB
/QD
R…
Inte
l SN
B e
5-2
67
0 [
8c]
2.6
GH
z //
IB/Q
DR
(p
pn
=8)
Rav
en
B5
10 S
NB
e5-
267
0 [
8c]
2.6
GH
z //
IB/Q
DR
Fuji
tsu
CX
25
0 S
NB
e5
-26
70
[8
c] 2
.6G
Hz
// IB
/QD
R (
201
7)
Bu
ll b
510
SN
B E
5-2
680
[8
c] 2
.7 G
Hz
// IB
/QD
R
Bu
ll b
510
SN
B E
5-2
680
[8
c] 2
.7 G
Hz
(T)
// IB
/QD
R
Fuji
tsu
RX
300
SN
B E
5-2
68
0 [
8c]
2.7
GH
z //
IB/Q
DR
Ato
s B
DW
e5-
268
0v4
[1
4c]
2.4
GH
z (T
) //
IB/E
DR
Ato
s Sk
ylak
e G
old
61
48
2.4
GH
z (T
) //
IB/E
DR
(p
pn
=16)
45.8
55.3
MP2 Mn(CO)5H
Performance of MP2 Gradient Module
Performance Relative to the Cray T3E/900 (32 CPUs)
Mn(CO)5H - MP2 geometry optimisation
BASIS: TZVP + f (217 GTOs)
CS48 Bull Xeon E5472
3.0GHz QC + DDR
Intel SNB e5-2670 [8c]
2.6GHz // IB/QDR
(ppn=8)
12 December 2018Application Performance on Multi-core Processors 123
Performance Data (32-256 PEs)
GAMESS-UK – DFT Performance Report
Cyclosporin 6-31G** basis (1855
GTOs); DFT B3LYP
CPU Time Breakdown
Total Wallclock Time
Breakdown
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
12 December 2018Application Performance on Multi-core Processors 124
Summary
1. Introduction – DL_POLY and GAMESS-UK
¤ Background and Flagship codes for the UK’s CCP5 & CCP1
¤ Critical role of collaborative developments
2. HPC Technology - Processor & Interconnect Technologies
¤ The last 10 years of Intel dominance – Nehalem to Skylake
3. DL_POLY and GAMESS-UK Performance
¤ Benchmarks & Test Cases
¤ Overview of two decades of Code Performance: From
T3E/1200E to Intel Skylake clusters
4. Understanding Performance – Useful Tools
5. Acknowledgements and Summary
12 December 2018Application Performance on Multi-core Processors 125
Acknowledgements
• Ludovic Sauge, Enguerrand Petit, Martyn Foster and
Nick Allsopp and John Humphries (Bull/ATOS) for
informative discussions and access to the Skylake & EPYC
clusters at the Bull HPC Competency Centre.
• David Cho, Gilad Shainer, Colin Bridger & Steve Davey
for access to and considerable assistance with the “Helios”
cluster at the HPC Advisory Council.
• Joshua Weage, Martin Hilgeman, Dave Coughlin, Gilles
Civario and Christopher Huggins for access to, and
assistance with, the variety of Skylake and EPYC SKUs at
the Dell Benchmarking Centre.
• Alin Marin Elena and Ilian Todorov (STFC) for discussions
around the DL_POLY software
• The DisCO programme at Daresbury Laboratory.
Application Performance on Multi-core Processors 12612 December 2018
127Application Performance on Multi-core Processors
Final Thoughts & Summary
I. Performance Benchmarks and Cluster Systems
a. Synthetic Code Performance: STREAM and IMB
b. Application Code Performance: DLPOLY, GROMACS,
AMBER,GAMESS_UK, VASP and Quantum Espresso
c. Interconnect Performance: Intel MPI and Mellanox’s HPCX
d. Processor Family and Interconnect – “core to core” and “node
to node” benchmarks
II. Impact of Environmental Issues in Cluster acceptance
tests.
a. Security patches, turbo mode and Throughput testing
III. Performance profile of DL_POLY and GAMESS-UK over
the past two decades
12 December 2018