Download - Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Scaling, Throughput and an Historical

perspective

Application Performance on Multi-

core processors

M.F. Guest≠, C.A. Kitchen≠, M. Foster† and D. Cho§

≠ Cardiff University, †Atos, § Mellanox Technologies

2Application Performance on Multi-core Processors

Outline

I. Performance Benchmarks and Cluster Systems

a. Synthetic Code Performance: STREAM and IMB

b. Application Code Performance: DLPOLY, GROMACS,

AMBER,GAMESS_UK, VASP and Quantum Espresso

c. Interconnect Performance: Intel MPI and Mellanox’s HPCX

d. Processor Family and Interconnect – “core to core” and “node

to node” benchmarks

II. Impact of Environmental Issues in Cluster Acceptance

tests

a. Security patches, turbo mode and Throughput testing

III. Performance profile of DL_POLY and GAMESS-UK over

the past two decades

IV. Acknowledgements and Summary

12 December 2018


Contents

I. Review of parallel application performance featuring synthetics and end-

user applications across a variety of clusters

¤ End-user Codes – DL_POLY, GROMACS, AMBER, NAMD, LAMMPS,

GAMESS-UK, Quantum Espresso, VASP, CP2K, ONETEP & OpenFOAM

• Ongoing Focus on Intel’s Xeon Scalable processors (“Skylake”), AMD’s

Naples EPYC processor plus nVIDIA GPUs, including

¤ Clusters with dual-socket nodes - Intel Xeon Gold 6148 Processor (20c, 27.5M

Cache, 2.40 GHz) & Xeon Gold 6138 Processor (20c, 27.5M Cache, 2.00

GHz) + AMD Naples EPYC 7551 (2.00 GHz) & EPYC 7601 (2.20 GHz) CPUs.

¤ Updated review of Intel MPI and Mellanox HPCX performance analysis .

II. How these benchmarks have been deployed in the framework of

procurement and acceptance testing, dealing with a variety of issues

e.g. (a) security patches, turbo mode etc. & (ii) Throughput testing.

III. An historical perspective of two of these codes – DL_POLY and

GAMESS-UK – and briefly overview the development and performance

profile of both over the past two decades.

12 December 2018

The Xeon Skylake Architecture


• The architecture of Skylake is

very different from that of the prior

“Haswell” and “Broadwell” Xeon

chips

• Three basic variants that now

cover what was formerly the Xeon

E5 and Xeon E7 product lines, with

Intel converging the Xeon E5 and

E7 chips into a single socket.

• Product segmentation – Platinum, Gold, Silver, & Bronze – with 51

variants of the SP chip

• Also custom versions requested by hyperscale and OEM customers.

• All of these chips differ from each other in a number of ways, including

number of cores, clock speed, L3 cache capacity, number and speed of

UltraPath links between sockets, number of sockets supported, main

memory capacity, width of the AVX vector units etc.

12 December 2018

Intel Xeon : Westmere - Skylake

Xeon 5600

(Westmere-EP)

Xeon E5-2600

(Sandy Bridge-EP)

Xeon E5-2600 v4

“Broadwell-EP”

Intel Xeon Scalable

Processor

“Skylake”

Cores / ThreadsUp to 6 cores / 12

threads

Up to 8 cores / 16

threads

Up to 22 Cores / 44

threads

Up to 28 Cores / 56

threads

Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-

inclusive)

Max memory

channels, speed

/ socket

3xDDR3 channels,

1333

4xDDR3 channels,

1600

4 channels of up to 3

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2400 MHz


RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2666 MHz

New

instructionsAES-NI

AVX 1.0

8 DP Flops/Clock

AVX 2.0

16 DP Flops/Clock

AVX 512

32 DP Flops/Clock

QPI / UPI Speed

(GT/s)

1 QPI channels @

6.4 GT/s

2 QPI channels @ 8.0

GT/s

2 x QPI channels @

9.6 GT/s

Up to 3 x UPI @ 10.4

GT/s

PCIe Lanes /

Controllers /

Speed (GT/s)

36 lanes PCIe 2.0 on

chipset

40 Lanes / Socket

Integrated PCIe 3.0

40 / 10 / PCIe* 3.0

(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0

(2.5, 5, 8 GT/s)

Server /

Workstation

TDP

Server /

Workstation: 130W

Up to 130W Server;

150W Workstation 55 - 145W 70 – 205W

5Application Performance on Multi-core Processors 12 December 2018

06

SKU 7601 7551 7501 7451 7401 7351 7301

Freq (base) 2.2 2.0 2.0 2.3 2.0 2.4 2.2

Turboboost

All cores

active

2.7 2.6 2.6 2.9 2.8 2.9 2.7

Turboboost

On core

active

3.2 3.0 3.0 3.2 3.0 2.9 2.7

Cores/socket 32 32 32 24 24 16 16

L3 cache size 64 MB

Memory

Channel8

Memory Freq 2667 MT/s

TDP (W) 180 180 155/170 180 155/170 155/170 155/170

AMD® Epyc™ 7000 Series - SKU Map and FLOP/cycle

Architecture Sandy Bridge Haswell Skylake EPYC

ISA* AVX AVX2 AVX-512 AVX2

op/cycle2

(1 ADD, 1 MUL)

4

(2 FMA)

4

(2 FMA)

4

(2 ADD, 2 MUL)

Vector size

(DP = 64-bits)4 4 8 2

FLOP/cycle 8 16 32 8

* Instruction Set Architecture

12 December 2018 6Application Performance on Multi-core Processors

The AMD EPYC

only supports 2

× 128-bit AVX

natively, so

there’s a large

gap with Intel

SKL and their 2

× 512-bit FMAs.

Thus the FP

peak on AMD is

4 × lower than

on Intel SKL.

• Zen cores

¤ Private L1/L2 cache

• CCX

¤ 4 ZEN cores (or less)

¤ 8MB L3 shared cache

• Zeppelin

¤ 2 CCX (or less)

¤ 2 DDR4 channels

¤ 2 PCIe 16x

• Naples

¤ 4 Zeppelin SoC dies fully

connected by Infinity

Fabric.

¤ 4 Numa Nodes !

07

EPYC Architecture - Naples, Zeppelin & CCX2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x D

DR

4 C

ha

nn

els

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

∞

• Delivers 32 cores / 64 threads, 16MB L2 cache and 64MB L3 cache per socket.

• Design also means that there are four NUMA nodes per socket or eight NUMA nodes in

a dual socket system i.e. different memory latencies depending on which die needs data

from memory that can be attached to that die or another die on the fabric.

• The key difference with Intel’s Skylake SP architecture is that AMD needs to go off die within

the same socket where Intel stays on a single piece of silicon.


Intel Skylake and AMD EPYC Cluster Systems

Cluster / Configuration

“Hawk” – Supercomputing Wales cluster at Cardiff comprising 201 nodes, totalling 8,040

cores, 46.080 TB total memory

• CPU: 2 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each; RAM: 192 GB,

384GB on high memory and GPU nodes; GPU: 26 x nVidia P100 GPUs with 16GB of

RAM on 13 nodes.

“Helios” – 32 node HPC Advisory Council cluster running SLURM: Mellanox ConnectX-5

Supermicro SYS-6029U-TR4 / Foxconn Groot 1A42USF00-600-G 32-node cluster; Dual

Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

Mellanox ConnectX-5 EDR 100Gb/s InfiniBand/VPI adapters with Socket Direct, Mellanox

Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches

Memory: 192GB DDR4 2677MHz RDIMMs per node

20 node Bull|ATOS AMD EPYC cluster running SLURM;

AMD EPYC 7551; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3.2 GHz

Base Clock: 2.2 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s

32 node Dell|EMC PowerEdge R7425 AMD EPYC cluster running SLURM;

AMD EPYC 7601; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3GHz Base

Clock: 2.0 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s


Baseline Cluster Systems


Cluster Configuration

Intel Sandy Bridge Clusters

“Raven”128 x Bull|ATOS b510 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Supercomputing

Wales

384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Intel Broadwell Clusters

Dell PE R730/R630,

Broadwell EP-2697A v4

2.6 GHz 16C

HPC Advisory Council, “Thor” cluster, Dell PowerEdge R730/R630 36-

node cluster: 2 x Xeon E5-2697A v4 @ 2.6GHz, 16 Core, 145W TDP,

40MB Cache,256GB DDR4 2400MHz, Interconnect: ConnectX-4 EDR

ATOS Broadwell EP-

2680 v4 2.4 GHz 16C

32 node cluster, Node config: 2 x Xeon E5-2680 v4 @ 2.4GHz, 16 Core,

145W TDP, 40MB Cache,128GB DDR4 2400MHz, Interconnect: Mellanox

ConnectX-4 EDR; and Intel OPA

IBM Power 8 S822LC

IBM Power 8 S822LC

with Mellanox EDR

20 cores, 3.49 GHz with performance CPU governor; 256 GB memory ;

1 – IB (EDR) port ; 2 × NVIDIA K80 GPU;

IBM PE (Parallel Environment) Operating System: RHEL 7.2 LE;

Compilers: xlC 13.1.3, xlf 15.1.3, gcc 4.8.5 (Red Hat), gcc 5.2.1 (from IBM

Advance Toolchain 9.0)

12 December 2018

The Performance Benchmarks

• The Test suite comprises both synthetics & end-user applications.

Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks

(http://software.intel.com/en-us/articles/intel-mpi-benchmarks), IOR and

STREAM

• Variety of “open source” & commercial end-user application codes:

• These stress various aspects of the architectures under consideration

and should provide a level of insight into why particular levels of

performance are observed e.g., memory bandwidth and latency, node

floating point performance and interconnect performance (both latency

and B/W) and sustained I/O performance.

GROMACS, LAMMPS, AMBER, NAMD, DL_POLY classic & DL_POLY-4 (molecular dynamics)

Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP

(ab initio Materials properties)

NWChem, GAMESS-US and GAMESS-UK

(molecular electronic structure)


EPYC - Compiler and Run-time Options

Compilation:

INTEL COMPILERS 2018, IntelMPI 2017

Update 3, FFTW-3.3.5

INTEL SKL: -O3 –xCORE-AVX512

AMD EPYC: –O3 -xAVX2

AMD EPYC: -axCORE-AVX-I

#

# Preload the amd-cputype library to navigate

# the Intel Genuine cpu test

module use /opt/amd/modulefiles

module load AMD/amd-cputype/1.0

export LD_PRELOAD=$AMD_CPUTYPE_LIBexport OMP_PROC_BIND=true

# export KMP_AFFINITY=granularity=fine

export I_MPI_DEBUG=5

export MKL_DEBUG_CPU_TYPE=5

Application Performance on Multi-core Processors 1112 December 2018

STREAM (Atos Clusters):module load AMD/amd-cputype/1.0

icc -o stream.x stream.c -DSTATIC -

Ofast -xCORE-AVX2 -qopenmp -

DSTREAM_ARRAY_SIZE=800000000 \

-mcmodel=large -shared-intel

export OMP_NUM_THREADS=16

export OMP_PROC_BIND=true

export OMP_PLACES="{0:4:1}:16:4” #1

thread per CCX

export OMP_DISPLAY_ENV=true

STREAM (Dell|EMC EPYC):export OMP_NUM_THREADS=32

export OMP_PROC_BIND=true

export OMP_DISPLAY_ENV=true

export

OMP_PLACES="{0},{16},{8},{24},{2},{1

8},{10},{26},{4},{20},{12},{28},{6},

{22},{14},{30},{1},{17},{9},{25},{3}

,{19},{11},{27},{5},{21},{13},{29},{

7},{23},{15},{31}"

74,309

93,486

118,605114,367

132,035 128,083

169,830

185,863

196,721

184,087

303,797

279,640

0

50,000

100,000

150,000

200,000

250,000

300,000

Bull b510"Raven"SNB e5-

2670/2.6GHz

ClusterVision IVBe5-2650v2

2.6GHz

Dell R730 HSWe5-2697v32.6GHz (T)

Dell HSW e5-2660v3 2.6GHz

(T)

Thor BDW e5-2697A v4 2.6GHz

(T)

ATOS BDW e5-2680v4 2.4GHz

(T)

Mellanox SKLGold 61382.0GHz (T)

Dell SKL Gold6142 2.6GHz (T)

"Hawk" Atos SKLGold 6148

2.4GHz

IBM Power8S822LC 2.92GHz

AMD Epyc 75512.0 GHz


Memory B/W –STREAM performance

TRIAD [Rate (MB/s) ]

Ivy Bridge & Haswell

E5-26xx v2,v3

OMP_NUM_THREADS (KMP_AFFINITY=physical

Broadwell

E5-26xx v4

Skylake Gold

6138, 6142, 6148


4,644

5,843

4,236

5,718

4,126

4,5744,246

5,808

4,918

9,204

4,747

4,369

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Bull b510"Raven"SNB e5-

2670/2.6GHz

ClusterVisionIVB e5-2650v2

2.6GHz

Dell R730 HSWe5-2697v32.6GHz (T)

Dell HSW e5-2660v3 2.6GHz

(T)

Thor BDW e5-2697A v42.6GHz (T)

ATOS BDW e5-2680v4 2.4GHz

(T)

Mellanox SKLGold 61382.0GHz (T)

Dell SKL Gold6142 2.6GHz (T)

"Hawk" AtosSKL Gold 6148

2.4GHz

IBM Power8S822LC 2.92GHz



Memory B/W – STREAM / core performance

TRIAD [Rate (MB/s) ]

OMP_NUM_THREADS (KMP_AFFINITY=physical

Ivy Bridge & Haswell

E5-26xx v2,v3

Broadwell

E5-26xx v4

Skylake Gold

6138, 6142, 6148


3.8

11,466

5,957

1.7

3,694

1,729

1

10

100

1,000

10,000

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Intel SKL Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC 2.92GHz IB/EDR

Thor BDW e5-2697A v4 2.6GHz (T) EDR

Intel BDW e5-2690v4 2.6GHz (T) OPA

Dell OPA32 e5-2660v3 2.6GHz (T) OPA

Bull HSW E5-2680v3 2.5 GHz (T) Connect-IB

Dell R720 e5-2680v2 2.8 GHz (T) connect-IB

Azure A9 WE (e5-2670 2.6 GHz) IB RDMA

Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)

MPI Performance – PingPong

IMB Benchmark (Intel)

1 PE / node

Latency

Message Length (Bytes)

Mb

yte

s/s

ec


BE

TT

ER

12 December 2018

export I_MPI_DAPL_TRANSLATION_CACHE=1

Memory resident cache feature in DAPL

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Fujitsu CX250 SNB e5-2670/2.6GHz IB-QDR

ATOS BDW e5-2680v4 2.4GHz (T) OPA

Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR

Dell PE R730 BDW e5-2697Av4 2.6GHz (T) OPA

Dell|EMC SKL Gold 6130 2.1GHz (T) OPA

"Helios" Mellanox SKL Gold 6138 2.0GHz (T)


"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR

Dell|EMC SKL Gold 6142 2.6GHz (T) EDR


Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR

MPI Collectives – Alltoallv (128 PEs)

IMB Benchmark (Intel)

128 PEs

Latency

BE

TT

ER

Message Length (Bytes)

Measured Time (usec)


EPYC performance

with Intel MPI ~ 4-6 ×

worse than that with

SKL processors!

12 December 2018

Time-consuming messages

called by Alltoall & Alltoallv (IPM)

Application Performance on Multi-core

Processors

I.1 THE CODES: DLPOLY, GROMACS, NAMD, LAMMPS,

GAMESS, NWChem, GAMESS-UK, ONETEP, VASP,

SIESTA, CASTEP, Quantum Espresso, CP2K – on a

variety of HPC systems.

Allinea (ARM) Performance Reports

Allinea Performance Reports provides a

mechanism to characterize and understand the

performance of HPC application runs through a

single-page HTML report.


• Based on Allinea MAP's adaptive sampling technology that keeps data

volumes collected and application overhead low.

• Modest application slowdown (ca. 5%) even with 1000’s of MPI

processes.

• Runs on existing codes: a single command added to execution scripts.

• If submitted through a batch queuing system, then the submission script

is modified to load the Allinea module and add the 'perf-report' command

in front of the required mpiexec command.

• perf-report mpiexec -n 4 $code

• A Report Summary: This characterizes how the application's wallclock

time was spent, broken down into CPU, MPI and I/O

• All examples updated on Broadwell Mellanox Cluster (E5-2697A v4)

12 December 2018

DL_POLY Developed as CCP5 parallel MD code by W. Smith,

T.R. Forester and I. Todorov

UK CCP5 + International user community

DLPOLY_classic (replicated data) and DLPOLY_3 &

_4 (distributed data – domain decomposition)

Areas of application:

liquids, solutions, spectroscopy, ionic solids,

molecular crystals, polymers, glasses, membranes,

proteins, metals, solid and liquid interfaces,

catalysis, clathrates, liquid crystals, biopolymers,

polymer electrolytes.

Molecular Dynamics Codes: AMBER, DL_POLY,

CHARMM, NAMD, LAMMPS, GROMACS etc

Molecular Simulation I. DL_POLY


3.5

6.8

11.1

15.7

4.4

8.3

13.7

16.8

4.8

9.3

15.4

19.2

4.2

7.7

12.4

16.7

4.1

6.0

9.8

13.6

4.5

7.1

11.6

15.9

0.0

4.0

8.0

12.0

16.0

20.0

32 64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Intel Broadwell2 e5-2690v4 2.6GHz (T) OPA

"Helios" Skylake Gold 6138 2.0GHz (T) EDR

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR

Dell Skylake Gold 6150 2.7GHz (T) EDR



DL_POLY Classic – NaCl Simulation

Number of Processing Elements

Performance

Performance Data (32-256 PEs)

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)

BE

TT

ER

NaCl 27,000 atoms; 500 time steps


A B

C D

• Distribute atoms, forces across the nodes

¤ More memory efficient, can address much larger

cases (105-107)

• Shake and short-ranges forces require only

neighbour communication

¤ communications scale linearly with number of

nodes

• Coulombic energy remains global

¤ Adopt Smooth Particle Mesh Ewald scheme

• includes Fourier transform smoothed charge

density (reciprocal space grid typically

64x64x64 - 128x128x128)

http://www.scd.stfc.ac.uk//research/app/ccg/software/DL_POLY/44516.aspx

W. Smith and I. Todorov

Domain Decomposition - Distributed data:

DL_POLY 4 – Distributed data


Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å

2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps

12 December 2018

1.8

3.1

4.9

2.9

5.1

8.5

2.8

5.0

7.9

2.4

4.5

7.4

2.7

5.0

8.4

3.2

5.7

3.0

5.4

9.0

3.2

9.5

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256

Re

lati

ve

P

erf

orm

an

ce


Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI


Intel Skylake Platinum 8170 2.1GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA


Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR


DL_POLY 4 – Gramicidin Simulation


Performance

BE

TT

ER

Gramicidin 792,960 atoms; 50 time steps



Relative to the Fujitsu CX250 e5-2670 2.6 GHz 8-C (32 PEs)

12 December 2018

SKL 6142 2.6 GHz ~

1.06 X e5-2697v4 2.6

GHz

1.7

3.0

4.6

2.4

4.3

7.2

2.6

4.6

7.5

3.1

5.1

2.8

5.0

8.4

2.5

3.4

4.6

2.3

3.1 3.2

0.0

2.0

4.0

6.0

8.0

64 128 256



Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI








DL_POLY 4 – Gramicidin Simulation – EPYC


Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)

BE

TT

ER

Gramicidin 792,960 atoms; 50 time steps


0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)


DLPOLY4 – Gramicidin Simulation Performance Report

Smooth Particle Mesh Ewald Scheme


CPU Time Breakdown

Total Wallclock Time

Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

12 December 2018

“DL_POLY_4 and Xeon Phi: Lessons Learnt”,

Alin Marin Elena , Christian Lalanne, Victor

Gamayunov , Gilles Civario, Michael Lysaght,

and IlianTodorov

Molecular Simulation - II GROMACS

GROMACS (GROningen MAchine for Chemical Simulations) is a

molecular dynamics package designed for simulations of proteins, lipids

and nucleic acids [University of Groningen] .

• Single and Double Precision

• Efficient GPU Implementations

Versions under Test:

Version 4.6.1 – 5 March 2013

Version 5.0.7 – 14 October 2015

Version 2016.3 – 14 March 2017

Version 2018.2 – 14 June 2018 (optimised for “Hawk” by Ade Fewings)

Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-

Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory

and Computation 4 (3): 435–447.


http://manual.gromacs.org/documentation/

GROMACS Benchmark Cases


Ion channel system

• The 142k particle ion channel system is the

membrane protein GluCl - a pentameric chloride

channel embedded in a DOPC membrane and

solvated in TIP3P water, using the Amber ff99SB-

ILDN force field. This system is a challenging

parallelization case due to the small size, but is one

of the most wanted target sizes for biomolecular

simulations.

Lignocellulose

• Gromacs Test Case B from the UEA Benchmark

Suite. A model of cellulose and lignocellulosic

biomass in an aqueous solution. This system of

3.3M atoms is inhomogeneous, and uses reaction-

field electrostatics instead of PME and therefore

should scale well.

12 December 2018

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

32 PEs

64 PEs

128 PEs

256 PEs




0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)


GROMACS – Ion-channel Performance Report


CPU Time Breakdown


Breakdown

12 December 2018

45.1

79.8

132.4

48.5

90.3

167.2

60.1

114.0

191.7

0.0

50.0

100.0

150.0

200.0

64 128 256

Gromacs 4.6.1

Gromacs 5.0

Gomacs 2016.3-single-AVX

Gromacs 2018.2

27

GROMACS – Ion Channel Simulation


Performance (ns /day)


BE

TT

ER

142k particle ion channel system


Single Precision

"Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T) Nodes

with EDR Interconnect + dual P100 GPU nodes

20.6

37.9

68.8

32.5

60.3

100.4

36.4

68.5

123.6

54.0

97.4

151.5

48.5

90.3

149.0

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

64 128 256


Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX





"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S}

Intel Skylake Gold 6148 2.4GHz (T) OPA {S}

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S}

28

Ion Channel Simulation – Impact of Single Precision




BE

TT

ER



GROMACS 5.0.7

1.0

1.2

1.6

1.9

2.2

3.2 3.2

3.5

1.2

1.8

2.7

3.2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

64 80 96 128 160 240 256 320 N=1,2×GPU

N=2,4×GPU

N=4,8×GPU

N=6,12×GPU

29

GROMACS – GPU Performance: Ion Channel Simulation


Relative Performance

BE

TT

ER



"Hawk" Atos Cluster -

SKL Gold 6148 2.4GHz (T)

Nodes with EDR

Interconnect + dual P100

GPU nodes

GROMACS 2018.2

2.4

4.8

8.7

2.4

4.7

8.6

3.7

7.3

13.5

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

64 128 256

Gromacs 4.6.1

Gromacs 5.0

Gromacs 2016.3-single-AVX

Gromacs 2018.2

30

GROMACS – Lignocellulose Simulation




BE

TT

ER


3,316,463 atom system using

reaction-field electrostatics instead

of PME

"Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T)

Nodes with EDR Interconnect + dual P100 GPU nodes

Single Precision

0.9

1.7

3.3

1.3

2.6

5.0

1.4

2.8

5.2

1.6

3.1

6.1

2.9

5.5

10.1

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256


Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX





"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S}

Intel Skylake Gold 6148 2.4GHz (T) OPA {S}

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S}

31

Lignocellulose Simulation – Impact of Single Precision




BE

TT

ER

3,316,463 atom system using reaction-

field electrostatics instead of PME


GROMACS 5.0.7

1.01.2

1.5

1.9

2.3

3.43.6

4.4

1.6

2.9

5.3

7.1

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

64 80 96 128 160 240 256 320 N=1,2×GPU

N=2,4×GPU

N=4,8×GPU

N=6,12×GPU

32

GROMACS – GPU Performance: Lignocellulose Simulation


BE

TT

ER


"Hawk" Atos Cluster -

SKL Gold 6148 2.4GHz (T)

Nodes with EDR

Interconnect + dual P100

GPU nodes

3,316,463 atom system using

reaction-field electrostatics instead

of PME

GROMACS 2018.2


Molecular Simulation - III The AMBER Benchmarks

• AMBER 16/1 is used, specifically

PMEMD & GPU accelerated PMEMD.

• M01 Benchmark

¤ Major Urinary Protein (MUP) + IBM ligand (21,736 atoms)

• M06 Benchmark

¤ Cluster of six MUPs (134,013 atoms)

• M27 Benchmark

¤ Cluster of 27 MUPs (657,585 atoms)

• M45 Benchmark

¤ Cluster of 45 MUPs (932,751 atoms)

All test cases run 30,000 steps * 2fs = 60ps simulation time. Periodic boundary

conditions, constant pressure, T=300K. Position data written every 500 steps.

R. Salomon-Ferrer, D.A. Case, R.C. Walker. An overview of the Amber biomolecular simulation package. WIREs Comput. Mol. Sci. 3, 198-210 (2013).

D.A. Case, T.E. Cheatham, III, T. Darden, H. Gohlke, R. Luo, K.M. Merz, Jr., A. Onufriev, C. Simmerling, B. Wang and R. Woods. The Amber biomolecular simulation programs. J. Computat. Chem. 26, 1668-1688 (2005).


1.31

1.241.27 1.27

1.13

1.34

1.45 1.44

1.34

1.27

1.36

1.57

1.48

1.56

1.65 1.65

1.36

1.23

1.47

1.39

1.551.58

1.70

1.64

1.00

1.20

1.40

1.60

1.80

64 80 96 128 160 240 256 320

M06 M27 M45

AMBER – SKL vs. SNB: M06, M27 and M45



BE

TT

ER


SKL 6148 2.4 GHz // EDR vs SNB e5-2670 2.6 GHz // QDR


1.361.48

1.88

2.292.41

3.12

3.403.54

2.73

4.21

0.0

1.0

2.0

3.0

4.0

64 80 96 128 160 240 256 320 N1 (ppn1GPU×2)

N1 (ppn2GPU×2)

35

AMBER – GPU Performance: M45 Simulation


Relative Performance (64 SNB cores)

BE

TT

ER


"Hawk" Atos Cluster - SKL Gold 6148

2.4GHz (T) with EDR Interconnect + dual

P100 GPU nodes vs. “Raven” 64 SNB e5-

2670 PEs

Cluster of 45 Major Urinary Protein

(MUP) + IBM ligand (932,751 atoms)

GAMESS-UK - Moving to Distributed Data.

The MPI/ScaLAPACK Implementation

of the GAMESS-UK SCF/DFT module

• Pragmatic approach to the replicated data constraints:

¤ MPI-based tools (such as ScaLAPACK) used in place of Global Arrays

¤ All data structures except those required for the Fock matrix build are fully

distributed (F, P)

• Partially distributed model chosen because, in the absence of efficient

one-sided communications it is difficult to efficiently load balance a

distributed Fock matrix build.

• Obvious drawback - some large replicated data structures are required.

¤ These are kept to a minimum. For a closed shell HF or DFT calculation only

2 replicated matrices are required, 1 × Fock and 1 × Density (doubled for

UHF).


“The GAMESS-UK electronic structure package: algorithms, developments and

applications'' M.F. Guest, I. J. Bush, H.J.J. van Dam, P. Sherwood, J.M.H. Thomas, J.H.

van Lenthe, R.W.A Havenith, J. Kendrick, Mol. Phys. 103, No. 6-8, 2005, 719-747.

12 December 2018

1.2

2.1

1.2

2.1

1.5

2.7

1.8

3.2

1.8

3.0

1.6

2.8

1.9

3.3

1.8

3.3

1.9

3.5

1.1

2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256


Bull Haswell e5-2695v3 2.3GHz Connect-IB

Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) EDR

Thor Broadwell e5-2697A v4 2.6GHz (T) EDR

Mellanox SKL Gold 6138 2.0GHz (T) EDR






IBM Power8 S822LC 2.92GHz IB/EDR

Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK Performance - Zeolite Y cluster

Performance



BE

TT

ER


SKL 6142 2.6 GHz

~ 1.05 X e5-2697v4 2.6 GHz

12 December 2018

1.2

2.1

1.2

2.1

1.5

2.7

1.8

3.2

1.8

3.0

1.6

2.8

1.7

3.1

1.8

3.3

1.9

3.5

1.1

2.0

1.4

2.5

1.5

2.8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDRBull Haswell e5-2695v3 2.3GHz Connect-IBHuawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDRThor Broadwell e5-2697A v4 2.6GHz (T) EDRMellanox SKL Gold 6138 2.0GHz (T) EDRDell|EMC Skylake Gold 6130 2.1GHz (T) OPAIntel Skylake Gold 6148 2.4GHz (T) OPA"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDRDell|EMC Skylake Gold 6142 2.6GHz (T) EDRDell|EMC Skylake Gold 6150 2.7GHz (T) EDRIBM Power8 S822LC 2.92GHz IB/EDRATOS AMD EPYC 7551 2.0 GHz (T) EDRDell|EMC AMD EPYC 7601 2.2 GHz (T) EDR

Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK MPI/ScaLAPACK code – EPYC Performance

Performance



BE

TT

ER


0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)


GAMESS-UK.MPI DFT – DFT Performance Report


Cyclosporin 6-31G** basis (1855

GTOs); DFT B3LYP

CPU Time Breakdown


Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs




12 December 2018

• VASP – performs ab-initio QM molecular dynamics (MD) simulations using

pseudopotentials or the projector-augmented wave method and a plane

wave basis set.

• Quantum Espresso – an integrated suite of Open-Source computer codes

for electronic-structure calculations and materials modelling at the

nanoscale. It is based on density-functional theory (DFT), plane waves,

and pseudopotentials

• SIESTA - an O(N) DFT code for electronic structure calculations and ab

initio molecular dynamics simulations for molecules and solids. It uses

norm-conserving pseudopotentials and linear combination of numerical

atomic orbitals (LCAO) basis set.

• CP2K is a program to perform atomistic and molecular simulations of solid

state, liquid, molecular, and biological systems. It provides a framework for

different methods such as e.g., DFT using a mixed Gaussian & plane waves

approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling

code for quantum-mechanical calculations based on DFT.

Computational Materials

Advanced Materials Software


http://www.quantum-espresso.org/


Quantum Espresso is an

integrated suite of Open-

Source computer codes

for electronic-structure

calculations and

materials modelling at the

nanoscale. It is based on

density-functional theory,

plane waves, and

pseudopotentials.

Transition from v5.2 to

v6.1

Ground-state calculations.

Structural Optimization.

Transition states & minimum energy paths.

Ab-initio molecular dynamics.

Response properties (DFPT).

Spectroscopic properties.

Quantum Transport.

Benchmark Details

DEISA AU112

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT dimensions:

(180, 90, 288)

PRACE

GRIR443

Carbon-Iridium complex (C200Ir243),

2,233,063 G-vectors, 8 k-points, FFT

dimensions: (180, 180, 192)

Quantum Espresso




1.01.3

2.0 2.0

2.52.9

3.3 3.4

3.3

5.6

5.0

5.7

7.67.8

1.5

2.2

3.2

5.1

4.3

4.9

5.9

1.9

2.7

4.0

6.7

5.7

6.4

8.3

8.8

0.0

2.0

4.0

6.0

8.0

10.0

0 64 128 192 256 320



Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL

Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA





Perf

orm

an

ce

Performance Data (32 - 320 PEs)

BE

TT

ER

Quantum Espresso – Au112


Relative to the Fujitsu e5-2670

2.6 GHz 8-C (32 PEs)

12 December 2018

Version 5.2

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

32 PEs

64 PEs

128 PEs

256 PEs





Quantum Espresso – Au112 Performance Report

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT

dimensions: (180, 90, 288)


CPU Time Breakdown


Breakdown

12 December 2018

Parallelism in Quantum Espresso

• Quantum ESPRESSO implements several MPI parallelization levels,

with Processors organized in a hierarchy of groups identified by different

MPI communicator levels. Group hierarchy:

• images: Processors divided into different "images", corresponding to a

different SCF or linear-response calculation, loosely coupled to others.

• Pools and bands: each image can be sub-partitioned into "pools", each

taking care of a group of k-points. Each pool is sub-partitioned into

"band groups", each taking care of a group of Kohn-Sham orbitals.

• PW Parallelisation: orbitals in the PW basis set, as well as charges

and density in either reciprocal or real space, distributed across

processors. All linear-algebra operations on array of PW / real-space

grids are automatically and effectively parallelized.

• tasks: Allows for good parallelization of the 3D FFT when no. of CPUs

exceeds the no. of FFT planes, FFTs on Kohn-Sham states are

redistributed to "task".

4412 December 2018Application Performance on Multi-core Processors

Parallelism in Quantum Espresso

• linear-algebra group: A further level, independent on PW or k-point

parallelization, is the parallelization of subspace diagonalization /

iterative orthonormalization.

• About communications Images and pools are loosely coupled and

CPUs communicate between different images and pools only once in a

while, whereas CPUs within each pool are tightly coupled and

communications are significant.

• Choosing parameters : To control the no. CPUs in each group,

command line switches: -nimage, -npools, -nband, -ntg, -ndiag or –

northo. Thus for Au112, use is of the following command line:

mpirun $code -inp ausurf.in -npool $NPOOL -ntg $NT -ndiag $ND

• This executes an energy calculation on $NP processors, with k-points

distributed across $NPOOL pools of $NP/$NPOOL processors each,

3D FFT is performed using $NT task groups, with the diagonalization of

the subspace Hamiltonian distributed to a square grid of $ND

processors.





BE

TT

ER

Impact of npool – Au112

46

2.42.7

3.2

4.4

4.8

5.4

6.7

7.7

6.5

4.7 4.8 4.9

4.0

4.44.0

1.0

1.4 1.5

2.0 2.02.3

2.62.4 2.5

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

32 64 96 128 160 192 224 256 288 320

Hawk (NPOOL=2, ND=nP)

Hawk (NPOOL=1)

Raven (NPOOL=2, ND=nP)

Raven (NPOOL=1)


Version 5.2

1.92.2

2.9

3.9 4.0

4.7

5.96.1

6.8

4.24.4

5.3

3.73.9

4.5

1.01.1

1.4

1.8

2.2

2.8

3.2 3.3

3.9

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

32 64 96 128 160 192 224 256 288 320

Hawk (NPOOL=2, ND=nP)

Hawk (NPOOL=1)

Raven (NPOOL=2, ND=nP)

Raven (NPOOL=1)




BE

TT

ER

Impact of npool – GRIR443


1.3

1.9

2.42.3

3.2

3.8

2.8

4.0

5.2

2.8

4.0

4.9

1.4

1.82.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

96 128 160



Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL






Pe

rfo

rma

nc

e

BE

TT

ER

Performance Data (96-160 PEs)Quantum Espresso – GRIR443[R

ela

tive

to

th

e F

ujit

su

e5

-

26

70

2.6

GH

z 8

-C (

96

PE

s)]


Zeolite Benchmark

• Zeolite with the MFI structure unit cell running

a single point calculation and a planewave cut

off of 400eV using the PBE functional

• 2 k-points; maximum number of plane-

waves: 96,834

• FFT grid; NGX=65, NGY=65, NGZ=43,

giving a total of 181,675 points

Pd-O Benchmark

• Pd-O complex – Pd75O12, 5X4 3-layer

supercell running a single point calculation

and a planewave cut off of 400eV. Uses the

RMM-DIIS algorithm for the SCF and

is calculated in real space.

• 10 k-points; maximum number of plane-

waves: 34,470

• FFT grid; NGX=31, NGY=49, NGZ=45,

giving a total of 68,355 points

VASP – Vienna Ab-initio Simulation Package

Benchmark Details

MFI Zeolite

Zeolite (Si96O192), 2 k-

points, FFT grid: (65,

65, 43); 181,675 points

Pd-O

complex

Palladium-Oxygen

complex (Pd75O12), 10

k-points, FFT grid: (31,

49, 45), 68,355 points

VASP (5.4.4) performs ab-

initio QM molecular

dynamics (MD) simulations

using pseudopotentials or

the projector-augmented

wave method and a plane

wave basis set.


1.7

2.5

2.1

2.8

4.6

6.5

3.3

5.2

5.9

2.8

4.5

5.9

3.8

5.95.7

3.7

5.6

6.8

0.0

2.0

4.0

6.0

8.0

64 128 256


Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX


"Helios" Mellanox SKL 6138 2.0GHz (T)

"Helios" Mellanox SKL 6138 2.0GHz (T) HPCX 2.3.0






Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

VASP 5.4.4 – Pd-O Benchmark


1.7

2.52.1

2.8

4.6

6.5

3.7

5.2

5.9

3.7

6.4

8.5

3.8

7.5

9.1

3.7

5.6

6.8

2.2

3.9

5.3

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256


Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA



"Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2]


"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2]



Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR [KPAR=2]


Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER



VASP 5.4.4 – Pd-O Benchmark - Parallelisation on k-points


NPEs KPAR NPAR

64 2 2

128 2 4

256 2 8

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs




0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)


VASP – Pd-O Benchmark Performance Report




CPU Time Breakdown


Breakdown

12 December 2018

1.0

1.51.7

1.4

2.9

4.7

1.6

3.2

4.3

1.5

2.7

3.9

1.8

3.4

4.0

1.7

3.0

4.2

1.7

3.2

4.5

0.0

1.0

2.0

3.0

4.0

5.0

64 128 256



Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) OPA








Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single point

calculation and a 400eV planewave cut off of using the PBE

functional. maximum number of plane-waves: 96,834, 2 k-points,

FFT grid: (65, 65, 43); 181,675 points

VASP 5.4.4 – Zeolite Benchmark


Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

12 December 2018

1.0

1.51.7

1.4

2.9

4.7

1.6

4.3

4.7

1.5

2.7

3.9

1.8

3.4

4.0

1.7

3.0

4.2

1.7

3.2

4.5

0.0

1.0

2.0

3.0

4.0

5.0

64 128 256




"Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2]




"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2]




Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single point

calculation and a 400eV planewave cut off of using the PBE functional.

maximum number of plane-waves: 96,834, 2 k-points, FFT grid: (65, 65,

43); 181,675 points

VASP 5.4.4 – Zeolite Benchmark - Parallelisation on k-points



12 December 2018

1.4

2.0

2.6

1.5

2.1

2.7

0.9

1.4

1.6

1.4

1.8

2.1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

64 96 128




ATOS AMD EPYC 7601 2.2 GHz (T) EDR (16c/socket)


Perf

orm

an

ce


BE

TT

ER

VASP 5.4.1 – Zeolite Benchmark

Zeolite (Si96O192) with MFI structure unit cell running a single

point calculation and a 400eV planewave cut off of using the

PBE functional. maximum number of plane-waves: 96,834, 2 k-

points, FFT grid: (65, 65, 43); 181,675 points


Application Performance on Multi-core

Processors:

I.2. Selecting Fabrics and Optimising

Performance:

Intel MPI and Mellanox HPCX

• Intel MPI Library - can select a communication fabric at runtime without having

to recompile the application. By default, it automatically selects the most

appropriate fabric based on both S/W and H/W configuration i.e. in most cases

you do not have to manually select a fabric.

• Specifying a particular fabric can boost performance. Can specify fabrics for both

intra-node and inter-node communications. Following fabrics available:

• For inter-node communication, it uses the first available fabric from the default

fabric list. List is defined automatically for each H/W and S/W configuration (see

I_MPI_FABRICS_LIST).

• For most configurations, this list is as follows: dapl, ofa, tcp, tmi, ofi

Selecting Fabrics – MPI Optimisation


Fabric Network hardware and software used

shm Shared memory (for intra node communication only).

dapl Direct Access Programming Library (DAPL) fabrics, such as InfiniBand (IB)

and iWarp (through DAPL).

ofa OpenFabrics Alliance (OFA) fabrics e.g. InfiniBand (through OFED verbs).

tcp TCP/IP network fabrics, such as Ethernet and InfiniBand (through IPoIB).

tmi Tag Matching Interface (TMI) fabrics, such as Intel True Scale Fabric, Intel

Omni Path Architecture and Myrinet (through TMI).

ofi OpenFabrics Interfaces* (OFI) - capable fabrics, such as Intel True Scale

Fabric, Intel Omni Path Architecture, IB and Ethernet (through OFI API).

12 December 2018

Mellanox HPC-X Toolkit

The Mellanox HPC-X Toolkit provides a MPI, SHMEM and UPC

software suite for HPC environments. Delivers “enhancements to

significantly increase the scalability & performance of message

communications in the network”. Includes:

¤ Complete MPI, SHMEM, UPC package, including Mellanox MXM and

FCA acceleration engines

¤ Offload collectives communication from MPI process onto Mellanox

interconnect hardware

¤ Maximize application performance with underlying hardware

architecture. Optimized for Mellanox InfiniBand and VPI interconnects

¤ Increase application scalability and resource efficiency

¤ Multiple transport support including RC, DC and UD

¤ Intra-node shared memory communication

• Performance comparison conducted on the Mellanox SKL 6138 / 2.00

GHz EDR based “Helios” cluster


Application Performance & MPI Libraries

Performance comparison exercise undertaken to capture the impact

of the latest release of Intel MPI and Mellanox’s HPCX.

¤ In 2017, on the Mellanox HP Proliant- E5-2697A v4 EDR based

Thor cluster, comparison of Intel MPI and Mellanox HPCX for the

following applications (and associated data sets).

– DLPOLY4 (NaCl and Gramicidin) & GROMACS (Ion Channel and

lignocellulose)

– VASP (PdO Complex & Zeolite System)

– Quantum ESPRESSO (Au112 and GRIR443)

– OpenFOAM (Cavity 3D-3M)

¤ Simply compared the time to solution for each application i.e.

T HPCX / T Intel-MPI

across multiple core counts


Application Performance & MPI Libraries

• Optimum performance found to be a function of both

application and core count.

¤ With the materials-based codes & OpenFOAM, and at high

core count (> 512 cores), HPCX exhibited a clear

performance advantage over Intel MPI.

¤ This was not the case for the classical MD codes where Intel

MPI showed a distinct advantage at all but the highest core

counts.

• Repeated the exercise on the Helios partition of the Skylake

cluster using latest releases of HPCX v2.2.0 and 2.3.0-pre


http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf

12 December 2018

DL_POLY 4 – Intel MPI vs. HPCX – December 2017


% Intel MPI Performance vs. HPCX

Processor Core Count

85%

90%

95%

100%

105%

110%

115%

120%

0 128 256 384 512 640 768 896 1024

DL_POLY4 - NaCl

DL_POLY4 - Gramicidin

12 December 2018

Intel MPI is seen to outperform HPC-X for the DLPOLY 4 NaCl

test case at all core counts, and at lower core counts for

Gramicidin

85%

90%

95%

100%

105%

110%

115%

120%

0 128 256 384 512 640 768 896 1024

DL_POLY4 - NaCl

DL_POLY4 - Gramicidin

DL_POLY 4 – Intel MPI vs. HPCX – December 2018




Advantage of Intel MPI now reduced at

most core counts for both NaCl and

Gramicidin

12 December 2018

95%

100%

105%

110%

115%

120%

125%

0 128 256 384 512 640 768 896 1024

GROMACS - ion channel

GROMACS - lignocellulose

GROMACS – Intel MPI vs. HPCX – December 2017


% Intel MPI

Performance vs. HPCX


At no point does the HPC-X implementation of

Gromacs outperform that using Intel MPI

12 December 2018

95%

100%

105%

110%

115%

120%

125%

0 128 256 384 512 640 768 896 1024

GROMACS - ion channel

GROMACS - lignocellulose

GROMACS – Intel MPI vs. HPCX – December 2018


% Intel MPI

Performance vs. HPCX

Processor Count

12 December 2018

Similar findings to DL_POLY, with the advantage of Intel

MPI over the HPC-X implementation of Gromacs

significantly reduced compared to the 2017 findings.

60%

70%

80%

90%

100%

110%

120%

0 128 256 384 512

VASP - Palladium Complex

VASP - Zeolite Cluster

VASP 5.4.1 – Intel MPI vs. HPCX – December 2017



Processor Count

Significantly different to the classical MD codes – now

HPCX is seen to outperform Intel MPI for the Zeolite

cluster at all core counts, and at larger core counts for

the Palladium complex

12 December 2018

60%

70%

80%

90%

100%

110%

120%

0 128 256 384 512

VASP - Palladium Complex

VASP - Zeolite Cluster

VASP 5.4.4 – Intel MPI vs. HPCX – December 2018



Processor Count

Significantly different to the 2017 findings – little

difference between Intel MPI and HPCX at larger core

counts, with Intel MPI superior at lower core counts.

12 December 2018

65%

75%

85%

95%

105%

115%

125%

0 128 256 384 512 640 768

Quantum Espresso - GRIR443

Quantum Espresso - Au112

Quantum Espresso v5.2 – Intel MPI vs. HPCX – Dec. 2017



Processor Count

Significantly different to the classical MD codes – as

with VASP, HPCX is seen to outperform Intel MPI for the

larger core counts

12 December 2018

65%

75%

85%

95%

105%

115%

125%

0 128 256 384 512

Quantum Espresso - GRIR443

Quantum Espresso - Au112



Processor Count

Significantly different to the 2017 findings – Intel MPI

superior at lower core counts, with HPCX somewhat

more effective at higher core counts.

12 December 2018

Quantum Espresso v6.1 – Intel MPI vs. HPCX – Dec. 2018

I.3 Relative Performance as a Function

of Processor Family and Interconnect –

SKL and SNB Clusters.


core Processors

0.00

0.20

0.40

0.60

0.80

1.00

DLPOLY-4Gramicidin

DLPOLY-4 NaCl

GROMACS ion-channel

GROMACSlignocellulose

OpenFoam -3d3M

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

BSMBenchBalance

Bull b510 "Raven"Sandy Bridge e5-2670/2.6 GHz IB-QDR

Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDRHPCX

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA

Dell Skylake Gold 61422.6GHz (T) EDR


Target Codes and Data Sets – 128 PEs


128 PE Performance [Applications]

12 December 2018

http://www.cp2k.org/_media/lumo.jpg?cache=


1.11

1.29

1.33

1.36

1.37

1.38

1.40

1.41

1.42

1.43

1.45

1.49

1.53

1.53

1.54

1.58

1.58

1.59

1.65

1.67

1.71

1.95

0.9 1.1 1.3 1.5 1.7 1.9 2.1

OpenFOAM - Cavity3d-3M

WRF - 4dbasic

Gromacs 2016-3 - ion channel

Gromacs 2016-3 -lignocellulose

Gromacs 5.0 - lignocellulose

Gromacs 4.6.1 - lignocellulose

Gromacs 4.6.1 - ion channel

CP2K - H2O-512

QE 5.2 - AU112

CP2K - H2O-256

Gromacs 5.0 - ion channel

WRF - conus 2.5km

VASP 5.4.4 - Zeolite

DLPOLY Classic Bench7

GAMESS-UK - SiOSi7

GAMESS-UK - DFT.cyclo.6-31G-dp

DLPOLY Classic - Bench5


DL_POLY 4.08 - NaCl

DL_POLY 4.08 - Gramicidin

QE 5.2 - GRIR443

VASP 5.4.4 - PdO Complex

Improved Performance of

Hawk - Dell |EMC Skylake

Gold 6148 2.4GHz (T) EDR

vs.

Raven - ATOS b510 Sandy

Bridge e5-2670/2.6 GHz

IB-QDR


Average Factor = 1.49

SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

12 December 2018

NPEs = 80

1.23

1.32

1.33

1.36

1.36

1.36

1.39

1.45

1.46

1.48

1.49

1.49

1.49

1.50

1.53

1.56

1.56

1.59

1.71

1.76

2.02

2.23

0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3



QE 5.2 - AU112

Gromacs 2016-3 - lignocellulose

WRF - 4dbasic






WRF - conus 2.5km

CP2K - H2O-512

QE 5.2 - GRIR443


CP2K - H2O-256


GAMESS-UK - SiOSi7

GAMESS-UK - DFT.cyclo.6-31G-dp

DL_POLY 4.08 - NaCl







vs.


Bridge e5-2670/2.6 GHz IB-

QDR




12 December 2018

NPEs = 160

1.34

1.34

1.37

1.38

1.39

1.39

1.40

1.40

1.41

1.41

1.44

1.45

1.53

1.56

1.58

1.60

1.74

1.80

1.88

1.97

2.16

2.71

0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7

QE 5.2 - GRIR443



Gromacs 2016-3 -…


WRF - 4dbasic

CP2K - H2O-512





WRF - conus 2.5km


CP2K - H2O-256

GAMESS-UK - DFT.cyclo.6-…

GAMESS-UK - SiOSi7


DL_POLY 4.08 - NaCl


QE 5.2 - AU112






vs.



QDR




12 December 2018

NPEs = 320


Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous charts based on Core to core comparison i.e.

performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a 4 and 6 “node to node” comparison of the following:

¤ Benchmarks based on set of 10 applications & 19 data sets.

1

Raven - Bull b510 Sandy


QDR [64 cores]



[160 cores]

2

Raven - Bull b510 Sandy


QDR [96 cores]



[240 cores]

12 December 2018

2.50

2.62

2.70

2.81

2.94

2.95

2.96

2.98

3.09

3.11

3.26

3.28

3.31

3.40

3.46

3.50

1.0 1.5 2.0 2.5 3.0 3.5

CP2K - H2O-256

QE 5.2 - Au112

CP2K - H2O-512

DLPOLYclassic Bench4

GAMESS-UK (DFT.cyclo.6-31G-dp)

VASP 5.4.4 Pd-O complex

VASP 5.4.4 Zeolite complex

WRF 3.4 - 4dbasic

DLPOLY-4 NaCl

GROMACS 2016.3 - ion-channel

QE 5.2 - GRIR443

GROMACS 2016.3 - lignocellulose

DLPOLY-4 Gramicidin

GAMESS-UK (DFT.siosi7.3975)

WRF 3.4 - conus 2.5km



Dell |EMC Skylake Gold 6148

2.4GHz (T) EDR [160 cores]

vs.

Bull b510 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [64 cores]



SKL “Gold” 6148 2.4 GHz EDR vs. SB e5-2670 2.6 GHz QDR

4 Node Comparison

12 December 2018

2.59

2.63

2.64

2.67

2.78

2.78

2.79

2.96

2.96

3.01

3.14

3.18

3.19

3.19

3.27

3.88

1.0 1.5 2.0 2.5 3.0 3.5 4.0

VASP 5.4.4 Zeolite complex

CP2K - H2O-256

GROMACS 2016.3 - ion-channel

VASP 5.4.4 Pd-O complex

WRF 3.4 - 4dbasic

GAMESS-UK (DFT.cyclo.6-31G-dp)

CP2K - H2O-512

DLPOLY-4 Gramicidin


DLPOLY-4 NaCl

QE 5.2 - GRIR443

WRF 3.4 - conus 2.5km

GAMESS-UK (DFT.siosi7.3975)

QE 5.2 - Au112

GROMACS 2016.3 - lignocellulose


Improved Performance of Hawk


2.4GHz (T) EDR [240 cores]

vs.

Bull b510 Sandy Bridge e5-



Average Factor =

2.98


6 Node Comparison

12 December 2018

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


DLPOLY-4Gramicidin

DLPOLY-4NaCl

GROMACSion-channel

GROMACSlignocellulose

GAMESS-UK(cyc-sporin)

GAMESS-UK(Siosi7)

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDRATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA



Bull|ATOS Skylake Gold6150 2.7GHz (T) EDR

Dell|EMC AMD EPYC 76012.2 GHz (T) EDR

EPYC - Target Codes and Data Sets – 128 PEs


128 PE Performance [Applications]

12 December 2018




Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous EPYC charts based on Core to core comparison

i.e. performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a “node to node” comparison of the following:

¤ Benchmarks based on set of 6 applications & 15 data sets.

1Fujitsu CX250 Sandy Bridge e5-


Dell |EMC AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

2Dell |EMC Skylake Gold 6130 2.1GHz

(T) OPA [128 cores]

Dell |EMC AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

12 December 2018

1.55

2.09

2.13

2.30

2.69

2.88

2.90

3.24

3.33

3.62

4.15

4.19

1.0 1.5 2.0 2.5 3.0 3.5 4.0


VASP Pd-O complex


DLPOLY-4 NaCl

DLPOLY-4 Gramicidin

VASP Zeolite complex


GROMACS ion-channel

QE Au112

GAMESS-UK (cyc-sporin)

GROMACS lignocellulose

GAMESS-UK (valino.A2)

Relative Performance of

Dell | EMC AMD EPYC 7601 2.2

GHz (T) EDR [256 cores]

vs.

Fujitsu CX250 Sandy Bridge e5-




Dell|EMC EPYC 7601 2.2 GHz (T) EDR vs. SB e5-2670 2.6 GHz QDR

12 December 2018

4 Node Comparison

0.74

0.80

0.84

0.94

1.00

1.07

1.13

1.21

1.44

1.51

1.51

1.64

1.78

1.78

1.83

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

VASP Pd-O complex

QE Au112

QE GRIR443

DLPOLY-4 NaCl


DLPOLY-4 Gramicidin

VASP Zeolite complex


GROMACS ion-channel


GAMESS-UK (cyc-sporin)

GAMESS-UK (valino.A2)

GAMESS-UK (Siosi7)

GROMACS lignocellulose

GAMESS-UK (hf12z)

Relative Performance of

Dell |EMC AMD EPYC 7601 2.2

GHz (T) EDR [256 cores]

vs.


2.1GHz (T) OPA [128 cores]



SKL “Gold” 6130 2.1 GHz OPA vs. AMD EPYC 7601 2.2 GHz (T) EDR

12 December 2018

4 Node Comparison

Summary

• Ongoing Focus on performance benchmarks and clusters featuring

Intel’s SKL processors, with the addition of the “Gold” 6138, 2.0

GHz [20c] and 6148, 2.4 GHz [20c] alongside the 6142, 2.6 GHz

[16c] ; and 6150, 2.7 GHz [18c]).

• Performance comparison with current SNB systems and those

based on dual Intel BDW processor EP nodes (16-core, 14-core)

with Mellanox EDR and Intel’s Omnipath OPA interconnects.

• Measurements of parallel application performance based on

synthetic and end user applications – DLPOLY, Gromacs, Amber,

GAMESS-UK, Quantum ESPRESSO and VASP.

¤ Use of Allinea Performance reports to guide analysis, and

updated comparison of Mellanox’s HPC-X and Intel MPI on

EDR-based systems

• Results augmented through consideration of two AMD Naples

EPYC clusters, featuring the 7601 (2.20 GHz) and 7551 (2.00 GHz)

processors.


Summary II

• Relative Code Performance: Processor Family and Interconnect – “core

to core” and “node to node” benchmarks.

• A Core-to-Core comparison focusing on the Skylake “Gold” 6148

cluster (EDR) across 19 data sets (7 applications) suggests average

speedups between 1.49 (80 cores) through 1.60 (320 cores) when

comparing the to the Sandy Bridge-based “Raven” e5-2670 2.6GHz

cluster with QDR environment.

¤ Some applications however show much higher factors e.g.

GROMACS and VASP depending on the level of optimisation

undertaken on Hawk.

• A Node-to-Node comparison typical of the performance when running

a workload shows increased factors.

¤ A 4-node benchmark (160 cores) based on examples from 9

applications and 16 data sets show average improvement factors of

3.05 compared to the corresponding 4 node runs (64 cores) on the

Raven cluster.

¤ This factor is reduced somewhat, to 2.98, when using 6 node

benchmarks, comparing 240 SKL cores to 96 SNB cores.


Summary III

• An updated comparison of Intel MPI and Mellanox’s HPCX

conducted on the “Helios” cluster suggests that the clear

delineation between MD (DLPOLY, GROMACS) and Materials-

based codes (VASP, Quantum Espresso) is no longer evident.

• Ongoing studies on the EPYC 2701 shows a complex

performance dependency on EPYC architecture.

¤ Codes with high usage of vector instructions (Gromacs, VASP

and Quantum Espresso) perform at best in somewhat modest

fashion.

¤ The AMD EPYC only supports 2 × 128-bit AVX natively, so

there’s a large gap with Intel and their 2 × 512-bit FMAs.

¤ The floating point peak on AMD is 4 × lower than Intel and

given that e.g., GROMACS has a native AVX-512 kernel for

Skylake, performance inevitably suffers.


II. Acceptance Test Challenges and the

Impact of Environment.


core Processors

Background - Supercomputing Wales, New HPC Systems

• Multi-million £ procurement exercise for new hubs agreed by all

partners

• Tender issued in May 2017 following 6-9 month review of research

community requirements and development of technical reference

design

• Budgetary challenges due to currency devaluation and increase in

component costs since budgets agreed in 2016

• Contracts awarded to Atos, March 2018. Hubs now installed and

operational, based on Intel Skylake Gold 6148, supported by Nvidia

GPU accelerators:

Lot 1 – “Hawk” system - Cardiff hub. 7,000 HPC + 1,040 HTC cores

Lot 2 – “Sunbird” system - Swansea hub. 5,000 HPC cores

Lot 3 – “Sparrow” – Cardiff High Performance Data Analytics

development system

Suppliers to provide development opportunities and other activities

through a programme of Community Benefits

12 December 2018Application Performance on Multi-core Processors 85

Performance Acceptances Tests

1. Consideration of the Performance Acceptance tests undertaken as

part of the Supercomputing Wales procurement. Carried out by Atos

on the “Hawk” HPC Skylake 6148 Cluster at Cardiff University.

2. Performance targets built on benchmarks specified in the ITT – but

developments impacted on the subsequent testing e.g., SPECTRE /

Meltdown.

3. Assess Performance through analyses of results generated through

three distinct run time environment variables, characterised by :

¤ Turbo Mode – ON or OFF. Impact considerably more complicated with

Skylake compared to previous Intel processor families.

¤ Security patches – DISABLED or ENABLED on the Skylake 6148 compute

nodes

¤ Distribution of processing cores – PACKED or UNPACKED on each node

e.g. 256 cores on either 7 or 8 × 40-core nodes.

4. Total of 8 Combinations – Impact on Performance ?

¤ ITT defined that all – “Application benchmarks should be in

“PACKED” mode; HPCC in non-turbo mode”12 December 2018Application Performance on Multi-core Processors 86

Process Adopted

1. Performance benchmark results generated by Atos (Martyn Foster)

on the Hawk HPC Skylake 6148 Cluster at Cardiff University

2. MF adopted a systematic approach to assessing performance

through the analyses of results generated across four distinct

environments (a subset of the 8 possible environments)

¤ “base (switch contained)” – Turbo mode off, security patches disabled on

the Skylake 6148 compute nodes

¤ “turbo + packed” - Turbo mode activated, with packed nodes – Slurm

default, with 40 cores per Skylake 6148 node

¤ “turbo + spread” - Turbo mode activated, de-populated nodes (32 cores /

node)

¤ “base + spectre” – base configuration above with security patches enabled

3. Identify those applications where the committed performance from the

SCW ITT submission (“Target”) is not achieved.10% shortfall allowed.


GLOBAL_SETTINGS

export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/disablekpti"

export SPEC="disable"

##### OR #####

export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/enablekpti"

export SPEC="enable"

export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_on" ;

export TSTR=TURBO

##### OR #####

export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_off" ;

export TSTR=OFF

export SRUN_PACKING="-m Pack" ; export PSTR=Packed

##### OR #####

export SRUN_PACKING="-m NoPack"; export PSTR=Spread

export LAUNCHER="srun ${SRUN_PACKING} --cpu_bind=verbose,cores --export

LD_LIBRARY_PATH".


SCW Application Performance Benchmarks

• The Benchmark suite comprises both synthetics & end-user

applications. Synthetics include HPCC (http://icl.cs.utk.edu/hpcc) &

IMB benchmarks (http://software.intel.com/en-us/articles/intel-mpi-

benchmarks), IOR and STREAM

• Variety of “open source” & commercial end-user application codes:

• These stress various aspects of the architectures under consideration

and should provide a level of insight into why particular levels of

performance are observed.

GROMACS and DL_POLY-4 (molecular dynamics)

Quantum Espresso and VASP (ab initio Materials properties)

BSMBench (particle physics – Lattice Gauge Theory Benchmarks)

OpenFOAM (computational engineering)


“Sunbird” Acceptance Tests – User Applications

90

106%

108%

105%

113%

111%

105%

95%

91%

97%

89%

98%

107%106%

104%

110%

95%

87%

96%

105%

95%

78%

104%

94%

103%

98%

102%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%

DL

PO

LY

-Gra

mic

idin

(64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-io

nc

han

ne

l (64

)

GR

OM

AC

S-io

nc

han

ne

l (12

8)

GR

OM

AC

S-lig

no

cellu

los

e (1

28

)

GR

OM

AC

S-lig

no

cellu

los

e (2

56

)

VA

SP

-Pd

O (6

4)

VA

SP

-Pd

O (1

28)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (6

4)

QE

-Au

11

2 (1

28)

QE

-GR

IR4

43

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(12

8)

Op

en

FO

AM

(25

6)

BS

MB

en

ch

-Co

mm

s (2

56

)

BS

MB

en

ch

-Co

mm

s (5

12

)

BS

MB

en

ch

-Co

mm

s (1

02

4)

BS

MB

en

ch

-Ba

lan

ce

(25

6)

BS

MB

en

ch

-Ba

lan

ce

(51

2)

BS

MB

en

ch

-Ba

lan

ce

(10

24

)

BS

MB

en

ch

-Co

mp

ute

(25

6)

BS

MB

en

ch

-Co

mp

ute

(51

2)

BS

MB

en

ch

-Co

mp

ute

(10

24

)

Basket of Synthetic (HPCC, IOR, STREAM, IMB) and end-user application codes

– DL_POLY, GROMACS, VASP, ESPRESSO, OpenFOAM & BSMBENCH)


85.0%

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Turbo Mode on Performance (Security Patches Enabled)

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)


Normalised to corresponding

performance with Turbo OFF

Security patches Enabled

T Turbo-OFF / T Turbo-ON

85.0%

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Turbo Mode on Performance (Security Patches Disabled)

Computational setup

BE

TT

ER



performance with Turbo OFF

Security patches Disabled

Re

lati

ve

Pe

rfo

rma

nc

e (

%)

T Turbo-OFF / T Turbo-ON

90.0%

95.0%

100.0%

105.0%

110.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR4

43

(25

6)

QE

-GR

IR4

43

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Security Patches on Performance (Turbo Mode OFF)

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)



performance with patches

disabled on the compute nodes

Turbo OFF

T DISABLED / T ENABLED

90.0%

95.0%

100.0%

105.0%

110.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Security Patches on Performance (Turbo Mode ON)

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)



performance with patches

disabled on the compute nodes

Turbo ON

T DISABLED / T ENABLED

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%

DL

PO

LY

-Gra

mic

idin

(64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Overall Impact of Environment on Performance

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)


Normalised with respect to the most

constrained environment - Turbo OFF,

security patches enabled, “packed” nodes

T CONSTRAIN / T MIN

Workload validation and Throughput tests

• Aim: Throughput designed to illustrate the Stability of the system

over an observed period of a week, while hardening the system

• Benchmarks based on multiple, concurrent instantiations of a

number of data sets associated with five of the end user application

codes and two of the synthetic benchmarks.

• Each data set is run a number of times on a variety of processor

(core) counts - typically 40, 80, 160, 320, 640 and 1024. This

combination of jobs has been designed to run for approximately 6

hours (elapsed time) on a 2720-core, 68 node cluster partition.

• Note that the metrics for success of these tests are twofold:

1. All jobs comprising a given run complete successfully and

2. There is a consistency of run time across each of the tests. The

measured time is simply the time at which the first of the jobs is

launched through the time that the last jobs finishes.


Workload validation and Throughput tests

• Based around multiple instantiations

of a number of data sets associated

with the five codes, DLPOLY4,

Gromacs (v5.2), Quantum Espresso,

OpenFOAM and VASP, and the two

synthetic benchmarks, IMB and IOR.

• DLPOLY4 - NaCl & Gramicidin

• Gromacs - ion_channel &

lignocellulose

• QE 6.1 - AUSURF112 & GRIR443

• OpenFOAM - cavity3d-3M

• VASP 5.4.4 – PdO complex and

Zeolite


SLURM Scripts

DLPOLY4.test2+test8.SCW.40.q





GROMACS.All.SCW.80.q





IMB3.SCW.160.q

IMB3.SCW.320.q

IOR.SCW.4.q

IOR.SCW.8.q

OpenFOAM_cavity3d-3M.SCW.80.q




QE.AUSURF112.SCW.160.q

QE.AUSURF112.SCW.320.q

QE.GRIR443.SCW.320.q

QE.GRIR443.SCW.640.q

VASP.example3.SCW.80.q





Throughput Tests – Hawk System – Two partition Approach

The throughput tests were undertaken on two separate partitions of the

Hawk cluster – compute64 and compute64b – to enable other testing and

early pilot user service. Each partition comprised 68 nodes.

Partition 1 – Compute 64 (68 Nodes)

• The first set of trial runs was executed between 12-14 May. A number of the

runs failed to complete, subsequently attributed to an apparent VASP related

error peculiar to the lustre file system:

forrtl: severe (121): Cannot access current working directory for unit 18, file "Unknown"

Image PC Routine Line Source

vasp_std 00000000014F3E09 Unknown Unknown Unknown

vasp_std 000000000150E10F Unknown Unknown Unknown

vasp_std 000000000134C950 Unknown Unknown Unknown

vasp_std 000000000040AF5E Unknown Unknown Unknown

libc-2.17.so 00002B450F32EC05 __libc_start_main Unknown Unknown

vasp_std 000000000040AE69 Unknown Unknown Unknown

forrtl: error (76): Abort trap signal

• This transient error affected perhaps one in twenty identical jobs, and although

reported into the appropriate Level 3 service regimes, has still not been formally

addressed. A workaround module was developed by Cardiff’s Tom Green when

it became clear that the formal channels were struggling.

module load lustre_getcwd_fix


Throughput Tests – Hawk System II

Partition 1:

• A second set of trial runs were carried out over the bank holiday

weekend and successfully passed the associated tests over the

period 30 May – 3 June.

• Partition 2:

• Runs 11 -22: Initial runs using compute64b conducted between 7 –

10 June revealed a number of issues pointing to readiness of the

nodes. Timings from the first completed run suggested some

variability in run times for a given application/core count, with the total

run time significantly longer than those on compute64.

Run # Start Time Finish TimeTotal Elapsed Time

(hours:Mins)

6 30May 21-21 31May 03-25 6:02

7 31May 23-33 01Jun 05-38 6:05

8 02Jun 00-04 02Jun 06-06 6:02

9 02Jun 13:24 02Jun 19:27 6:04

10 02Jun 22-57 03Jun 05-00 6:03

11 03Jun 05-45 03Jun 11-47 6:02

12 03Jun 15-57 03Jun 22-12 6:15


Throughput Tests – Hawk System III

• Following a lustre upgrade, a further set of runs were undertaken

between 21 June and 25 June. Runs 8 - 12 actually ran OK, so

formally compute 64b, along with compute64, can be judged to have

passed the Acceptance Test throughput requirement of five

consecutive error-free runs, although the variations in the individual

run times are perhaps larger than hoped.

• Testing on Hawk commenced on 12 May 2018 and was finally

completed on the 25 June 2018.


(hours:Mins)

8 23Jun 15-44 23Jun 20-57 5:13

9 23Jun 21-35 24Jun 02-55 5:20

10 24Jun 11-34 24Jun 17-06 5:32

11 24Jun 18-56 25Jun 00-16 5:20

12 25Jun 00-31 25Jun 06-03 5:32


Throughput Tests – Sunbird System – Two partition approach

Partition 1: Runs 1 – 4:

¤ Run 3 did not complete with JOBID #11050 hanging, while JOBID #11372 of

Run 4 suffered the same fate. Both jobs failed with the all too familiar

VASP/lustre error diagnostics. The scripts used were identical to those

used on Hawk in June, and did not include the workaround introduced at the

time.

• Runs 5 -10: Completed successfully, with two of the VASP/Lustre partitions

trapped though the added module

module load lustre_getcwd_fix

Partition 2: Runs 11 - 22: Three jobs in one of the runs hung when hitting problems

on scs0105. That node had been taken out for when setting up the user-facing file

systems and needed the playbooks running. Several of the runs showed the impact

of the lustre issue with VASP.

• However, there were significant variations in the overall run times.

¤ At least three of the nodes appeared to be either defective or possess some

different bios settings (scs0064,scs0092 and scs0096). These were

subsequently removed from service.

¤ Turbo in inconsistent state across the compute nodes. Usually reset by

the Slurm prologue scripts, but they appear to have been commented out.


Throughput Tests – Sunbird System

• Runs 23 – 30: Runs certainly acceptable from the metric of job

completion, for all completed successfully. Note there was no

reoccurrence of the lustre-related issue during this set of runs.

• Testing on Sunbird commenced on 10 August 2018, and was finally

completed on the evening of 19 August 2018


(hours:Mins)

23 17Aug 17:52 17Aug 23:07 5:15

24 17Aug 23:54 18Aug 05:11 5:17

25 18Aug 05:19 18Aug 10:36 5:17

26 18Aug 14:00 18Aug 19:16 5:16

27 18Aug 19:51 19Aug 01:08 5:17

28 19Aug 02:45 19Aug 07-57 5:12

29 19Aug 13:21 19Aug 18:32 5:11

30 19Aug 19:06 20Aug 00:265:20 (SLURM CG

Issue


Throughput Tests – Nottingham OCF Cluster

• Tests modified to run on two partition of the OCF cluster at

Nottingham, “martyn" and "colin", each comprising 50 nodes with

EDR interconnect. All component nodes comprised dual Gold 6138

2.0GHz 20c SKL processors

• Initial runs of the workload failed to complete successfully, with each

of the 8 x 320-core IMB jobs hanging, consuming all of their

allocated time. Traced to an issue with the gatherv collective that

failed to complete across all specified msglens.

• Navigated around the issue by removing those environment variables

deemed likely to trigger the problem, specifically:

¤ export I_MPI_JOB_FAST_STARTUP=enable

¤ export I_MPI_SCALABLE_OPTIMIZATION=enable

¤ export I_MPI_DAPL_UD=enable

¤ export I_MPI_TIMER_KIND=rdtsc

• With these removed, runs proceeded to complete successfully.

• One of the allocated nodes (compute099) rendered unusable as a

result of tests - removed from service. Thus the subsequent runs

used 49 nodes, rather that the intended 50.


Throughput Tests – Acceptance Achieved (OCF System)


(hours:Mins)

2 31Jul 18-04 01Aug 00-49 6:45

3 01Aug 01-22 01Aug 08-07 6:45

4 01Aug 08-32 01Aug 15-18 6:46

5 01Aug 17-08 01Aug 23-52 6:44

6 02Aug 03-20 02Aug 10-06 6:46

12 December 2018Application Performance in Materials Science 104

Table. Overall run times for the throughput runs on the “martyn” partition.


(hours:Mins)

1 02Aug 12-24 02Aug 19-05 6:41

2 03Aug 02-41 03Aug 09-18 6:37

3 03Aug 10-34 03Aug 17-18 6:44

4 03Aug 17-35 04Aug 00-26 6:51

5 04Aug 01-00 04Aug 07-45 6:45

Table. Overall run times for the throughput runs on the “colin” partition.

Results of "throughput benchmarks" carried out on the new OCF Skylake

cluster at Nottingham University between 31 July and 4 August 2018.

III. The Performance Evolution of two

Community Codes, DL_POLY and

GAMESS-UK

.


core Processors

Outline and Contents

1. Introduction – DL_POLY and GAMESS-UK

¤ Background and Flagship community codes for the UK’s

CCP5 & CCP1 – Collaboration!

2. HPC Technology – Impact of Processor & Interconnect

developments

¤ The last 10 years of Intel dominance – Nehalem to Skylake

3. DL_POLY and GAMESS-UK Performance

¤ Benchmarks & Test Cases

¤ Overview of two decades of Code Performance: From the Cray

T3E/900 to Intel Skylake clusters


“DL_POLY - A Performance Overview. Analysing, Understanding and Exploiting

available HPC Technology”, Martyn F Guest, Alin M Elena and Aidan B G Chalk,

Molecular Simulation, Accepted for publication (2019).

The Story of Two Community Codes

DL_POLY and GAMESS-UK - A Performance

Overview

HPC Technology –

Processor and

Networks

Computer Systems

• Benchmark timings - a wide variety of systems, starting with the Cray

T3E/1200 in 1999. Access initially undertaken as part of Daresbury’s

Distributed Computing support programme (DiSCO), with the

benchmarks presented at the annual Machine Evaluation Workshops

(1989-2014) and STFC’s successor Computing Insight (CIUK)

conferences (2015 onwards).

¤ Access typically short-lived as systems provided by suppliers to

enhance their profile at the MEW Workshops - limited opportunity for in

depth benchmarking.

• Systems include a wide range of CPU offerings. Representatives from

over a dozen generations of Intel processors, from the early days of

single processor nodes housing Pentium 3 and Pentium 4 CPUs,

through dual processor nodes featuring dual-core Woodcrest, quad-core

Clovertown & Harpertown processors, along with the Itanium and

Itanium2 CPUs, through to the extensive range of multi-core offerings

Westmere - Skylake.


Computer Systems

• A variety of processors from AMD (Athlon, Opteron, MagnyCours,

Interlagos etc.) along with the “power” processors from the IBM

pSeries have also featured (typically dual processor configurations).

• In the same way a wide variety of processors feature, so too is the

appearance of a range of network interconnects. Fast Ethernet and

GBit Ethernet were rapidly superseded by the increasing capabilities of

the family of Infiniband interconnects from Voltaire and Mellanox (SDR,

DDR, QDR, FDR, EDR and soon HDR), along with the now defunct

offerings from Myrinet, Quadrics and QLogic. The Truescale

interconnect from Intel, along with its successor, Omnipath, also feature.

• Dating from the appearance of Intel’s SNB processors, many of the

timings generated with the Turbo mode feature enabled by the system

administrators. Such systems are tagged with “(T)” notation.

• As for software, most of the commodity clusters featuring Intel CPUs

used successive generation of Intel compilers along with Intel MPI,

although a range of MPI libraries have been used – OpenMPI, MPICH,

MVAPICH and MVAPICH2. Proprietary systems (Cray and IBM) used

system specific compilers and associated MPI libraries.


Intel Xeon : Westmere - Skylake

Xeon 5600

(Westmere-EP)

Xeon E5-2600

(Sandy Bridge-EP)

Xeon E5-2600 v4

“Broadwell-EP”

Intel Xeon Scalable

Processor

“Skylake”

Cores / ThreadsUp to 6 cores / 12

threads

Up to 8 cores / 16

threads

Up to 22 Cores / 44

threads

Up to 28 Cores / 56

threads

Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-

inclusive)

Max memory

channels, speed

/ socket

3xDDR3 channels,

1333

4xDDR3 channels,

1600


RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2400 MHz


RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2666 MHz

New

instructionsAES-NI

AVX 1.0

8 DP Flops/Clock

AVX 2.0

16 DP Flops/Clock

AVX 512

32 DP Flops/Clock

QPI / UPI Speed

(GT/s)

1 QPI channels @

6.4 GT/s

2 QPI channels @ 8.0

GT/s

2 x QPI channels @

9.6 GT/s

Up to 3 x UPI @ 10.4

GT/s

PCIe Lanes /

Controllers /

Speed (GT/s)

36 lanes PCIe 2.0 on

chipset

40 Lanes / Socket

Integrated PCIe 3.0

40 / 10 / PCIe* 3.0

(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0

(2.5, 5, 8 GT/s)

Server /

Workstation

TDP

Server /

Workstation: 130W

Up to 130W Server;

150W Workstation 55 - 145W 70 – 205W




Overview

Overview of two

decades of

DL_POLY

Performance

A B

C D

• Distribute atoms, forces across the nodes

¤ More memory efficient, can address much larger

cases (105-107)

• Shake and short-ranges forces require only

neighbour communication

¤ communications scale linearly with number of

nodes

• Coulombic energy remains global

¤ Adopt Smooth Particle Mesh Ewald scheme

• includes Fourier transform smoothed charge

density (reciprocal space grid typically

64x64x64 - 128x128x128)

https://www.scd.stfc.ac.uk/Pages/DL_POLY.aspx

W. Smith and I. Todorov

Domain Decomposition - Distributed data:

DL_POLY 3/4 – Distributed data

Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å

2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps


DL_POLY 4

• Test2 Benchmark

¤ NaCl Simulation;

216,000 ions, 200 time

steps, Cutoff=12Å

• Test8 Benchmark

¤ Gramicidin in water;

rigid bonds + SHAKE:

792,960 ions, 50 time

steps

The DLPOLY Benchmarks

DL_POLY Classic

• Bench4

¤ NaCl Melt Simulation with Ewald

sum electrostatics & a MTS

algorithm. 27,000 atoms; 500 time

steps.

• Bench5

¤ Potassium disilicate glass (with 3-

body forces). 8,640 atoms: 3,000

time steps

• Bench7

¤ Simulation of gramicidin A molecule

in 4012 water molecules using

neutral group electrostatics. 12,390

atoms: 5,000 time steps


112

0

20

40

60

80

100

Cra

y T

3E

/1200 E

V56

600 M

Hz

IBM

SP

/Win

terh

aw

k2-3

75 M

Hz

SG

I O

rig

in 3

80

0/R

14k

-500

CS

4 A

MD

1.2

GH

z/F

E

CS

6 P

III/800 +

FE

/LA

M

IBM

Reg

att

a-H

CS

9 P

4/2

00

0 +

Myri

ne

t

IBM

p6

90

CS

10 P

4/2

666 +

Myri

ne

t

CS

16 Ita

niu

m2/1

300 +

My

rin

et

CS

11 P

4/2

400 +

Gb

itE

IBM

p6

90+

// H

PS

CS

19 O

pte

ron

246/2

.0 +

SC

I

CS

20 O

pte

ron

248/2

.2 +

M2

k

CS

22 P

4 E

M64

T/3

200 +

M2

k

Cra

y X

D1 O

pte

ron

250/2

.4 +

Ra

pid

Arr

ay

HP

Su

pe

rdo

me/Ita

niu

m2

1600

CS

26 O

pte

ron

875/2

.2 D

C +

M2k

CS

29 O

pte

ron

280/2

.4 D

C +

IB

CS

30 X

eo

n 5

160 3

.0G

Hz D

C +

IB

CS

32 O

pte

ron

2218-F

2.6

GH

z D

C +

IP

HT

X

CS

33 X

eo

n 5

160 3

.0G

Hz D

C +

IP

HT

X

CS

35 C

UB

RIC

Op

tero

n275

/2.2

DC

+ G

Bit

E

CS

42 O

pte

ron

2218-F

2.6

GH

z D

C +

Mella

no

x IB

CS

45 H

P B

L460

c X

eo

n 5

160/3

.0G

Hz D

C +

Me

llan

ox

IB

CS

50 S

GI Ic

e X

5365 C

lovert

ow

n 3

.0G

Hz Q

C +

…

CS

47 In

tel E

54

72 H

arp

ert

ow

n 3

.0G

Hz Q

C 1

600 F

SB

…

CS

51 B

ull X

eo

n E

5472 3

.0G

Hz Q

C 1

600 F

SB

+…

CS

54 S

GI Ic

e X

eo

n E

5440 2

.83G

Hz Q

C +

Vo

ltair

e IB

…

IBM

pS

eri

es 5

75 4

.7 G

Hz D

C +

IB

CS

57 In

tel X

55

60 N

eh

ale

m 2

.8G

Hz Q

C +

IB

QD

R

CS

60 V

igle

n E

55

20 N

H 2

.27G

Hz Q

C +

IB

DD

R…

CS

61 B

ullx X

555

0 N

H 2

.67G

Hz Q

C +

Co

nn

ectX

…

CS

63 In

tel X

55

70 N

H 2

.93G

Hz Q

C +

Co

nn

ectX

QD

R…

CS

66 In

tel L

7555 N

eh

ale

mE

X 1

.87G

Hz +

IB

QD

R

Fu

jits

u "

HT

C"

BX

92

2 X

5650 2

.66 G

Hz +

IB

CS

73 Q

Lo

gic

ND

C X

5670 2

.93

GH

z 6

-C +

QD

R…

Fu

jits

u B

X922 W

SM

X

5650 2

.67

GH

z IB

-QD

R

Fu

jits

u R

X300 S

NB

E5-2

680 8

-C +

IB

QD

R

Fu

jits

u C

X250 S

NB

e5-2

690/2

.9G

Hz IB

-QD

R

Fu

jits

u C

X250 S

NB

e5-2

670/2

.6G

Hz IB

-QD

R

Inte

l IV

B e

5-2

697v2 2

.7G

Hz T

rue

Scale

PS

M

Bu

ll B

710 IV

B e

5-2

697v2 2

.7G

Hz M

ellan

ox F

DR

Cra

y X

C30

e5-2

697v2 2

.7G

Hz A

RIE

S [

Arc

he

r]

Inte

l H

SW

e5-2

697v3

2.6

GH

z (

T)

Tru

escale

QD

R

Bu

ll H

SW

e5-2

690v3 2

.6G

Hz C

on

ne

ct-

IB

Bu

ll H

SW

e5-2

680v3 2

.5G

Hz (

T)

Co

nn

ect-

IB

Bo

sto

n B

DW

e5-2

650

v4 2

.2G

Hz (

T)

FD

R

Th

or

BD

W e

5-2

697A

v4 2

.6G

Hz (

T)

ED

R

IBM

Po

wer8

S822L

C 2

.92G

Hz IB

/ED

R

Hu

aw

ei F

usio

n C

H140 e

5-2

690 v

4 2

.6G

Hz (

T)

ED

R

Inte

l S

KL

Pla

tin

um

8170 2

.1G

Hz (

T)

OP

A[2

6c]

Dell S

KL

Go

ld 6

142 2

.6G

Hz (

T)

ED

R [

16c]

Bu

ll|A

TO

S S

KL

Go

ld 6

150 2

.7G

Hz (

T)

OF

A

Pe

rfo

rma

nc

e r

ela

tive

to

th

e C

ray T

3E

/12

00

E DLPOLY 2 - Bench 4 (32 PEs)

DL_POLY Classic: Bench 4

Performance Relative to the Cray T3E/1200 (32 CPUs)


47

0.0

10.0

20.0

30.0

40.0

50.0C

ray T

3E

/1200E

EV

56 6

00 M

Hz

IBM

SP

/Win

terh

aw

k2-3

75 M

Hz

SG

I O

rig

in 3

80

0/R

14k

-500

CS

4 A

MD

1.2

GH

z/F

E

CS

6 P

III/800 +

FE

/LA

M

IBM

Reg

att

a-H

CS

9 P

4/2

00

0 +

Myri

ne

t

IBM

p6

90

CS

10 P

4/2

666 +

Myri

ne

t

CS

16 Ita

niu

m2/1

300 +

My

rin

et

CS

11 P

4/2

400 +

Gb

itE

IBM

p6

90+

// H

PS

CS

19 O

pte

ron

246/2

.0 +

SC

I

CS

20 O

pte

ron

248/2

.2 +

M2

k

CS

22 P

4 E

M64

T/3

200 +

M2

k

Cra

y X

D1 O

pte

ron

250/2

.4 +

Ra

pid

Arr

ay

HP

Su

pe

rdo

me/Ita

niu

m2

1600

CS

26 O

pte

ron

875/2

.2 D

C +

IB

CS

29 O

pte

ron

280/2

.4 D

C +

IB

CS

30 X

eo

n 5

160 3

.0G

Hz D

C +

IB

CS

32 O

pte

ron

2218-F

2.6

GH

z D

C +

IP

HT

X

CS

33 X

eo

n 5

160 3

.0G

Hz D

C +

IP

HT

X

CS

35 C

UB

RIC

Op

tero

n275

/2.2

DC

+ G

Bit

E

CS

42 O

pte

ron

2218-F

2.6

GH

z D

C +

IB

CS

45 H

P B

L460

c X

eo

n 5

160/3

.0G

Hz D

C +

IB

CS

50 S

GI Ic

e X

5365 C

lovert

ow

n 3

.0G

Hz Q

C…

CS

47 In

tel E

54

72 H

arp

ert

ow

n 3

.0G

Hz Q

C…

CS

51 B

ull X

eo

n E

5472 3

.0G

Hz Q

C 1

600 F

SB

…

CS

54 S

GI Ic

e X

eo

n E

5440 2

.83G

Hz Q

C +

…

IBM

pS

eri

es 5

75 4

.7 G

Hz D

C +

IB

CS

57 In

tel X

55

60 N

EH

2.8

GH

z Q

C +

IB

QD

R

CS

60 V

igle

n E

55

20 N

EH

2.2

7G

Hz Q

C +

IB

DD

R

CS

61 B

ullx X

555

0 N

EH

2.6

7G

Hz Q

C +

IB

CS

63 In

tel X

55

70 N

EH

2.9

3G

Hz Q

C +

C-X

…

CS

66 In

tel L

7555 N

EH

-EX

1.8

7G

Hz +

IB

QD

R

Fu

jits

u "

HT

C"

BX

92

2 X

5650 2

.66 G

Hz +

IB

CS

73 Q

Lo

gic

ND

C X

5670 2

.93

GH

z 6

-C +

QD

R…

Fu

jits

u B

X922 W

SM

X

5650 2

.67

GH

z IB

-QD

R

Fu

jits

u R

X300 S

NB

E5-2

680 8

-C +

IB

QD

R

Fu

jits

u C

X250 S

NB

e5-2

690/2

.9G

Hz IB

-QD

R

Fu

jits

u C

X250 S

NB

e5-2

670/2

.6G

Hz IB

-QD

R

Inte

l IV

B e

5-2

697v2 2

.7G

Hz T

rue

Scale

PS

M

Bu

ll B

710 IV

B e

5-2

697v2 2

.7G

Hz IB

FD

R

Cra

y X

C30

e5-2

697v2 2

.7G

Hz A

RIE

S [

Arc

he

r]

Inte

l H

SW

e5-2

697v3

2.6

GH

z (

T)

Tru

escale

QD

R

Bu

ll H

SW

e5-2

690v3 2

.6G

Hz C

on

ne

ct-

IB

Bu

ll H

SW

e5-2

680v3 2

.5G

Hz (

T)

Co

nn

ect-

IB

Bo

sto

n B

DW

e5-2

650

v4 2

.2G

Hz (

T)

FD

R

Th

or

BD

W e

5-2

697A

v4 2

.6G

Hz (

T)

ED

R

IBM

Po

wer8

S822L

C 2

.92G

Hz IB

/ED

R

Hu

aw

ei F

usio

n C

H140 e

5-2

690 v

4 2

.6G

Hz (

T)…

Inte

l S

KL

Pla

tin

um

8170 2

.1G

Hz (

T)

OP

A[2

6c]

Dell S

KL

Go

ld 6

142 2

.6G

Hz (

T)

ED

R [

16c]

Bu

ll|A

TO

S S

KL

Go

ld 6

150 2

.7G

Hz (

T)

OF

A

Pe

rfo

rma

nc

e r

ela

tive

to

th

e C

ray T

3E

/12

00

E

DLPOLY 2 - Bench 7 (32 PEs)

DL_POLY V2: Bench 7



61.5

0.0

10.0

20.0

30.0

40.0

50.0

60.0IB

M e

32

6 O

pte

ron

28

0 2

.4G

Hz

// G

bit

E (

Cis

co

) P

GI

HP

DL

14

5 G

2 O

pte

ron

28

0 2

.4G

Hz D

C // IB

HP

DL

14

5 G

2 O

pte

ron

28

0 2

.4G

Hz D

C // IB

Su

n X

410

0D

C O

pte

ron

28

0 2

.4G

Hz D

C /

/ IB

(P

SC

)

Qu

ad

rix X

eo

n 5

16

0 W

oo

dc

res

t 3

.0G

Hz D

C /

/ E

lan

4

HP

DL

14

0 G

3 X

eo

n 5

16

0 3

.0G

Hz D

C //

Vo

ltair

e IB

/DD

R

Bu

ll R

440

Xeo

n 5

160

3.0

GH

z D

C /

/ V

olt

air

e IB

/SD

R

Str

eam

lin

e X

eo

n 5

16

0 3

.0G

Hz D

C //

GB

itE

(S

Co

re)

Su

n X

220

0 M

2 O

pte

ron

22

18 2

.6G

Hz D

C //

IP (

PS

C)

IBM

x34

55 O

pte

ron

22

18

-F/2

.6G

Hz D

C //

IP

SG

I Ic

e X

53

65

C

lov

ert

ow

n 3

.0G

Hz Q

C // IB

IBM

pS

eri

es 5

75

po

we

r5 4

.7G

Hz D

C /

/ IB

PO

D E

543

0 H

arp

ert

ow

n 2

.66

GH

z Q

C //

IB/S

DR

(O

pe

nM

PI)

SG

I Ic

e E

54

62

2.8

3G

Hz Q

C //

IB (

mva

pic

h2

)

Str

eam

lin

e E

547

2 3

.0G

Hz Q

C //

IB

In

tel E

547

2 H

arp

ert

ow

n 3

.0G

Hz Q

C /

/ IB

Inte

l E

5482

2.8

0G

Hz Q

C //

IB/D

DR

(In

telM

PI)

Cra

y X

T4 O

pte

ron

2.3

GH

z Q

C //

XT

4 In

tern

al

inte

rco

nn

ect

Inte

l L

75

55

NH

EX

[8c]

1.8

7G

Hz /

/ IB

/QD

R (

mva

pic

h-1

.2)

Bu

ll X

555

0 N

H 2

.67G

Hz Q

C //

IB (

imp

i 3.2

.2)

Inte

l X

5560

NH

2.8

GH

z Q

C //

IB/Q

DR

Inte

l X

5570

NH

2.9

3G

Hz Q

C // IB

/QD

R (

imp

i-3

.2.2

)

De

ll P

E C

614

5 I

nte

rla

go

s O

pte

ron

627

6 [

16

c]

2.3

GH

z

Inte

l X

5670

WS

M [

6c]

2.9

3G

Hz //

IB/Q

DR

(m

vap

ich

2)

QL

og

ic N

DC

X56

75 [

6c]

3.0

7G

Hz //

IB/Q

DR

(Q

log

ic M

PI)

Inte

l S

NB

E5

-267

0 [

8c

] 2

.6G

Hz

// IB

/QD

R(i

mp

i)

Fu

jits

u R

X3

00

SN

B E

5-2

68

0 [

8c

] 2

.7 G

Hz /

/ IB

/QD

R

PO

D W

SM

X5

67

5 3

.07

GH

z [

6c]

// T

rue

sc

ale

/QD

R

PO

D S

NB

e5

-267

0 2

.6G

Hz [

8c

] //

Tru

es

ca

le Q

DR

Fu

jits

u C

X2

50

SN

B e

5-2

690

[8

c]

2.9

GH

z // IB

/QD

R

Clu

se

rVis

ion

IV

B e

5-2

650v

2 [

8c

] 2.6

GH

z /

/ T

rue

sc

ale

/QD

R

Inte

l IV

B e

5-2

697

v2

[1

2c

] 2

.7G

Hz /

/ T

rue S

cale

/QD

R

Bu

ll b

71

0 I

VB

e5-2

69

7v

2 [

12

c]

2.7

GH

z // IB

/FD

R

Inte

l IV

B e

5-2

690

v2

[1

0c

] 3

.0G

Hz (

T)

// T

rue S

cale

/QD

R

Bu

ll H

SW

e5

-26

80

v3

[1

2c

] 2

.5G

Hz (

T)

// I

B

Inte

l H

SW

e5-2

697

v3

[1

4c

] 2

.6G

Hz (

T)

// T

rue

sc

ale

/QD

R

Bu

ll H

SW

e5

-26

90

v3

[1

2c

] 2

.6G

Hz /

/ IB

De

ll H

SW

e5-2

660

v3

[1

0c

] 2

.6G

Hz (

T)

// O

PA

Bo

sto

n B

DW

e5

-26

50

v4

[1

2c

] 2

.2G

Hz (

T)

//

IB/F

DR

Ato

s B

DW

e5-2

680

v4 [

14c]

2.4

GH

z (

T)

// I

B/E

DR

Inte

l B

DW

e5

-26

90

v4

[1

4c

] 2

.6G

Hz (

T)

// O

PA

Th

or

BD

W e

5-2

697A

v4

[1

6c

] 2

.6G

Hz (T

) // I

B/E

DR

IBM

Po

we

r8 S

82

2L

C [

10c

] 2.9

2G

Hz //

IB/E

DR

Inte

l S

KL

Pla

tin

um

817

0 [

26

c]

2.1

GH

z (

T)

// O

PA

De

ll S

KL

Go

ld 6

14

2 [

16

c]

2.6

GH

z (

T)

// IB

/ED

R

De

ll S

KL

Go

ld 6

15

0 [

18

c]

2.7

GH

z (

T)

// IB

/ED

R

DLPOLY 3/4 - Gramicidin (128 cores)

DL_POLY 3/4 – Gramicidin (128 Cores)

Performance Relative to the IBM e326

Opteron280/2.4GHz + GbitE

Perf

orm

an

ce

DL_POLY 3

DL_POLY 4


61.5

0.0

10.0

20.0

30.0

40.0

50.0

60.0F

ujits

u B

X922

WS

M X

565

0 [

6c

] 2

.67

GH

z //

…

PO

D W

SM

X56

75 3

.07

GH

z [

6c]

// T

rues

cale

/QD

R

Azu

re A

9 W

E (

e5

-267

0 2

.6 G

Hz)

[8c

] // I

B R

DM

A

PO

D S

NB

e5

-267

0 2

.6G

Hz [

8c

] //

Tru

es

ca

le Q

DR

Fu

jits

u C

X250

SN

B e

5-2

67

0 [

8c]

2.6

GH

z //…

Fu

jits

u C

X250

SN

B e

5-2

69

0 [

8c]

2.9

GH

z //…

Fu

jits

u C

X250

SN

B e

5-2

69

0 [

8c]

2.9

GH

z (T

) //…

Clu

serV

isio

n IV

B e

5-2

65

0v

2 [

8c

] 2

.6G

Hz /

/…

Inte

l IV

B e

5-2

697

v2

[1

2c]

2.7

GH

z /

/ IB

/FD

R

Inte

l IV

B e

5-2

697

v2

[1

2c]

2.7

GH

z /

/ T

rue…

Cra

y X

C3

0 e

5-2

69

7v

2 [

12

c]

2.7

GH

z //

AR

IES

…

Bu

ll b

71

0 I

VB

e5

-26

97v

2 [

12

c]

2.7

GH

z //

IB/F

DR

De

ll R

72

0 IV

B e

5-2

68

0v2

[1

0c

] 2.8

GH

z (

T)

// I

B

Inte

l IV

B e

5-2

690

v2

[1

0c]

3.0

GH

z (

T)

// T

rue

…

Bu

ll H

SW

e5

-26

95

v3

[14

c]

2.3

GH

z /

/ IB

Bu

ll H

SW

e5

-26

80

v3

[12

c]

2.5

GH

z (

T)

// I

B

Bu

ll H

SW

e5

-26

80

v3

[12

c]

2.5

GH

z (

T)

// I

B/E

DR

Inte

l H

SW

e5-2

697

v3 [

14

c]

2.6

GH

z (

T)

//…

Inte

l H

SW

e5-2

697

v3 [

14

c]

2.6

GH

z (

T)

//…

Bu

ll H

SW

e5

-26

90

v3

[12

c]

2.6

GH

z /

/ IB

SG

I IC

E-X

HS

W e

5-2

690

v3 [

12

c]

2.6

GH

z (

T)

//…

De

ll H

SW

e5

-26

60v

3

[10

c]

2.6

GH

z (

T)

// O

PA

Hu

aw

ei C

H14

0 e

5-2

683

v4 [

16

c]

2.1

GH

z (

T)

//…

Bo

sto

n B

DW

e5

-26

50v

4 [

12

c]

2.2

GH

z (

T)

//…

Bo

sto

n B

DW

e5

-26

80v

4 [

14

c]

2.4

GH

z (

T)

// O

PA

Ato

s B

DW

e5

-268

0v4

[1

4c

] 2

.4G

Hz (

T)

// I

B/E

DR

Ato

s B

DW

e5

-268

0v4

[1

4c

] 2

.4G

Hz (

T)

// O

PA

Inte

l B

DW

e5-2

690

v4

[1

4c

] 2

.6G

Hz (

T)

// O

PA

Inte

l B

DW

e5-2

690

v4

[1

4c

] 2

.6G

Hz (

T)

// IB

/ED

R

Th

or

BD

W e

5-2

69

7A

v4 [

16c

] 2

.6G

Hz

(T)

//…

Inte

l D

iam

on

d B

DW

e5

-269

7A

v4 [

16

c]

2.6

GH

z…

IBM

Po

wer8

S8

22

LC

[1

0c

] 2

.92

GH

z //

IB/E

DR

Dell

SK

L G

old

613

0 [

16

c]

2.1

GH

z (

T)

// O

PA

Inte

l S

KL

Pla

tin

um

81

70 [

26c

] 2

.1G

Hz (

T)

// O

PA

Inte

l S

KL

Go

ld 6

14

8 [

20

c]

2.4

GH

z (

T)

// O

PA

De

ll S

KL

Go

ld 6

14

2 [

16

c]

2.6

GH

z (

T)

// IB

/ED

R

Ato

s S

KL

Go

ld 6

15

0 [

18

c]

2.7

GH

z (

T)

// O

FA

De

ll S

KL

Go

ld 6

15

0 [

18

c]

2.7

GH

z (

T)

// IB

/ED

R

Ato

s A

MD

EP

YC

76

01

[3

2c

] 2

.2G

Hz (

T)

//…

DLPOLY 4 - Gramicidin (128 cores)

E5-26xxE5-26xx v2

E5-26xx v3

E5-26xx v4

Intel SKL

DL_POLY 4 – Gramicidin (128 cores)

Performance Relative to the

IBM e326 Opteron280/2.4GHz

/ GbitE

Perf

orm

an

ce


0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs





DL_POLY4 – Gramicidin Perf Report

Smooth Particle Mesh Ewald Scheme

CPU Time Breakdown

Total Wallclock Time Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)




Overview

Overview of two

decades of

GAMESS-UK

Performance

Large-Scale Parallel Ab-Initio Calculations

• GAMESS-UK now has two parallelisation schemes:

¤ The traditional version based on the Global Array tools

• retains a lot of replicated data

• limited to about 4000 atomic basis functions

¤ Subsequent developments by Ian Bush (High Performance

Applications Group, Daresbury, now at Oxford University via NAG

Ltd.) have extended the system sizes available for treatment by

both GAMESS-UK (molecular systems) and CRYSTAL (periodic

systems)

• Partial introduction of “Distributed Data” architecture…

• MPI/ScaLAPACK based


The GAMESS-UK Benchmarks

Five representative examples of increasing

complexity.

• Cyclosporin 6-31g basis (1000 GTOs) DFT B3LYP (direct

SCF)

• Cyclosporin 6-31g-dp basis (1855 GTOs) DFT B3LYP

(direct SCF)

• Valinomycin (dodecadepsipeptide) in water; DZVP2 DFT

basis, HCTH functional (1620 GTOs) (direct SCF)

• Mn(CO)5H TZVP/DZP MP2 - geometry optimization

• ((C6H4(CF3))2 6-31g basis DFT B3LYP opt geom + analytic

2nd Derivatives


0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0C

ray

T3E/

120

0 E

V5

6 60

0 M

Hz

IBM

pSe

ries

690

po

wer

4 1

.3 G

Hz

SGI O

rigi

n 3

80

0/R

14

k 50

0M

Hz

// N

UM

Alin

k 3

IBM

pSe

ries

690

+ p

ow

er4

1.7

GH

z //

HP

S (S

P7

)

IBM

pSe

ries

690

po

wer

4 1

.3 G

Hz

[J-F

it]

IBM

pSe

ries

690

po

wer

4 1

.3 G

Hz

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

Lon

esta

r D

ell

PE

18

55 P

en

tiu

m-4

3.2

GH

z //

IB

SGI A

ltix

37

00/I

tan

ium

2 1

.3G

Hz

SGI O

rigi

n 3

80

0/R

14

k 50

0M

Hz

// N

UM

Alin

k 3

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

HP

Su

pe

rdo

me

Itan

ium

2 1

.6G

Hz

CS2

2 P

en

tiu

m-4

EM

64T

3.2

GH

z //

M2

k

CS4

8 B

ull

R42

2 H

arp

erto

wn

E5

472

3.0

GH

z Q

C /

/ IB

CS4

2 IB

M x

34

55 O

pte

ron

22

18-F

2.6

GH

z D

C /

/ IB

IBM

pSe

ries

575

po

wer

6 4

.7G

Hz

DC

//

IB

CS5

4 S

GI A

ltix

Ice

82

00

Xeo

n E

544

0 2

.83

GH

z Q

C /

/ IB

CS6

6 In

tel L

75

55 N

H-E

X [

8c]

1.8

7G

Hz

// IB

QD

R [

pp

n=1

6]

CS6

0 V

igle

n E

552

0 N

H 2

.27G

Hz

QC

//

IB D

DR

(m

vap

ich

)

Fuji

tsu

"H

TC"

BX

922

X5

650

[6

c] 2

.66G

Hz

// IB

/QD

R

De

ll M

51

0 X

565

0 [

6c]

2.6

7 G

Hz

// IB

/QD

R (

imp

i)

CS6

4 In

tel X

56

70 W

SM [

6c]

2.9

3GH

z //

IB/Q

DR

(m

vap

ich

2)

Fuji

tsu

CX

25

0 SN

B e

5-26

70

[8c]

2.6

GH

z //

IB/Q

DR

Inte

l SN

B e

5-2

670

[8

c] 2

.6G

Hz

// IB

/QD

R (

pp

n=8

)

Bu

ll b

510

SN

B E

5-2

680

[8

c] 2

.7 G

Hz

(T)

// IB

/QD

R

Fuji

tsu

CX

25

0 SN

B E

5-26

90

[8c]

2.9

GH

z //

IB/Q

DR

Inte

l IV

B e

5-2

697

v2 [

12

c] 2

.7G

Hz

// T

rue

Scal

e/Q

DR

Bu

ll b

710

IVB

e5

-269

7v2

2.7

GH

z //

IB/F

DR

Bu

ll H

SW e

5-2

690

v3 [

12c]

2.6

GH

z //

IB

De

ll R

730

HSW

e5

-269

7v3

2.6

GH

z (T

) //

IB

SGI I

CE-

X H

SW e

5-2

680

v3 [

12

c] 2

.6G

Hz

(T)

// IB

/FD

R

Ato

s B

DW

e5

-26

80v

4 [

14c

]2.4

GH

z (T

) //

IB/E

DR

Ato

s SK

L G

old

614

8 2

.4G

Hz

(T)

// IB

/ED

R (

pp

n=1

6)

28.5

65.1Valinomycin DFT - DZVP2 1620 GTOs

GAMESS-UK. DFT B3LYP Performance

Performance Relative to the Cray T3E/1200 (32CPUs)

Basis: DZVP2_A2 (Dgauss)

Valinomycin, 1620 GTOs

Atos Skylake Gold 6148

2.4GHz (T) // IB/EDR

CS48 Bull R422

Harpertown E5472

3.0GHz QC // IB


0.0

10.0

20.0

30.0

40.0

50.0

60.0C

ray

T3E/

900

EV

56

45

0 M

Hz

IBM

SP

/P2S

C

Cra

y T3

E/12

00

EV

56

60

0 M

Hz

CS1

Pe

nti

um

-3 4

50

MH

z //

FE/

LAM

IBM

SP

/Win

terh

awk-

2 p

ow

er3

37

5M

Hz

CS2

Qu

adri

x U

P2

000

/EV

67 6

67

MH

z //

QSN

et

Co

mp

aq A

lph

aSer

ver

ES4

0 66

7 M

hz

SGI O

rigi

n 3

800

/R1

2k

40

0MH

z

SGI O

rigi

n 3

800

/R1

4k

50

0MH

z //

NU

MA

link

3

CS6

Pe

nti

um

-3 8

00

MH

z //

FE/

LAM

CS7

AM

D A

thlo

n K

7 1

.0G

Hz

MP

//

SCI

CS9

Pe

nti

um

-4 2

.0G

Hz

// M

yrin

et

2k

IBM

pSe

rie

s 69

0 p

ow

er4

1.3

GH

z

SGI A

ltix

370

0/I

tan

ium

2 1

.3G

Hz

IBM

pSe

rie

s 69

0+

po

wer

4 1

.7 G

Hz

// H

PS

CS1

8 P

en

tiu

m-4

2.8

GH

z //

M2

k

CS1

8 P

en

tiu

m-4

2.8

GH

z //

GB

itE

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

CS2

1 P

en

tiu

m-4

EM

64T

3.4

GH

z //

IB

SGI O

rigi

n 3

800

/R1

4k

50

0MH

z //

NU

MA

link

3

IBM

pSe

rie

s 69

0+

po

wer

4 1

.7 G

Hz

// H

PS

(SP

9)

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

CS1

8 P

en

tiu

m-4

2.8

GH

z //

Gb

itE

CS1

8 P

en

tiu

m-4

2.8

GH

z //

M2

k

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

SGI O

rigi

n 3

800

/R1

4k

50

0MH

z //

NU

MA

link

3

IBM

pSe

rie

s 69

0 p

ow

er4

1.3

GH

z

Co

mp

aq A

lph

aSer

ver

SC E

S45

1.0

GH

z

SGI A

ltix

370

0/I

tan

ium

2 1

.3G

Hz

SGI A

ltix

370

0/I

tan

ium

2 1

.3G

Hz

SGI A

ltix

370

0/I

tan

ium

2 1

.5G

Hz

IBM

pSe

rie

s 57

5 p

ow

er5

1.5

GH

z D

C /

/ H

PS

HP

Su

per

do

me

Itan

ium

2 1

.6G

Hz

CS4

8 B

ull

R4

22

Har

per

tow

n E

54

72

3.0

GH

z Q

C /

/ IB

CS4

2 IB

M x

345

5 O

pte

ron

221

8-F

2.6

GH

z D

C /

/ IB

CS5

5 SG

I Ice

Xe

on

E5

46

2 2

.83

GH

z Q

C /

/ IB

(m

pav

ich

2)

IBM

pSe

rie

s 57

5 p

ow

er6

4.7

GH

z D

C /

/ IB

CS5

4 SG

I Alt

ix Ic

e 8

20

0 X

eon

E5

44

0 2

.83

GH

z Q

C /

/ IB

CS6

6 In

tel L

75

55

NH

-EX

[8

c] 1

.87

GH

z //

IB Q

DR

[p

pn

=8]

CS6

6 In

tel L

75

55

NH

-EX

[8

c] 1

.87

GH

z //

IB Q

DR

[p

pn

=16

]

CS6

6 In

tel L

75

55

NH

-EX

[8

c] 1

.87

GH

z //

IB Q

DR

CS6

0 V

igle

n E

55

20

NH

2.2

7G

Hz

QC

//

IB D

DR

(m

vap

ich

)

Alic

e X

55

50

NH

2.6

7G

Hz

QC

//

IB Q

DR

CS5

7 In

tel X

556

0 N

H 2

.8G

Hz

QC

//

IB Q

DR

De

ll M

510

X5

650

[6

c]2

.67

GH

z //

IB/Q

DR

(im

pi)

Fuji

tsu

"H

TC"

BX

92

2 X

565

0 [

6c]

2.6

6G

Hz

// IB

/QD

R

CS6

4 In

tel X

567

0 W

SM [

6c]

2.9

3G

Hz

// IB

/QD

R…

Inte

l SN

B e

5-2

67

0 [

8c]

2.6

GH

z //

IB/Q

DR

(p

pn

=8)

Rav

en

B5

10 S

NB

e5-

267

0 [

8c]

2.6

GH

z //

IB/Q

DR

Fuji

tsu

CX

25

0 S

NB

e5

-26

70

[8

c] 2

.6G

Hz

// IB

/QD

R (

201

7)

Bu

ll b

510

SN

B E

5-2

680

[8

c] 2

.7 G

Hz

// IB

/QD

R

Bu

ll b

510

SN

B E

5-2

680

[8

c] 2

.7 G

Hz

(T)

// IB

/QD

R

Fuji

tsu

RX

300

SN

B E

5-2

68

0 [

8c]

2.7

GH

z //

IB/Q

DR

Ato

s B

DW

e5-

268

0v4

[1

4c]

2.4

GH

z (T

) //

IB/E

DR

Ato

s Sk

ylak

e G

old

61

48

2.4

GH

z (T

) //

IB/E

DR

(p

pn

=16)

45.8

55.3

MP2 Mn(CO)5H

Performance of MP2 Gradient Module


Mn(CO)5H - MP2 geometry optimisation

BASIS: TZVP + f (217 GTOs)

CS48 Bull Xeon E5472

3.0GHz QC + DDR

Intel SNB e5-2670 [8c]

2.6GHz // IB/QDR

(ppn=8)



GAMESS-UK – DFT Performance Report

Cyclosporin 6-31G** basis (1855

GTOs); DFT B3LYP

CPU Time Breakdown


Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs





Summary

1. Introduction – DL_POLY and GAMESS-UK

¤ Background and Flagship codes for the UK’s CCP5 & CCP1

¤ Critical role of collaborative developments

2. HPC Technology - Processor & Interconnect Technologies

¤ The last 10 years of Intel dominance – Nehalem to Skylake

3. DL_POLY and GAMESS-UK Performance

¤ Benchmarks & Test Cases

¤ Overview of two decades of Code Performance: From

T3E/1200E to Intel Skylake clusters

4. Understanding Performance – Useful Tools

5. Acknowledgements and Summary


https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/intel-skylake-xeon-block-diagram.jpg

https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/intel-skylake-xeon-block-diagram.jpg

Acknowledgements

• Ludovic Sauge, Enguerrand Petit, Martyn Foster and

Nick Allsopp and John Humphries (Bull/ATOS) for

informative discussions and access to the Skylake & EPYC

clusters at the Bull HPC Competency Centre.

• David Cho, Gilad Shainer, Colin Bridger & Steve Davey

for access to and considerable assistance with the “Helios”

cluster at the HPC Advisory Council.

• Joshua Weage, Martin Hilgeman, Dave Coughlin, Gilles

Civario and Christopher Huggins for access to, and

assistance with, the variety of Skylake and EPYC SKUs at

the Dell Benchmarking Centre.

• Alin Marin Elena and Ilian Todorov (STFC) for discussions

around the DL_POLY software

• The DisCO programme at Daresbury Laboratory.



Final Thoughts & Summary

I. Performance Benchmarks and Cluster Systems

a. Synthetic Code Performance: STREAM and IMB

b. Application Code Performance: DLPOLY, GROMACS,

AMBER,GAMESS_UK, VASP and Quantum Espresso

c. Interconnect Performance: Intel MPI and Mellanox’s HPCX

d. Processor Family and Interconnect – “core to core” and “node

to node” benchmarks

II. Impact of Environmental Issues in Cluster acceptance

tests.

a. Security patches, turbo mode and Throughput testing

III. Performance profile of DL_POLY and GAMESS-UK over

the past two decades

12 December 2018

Download - Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Top Related