hpc platform efficiency and challenges for a system builder
TRANSCRIPT
HPC platform efficiency and challenges for a system builder
Martin Hilgeman
Technical Director, HPC
2 HPC Saudi 2018
The landscape is changing
“We are no longer in the general purpose era… the argument of tuning software for hardware is
moot. Now, to get the best bang for the buck, you have to tune both.”
https://www.nextplatform.com/2017/03/08/arm-amd-x86-server-chips-get-mainstream-lift-microsoft/amp/
- Kushagra Vaid, general manager of server
engineering, Microsoft Cloud Solutions
3 HPC Saudi 2018
4 nodes in 2U – the HPC standard
4 HPC Saudi 2018
The evaluation of the dual socket
Model 6100 6220 6320 6420
Year 2010 2011 2012 2013 2014 2015 2017
CPU X5570 X5690 E5-2690E5-2695
v2
E5-2699
v3
E5-2699A
v48180
Cores 4 6 8 12 18 22 28
GHz 3.33 3.46 2.9 2.7 2.3 2.4 2.5
TDP (W) 95 130 135 135 145 145 205
DIMM 12/(3 channels) 16/(4 channels) 16/(4 channels) 16/(6 ch)
PSU 1400 1400 1600 2000
Price $1,500 $1,700 $2,050 $3,549 $3,805 $4,938 $10,009
5 HPC Saudi 2018
The de-facto standard node in HPC
6 HPC Saudi 2018
Improving performance - what levels do we have?
• Challenge: Sustain performance trajectory without massive increases in cost, power, real
estate, and unreliability
• Solutions: No single answer, must intelligently turn “Architectural Knobs”
𝐹𝑟𝑒𝑞 ×𝑐𝑜𝑟𝑒𝑠
𝑠𝑜𝑐𝑘𝑒𝑡× #𝑠𝑜𝑐𝑘𝑒𝑡𝑠 ×
𝑖𝑛𝑠𝑡 𝑜𝑟 𝑜𝑝𝑠
𝑐𝑜𝑟𝑒 × 𝑐𝑙𝑜𝑐𝑘× 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦
Hardware performance What you really get
1 2 3 4 5
Software performance
7 HPC Saudi 2018
Turning the knobs 1 - 4
Frequency is unlikely to change much Thermal/Power/Leakage
challenges
Moore’s Law still holds: 130 -> 14 nm. LOTS of transistors
Number of sockets per system is the easiest knob.
Challenging for power/density/cooling/networking
IPC still grows
FMA, AVX, accelerator implementations for algorithms
Challenging for the user/developer
1
2
3
4
8 HPC Saudi 2018
Turning knob #5
Hardware tuning knobs are limited, but there’s far more possible in the
software layer
Hardware
Operating
system
Middleware
Application
BIOS
P-states
Memory profile
I/O cache tuning
Process affinity
Memory allocation
MPI (parallel) tuning
Use of performance
libs (Math, I/O, IPP)
Compiler hints
Source changes
Adding parallelism
easy
hard
9 HPC Saudi 2018
New capabilities according to Intel
SSSE3 SSE4 AVX AVX AVX2 AVX2
2007 2009 2012 2013 2014 2015 2017
AVX-512
10 HPC Saudi 2018
The state of ISV software
Segment Applications Vectorization support
CFD Fluent, LS-DYNA, STAR
CCM+
Limited SSE2 support
CSM CFX, RADIOSS, Abaqus Limited SSE2 support
Weather WRF, UM, NEMO, CAM Yes
Oil and Gas Seismic processing Not applicable
Reservoir Simulation Yes
Chemistry Gaussian, GAMESS, Molpro Not applicable
Molecular dynamics NAMD, GROMACS,
Amber,…
PME kernels support SSE2
Biology BLAST, Smith-Waterman Not applicable
Molecular mechanics CPMD, VASP, CP2k,
CASTEP
Yes
Bottom line: ISV support for new instructions is poor. Less of an issue
for in-house developed codes, but programming is hard
11 HPC Saudi 2018
What does Intel do about these trends?
Problem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake
Inter CPU
bandwidth
No problem Even better Two snoop
modes
Three snoop
modes
Four (!) snoop
modes
• UPI
• COD snoop
modes
Memory
bandwidth
No problem Extra memory
channel
Larger cache Extra load/store
units
Larger cache • Extra
load/store
units
• +50%
memory
channels
Core
frequency
No problem • More cores
• AVX
• Better
Turbo
• Even more
cores
• Above TDP
Turbo
• Still more
cores
• AVX2/FMA
• Per-core
Turbo
• Again even
more cores
• optimized
FMA
• Per-core
Turbo
based on
instruction
type
• More cores
• Larger
OOO
engine
• AVX-512
• 3 different
core
frequency
modes
12 HPC Saudi 2018
Result: more variation
au001
au002
au003
au004
au005au006
au007
au008au009
au010au011
au012
au013
au014
au015
au016
au017
au018au019
au020
au021
au022au023
au024au025
au026
au027
au028
au029
au030
au031au032
au033au034
au035au036
au037
au038
au039au040
au041
au042au043
au044
au045
au046
au047
au048
au049
au050au051
au052
au053
au054
au055
au056
au057
au058
au059
au060 au061au062 au063au064
au065
au066
au067
au068
au069
au070au071
au072
au073
au074
au075
au076au077
au078
au079au080
au081
au082
au083
au084au085
au086au087
au088
au089
au090
au091
au092au093au094
au095
au096
au097
au098au099
au100
au101
au102
au103
au104
au105
au106
au107
au108
au109
au110
au111
au112
au113 au114
au115
au116au117
au118
au119
au120
au121
au122
au123au124
au125au126
au127
au128
au129
au130au131au132
au133
au134
au135au136
au137
au138
au139
au140
au141au142
au143
au144
au145
au146au147
au148
au149
au150
au151
au152
au153
au154
au155
au156
au157au158
au159au160
au161
au162au163
au164
au165
au166au167
au168
au169
au170
au171
au172
au173
au174
au175au176
au177au178au179
au180au181
au182
au183
au184
au185
au186
au187
au188au189
au190
au191au192
au193
au194
au195
au196
au197
au198
au199
au200
au201
au202au203
au204
au205au206
au207
au208
au209
au210
au211
au212
au213
au214
au215
au216
au217
au218
au219
au220
au221 au222
au224
au225
au226
au227
au228
au229
au…
au231au232
au233
au234
au235
au236
au237
au238au239
au240
310
315
320
325
330
335
340
345
350
355
360
310 315 320 325 330 335 340 345 350 355 360
second s
ocket
first socket
A case study on power
14 HPC Saudi 2018
Key Aspects of Acceleration
We have lots of transistors… Moore’s law is holding; this isn’t
necessarily the problem
We don’t really need lower power per transistor, we need lower power
per operation
How to do this?
15 HPC Saudi 2018
Performance and Efficiency with Intel® AVX-512
669
11782034
3259
760 768 791 767
3.12.8
2.5
2.1
0
0.5
1
1.5
2
2.5
3
3.5
0
500
1000
1500
2000
2500
3000
3500
SSE4.2 AVX AVX2 AVX512
Co
re F
req
ue
ncy
GF
LO
Ps
, S
yste
m P
ow
er
LINPACK Performance
GFLOPs Power (W) Frequency (GHz)
1.00
1.74
2.92
4.83
1.00
2.00
4.00
8.00
SSE4.2 AVX AVX2 AVX512No
rmali
zed
to
SS
E4.2
G
FL
OP
s/W
att
GFLOPs / Watt
1.001.95
3.77
7.19
0.00
2.00
4.00
6.00
8.00
SSE4.2 AVX AVX2 AVX512No
rmali
zed
to
SS
E4.2
G
FL
OP
s/G
Hz
GFLOPs / GHz
Intel® AVX-512 delivers significant performance and efficiency gains
1
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
16 HPC Saudi 2018
Powerful instructions to save power
For LINPACK, powerful instructions can bring significant performance gains.
What about real applications?
NAS parallel benchmarks, which are “mini applications” containing kernels for
major HPC workload types
17 HPC Saudi 2018
NPB kernels
KERNEL Description Workload
CG Conjugate gradient Memory latency bound
MG Multigrid Memory intensive
FT Fourier transform Compute and
transpose
BT Block tridiagonal solver
SP Scalar pentadiagonal solver
LU Lower-upper Gauss-Seidel solver
18 HPC Saudi 2018
Conjugate Gradient
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHZ AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt AVX Watt
AVX2 Watt AVX-512 Watt
19 HPC Saudi 2018
Multigrid
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt AVX Watt
AVX2 Watt AVX-512 Watt
20 HPC Saudi 2018
Block tridiagonal solver
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt
AVX Watt AVX2 Watt
AVX-512 Watt
21 HPC Saudi 2018
Fourier Transformation
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt
SSE4.2 Watt
AVX Watt
AVX2 Watt
AVX-512 Watt