hpc platform efficiency and challenges for a system builder

22
HPC platform efficiency and challenges for a system builder Martin Hilgeman Technical Director, HPC [email protected]

Upload: others

Post on 02-Jul-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC platform efficiency and challenges for a system builder

HPC platform efficiency and challenges for a system builder

Martin Hilgeman

Technical Director, HPC

[email protected]

Page 2: HPC platform efficiency and challenges for a system builder

2 HPC Saudi 2018

The landscape is changing

“We are no longer in the general purpose era… the argument of tuning software for hardware is

moot. Now, to get the best bang for the buck, you have to tune both.”

https://www.nextplatform.com/2017/03/08/arm-amd-x86-server-chips-get-mainstream-lift-microsoft/amp/

- Kushagra Vaid, general manager of server

engineering, Microsoft Cloud Solutions

Page 3: HPC platform efficiency and challenges for a system builder

3 HPC Saudi 2018

4 nodes in 2U – the HPC standard

Page 4: HPC platform efficiency and challenges for a system builder

4 HPC Saudi 2018

The evaluation of the dual socket

Model 6100 6220 6320 6420

Year 2010 2011 2012 2013 2014 2015 2017

CPU X5570 X5690 E5-2690E5-2695

v2

E5-2699

v3

E5-2699A

v48180

Cores 4 6 8 12 18 22 28

GHz 3.33 3.46 2.9 2.7 2.3 2.4 2.5

TDP (W) 95 130 135 135 145 145 205

DIMM 12/(3 channels) 16/(4 channels) 16/(4 channels) 16/(6 ch)

PSU 1400 1400 1600 2000

Price $1,500 $1,700 $2,050 $3,549 $3,805 $4,938 $10,009

Page 5: HPC platform efficiency and challenges for a system builder

5 HPC Saudi 2018

The de-facto standard node in HPC

Page 6: HPC platform efficiency and challenges for a system builder

6 HPC Saudi 2018

Improving performance - what levels do we have?

• Challenge: Sustain performance trajectory without massive increases in cost, power, real

estate, and unreliability

• Solutions: No single answer, must intelligently turn “Architectural Knobs”

𝐹𝑟𝑒𝑞 ×𝑐𝑜𝑟𝑒𝑠

𝑠𝑜𝑐𝑘𝑒𝑡× #𝑠𝑜𝑐𝑘𝑒𝑡𝑠 ×

𝑖𝑛𝑠𝑡 𝑜𝑟 𝑜𝑝𝑠

𝑐𝑜𝑟𝑒 × 𝑐𝑙𝑜𝑐𝑘× 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦

Hardware performance What you really get

1 2 3 4 5

Software performance

Page 7: HPC platform efficiency and challenges for a system builder

7 HPC Saudi 2018

Turning the knobs 1 - 4

Frequency is unlikely to change much Thermal/Power/Leakage

challenges

Moore’s Law still holds: 130 -> 14 nm. LOTS of transistors

Number of sockets per system is the easiest knob.

Challenging for power/density/cooling/networking

IPC still grows

FMA, AVX, accelerator implementations for algorithms

Challenging for the user/developer

1

2

3

4

Page 8: HPC platform efficiency and challenges for a system builder

8 HPC Saudi 2018

Turning knob #5

Hardware tuning knobs are limited, but there’s far more possible in the

software layer

Hardware

Operating

system

Middleware

Application

BIOS

P-states

Memory profile

I/O cache tuning

Process affinity

Memory allocation

MPI (parallel) tuning

Use of performance

libs (Math, I/O, IPP)

Compiler hints

Source changes

Adding parallelism

easy

hard

Page 9: HPC platform efficiency and challenges for a system builder

9 HPC Saudi 2018

New capabilities according to Intel

SSSE3 SSE4 AVX AVX AVX2 AVX2

2007 2009 2012 2013 2014 2015 2017

AVX-512

Page 10: HPC platform efficiency and challenges for a system builder

10 HPC Saudi 2018

The state of ISV software

Segment Applications Vectorization support

CFD Fluent, LS-DYNA, STAR

CCM+

Limited SSE2 support

CSM CFX, RADIOSS, Abaqus Limited SSE2 support

Weather WRF, UM, NEMO, CAM Yes

Oil and Gas Seismic processing Not applicable

Reservoir Simulation Yes

Chemistry Gaussian, GAMESS, Molpro Not applicable

Molecular dynamics NAMD, GROMACS,

Amber,…

PME kernels support SSE2

Biology BLAST, Smith-Waterman Not applicable

Molecular mechanics CPMD, VASP, CP2k,

CASTEP

Yes

Bottom line: ISV support for new instructions is poor. Less of an issue

for in-house developed codes, but programming is hard

Page 11: HPC platform efficiency and challenges for a system builder

11 HPC Saudi 2018

What does Intel do about these trends?

Problem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake

Inter CPU

bandwidth

No problem Even better Two snoop

modes

Three snoop

modes

Four (!) snoop

modes

• UPI

• COD snoop

modes

Memory

bandwidth

No problem Extra memory

channel

Larger cache Extra load/store

units

Larger cache • Extra

load/store

units

• +50%

memory

channels

Core

frequency

No problem • More cores

• AVX

• Better

Turbo

• Even more

cores

• Above TDP

Turbo

• Still more

cores

• AVX2/FMA

• Per-core

Turbo

• Again even

more cores

• optimized

FMA

• Per-core

Turbo

based on

instruction

type

• More cores

• Larger

OOO

engine

• AVX-512

• 3 different

core

frequency

modes

Page 12: HPC platform efficiency and challenges for a system builder

12 HPC Saudi 2018

Result: more variation

au001

au002

au003

au004

au005au006

au007

au008au009

au010au011

au012

au013

au014

au015

au016

au017

au018au019

au020

au021

au022au023

au024au025

au026

au027

au028

au029

au030

au031au032

au033au034

au035au036

au037

au038

au039au040

au041

au042au043

au044

au045

au046

au047

au048

au049

au050au051

au052

au053

au054

au055

au056

au057

au058

au059

au060 au061au062 au063au064

au065

au066

au067

au068

au069

au070au071

au072

au073

au074

au075

au076au077

au078

au079au080

au081

au082

au083

au084au085

au086au087

au088

au089

au090

au091

au092au093au094

au095

au096

au097

au098au099

au100

au101

au102

au103

au104

au105

au106

au107

au108

au109

au110

au111

au112

au113 au114

au115

au116au117

au118

au119

au120

au121

au122

au123au124

au125au126

au127

au128

au129

au130au131au132

au133

au134

au135au136

au137

au138

au139

au140

au141au142

au143

au144

au145

au146au147

au148

au149

au150

au151

au152

au153

au154

au155

au156

au157au158

au159au160

au161

au162au163

au164

au165

au166au167

au168

au169

au170

au171

au172

au173

au174

au175au176

au177au178au179

au180au181

au182

au183

au184

au185

au186

au187

au188au189

au190

au191au192

au193

au194

au195

au196

au197

au198

au199

au200

au201

au202au203

au204

au205au206

au207

au208

au209

au210

au211

au212

au213

au214

au215

au216

au217

au218

au219

au220

au221 au222

au224

au225

au226

au227

au228

au229

au…

au231au232

au233

au234

au235

au236

au237

au238au239

au240

310

315

320

325

330

335

340

345

350

355

360

310 315 320 325 330 335 340 345 350 355 360

second s

ocket

first socket

Page 13: HPC platform efficiency and challenges for a system builder

A case study on power

Page 14: HPC platform efficiency and challenges for a system builder

14 HPC Saudi 2018

Key Aspects of Acceleration

We have lots of transistors… Moore’s law is holding; this isn’t

necessarily the problem

We don’t really need lower power per transistor, we need lower power

per operation

How to do this?

Page 15: HPC platform efficiency and challenges for a system builder

15 HPC Saudi 2018

Performance and Efficiency with Intel® AVX-512

669

11782034

3259

760 768 791 767

3.12.8

2.5

2.1

0

0.5

1

1.5

2

2.5

3

3.5

0

500

1000

1500

2000

2500

3000

3500

SSE4.2 AVX AVX2 AVX512

Co

re F

req

ue

ncy

GF

LO

Ps

, S

yste

m P

ow

er

LINPACK Performance

GFLOPs Power (W) Frequency (GHz)

1.00

1.74

2.92

4.83

1.00

2.00

4.00

8.00

SSE4.2 AVX AVX2 AVX512No

rmali

zed

to

SS

E4.2

G

FL

OP

s/W

att

GFLOPs / Watt

1.001.95

3.77

7.19

0.00

2.00

4.00

6.00

8.00

SSE4.2 AVX AVX2 AVX512No

rmali

zed

to

SS

E4.2

G

FL

OP

s/G

Hz

GFLOPs / GHz

Intel® AVX-512 delivers significant performance and efficiency gains

1

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Page 16: HPC platform efficiency and challenges for a system builder

16 HPC Saudi 2018

Powerful instructions to save power

For LINPACK, powerful instructions can bring significant performance gains.

What about real applications?

NAS parallel benchmarks, which are “mini applications” containing kernels for

major HPC workload types

Page 17: HPC platform efficiency and challenges for a system builder

17 HPC Saudi 2018

NPB kernels

KERNEL Description Workload

CG Conjugate gradient Memory latency bound

MG Multigrid Memory intensive

FT Fourier transform Compute and

transpose

BT Block tridiagonal solver

SP Scalar pentadiagonal solver

LU Lower-upper Gauss-Seidel solver

Page 18: HPC platform efficiency and challenges for a system builder

18 HPC Saudi 2018

Conjugate Gradient

0

500

1000

1500

2000

2500

3000

3500

4000

no-vec MHz SSE4.2 MHZ AVX MHz

AVX2 MHz AVX-512 MHz

0

50

100

150

200

250

no-vec Watt SSE4.2 Watt AVX Watt

AVX2 Watt AVX-512 Watt

Page 19: HPC platform efficiency and challenges for a system builder

19 HPC Saudi 2018

Multigrid

0

500

1000

1500

2000

2500

3000

3500

4000

no-vec MHz SSE4.2 MHz AVX MHz

AVX2 MHz AVX-512 MHz

0

50

100

150

200

250

no-vec Watt SSE4.2 Watt AVX Watt

AVX2 Watt AVX-512 Watt

Page 20: HPC platform efficiency and challenges for a system builder

20 HPC Saudi 2018

Block tridiagonal solver

0

500

1000

1500

2000

2500

3000

3500

4000

no-vec MHz SSE4.2 MHz AVX MHz

AVX2 MHz AVX-512 MHz

0

50

100

150

200

250

no-vec Watt SSE4.2 Watt

AVX Watt AVX2 Watt

AVX-512 Watt

Page 21: HPC platform efficiency and challenges for a system builder

21 HPC Saudi 2018

Fourier Transformation

0

500

1000

1500

2000

2500

3000

3500

4000

no-vec MHz SSE4.2 MHz AVX MHz

AVX2 MHz AVX-512 MHz

0

50

100

150

200

250

no-vec Watt

SSE4.2 Watt

AVX Watt

AVX2 Watt

AVX-512 Watt

Page 22: HPC platform efficiency and challenges for a system builder