analytical performance modeling and simulation for blue waters · 2018. 3. 5. · analytical...

69
Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions from: Greg Bauer, Steven Gottlieb, William Gropp, William Kramer, Marc Snir, IBM, and the Blue Waters team Keynote at the Workshop on Productivity and Performance (PROPER 2010) August 30 th Ischia Italy

Upload: others

Post on 19-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Analytical Performance Modeling and

Simulation for Blue Waters

Torsten Hoefler

With contributions from: Greg Bauer, Steven Gottlieb, William

Gropp, William Kramer, Marc Snir, IBM, and the Blue Waters team

Keynote at the Workshop on Productivity and Performance (PROPER 2010)

August 30th Ischia Italy

Page 2: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Imagine …

• … you‟re planning to construct a multi-million

Dollar Supercomputer …

• … that consumes as much energy as a small

[european] town …

• … to solve computational problems at an

international scale and advance science to the

next level …

• … with “hero-runs” of [insert verb here] scientific

applications that cost $10k and more per run …

2Performance Modeling and Simulation on Blue Waters

Page 3: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

… and all you have (now) is …

• … then you better plan ahead! (same for Exascale)

3Performance Modeling and Simulation on Blue Waters

Page 4: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

HPC is all about “Performance” – What is this?

• FLOP/s?

• Merriam Webster “flop: to fail completely”

• HPCC: MB/s? GUPS? FFT-rate?

• It‟s a multi-dimensional vector space

• Identify independent vectors

• network: bandwidth, latency, injection rate, …

• memory and I/O: bandwidth, latency, random access rate, …

• CPU: latency (pipeline depth), # execution units, clock speed, …

• Categorize each vector for a particular machine

• open a “feasible region” in the n-dimensional space

4Performance Modeling and Simulation on Blue Waters

Page 5: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Example: Memory Subsystem

• each application has particular coordinates

5Performance Modeling and Simulation on Blue Waters

La

ten

cy

Injection Rate

some graph or

“informatics”

applications regular mesh

computations

highly irregular

mesh computations

• Application A

• Application B

Page 6: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Multi-dimensional Performance Metrics

• Very helpful for HW/SW co-design

• Good for comparison of applications

• Good for comparison of machines

• Design application-optimized (specific) HW

• Not good for absolute estimates

• App. A on machine A vs. app. B on machine B

• “Absolute performance” is very hard to define

• Most useful (universal) metric: “time to solve problem X”

• Also most important as time is fundamentally limited

6Performance Modeling and Simulation on Blue Waters

Page 7: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

What is Performance Modeling?

• Understand the resource usage of an application

on a particular architecture

• We focus mostly on time as a resource

• Generate analytic expressions to estimate runtime

• Closely related to “Performance Engineering”

• Often builds on empirical techniques

• Also cutting into complexity theory

• More pragmatic (asymptotes often insufficient)

• Complex (low-order terms cannot be dropped)

Performance Modeling and Simulation on Blue Waters 7

Page 8: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Performance Modeling and Productivity?

• Predict resources and costs to solve a particular

problem instance

• Evaluate effectiveness of a computing platform to

solve a particular problem

• Understand bottlenecks, i.e., HW/SW Co-Design

• Find performance bugs and assess upper and

lower bounds of code optimizations

• Makes programmers think about the structured

performance profile of an application or platform

8Performance Modeling and Simulation on Blue Waters

Page 9: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Performance Modeling from 10.000 Feet

Platform or System Model

(Hardware, Middleware)

Application Model

(Algorithm, Structure)

Performance Model

Performance Modeling and Simulation on Blue Waters 9

Page 10: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Of course it’s more complex than that

10Performance Modeling and Simulation on Blue Waters

Page 11: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

How to Derive a Platform Model

Platform or System Model

(Hardware, Middleware)

Application Model

(Algorithm, Structure)

Performance Model

Performance Modeling and Simulation on Blue Waters 11

Page 12: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Example: Blue Waters - Overview

• >300.000 compute cores

• Based on POWER7

• 10 PF/s peak (not important, more later)

• 1 PF/s sustained

• Direct network with 1.1 TB/s per hub

• Total approx. 10 PB/s

• >1.2 PB main memory

• >18 PB on-line disk storage

• >500 PB near-line storage

12Performance Modeling and Simulation on Blue Waters

Page 13: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

From Chip to Entire Integrated System

13

On-line Storage

Near-line Storage

L-L

ink C

ab

les

Super Node(32 Nodes / 4 CEC)

P7 Chip

(8 cores)

SMP node

(32 cores)

Drawer

(256 cores)

SuperNode

(1024 cores)

Building Block

Blue Waters System

NPCF

Performance Modeling and Simulation on Blue Waters Source: IBM

Page 14: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

14Performance Modeling and Simulation on Blue Waters Source: IBM

Page 15: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

POWER7 Chip (8 cores)

• Base Technology

• 45 nm, 576 mm2

• 1.2 B transistors

• Chip

• 8 cores

• 4 FMAs/cycle/core

• 32 MB L3 (private/shared)

• Dual DDR3 memory

• 128 GiB/s peak bandwidth

• (1/2 byte/flop)

• Clock range of 3.5 – 4 GHz

15

Quad-chip MCM

Performance Modeling and Simulation on Blue Waters Source: IBM

Page 16: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

L3 Cache/On-Chip Communication

16

• L1 32KB Instruction / core

• L1 32KB Data / core

• L2 = 256KB / core

• L3 = 4MB eDRAM / core

• Fast private and

shared region

Performance Modeling and Simulation on Blue Waters Source: IBM

Page 17: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

MC

1M

C 0

8c uP

MC

1M

C0

8c uP

MC

0M

C1

8c uP

MC

0M

C1

8c uP

P7-0 P7-1

P7-3 P7-2

A

B

X

W

B

A

W

X

B A

Y X

A B

X Y

C

Z

C

Z

W

Z

Z

W

C

Y

C

Y

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

Quad Chip Module (4 chips)

• 32 cores

• 32 cores*8 F/core*4 GHz = 1 TF

• 4 threads per core (max)

• 128 threads per package

• 4x32 MiB L3 cache

• 512 GB/s RAM BW (0.5 B/F)

• 800 W (0.8 W/F)

17Performance Modeling and Simulation on Blue Waters Source: IBM

Page 18: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Adding a Network Interface (Hub)

18

DIM

M 1

5D

IMM

8D

IMM

9D

IMM

14

DIM

M 4

DIM

M 1

2D

IMM

13

DIM

M 5

DIM

M 1

1D

IMM

3D

IMM

2D

IMM

10

DIM

M 0

DIM

M 7

DIM

M 6

DIM

M 1

P7-IH Quad Block Diagram32w SMP w/ 2 Tier SMP Fabric, 4 chip Processor MCM + Hub SCM w/ On-Board Optics

MC

1

Mem

Mem

Mem

Mem

MC

0

Mem

Mem

8c uP

MC

1

Mem

Mem

Mem

Mem

MC

0

Mem

Mem

Mem

Mem

8c uP

7 Inter-Hub Board Level L-Buses

3.0Gb/s @ 8B+8B, 90% sus. peak

D0-D15 Lr0-Lr23

320 GB/s 240 GB/s

28x XMIT/RCV pairs@ 10 Gb/s

832

624

5+

5G

B/s

(6

x=

5+

1)

Hub Chip Module

22

+2

2G

B/s

164

Ll0

22

+2

2G

B/s

164

Ll1

22

+2

2G

B/s

164

Ll2

22

+2

2G

B/s

164

Ll3

22

+2

2G

B/s

164

Ll4

22

+2

2G

B/s

164

Ll5

22

+2

2G

B/s

164

Ll6

12x 12x

12x 12x

12x 12x

12x 12x

12x 12x

12x 12x

10

+1

0G

B/s

(1

2x=

10

+2

)12x 12x 12x 12x

12x 12x 12x 12x

12x 12x 12x 12x

12x 12x 12x 12x

7+

7G

B/s

72

EG2

PC

Ie 6

1x

7+

7G

B/s

72

EG1

PC

Ie 1

6x

7+

7G

B/s

72

EG2

PC

Ie 8

x

MC

0

Mem

Mem

Mem

Mem

MC

1

Mem

Mem

Mem

Mem

8c uP

MC

0

Mem

Mem

Mem

Mem

MC

1

Mem

Mem

Mem

Mem

8c uP

Mem

Mem

P7-0 P7-1

P7-3 P7-2

A

B

X

W

B

A

W

X

B A

Y X

A B

X Y

C

Z

C

Z

W

Z

Z

W

C

Y

C

Y

Z

W

Y

X

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

DIM

M 1

Mem

Mem

DIM

M 0

Mem

Mem

DIM

M 5Mem

Mem

DIM

M 4Mem

Mem

DIM

M 7

Mem

Mem

DIM

M 6

Mem

Mem

DIM

M 1

0

Mem

Mem

DIM

M 1

1

Mem

Mem

DIM

M 2

Mem

Mem

DIM

M 3

Mem

Mem

DIM

M 1

4Mem

Mem

DIM

M 1

5Mem

Mem

DIM

M 8Mem

Mem

DIM

M 9Mem

Mem

DIM

M 1

2Mem

Mem

DIM

M 1

3Mem

Mem

• Connects QCM to PCI-e

• Two 16x and one 8x PCI-e slot

• Connects 8 QCM's via low latency,

high bandwidth, copper fabric.

• Provides a message passing

mechanism with very

high bandwidth

• Provides the lowest possible

latency between 8 QCM's

Performance Modeling and Simulation on Blue Waters Source: IBM

Page 19: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

EI-

3 P

HY

s

Torrent

Diff PHYs

L lo

ca

l

HU

B T

o H

UB

Co

pp

er

Bo

ard

Wir

ing

L remote

4 Drawer Interconnect to Create a Supernode

Optical

LR

0 B

us

Op

tic

al

6x 6x

LR

23

Bu

s

Op

tic

al

6x 6x

LL0 Bus

Copper

8B

8B

8B

8B

LL1 Bus

Copper

8B

8B

LL2 Bus

Copper

8B

8B

LL4 Bus

Copper

8B

8B

LL5 Bus

Copper

8B

8B

LL6 Bus

Copper

8B

8B

LL3 Bus

Copper

Dif

f P

HY

s

PX

0 B

us

16

x

16

x

PC

I-E

IO P

HY

Ho

t P

lug

Ctl

PX

1 B

us

16

x

16

x

PC

I-E

IO P

HY

Ho

t P

lug

Ctl

PX

2 B

us

8x8x

PC

I-E

IO P

HY

Ho

t P

lug

Ctl

FS

I

FS

P1

-A

FS

I

FS

P1

-B

I2C

TP

MD

-A, T

MP

D-B

SV

IC

MD

C-A

SV

IC

MD

C-B

I2C

SE

EP

RO

M 1

I2C

SE

EP

RO

M 2

24

L remote

BusesHUB to QCM Connections

Address/Data

D B

us

Inte

rco

nn

ec

t o

f S

up

ern

od

es

Op

tic

al

D0 Bus

Optical

12x

12x

D15 Bus

Optical

12x

12x

16

D B

us

es

28

I2C

I2C_0 + Int

I2C_27 + Int

I2C

To

Op

tic

al

Mo

du

les

TO

D S

yn

c

8B

Z-B

us

8B

Z-B

us

TO

D S

yn

c

8B

Y-B

us

8B

Y-B

us

TO

D S

yn

c

8B

X-B

us

8B

X-B

us

TO

D S

yn

c

8B

W-B

us

8B

W-B

us

1.1 TB/s HUB

19

• 192 GB/s Host Connection

• 336 GB/s to 7 other local nodes

• 240 GB/s to local-remote nodes

• 320 GB/s to remote nodes

• 40 GB/s to general purpose I/O

• cf. “The PERCS interconnect” @HotI‟10

Performance Modeling and Simulation on Blue Waters

Hub Chip

Source: IBM

Page 20: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

20

First Level Interconnect

L-Local

HUB to HUB Copper Wiring

256 Cores

DCA-0 Connector (Top DCA)

DCA-1 Connector (Bottom DCA)

1st Level Local Interconnect (256 cores)

HUB

7

HUB

6

HUB

4

HUB

3

HUB

5

HUB

1

HUB

0

HUB

2

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P1-C

17

-C1

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l F

an-o

ut fr

om

HU

B M

od

ule

s

2,3

04

Fib

er

'L-L

ink'

64/40 Optical'D-Link'

64/40 Optical'D-Link'

P7-0

P7-2

P7-3P7-1

QCM 0

U-P1-M1

P7-0

P7-2

P7-3P7-1

QCM 1

U-P1-M2

P7-0

P7-2

P7-3P7-1

QCM 2

U-P1-M3

P7-0

P7-2

P7-3P7-1

QCM 3

U-P1-M4

P7-0

P7-2

P7-3P7-1

QCM 4

U-P1-M5

P7-0

P7-2

P7-3P7-1

QCM 5

U-P1-M6

P7-0

P7-2

P7-3P7-1

QCM 6

U-P1-M7

P7-0

P7-2

P7-3P7-1

QCM 7

U-P1-M8

P1-C

16

-C1

P1-C

15

-C1

P1-C

14

-C1

P1-C

13

-C1

P1-C

12

-C1

P1-C

11

-C1

P1-C

10

-C1

P1-C

9-C

1

P1-C

8-C

1

P1-C

7-C

1

P1-C

6-C

1

P1-C

5-C

1

P1-C

4-C

1

P1-C

3-C

1

P1-C

2-C

1

P1-C

1-C

1

N0

-DIM

M1

5N

0-D

IMM

14

N0

-DIM

M1

3N

0-D

IMM

12

N0

-DIM

M1

1N

0-D

IMM

10

N0

-DIM

M0

9N

0-D

IMM

08

N0

-DIM

M0

7N

0-D

IMM

06

N0

-DIM

M0

5N

0-D

IMM

04

N0

-DIM

M0

3N

0-D

IMM

02

N0

-DIM

M0

1N

0-D

IMM

00

N1

-DIM

M1

5N

1-D

IMM

14

N1

-DIM

M1

3N

1-D

IMM

12

N1

-DIM

M1

1N

1-D

IMM

10

N1

-DIM

M0

9N

1-D

IMM

08

N1

-DIM

M0

7N

1-D

IMM

06

N1

-DIM

M0

5N

1-D

IMM

04

N1

-DIM

M0

3N

1-D

IMM

02

N1

-DIM

M0

1N

1-D

IMM

00

N2

-DIM

M1

5N

2-D

IMM

14

N2

-DIM

M1

3N

2-D

IMM

12

N2

-DIM

M1

1N

2-D

IMM

10

N2

-DIM

M0

9N

2-D

IMM

08

N2

-DIM

M0

7N

2-D

IMM

06

N2

-DIM

M0

5N

2-D

IMM

04

N2

-DIM

M0

3N

2-D

IMM

02

N2

-DIM

M0

1N

2-D

IMM

00

N3

-DIM

M1

5N

3-D

IMM

14

N3

-DIM

M1

3N

3-D

IMM

12

N3

-DIM

M1

1N

3-D

IMM

10

N3

-DIM

M0

9N

3-D

IMM

08

N3

-DIM

M0

7N

3-D

IMM

06

N3

-DIM

M0

5N

3-D

IMM

04

N3

-DIM

M0

3N

3-D

IMM

02

N3

-DIM

M0

1N

3-D

IMM

00

N4

-DIM

M1

5N

4-D

IMM

14

N4

-DIM

M1

3N

4-D

IMM

12

N4

-DIM

M1

1N

4-D

IMM

10

N4

-DIM

M0

9N

4-D

IMM

08

N4

-DIM

M0

7N

4-D

IMM

06

N4

-DIM

M0

5N

4-D

IMM

04

N4

-DIM

M0

3N

4-D

IMM

02

N4

-DIM

M0

1N

4-D

IMM

00

N5

-DIM

M1

5N

5-D

IMM

14

N5

-DIM

M1

3N

5-D

IMM

12

N5

-DIM

M1

1N

5-D

IMM

10

N5

-DIM

M0

9N

5-D

IMM

08

N5

-DIM

M0

7N

5-D

IMM

06

N5

-DIM

M0

5N

5-D

IMM

04

N5

-DIM

M0

3N

5-D

IMM

02

N5

-DIM

M0

1N

5-D

IMM

00

N6

-DIM

M1

5N

6-D

IMM

14

N6

-DIM

M1

3N

6-D

IMM

12

N6

-DIM

M1

1N

6-D

IMM

10

N6

-DIM

M0

9N

6-D

IMM

08

N6

-DIM

M0

7N

6-D

IMM

06

N6

-DIM

M0

5N

6-D

IMM

04

N6

-DIM

M0

3N

6-D

IMM

02

N6

-DIM

M0

1N

6-D

IMM

00

N7

-DIM

M1

5N

7-D

IMM

14

N7

-DIM

M1

3N

7-D

IMM

12

N7

-DIM

M1

1N

7-D

IMM

10

N7

-DIM

M0

9N

7-D

IMM

08

N7

-DIM

M0

7N

7-D

IMM

06

N7

-DIM

M0

5N

7-D

IMM

04

N7

-DIM

M0

3N

7-D

IMM

02

N7

-DIM

M0

1N

7-D

IMM

00

Drawer

• 8 nodes

• 32 chips

• 256 cores

Performance Modeling and Simulation on Blue Waters Source: IBM

Page 21: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

21Performance Modeling and Simulation on Blue Waters Photo taken at SC09

Page 22: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

P7-IH Board Layout 2nd Level of Interconnect (1,024 cores)

4.6 TB/s

Bisection BWBW of 1150

10G-E ports

DCA-0 Connector (Top DCA)

DCA-1 Connector (Bottom DCA)

2nd

Level Interconnect (1,024 cores)

HUB

7

61x96mm

HUB

6

61x96mm

HUB

4

61x96mm

HUB

3

61x96mm

HUB

5

61x96mm

HUB

1

61x96mm

HUB

0

61x96mm

HUB

2

61x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l Fa

n-o

ut fro

m

HU

B M

od

ule

s

2,3

04

Fib

er 'L

-Lin

k'

64/40 Optical'D-Link'

FSP/CLK-A

64/40 Optical'D-Link'

FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7

DCA-0 Connector (Top DCA)

DCA-1 Connector (Bottom DCA)

2nd

Level Interconnect (1,024 cores)

HUB

7

61x96mm

HUB

6

61x96mm

HUB

4

61x96mm

HUB

3

61x96mm

HUB

5

61x96mm

HUB

1

61x96mm

HUB

0

61x96mm

HUB

2

61x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l Fa

n-o

ut fro

m

HU

B M

od

ule

s

2,3

04

Fib

er 'L

-Lin

k'

64/40 Optical'D-Link'

FSP/CLK-A

64/40 Optical'D-Link'

FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7

DCA-0 Connector (Top DCA)

DCA-1 Connector (Bottom DCA)

2nd

Level Interconnect (1,024 cores)

HUB

761x96mm

HUB

661x96mm

HUB

461x96mm

HUB

361x96mm

HUB

561x96mm

HUB

161x96mm

HUB

061x96mm

HUB

261x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l F

an-o

ut fr

om

HU

B M

od

ule

s

2,3

04

Fib

er

'L-L

ink'

64/40 Optical'D-Link'

FSP/CLK-A

64/40 Optical'D-Link'

FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7

DCA-0 Connector (Top DCA)

DCA-1 Connector (Bottom DCA)

2nd

Level Interconnect (1,024 cores)

HUB

761x96mm

HUB

661x96mm

HUB

461x96mm

HUB

361x96mm

HUB

561x96mm

HUB

161x96mm

HUB

061x96mm

HUB

261x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l F

an-o

ut fr

om

HU

B M

od

ule

s

2,3

04

Fib

er

'L-L

ink'

64/40 Optical'D-Link'

FSP/CLK-A

64/40 Optical'D-Link'

FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7

22

Second Level Interconnect

Optical „L-Remote‟ Links from HUB

4 drawers

1,024 Cores

L-L

ink C

ab

les

Super Node(32 Nodes / 4 CEC)

Supernode

Performance Modeling and Simulation on Blue Waters Source: IBM

Page 23: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

National Petascale Computing Facility

23Performance Modeling and Simulation on Blue Waters Photo: NCSA

Page 24: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Designing a Platform Model

• Key Platform parameters:

• Memory

• CPU and Caches

• Network

• I/O subsystem (e.g., disk)

• OS (e.g., noise)

• Middleware Parameters (e.g., MPI)

24Performance Modeling and Simulation on Blue Waters

Page 25: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Determine a Memory Model

• Memory parameters

• Latency – time to access a single element (cf. L)

• Access Rate – per-element overhead (cf. g)

• Bandwidth - linear access time (cf. G)

• Netgauge memory benchmark

• Random pointer-chase to measure latency

• Bulk random access to measure access rate (cf. GUPS)

• Bulk linear access to measure bandwidth (cf. stream)

25Performance Modeling and Simulation on Blue Waters

Page 26: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Single Core Memory on the IBM 780 (P7 MR)

26

512 MB (8 Byte) random read: 143 MB/s, write: 320 MB/s, copy: 121 MB/s

latency: 132 ns

compiled with gcc 4.3.4

Performance Modeling and Simulation on Blue Waters

cached uncached

Page 27: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Single Core Modeling Strategy

• Modeling performance of modern CPUs is well

understood but very complex

• Upper bounds are (among others) FLOP-rate and

memory performance

• Details are not too important as we mostly care

about scalability

• Semi-empirical modeling: run application

benchmarks to determine single-core models

• Simulation: simulate CPU performance (Mambo)

27Performance Modeling and Simulation on Blue Waters

Page 28: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Determining a Network Model

• The network is most important

• Becomes dominant at scale

• We cannot simply “run” benchmarks to determine a

model as in the single-core case (too expensive)

• Parameters:

• Topology (bisection bandwidth)

• Collective parameters (e.g., Alltoall bandwidth)

• Point-to-point parameters (e.g., LogGP)

• latency, CPU overhead, injection rate/gap, bandwidth

28Performance Modeling and Simulation on Blue Waters

Page 29: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

• LL Topology

• 24 GB/s

• 7 links/Hub

• Fully connected

• 8 Hubs

29Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 30: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

• LR Topology

• 5 GB/s

• 24 links/Hub

• Fully connected

• 4 Drawers

• 32 Hubs

30Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 31: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

31

• D Topology

• 10 GB/s

• 16 links/Hub

• Fully

connected

• 512 SNs

• 2048 Drawers

• 16384 Hubs

Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 32: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Routing

• Deterministic routing

• Hits Borodin-Hopcroft bound ( )

• Bad worst-case performance

• Indirect (random) routing

• Increases latency

• Effectively halves (global) bandwidth

• Worst-case becomes very unlikely

32Performance Modeling and Simulation on Blue Waters

Page 33: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

SN-SN Direct Routing

SuperNode A

Direct Routing

SuperNode B

L-Hops: 2

D-Hops: 1

Total: 3 hops

D3

LR12

LR21

Performance Modeling and Simulation on Blue Waters 33Source: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 34: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

SN-SN Indirect Routing

SuperNode A SuperNode B

SuperNode x

L-Hops: 3

D-Hops: 2

Total: 5 hops

D5

D12

LR21

LL7

LR30

Total paths = # SNs – 2

Performance Modeling and Simulation on Blue Waters 34Source: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 35: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Example Model: BW Global Alltoall Bandwidth

• Assume all communications happen simultaneously

• Derive upper (expected) bound for direct routing

• Simple counting argument

• Each CN can be reached through series of LL, LR, D

• Not more than one D link or LL-LR, LR-LL, LL-LL, LR-LR

• Denote e(P) as number of nodes reachable through path P

• e(LL) =7, e(LR)=24, e(D)=d (variable number of D-links)

• # nodes reachable in two hops:

• e(LL-D)=e(D-LL)=7d

• e(LR-D)=e(D-LR) = 24d

35Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 36: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Example: Alltoall Bandwidth Continued …

• # nodes reachable in three hops:

• e(LL-D-LL) = 49d

• e(LL-D-LR)=e(LR-D-LL) = 168d

• e(LR-D-LR)=576d

• Number of paths from each source: 31+1024d

• Now count the number of paths (i.e., congestion) through

each LL, LR, D link c(L): (omitted uninteresting parts of summation)

• c(LL) = 1+64d

• c(LR) = 1+64d

• c(D) = 1024

36Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 37: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Example: Alltoall Bandwidth Continued …

• Effective (expected) bandwidths for each link-type:

• bandwidth = link bandwidth / congestion

• b(LL) = 24 GB/s / (1+64d)

• b(LR) = 5 GB/s / (1+64d)

• b(D) = 10 GB/s / 1024

• If d<8, D is the bottleneck, otherwise LR

• Bandwidth per PE: 5 GB/s / 1+64d * total # paths

• d=9 (294912 cores): 80.13 GB/s

• d=10 (327680 cores): 80.11 GB/s total ~ 0.8 PB/s

• Connection to P7 chips: 4*24 GB/s = 96 GB/s ~ 83%

37Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

Page 38: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Influence of the Routing Strategy

• Indirect routing would effectively half the bandwidth

• First route to intermediate supernodes

• Not desirable for global Alltoall case

• Other patterns could benefit

• Example: bad pattern for direct routing:

• all nodes in SN A send to all nodes in SN B

• congestion of 32, i.e., 312.5 MB/s

• (optimal) indirect routing would improve to 5 GB/s

• Detailed routing strategy is still under discussion

• May be influenced by outcome of application modeling

38Performance Modeling and Simulation on Blue Waters

Page 39: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Complete Network Models

• LogGP parameters can be measured (Netgauge)

• Or estimated based on documentation

• Traffic patterns need to be modeled or simulated

• Performance depends on task-to-node mapping

• Significant effort but only done once per machine

• NCSA is working on a detailed system model of

Blue Waters

39Performance Modeling and Simulation on Blue Waters

Page 40: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

How to Derive an Application Model

Platform or System Model

(Hardware, Middleware)

Application Model

(Algorithm, Structure)

Performance Model

Performance Modeling and Simulation on Blue Waters 40

Page 41: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Application Models

• Are often much more complex than platform models

• Serial performance models:

• CPU + Cache requirements

• Memory requirements

• I/O requirements

• Parallel performance models:

• Amdahl‟s law (scaling)

• Communication overhead

• Communication/computation overlap

41Performance Modeling and Simulation on Blue Waters

Page 42: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Rough Application Classification

42

Static Adaptive Dynamic

fixed control-

and dataflowfixed control-flow

and variable dataflow

variable control-

and dataflow

e.g., structured

n-dimensional

meshes/tori

e.g., MILC, WRF,

Abinit, Sweep3D

e.g., TDDFT/Octopus,

Enzo, ACES III

e.g., AMR, N-body,

unstructured mesh

e.g., high-impact AMR,

graph computations

e.g., PBGL, BFS

easy to model … … hard to model

Performance Modeling and Simulation on Blue Waters

Page 43: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

A Strategy for Static Applications

• Two of the three NSF Petascale benchmarks are static

• Turbulence & Quantum Chromodynamics

• Application modeling in three simple steps:

1. Identify all performance-critical parameters

2. Identify code-blocks that share a performance signature

• derive serial performance model for each

3. Determine application communication pattern

• including communication sizes

43Performance Modeling and Simulation on Blue Waters

Page 44: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

An Application Modeling Example: MILC

• MIMD Lattice Computation

• Gains deeper insights in

fundamental laws of physics

• Determine the predictions of

lattice field theories (QCD &

Beyond Standard Model)

• Major NSF application

• Challenge:

• High accuracy (computationally intensive) required for

comparison with results from experimental programs in

high energy & nuclear physics

44Performance Modeling and Simulation on Blue Waters

Page 45: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

MILC - Performance-critical Parameters

45

Name simple complex comment

P X Number of processes

nx, ny, nz, nt X Lattice size in x,y,z,t

warms, trajecs X Warmup rounds and trajectories

traj_between_meas X Number of “steps” in each trajectory

beta, mass1, mass2, error_for_propagator

X Physical parameters – influence convergence of conjugate gradient

max_cg_iterations X Limits CG iterations per step

• If parameters are more complex (e.g., input files) then the

user has to distill them into single values (domain specific)

Performance Modeling and Simulation on Blue Waters

Page 46: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

MILC – Critical Blocks

• Identify sub-trees in

call-graph with same

performance characteristic

• Five blocks in MILC

46

Name Function

LL load_longlinks

FL load_fatlinks

CG ks_congrad

GF imp_gauge_force

FF eo_fermion_force

Ignored

insignificant

sub-trees

Performance Modeling and Simulation on Blue Waters

Page 47: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Communication Pattern

• Four-dimensional p2p communication topology

• Prime-factor decomposition of P (→ square)

• Total number of p2p messages

• Counted manually (profiling tools and source)

• Collective Communication

• Single MPI_Allreduce per CG iteration

47

Type Number of Messages

FF (trajecs + warms) · steps · 1616

GF … (for LL, FL, CG)

Performance Modeling and Simulation on Blue Waters

Page 48: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Ideas for Automation – Tool support

• Model each function as a critical block

• Automatic decomposition leads to more blocks

• User needs to provide asymptotic scaling function

• e.g., T ~ nx*ny*nz*nt,

• Statistical runs could automatically fit the

parameters (can model cache linearly)

• Similar techniques can be used for message

counting

• Should investigate tool support for this!

48Performance Modeling and Simulation on Blue Waters

Page 49: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

How to Derive an Application Model

Platform or System Model

(Hardware, Middleware)

Application Model

(Algorithm, Structure)

Performance Model

Performance Modeling and Simulation on Blue Waters 49

Page 50: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Applying a Machine Model

1. Determine sequential baseline

• Use single-core application runs as accurate

CPU/memory model

2. Process-to-node mapping

• On-node vs. off-node communication

• Network congestion

3. Model communication overheads

• Uses message counts and collective counts

50Performance Modeling and Simulation on Blue Waters

Page 51: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Test system for Model Validation

• BW hardware details are still confidential

• Single Power7 MR is not useful for validation

• No scaling runs possible

• We use a 1920-core Power5+ system

• 120 nodes 16 cores per node

• HPS switch

• POE/IBM MPI Version 5.1

51Performance Modeling and Simulation on Blue Waters

Page 52: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Sequential Baseline

• Stepwise linear function to represent cache influence

• Chose two steps, different CPUs might need more

• Volume V = nx*ny*nz*nt; Type B = {LL, FL, GF, CG, FF}

• Cache holds s(B) data elements

52Performance Modeling and Simulation on Blue Waters

Page 53: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Example block: GF

53Performance Modeling and Simulation on Blue Waters

Page 54: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

System Model: Communication Parameters

55

Intra-node Inter-node

Performance Modeling and Simulation on Blue Waters

Page 55: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

On-node Memory Contention

• Two cores share one memory controller

• Congestion has non-trivial performance effects

• Empirical analysis

• Assume fixed 10%

slowdown

56Performance Modeling and Simulation on Blue Waters

Page 56: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Process-to-Node Mapping

• Trivial linear

default mapping

• With 4 processes

per node:

• 6 internal edges

• 10 remote edges

• Wrap-around

• Looses two internal edges

• Unbalanced communication

57Performance Modeling and Simulation on Blue Waters

Page 57: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Optimized Process-to-Node Mapping

• Optimal mapping

• cf. Lagrange multiplier

• 8 internal edges

• 8 remote edges

• Similar for 4d mapping

• 16 cores, optimal sub-block:

• ½ remote edges

58Performance Modeling and Simulation on Blue Waters

Page 58: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Parallel Performance Example: GF

59Performance Modeling and Simulation on Blue Waters

Page 59: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Parallel Performance Model

• Example: quantify maximum

benefit of MPI

datatypes!

• cf. EuroMPI‟10

60Performance Modeling and Simulation on Blue Waters

Page 60: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Scaling with the Number of Processes

61Performance Modeling and Simulation on Blue Waters

V=64 V=104

Page 61: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Generating a P7 Model and Prediction

• Back to this guy:

62Performance Modeling and Simulation on Blue Waters

Page 62: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Sequential Baseline – Now very Simple!

63

Power5+ Power7 MR

Performance Modeling and Simulation on Blue Waters

Page 63: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Power7 MR Sequential Model

• relative error is less than 10%

64Performance Modeling and Simulation on Blue Waters

Page 64: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Communication Parameters (P7 MR)

• Intra-node parameters benchmarked (Netgauge)

• Network parameters from (public) documentation

• Latency ~ 1

• Overhead ~ 0.5

• Gap ~ 0.5

• Gap per Byte (bandwidth) ~ 0.2 ns (5 GB/s)

• Allreduce :

65Performance Modeling and Simulation on Blue Waters

Page 65: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Process-to-Node Mapping

• assume two 2x2x2x2 sub-blocks per node

• 8 edges per process

• 16 * 4 local edges per sub-block

• 2*2*2 edges between the two blocks

• 72 local edges

• 56 remote edges

• local : remote ratio = 1 : 0.7778

66Performance Modeling and Simulation on Blue Waters

Page 66: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Complete P7 MR Prediction

• Shared memory communication causes high overhead

67Performance Modeling and Simulation on Blue Waters

Page 67: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Weak Scaling to 300000 Cores

68Performance Modeling and Simulation on Blue Waters

V=64 V=104

Page 68: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Benefits of the Model• Allows to estimate (with varying P and V)

• Communication overheads

• on-node vs. off-node communication

• evaluate decomposition and mapping strategies offline

• Time spent in message packing

• upper bound for benefits of datatypes (cf. EuroMPI‟10)

• Scaling bottlenecks

• Allreduce is relatively non-critical for scaling up

• ignoring OS noise for now

• Predict runtime at scale

• estimate overheads at large scale

• …

69Performance Modeling and Simulation on Blue Waters

Page 69: Analytical Performance Modeling and Simulation for Blue Waters · 2018. 3. 5. · Analytical Performance Modeling and Simulation for Blue Waters Torsten Hoefler With contributions

Takeaways, Questions & Discussion

Performance Modeling and Simulation on Blue Waters 70

• Performance Modeling should become routine!

• Modeling during application and system design

• Maintain models through

application lifetime

• Improves productivity

• Needs tool support

• Existing, needs glue

• It‟s relatively simple!

• For static applications