tobias gysi, tobias grosser and torsten hoefler absinthe… · absinthe halide polymage....

18
spcl.inf.ethz.ch @spcl_eth T OBIAS GYSI, TOBIAS GROSSER, AND T ORSTEN HOEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot

Upload: others

Post on 19-Jun-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

TOBIAS GYSI, TOBIAS GROSSER, AND TORSTEN HOEFLER

Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot

Page 2: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

COSMO Atmospheric Model

• Regional atmospheric model used by 7 national weather services• Implements many different stencil programs

Page 3: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

3

Optimizing the Fastwaves Kernel from the COSMO Atmospheric Model

unfused

tiled

64x64x1

20 1

3 4

6

5

7 8

model prediction [ms]

measu

red

exe

cuti

on t

ime [

ms]

1.08

0.94

absinthe

64x4x3

64x4x5

20 1

3 4

6

5

78

unfused

tiled

64x64x1

20 1

3 4

6

5

7 8

model prediction [ms]

measu

red

exe

cuti

on t

ime [

ms]

1.080.67

0.94

0.62

absinthe

64x4x3

64x4x5

20 1

3 4

6

5

78

unfused

tiled

64x64x1

20 1

3 4

6

5

7 8

auto-tuning

64x4x1

64x4x4

model prediction [ms]

measu

red

exe

cuti

on t

ime [

ms]

1.080.730.67

0.94

0.62

0.58-6.5%

20 1

3 4

6

5

78

Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt, Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities. 2011.

Page 4: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

4

Stencil Programs Execute Multiple Stencils in Sequence

for (int y = ybeg; y < yend; y++)

for (int x = xbeg; x < xend; x++)

A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y);

for (int y = ybeg; y < yend; y++)

for (int x = xbeg; x < xend; x++)

B(x,y) = A(x,y+1) + A(x,y);

• element-wise computation• position independent access pattern

xbeg xend

x

y

yend

ybeg

1 load / 1 store2 loads / 1 store

Page 5: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

5

Loop Tiling and Loop Fusion

for (int idx = 0; idx < 4; ++idx) {

int xbeg = tiles[idx].xbeg;

int xend = tiles[idx].xend;

int ybeg = tiles[idx].ybeg;

int yend = tiles[idx].yend;

Buffer A(xbeg, xend, ybeg, yend+1);

for (int y = ybeg; y < yend+1; ++y)

for (int x = xbeg; x < xend; ++x)

A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y);

for (int y = ybeg; y < yend; y++)

for (int x = xbeg; x < xend; x++)

B(x,y) = A(x,y+1) + A(x,y);

}

x

y

idx = 0 idx = 1

xbeg xendxbeg xend

1 load / 0 store0 loads / 1 store

idx = 2 idx = 3

yend

ybeg

yend

ybeg

Page 6: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

6

Architecture Overview

1 model learner

2 optimizer ILP solver

learned parameters

target system

benchmark

3 code generatorcode transformations

fast code

𝑡(𝑝, 𝑏) = 𝑃𝑝 + 𝐵𝑏

Page 7: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

7

Performance Model Ideas

• memory accesses dominate the execution time

scalar peel loops vectorized loop body

• execution time of innermost loop

fast memory (L1 cache) slow memory (L3 cache/DDR)

Page 8: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

8

Performance Model Design

• linear cost functions for peel and body cost

• model the entire program

𝑡 = 𝑃𝑝 + 𝐵𝑏

• slow and fast memory

𝑡 = max 𝑃1𝑝1, 𝑃2𝑝2 +max(𝐵1𝑏1, 𝐵2𝑏2)

𝑡 =

𝑖=0..8

𝑡𝑖

+

𝑃𝑝 𝐵𝑏

+𝑃1𝑝1

𝑃2𝑝2

𝐵2𝑏2

𝐵1𝑏1

0 1 2 3 4 5 6 7 8

Page 9: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

• # cache accesses

• estimated execution time

9

Evaluating the Fast Memory Model

3 loads / 1 store

𝑝𝑓 = 𝑛𝑥(𝐷𝑦 + 𝑒𝑦𝑛𝑦)

𝑏𝑓 = 𝐷𝑥𝐷𝑦 + 𝐷𝑥𝑒𝑦𝑛𝑦

𝑛𝑥 = 2 , 𝑛𝑦 = 2

𝑡 = 𝑃𝑓𝑝𝑓 + 𝐵𝑓𝑏𝑓

learn the model parameters 𝑃𝑓 , 𝐵𝑓

0 1

2 3

𝐷𝑥

𝐷𝑦 𝑒𝑦

𝑒𝑦

+

𝑝𝑓 = 𝑛𝑥𝐷𝑦

𝑏𝑓 = 𝐷𝑥𝐷𝑦

𝑝𝑓 = (3 + 1)𝑛𝑥(𝐷𝑦 + 𝑒𝑦𝑛𝑦)

𝑏𝑓 = (3 + 1)𝐷𝑥𝐷𝑦 + 𝐷𝑥𝑒𝑦𝑛𝑦

Page 10: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

10

Learning the Fast Memory Model

p=12

p=16

p=20

0.00

0.05

0.10

20 40 60 80x

exec

uti

on

tim

e [m

s]

fast memory

𝑃𝑓 , 𝐵𝑓 = argmin(𝑃,𝐵)∈ℝ

𝑟∈[0,𝑅]

𝑃𝑝𝑟 − 𝐵𝑏𝑟 − 𝑡𝑟

=

k

+

k+1

+

k-1

p=12

=

k

+

k+1

+

k-1

p=16

=

k

+

k+1

+

k-1

p=20

output array input array

Page 11: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

11

Linear Multiplication of Bounded Integer Variables

result 0 𝑥

limit range 𝟎 ≤ 𝒑 ≤ 𝒙

force result 𝒑 − 𝑿𝒃 ≤ 𝟎 𝒑 − 𝒙 − 𝑿𝒃 ≥ −𝑿

binary representation 𝑦 =

𝑖=0

⌊log2(𝑌)⌋

2𝑖𝑦𝑖

sum binary products 𝑝 =

𝑖=0

⌊log2(𝑌)⌋

2𝑖𝑥𝑦𝑖

b

p

𝑝 ≤ 𝑥

𝑝 ≥ 0

𝑏 = 0 𝑏 = 1

𝑝 ≤ 0

𝑝 ≥ 𝑥

https://blog.adamfurmanek.pl/2015/09/26/ilp-part-6/

• the binary product 𝑝 = 𝑥𝑏 given the upper bound 𝑋

• the integer product 𝑝 = 𝑥𝑦 given the upper bounds 𝑋 and 𝑌

Page 12: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

12

Comparison to Auto-tuning, Heuristics, Hand-tuned, and Random Variants

absinthe

0.5

0.7

0.9

1.1

1.3

0.5 0.7 0.9 1.1 1.3estimated time [ms]

mea

sure

d t

ime

[ms]

fastwaves

auto-tuning(-6.5%)

min

max(74.0%)

handabsinthe

hand

min

max

auto-tuning(-0.8%)

0.4

0.8

1.2

1.6

0.4 0.8 1.2 1.6estimated time [ms]

mea

sure

d t

ime

[ms]

diffusion

absinthe

hand

min

max

auto-tuning(-3.4%)0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.4 0.5 0.6 0.7 0.8estimated time [ms]

mea

sure

d t

ime

[ms]

advection

Page 13: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

13

Comparison to Halide and Polymage

R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, Automatically scheduling halide image processing pipelines. 2016.A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines. 2018.

1.66x

1.29x

1x

0

10

20

fastwaves

2.03x

3.7x

1x

advection

1.4x

1.06x1x

diffusion

exec

uti

on

tim

e [m

s]

Absinthe

Halide

Polymage

Page 14: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

14

Conclusionslearned performance model

integer linear programming

loop fusion and loop tiling

close to auto-tuning

Page 15: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

15

Backup Slides

Page 16: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

16

Model the Space of Possible Code Transformations

stencils

64x4x3 64x4x5

0 1 2 3 4 5 6 7 8

fusion choices 𝑔3 = 0𝑔1 = 0𝑔2 = 0𝑔0 = 0 𝑔4 = 0 𝑔5 = 0 𝑔6 = 1 𝑔7 = 1 𝑔8 = 1

0 ≤ 𝑔𝑖+1 − 𝑔𝑖 ≤ 1 ∀𝑖 ∈ 0,7

Page 17: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

17

Model the Space of Possible Code Transformations

stencils

64x4x3 64x4x5

0 1 2 3 4 5 6 7 8

tile sizes𝑛0𝑥 = 1

𝑛0𝑦= 16

𝑛0𝑧 = 20

…𝑛5𝑥 = 1

𝑛5𝑦= 16

𝑛5𝑧 = 20

𝑛6𝑥 = 1

𝑛6𝑦= 16

𝑛6𝑧 = 12

...

𝑛8𝑥 = 1

𝑛8𝑦= 16

𝑛8𝑧 = 12

1 ≤ 𝑛𝑖𝑥 ≤ 𝐷𝑥, 1 ≤ 𝑛𝑖

𝑦≤ 𝐷𝑦, 1 ≤ 𝑛𝑖

𝑧 ≤ 𝐷𝑧 ∀𝑖 ∈ 0,8

equality constraints

Page 18: TOBIAS GYSI, TOBIAS GROSSER AND TORSTEN HOEFLER Absinthe… · Absinthe Halide Polymage. spcl.inf.ethz.ch @spcl_eth 14 Conclusions learned performance model integer linear programming

spcl.inf.ethz.ch

@spcl_eth

18

Limit the Cache Utilization

𝐹02 = 6 𝐹12 = 5 𝐹22 = 4

𝑓2 ≥ 𝐹22

𝑓2 + 𝐹12 𝑔2 − 𝑔1 ≥ 𝐹12

𝑓2 + 𝐹02 𝑔2 − 𝑔0 ≥ 𝐹02

𝐶𝑛2𝑥𝑛2

𝑦𝑛2𝑧 − 𝑓2 ≥ 0

stencils 0 1 2