tobias gysi, tobias grosser and torsten hoefler absinthe… · absinthe halide polymage....
TRANSCRIPT
spcl.inf.ethz.ch
@spcl_eth
TOBIAS GYSI, TOBIAS GROSSER, AND TORSTEN HOEFLER
Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot
spcl.inf.ethz.ch
@spcl_eth
COSMO Atmospheric Model
• Regional atmospheric model used by 7 national weather services• Implements many different stencil programs
spcl.inf.ethz.ch
@spcl_eth
3
Optimizing the Fastwaves Kernel from the COSMO Atmospheric Model
unfused
tiled
64x64x1
20 1
3 4
6
5
7 8
model prediction [ms]
measu
red
exe
cuti
on t
ime [
ms]
1.08
0.94
absinthe
64x4x3
64x4x5
20 1
3 4
6
5
78
unfused
tiled
64x64x1
20 1
3 4
6
5
7 8
model prediction [ms]
measu
red
exe
cuti
on t
ime [
ms]
1.080.67
0.94
0.62
absinthe
64x4x3
64x4x5
20 1
3 4
6
5
78
unfused
tiled
64x64x1
20 1
3 4
6
5
7 8
auto-tuning
64x4x1
64x4x4
model prediction [ms]
measu
red
exe
cuti
on t
ime [
ms]
1.080.730.67
0.94
0.62
0.58-6.5%
20 1
3 4
6
5
78
Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt, Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities. 2011.
spcl.inf.ethz.ch
@spcl_eth
4
Stencil Programs Execute Multiple Stencils in Sequence
for (int y = ybeg; y < yend; y++)
for (int x = xbeg; x < xend; x++)
A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y);
for (int y = ybeg; y < yend; y++)
for (int x = xbeg; x < xend; x++)
B(x,y) = A(x,y+1) + A(x,y);
• element-wise computation• position independent access pattern
xbeg xend
x
y
yend
ybeg
1 load / 1 store2 loads / 1 store
spcl.inf.ethz.ch
@spcl_eth
5
Loop Tiling and Loop Fusion
for (int idx = 0; idx < 4; ++idx) {
int xbeg = tiles[idx].xbeg;
int xend = tiles[idx].xend;
int ybeg = tiles[idx].ybeg;
int yend = tiles[idx].yend;
Buffer A(xbeg, xend, ybeg, yend+1);
for (int y = ybeg; y < yend+1; ++y)
for (int x = xbeg; x < xend; ++x)
A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y);
for (int y = ybeg; y < yend; y++)
for (int x = xbeg; x < xend; x++)
B(x,y) = A(x,y+1) + A(x,y);
}
x
y
idx = 0 idx = 1
xbeg xendxbeg xend
1 load / 0 store0 loads / 1 store
idx = 2 idx = 3
yend
ybeg
yend
ybeg
spcl.inf.ethz.ch
@spcl_eth
6
Architecture Overview
1 model learner
2 optimizer ILP solver
learned parameters
target system
benchmark
3 code generatorcode transformations
fast code
𝑡(𝑝, 𝑏) = 𝑃𝑝 + 𝐵𝑏
spcl.inf.ethz.ch
@spcl_eth
7
Performance Model Ideas
• memory accesses dominate the execution time
scalar peel loops vectorized loop body
• execution time of innermost loop
fast memory (L1 cache) slow memory (L3 cache/DDR)
spcl.inf.ethz.ch
@spcl_eth
8
Performance Model Design
• linear cost functions for peel and body cost
• model the entire program
𝑡 = 𝑃𝑝 + 𝐵𝑏
• slow and fast memory
𝑡 = max 𝑃1𝑝1, 𝑃2𝑝2 +max(𝐵1𝑏1, 𝐵2𝑏2)
𝑡 =
𝑖=0..8
𝑡𝑖
+
𝑃𝑝 𝐵𝑏
+𝑃1𝑝1
𝑃2𝑝2
𝐵2𝑏2
𝐵1𝑏1
0 1 2 3 4 5 6 7 8
spcl.inf.ethz.ch
@spcl_eth
• # cache accesses
• estimated execution time
9
Evaluating the Fast Memory Model
3 loads / 1 store
𝑝𝑓 = 𝑛𝑥(𝐷𝑦 + 𝑒𝑦𝑛𝑦)
𝑏𝑓 = 𝐷𝑥𝐷𝑦 + 𝐷𝑥𝑒𝑦𝑛𝑦
𝑛𝑥 = 2 , 𝑛𝑦 = 2
𝑡 = 𝑃𝑓𝑝𝑓 + 𝐵𝑓𝑏𝑓
learn the model parameters 𝑃𝑓 , 𝐵𝑓
0 1
2 3
𝐷𝑥
𝐷𝑦 𝑒𝑦
𝑒𝑦
+
𝑝𝑓 = 𝑛𝑥𝐷𝑦
𝑏𝑓 = 𝐷𝑥𝐷𝑦
𝑝𝑓 = (3 + 1)𝑛𝑥(𝐷𝑦 + 𝑒𝑦𝑛𝑦)
𝑏𝑓 = (3 + 1)𝐷𝑥𝐷𝑦 + 𝐷𝑥𝑒𝑦𝑛𝑦
spcl.inf.ethz.ch
@spcl_eth
10
Learning the Fast Memory Model
p=12
p=16
p=20
0.00
0.05
0.10
20 40 60 80x
exec
uti
on
tim
e [m
s]
fast memory
𝑃𝑓 , 𝐵𝑓 = argmin(𝑃,𝐵)∈ℝ
𝑟∈[0,𝑅]
𝑃𝑝𝑟 − 𝐵𝑏𝑟 − 𝑡𝑟
=
k
+
k+1
+
k-1
p=12
=
k
+
k+1
+
k-1
p=16
=
k
+
k+1
+
k-1
p=20
output array input array
spcl.inf.ethz.ch
@spcl_eth
11
Linear Multiplication of Bounded Integer Variables
result 0 𝑥
limit range 𝟎 ≤ 𝒑 ≤ 𝒙
force result 𝒑 − 𝑿𝒃 ≤ 𝟎 𝒑 − 𝒙 − 𝑿𝒃 ≥ −𝑿
binary representation 𝑦 =
𝑖=0
⌊log2(𝑌)⌋
2𝑖𝑦𝑖
sum binary products 𝑝 =
𝑖=0
⌊log2(𝑌)⌋
2𝑖𝑥𝑦𝑖
b
p
𝑝 ≤ 𝑥
𝑝 ≥ 0
𝑏 = 0 𝑏 = 1
𝑝 ≤ 0
𝑝 ≥ 𝑥
https://blog.adamfurmanek.pl/2015/09/26/ilp-part-6/
• the binary product 𝑝 = 𝑥𝑏 given the upper bound 𝑋
• the integer product 𝑝 = 𝑥𝑦 given the upper bounds 𝑋 and 𝑌
spcl.inf.ethz.ch
@spcl_eth
12
Comparison to Auto-tuning, Heuristics, Hand-tuned, and Random Variants
absinthe
0.5
0.7
0.9
1.1
1.3
0.5 0.7 0.9 1.1 1.3estimated time [ms]
mea
sure
d t
ime
[ms]
fastwaves
auto-tuning(-6.5%)
min
max(74.0%)
handabsinthe
hand
min
max
auto-tuning(-0.8%)
0.4
0.8
1.2
1.6
0.4 0.8 1.2 1.6estimated time [ms]
mea
sure
d t
ime
[ms]
diffusion
absinthe
hand
min
max
auto-tuning(-3.4%)0.3
0.4
0.5
0.6
0.7
0.8
0.3 0.4 0.5 0.6 0.7 0.8estimated time [ms]
mea
sure
d t
ime
[ms]
advection
spcl.inf.ethz.ch
@spcl_eth
13
Comparison to Halide and Polymage
R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, Automatically scheduling halide image processing pipelines. 2016.A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines. 2018.
1.66x
1.29x
1x
0
10
20
fastwaves
2.03x
3.7x
1x
advection
1.4x
1.06x1x
diffusion
exec
uti
on
tim
e [m
s]
Absinthe
Halide
Polymage
spcl.inf.ethz.ch
@spcl_eth
14
Conclusionslearned performance model
integer linear programming
loop fusion and loop tiling
close to auto-tuning
spcl.inf.ethz.ch
@spcl_eth
15
Backup Slides
spcl.inf.ethz.ch
@spcl_eth
16
Model the Space of Possible Code Transformations
stencils
64x4x3 64x4x5
0 1 2 3 4 5 6 7 8
fusion choices 𝑔3 = 0𝑔1 = 0𝑔2 = 0𝑔0 = 0 𝑔4 = 0 𝑔5 = 0 𝑔6 = 1 𝑔7 = 1 𝑔8 = 1
0 ≤ 𝑔𝑖+1 − 𝑔𝑖 ≤ 1 ∀𝑖 ∈ 0,7
spcl.inf.ethz.ch
@spcl_eth
17
Model the Space of Possible Code Transformations
stencils
64x4x3 64x4x5
0 1 2 3 4 5 6 7 8
tile sizes𝑛0𝑥 = 1
𝑛0𝑦= 16
𝑛0𝑧 = 20
…𝑛5𝑥 = 1
𝑛5𝑦= 16
𝑛5𝑧 = 20
𝑛6𝑥 = 1
𝑛6𝑦= 16
𝑛6𝑧 = 12
...
𝑛8𝑥 = 1
𝑛8𝑦= 16
𝑛8𝑧 = 12
1 ≤ 𝑛𝑖𝑥 ≤ 𝐷𝑥, 1 ≤ 𝑛𝑖
𝑦≤ 𝐷𝑦, 1 ≤ 𝑛𝑖
𝑧 ≤ 𝐷𝑧 ∀𝑖 ∈ 0,8
equality constraints
spcl.inf.ethz.ch
@spcl_eth
18
Limit the Cache Utilization
𝐹02 = 6 𝐹12 = 5 𝐹22 = 4
𝑓2 ≥ 𝐹22
𝑓2 + 𝐹12 𝑔2 − 𝑔1 ≥ 𝐹12
𝑓2 + 𝐹02 𝑔2 − 𝑔0 ≥ 𝐹02
𝐶𝑛2𝑥𝑛2
𝑦𝑛2𝑧 − 𝑓2 ≥ 0
stencils 0 1 2