automatic performance tuning jeremy johnson dept. of computer science drexel university
TRANSCRIPT
Automatic Performance TuningAutomatic Performance Tuning
Jeremy Johnson
Dept. of Computer Science
Drexel University
OutlineOutline
• Scientific Computation Kernels– Matrix Multiplication– Fast Fourier Transform (FFT)
• Automated Performance Tuning (IEEE Proc. Vol. 93, No. 2, Feb. 2005)
– ATLAS– FFTW– SPIRAL
Matrix Multiplication and the FFTMatrix Multiplication and the FFT
n
kkjikij BAC
1
1
0
1
0
2112
1
0
12
22
12
11
,,
R
lSR
S
l
kl
NS
l
N
l
kl
Nk
xy
llkk
xy
kklklk
Skk
RlSkRSN
Basic Linear Algebra Subprograms (BLAS)Basic Linear Algebra Subprograms (BLAS)
• Level 1 – vector-vector, O(n) data, O(n) operations• Level 2 – matrix-vector, O(n2) data, O(n2) operations• Level 3 – matrix-matrix, O(n2) data, O(n3) operations = data reuse =
locality!
• LAPACK built on top of BLAS (level 3)– Blocking (for the memory hierarchy) is the single most important
optimization for linear algebra algorithms
• GEMM – General Matrix Multiplication
– SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC )
– C := alpha*op( A )*op( B ) + beta*C, – where op(X) = X or X’
DGEMMDGEMM
…* Form C := alpha*A*B + beta*C.* DO 90, J = 1, N IF( BETA.EQ.ZERO )THEN DO 50, I = 1, M C( I, J ) = ZERO 50 CONTINUE ELSE IF( BETA.NE.ONE )THEN DO 60, I = 1, M C( I, J ) = BETA*C( I, J ) 60 CONTINUE END IF DO 80, L = 1, K IF( B( L, J ).NE.ZERO )THEN TEMP = ALPHA*B( L, J ) DO 70, I = 1, M C( I, J ) = C( I, J ) + TEMP*A( I, L ) 70 CONTINUE END IF 80 CONTINUE 90 CONTINUE…
Matrix Multiplication PerformanceMatrix Multiplication Performance
Matrix Multiplication PerformanceMatrix Multiplication Performance
Numeric RecipesNumeric Recipes
• Numeric Recipes in C – The Art of Scientific Computing, 2nd Ed.– William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P.
Flannery, Cambridge University Press, 1992.
• “This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines.
• 1. Preliminarys
• 2. Solutions of Linear Algebraic Equations
• …
• 12. Fast Fourier Transform
• 19. Partial Differential Equations
• 20. Less Numerical Algorithms
four1four1
four1 (cont)four1 (cont)
FFT PerformanceFFT Performance
Atlas Architecture and Search ParametersAtlas Architecture and Search Parameters
• NB – L1 data cache tile size
• NCNB – L1 data cache tile size for non-copying version
• MU, NU – Register tile size
• KU – Unroll factor for k’ loop
• LS – Latency for computation scheduling
• FMA – 1 if fused multiply-add available, 0 otherwise
• FF, IF, NF – Scheduling of loadsYotov et al., Is Search Really Necessary to Generate High-Performance BLAS?, Proc. IEEE, Vol. 93, No. 2, Feb. 2005
ATLAS Code GenerationATLAS Code Generation
• Optimization for locality – Cache tiling, Register tiling
ATLAS Code GenerationATLAS Code Generation
• Register Tiling– MU + NU + MU×NU ≤ NR
• Loop unrolling• Scalar replacement• Add/mul interleaving• Loop skewing
• Ci’’j’’ = Ci’’j’’ + Ai’’k’’*Bk’’j’’
A C
B
NU
MU
K
K
NB
NB
mul1
mul2
…
mulLs
add1
mulLs+1
add2
…
mulMu×Nu
addMu×Nu-Ls+2
…
addMu×Nu
ATLAS SearchATLAS Search
• Estimate Machine Parameters (C1, NR, FMA, LS)
– Used to bound search
• Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter)– Search order
• NB
• MU, NU
• KU
• LS
• FF, IF, NF
• NCNB
• Cleanup codes
Using FFTWUsing FFTW
FFTW InfrastructureFFTW Infrastructure
• Use dynamic programming to find an efficient way to combine code sequences.
• Combine code sequences using divide and conquer structure in FFT
• Codelets (optimized code sequences for small FFTs)
• Plan encodes divide and conquer strategy and stores “twiddle factors”
• Executor computes FFT of given data using algorithm described by plan.
15
3 12
4 8
3 5
Right Recursive
SPIRAL SPIRAL systemsystem
DSP transformspecifies
user
goes for a coffee
Formula Generator
SPL Compiler Sea
rch
En
gin
e
runtime on given platform
controlsimplementation options
controlsalgorithm generation
fast algorithmas SPL formula
C/Fortran/SIMDcodeS P
I R
A L
(or an
espres
so fo
r small tran
sfo
rms)
platform-adaptedimplementation
comes back
Mathematic
ian
Expert
Program
mer
DSPDSP Algorithms: Example 4-point DFT Algorithms: Example 4-point DFT
Cooley/Tukey FFT (size 4):
algorithms reduce arithmetic cost O(n^2)O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure
1000
0010
0100
0001
1100
1100
0011
0011
000
0100
0010
0001
1010
0101
1010
0101
11
1111
11
1111
iii
ii
4222
42224 )()( LDFTITIDFTDFT
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
AlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas)(
8IIDCT
)(4IIDCT
)(4IVDCT
R1)()( 2/2
)(2/
)(2/
)(n
IVn
IIn
IIn IFDCTDCTPDCT
2FR3 R6
2FR4
R3
R1
R6
2F
2FR4
2)(
22
1FDCT II
)(2IIDCT
)(2IIDST
)(2IVDCT
)(2IIDST
)(2IIDCT
R1R6 SDCTPDCT II
nIV
n )()(
)(4IIDCT)(
2IVDCT
GeneratedGenerated DFT Vector Code: Pentium 4, SSE DFT Vector Code: Pentium 4, SSE(P
seu
do
) g
flo
p/s
DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
n
speedups (to C code) up to factor of 3.1
0
1
2
3
4
5
6
7
4 5 6 7 8 9 10 11 12 13
Spiral SSE
Intel MKL interl.
FFTW 2.1.3
Spiral C
Spiral C vect
SIMD-FFT
hand-tuned vendor assembly code
Best Best DFT Trees, size DFT Trees, size 210 = 1024 = 1024
scalar
C vect
SIMD
Pentium 4float
Pentium 4double
Pentium IIIfloat
AthlonXPfloat
10
6
4
4
10
8
7
5
10
8
6
4
2
2
2
2 2
2 2
2 2
2
2
1
2
2 3
10
8
5
2
2
2 3
10
9
7
5
1
2
22 3
10
6
2
4
2 2
2 2
4
10
4
2
6
2 4
2 2
2
10
5
2
5
2 3 3
10
6
3
4
2 2 3
10
8
5
2
2
2 3
10
6
4
4
2 2
2 2
2
10
5
2
5
2 3 3
trees platform/datatype dependent
CrosstimingCrosstiming of best trees on Pentium 4 of best trees on Pentium 4S
low
do
wn
fac
tor
w.r
.t.
bes
t
DFT 2n single precision, runtime of best found of other platforms
n
software adaptation is necessary
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
4 5 6 7 8 9 10 11 12 13
Pentium 4 SSE
Pentium 4 SSE2
AthlonXP SSE
PentiumIII SSE
Pentium 4 float