automatic performance tuning jeremy johnson dept. of computer science drexel university

Automatic Performance TuningAutomatic Performance Tuning

Jeremy Johnson

Dept. of Computer Science

Drexel University

OutlineOutline

• Scientific Computation Kernels– Matrix Multiplication– Fast Fourier Transform (FFT)

• Automated Performance Tuning (IEEE Proc. Vol. 93, No. 2, Feb. 2005)

– ATLAS– FFTW– SPIRAL

Matrix Multiplication and the FFTMatrix Multiplication and the FFT

n

kkjikij BAC

1

1

0

1

0

2112

1

0

12

22

12

11

,,

R

lSR

S

l

kl

NS

l

N

l

kl

Nk

xy

llkk

xy

kklklk

Skk

RlSkRSN

Basic Linear Algebra Subprograms (BLAS)Basic Linear Algebra Subprograms (BLAS)

• Level 1 – vector-vector, O(n) data, O(n) operations• Level 2 – matrix-vector, O(n2) data, O(n2) operations• Level 3 – matrix-matrix, O(n2) data, O(n3) operations = data reuse =

locality!

• LAPACK built on top of BLAS (level 3)– Blocking (for the memory hierarchy) is the single most important

optimization for linear algebra algorithms

• GEMM – General Matrix Multiplication

– SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC )

– C := alpha*op( A )*op( B ) + beta*C, – where op(X) = X or X’

DGEMMDGEMM

…* Form C := alpha*A*B + beta*C.* DO 90, J = 1, N IF( BETA.EQ.ZERO )THEN DO 50, I = 1, M C( I, J ) = ZERO 50 CONTINUE ELSE IF( BETA.NE.ONE )THEN DO 60, I = 1, M C( I, J ) = BETA*C( I, J ) 60 CONTINUE END IF DO 80, L = 1, K IF( B( L, J ).NE.ZERO )THEN TEMP = ALPHA*B( L, J ) DO 70, I = 1, M C( I, J ) = C( I, J ) + TEMP*A( I, L ) 70 CONTINUE END IF 80 CONTINUE 90 CONTINUE…

Matrix Multiplication PerformanceMatrix Multiplication Performance

Numeric RecipesNumeric Recipes

• Numeric Recipes in C – The Art of Scientific Computing, 2nd Ed.– William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P.

Flannery, Cambridge University Press, 1992.

• “This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines.

• 1. Preliminarys

• 2. Solutions of Linear Algebraic Equations

• …

• 12. Fast Fourier Transform

• 19. Partial Differential Equations

• 20. Less Numerical Algorithms

four1four1

four1 (cont)four1 (cont)

FFT PerformanceFFT Performance

Atlas Architecture and Search ParametersAtlas Architecture and Search Parameters

• NB – L1 data cache tile size

• NCNB – L1 data cache tile size for non-copying version

• MU, NU – Register tile size

• KU – Unroll factor for k’ loop

• LS – Latency for computation scheduling

• FMA – 1 if fused multiply-add available, 0 otherwise

• FF, IF, NF – Scheduling of loadsYotov et al., Is Search Really Necessary to Generate High-Performance BLAS?, Proc. IEEE, Vol. 93, No. 2, Feb. 2005

ATLAS Code GenerationATLAS Code Generation

• Optimization for locality – Cache tiling, Register tiling

ATLAS Code GenerationATLAS Code Generation

• Register Tiling– MU + NU + MU×NU ≤ NR

• Loop unrolling• Scalar replacement• Add/mul interleaving• Loop skewing

• Ci’’j’’ = Ci’’j’’ + Ai’’k’’*Bk’’j’’

A C

B

NU

MU

K

K

NB

NB

mul1

mul2

…

mulLs

add1

mulLs+1

add2

…

mulMu×Nu

addMu×Nu-Ls+2

…

addMu×Nu

ATLAS SearchATLAS Search

• Estimate Machine Parameters (C1, NR, FMA, LS)

– Used to bound search

• Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter)– Search order

• NB

• MU, NU

• KU

• LS

• FF, IF, NF

• NCNB

• Cleanup codes

Using FFTWUsing FFTW

FFTW InfrastructureFFTW Infrastructure

• Use dynamic programming to find an efficient way to combine code sequences.

• Combine code sequences using divide and conquer structure in FFT

• Codelets (optimized code sequences for small FFTs)

• Plan encodes divide and conquer strategy and stores “twiddle factors”

• Executor computes FFT of given data using algorithm described by plan.

15

3 12

4 8

3 5

Right Recursive

SPIRAL SPIRAL systemsystem

DSP transformspecifies

user

goes for a coffee

Formula Generator

SPL Compiler Sea

rch

En

gin

e

runtime on given platform

controlsimplementation options

controlsalgorithm generation

fast algorithmas SPL formula

C/Fortran/SIMDcodeS P

I R

A L

(or an

espres

so fo

r small tran

sfo

rms)

platform-adaptedimplementation

comes back

Mathematic

ian

Expert

Program

mer

DSPDSP Algorithms: Example 4-point DFT Algorithms: Example 4-point DFT

Cooley/Tukey FFT (size 4):

algorithms reduce arithmetic cost O(n^2)O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure

1000

0010

0100

0001

1100

1100

0011

0011

000

0100

0010

0001

1010

0101

1010

0101

11

1111

11

1111

iii

ii

4222

42224 )()( LDFTITIDFTDFT

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

AlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas)(

8IIDCT

)(4IIDCT

)(4IVDCT

R1)()( 2/2

)(2/

)(2/

)(n

IVn

IIn

IIn IFDCTDCTPDCT

2FR3 R6

2FR4

R3

R1

R6

2F

2FR4

2)(

22

1FDCT II

)(2IIDCT

)(2IIDST

)(2IVDCT

)(2IIDST

)(2IIDCT

R1R6 SDCTPDCT II

nIV

n )()(

)(4IIDCT)(

2IVDCT

GeneratedGenerated DFT Vector Code: Pentium 4, SSE DFT Vector Code: Pentium 4, SSE(P

seu

do

) g

flo

p/s

DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

speedups (to C code) up to factor of 3.1

0

1

2

3

4

5

6

7

4 5 6 7 8 9 10 11 12 13

Spiral SSE

Intel MKL interl.

FFTW 2.1.3

Spiral C

Spiral C vect

SIMD-FFT

hand-tuned vendor assembly code

Best Best DFT Trees, size DFT Trees, size 210 = 1024 = 1024

scalar

C vect

SIMD

Pentium 4float

Pentium 4double

Pentium IIIfloat

AthlonXPfloat

10

6

4

4

10

8

7

5

10

8

6

4

2

2

2

2 2

2 2

2 2

2

2

1

2

2 3

10

8

5

2

2

2 3

10

9

7

5

1

2

22 3

10

6

2

4

2 2

2 2

4

10

4

2

6

2 4

2 2

2

10

5

2

5

2 3 3

10

6

3

4

2 2 3

10

8

5

2

2

2 3

10

6

4

4

2 2

2 2

2

10

5

2

5

2 3 3

trees platform/datatype dependent

CrosstimingCrosstiming of best trees on Pentium 4 of best trees on Pentium 4S

low

do

wn

fac

tor

w.r

.t.

bes

t

DFT 2n single precision, runtime of best found of other platforms

n

software adaptation is necessary

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

4 5 6 7 8 9 10 11 12 13

Pentium 4 SSE

Pentium 4 SSE2

AthlonXP SSE

PentiumIII SSE

Pentium 4 float

automatic performance tuning jeremy johnson dept. of computer science drexel university

Documents