Carnegie Mellon
Parallelism in Spiral
This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel
Franz Franchetti
Electrical and Computer EngineeringCarnegie Mellon University
Joint work withYevgen VoronenkoMarkus Püschel
… and the Spiral team (only part shown)
Carnegie Mellon
The Problem
0
2
4
6
8
10
12
14
16
18
20
22
24
26
16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
Problem size
Discrete Fourier Transform (single precision): 2 x Core2 Extreme 3 GHzPerformance [Gflop/s]
What’s going on?
30xbest code
Numerical Recipes
Carnegie Mellon
Automatic Performance TuningCurrent vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized
Automatic Performance TuningBLAS: ATLAS Linear algebra: Bebop, Spike, FlameSorting Fourier transform: FFTW Linear transforms: Spiral…othersNew compiler techniques
Proceedings of the IEEE special issue, Feb. 2005But what about parallelism … ?
Carnegie Mellon
Vision Behind Spiral
Numerical problem
Computing platform
algorithm selection
compilation
hum
an e
ffort
auto
mat
ed
implementationC program
auto
mat
edalgorithm selection
compilation
implementation
Numerical problem
Computing platform
Current Future
C code a singularity: Compiler hasno access to high level information
Challenge: conquer the high abstraction level for complete automation
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
Carnegie Mellon
SpiralLibrary generator for linear transforms (DFT, DCT, DWT, filters, ….) and recently more …
Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog, GPU
Research Goal: “Teach” computers to write fast librariesComplete automation of implementation and optimizationConquer the “high” algorithm level for automation
When a new platform comes out: Regenerate a retuned library
When a new platform paradigm comes out (e.g., vector or CMPs):Update the tool rather than rewriting the library
Intel has started to use Spiral to generate parts of their MKL library
Carnegie Mellon
How Spiral Works
Algorithm GenerationAlgorithm Optimization
ImplementationCode Optimization
CompilationCompiler Optimizations
algorithm
C code
performance
Problem specification (transform)
Fast executable
Sear
ch
controls
controls
Spiral
Spiral: Complete automation of the implementation and optimization task
Basic idea:Declarative representation of algorithms
Rewriting systems to generate and optimize algorithms
Carnegie Mellon
What is a (Linear) Transform?Mathematically: Matrix-vector multiplication
Example: Discrete Fourier transform (DFT)
input vector (signal)output vector (signal) transform = matrix
Carnegie Mellon
Transform Algorithms: Example 4-point FFTCooley/Tukey fast Fourier transform (FFT):
Algorithms reduce arithmetic cost O(n2) → O(nlog(n))Product of structured sparse matricesMathematical notation exhibits structure: SPL (signal processing language)
1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1
j j
j j j
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
Carnegie Mellon
Examples: Transforms
Spiral currently contains 55 transforms
Carnegie Mellon
Examples: Breakdown Rules (currently ≈220)
Base case rules
“Teaches” Spiral about existing algorithm knowledge (~200 journal papers)
Includes many new ones (algebraic theory, Pueschel, Moura, Voronenko)
Carnegie Mellon
SPL to Sequential Code
Example: tensor product
Carnegie Mellon
Program Generation in Spiral (Sketched)Transformuser specified
C Code:
Fast algorithmin SPLmany choices
∑-SPL:
Iteration of this process to search for the fastest
But that’s not all …
parallelizationvectorization
loop optimizations
constant foldingscheduling……
Optimization at allabstraction levels
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
Carnegie Mellon
SPL to Shared Memory Code: Basic IdeaGoverning construct: tensor product
Independent operation, load-balanced
AAAA
x y
Processor 0Processor 1Processor 2Processor 3
Problematic construct: permutations produce false sharing
Task: Rewrite formulas to extract tensor product + keep contiguous blocks
x y
[SC 06]
Carnegie Mellon
Step 1: Shared Memory TagsIdentify crucial hardware parameters
Number of processors: pCache line size: μ
Introduce them as tags in SPL
This means: formula A is to be optimized for p processors and cache line size μ
Carnegie Mellon
Step 2: Identify “Good” FormulasLoad balanced, avoiding false sharing
Tagged operators (no further rewriting necessary)
Definition: A formula is fully optimized if it is one of the above or of the form
where A and B are fully optimized.
Carnegie Mellon
Step 3: Identify Rewriting RulesGoal: Transform formulas into fully optimized formulas
Formulas rewritten, tags propagatedThere may be choices
Carnegie Mellon
Simple Rewriting Example
fully optimized
Loop splitting + loop exchange
Carnegie Mellon
Parallelization by Rewriting
Fully optimized (load-balanced, no false sharing) in the sense of our definition
Carnegie Mellon
Same Approach for Other Parallel ParadigmsVectorization: [IPDPS 02, VecPar 06]Message Passing: [ISPA 06]
Cg/OpenGL for GPUs: Verilog for FPGAs: [DAC 05]
MPI
With Bonelli, Lorenz, Ueberhuber, TU Vienna
With Shen, TU Denmark With Milder, Hoe, CMU
Carnegie Mellon
Going Beyond TransformsTransform = linear operator with one vector input and one vector output
Key ideas: Generalize to (possibly nonlinear) operators with several inputs and severaloutputsGeneralize SPL (including tensor product) to OL (operator language)
Cooley-Tukey FFT in OL:
Viterbi in OL:
Mat-Mat-Mult:
Carnegie Mellon
OL Rewriting RulesSPL rules reusedOnly few OL-specific rules required
New OL rules
Carnegie Mellon
Example: Viterbi Decoder in OLViterbi decoder (forward part) as operator
Viterbi kernel (butterfly)
Viterbi algorithm as breakdown rule
First non-transform supported by Spiral
Carnegie Mellon
Viterbi: Vectorization Through Rewriting
Sufficient to vectorize one inputVectorized kernelIn-register shuffle operation
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
Carnegie Mellon
Benchmarks
platforms
kernels
vector dual/quad coreGPU FPGA
DFT
All Spiral code shownis “push-button” generatedfrom scratch
“click”
FPGA+CPU
Carnegie Mellon
DFT (single precision): on 3 GHz 2 x Core 2 Extremeperformance [Gflop/s]
0
5
10
15
20
25
30
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
input size
Spiral 5.0 SPMDSpiral 5.0 sequentialIntel IPP 5.0FFTW 3.2 alpha SMPFFTW 3.2 alpha sequential
Benchmark: Vector and SMP
2 processors
4 processors
2 processors 4 processors
Memory footprint < L1$ of 1 processor
25 Gflop/s!
4-way vectorized + up to 4-threaded + adapted to the memory hierarchy
Carnegie Mellon
Benchmark: Cell (1 processor = SPE)DFT (single precision) on 3.2 GHz Cell BE (Single SPE)performance [Gflop/s]
0
2
4
6
8
10
12
14
16
16 32 48 64 80 96 128 160 256 384 512 1024input size
Generated using the simulator; run at Mercury (thanks to Robert Cooper)
Joint work with Th. Peter (ETH Zurich), S. Chellappa, M. Telgarsky, J. Moura (CMU)
Carnegie Mellon
DCT4, Multiples of 32: 4-way VectorizedDCT (single precision) 2.66 GHz Core2 (4-way 32-bit SSE)performance [Gflop/s]
0
1
2
3
4
5
6
7
8
9
10
32 64 96 128 160 192 224 256
input size
Spiral DCT4FFTW 3.1.2 DCT4 (k11)IPP 5.1 DCT2 (?)
novel algorithm (algebraic algorithm theory)
Carnegie Mellon
DFT, 8-way Vectorized: All Sizes Up To 80
0
2
4
6
8
10
12
14
16
18
2 7 12 17 22 27 32 37 42 47 52 57 62 67 72 77input size
Spiral SSE2
Intel IPP 5.1
DFT (16-bit integer) on 2.66 GHz Core2 Duo (8-way SSE2)performance [Gflop/s]
first 8-way DFTs for all sizesarbitrary vector length /arbitrary DFT sizes in principle solved
Carnegie Mellon
Benchmark: GPU
0
1
2
3
4
5
6
8 10 12 14 16 18 20 22 24log2(input size)
WHT (single precision) on 3.6 GHz Pentium 4 with Nvidia 7900 GTXperformance [Gflops/s]
Spiral CPU Spiral GPU
Joint work with H. Shen, TU Denmark
CPU + GPU
Carnegie Mellon
Benchmark: FPGADFT 256 on Xilinx Virtex 2 Pro FPGAinverse throughput (gap) [us]
0
1
2
3
4
5
6
7
8
0 5000 10000 15000 20000 25000Area [slices]
Xilinx Logicore 3.2
Spiral
better
Joint work with P. Milder, J. Hoe (CMU)
competitive with professional designsmuch larger set of performance/area trade-offs
(Pareto-optimal designs)
Carnegie Mellon
Benchmark: Hardware Accelerator (FPGA)Xilinx Virtex 2 Pro FPGA: 1M gates @ 100 MHz + 2 PowerPC 405 @ 300 MHz
Fixed set of accelerators speed up a whole library
better
0
100
200
300
400
500
600
700
16 32 64 128 256 512 1024 2048 4096 8192
Problem size
[Mflop/s]
Software only
Software + hardware
Joint work with P. D’Alberto (Yahoo), A. Sandryhaila, P. Milder, J. Hoe, J . M. F. Moura (CMU), J. Johnson (Drexel)
6.5x
native sizes
Carnegie Mellon
Benchmarks
platforms
kernels
vector dual/quad coreGPU FPGA
DFT
filter
GEMM
Viterbi
All Spiral code shownis “push-button” generatedfrom scratch
“click”
FPGA+CPU
Carnegie Mellon
Benchmark: Finite Impulse Response Filter
0
5
10
15
20
25
30
35
40
45
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20log2(input size)
theoretical peak performance: 74.66 Gflop/s Spiral 8 taps
Spiral 32 taps
IPP 32 taps
IPP 8 taps
FIR filter (double precision) on 2.33 GHz 2x Core 2 Quad (8 threads)performance [Gflop/s]
Carnegie Mellon
Beyond Transforms : Viterbi DecodingViterbi decoding (8-bit) on 2.66 GHz Core 2 Duoperformance [Gbutterflies/s]
0
0.5
1
1.5
2
2.5
6 7 8 9 10 11 12 13log2(size of state machine)
Spiral 16-way vectorized
Spiral scalar
Karn's Viterbi decoder(hand-tuned assembly)
1 butterfly = ~22 ops
Vectorized using practically the same rules as for DFT
Joint work with S. Chellappa, CMU Karn: http://www.ka9q.net/code/fec/
Carnegie Mellon
First Results: Matrix-Matrix-Multiply
DGEMM on 3 GHz Core 2 Duo (1 thread)performance [Gflop/s]
0
2
4
6
8
10
12
2 4 8 16 32 64 128 256 512
N (matrix size is N x N)
Goto
Spiral
triple loop
work with F. de Mesmay, CMU
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
Carnegie Mellon
ConclusionsAutomatic generation of very fast and fastest numerical kernels is possible and desirable
High level language and approachAlgorithm generationAlgorithm optimization
Same approach for loop optimization, different forms of parallelism, SW and HW implementations
Carnegie Mellon
Spiral Web Interface @spiral.net (beta version)
1. Select platform
2. Select functionality
3.“click”