carnegie mellon spiral: automatic generation of industry strength performance libraries franz...
TRANSCRIPT
Carnegie MellonCarnegie Mellon
Spiral:Automatic Generation of Industry Strength Performance Libraries
Franz Franchetti
Carnegie Mellon Universitywww.ece.cmu.edu/~franzf
CTO and Co-Founder, SpiralGenwww.spiralgen.com
This work was supported by DARPA DESA program, NSF, ONR, Mercury Inc., Intel, and Nvidia
Carnegie MellonCarnegie Mellon
The Future is Parallel and Heterogeneous
multicore ?2009
2012 and later
Cell BE8+1 cores
before 2000
Core2 DuoCore2 Extreme
Virtex 5FPGA+ 4 CPUs
SGI RASC Itanium + FPGA
Nvidia GPUs240 streaming cores
Sun Niagara32 threads
IBM Cyclops6480 cores
Intel Larrabee
Xtreme DATA Opteron + FPGA
ClearSpeed192 cores
Programmability?Performance portability?Rapid prototyping?
CPU platforms
AMD Fusion
BlueGene/Q
Intel Haswellvector coprocessors
Tilera TILEPro64 cores
IBM POWER72x8 cores
Intel Sandy Bridge8-way float vectors
Nvidia Fermi
Carnegie MellonCarnegie Mellon
Spiral: Computer Writes Best SAR Code
Special hardwareCell blade
COTSIntel
Synthesis [2]Programming [1]
[1] Rudin, J., Implementation of Polar Format SAR Image Formation on the IBM Cell Broadband Engine, in Proceedings High Performance Embedded Computing (HPEC), 2007. Best Paper Award.
[2] D. McFarlin, F. Franchetti, M. Püschel, and J. M. F. Moura: High Performance Synthetic Aperture Radar Image Formation On Commodity Multicore Architectures. in Proceedings SPIE, 2009.
Result Same performance, 1/10th human effort, non-expert user
Key ideasrestrict domain, use mathematics, program synthesis
Carnegie MellonCarnegie Mellon
What is Spiral?Traditionally Spiral Approach
High performance libraryoptimized for given platform
Spiral
High performance libraryoptimized for given platform
Comparable performance
Carnegie MellonCarnegie Mellon
Spiral’s Domain-Specific Program Synthesis
νpμ
Architectural parameter:Vector length, #processors, …
rewritingdefines
Kernel: problem size, algorithm choice
picksearch
abstraction abstraction
Model: common abstraction= spaces of matching formulas
architecturespace
algorithmspace
optimization
Carnegie MellonCarnegie Mellon
Related WorkCompiler-Based AutotuningSynthesis from Domain Math
Autotuning Numerical Libraries Autotuning Primer
Polyhedral frameworkIBM XL, Pluto, CHiLL
Transformation prescriptionCHiLL, POET
Profile guided optimizationIntel C, IBM XL
SpiralSignal and image processing, SDR
Tensor Contraction EngineQuantum Chemistry Code Synthesizer
FLAMENumerical linear algebra (LAPACK)
ATLASBLAS generator
FFTWkernel generator
Vendor math librariesCode generation scripts
Carnegie MellonCarnegie Mellon
Organization
M. Püschel, F. Franchetti, Y. Voronenko: Spiral. Encyclopedia of Parallel Computing, D. A. Padua (Editor), 2011.
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005.
Spiral overview
Validation and Verification
Results
Concluding remarks
Carnegie MellonCarnegie Mellon
Transform = Matrix-vector multiplicationExample: Discrete Fourier transform (DFT)
Fast algorithm = sparse matrix factorization = SPL formulaExample: Cooley-Tukey FFT algorithm
Spiral’s Origin: Linear Transforms
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
j j
j j j
input vector (signal)
output vector (signal) transform = matrix
Carnegie MellonCarnegie Mellon
Beyond Transforms: General Operators Transform =
linear operator with one vector input and one vector output
Key ideas: Generalize to (possibly nonlinear) operators with several inputs and
several outputs Generalize SPL (including tensor product) to OL (operator language) Generalize rewriting systems for parallelizations
linear
Carnegie MellonCarnegie Mellon
Software Defined RadioLinear Transforms
Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)
convolutionalencoder
Viterbidecoder
010001 11 10 00 01 10 01 11 00 01000111 10 01 01 10 10 11 00
= £
Expressing Kernels as Operator Formulas
Interpolation 2D FFT
Carnegie MellonCarnegie Mellon
One Approach for all Types of Parallelism Multithreading (Multicore)
Vector SIMD (SSE, VMX/Altivec,…)
Message Passing (Clusters, MPP)
Streaming/multibuffering (Cell)
Graphics Processors (GPUs)
Gate-level parallelism (FPGA)
HW/SW partitioning (CPU + FPGA)
Carnegie MellonCarnegie Mellon
Autotuning in Constraint Solution SpaceIntel MIC
Base cases
DFT256
Breakdown rulesTransformation rules
Carnegie MellonCarnegie Mellon
Translating a Formula into Code
C Code:
Output =OL Formula:
∑-OL:
Constraint Solver Input:
Carnegie MellonCarnegie Mellon
Auto-Generation of Performance Library
High-Performance Library(FFTW-like, MKL-like, IPP-like)
Spiral
Input: Transform: Algorithms: Vectorization: 2-way SSE Threading: Yes
Output: Optimized library (10,000 lines of C++) For general input size
(not collection of fixed sizes) Vectorized Multithreaded With runtime adaptation mechanism Performance competitive with hand-written code
Carnegie MellonCarnegie Mellon
Core Idea: Recursion Step Closure Input: transform T and a breakdown rules Output: problem specifications for recursive function and codelets
Algorithm:1. Apply the breakdown rule
2. Convert to -SPL
3. Apply loop merging + index simplification rules.
4. Extract recursion steps
5. Repeat until closure is reached
Carnegie MellonCarnegie Mellon
void dft64(float *Y, float *X) { __m512 U912, U913, U914, U915, U916, U917, U918, U919, U920, U921, U922, U923, U924, U925,...; a2153 = ((__m512 *) X); s1107 = *(a2153); s1108 = *((a2153 + 4)); t1323 = _mm512_add_ps(s1107,s1108); ... U926 = _mm512_swizupconv_r32(_mm512_set_1to16_ps(0.70710678118654757),_MM_SWIZ_REG_CDAB); s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi( _mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341), _mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),0x5555,a2154,U926), _mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB)); U927 = _mm512_swizupconv_r32(_mm512_set_16to16_ps(0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757)),_MM_SWIZ_REG_CDAB); ... s1166 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(_mm512_set_16to16_ps( 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757)), 0xAAAA,a2154,U951),t1362), _mm512_mask_sub_ps(_mm512_set_16to16_ps(0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757)),0x5555,a2154,U951), _mm512_swizupconv_r32(t1362,_MM_SWIZ_REG_CDAB)); ...}
Spiral-Generated Code (Intel MIC/LRBni)
Carnegie MellonCarnegie Mellon
Support For Library-Specific Interfaces
IPPAPI(IppStatus, ippgDFTFwd_CToC_32fc, (const Ipp32fc *pSrc, Ipp32fc *pDst, int length, int flag) )
IPPAPI(IppStatus, ippgWHT_32f, (const Ipp32f *pSrc, Ipp32f *pDst, int order, int flag, Ipp8u *pBuf)
IPPAPI(IppStatus, ippgWHTGetBufferSize_32f, (int order, Ipp32u *pBufferSize) )
Complex FFT name data type
Walsh-Hadamard Transformlog(size) scaling memory
size scaling
memory
Carnegie MellonCarnegie Mellon
Industry-Strength Code: Spiral and Intel IPP 6.0
Spiral-generated code in Intel’s Library IPP• IPP = Intel’s performance primitives, used by 1000s of companies• Generated: 3984 C functions (signal processing) = 1M lines of code• Full parallelism support• Computer-generated code: Faster than what was achievable by hand
Carnegie MellonCarnegie Mellon
Organization Spiral overview
Validation and Verification
Results
Concluding remarks
Carnegie MellonCarnegie Mellon
Transform = Matrix-vector multiplicationmatrix fully defines the operation
Algorithm = Formularepresents a matrix expression, can be evaluated to a matrix
Symbolic Verification
= ?
Carnegie MellonCarnegie Mellon
Run program on all basis vectors,compare to columns of transform matrix
Compare program output on random vectorsto output of a random implementation of same kernel
Empirical Verification
= ?DFT4([0,1,0,0])
DFT4_rnd([0.1,1.77,2.28,-55.3]))
DFT4([0.1,1.77,2.28,-55.3])
= ?
Carnegie MellonCarnegie Mellon
Rule replaces left-hand side by right-hand side when preconditions match
Test rule by evaluating expressions before and after rule application and compare result
Verification of the Generator
= ?
Carnegie MellonCarnegie Mellon
Verification of Autotuning Libraries
Auto-generated FFTW-like library Need verifier for each function Auto-generated from specification Auto-generate test harness Drop-in replacement into
existing infrastructure
Carnegie MellonCarnegie Mellon
Organization Spiral overview
Validation and Verification
Results
Concluding remarks
Carnegie MellonCarnegie Mellon
Results: Spiral Outperforms Humans
FFT on Multicore
FFT on FPGA
SAR
SDR
Carnegie MellonCarnegie Mellon
Samsung i9100 Galaxy S IIDual-core ARM at 1.2GHz with NEON ISA
SIMD vectorization + multi-threading
From Cell Phone To Supercomputer
G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue:2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).
Global FFT (1D FFT, HPC Challenge)performance [Gflop/s]
BlueGene/P at Argonne National Laboratory128k cores (quad-core CPUs) at 850 MHz
SIMD vectorization + multi-threading + MPI
6.4 Tflop/s
BlueGene/P
Carnegie MellonCarnegie Mellon
Organization Spiral overview
Validation and Verification
Results
Concluding remarks
Carnegie MellonCarnegie Mellon
Summary: Spiral in a NutshellVerificationJoint Abstraction
Target Machines Application DomainsSignal Processing
Matrix Algorithms
Software Defined Radio
Image Formation (SAR)Academic @ CMU: www.spiral.netCommercial: www.spiralgen.com
= ?DFT4([0,1,0,0])
Carnegie MellonCarnegie Mellon
AcknowledgementJames C. HoeJeremy JohnsonJosé M. F. MouraDavid PaduaMarkus PüschelVolodymyr Arbatov Paolo D’AlbertoPeter A. MilderYevgen VoronenkoQian YuBerkin AkinChristos Angelopoulos Srinivas ChellappaFrédéric de MesmayDaniel S. McFarlinMarek R. Telgarsky
Special thanks to:Randi Rost, Scott Buck (Intel), Jon Greene (Mercury Inc.), Yuanwei Jin (UMES)Gheorghe Almasi, Jose E. Moreira, Jim Sexton (IBM), Saeed Maleki (UIUC)Francois Gygi (LLNL, UC Davis), Kim Yates (LLNL), Kalyan Kumaran (ANL)
Carnegie MellonCarnegie Mellon
More Information:www.spiral.netwww.spiralgen.com