carnegie mellon spiral: automatic generation of industry strength performance libraries franz...

30
Carnegie Mellon Carnegie Mello iral: omatic Generation of ustry Strength Performance Libraries Franz Franchetti Carnegie Mellon University www.ece.cmu.edu/~franzf CTO and Co-Founder, SpiralGen www.spiralgen.com work was supported by A DESA program, NSF, ONR, Mercury Inc., Intel, and Nvidia

Upload: domenic-little

Post on 16-Dec-2015

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Spiral:Automatic Generation of Industry Strength Performance Libraries

Franz Franchetti

Carnegie Mellon Universitywww.ece.cmu.edu/~franzf

CTO and Co-Founder, SpiralGenwww.spiralgen.com

This work was supported by DARPA DESA program, NSF, ONR, Mercury Inc., Intel, and Nvidia

Page 2: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

The Future is Parallel and Heterogeneous

multicore ?2009

2012 and later

Cell BE8+1 cores

before 2000

Core2 DuoCore2 Extreme

Virtex 5FPGA+ 4 CPUs

SGI RASC Itanium + FPGA

Nvidia GPUs240 streaming cores

Sun Niagara32 threads

IBM Cyclops6480 cores

Intel Larrabee

Xtreme DATA Opteron + FPGA

ClearSpeed192 cores

Programmability?Performance portability?Rapid prototyping?

CPU platforms

AMD Fusion

BlueGene/Q

Intel Haswellvector coprocessors

Tilera TILEPro64 cores

IBM POWER72x8 cores

Intel Sandy Bridge8-way float vectors

Nvidia Fermi

Page 3: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Spiral: Computer Writes Best SAR Code

Special hardwareCell blade

COTSIntel

Synthesis [2]Programming [1]

[1] Rudin, J., Implementation of Polar Format SAR Image Formation on the IBM Cell Broadband Engine, in Proceedings High Performance Embedded Computing (HPEC), 2007. Best Paper Award.

[2] D. McFarlin, F. Franchetti, M. Püschel, and J. M. F. Moura: High Performance Synthetic Aperture Radar Image Formation On Commodity Multicore Architectures. in Proceedings SPIE, 2009.

Result Same performance, 1/10th human effort, non-expert user

Key ideasrestrict domain, use mathematics, program synthesis

Page 4: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

What is Spiral?Traditionally Spiral Approach

High performance libraryoptimized for given platform

Spiral

High performance libraryoptimized for given platform

Comparable performance

Page 5: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Spiral’s Domain-Specific Program Synthesis

νpμ

Architectural parameter:Vector length, #processors, …

rewritingdefines

Kernel: problem size, algorithm choice

picksearch

abstraction abstraction

Model: common abstraction= spaces of matching formulas

architecturespace

algorithmspace

optimization

Page 6: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Related WorkCompiler-Based AutotuningSynthesis from Domain Math

Autotuning Numerical Libraries Autotuning Primer

Polyhedral frameworkIBM XL, Pluto, CHiLL

Transformation prescriptionCHiLL, POET

Profile guided optimizationIntel C, IBM XL

SpiralSignal and image processing, SDR

Tensor Contraction EngineQuantum Chemistry Code Synthesizer

FLAMENumerical linear algebra (LAPACK)

ATLASBLAS generator

FFTWkernel generator

Vendor math librariesCode generation scripts

Page 7: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Organization

M. Püschel, F. Franchetti, Y. Voronenko: Spiral. Encyclopedia of Parallel Computing, D. A. Padua (Editor), 2011.

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005.

Spiral overview

Validation and Verification

Results

Concluding remarks

Page 8: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Transform = Matrix-vector multiplicationExample: Discrete Fourier transform (DFT)

Fast algorithm = sparse matrix factorization = SPL formulaExample: Cooley-Tukey FFT algorithm

Spiral’s Origin: Linear Transforms

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1

j j

j j j

input vector (signal)

output vector (signal) transform = matrix

Page 9: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Beyond Transforms: General Operators Transform =

linear operator with one vector input and one vector output

Key ideas: Generalize to (possibly nonlinear) operators with several inputs and

several outputs Generalize SPL (including tensor product) to OL (operator language) Generalize rewriting systems for parallelizations

linear

Page 10: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Software Defined RadioLinear Transforms

Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)

convolutionalencoder

Viterbidecoder

010001 11 10 00 01 10 01 11 00 01000111 10 01 01 10 10 11 00

= £

Expressing Kernels as Operator Formulas

Interpolation 2D FFT

Page 11: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

One Approach for all Types of Parallelism Multithreading (Multicore)

Vector SIMD (SSE, VMX/Altivec,…)

Message Passing (Clusters, MPP)

Streaming/multibuffering (Cell)

Graphics Processors (GPUs)

Gate-level parallelism (FPGA)

HW/SW partitioning (CPU + FPGA)

Page 12: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Autotuning in Constraint Solution SpaceIntel MIC

Base cases

DFT256

Breakdown rulesTransformation rules

Page 13: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Translating a Formula into Code

C Code:

Output =OL Formula:

∑-OL:

Constraint Solver Input:

Page 14: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Auto-Generation of Performance Library

High-Performance Library(FFTW-like, MKL-like, IPP-like)

Spiral

Input: Transform: Algorithms: Vectorization: 2-way SSE Threading: Yes

Output: Optimized library (10,000 lines of C++) For general input size

(not collection of fixed sizes) Vectorized Multithreaded With runtime adaptation mechanism Performance competitive with hand-written code

Page 15: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Core Idea: Recursion Step Closure Input: transform T and a breakdown rules Output: problem specifications for recursive function and codelets

Algorithm:1. Apply the breakdown rule

2. Convert to -SPL

3. Apply loop merging + index simplification rules.

4. Extract recursion steps

5. Repeat until closure is reached

Page 16: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

void dft64(float *Y, float *X) { __m512 U912, U913, U914, U915, U916, U917, U918, U919, U920, U921, U922, U923, U924, U925,...; a2153 = ((__m512 *) X); s1107 = *(a2153); s1108 = *((a2153 + 4)); t1323 = _mm512_add_ps(s1107,s1108); ... U926 = _mm512_swizupconv_r32(_mm512_set_1to16_ps(0.70710678118654757),_MM_SWIZ_REG_CDAB); s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi( _mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341), _mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),0x5555,a2154,U926), _mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB)); U927 = _mm512_swizupconv_r32(_mm512_set_16to16_ps(0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757)),_MM_SWIZ_REG_CDAB); ... s1166 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(_mm512_set_16to16_ps( 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757)), 0xAAAA,a2154,U951),t1362), _mm512_mask_sub_ps(_mm512_set_16to16_ps(0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757), 0.70710678118654757, (-0.70710678118654757)),0x5555,a2154,U951), _mm512_swizupconv_r32(t1362,_MM_SWIZ_REG_CDAB)); ...}

Spiral-Generated Code (Intel MIC/LRBni)

Page 17: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Support For Library-Specific Interfaces

IPPAPI(IppStatus, ippgDFTFwd_CToC_32fc, (const Ipp32fc *pSrc, Ipp32fc *pDst, int length, int flag) )

IPPAPI(IppStatus, ippgWHT_32f, (const Ipp32f *pSrc, Ipp32f *pDst, int order, int flag, Ipp8u *pBuf)

IPPAPI(IppStatus, ippgWHTGetBufferSize_32f, (int order, Ipp32u *pBufferSize) )

Complex FFT name data type

Walsh-Hadamard Transformlog(size) scaling memory

size scaling

memory

Page 18: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Industry-Strength Code: Spiral and Intel IPP 6.0

Spiral-generated code in Intel’s Library IPP• IPP = Intel’s performance primitives, used by 1000s of companies• Generated: 3984 C functions (signal processing) = 1M lines of code• Full parallelism support• Computer-generated code: Faster than what was achievable by hand

Page 19: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Organization Spiral overview

Validation and Verification

Results

Concluding remarks

Page 20: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Transform = Matrix-vector multiplicationmatrix fully defines the operation

Algorithm = Formularepresents a matrix expression, can be evaluated to a matrix

Symbolic Verification

= ?

Page 21: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Run program on all basis vectors,compare to columns of transform matrix

Compare program output on random vectorsto output of a random implementation of same kernel

Empirical Verification

= ?DFT4([0,1,0,0])

DFT4_rnd([0.1,1.77,2.28,-55.3]))

DFT4([0.1,1.77,2.28,-55.3])

= ?

Page 22: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Rule replaces left-hand side by right-hand side when preconditions match

Test rule by evaluating expressions before and after rule application and compare result

Verification of the Generator

= ?

Page 23: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Verification of Autotuning Libraries

Auto-generated FFTW-like library Need verifier for each function Auto-generated from specification Auto-generate test harness Drop-in replacement into

existing infrastructure

Page 24: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Organization Spiral overview

Validation and Verification

Results

Concluding remarks

Page 25: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Results: Spiral Outperforms Humans

FFT on Multicore

FFT on FPGA

SAR

SDR

Page 26: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Samsung i9100 Galaxy S IIDual-core ARM at 1.2GHz with NEON ISA

SIMD vectorization + multi-threading

From Cell Phone To Supercomputer

G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue:2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).

Global FFT (1D FFT, HPC Challenge)performance [Gflop/s]

BlueGene/P at Argonne National Laboratory128k cores (quad-core CPUs) at 850 MHz

SIMD vectorization + multi-threading + MPI

6.4 Tflop/s

BlueGene/P

Page 27: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Organization Spiral overview

Validation and Verification

Results

Concluding remarks

Page 28: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

Summary: Spiral in a NutshellVerificationJoint Abstraction

Target Machines Application DomainsSignal Processing

Matrix Algorithms

Software Defined Radio

Image Formation (SAR)Academic @ CMU: www.spiral.netCommercial: www.spiralgen.com

= ?DFT4([0,1,0,0])

Page 29: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

AcknowledgementJames C. HoeJeremy JohnsonJosé M. F. MouraDavid PaduaMarkus PüschelVolodymyr Arbatov Paolo D’AlbertoPeter A. MilderYevgen VoronenkoQian YuBerkin AkinChristos Angelopoulos Srinivas ChellappaFrédéric de MesmayDaniel S. McFarlinMarek R. Telgarsky

Special thanks to:Randi Rost, Scott Buck (Intel), Jon Greene (Mercury Inc.), Yuanwei Jin (UMES)Gheorghe Almasi, Jose E. Moreira, Jim Sexton (IBM), Saeed Maleki (UIUC)Francois Gygi (LLNL, UC Davis), Kim Yates (LLNL), Kalyan Kumaran (ANL)

Page 30: Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University franzf

Carnegie MellonCarnegie Mellon

More Information:www.spiral.netwww.spiralgen.com