Lecture 1 Introduction I. Why Study Compilers? courses/cs243/lectures/L1- I. Why Study Compilers? II.Example: Machine Learning ... • Amenable to languages + compiler techniques ... Experiment variants

Download Lecture 1 Introduction I. Why Study Compilers?  courses/cs243/lectures/L1-  I. Why Study Compilers? II.Example: Machine Learning ... • Amenable to languages + compiler techniques ... Experiment variants

Post on 25-Mar-2018

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>1</p><p>Carnegie Mellon </p><p>Lecture 1 </p><p>Introduction </p><p>I. Why Study Compilers? II.Example: Machine Learning III. Course Syllabus </p><p> Chapters 1.1-1.5, 8.4, 8.5, 9.1 </p><p>CS243: Introduction 1 M. Lam </p><p>Carnegie Mellon </p><p>Why Study Compilers? </p><p>M. Lam CS243: Introduction 2 </p></li><li><p>2</p><p>Carnegie Mellon </p><p>Reasons for Studying Program Analysis &amp; Optimizations </p><p> Implementation of a programming abstraction Improves software productivity by hiding low-level details </p><p>while delivering performance Domain-specific languages </p><p> A tool for designing and evaluating computer architectures Inspired RISC, VLIW machines Machines performance measured on compiled code </p><p> Techniques for developing other programming tools </p><p> Example: error detection tools </p><p> Program translation can be used to solve other problems Example: Binary translation </p><p>(processor change, adding virtualization support) </p><p> Compilers have impact: affect all programs </p><p>CS243: Introduction 3 M. Lam </p><p>ProgrammingLanguage</p><p>Machine</p><p>Carnegie Mellon </p><p>Compiler Study Trains Good Developers </p><p>Excellent software engineering case study </p><p> Optimizing compilers are hard to build Input: all programs Objectives: </p><p> Methodology for solving complex real-life problems Key to success: Formulate the right approximation! </p><p> Desired solutions are often NP-complete / undecidable Where theory meets practice </p><p> Cant be solved by just pure hacking theory aids generality and correctness </p><p> Cant be solved by just theory experimentation validates and provides feedback to </p><p>problem formulation Reasoning about programs, reliability &amp; security makes you a better </p><p>programmer </p><p>There are programmers, and there are tool builders ... </p><p>CS243: Introduction 4 M. Lam </p></li><li><p>3</p><p>Carnegie Mellon </p><p>Example: Machine Learning </p><p> Machine learning Important to improve programmers productivity </p><p> Powerful, widely applicable Many variants of common ideas </p><p> Need programmability Amenable to languages + compiler techniques </p><p> GPUs are difficult to optimize </p><p> Demonstrate the importance of Programming languages: Programming abstraction </p><p> A beautiful hierarchy of abstractions bridge the gap from neural nets to GPUs </p><p> Compiler optimizations: Performance for high-level programming </p><p> Automatic analysis and transformation hide complex machine details provide machine independence </p><p>M. Lam CS243: Introduction 5 </p><p>NeuralNets</p><p>GPUs</p><p>Carnegie Mellon </p><p>Outline </p><p>1. Why is optimizing machine-learning important and hard? </p><p>2. Todays approach: programming abstractions + handcoded libraries </p><p>3. Need and opportunities for compiler optimizations </p><p>M. Lam CS243: Introduction 6 </p></li><li><p>4</p><p>Carnegie Mellon M. Lam CS243: Introduction 7 </p><p>1ExaFLOPS=1018FLOPS</p><p>1. Machine Learning is Expensive </p><p>Carnegie Mellon </p><p>Nvidia Volta GV100 GPU </p><p>M. Lam CS243: Introduction 8 </p><p>21Btransistors815mm21455Mhz</p><p>80StreamMulEprocessors(SM)</p><p>hIps://wccMech.com/nvidia-volta-tesla-v100-cards-detailed-150w-single-slot-300w-dual-slot-gv100-powered-pcie-accelerators/</p></li><li><p>5</p><p>Carnegie Mellon </p><p>In Each SM </p><p>M. Lam CS243: Introduction 9 </p><p>64FP32cores64intcores32FP64cores8TensorcoresTensorCoresD=AxB+C;A,B,C,Dare4x4matrices4x4x4matrixprocessingarray1024floaEngpointops/clock</p><p>FP32:15TFLOPSFP64:7.5TFLOPSTensor:120TFLOPS</p><p>hIps://wccMech.com/nvidia-volta-tesla-v100-cards-detailed-150w-single-slot-300w-dual-slot-gv100-powered-pcie-accelerators/</p><p>Carnegie Mellon M. Lam CS243: Introduction 10 </p><p>GeneralalgorithmTraining:GenerateparameterswithalargedatasetStartwithiniEalvaluesIterateForwardpassonlabeleddataBackpropagate Inference:ForwardpasswithtrainedparametersNoteManypopulararchitecturesCommonbuildingblocks</p></li><li><p>6</p><p>Carnegie Mellon </p><p>Example: Inception a Convolutional Neural Network for Object Recognition </p><p>M. Lam CS243: Introduction 11 </p><p>hIps://towardsdatascience.com/transfer-learning-using-keras-d804b2e04ef8</p><p>Carnegie Mellon </p><p>Neural Networks to ExaFLOPS </p><p>M. Lam CS243: Introduction 12 </p><p>8NvidiaVolta(960TFLOPS)</p><p>+</p><p>TeslaV100</p><p>2IntelXeonE5-2698V4processorsEach:20cores,40threadsat2.2GHz</p></li><li><p>7</p><p>Carnegie Mellon </p><p>Challenges </p><p> Many applications of deep neural nets Want to enable many developers to write efficient neural net code </p><p> Can we transform a high-level neural net algorithm to Efficient code for a complex target? </p><p>Heterogeneous and distributed/parallel architectures: x86 (multi-cores, SIMD instructions) GPU (blocks of streams, Tensor cores) </p><p> For different architectures? Architecture upgrades: e.g. Pascal to Volta (Nvidia) Alternative architectures: </p><p>Intel, AMD, TPU (tensor processing unit), FPGA Training vs inference: servers to mobile devices </p><p>M. Lam CS243: Introduction 13 </p><p>Carnegie Mellon </p><p>2. Todays Solution </p><p> Bridge the semantic gap from Neural Networks to GPUs via domain-specific abstractions Computation model abstractions Hand-optimized primitives </p><p>M. Lam CS243: Introduction 14 </p></li><li><p>8</p><p>Carnegie Mellon </p><p>Levels of Abtraction </p><p>M. Lam CS243: Introduction 15 </p><p>IncepEon</p><p>HighLevelFramework Keras</p><p>LowLevelFramework Tensorflow</p><p>HighLevelLibrary cuDNN</p><p>LowLevelLibrary</p><p>cuBLAScuFFTcuSPARSE Eigen</p><p>CUDA</p><p>Hardware x86NvidiaGPU</p><p>NeuralNetModel</p><p>GoogleTPUAMD FPGA</p><p>OpenCL</p><p>OpenBLAS</p><p>Caffe MxNet PyTorch</p><p>TFslim</p><p>AlexNet Seq2Seq YOLO</p><p>GPULanguage</p><p>Carnegie Mellon </p><p>One Popular Stack </p><p>M. Lam CS243: Introduction 16 </p><p>Keras</p><p>LowLevelFramework Tensorflow</p><p>HighLevelLibrary cuDNN</p><p>LowLevelLibrary</p><p>cuBLAScuFFTcuSPARSE</p><p>GPULanguage CUDA</p><p>Hardware x86GPU</p><p>Networkofneuralnetworkmodules(conv2d,VGG:convoluEonalNNforimagerecogniEon)</p><p>cuX:HandwriIenlibrariesforNvidiaarchitecturescuDNN:convoluEon,pooling,normalizaEon,acEvaEoncuBLAS:basiclinearalgebra</p><p>Data-flowgraphsofmathkernels</p><p>ExplicitCPU/GPUcode+memorymanagement</p><p>Eigen:open-sourcegeneralmathslibrary</p><p>HighLevelFramework</p><p>NeuralNetModel</p><p>Eigen</p></li><li><p>9</p><p>Carnegie Mellon </p><p>Keras: Python framework to call a library of neural network modules </p><p>M. Lam CS243: Introduction 17 </p><p>HighLevelFramework</p><p>LowLevelFramework</p><p>HighLevelLibrary</p><p>LowLevelLibrary</p><p>GPULanguage</p><p>Hardware</p><p>NeuralNetModel</p><p>Keras</p><p>model=applicaEons.VGG19(weights="imagenet",include_top=False,input_shape=(img_width,img_height,3))</p><p>x=Conv2D(64,(3,3),acEvaEon='relu,padding='same',name='block1_conv1')(img_input)</p><p>Tensorflow</p><p>CUDA</p><p>cuDNN</p><p>cuBLAScuFFTcuSPARSE Eigen</p><p>x86GPU</p><p>Carnegie Mellon </p><p>Tensorflow: Graph of Operations in a Neural Network Module </p><p>M. Lam CS243: Introduction 18 </p><p>HighLevelFramework</p><p>LowLevelFramework</p><p>HighLevelLibrary</p><p>LowLevelLibrary</p><p>Hardware</p><p>NeuralNetModel</p><p>Keras</p><p>Tensorflow</p><p>W=o.get_variable(name='W',shape=(K_in,K_out),dtype=o.float32,trainable=True)x=o.placeholder(shape=(None,K_in),dtype=o.float32)y=o.placeholder(shape=(None,K_out),dtype=o.float32)y_hat=o.matmul(x,W)y_predict=y_hat&gt;=0J=o.nn.sigmoid_cross_entropy_with_logits(logits=y_hat,labels=y)opEmizer=o.train.GradientDescentOpEmizer(learning_rate=0.01)opEmizaEon=opEmizer.minimize(J)</p><p>XEntropyLoss</p><p>x</p><p>W</p><p>MatMul Sigmoidy Mul</p><p>y</p><p>J</p><p>GradientDescent</p><p>SigmoidGrad</p><p>Log</p><p>Mul</p><p>Rcp</p><p>JWApplyGrad</p><p>LogisEcRegression</p><p>GPULanguage CUDA</p><p>cuDNN</p><p>cuBLAScuFFTcuSPARSE Eigen</p><p>x86GPU</p></li><li><p>10</p><p>Carnegie Mellon </p><p>Tensorflow: Execute the Flow Graph </p><p>M. Lam CS243: Introduction 19 </p><p>HighLevelFramework</p><p>LowLevelFramework</p><p>HighLevelLibrary</p><p>LowLevelLibrary</p><p>Hardware</p><p>NeuralNetModel</p><p>Keras</p><p>Tensorflow</p><p>DynamicallyinterprettheflowgraphProcessorshavestate:dataandcachedAssignflowgraphnodestoprocessors Time:saEsfyingdependencesSpace:basedonstateoruserpreferenceDistributedmachines:Generatereceive&amp;sendcodesAuser-levellibrarythatdifferenEatesexpressionstoimplementbackpropagaEonautomaEcally.ManagementFaulttoleranceCoordinaEonandbackupworkers </p><p>GPULanguage CUDA</p><p>cuDNN</p><p>cuBLAScuFFTcuSPARSE Eigen</p><p>x86GPU</p><p>Carnegie Mellon </p><p>Eigen: Open C++ Math Libraries </p><p>M. Lam CS243: Introduction 20 </p><p>HighLevelFramework</p><p>LowLevelFramework</p><p>HighLevelLibrary</p><p>LowLevelLibrary</p><p>Hardware</p><p>NeuralNetModel</p><p>Keras</p><p>Tensorflow</p><p>Open-sourcemathslibraryDense&amp;sparsematricesDifferentdatatypesMatrixoperaEons,decomposiEons,GeneratescodeforNvidia(CUDA),x86,TPUs,ExplicitvectorizaEonOpEmizaEonsforfixed-sizematrices:e.g.loopunrollingImplementedusingexpressiontemplates</p><p>Eigen</p><p>GPULanguage CUDA</p><p>cuDNN</p><p>cuBLAScuFFTcuSPARSE</p><p>x86GPU</p></li><li><p>11</p><p>Carnegie Mellon </p><p>cuX: Proprietary Libraries in Cuda for Nvidia Processors </p><p>M. Lam CS243: Introduction 21 </p><p>HighLevelFramework</p><p>LowLevelFramework</p><p>HighLevelLibrary</p><p>LowLevelLibrary</p><p>Hardware</p><p>NeuralNetModel</p><p>Keras</p><p>Tensorflow</p><p>Carefullyhand-tunedlibrariesforNvidiaarchitectures:cuBLAS:basiclinearalgebrasubrouEnescuFFT:fastfouriertransformcuSPARSE:BLASforsparsematricescuDNN:deepneuralnetworkconvoluEon,pooling,normalizaEon,acEvaEonenablesopEmizaEonsacrosslow-levellibrariesTunedfordifferentmatrixsizes,aspectraEos,architecture</p><p>cuDNN</p><p>cuBLAScuFFTcuSPARSE Eigen</p><p>GPULanguage CUDA</p><p>x86GPU</p><p>Carnegie Mellon </p><p>CUDA: Programming Language from Nvidia </p><p>M. Lam CS243: Introduction 22 </p><p>HighLevelFramework</p><p>LowLevelFramework</p><p>HighLevelLibrary</p><p>LowLevelLibrary</p><p>Hardware</p><p>NeuralNetModel</p><p>Keras</p><p>Tensorflow</p><p>CUDA</p><p>ProgrammingmodelUsercontrolsthegridsofthreadsintheGPUManagememory</p><p>CUDAisarun-Emesystemtocontrol:</p><p>KerneldispatchtoGPUsSynchronizaEonMemorymanagement</p><p>Hardware</p><p>cuDNN</p><p>cuBLAScuFFTcuSPARSE Eigen</p><p>x86GPU</p><p>GPULanguage</p></li><li><p>12</p><p>Carnegie Mellon </p><p>CUDA </p><p>M. Lam CS243: Introduction 23 </p><p>int blockSize = 256; int numBlocks = (N + blockSize 1) / blockSize; add (N, x, y); </p><p>__global__ void add(int n, float *x, float *y) { int index = blockIdx.x * blockDim.x + threadIdx.x; int stride = blockDim.x * gridDim.x; for (int i = index; I &lt; n; i += stride) y[i] = x[i] + y[i]; } </p><p>hIps://devblogs.nvidia.com/parallelforall/even-easier-introducEon-cuda/</p><p>Carnegie Mellon </p><p>Levels of Abtraction </p><p>M. Lam CS243: Introduction 24 </p><p>IncepEon</p><p>HighLevelFramework Keras</p><p>LowLevelFramework Tensorflow</p><p>HighLevelLibrary cuDNN</p><p>LowLevelLibrary</p><p>cuBLAScuFFTcuSPARSE Eigen</p><p>CUDA</p><p>Hardware x86NvidiaGPU</p><p>NeuralNetModel</p><p>GoogleTPUAMD FPGA</p><p>OpenCL</p><p>OpenBLAS</p><p>Caffe MxNet PyTorch</p><p>TFslim</p><p>AlexNet Seq2Seq YOLO</p><p>GPULanguage</p></li><li><p>13</p><p>Carnegie Mellon </p><p>3. Problems with Handcoded Libraries </p><p> Cost: Expensive to write the tuned routines Proprietary </p><p> Nvidia proprietary libraries NOT available for other architectures Competitive advantage that reduces innovation </p><p> Machine dependence Cannot generate code for different targets automatically </p><p> Performance: Cannot optimize across library calls </p><p> Compilers XLA: Tensorflow optimized code for different targets TensorRT: trained model inference engine (e.g. on CPU) </p><p> Optimization opportunities Cross-kernel optimization Optimization of each kernel </p><p>M. Lam CS243: Introduction 25 </p><p>Carnegie Mellon </p><p>Cross-Kernel Optimization: Fusion </p><p>M. Lam CS243: Introduction 26 </p><p>s[j]=soMmax[j](ReLU(bias[j]+matmul_result[j]))</p><p>hIps://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html</p></li><li><p>14</p><p>Carnegie Mellon </p><p>Optimizing Kernels: Matrix Multiply </p><p> for (i = 0; i &lt; N; i++) { for (j = 0; j &lt; N; j++) { for (k = 0; k &lt; N; k++) { m3(i, j) += m1(i, k) * m2(k, j); }}} Permute loop to make data access contiguous for vectorization: for (k = 0; k &lt; N; k++) { for (i = 0; i &lt; N; i++) { for (j = 0; j &lt; N; j++) { m3(i, j) += m1(i, k) * m2(k, j); }}} </p><p>+= x</p><p>+= x</p><p>ij</p><p>k</p><p>ji</p><p>i k</p><p>Carnegie Mellon </p><p>Tiling: to Increase Reuse for (k = 0; k &lt; N; k++) { for (i = 0; i &lt; N; i++) { for (j = 0; j &lt; N; j++) { m3(i, j) += m1(i, k) * m2(k, j); }}} Tile the outermost loop for (k1 = 0; k1 </p></li><li><p>15</p><p>Carnegie Mellon </p><p>Experiment </p><p> Square float32 matrix of various sizes Initialized with random (0, 1) normal Average of 10 iterations </p><p> Intel i7-4770HQ CPU @ 2.20GHz (Haswell), with SSE4.2 and AVX2, no turbo </p><p> 32k L1 cache, 256k L2, 6M L3, 132M L4 cache (LLC, GPU shared) </p><p> Compiled with g++ 7.2.1 20170915, as provided in Fedora 27 Common options: --std=c++14 -Wall -g (The production version of clang does not support loop optimizations) </p><p>M. Lam CS243: Introduction 29 </p><p>Carnegie Mellon </p><p>Experiment variants </p><p> Naive with no optimizations (-O0 -march=native) Naive with std optimizations (-O3 -march=native) Naive with full optimizations (-O3 -march=native -ftree-vectorize </p><p>-floop-interchange -floop-nest-optimize -floop-block -funroll-loops) </p><p> Permuted 1-d tiled with std optimizations 1-d tiled with full optimizations </p></li><li><p>16</p><p>Carnegie Mellon </p><p>Performance </p><p>Permuted </p><p>Carnegie Mellon </p><p>Optimizing the Matrix Multiply Kernel </p><p>#pragma omp parallel for for (i = 0; i &lt; N; i++) { for (j = 0; j &lt; N; j++) { for (k = 0; k &lt; N; k++) { m3(i, j) += m1(i, k) * m2(k, j); }}} Permute loop to make data access contiguous: #pragma omp parallel for (k = 0; k &lt; N; k++) { #pragma omp for for (i = 0; i &lt; N; i++) { for (j = 0; j &lt; N; j++) { m3(i, j) += m1(i, k) * m2(k, j); }}} </p><p>+= x</p><p>+= x</p><p>ij</p><p>k</p><p>ji</p><p>i k</p></li><li><p>17</p><p>Carnegie Mellon </p><p>Parallel 1-d tiled algorithm </p><p>#pragma omp parallel for (k1 = 0; k1 &lt; N; k1 += B) { #pragma omp for for (i = 0; i &lt; N; i++) { for (k2 = k1; k2 &lt; k1 + B; k2++) { for (j = 0; j &lt; N; j++) { m3(i, j) += m1(i, k2) * m2(k2, j); }}}} </p><p>+= x</p><p>k1i k1</p><p>B</p><p>B</p><p>m3fetchedN/BEmes</p><p>Carnegie Mellon </p><p>Parallel scaling (matrix size 1500) </p><p>Permuted </p></li><li><p>18</p><p>Carnegie Mellon </p><p>Data Locality Optimization </p><p> Data locality is important for sequential and parallel machines Matrix multiplication has N2 parallelism Parallelization does not necessarily mean performance </p><p> 2 important classes of loop transformations Affine: Permutation, skewing, reversal, fusion, fission Tiling </p><p> Compilation Machine-independence analysis </p><p> Is it legal to parallelize a loop? to transform a loop? Machine-dependent optimization </p><p> which transformation to apply? for a certain machine </p><p>35 </p><p>Carnegie Mellon </p><p>Loop Transformations </p><p>CS243: Introduction 36 </p><p>abstractionstatic statementsdynamic execution</p><p>graphsmatricesinteger programs</p><p>MathematicalModelPrograms</p><p>solutionsgenerated code</p><p>relations</p><p>M. Lam </p></li><li><p>19</p><p>Carnegie Mellon </p><p>Use of Mathematical Abstraction </p><p> Design of mathematical model &amp; algorithm Generality, power, simplicity and efficiency </p><p>CS243: Introduction 37 </p><p>abstractionstatic statementsdynamic execution</p><p>graphsmatricesinteger programs</p><p>MathematicalModelPrograms</p><p>solutionsgenerated code</p><p>relations</p><p>M. Lam </p><p>Carnegie Mellon </p><p>Course Syllabus 1. Basic compiler optimizations </p><p>2. Pointer alias analysis </p><p>3. Parallelization and memory hierarchy optimization </p><p>4. Garbage collection (run-time system) </p><p>CS243: Introduction 38 </p><p>Goal Eliminatesredundancyinhigh-levellanguageprogramsAllocatesregistersSchedulesinstrucEons(forinstrucEon-levelparallelism)</p><p>Scope Simplescalarvariables,intraprocedural,flow-sensiEveTheory Data-flowanalysis(graphs&amp;solvingfix-pointequaEons)</p><p>Goal Usedinprogramunderstanding,concretetypeinferenceinOOprograms(resolvetargetofmethodinvocaEon,inline,andopEmize)</p><p>Scope Pointers,interprocedural,flow-insensiEveTheory RelaEons,Binarydecisiondiagrams(BDD)</p><p>Goal ParallelizessequenEalprograms(formulEprocessors)OpEmizesforthememoryhierarchy</p><p>Scope Arrays,loopsTheory Linearalgebra</p><p>M. Lam </p></li><li><p>20</p><p>Carnegie Mellon </p><p>Tentative Course Schedule </p><p>1 CourseintroducEon2 BasicopEmizaEon Data-flowanalysis:introducEon3 Data-flowanalysis:theoreEcfoundaEon4 OpEmizaEon:constantpropagaEon5 (joeqframewo...</p></li></ul>