a scheduling language for gpus. vinod grover | …vinod grover | dec 5, 2019 fireiron - a scheduling...
TRANSCRIPT
![Page 1: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/1.jpg)
Vinod Grover | Dec 5, 2019
Fireiron - A Scheduling Language for GPUs.
![Page 2: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/2.jpg)
2
Acknowledgments
Optimizedode
Joint work with :● Bastian Hagedorn ● Sam Elliott● Henrik Barthels● Ras Bodik
And contributions from many others at NVIDIA.
![Page 3: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/3.jpg)
3
OVERVIEWHigh Performance DSL for linear algebra on GPUs
Optimizedode
A hierarchical scheduling language based on Halide and TVM● designed to express GPU optimizations for maximum performance
Can directly represent elements of● storage hierarchy
○ registers, fragments, shared memory● compute hierarchy
○ threads, warps, blocks, kernels
Can reason about tensorcore and machine level operations.
Suitable for auto-scheduling and auto-tuning
![Page 4: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/4.jpg)
4
DECOMPOSING MATMULExploiting Hierarchical Structure of GPU Kernels
Matrix Multiplication Kernel
Describing the problem this box implements
Hierarchical Structure:Original Problem is decomposed into “smaller” instances of the same type of problem
![Page 5: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/5.jpg)
5
DECOMPOSING MATMULExploiting Hierarchical Structure of GPU Kernels
Matrix Multiplication Kernel
Describing the problem this box implements
Hierarchical Structure:Original Problem is decomposed into “smaller” instances of the same type of problem
![Page 6: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/6.jpg)
6
INTRODUCTIONGEMM Spec(ification)
}Specs define the current problem to optimize
Fireiron MatMul Spec
MatMul(Kernel,A: Matrix(1536,2048,GL,FP32,RowMajor),B: Matrix(2048,1024,GL,FP32,ColMajor),C: Matrix(1536,1024,GL,FP32,ColMajor))
and contain enough information to fully describe it
Idea: A programmer should be able to provide a valid implementation for a given spec!
![Page 7: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/7.jpg)
7
INTRODUCTIONWorking with Specs
}Given a Spec, you can:
a) Provide a handwritten microkernel, orb) Arrive at an executable Spec, orc) Decompose it into a “smaller” spec
Fireiron MatMul Spec
MatMul(Kernel,A: Matrix(1536,2048,GL,FP32,RowMajor),B: Matrix(2048,1024,GL,FP32,ColMajor),C: Matrix(1536,1024,GL,FP32,ColMajor))
Goal: Generate high-performance MatMul Kernel-> We start with Kernel-level Spec
![Page 8: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/8.jpg)
8
INTRODUCTIONWorking with Specs
}Fireiron MatMul Spec
MatMul(Kernel,A: Matrix(1536,2048,GL,FP32,RowMajor),B: Matrix(2048,1024,GL,FP32,ColMajor),C: Matrix(1536,1024,GL,FP32,ColMajor))
Goal: Generate high-performance MatMul Kernel-> We start with Kernel-level Spec
Given a Spec, you can:
a) Provide a handwritten microkernel, orb) Arrive at an executable Spec, orc) Decompose it into a “smaller” spec
![Page 9: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/9.jpg)
9
DECOMPOSITIONSHalide-like transformations constructing the IR
Every Decomposition:1. is a function: Spec -> Spec (returning a “smaller” subspec)2. provides a partial implementation to our code generator
Two Main Decompositions: ● .tile(m,n) - enables descending the compute-hierarchy● .load(matrix, loc, impl) - enables descending the memory hierarchy
We allow to define operation-specific Decompositions:● .split(k)● .epilog(...)● ...
![Page 10: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/10.jpg)
10
.tile(128,128)
MatMul(Kernel,A:Matrix(1536,2048,GL,FP32,RowMajor),B:Matrix(2048,1024,GL,FP32,ColMajor),C:Matrix(1536,1024,GL,FP32,ColMajor))
Current Spec1024
2048
1536
MatMul(Kernel,A:Matrix(128 ,2048,GL,FP32,RowMajor),B:Matrix(2048, 128,GL,FP32,ColMajor),C:Matrix(128 , 128,GL,FP32,ColMajor))
New Spec
128
128
DESCENDING THE COMPUTE HIERARCHY.tile(m,n)
![Page 11: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/11.jpg)
11
.tile(128,128)
1024
2048
1536
.to(Block)
128
128
“Refinement”:Adding implementation
details
DESCENDING THE COMPUTE HIERARCHY.tile(m,n)
![Page 12: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/12.jpg)
12
OUTER PRODUCT BLOCKED GEMM
MatMul(Block,A:Matrix(128,2048,GL,FP32,RowMajor),B:Matrix(2048,128,GL,FP32,ColMajor),C:Matrix(128 ,128,GL,FP32,ColMajor))
Current Spec
New SpecMatMul(Block,
A:Matrix(128, 8,GL,FP32,RowMajor),B:Matrix(8 ,128,GL,FP32,ColMajor),C:Matrix(128,128,GL,FP32,ColMajor))
2048
128
128
2048
8
8
128
128
128
88
128
128
.split(8)
.split(kBlock)
![Page 13: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/13.jpg)
13
DESCENDING THE MEMORY HIERARCHY.load(Matrix, Location, Strategy)
.load(A,SH,strategy)GL
GL
GL
SH
GL
GL
new Spec describingdata movement
this spec is decomposed with the given strategy
![Page 14: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/14.jpg)
14
WMMA IN FIREIRONadding support for CUDA’s WMMA API
GL
SH
RF
GLSHFR<M,N,K>RFM
emor
y H
iera
rchy
Updating Fireiron’s Memory Hierarchy
“Before the MMA operation is performed the operand matrices must be represented in the registers of the GPU. As an MMA is a warp-wide operation these registers are distributed amongst the threads of a warp with each thread holding a fragment of the overall matrix.”
![Page 15: A Scheduling Language for GPUs. Vinod Grover | …Vinod Grover | Dec 5, 2019 Fireiron - A Scheduling Language for GPUs. 2 Acknowledgments Optimized ode Joint work with : Bastian Hagedorn](https://reader034.vdocuments.mx/reader034/viewer/2022042123/5e9e83c45fdf0951106c0584/html5/thumbnails/15.jpg)
15
fp16 performance on Volta