tensorcore and tensorizationwmma::mma_sync(cmat, amat, bmat, cmat); wmma::store_matrix_sync(d, cmat,...
TRANSCRIPT
![Page 1: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/1.jpg)
TensorCore and Tensorization
Dec 5, 2019Siyuan Feng
1
![Page 2: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/2.jpg)
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
2
![Page 3: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/3.jpg)
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
3
![Page 4: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/4.jpg)
What are TensorCores
4
![Page 5: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/5.jpg)
Warp-Level Operation
wmma::fill_fragment(Cmat, 0.0f);
Warp32 threads
5
![Page 6: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/6.jpg)
Programming TensorCore__device__ void tensor_op_16_16_16
( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat;
wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major);
}
Create Fragments
Load Fragments
Perform MatMul
Store Results
6
16x16x16 MatMul
![Page 7: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/7.jpg)
TensorCore Summary
• TensorCores are hardware accelerators
• Warp-level operation
• New memory scope fragment
7
![Page 8: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/8.jpg)
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
8
![Page 9: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/9.jpg)
Steps for TensorCore Support in TVM
Memory Scope
Tensorization
Create Schedule
9
![Page 10: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/10.jpg)
Current Memory Scope
1
2
3
Memory Scope Create Schedule Tensorization 10
![Page 11: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/11.jpg)
Special Memory Scope
1
2
3
5
6
4
11Memory Scope Create Schedule Tensorization
![Page 12: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/12.jpg)
Traditional GPU Memory Scope Order
Global LocalShared Global
12Memory Scope Create Schedule Tensorization
![Page 13: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/13.jpg)
Enhanced TensorCore Memory Scope Order
GlobalFragment
LocalShared Global
13Memory Scope Create Schedule Tensorization
![Page 14: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/14.jpg)
Warp Level Schedule
blockDim.x = warp_size= 32
14Memory Scope Create Schedule Tensorization
![Page 15: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/15.jpg)
Warp Level Schedule
blockDim.x = warp_size= 32
Warp Warp……
Warp Warp……
…… …… ……
blockDim.y
bloc
kDim
.z
15Memory Scope Create Schedule Tensorization
![Page 16: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/16.jpg)
Tensorization
for (i, 0, 16) {for (j, 0, 16) {
for (k, 0, 16) {C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j)))
}}
}
tvm_mma_sync(C, 0, A, 0, B, 0, C, 0);
16Memory Scope Create Schedule Tensorization
![Page 17: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/17.jpg)
Performance Improvements over non-TensorCore
1 1 1 1
4.875.17 5.02 4.97
Large MatMul BatchConv Small MatMul BatchMatMul
TVM w/o TensorCores tvm w/ TensorCores
17
![Page 18: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/18.jpg)
Performance Comparison vs CuDNN
1 1 1 1
0.760.83
1.16
1.44
Large MatMul BatchConv Small MatMul BatchMatMul
CuDNN w/ TensorCores tvm w/ TensorCores
Comparable on traditional workloads
18
![Page 19: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/19.jpg)
Performance Comparison vs CuDNN
1 1 1 1
0.760.83
1.16
1.44
Large MatMul BatchConv Small MatMul BatchMatMul
CuDNN w/ TensorCores tvm w/ TensorCores
19
1.4x on emerging workloads(BERT)
![Page 20: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/20.jpg)
TVM TensorCore Support Summary
• Massive speed up over non-tensorcore
• Competitive performance with CuDNN
• Based on tensor intrinsic
20
![Page 21: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/21.jpg)
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
21
![Page 22: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/22.jpg)
Future Work
1. Use TensorCore in TOPI and Relay
2. Apply TensorCore to popular ML model, such as
BERT
22
![Page 23: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results](https://reader034.vdocuments.mx/reader034/viewer/2022051916/60081498a468b63db279d153/html5/thumbnails/23.jpg)
Thank you
Dec 5, 2019Siyuan Feng
23