TensorCore and Tensorization
Dec 5, 2019Siyuan Feng
1
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
2
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
3
What are TensorCores
4
Warp-Level Operation
wmma::fill_fragment(Cmat, 0.0f);
Warp32 threads
5
Programming TensorCore__device__ void tensor_op_16_16_16
( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat;
wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major);
}
Create Fragments
Load Fragments
Perform MatMul
Store Results
6
16x16x16 MatMul
TensorCore Summary
• TensorCores are hardware accelerators
• Warp-level operation
• New memory scope fragment
7
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
8
Steps for TensorCore Support in TVM
Memory Scope
Tensorization
Create Schedule
9
Current Memory Scope
1
2
3
Memory Scope Create Schedule Tensorization 10
Special Memory Scope
1
2
3
5
6
4
11Memory Scope Create Schedule Tensorization
Traditional GPU Memory Scope Order
Global LocalShared Global
12Memory Scope Create Schedule Tensorization
Enhanced TensorCore Memory Scope Order
GlobalFragment
LocalShared Global
13Memory Scope Create Schedule Tensorization
Warp Level Schedule
blockDim.x = warp_size= 32
14Memory Scope Create Schedule Tensorization
Warp Level Schedule
blockDim.x = warp_size= 32
Warp Warp……
Warp Warp……
…… …… ……
blockDim.y
bloc
kDim
.z
15Memory Scope Create Schedule Tensorization
Tensorization
for (i, 0, 16) {for (j, 0, 16) {
for (k, 0, 16) {C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j)))
}}
}
tvm_mma_sync(C, 0, A, 0, B, 0, C, 0);
16Memory Scope Create Schedule Tensorization
Performance Improvements over non-TensorCore
1 1 1 1
4.875.17 5.02 4.97
Large MatMul BatchConv Small MatMul BatchMatMul
TVM w/o TensorCores tvm w/ TensorCores
17
Performance Comparison vs CuDNN
1 1 1 1
0.760.83
1.16
1.44
Large MatMul BatchConv Small MatMul BatchMatMul
CuDNN w/ TensorCores tvm w/ TensorCores
Comparable on traditional workloads
18
Performance Comparison vs CuDNN
1 1 1 1
0.760.83
1.16
1.44
Large MatMul BatchConv Small MatMul BatchMatMul
CuDNN w/ TensorCores tvm w/ TensorCores
19
1.4x on emerging workloads(BERT)
TVM TensorCore Support Summary
• Massive speed up over non-tensorcore
• Competitive performance with CuDNN
• Based on tensor intrinsic
20
Contents
TensorCore Introduction 1
2 TensorCore Support in TVM
3Future Work
21
Future Work
1. Use TensorCore in TOPI and Relay
2. Apply TensorCore to popular ML model, such as
BERT
22
Thank you
Dec 5, 2019Siyuan Feng
23