人工智能系统 system for ai 课程介绍 lecture introduction · 2021. 1. 22. · 2....
TRANSCRIPT
-
1
-
计算调度优化
算子表达式IR
计算图IR(DAG)x
w
*
b+ y
-
2. 内存优化
3. 内核优化
4. 调度优化
-
Add Log While
Sub MatMul Merge
Mul Conv BroadCast
Div BatchNorm Reduce
Relu Loss Map
Tanh Transpose Reshape
Exp Concatenate Select
Floor Sigmoid …..
-
x
w b
*+ y
x
w b
*+ y
-
p1 / (p0 + (p3 - p4))
-
Batch same-type operators to leverage GPU massive parallelism
12
+ + +
M
xt
M M
Rf Ro Rz
M
ht-1
M M
𝝈 𝝈 𝒕𝒂𝒏𝒉
×
+
×
ht
Wf Wo Wz
Data-flow graph of a GRU cell
-
通过将输入张量合并成一个大的张量来实现将相同的算子合并成一个更大的算子,从而更好的利用硬件并行度
13
+
M
xtR ht-1
M
𝝈 𝝈 𝒕𝒂𝒏𝒉
×
+
×
ht
W
Data-flow graph of a GRU cell
+ + +
M
xt
M M
Rf Ro Rz
M
ht-1
M M
𝝈 𝝈 𝒕𝒂𝒏𝒉
×
+
×
ht
Wf Wo Wz
-
14
-
15
-
16
-
17
-
1. 计算图优化
2. 内存优化
3. 内核优化
4. 调度优化
-
19
-
20
-
21
AutoTM: Automatic Tensor Movement in HeterogeneousMemory Systems using Integer Linear Programming
-
22
AutoTM: Automatic Tensor Movement in HeterogeneousMemory Systems using Integer Linear Programming
-
23
-
24
-
1. 计算图优化
2. 内存优化
3. 内核优化
4. 调度优化
-
计算图IR(DAG)x
w
*
b+ y
问题: 每个后端平台针对每个算子都需要单独实现至少一个内核考虑:编程模型、数据排布、线程模型、缓存大小等等
-
计算图IR(DAG)x
w
*
b+ y
张量运算表达式 C[i] = A[i] + B[i]
-
36
-
1. 计算图优化
2. 内存优化
3. 内核优化
4. 调度优化
-
计算图优化XLA, TVM, nGraph
算子生成AutoTVM, TC, Halide, Tiramisu,
etc
MatMul: C[i] = reduce_sum(A[m,i] * B[i,n])
C[i] = A[i] + B[i]
D[i] = max(C[i], 0)
D[i] = max(A[i] + B[i], 0)
__global__ add_relu_kernel(args…) {
//.. Code
}
void launch_kernel(add_relu_kernel);
add
relu
add_relu
SM SM SM SM SM SM
计算调度优化
算子表达式IR
计算图IR(DAG)x
w
*
b+ y
如何最终将每个算子的计算映射到硬件上?
-
Task-parallel Graph
BlockFusionOptimizer
NNFusion:全局计算调度优化
计算调度优化
算子表达式IR
计算图IR(DAG)x
w
*
b+ y
-
APEEND
GROUP_SYNC
通过引出新的调度原语来支持任务级调度
-
错误依赖
死锁
Co-schedule to operators
SM SM SM SM SM SM
Conv
Matmul
Matmul
Conv
block0
Matmul
Conv
block1
Matmul
Conv
block2
Matmul
Conv
block3
Matmul
Conv
block4
Matmul
Conv
block5
Matmul
Conv
block6
Matmul
Conv
block7
Matmul
Conv
block8
Matmul
block9
Matmul
block10
GPU
barrier
Wait for barrierUnable to execute
Cannot scheduleConv
Matmul
挑战:调度算子的任务到GPU上的挑战
-
MatMul
Convbarrier
SM SM SM SM SM SM
BE 0 BE 1 BE 2 BE 3 BE 4 BE 5
GPU
vblock0 vblock1 vblock4 vblock5vblock2 vblock3
vblock6 vblock7 vblock10vblock8 vblock9
vblock0 vblock1 vblock4
vblock5
vblock2 vblock3 vblock6
vblock7 vblock8
依赖关系的映射
-
43
SM SM SM SM SM SM
BE 0 BE 1 BE 2 BE 3 BE 4 BE5
GPU
vblock0
vblock1
vblock4
vblock5
vblock2
vblock3
vblock6
vblock7
vblock0
vblock1
vblock0
vblock1
vblock2
vblock3 Under utilization of
GPU resource
𝜎
MatMul
Conv
[64, 512] [512, 128]
[64, 128]
Level 1
Level 2
Better UtilizationLower Latency
并行度的映射
-
45
-
46