Micah Villmow
Senior TensorRT Software Engineer
S8822 – OPTIMIZING NMT WITH TENSORRT
2
100倍以上速く、本当に可能ですか?
2
3
Neural Machine
Translation Unit
DOUGLAS ADAMS – BABEL FISH
4
OVER 100X FASTER,IS IT REALLY POSSIBLE?
Over 200
years
4
5
NVIDIA TENSORRTProgrammable Inference Accelerator
developer.nvidia.com/tensorrt
DRIVE PX 2
JETSON TX2
NVIDIA DLA
TESLA P4
TESLA V100
FRAMEWORKS GPU PLATFORMS
TensorRT
Optimizer Runtime
6
• Convolution
• LSTM and GRU
• Activation: ReLU, tanh, sigmoid
• Pooling: max and average
• Scaling
• Element wise operations
• LRN
• Fully-connected
• SoftMax
• Deconvolution
TENSORRT LAYERS
Built-in Layer Support Custom Layer API
CUDA Runtime
Deployed Application
TensorRT Runtime
Custom Layer
7
TENSORRT OPTIMIZATIONS
Kernel Auto-Tuning
Layer & Tensor Fusion
Dynamic Tensor
Memory
Weights & Activation
Precision Calibration
140 305
5700
14 ms
6.67 ms 6.83 ms
0
5
10
15
20
25
30
35
40
0
1,000
2,000
3,000
4,000
5,000
6,000
CPU-Only V100 +TensorFlow
V100 + TensorRT
Late
ncy (m
s)Images/
sec
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake
with AVX512.
40x Faster CNNs on V100 vs. CPU-Only
Under 7ms Latency (ResNet50)
8
Agenda
•What is NMT?
•What is current state?
•What are the problems?
•How did we solve it?
•What perf is possible?
9
ACRONYMS AND DEFINITIONS
NMT: Neural Machine Translation
OpenNMT: Open source NMT project for academia and industry
Token: The minimum representation used for encoding(symbol, word, character, subword)
Sequence: A number of tokens wrapped by special start and end sequence tokens.
Beam Search: directed partial breadth-first tree search algorithm
TopK: Partial sort resulting in N min/max elements
Unk: Special token that represents unknown translations.
10
OPENNMT INFERENCEDecoder
Encoder
Beam
Search
EncoderRNN
Decoder
RNN
Attention
Model
Projection
TopK
Output
Beam
Shuffle
Batc
h R
eductio
n
Beam
Scorin
g
Input
Setup
Input
11
DECODER EXAMPLE
Decoder
RNN
Attention
Model
Projection
TopK
Output Embedding
Beam
Searc
h
Beam
Shuff
le
Batc
h
Reducti
on
Beam
Scori
ng
Input Embedding
Itera
tion 0
<S>
This
The
He
What
The
Itera
tion 1
+
ishouse
ran
tim
ecow
This
The
He
What
The
12
TRAINING VS INFERENCE
Decoder
Encoder
EncoderRNN Decoder
RNN
Attention Model
Projection
Output
Input
SetupInput
Decoder
Encoder
Beam
Search
EncoderRNN
Decoder
RNN
Attention
Model
Projection
TopK
OutputBeam
Shuffle
Batc
h R
eductio
n
Beam
Scorin
g
Input
Setup
Input
13
Agenda
•What is NMT?
•What is current state?
•What are the problems?
•How did we solve it?
•What perf is possible?
14
INFERENCE TIME IS BEAM SEARCH TIME
• Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144
• Sharan Narang, Jun, 2017, Baidu’s DeepBench -https://github.com/baidu-research/DeepBench
• Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than training.’ - https://github.com/tensorflow/nmt/issues/204
• David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance across hardware platforms.’
15
Agenda
•What is NMT?
•What is current state?
•What are the problems?
•How did we solve it?
•What perf is possible?
16
PERF ANALYSIS
17
KERNEL ANALYSIS
18
Agenda
•What is NMT?
•What is current state?
•What are the problems?
•How did we solve it?
•What perf is possible?
19
ENCODER
Encoder
EncoderRNN
Input
Setup
20
Sequence
Length
Buffer
PrefixSumPlugin
Input
Hello.This is a test.
Bye.Tokenization
Hello .This is a test .Bye .
Gather
Encoder
Input
42 23 0 0 0 0
73 3 8 19 23 0
98 23 0 0 0 0
2
5
2
Decoder
Start
Tokens
Constant
Zero
State
buffer
Setup
21
Sequence
LengthsEncoder
Input
Encoder
42 23 0 0 0 0
73 3 8 19 23 0
98 23 0 0 0 0
2
5
2
PackedRNN
Trained
Hidden
State
Trained
Cell
State
Encoder
Hidden
State
Encoder
Cell
StateContext
Vector
.1 .35 0 0 0 0
.123 .93 1.4 1 .01 0
.42 .20 0 0 0 0
Embedding Plugin
22
DECODER
Decoder
Beam
Search
Decoder
RNN
Attention Model
Projection
TopK
Output
Beam
Shuffle
Batc
h R
eductio
n
Beam
Scorin
g
Input
23
Decoder, 1st Iteration
RNN
Encoder
Hidden
State
Encoder
Cell
State
Decode
Hidden
State
Decode
Cell
State
Decoder
Input
Batch0 <S>
BatchN <S>
Decoder
Output
Batch0 .124
BatchN .912
Start Sentence
Token
Embedding Plugin
24
Decoder, 2nd+ Iteration
RNN
Prev
Hidden
State
Prev
Cell
State
Next
Hidden
State
Next
Cell
State
Decoder
Input
Batch Beam 0 Beam1 Beam2 Beam3 Beam4
0 こ ん に ち は
N さ よ う な ら
Batch0 .124
BatchN .912
Decoder
Outp
ut
Batch Beam 0 Beam1 Beam2 Beam3 Beam4
0 .18 .32 .85 .39 .75
N .79 .27 .81 .93 .73
Embedding Plugin
25
Global Attention Model
Context
Vector
Sequence
Length
Buffer
Decoder
Outp
ut
Batch Beam 0 Beam1 Beam2 Beam3 Beam4
0 .18 .32 .85 .39 .75
N .79 .27 .81 .93 .73
FullyConnected Weights
Weights
BatchedGemm
RaggedSoftmax
Concat
FullyConnected
TanH
BatchedGemm
26
Projection
Atte
ntio
n
Outp
ut
Batch Beam 0 Beam1 Beam2 Beam3 Beam4
0 [.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2]
N [.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9]
FullyConnected Weights
Softmax
Pro
jectio
n
Outp
ut
Log
Batch Beam 0 Beam1 Beam2 Beam3 Beam4
0
N
27
TopK Part 1
Pro
jectio
n
Outp
ut
Batch Beam 0 Beam1 Beam2 Beam3 Beam4
0
N
TopK
Intra
-beam
Outp
ut
Batch Beam 0 Beam1 Beam2 Beam3 Beam4
Index
Prob
[1,3]
[.9,.8]
[2,4]
[.99,.5]
[9,0]
[.3,.8]
[5,0]
[.1,.93]
[7,6]
[.85,.99]
Gather
28
TopK Part 2
Gath
er
Outp
ut
Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99]
TopK
Inte
r-beam
Outp
ut
Indices [2,9,7,0,8]
Prob [.99,.99,.93,.9,.85]
Beam Mapping Plugin
Intra-beam
Output
Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
29
Beam Search – Beam Shuffle
Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
Beam0
State
Beam1
State
Beam2
State
Beam3
State
Beam4
State
Beam1
State+1
Beam4
State+1
Beam3
State+1
Beam0
State+1
Beam4
State+1
Beam
Shuffle
Plugin
30
Beam Search – Beam Scoring
Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
Beam0
State+1
Beam1
State+1
Beam2
State+1
Beam3
State+1
Beam4
State+1
Beam0
State
Beam1
State
Beam2
State
Beam3
State
Beam4
State
Beam Scoring Plugin
EOS
DetectionSentence
Probability
UpdateBacktrack
State
Storage Sequence
Length
IncrementEnd of
Beam/Batch
Heuristic
Batch Finished
Bitmap[0001100011…010]
31
Beam Search – Batch Reduction
Batch Finished
Bitmap[0001100011…010]
Reduce Operation(Sum)Transfer 32bit to Host
as new batch size.
TopK
Gather
Encoder
Output
Encoder
Output
Encoder/State
Reduction
PluginBeam
State
Beam
State
32
Output
Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
All Done? No Decoder Input
Yes
Beam
StateDevice To Host
On Host:
Beam
Sta
te Output
こんにちは。これはテストです。さようなら。
33
TENSORRT ANALYSIS
34
TENSORRT KERNEL ANALYSIS
35
Agenda
•What is NMT?
•What is current state?
•What are the problems?
•How did we solve it?
•What perf is possible?
36
RESULTS
425
550
280 ms
153 ms
117 ms
0
50
100
150
200
250
300
350
400
450
500
0
100
200
300
400
500
600
CPU-Only + Torch V100 + Torch V100 + TensorRT
Late
ncy (m
s)
Sente
nces/
sec
Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size
4, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On
140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference
(OpenNMT)
37
SUMMARY• Show that topK no longer dominates sequence inference time.
• Show that RNN Inference is compute bound, not memory bound.
• TensorRT accelerates sequence inferencing.
• Over two orders of magnitude higher throughput over CPU.
• Latency reduction by more than half over CPU.
developer.nvidia.com/tensorrt
PRODUCT PAGE
38
LEARN MORE
developer.nvidia.com/tensorrt
PRODUCT PAGE
docs.nvidia.com/deeplearning/sdk
DOCUMENTATION
nvidia.com/dli
TRAINING
39
Q&A