s8822 optimizing nmt with tensorrton-demand.gputechconf.com/gtc/2018/presentation/... · batch beam...

Micah Villmow

Senior TensorRT Software Engineer

S8822 – OPTIMIZING NMT WITH TENSORRT

100倍以上速く、本当に可能ですか？

Neural Machine

Translation Unit

DOUGLAS ADAMS – BABEL FISH

OVER 100X FASTER,IS IT REALLY POSSIBLE?

Over 200

NVIDIA TENSORRTProgrammable Inference Accelerator

developer.nvidia.com/tensorrt

DRIVE PX 2

JETSON TX2

NVIDIA DLA

TESLA P4

TESLA V100

FRAMEWORKS GPU PLATFORMS

TensorRT

Optimizer Runtime

• Convolution

• LSTM and GRU

• Activation: ReLU, tanh, sigmoid

• Pooling: max and average

• Scaling

• Element wise operations

• LRN

• Fully-connected

• SoftMax

• Deconvolution

TENSORRT LAYERS

Built-in Layer Support Custom Layer API

CUDA Runtime

Deployed Application

TensorRT Runtime

Custom Layer

TENSORRT OPTIMIZATIONS

Kernel Auto-Tuning

Layer & Tensor Fusion

Dynamic Tensor

Memory

Weights & Activation

Precision Calibration

140 305

6.67 ms 6.83 ms

CPU-Only V100 +TensorFlow

V100 + TensorRT

ncy (m

s)Images/

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-

16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),

batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587

Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake

with AVX512.

40x Faster CNNs on V100 vs. CPU-Only

Under 7ms Latency (ResNet50)

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

ACRONYMS AND DEFINITIONS

NMT: Neural Machine Translation

OpenNMT: Open source NMT project for academia and industry

Token: The minimum representation used for encoding(symbol, word, character, subword)

Sequence: A number of tokens wrapped by special start and end sequence tokens.

Beam Search: directed partial breadth-first tree search algorithm

TopK: Partial sort resulting in N min/max elements

Unk: Special token that represents unknown translations.

OPENNMT INFERENCEDecoder

Encoder

Search

EncoderRNN

Decoder

Attention

Projection

Output

Shuffle

eductio

Scorin

DECODER EXAMPLE

Decoder

Attention

Projection

Output Embedding

Reducti

Input Embedding

tion 0

tion 1

ishouse

TRAINING VS INFERENCE

Decoder

Encoder

EncoderRNN Decoder

Attention Model

Projection

Output

SetupInput

Decoder

Encoder

Search

EncoderRNN

Decoder

Attention

Projection

OutputBeam

Shuffle

eductio

Scorin

Agenda

•What is NMT?

INFERENCE TIME IS BEAM SEARCH TIME

• Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144

• Sharan Narang, Jun, 2017, Baidu’s DeepBench -https://github.com/baidu-research/DeepBench

• Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than training.’ - https://github.com/tensorflow/nmt/issues/204

• David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance across hardware platforms.’

Agenda

•What is NMT?

PERF ANALYSIS

KERNEL ANALYSIS

Agenda

•What is NMT?

ENCODER

Encoder

EncoderRNN

Sequence

Length

Buffer

PrefixSumPlugin

Hello.This is a test.

Bye.Tokenization

Hello .This is a test .Bye .

Gather

Encoder

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

Decoder

Tokens

Constant

buffer

Sequence

LengthsEncoder

Encoder

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

PackedRNN

Trained

Hidden

Trained

Encoder

Hidden

Encoder

StateContext

Vector

.1 .35 0 0 0 0

.123 .93 1.4 1 .01 0

.42 .20 0 0 0 0

Embedding Plugin

DECODER

Decoder

Search

Decoder

Attention Model

Projection

Output

Shuffle

eductio

Scorin

Decoder, 1st Iteration

Encoder

Hidden

Encoder

Decode

Hidden

Decode

Decoder

Batch0 <S>

BatchN <S>

Decoder

Output

Batch0 .124

BatchN .912

Start Sentence

Embedding Plugin

Decoder, 2nd+ Iteration

Hidden

Decoder

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 こんにちは

N さようなら

Batch0 .124

BatchN .912

Decoder

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

Embedding Plugin

Global Attention Model

Context

Vector

Sequence

Length

Buffer

Decoder

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

FullyConnected Weights

Weights

BatchedGemm

RaggedSoftmax

Concat

FullyConnected

BatchedGemm

Projection

0 [.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2]

N [.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9]

FullyConnected Weights

Softmax

jectio

TopK Part 1

jectio

[.9,.8]

[.99,.5]

[.3,.8]

[.1,.93]

[.85,.99]

Gather

TopK Part 2

Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99]

r-beam

Indices [2,9,7,0,8]

Prob [.99,.99,.93,.9,.85]

Beam Mapping Plugin

Intra-beam

Output

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam Search – Beam Shuffle

State+1

Shuffle

Plugin

Beam Search – Beam Scoring

State+1

Beam Scoring Plugin

DetectionSentence

Probability

UpdateBacktrack

Storage Sequence

Length

IncrementEnd of

Beam/Batch

Heuristic

Batch Finished

Bitmap[0001100011…010]

Beam Search – Batch Reduction

Batch Finished

Bitmap[0001100011…010]

Reduce Operation(Sum)Transfer 32bit to Host

as new batch size.

Gather

Encoder

Output

Encoder

Output

Encoder/State

Reduction

PluginBeam

Output

All Done? No Decoder Input

StateDevice To Host

On Host:

te Output

こんにちは。これはテストです。さようなら。

TENSORRT ANALYSIS

TENSORRT KERNEL ANALYSIS

Agenda

•What is NMT?

RESULTS

280 ms

153 ms

117 ms

CPU-Only + Torch V100 + Torch V100 + TensorRT

ncy (m

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size

4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On

140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference

(OpenNMT)

SUMMARY• Show that topK no longer dominates sequence inference time.

• Show that RNN Inference is compute bound, not memory bound.

• TensorRT accelerates sequence inferencing.

• Over two orders of magnitude higher throughput over CPU.

• Latency reduction by more than half over CPU.

PRODUCT PAGE

LEARN MORE

PRODUCT PAGE

docs.nvidia.com/deeplearning/sdk

DOCUMENTATION

nvidia.com/dli

TRAINING

s8822 optimizing nmt with tensorrton-demand.gputechconf.com/gtc/2018/presentation/... · batch beam...

Documents

design manual feature codes full chapter 40 descriptions...

)ª ¥ 土木工学専攻号輿石大 2 ¥ 1...

beam3 portal

3 flexural analysis/design of beam3. flexural...

ansys element beam188 - rutgers...

voltage 9 volts dc security entrance beam4 4. replace the...

beam2 - villa lighting supply, inc. · 5 lamp 6 lamp...

fea bending, torsion, tension, and shear tutorial...

3 flexural analysis/design of beam3. flexural...

radio education and research...

100ghz帯4-12ghz if sisミクサの開発...beam1 beam2...

beam4 descontinuado

the magnetic model of the large hadron collider · beam1...

wild west steakhouse t e x a s · jim beam4 cl/40%12 jack...

€¦ · web viewfundamentals of accounting / financial...

verification examples axisvm-v11 examples... ·...

第Ⅰ部門...