s8822 optimizing nmt with tensorrton-demand.gputechconf.com/gtc/2018/presentation/... · batch beam...

Post on 04-Jun-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Micah Villmow

Senior TensorRT Software Engineer

S8822 – OPTIMIZING NMT WITH TENSORRT

2

100倍以上速く、本当に可能ですか?

2

3

Neural Machine

Translation Unit

DOUGLAS ADAMS – BABEL FISH

4

OVER 100X FASTER,IS IT REALLY POSSIBLE?

Over 200

years

4

5

NVIDIA TENSORRTProgrammable Inference Accelerator

developer.nvidia.com/tensorrt

DRIVE PX 2

JETSON TX2

NVIDIA DLA

TESLA P4

TESLA V100

FRAMEWORKS GPU PLATFORMS

TensorRT

Optimizer Runtime

6

• Convolution

• LSTM and GRU

• Activation: ReLU, tanh, sigmoid

• Pooling: max and average

• Scaling

• Element wise operations

• LRN

• Fully-connected

• SoftMax

• Deconvolution

TENSORRT LAYERS

Built-in Layer Support Custom Layer API

CUDA Runtime

Deployed Application

TensorRT Runtime

Custom Layer

7

TENSORRT OPTIMIZATIONS

Kernel Auto-Tuning

Layer & Tensor Fusion

Dynamic Tensor

Memory

Weights & Activation

Precision Calibration

140 305

5700

14 ms

6.67 ms 6.83 ms

0

5

10

15

20

25

30

35

40

0

1,000

2,000

3,000

4,000

5,000

6,000

CPU-Only V100 +TensorFlow

V100 + TensorRT

Late

ncy (m

s)Images/

sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-

16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),

batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587

Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake

with AVX512.

40x Faster CNNs on V100 vs. CPU-Only

Under 7ms Latency (ResNet50)

8

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

9

ACRONYMS AND DEFINITIONS

NMT: Neural Machine Translation

OpenNMT: Open source NMT project for academia and industry

Token: The minimum representation used for encoding(symbol, word, character, subword)

Sequence: A number of tokens wrapped by special start and end sequence tokens.

Beam Search: directed partial breadth-first tree search algorithm

TopK: Partial sort resulting in N min/max elements

Unk: Special token that represents unknown translations.

10

OPENNMT INFERENCEDecoder

Encoder

Beam

Search

EncoderRNN

Decoder

RNN

Attention

Model

Projection

TopK

Output

Beam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Setup

Input

11

DECODER EXAMPLE

Decoder

RNN

Attention

Model

Projection

TopK

Output Embedding

Beam

Searc

h

Beam

Shuff

le

Batc

h

Reducti

on

Beam

Scori

ng

Input Embedding

Itera

tion 0

<S>

This

The

He

What

The

Itera

tion 1

+

ishouse

ran

tim

ecow

This

The

He

What

The

12

TRAINING VS INFERENCE

Decoder

Encoder

EncoderRNN Decoder

RNN

Attention Model

Projection

Output

Input

SetupInput

Decoder

Encoder

Beam

Search

EncoderRNN

Decoder

RNN

Attention

Model

Projection

TopK

OutputBeam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Setup

Input

13

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

14

INFERENCE TIME IS BEAM SEARCH TIME

• Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144

• Sharan Narang, Jun, 2017, Baidu’s DeepBench -https://github.com/baidu-research/DeepBench

• Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than training.’ - https://github.com/tensorflow/nmt/issues/204

• David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance across hardware platforms.’

15

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

16

PERF ANALYSIS

17

KERNEL ANALYSIS

18

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

19

ENCODER

Encoder

EncoderRNN

Input

Setup

20

Sequence

Length

Buffer

PrefixSumPlugin

Input

Hello.This is a test.

Bye.Tokenization

Hello .This is a test .Bye .

Gather

Encoder

Input

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

2

5

2

Decoder

Start

Tokens

Constant

Zero

State

buffer

Setup

21

Sequence

LengthsEncoder

Input

Encoder

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

2

5

2

PackedRNN

Trained

Hidden

State

Trained

Cell

State

Encoder

Hidden

State

Encoder

Cell

StateContext

Vector

.1 .35 0 0 0 0

.123 .93 1.4 1 .01 0

.42 .20 0 0 0 0

Embedding Plugin

22

DECODER

Decoder

Beam

Search

Decoder

RNN

Attention Model

Projection

TopK

Output

Beam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

23

Decoder, 1st Iteration

RNN

Encoder

Hidden

State

Encoder

Cell

State

Decode

Hidden

State

Decode

Cell

State

Decoder

Input

Batch0 <S>

BatchN <S>

Decoder

Output

Batch0 .124

BatchN .912

Start Sentence

Token

Embedding Plugin

24

Decoder, 2nd+ Iteration

RNN

Prev

Hidden

State

Prev

Cell

State

Next

Hidden

State

Next

Cell

State

Decoder

Input

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 こ ん に ち は

N さ よ う な ら

Batch0 .124

BatchN .912

Decoder

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

Embedding Plugin

25

Global Attention Model

Context

Vector

Sequence

Length

Buffer

Decoder

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

FullyConnected Weights

Weights

BatchedGemm

RaggedSoftmax

Concat

FullyConnected

TanH

BatchedGemm

26

Projection

Atte

ntio

n

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 [.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2]

N [.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9]

FullyConnected Weights

Softmax

Pro

jectio

n

Outp

ut

Log

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0

N

27

TopK Part 1

Pro

jectio

n

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0

N

TopK

Intra

-beam

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

Index

Prob

[1,3]

[.9,.8]

[2,4]

[.99,.5]

[9,0]

[.3,.8]

[5,0]

[.1,.93]

[7,6]

[.85,.99]

Gather

28

TopK Part 2

Gath

er

Outp

ut

Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99]

TopK

Inte

r-beam

Outp

ut

Indices [2,9,7,0,8]

Prob [.99,.99,.93,.9,.85]

Beam Mapping Plugin

Intra-beam

Output

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

29

Beam Search – Beam Shuffle

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0

State

Beam1

State

Beam2

State

Beam3

State

Beam4

State

Beam1

State+1

Beam4

State+1

Beam3

State+1

Beam0

State+1

Beam4

State+1

Beam

Shuffle

Plugin

30

Beam Search – Beam Scoring

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0

State+1

Beam1

State+1

Beam2

State+1

Beam3

State+1

Beam4

State+1

Beam0

State

Beam1

State

Beam2

State

Beam3

State

Beam4

State

Beam Scoring Plugin

EOS

DetectionSentence

Probability

UpdateBacktrack

State

Storage Sequence

Length

IncrementEnd of

Beam/Batch

Heuristic

Batch Finished

Bitmap[0001100011…010]

31

Beam Search – Batch Reduction

Batch Finished

Bitmap[0001100011…010]

Reduce Operation(Sum)Transfer 32bit to Host

as new batch size.

TopK

Gather

Encoder

Output

Encoder

Output

Encoder/State

Reduction

PluginBeam

State

Beam

State

32

Output

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

All Done? No Decoder Input

Yes

Beam

StateDevice To Host

On Host:

Beam

Sta

te Output

こんにちは。これはテストです。さようなら。

33

TENSORRT ANALYSIS

34

TENSORRT KERNEL ANALYSIS

35

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

36

RESULTS

425

550

280 ms

153 ms

117 ms

0

50

100

150

200

250

300

350

400

450

500

0

100

200

300

400

500

600

CPU-Only + Torch V100 + Torch V100 + TensorRT

Late

ncy (m

s)

Sente

nces/

sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size

4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On

140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference

(OpenNMT)

37

SUMMARY• Show that topK no longer dominates sequence inference time.

• Show that RNN Inference is compute bound, not memory bound.

• TensorRT accelerates sequence inferencing.

• Over two orders of magnitude higher throughput over CPU.

• Latency reduction by more than half over CPU.

developer.nvidia.com/tensorrt

PRODUCT PAGE

38

LEARN MORE

developer.nvidia.com/tensorrt

PRODUCT PAGE

docs.nvidia.com/deeplearning/sdk

DOCUMENTATION

nvidia.com/dli

TRAINING

39

Q&A

top related