why machine translation?machine translation translation from language en to ch translation from...

57

Upload: others

Post on 09-Jul-2020

66 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 2: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Why Machine Translation?

β€’ Of great business value

https://nimdzi.com/wp-content/uploads/2018/03/2018-Nimdzi-100-First-Edition.pdf

Page 3: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 4: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 5: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

𝒇=(

)

经桎, 发展, 变, ζ…’, δΊ†, .θΏ‘, ε‡ εΉ΄,

Encoder

Decoder

-

0.20.9

-

0.10.50.7 0.0 0.2

Page 6: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Decoder Recurrent Neural Network

Encoder Recurrent Neural Network

π’”π’Š

π’šπ’ŠWo

rdSa

mp

leR

ecu

rren

tSt

ate

𝒇=(

)

π’‰π’ŠSou

rce

Stat

e

π’™π’ŠSou

rce

Wo

rdD

eco

der

En

cod

er

Sutskever et al., NIPS, 2014

经桎, 发展, 变, ζ…’, δΊ†, .θΏ‘, ε‡ εΉ΄,

(1οΌ‰

(2οΌ‰οΌˆ3οΌ‰

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

Page 7: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

βŠ•

π’”π’Š

π’šπ’Š

π’„π’Š

𝒉𝒋

Wo

rdSa

mp

leR

ecu

rren

tSt

ate

Co

nte

xtV

ecto

rSo

urc

eV

ecto

rs

Attention Weight

En

cod

er

Atte

ntio

n

⨀

𝒇=(

)

Deco

der

θΏ‘, ε‡ εΉ΄,

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

Bahdanau et al., ICLR, 2015

Page 8: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Bahdanau et al., ICLR, 2015

π’”π’Š

π’šπ’Š

π’„π’Š

𝒉𝒋

Wo

rdSa

mp

leR

ecu

rren

tSt

ate

Co

nte

xt

vec

tor

Sou

rce

Vec

tors

En

cod

er

Atte

ntio

n

𝒇=(

)

Deco

der

发展, 变, ζ…’, δΊ†, .θΏ‘, ε‡ εΉ΄,

⨀ Attention Weight βŠ•

经桎,

𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)

Page 9: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 10: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 11: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 12: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

rare wordpopular word

Translation Word2Vec

Page 13: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

rare wordpopular word

Translation Word2Vec

Page 14: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

1

β€’ We found more than 50% rare words should be semantically related to popular words (citizen-citizenship), but such relationships are not reflected from the embedding space.

2 β€’ It will consequently limit the performance of down-stream tasks using the embeddings.

β€’ Text classification: Peking is a wonderful city != Beijing is a wonderful city

Page 15: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

We want that popular words and rare words are mixed together in the embedding space.

One cannot differentiate the frequency (popular? rare?) of a word from the embedding space.

We train a discriminator together with the NLP model using adversarial training.

Page 16: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

FRAGEBaseline

Page 17: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

WMT En-De, 28.4

IWSLT De-En,

33.12

WMT En-De,

29.11

IWSLT De-En,

33.97

20

22

24

26

28

30

32

34

36

WMT En-De IWSLT De-En

BLEU

Baseline Transformer Ours

Page 18: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Word similarity

Language modeling

Text classification

Page 19: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 20: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

AI Tasks X β†’ Y Y β†’ XMachine translation Translation from language EN to

CH

Translation from language CH to EN

Speech processing Speech recognition Text to speech

Image

understanding

Image captioning Image generation

Conversation Question answering Question generation (e.g., Jeopardy!)

Search engine Query-document matching Query/keyword suggestion

Primal Task Dual Task

Currently most machine learning algorithms do not exploit structure duality for training and inference.

Page 21: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Dual Learning

β€’ A new learning framework that leverages the primal-dual structure of AI tasks to obtain effective feedback or regularization signals to enhance the learning/inference process.

β€’ Algorithmsβ€’ Dual unsupervised learning (NIPS 2016)

β€’ Dual supervised learning (ICML 2017)

β€’ Multi-agent dual learning (ongoing work)

Page 22: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 23: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Dual Unsupervised Learning

English sentence π‘₯ Chinese sentence𝑦 = 𝑓(π‘₯)New English sentence

π‘₯β€² = 𝑔(𝑦)

Feedback signals during the loop:β€’ 𝑅 π‘₯, 𝑓, 𝑔 = 𝑠 π‘₯, π‘₯β€² : BLEU of π‘₯β€² given π‘₯ . β€’ 𝐿(𝑦) and 𝐿(π‘₯β€²): Language model of 𝑦 and π‘₯’

Primal Task 𝑓: π‘₯ β†’ 𝑦

Dual Task 𝑔: 𝑦 β†’ π‘₯

Agent Agent

Ch->En translation

En->Ch translation

Reinforcement Learning algorithms can be used to improve both primal and dual models according to feedback signals

Page 24: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Experimental Setting

β€’ Baseline: Neural Machine Translation (NMT)β€’ One-layer RNN model, trained using

100% bilingual data (10M)β€’ Neural Machine Translation by Jointly

Learning to Align and Translate, by Bengio’s group (ICLR 2015)

β€’ Our algorithm:β€’ Step 1: Initialization

β€’ Start from a weak NMT model learned from only 10% training data (1M)

β€’ Step 2: Dual learningβ€’ Use the policy gradient algorithm to

update the dual models based on monolingual data NMT with 10%

bilingual dataDual learning with10% bilingual data

NMT with 100%bilingual data

BLEU score: French->English

↑ 0.3 (1%)

↓ 5.0 (17%)

Starting from initial models obtained from only 10% bilingual data, dual learning can achieve similar accuracy as the NMT model learned from 100% bilingual data!

Page 25: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Probabilistic View of Structural Duality

β€’ The structural duality implies strong probabilistic connections between the models of dual AI tasks.

β€’ This can be used beyond unsupervised learningβ€’ Structural regularizer to enhance supervised learning

β€’ Additional criterion to improve inference

𝑃 π‘₯, 𝑦 = 𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 = 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔

Primal View Dual View

Page 26: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 27: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Dual Supervised Learning

Labeled data π‘₯ label 𝑦 = 𝑓(π‘₯)

π‘₯ = 𝑔(𝑦)

Feedback signals during the loop:β€’ 𝑅 π‘₯, 𝑓, 𝑔 = |𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 βˆ’ 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔 |: the gap between

the joint probability 𝑃(π‘₯, 𝑦) obtained in two directions

Primal Task 𝑓: π‘₯ β†’ 𝑦

Dual Task 𝑔: 𝑦 β†’ π‘₯

Agent Agent

min |𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 βˆ’ 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔 |

max log𝑃(𝑦|π‘₯; 𝑓)

max log𝑃(π‘₯|𝑦; 𝑔)

Label 𝑦

𝑃 π‘₯, 𝑦= 𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓=𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔

Page 28: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Results

Theoretical Analysisβ€’ Dual supervised learning generalizes better than standard supervised learning

The product space of the two models satisfying probabilistic duality: 𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 = 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔

En->Fr Fr->En En->De De->En

NMT Dual learning↑2.1 (7%) ↑0.9 (3%)

↑1.4 (8%)↑0.1 (0.5%)

Page 29: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Dual Learning for Deep NMT Models

[1] Deep recurrent models with fast-forward connections for neural machine translation. Transactions of the Association for Computational Linguistics, 2016[2]Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016[3] Convolutional sequence to sequence learning. ICML 2017[4] Attention Is All You Need. NIPS 2017[5] Our own deep LSTM model: 4-layer encoder, 4-layer decoder. NIPS 2017

39.2

39.92

40.55

41

41.53

Baidu Deep LSTM[1] Google DeepLSTM +reinforcement learning[2]

Facebook CNN[3] Google Transformer [4] Our DeepLSTM + duallearning

English->French

Page 30: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Human Parity In Machine Translation

//newstest2017

AI score: 69.5

Human score: 69.0

Dual learning

Deliberation learning

@2018.3

Page 31: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 32: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Refresh of Dual Learning

English sentence π‘₯ Chinese sentence𝑦 = 𝑓(π‘₯)New English sentence

π‘₯β€² = 𝑔(𝑦)

Feedback signals during the loop:β€’ Ξ”π‘₯ π‘₯, π‘₯β€²; 𝑓, 𝑔 : similarity of π‘₯β€² and π‘₯;β€’ Δ𝑦 𝑦, 𝑦′; 𝑔, 𝑓 : similarity of 𝑦′ and 𝑦;

Primal Task 𝑓: π‘₯ β†’ 𝑦

Dual Task 𝑔: 𝑦 β†’ π‘₯

Agent Agent

Zh→En translation

En→Zh translation

Training objective function:1

β€–β„³π‘₯β€–

π‘₯βˆˆβ„³π‘₯

Ξ”π‘₯ π‘₯, 𝑔 𝑓 π‘₯ +1

‖ℳ𝑦‖

π‘¦βˆˆβ„³π‘¦

Δ𝑦 𝑦, 𝑓 𝑔 𝑦

Page 33: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Motivation

English sentence π‘₯ Chinese sentence𝑦 = 𝑓(π‘₯)New English sentence

π‘₯β€² = 𝑔(𝑦)

In this loop, 𝑓: generator 𝑦′ = 𝑓 π‘₯ ;𝑔: evaluator of 𝑓; Ξ”π‘₯(π‘₯, 𝑔(𝑦′))Evaluation quality plays a central role.

Primal Task 𝑓: π‘₯ β†’ 𝑦

Dual Task 𝑔: 𝑦 β†’ π‘₯

Agent Agent

Zh→En translation

En→Zh translation

Employing multiple agents can improve evaluation quality:Multi-Agent Dual Learning

Page 34: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Framework

English sentence π‘₯ Chinese sentence𝑦 = 𝐹𝛼(π‘₯)New English sentence

π‘₯β€² = 𝐺𝛽(𝑦)

Primal Task 𝐹𝛼: π‘₯ β†’ 𝑦

Dual Task 𝐺𝛽: 𝑦 β†’ π‘₯

Agent Agent

Zh→En translation

En→Zh translation

𝜢𝟎

𝛼1

𝛼2

πœ·πŸŽπ›½1

𝛽2

𝑓𝑖 is fixed (𝑖 β‰₯ 1)𝐹𝛼 = σ𝑖=0

π‘βˆ’1𝛼𝑖𝑓𝑖

𝑔𝑗 is fixed (𝑗 β‰₯ 1)

𝐺𝛽 = σ𝑗=0π‘βˆ’1𝛽𝑗𝑔𝑗

Feedback signals during the loop:β€’ Ξ”π‘₯ π‘₯, π‘₯β€²; 𝐹𝛼, 𝐺𝛽 : similarity of π‘₯β€² and π‘₯;

β€’ Δ𝑦 𝑦, 𝑦′; 𝐺𝛽 , 𝐹𝛼 : similarity of 𝑦′ and 𝑦;

Training objective function:1

β€–β„³π‘₯β€–

π‘₯βˆˆβ„³π‘₯

Ξ”π‘₯ π‘₯, 𝐺𝛽 𝐹𝛼 π‘₯ +1

‖ℳ𝑦‖

π‘¦βˆˆβ„³π‘¦

Δ𝑦 𝑦, 𝐹𝛼 𝐺𝛽 𝑦

Train and update 𝑓0 and 𝑔0

Page 35: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

IWSLT 2014 (<200k bilingual data)

33.61

27.95

40.537.95

18.9416.06

33.25

22.2

34.57

29.07

42.1438.09

21.14

16.31

35.42

22.75

35.44

29.52

42.4138.62

21.56

16.71

36.08

23.8

De-to-En En-to-De Es-to-En En-to-Es Ru-to-En En-to-Ru He-to-En En-to-He

Transformer Standard Dual Multi-Agent Dual

Deutsch EspaΓ±ol русский Χ’Χ‘Χ¨Χ™Χͺ

State-of-the-art resultsDe↔En: 2 Γ— 5 agents{Es, Ru, He}↔En: 2 Γ— 3 agents

Page 36: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

WMT 2014 (4.5M bilingual data)

28.4

32.15

28.4

32.15

29.44

32.99

29.93

34.98

30.05

33.4

30.67

35.64

En-to-De (+0M monolingual) De-to-En (+0M monolingual) En-to-De (+8M monolingual) De-to-En (+8M monolingual)

Transformer Standard Dual Multi-Agent Dual

State-of-the-art resultswith WMT2014 data only

En↔De: 2 Γ— 3 agents

Page 37: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

β€’ Improve the efficiency of unlabeled data

β€’ Also works for semi-supervised learning

Dual unsupervised

learning

β€’ Improve the efficiency of labeled data

β€’ Focus on probabilistic connection of structure duality

Dual supervised

learning

β€’ Ensemble of multiple primal and dual models to improve data efficiency

β€’ Works for both labeled and unlabeled data

Multi-agent dual

learning

Page 38: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

More on Dual Learning

β€’ DualGAN for image translation (ICCV2017)

β€’ Dual face manipulation (CVPR 2017)

β€’ Semantic image segmentation (CVPR 2017)

β€’ Question generation/answering (EMNLP 2017)

β€’ Image captioning (CIKM 2017)

β€’ Dual transfer learning (AAAI 2018)

β€’ Unsupervised machine translation (ICLR 2018/2018)

Page 39: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

More on Dual Learning

β€’ Model-level dual learning (ICML 2018)

β€’ Conditional image translation (CVPR 2018)

β€’ Visual question generation/answering (CVPR 2018)

β€’ Face aging/rejuvenation (IJCAI 2018)

β€’ Safe Semi-Supervised Learning (ACESS 2018)

β€’ Image rain removal (BMVC 2018)

β€’ …

Page 40: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 41: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Background

β€’ Neural machine translation models are usually based on autoregressive factorization

𝑃 𝑦 π‘₯= P y1 π‘₯ Γ— P y2 y1, π‘₯Γ— β‹―Γ— P yT y1, … , π‘¦π‘‘βˆ’1, π‘₯

Masked Multi-headSelf Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

<sos> y1 y2 y3

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

Γ—M

Page 42: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Inference latency bottleneck

β€’ Parallelizable Training v.s. Non-parallelizable Inference

Masked Multi-headSelf Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

<sos> y1 y2 y3

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

Γ—M

Masked Multi-headSelf Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

<sos> y1 y2 y3

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

Γ—M

Inference timeper example

Transformer (6-layer encoder-decoder NMT)

700ms

ResNet (50-layer classifier) 60ms

Page 43: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Non-autoregressive NMT

Multi-HeadSelf Attention

Multi-Head Positional Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

MΓ— NΓ—

Copy

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

Use a deep neural network to predict target length

Use a deep neural network to copy source embedding to target embedding

Generate target tokens in parallel

Page 44: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Non-autoregressive NMT (NART v.s. ART)

Multi-HeadSelf Attention

Multi-Head Positional Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

MΓ— NΓ—

Copy

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

BLEU, 27.3

Speedup, 1

BLEU, 17.69

Speedup, 15

0

5

10

15

20

25

30

BLEU Speedup

WMT En-De

Transformer NART

Page 45: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 46: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Masked Multi-headSelf Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

<sos> y1 y2 y3

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

Γ—MMulti-Head

Self Attention

Multi-Head Positional Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

NΓ—

Copy

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

Page 47: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Multi-HeadSelf Attention

Multi-Head Positional Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y'1 y'2 y'3 y'4

NΓ—

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

y1 y2 y3 y4

Phrase->phrase translation

Copy

Page 48: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Multi-HeadSelf Attention

Multi-Head Positional Attention

Multi-Head Encoder-to-Decoder Attention

FFN FFN FFN FFN

SoftMax

SoftMax

SoftMax

SoftMax

Emb Emb Emb Emb

y1 y2 y3 y4

NΓ—Copy

Multi-HeadSelf Attention

Context

Emb Emb Emb

x1 x2 x3

FFN FFN FFN

Linear transform

DiscriminatorAdversary training

Page 49: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 50: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 51: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Source vor einem jahr oder so , las ich eine studie , die mich wirklich

richtig umgehauen hat .

Target i read a study a year or so ago that really blew my mind wide open .

Transformer one year ago , or so , i read a study that really blew me up properly .

NART so a year , , i was to a a a year or , read that really really really me

me me .

β€’ Repetitive

translation

β€’ Incomplete

Translation

Page 52: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

Feed Forward

Emb Emb EmbEmb

x1 x2 x4x3

Emb Emb EmbEmb Emb

Multi-Head Inter-Attention

Feed Forward

SoftMax

SoftMax

SoftMax

SoftMax

SoftMax

h1 h2 h3 h4 h5

c1 c2 c3 c4

Multi-Head Self-Attention

Backward Encoder

Backward Decoder

x1 x2 x4x3

MΓ—

NΓ—

Backward TranslationNAT DecoderNAT Encoder

Handle repetitive translations by similarity regularization

πΏπ‘ π‘–π‘š

s1(h) s2(h) s3(h) s4(h)

Multi-Head Self-Attention

Multi-Head Positional Attention

y1 y2 y4y3 y5

s1(y) s2(y) s3(y) s4(y)

Uniform mapping

st(h) Denote π‘ π‘π‘œπ‘  β„Žπ‘‘, β„Žπ‘‘+1 in Eqn. 2

st(y) Denote π‘ π‘π‘œπ‘  𝑦𝑑, 𝑦𝑑+1 in Eqn. 2

Stacked unit

Handle repetitive translations by dual-translation regularization

Page 53: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 54: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

β€’ Use a phrase translation table to enhance the decoder

inputHard word Translation

β€’ Linearly transform source word embeddings to target

word embeddings to enhance the decoder input

Soft embedding

mapping

β€’ Regularize the hidden states of the NART model to

avoid repeated translationsSimilarity regularization

β€’ Handle incomplete translations through back-

translation

Dual-translation

regularization

Page 56: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding
Page 57: Why Machine Translation?Machine translation Translation from language EN to CH Translation from language CH to EN Speech processing Speech recognition Text to speech Image understanding

deep learning reinforcement

learning

[email protected]

http://research.Microsoft.com/~taoqin