an introduction to neural machine translation › wp-content › uploads › 2018 › 07 › ... ·...

An Introduction to Neural Machine Translation

Prof. John D. Kelleher@johndkelleher

ADAPT Centre for Digital Content TechnologyDublin Institute of Technology, Ireland

June 25, 2018

1 / 57

Outline

The Neural Machine Translation Revolution

Neural Networks 101

Word Embeddings

Language Models

Neural Language Models

Neural Machine Translation

Beyond NMT: Image Annotation

Image from https://blogs.msdn.microsoft.com/translation/

https://blogs.msdn.microsoft.com/translation/

Image from https://www.blog.google/products/translate

https://www.blog.google/products/translate

Image fromhttps://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/

https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/

Image from https:

//slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/

https://slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/

https://slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/

Neural Networks 101

7 / 57

What is a function?

A function maps a set of inputs (numbers) to an output (number)1

sum(2, 5, 4)→ 11

1This introduction to neural network and machine translation is based on: Kelleher (2016)

8 / 57

What is a weightedSum function?

weightedSum([x1, x2, . . . , xm]︸︷︷︸Input Numbers

, [w1,w2, . . . ,wm]︸︷︷︸Weights

)

= (x1 × w1) + (x2 × w2) + · · ·+ (xm × wm)

weightedSum([3, 9], [−3, 1])

= (3×−3) + (9× 1)

= −9 + 9

= 0

9 / 57

What is an activation function?

An activation function takes the output of ourweightedSum function and applies another mapping to it.

10 / 57


−10 −5 0 5 10

−1.

0−

0.5

0.0

0.5

1.0

z

activ

atio

n(z) logistic(z) =

11 + e−z

tanh(z) =ez − e−z

ez + e−z logistic(z)tanh(z)

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

z

activ

atio

n(z)

rectifier(z) = max(0,z)

11 / 57


activation =

logistic(weightedSum(([x1, x2, . . . , xm]︸︷︷︸Input Numbers

, [w1,w2, . . . ,wm]︸︷︷︸Weights

))

logistic(weightedSum([3, 9], [−3, 1]))

= logistic((3×−3) + (9× 1))

= logistic(−9 + 9)

= logistic(0)

= 0.5

12 / 57

What is a Neuron?

The simple list of operations that we have just described defines thefundamental building block of a neural network: the Neuron.

Neuron =

activation(weightedSum(([x1, x2, . . . , xm]︸︷︷︸Input Numbers

, [w1,w2, . . . ,wm]︸︷︷︸Weights

))

13 / 57

What is a Neuron?

Σ ϕ

x0

x1

x2

x3

xm

w0

w1

w2

w3wm

Activation

...

Image generated using code from Martin Thoma https://github.com/MartinThoma

14 / 57

https://github.com/MartinThoma

What is a Neural Network?

InputLayer

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

OutputLayer 4

15 / 57

Training a Neural Network

I We train a neural network by iteratively updating the weights

I We start by randomly assigning weights to each edge

I We then show the network examples of inputs and expectedoutputs and update the weights using Backpropogation sothat the network outputs match the expected outputs

I We keep updating the weights until the network is working theway we want

16 / 57

Word Embeddings

17 / 57

Word Embeddings

I Language is sequential and has lots of words.

18 / 57

“a word is characteriezed by the company it keeps”

— Firth, 1957

19 / 57

Word Embeddings

1. Train a network to predict the word that is missing

from the middle of an n-gram (or predict the n-gram

from the word)

2. Use the trained network weights to represent the word

in vector space.

20 / 57

Word Embeddings

Each word is represented by a vector of numbers that positions theword in a multi-dimensional space, e.g.:

king =< 55,−10, 176, 27 >

man =< 10, 79, 150, 83 >

woman =< 15, 74, 159, 106 >

queen =< 60,−15, 185, 50 >

21 / 57

Word Embeddings

vec(King)− vec(Man) + vec(Woman) ≈ vec(Queen)2

2Linguistic Regularities in Continuous Space Word Representations Mikolov et al. (2013)

22 / 57

Language Models

23 / 57

Language Models

I Language is sequential and has lots of words.

24 / 57

1,2,?

25 / 57

0 1 2 3 4 5 6 7 8 9

8 · 10−20.1

0.12

0.14

0.16

0.18

0.2

Th?

27 / 57

a b c d e f g h i j k l m n o p q r s t u v w x y z

0

0.1

0.2

0.3

0.4

I A language model can compute:

1. the probability of an upcoming symbol:

P(wn|w1, . . . ,wn−1)

2. the probability for a sequence of symbols3

P(w1, . . . ,wn)

3We can go from 1. to 2. using the Chain Rule of Probability P(w1,w2,w3) = P(w1)P(w2|w1)P(w3|w1,w2)

29 / 57

I Language models are useful for machine translationbecause they help with:

1. word ordering

P(Yes I can help you) > P(Help you I can yes)4

2. word choice

P(Feel the Force) > P(Eat the Force)

4Unless its Yoda that speaking

30 / 57

Neural Language Models

31 / 57

Recurrent Neural Networks

A particular type of neural network that is useful for

processing sequential data (such as, language) is a

Recurrent Neural Network.

32 / 57


Using an RNN we process our sequential data one input at

a time.

In an RNN the outputs of some of the neurons for one

input are feed back into the network as part the next

input.

33 / 57

Simple Feed-Forward Network

OutputLayer

HiddenLayer

InputLayer

Input 1Input 2Input 3

...

34 / 57

Recurrent Neural NetworksInputLayer

HiddenLayer

BufferOutputLayer


...

35 / 57


OutputLayer

InputLayer

Buffer

HiddenLayerInput 1

Input 2Input 3

...

36 / 57


BufferOutputLayer

HiddenLayerInput 1

Input 2Input 3

...

37 / 57


HiddenLayer

BufferOutputLayer


...

38 / 57


OutputLayer

HiddenLayer

Buffer


...

39 / 57


Buffer

HiddenLayer

OutputLayer

InputLayerInput 1

Input 2Input 3

...

40 / 57


Buffer

HiddenLayer

OutputLayer

InputLayerInput 1

Input 2Input 3

...

41 / 57

yt

ht

xtht−1

Figure: Recurrent Neural Network

ht = φ((Whh · ht−1) + (Wxh · xt))

yt = φ(Why · ht)

42 / 57


Output:

Input:

y1 y2 y3 yt yt+1

h1 h2 h3 · · · ht ht+1

x1 x2 x3 xt xt+1

0 / 0

Figure: RNN Unrolled Through Time43 / 57

Hallucinating Text

Output:

Input:

⇤Word2 ⇤Word3 ⇤Word4 · · · ⇤Wordt+1

h1 h2 h3 · · · ht

Word1

0 / 0

44 / 57

Hallucinating Shakespeare

PANDARUS: Alas, I think he shall be come approached and the dayWhen little srain would be attain’d into being never fed, And who isbut a chain and subjects of his death, I should not sleep.

Second Senator: They are away this miseries, produced upon mysoul, Breaking and strongly should be buried, when I perish Theearth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

From: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

45 / 57

http://karpathy.github.io/2015/05/21/rnn-effectiveness/


46 / 57


1. RNN Encoders

2. RNN Language Models

47 / 57

Encoders

Encoding:

Input:

h1 h2 · · · hm C

Word1 Word2 Wordm < eos >

0 / 0

Figure: Using an RNN to Generate an Encoding of a Word Sequence

48 / 57

Language Models

Output:

Input:

⇤Word2 ⇤Word3 ⇤Word4 ⇤Wordt+1

h1 h2 h3 · · · ht

Word1 Word2 Word3 Wordt

0 / 0

Figure: RNN Language Model Unrolled Through Time

49 / 57

Decoder

Output:

Input:

⇤Word2 ⇤Word3 ⇤Word4 · · · ⇤Wordt+1

h1 h2 h3 · · · ht

Word1

0 / 0

Figure: Using an RNN Language Model to Generate (Hallucinate) a WordSequence

50 / 57

Encoder-Decoder Architecture

Decoder

Encoder

Target1 Target2 · · · < eos >

h1 h2 · · · C d1 · · · dn

Source1 Source2 · · · < eos >

0 / 0

Figure: Sequence to Sequence Translation using an Encoder-Decoder Architecture

51 / 57


Decoder

Encoder

Life is beautiful < eos >

h1 h2 h3 h4 C d1 d2 d3

belle est vie La < eos >

0 / 0

Figure: Example Translation using an Encoder-Decoder Architecture

52 / 57

Beyond NMT: Image Annotation

53 / 57

Image Annotation

Image from Image from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xuet al. (2015)

54 / 57

Thank you for your [email protected]

@johndkelleher

www.machinelearningbook.com

https://ie.linkedin.com/in/johndkelleher

computer science

THE MIT PRESSMASSACHUSETTS INSTITUTE OF TECHNOLOGYCAMBRIDGE, MASSACHUSETTS 02142HTTP://MITPRESS.MIT.EDU978-0-262-51679-2

DATA

SC

IEN

CE

KE

LLEH

ER

AN

D TIE

RN

EY

DATA SCIENCE KELLEHER AND TIERNEY

A short briefing on intellectual property strategy, describing how a flexible and creative approach can help an organization achieve both short- and long-term benefits.

9 780262 516792

9 0 0 0 0

DATA SCIENCE JOHN D. KELLEHER

AND BRENDAN TIERNEY

THE MIT PRESS ESSENTIAL KNOWLEDGE SERIES

Acknowledgements: The ADAPT Centre is funded under the SFIResearch Centres Programme (Grant 13/RC/2106) and isco-funded under the European Regional Development Fund.Books: Kelleher et al. (2015) Kelleher and Tierney (2018)

[email protected]

@johndkelleher

www.machinelearningbook.com

https://ie.linkedin.com/in/johndkelleher

References I

Kelleher, J. (2016). Fundamentals of machine learning for neuralmachine translation. In Translating Europe Forum 2016:Focusing on Translation Technologies. The EuropeanCommission Directorate-General for Translation, EuropeanCommission doi: 10.21427/D78012.

Kelleher, J. D., Mac Namee, B., and D’Arcy, A. (2015).Fundamentals of Machine Learning for Predictive Analytics:Algorithms, Worked Examples and Case Studies. MIT Press,Cambridge, MA.

Kelleher, J. D. and Tierney, B. (2018). Data Science. MIT Press.

56 / 57

References II

Mikolov, T., Yih, W.-t., and Zweig, G. (2013). Linguisticregularities in continuous space word representations. InProceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics:Human Language Technologies, pages 746–751.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R.,Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neuralimage caption generation with visual attention. In InternationalConference on Machine Learning, pages 2048–2057.

57 / 57

an introduction to neural machine translation › wp-content › uploads › 2018 › 07 › ... ·...

Documents