an introduction to neural machine translation › wp-content › uploads › 2018 › 07 › ... ·...
Post on 24-Jun-2020
16 Views
Preview:
TRANSCRIPT
An Introduction to Neural Machine Translation
Prof. John D. Kelleher@johndkelleher
ADAPT Centre for Digital Content TechnologyDublin Institute of Technology, Ireland
June 25, 2018
1 / 57
Outline
The Neural Machine Translation Revolution
Neural Networks 101
Word Embeddings
Language Models
Neural Language Models
Neural Machine Translation
Beyond NMT: Image Annotation
Image from https://blogs.msdn.microsoft.com/translation/
Image fromhttps://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/
Image from https:
//slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/
Neural Networks 101
7 / 57
What is a function?
A function maps a set of inputs (numbers) to an output (number)1
sum(2, 5, 4)→ 11
1This introduction to neural network and machine translation is based on: Kelleher (2016)
8 / 57
What is a weightedSum function?
weightedSum([x1, x2, . . . , xm]︸ ︷︷ ︸Input Numbers
, [w1,w2, . . . ,wm]︸ ︷︷ ︸Weights
)
= (x1 × w1) + (x2 × w2) + · · ·+ (xm × wm)
weightedSum([3, 9], [−3, 1])
= (3×−3) + (9× 1)
= −9 + 9
= 0
9 / 57
What is an activation function?
An activation function takes the output of ourweightedSum function and applies another mapping to it.
10 / 57
What is an activation function?
−10 −5 0 5 10
−1.
0−
0.5
0.0
0.5
1.0
z
activ
atio
n(z) logistic(z) =
11 + e−z
tanh(z) =ez − e−z
ez + e−z logistic(z)tanh(z)
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
z
activ
atio
n(z)
rectifier(z) = max(0,z)
11 / 57
What is an activation function?
activation =
logistic(weightedSum(([x1, x2, . . . , xm]︸ ︷︷ ︸Input Numbers
, [w1,w2, . . . ,wm]︸ ︷︷ ︸Weights
))
logistic(weightedSum([3, 9], [−3, 1]))
= logistic((3×−3) + (9× 1))
= logistic(−9 + 9)
= logistic(0)
= 0.5
12 / 57
What is a Neuron?
The simple list of operations that we have just described defines thefundamental building block of a neural network: the Neuron.
Neuron =
activation(weightedSum(([x1, x2, . . . , xm]︸ ︷︷ ︸Input Numbers
, [w1,w2, . . . ,wm]︸ ︷︷ ︸Weights
))
13 / 57
What is a Neuron?
Σ ϕ
x0
x1
x2
x3
xm
w0
w1
w2
w3wm
Activation
...
Image generated using code from Martin Thoma https://github.com/MartinThoma
14 / 57
What is a Neural Network?
InputLayer
HiddenLayer 1
HiddenLayer 2
HiddenLayer 3
OutputLayer 4
15 / 57
Training a Neural Network
I We train a neural network by iteratively updating the weights
I We start by randomly assigning weights to each edge
I We then show the network examples of inputs and expectedoutputs and update the weights using Backpropogation sothat the network outputs match the expected outputs
I We keep updating the weights until the network is working theway we want
16 / 57
Word Embeddings
17 / 57
Word Embeddings
I Language is sequential and has lots of words.
18 / 57
“a word is characteriezed by the company it keeps”
— Firth, 1957
19 / 57
Word Embeddings
1. Train a network to predict the word that is missing
from the middle of an n-gram (or predict the n-gram
from the word)
2. Use the trained network weights to represent the word
in vector space.
20 / 57
Word Embeddings
Each word is represented by a vector of numbers that positions theword in a multi-dimensional space, e.g.:
king =< 55,−10, 176, 27 >
man =< 10, 79, 150, 83 >
woman =< 15, 74, 159, 106 >
queen =< 60,−15, 185, 50 >
21 / 57
Word Embeddings
vec(King)− vec(Man) + vec(Woman) ≈ vec(Queen)2
2Linguistic Regularities in Continuous Space Word Representations Mikolov et al. (2013)
22 / 57
Language Models
23 / 57
Language Models
I Language is sequential and has lots of words.
24 / 57
1,2,?
25 / 57
0 1 2 3 4 5 6 7 8 9
8 · 10−20.1
0.12
0.14
0.16
0.18
0.2
Th?
27 / 57
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
0.1
0.2
0.3
0.4
I A language model can compute:
1. the probability of an upcoming symbol:
P(wn|w1, . . . ,wn−1)
2. the probability for a sequence of symbols3
P(w1, . . . ,wn)
3We can go from 1. to 2. using the Chain Rule of Probability P(w1,w2,w3) = P(w1)P(w2|w1)P(w3|w1,w2)
29 / 57
I Language models are useful for machine translationbecause they help with:
1. word ordering
P(Yes I can help you) > P(Help you I can yes)4
2. word choice
P(Feel the Force) > P(Eat the Force)
4Unless its Yoda that speaking
30 / 57
Neural Language Models
31 / 57
Recurrent Neural Networks
A particular type of neural network that is useful for
processing sequential data (such as, language) is a
Recurrent Neural Network.
32 / 57
Recurrent Neural Networks
Using an RNN we process our sequential data one input at
a time.
In an RNN the outputs of some of the neurons for one
input are feed back into the network as part the next
input.
33 / 57
Simple Feed-Forward Network
OutputLayer
HiddenLayer
InputLayer
Input 1Input 2Input 3
...
34 / 57
Recurrent Neural NetworksInputLayer
HiddenLayer
BufferOutputLayer
Input 1Input 2Input 3
...
35 / 57
Recurrent Neural Networks
OutputLayer
InputLayer
Buffer
HiddenLayerInput 1
Input 2Input 3
...
36 / 57
Recurrent Neural NetworksInputLayer
BufferOutputLayer
HiddenLayerInput 1
Input 2Input 3
...
37 / 57
Recurrent Neural NetworksInputLayer
HiddenLayer
BufferOutputLayer
Input 1Input 2Input 3
...
38 / 57
Recurrent Neural NetworksInputLayer
OutputLayer
HiddenLayer
Buffer
Input 1Input 2Input 3
...
39 / 57
Recurrent Neural Networks
Buffer
HiddenLayer
OutputLayer
InputLayerInput 1
Input 2Input 3
...
40 / 57
Recurrent Neural Networks
Buffer
HiddenLayer
OutputLayer
InputLayerInput 1
Input 2Input 3
...
41 / 57
yt
ht
xtht−1
Figure: Recurrent Neural Network
ht = φ((Whh · ht−1) + (Wxh · xt))
yt = φ(Why · ht)
42 / 57
Recurrent Neural Networks
Output:
Input:
y1 y2 y3 yt yt+1
h1 h2 h3 · · · ht ht+1
x1 x2 x3 xt xt+1
0 / 0
Figure: RNN Unrolled Through Time43 / 57
Hallucinating Text
Output:
Input:
⇤Word2 ⇤Word3 ⇤Word4 · · · ⇤Wordt+1
h1 h2 h3 · · · ht
Word1
0 / 0
44 / 57
Hallucinating Shakespeare
PANDARUS: Alas, I think he shall be come approached and the dayWhen little srain would be attain’d into being never fed, And who isbut a chain and subjects of his death, I should not sleep.
Second Senator: They are away this miseries, produced upon mysoul, Breaking and strongly should be buried, when I perish Theearth and thoughts of many states.
DUKE VINCENTIO: Well, your wit is in the care of side and that.
From: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
45 / 57
Neural Machine Translation
46 / 57
Neural Machine Translation
1. RNN Encoders
2. RNN Language Models
47 / 57
Encoders
Encoding:
Input:
h1 h2 · · · hm C
Word1 Word2 Wordm < eos >
0 / 0
Figure: Using an RNN to Generate an Encoding of a Word Sequence
48 / 57
Language Models
Output:
Input:
⇤Word2 ⇤Word3 ⇤Word4 ⇤Wordt+1
h1 h2 h3 · · · ht
Word1 Word2 Word3 Wordt
0 / 0
Figure: RNN Language Model Unrolled Through Time
49 / 57
Decoder
Output:
Input:
⇤Word2 ⇤Word3 ⇤Word4 · · · ⇤Wordt+1
h1 h2 h3 · · · ht
Word1
0 / 0
Figure: Using an RNN Language Model to Generate (Hallucinate) a WordSequence
50 / 57
Encoder-Decoder Architecture
Decoder
Encoder
Target1 Target2 · · · < eos >
h1 h2 · · · C d1 · · · dn
Source1 Source2 · · · < eos >
0 / 0
Figure: Sequence to Sequence Translation using an Encoder-Decoder Architecture
51 / 57
Neural Machine Translation
Decoder
Encoder
Life is beautiful < eos >
h1 h2 h3 h4 C d1 d2 d3
belle est vie La < eos >
0 / 0
Figure: Example Translation using an Encoder-Decoder Architecture
52 / 57
Beyond NMT: Image Annotation
53 / 57
Image Annotation
Image from Image from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xuet al. (2015)
54 / 57
Thank you for your attentionjohn.d.kelleher@dit.ie
@johndkelleher
www.machinelearningbook.com
https://ie.linkedin.com/in/johndkelleher
computer science
THE MIT PRESSMASSACHUSETTS INSTITUTE OF TECHNOLOGYCAMBRIDGE, MASSACHUSETTS 02142HTTP://MITPRESS.MIT.EDU978-0-262-51679-2
DATA
SC
IEN
CE
KE
LLEH
ER
AN
D TIE
RN
EY
DATA SCIENCE KELLEHER AND TIERNEY
A short briefing on intellectual property strategy, describing how a flexible and creative approach can help an organization achieve both short- and long-term benefits.
9 780262 516792
9 0 0 0 0
DATA SCIENCE JOHN D. KELLEHER
AND BRENDAN TIERNEY
THE MIT PRESS ESSENTIAL KNOWLEDGE SERIES
Acknowledgements: The ADAPT Centre is funded under the SFIResearch Centres Programme (Grant 13/RC/2106) and isco-funded under the European Regional Development Fund.Books: Kelleher et al. (2015) Kelleher and Tierney (2018)
References I
Kelleher, J. (2016). Fundamentals of machine learning for neuralmachine translation. In Translating Europe Forum 2016:Focusing on Translation Technologies. The EuropeanCommission Directorate-General for Translation, EuropeanCommission doi: 10.21427/D78012.
Kelleher, J. D., Mac Namee, B., and D’Arcy, A. (2015).Fundamentals of Machine Learning for Predictive Analytics:Algorithms, Worked Examples and Case Studies. MIT Press,Cambridge, MA.
Kelleher, J. D. and Tierney, B. (2018). Data Science. MIT Press.
56 / 57
References II
Mikolov, T., Yih, W.-t., and Zweig, G. (2013). Linguisticregularities in continuous space word representations. InProceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics:Human Language Technologies, pages 746–751.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R.,Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neuralimage caption generation with visual attention. In InternationalConference on Machine Learning, pages 2048–2057.
57 / 57
top related