an introduction to statistical machine translation
DESCRIPTION
An Introduction to Statistical Machine Translation. Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12. Outline. Introduction Peter Brown The Mathematics of Machine Translation: Parameter Estimation, computational linguistics , vol. 19,1993, pp.263-311. Model 1. Introduction (1). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/1.jpg)
1
An Introduction to Statistical Machine
Translation
Dept. of CSIE, NCKUYao-Sheng ChangDate: 2011.04.12
![Page 2: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/2.jpg)
2
Outline
Introduction Peter Brown
The Mathematics of Machine Translation: Parameter Estimation, computational linguistics, vol. 19,1993, pp.263-311.
Model 1
![Page 3: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/3.jpg)
3
Introduction (1)
Machine translation is available Statistical method, information theory Faster computer, large storage Machine-readable corpora
Statistical method have proven their value Automatic speech recognition Lexicography, Natural language
processing
![Page 4: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/4.jpg)
4
Introduction (2)
Translations involve many cultural respects We only consider the translation of individual
sentence, just acceptable sentences.
Every sentence in one language is a possible translation of any sentence in the other Assign (S,T) a probability, Pr(T|S), to be the
probability that a translator will produce T in the target language when presented with S in the source language.
![Page 5: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/5.jpg)
5
Statistical Machine Translation(SMT)
Noise channel problem
![Page 6: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/6.jpg)
6
Fundamental of SMT
Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem)
Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.
![Page 7: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/7.jpg)
7
Practical Challenges
Computation of translation model Pr(f|e) Computation of language model Pr(e) Decoding (i.e., search for e that maximize Pr(f|e) Pr(e))
![Page 8: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/8.jpg)
8
Alignment of case 1
![Page 9: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/9.jpg)
9
Alignment of case 2
![Page 10: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/10.jpg)
10
Alignment of case 3
![Page 11: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/11.jpg)
11
Formulation of Alignment(1) Let e = e1
le1e2…el and f = f1m f1f2…fm
An alignment between a pair of strings e and f use a mapping of every word ei to some word fj
In other words, an alignment a between e and f tells that the word ei, 1 i l is generated by the word faj, aj{1,…,m}
There are (l+1)m different alignments between e and f. (Including Null – no mapping )
e = e1e2…ei…el
f = f1 f2… fj… fm
aj =i
faj =ei
![Page 12: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/12.jpg)
12
Formulation of Alignment(2)
Probability of an alignment
![Page 13: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/13.jpg)
13
Translation Model
The alignment, a, can be represented by a series, a1
m = ala2... am, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then aj = i , and if it is not connected to any English word, then aj = 0 (null).
![Page 14: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/14.jpg)
14
IBM Model I (1)
![Page 15: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/15.jpg)
15
IBM Model I (2)
The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l.
![Page 16: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/16.jpg)
16
Constrained Maximization
We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e
![Page 17: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/17.jpg)
17
Lagrange Multipliers (1) Method of Lagrange multipliers(拉格朗乘數法) : Lagrange multipliers with one constraint
If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the
f(x,y) is called the objective function(目標函數) . g(x,y) is called the constrained equation(條件限制方程式) .
F(x, y, ) is called the Lagrange function(拉格朗函數) . is called the Lagrange multiplier (拉格朗乘數) .
),(),(),,( yxgyxfyxF
![Page 18: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/18.jpg)
18
Lagrange Multipliers (2) Example 1: Maximize
Subject to Let
Set
代入 (2) 與 (3) ,可得
(5) 與 (6) 代入 (4) ,可得 ,由此可得 因此,最大值為
xyzV
024346 zyx
)24346(),,,( zyxxyzzyxF
)4(024346
)3(03
)2(04
)1(06
zyxF
xyF
xzF
yzF
z
y
x
6)1(
yz
)5(2
30
64 xyyz
xz
)6(20
63 xzyz
xy
0241824)2(32
346
xxxx
3
4 x
y z 28
3,
9
64
3
8)2(
3
4
xyzV
![Page 19: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/19.jpg)
19
Lagrange Multipliers (3)
Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function
![Page 20: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/20.jpg)
20
Derivation (1)
The partial derivative of h with respect to t(f|e) is
where is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise
![Page 21: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/21.jpg)
21
Derivation (2)
We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition,
![Page 22: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/22.jpg)
22
Derivation (3)
replacing e by ePr(f|e), then Equation (11) can be written very compactly as
In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2)
le(2)), ..., (f(s) le(s)), , so this equation becomes
![Page 23: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/23.jpg)
23
Derivation (4)
For an expression that can be evaluated efficiently.
![Page 24: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/24.jpg)
24
Derivation (5)
Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12)
![Page 25: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/25.jpg)
25
EM Algorithm
![Page 26: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/26.jpg)
26
EM Algorithm
![Page 27: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/27.jpg)
27
Introduction(1)
In statistical computing, an expectation-maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM is frequently used for data clustering in machine learning and computer vision.
![Page 28: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/28.jpg)
28
Introduction(2)
EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated.
(From: http://en.wikipedia.org/wiki/Expectation-maximization_algorithm )
![Page 29: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/29.jpg)
29
EM algorithm is a soft version of K-means clustering.
The idea is that the observed data are generated by several underlying causes.
Each cause contributes independently to the generation process, bur we only see the final mixture –without information about which cause contributed what.
![Page 30: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/30.jpg)
30
Observable data
Each
Unobservable / hidden data
Each zij can be interpreted as cluster membership probabilities.
The component zij is 1 if object i is a member of cluster j.
}{ iX x
Timii xx ),,( 1 x
}{ izZ
Ti ikii zzzz }{ ,,2,1
![Page 31: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/31.jpg)
31
Initial Assumption
At first , suppose we have a data set , where each is the vector
that correspond to the ith data point.
Further , assume the samples are drawn from k mixture Gaussians , .
Notice that the p.d.f. of multivariate normal distribution is
}{ iX x
Timii xx ),,( 1 x
)()(
2
1exp
)2(
1),;(p 1
j jjT
j
jm
jj
xxx
jc kj 1
A normal distribution in a x variate with mean and 2variance is a statistic distribution with probability function
![Page 32: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/32.jpg)
32
E-step
Let be a n by k matrix , where
Notice that , if we set for then by Bayes formula we have
ijhH
kcj
1)p(
)|p(
)|p(
)p()|p(
)p()|p()|p(
11li
k
l
ji
lli
k
l
jjiij
cx
cx
ccx
ccxxc
kj 1
);|(
);();();(*1);(
1
|||
li
k
l
jiiijiijiijij
cxp
c|xpxzPxzPxzEh
![Page 33: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/33.jpg)
33
M-step
ij
n
i
iij
n
ij
h
xh
1
1'
i
ij
n
i
ijn
ix
h
h
1
1
ij
n
i
Tjijiij
n
ij
h
xxh
1
''
1'))((
G
Tjiji
ij
n
i
ijn
ixx
h
h))(( ''
1
1
n
hh
h
h
h
h ij
n
in
i
ij
n
i
ij
k
j
n
i
ij
n
i
ij
n
i
k
j
ij
n
ij
1
1
1
11
1
11
1'
1
.
![Page 34: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/34.jpg)
34
log likelihood
The log likelihood of the data set X given theparameters is
,
where , and is the weight of cluster j .
Notice that
),;(log)P(log)|(111
jjijj
k
j
n
ii
n
ixpxXl
),;(log11
jjijj
k
j
n
ixp
Tk),,( 1 ),,( jjjj
j
11
j
k
j
![Page 35: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/35.jpg)
35
計算示範 (1)
假設
則
4.01.0
3.02.0
),;(),;(
),;(),;(
22221121
22121111
xpxp
xpxpN
8.02.0
6.04.0
4.01.0
4.0
4.01.0
1.03.02.0
3.0
3.02.0
2.0
H
0
11x
1
02x
![Page 36: An Introduction to Statistical Machine Translation](https://reader033.vdocuments.mx/reader033/viewer/2022061614/56812cef550346895d91b863/html5/thumbnails/36.jpg)
36
計算示範 (2)
33.0
67.0
1
0
2.04.0
2.0
0
1
2.04.0
4.0'1
57.0
43.0
1
0
8.06.0
8.0
0
1
8.06.0
6.0'2
8.02.0
6.04.0H
0
11x
1
02x
22.022.0
22.022.0
33.01
67.00
33.01
67.00
2.04.0
2.0
33.00
67.01
33.00
67.01
2.04.0
4.0'1
TT
25.025.0
25.025.0
57.01
43.00
57.01
43.00
8.06.0
8.0
57.00
43.01
57.00
43.01
8.06.0
6.0'2
TT
3.02
2.04.0'1
7.0
2
8.06.0'2
48.2)4.07.01.03.0log()3.07.02.03.0log()|( Xl