issp2011 poster hmm-mge v6atef.ben-youssef/... · issp2011_poster_hmm-mge_v6.2 author: atef created...
TRANSCRIPT
Atef Ben Youssef, Pierre Badin & Gérard BaillyAtef Ben Youssef, Pierre Badin & Gérard BaillyGIPSA-lab (DPC / ICP), UMR 5216 CNRS / INP / UJF / U. Stendhal, Grenoble, France
{Atef.Ben-Youssef, Pierre.Badin, Gerard.Bailly}@gipsa-lab.grenoble-inp.fr
/a/ /i/ /y/ /u/
{Atef.Ben-Youssef, Pierre.Badin, Gerard.Bailly}@gipsa-lab.grenoble-inp.fr
Grenoble
Images
�
Visual articulatory feedback
/a/ /i/ /y/ /u/
Images
Parole
Signal
– Visual articulatory feedback
– Estimation of the articulatory movements from the speech signal Signal
Automatique
– Estimation of the articulatory movements from the speech signal
– Speech inversion system: HMM-based acoustic recognition and articulatory synthesis
�� �
– Acoustic HMMs�
– Corpus: One French male speaker, Parallel data (17 mn, 5100 phones, 36 phonemes): acoustic features x (MFCC+E+∆) //
● Tied-states and multi-Gaussian mixtures
–Decision tree-based state tying
phones, 36 phonemes): acoustic features x (MFCC+E+∆) // articulatory features y (12 EMA coordinates +∆)
Speech production model: Left-to-right, 3-state, multi-stream –Decision tree-based state tying
–Multiple mixtures component Gaussian distributions
– Speech production model: Left-to-right, 3-state, multi-stream HMM λ trained using Maximum Likelihood Estimation (MLE)
–Multiple mixtures component Gaussian distributions
� Improve statistics when the number of occurrences is low – Acoustic-to-articulatory inversion:
p(y| x) = p(y| λ,q) p(x| λ,q) P(λ){ })(maxargˆ xypy
y=
� Improve the context-dependency
Articulatory HMMs
p(y| x) = p(y| λ,q) p(x| λ,q) P(λ)Acoustic decoding (+Language model): Viterbi algorithm
– Articulatory HMMs● Minimum Generation Error (MGE)
Articulatory synthesis: Maximum Likelihood Parameter Generation (MLPG)
● Minimum Generation Error (MGE)
–Generation error defined as the Euclidean distance between the generated Ŷ and the measured Y articulatory trajectories
( ) ∑T2( ) ∑=
=−=T
ttt yyYYYYD
1
22ˆ,ˆˆ,
λn
λ1
–Articulatory HMMs parameters (mean µ and variance σ) updated as
=t 1λn
updated as( )origgenoldupdate µµµµ −−=...
λn λn ( )origgenoldupdate µµµµ −−=
( )( )∑∑ −−−=N T
ooo22 ˆˆ1 µσσAcoustic state decoder ( )( )∑∑
= =
−−−=n t
tntntntnoldupdate oooNT 1 1
,,,,22 ˆˆ µσσAcoustic state decoder
q2q2q1q1
λk
q3
…q2q2q2q1
λl
q3
�
– French acoustic-articulatory corpus OriginalMGE
Measured and synthesized articulatory spaces– French acoustic-articulatory corpus
Original/a/ /k/ /a/
MGEMLE
Tongue tip y
OriginalMGEMLE
/a/ /k/ /a/
Tongue tip y
Tongue
MLE
Tongue middle y
Tongue back y
100,00
Recognition rate
2,00
Inversion from audio alone Inversion from audio and text MLEMGE
85,46 84,31 86,19 86,35
80,00
100,00
Acc
ura
cy
1,721,55 1,48 1,58
1,50
2,00
RM
SE
(m
m)
1,88
1,54 1,561,44
1,62
1,38 1,40 1,351,50
2,00
RM
SE
(m
m)
MGE
40,00
60,00
Acc
ura
cy
1,00
1,50
RM
SE
(m
m)
1,38 1,40 1,35
1,00
1,50
RM
SE
(m
m)
0,00
20,00
40,00
Acc
ura
cy
0,00
0,50RM
SE
(m
m)
0,00
0,50
RM
SE
(m
m)
0,00
no-ctx L-ctx ctx-R L-ctx-R
0,00
no-ctx L-ctx ctx-R L-ctx-R
0,00
no-ctx L-ctx ctx-R L-ctx-R
– English acoustic-articulatory corpus (MOCHA-TIMIT, fsew0): 21 mn, 14000 phones, 45 phonemes– English acoustic-articulatory corpus (MOCHA-TIMIT, fsew0): 21 mn, 14000 phones, 45 phonemes
Inversion from audio and text MLERecognition rate Inversion from audio alone
1,961,76 1,77 1,741,68 1,61 1,59 1,62
2,00
RM
SE
(m
m)
Inversion from audio and text MLEMGE
80,00
100,00
Recognition rate
1,851,68 1,66
1,792,00
RM
SE
(m
m)
Inversion from audio alone
1,76 1,77 1,741,68 1,61 1,59 1,62
1,00
1,50
RM
SE
(m
m)
55,8267,89 70,20 66,30
60,00
80,00
Acc
ura
cy
1,00
1,50
RM
SE
(m
m)
0,50
1,00RM
SE
(m
m)
20,00
40,00
Acc
ura
cy
0,50
1,00
RM
SE
(m
m)
0,00
no-ctx L-ctx ctx-R L-ctx-R
0,00
20,00
no-ctx L-ctx ctx-R L-ctx-R
0,00
no-ctx L-ctx ctx-R L-ctx-R no-ctx L-ctx ctx-R L-ctx-Rno-ctx L-ctx ctx-R L-ctx-R no-ctx L-ctx ctx-R L-ctx-R
�
– Christophe Savariaux for its involvement in the recording of EMA data. – Christophe Savariaux for its involvement in the recording of EMA data.
– Work partially supported by the French ANR-08-EMER-001-02 project ARTIS.