issp2011 poster hmm-mge v6atef.ben-youssef/... · issp2011_poster_hmm-mge_v6.2 author: atef created...

1
Atef Ben Youssef, Pierre Badin & Gérard Bailly Atef Ben Youssef, Pierre Badin & Gérard Bailly GIPSA-lab (DPC / ICP), UMR 5216 CNRS / INP / UJF / U. Stendhal, Grenoble, France {Atef.Ben-Youssef, Pierre.Badin, Gerard.Bailly}@gipsa-lab.grenoble-inp.fr /a/ /i/ /y/ /u/ {Atef.Ben-Youssef, Pierre.Badin, Gerard.Bailly}@gipsa-lab.grenoble-inp.fr Grenoble Images Visual articulatory feedback /a/ /i/ /y/ /u/ Images Parole Signal Visual articulatory feedback Estimation of the articulatory movements from the speech signal Signal Automatique Speech inversion system: HMM-based acoustic recognition and articulatory synthesis Acoustic HMMs Corpus: One French male speaker, Parallel data (17 mn, 5100 phones, 36 phonemes): acoustic features x (MFCC+E+) // Tied-states and multi-Gaussian mixtures Decision tree-based state tying phones, 36 phonemes): acoustic features x (MFCC+E+) // articulatory features y (12 EMA coordinates +) Speech production model: Left-to-right, 3-state, multi-stream Decision tree-based state tying Multiple mixtures component Gaussian distributions Speech production model: Left-to-right, 3-state, multi-stream HMM λ trained using Maximum Likelihood Estimation (MLE) Multiple mixtures component Gaussian distributions Improve statistics when the number of occurrences is low Acoustic-to-articulatory inversion: p(y| x) = p(y| λ,q) p(x| λ,q) P(λ) { } ) ( max arg ˆ x y p y y = Improve the context-dependency Articulatory HMMs p(y| x) = p(y| λ,q) p(x| λ,q) P(λ) Acoustic decoding (+Language model): Viterbi algorithm Articulatory HMMs Minimum Generation Error (MGE) Articulatory synthesis: Maximum Likelihood Parameter Generation (MLPG) Minimum Generation Error (MGE) Generation error defined as the Euclidean distance between the generated Ŷ and the measured Y articulatory trajectories ( T 2 ( = = - = T t t t y y Y Y Y Y D 1 2 2 ˆ , ˆ ˆ , λ n λ 1 Articulatory HMMs parameters (mean µ and variance σ) updated as = t 1 λ n updated as ( 29 orig gen old update μ μ μ μ - - = ... λ n λ n orig gen old update ( 29 ( 29 - - - = N T o o o 2 2 ˆ ˆ 1 μ σ σ Acoustic state decoder ( 29 ( 29 = = - - - = n t t n t n t n t n old update o o o NT 1 1 , , , , ˆ ˆ μ σ σ q 2 q 2 q 1 q 1 λ k q 3 q 2 q 2 q 2 q 1 λ l q 3 French acoustic-articulatory corpus Original MGE Measured and synthesized articulatory spaces French acoustic-articulatory corpus Original /a/ /k/ /a/ MGE MLE Tongue tip y Original MGE MLE /a/ /k/ /a/ tip y Tongue MLE Tongue middle y Tongue back y 100,00 Recognition rate 2,00 Inversion from audio alone Inversion from audio and text MLE MGE 85,46 84,31 86,19 86,35 80,00 100,00 cy 1,72 1,55 1,48 1,58 1,50 2,00 mm) 1,88 1,54 1,56 1,44 1,62 1,38 1,40 1,35 1,50 2,00 mm) MGE 40,00 60,00 ccurac 1,00 MSE (m 1,38 1,40 1,35 1,00 1,50 MSE (m 0,00 20,00 40,00 Ac 0,00 0,50 RM 0,00 0,50 RM 0,00 no-ctx L-ctx ctx-R L-ctx-R 0,00 no-ctx L-ctx ctx-R L-ctx-R 0,00 no-ctx L-ctx ctx-R L-ctx-R English acoustic-articulatory corpus (MOCHA-TIMIT, fsew0): 21 mn, 14000 phones, 45 phonemes English acoustic-articulatory corpus (MOCHA-TIMIT, fsew0): 21 mn, 14000 phones, 45 phonemes Inversion from audio and text MLE Recognition rate Inversion from audio alone 1,96 1,76 1,77 1,74 1,68 1,61 1,59 1,62 2,00 mm) Inversion from audio and text MLE MGE 80,00 100,00 Recognition rate 1,85 1,68 1,66 1,79 2,00 ) Inversion from audio alone 1,68 1,61 1,59 1,62 1,00 1,50 MSE (m 55,82 67,89 70,20 66,30 60,00 80,00 uracy 1,00 1,50 E (mm) 0,50 1,00 RM 20,00 40,00 Accu 0,50 1,00 RMSE 0,00 no-ctx L-ctx ctx-R L-ctx-R 0,00 no-ctx L-ctx ctx-R L-ctx-R 0,00 no-ctx L-ctx ctx-R L-ctx-R no-ctx L-ctx ctx-R L-ctx-R no-ctx L-ctx ctx-R L-ctx-R Christophe Savariaux for its involvement in the recording of EMA data. Christophe Savariaux for its involvement in the recording of EMA data. Work partially supported by the French ANR-08-EMER-001-02 project ARTIS.

Upload: others

Post on 03-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ISSP2011 Poster HMM-MGE v6atef.ben-youssef/... · ISSP2011_Poster_HMM-MGE_v6.2 Author: Atef Created Date: 6/14/2011 5:50:17 PM Keywords ()

Atef Ben Youssef, Pierre Badin & Gérard BaillyAtef Ben Youssef, Pierre Badin & Gérard BaillyGIPSA-lab (DPC / ICP), UMR 5216 CNRS / INP / UJF / U. Stendhal, Grenoble, France

{Atef.Ben-Youssef, Pierre.Badin, Gerard.Bailly}@gipsa-lab.grenoble-inp.fr

/a/ /i/ /y/ /u/

{Atef.Ben-Youssef, Pierre.Badin, Gerard.Bailly}@gipsa-lab.grenoble-inp.fr

Grenoble

Images

Visual articulatory feedback

/a/ /i/ /y/ /u/

Images

Parole

Signal

– Visual articulatory feedback

– Estimation of the articulatory movements from the speech signal Signal

Automatique

– Estimation of the articulatory movements from the speech signal

– Speech inversion system: HMM-based acoustic recognition and articulatory synthesis

�� �

– Acoustic HMMs�

– Corpus: One French male speaker, Parallel data (17 mn, 5100 phones, 36 phonemes): acoustic features x (MFCC+E+∆) //

● Tied-states and multi-Gaussian mixtures

–Decision tree-based state tying

phones, 36 phonemes): acoustic features x (MFCC+E+∆) // articulatory features y (12 EMA coordinates +∆)

Speech production model: Left-to-right, 3-state, multi-stream –Decision tree-based state tying

–Multiple mixtures component Gaussian distributions

– Speech production model: Left-to-right, 3-state, multi-stream HMM λ trained using Maximum Likelihood Estimation (MLE)

–Multiple mixtures component Gaussian distributions

� Improve statistics when the number of occurrences is low – Acoustic-to-articulatory inversion:

p(y| x) = p(y| λ,q) p(x| λ,q) P(λ){ })(maxargˆ xypy

y=

� Improve the context-dependency

Articulatory HMMs

p(y| x) = p(y| λ,q) p(x| λ,q) P(λ)Acoustic decoding (+Language model): Viterbi algorithm

– Articulatory HMMs● Minimum Generation Error (MGE)

Articulatory synthesis: Maximum Likelihood Parameter Generation (MLPG)

● Minimum Generation Error (MGE)

–Generation error defined as the Euclidean distance between the generated Ŷ and the measured Y articulatory trajectories

( ) ∑T2( ) ∑=

=−=T

ttt yyYYYYD

1

22ˆ,ˆˆ,

λn

λ1

–Articulatory HMMs parameters (mean µ and variance σ) updated as

=t 1λn

updated as( )origgenoldupdate µµµµ −−=...

λn λn ( )origgenoldupdate µµµµ −−=

( )( )∑∑ −−−=N T

ooo22 ˆˆ1 µσσAcoustic state decoder ( )( )∑∑

= =

−−−=n t

tntntntnoldupdate oooNT 1 1

,,,,22 ˆˆ µσσAcoustic state decoder

q2q2q1q1

λk

q3

…q2q2q2q1

λl

q3

– French acoustic-articulatory corpus OriginalMGE

Measured and synthesized articulatory spaces– French acoustic-articulatory corpus

Original/a/ /k/ /a/

MGEMLE

Tongue tip y

OriginalMGEMLE

/a/ /k/ /a/

Tongue tip y

Tongue

MLE

Tongue middle y

Tongue back y

100,00

Recognition rate

2,00

Inversion from audio alone Inversion from audio and text MLEMGE

85,46 84,31 86,19 86,35

80,00

100,00

Acc

ura

cy

1,721,55 1,48 1,58

1,50

2,00

RM

SE

(m

m)

1,88

1,54 1,561,44

1,62

1,38 1,40 1,351,50

2,00

RM

SE

(m

m)

MGE

40,00

60,00

Acc

ura

cy

1,00

1,50

RM

SE

(m

m)

1,38 1,40 1,35

1,00

1,50

RM

SE

(m

m)

0,00

20,00

40,00

Acc

ura

cy

0,00

0,50RM

SE

(m

m)

0,00

0,50

RM

SE

(m

m)

0,00

no-ctx L-ctx ctx-R L-ctx-R

0,00

no-ctx L-ctx ctx-R L-ctx-R

0,00

no-ctx L-ctx ctx-R L-ctx-R

– English acoustic-articulatory corpus (MOCHA-TIMIT, fsew0): 21 mn, 14000 phones, 45 phonemes– English acoustic-articulatory corpus (MOCHA-TIMIT, fsew0): 21 mn, 14000 phones, 45 phonemes

Inversion from audio and text MLERecognition rate Inversion from audio alone

1,961,76 1,77 1,741,68 1,61 1,59 1,62

2,00

RM

SE

(m

m)

Inversion from audio and text MLEMGE

80,00

100,00

Recognition rate

1,851,68 1,66

1,792,00

RM

SE

(m

m)

Inversion from audio alone

1,76 1,77 1,741,68 1,61 1,59 1,62

1,00

1,50

RM

SE

(m

m)

55,8267,89 70,20 66,30

60,00

80,00

Acc

ura

cy

1,00

1,50

RM

SE

(m

m)

0,50

1,00RM

SE

(m

m)

20,00

40,00

Acc

ura

cy

0,50

1,00

RM

SE

(m

m)

0,00

no-ctx L-ctx ctx-R L-ctx-R

0,00

20,00

no-ctx L-ctx ctx-R L-ctx-R

0,00

no-ctx L-ctx ctx-R L-ctx-R no-ctx L-ctx ctx-R L-ctx-Rno-ctx L-ctx ctx-R L-ctx-R no-ctx L-ctx ctx-R L-ctx-R

– Christophe Savariaux for its involvement in the recording of EMA data. – Christophe Savariaux for its involvement in the recording of EMA data.

– Work partially supported by the French ANR-08-EMER-001-02 project ARTIS.