learning optimal audiovisual phasing for an hmm-based control model for facial animation

research & development

Learning optimal audiovisual phasing for an HMM-based control model for facial animation

O. Govokhina (1,2), G. Bailly (2), G. Breton (1)

(1) France Telecom R&D – Rennes

(2) GIPSA-lab, dpt. Parole&Cognition – Grenoble

SSW6, Bonn, August 2007


France Telecom Group GIPSA-lab

2

Facial animation

Data and articulatory model

Trajectory formation models State of the art First improvement: Task-Dynamics for Animation (TDA)

Multimodal coordination AV asynchrony PHMM: Phased Hidden Markov Model Results and conclusions

1

2

3

Agenda

4



3

1Facial Animation



4

Facial Animation Domain: Visual speech synthesis Control model

Computes multiparametric trajectories from phonetic input

Shape model Specifies how facial geometry is modified by articulatory parameters

Appearance model Final image rendering

Data from Motion Capture

Shapemodel

Shapemodel

Appearancemodel

AppearancemodelControl modelControl model

AV DataAV DataLearningLearning MotionCapture

MotionCaptureAnalys

i

s

Synthesi

s

Phonetic input

Facial animation



5

2Data and articulatory model



6

Data and articulatory model

Audiovisual database FT 540 sentences, one female subject 150 colored beads, automatic tracking

Cloning methodology developed at ICP

Badin et al., 2002; Revéret et al., 2000

Visual parameters: 3 geometric parameters:Lips aperture/closure, Lips width, Lips protrusion

6 articulatory parameters:Jaw opening, Jaw advance, Lips rounding, Upper lip

movements, Lower lip movements, Throat movements

protrusion

aperture

width



7

3Trajectory formation systems



8

Trajectory formation systems.State of the art

Control models Visual-only

Coarticulation models• Massaro-Cohen; Öhman, …

Triphones, cinematic models • Deng; Okadome, Kaburagi & Honda, …

From acoustics Linear vs. Nonlinear mappings

• Yehia et al; Berthommier• Nakamura et al: voice conversion (GMM, HMM) used for speech to articulatory

inversion

Multimodal Synthesis by concatenation

• Minnis et al; Bailly et al, ... HMM synthesis

• Masuko et al; Tokuda et al, …



9

Trajectory formation systems.Concatenation

Linguistic processing

Linguistic processing

Unit selection/concatenation

Unit selection/concatenation

Prosodic model

Prosodic model

Parametricsynthesis

Parametricsynthesis

Principles Multi-represented multimodal segments Selection & concatenation costs Optimal selection by DTW Selection costs

• Between features ormore complex phonologicalstructures

• Between stored cues andcues computed by external models: e.g. prosody

Post-processing• Smoothing

Advantages/disadvantages+ Quality of the synthetic speech (units from natural

speech). MOS test (rule-based, concatenation, linear

acoustic-to-visual mapping) : Concatenation is considered as almost equivalent to original movements Gibert et al, IEEE SS 2002

- Requires very large audiovisual database- Bad joins and/or inappropriate units are very

visible



10

Trajectory formation systems.HMM-based synthesis

Principles Learning

• Contextual phone-sized HMM• Static & dynamic parameters• Gaussian/multiGaussian pdfs

Generation• Selection of HMM• Distribution of phone durations

among states (z-scoring)• Solving linear equations• Smoothing due to dynamic pdfs

Advantages/disadvantages+ Statistical parametrical synthesis

• Requires relatively small database• It can be easily modified for different

applications (languages, speaking rate, emotions, …)

MOS test (concatenation, hmm, linear acoustic-to-visual mapping) : In average HMM synthesis rated better than concatenation… but under-articulated Govokhina et al, Interspeech 2006

HMM learningHMM learning

SegmentationSegmentation

Visual

parameters

Audio

HMM and state duration models

HMM and state duration models

Phonetic input

Visual parametersgeneration

Visual parametersgeneration

State durationgeneration

State durationgeneration

Synthetic

trajectories

HMM sequency

…a p

qq,



11

First improvement: TDAHMM+Concatenation

HMMsynthesis

HMMsynthesis

Unitselection/concatenation

Unitselection/concatenation

Geometricscore

Geometricscore

Articulatoryscore

Articulatoryscore

Dictionnary: visual segments (geometric and articulatory)

Phonetic inputPhonetic input

Planning

Execu

tion

Task dynamicsSaltzman & Munhall; 1989Specifying phonetic targetswith geometric/aerodynamicgauges

Testing TDAEncouraging results [Govokhina et al, Interpseech 2006]but HMM planning fails to predict precise AV timing relations

PlanningGeometric goalsHMM synthesissmooth, coherent

ExecutionArticulatory parameters Synthesis by concatenationDetailed articulation & intrinsic multimodal synchronization



12

4PHMM: Phased Hidden Markov Model



13

AV asynchrony

Possible/known asynchrony Non audible gestures: during silences (ex: pre-phonatory gestures),

plosives, etc. Visual salience with few acoustic impact

• Anticipatory gestures: rounding within consonants (/stri/ vs/. /stry/)• Predominance of phonatory modes over articulation for determining

phone boundaries Cause (articulation) precedes effect (sound)

Modeling synchrony Few attempts in AV recognition

• Coupled HMMs: Alissali, 1996; Luettin et al, 2001; Gravier et al 2002• Non significant improvements Hazen, 2005• But AV fusion more problematic than timing

Very few in AV synthesis• Okadome et al



14

PHMM: Phased Hidden Markov Model

Visual speech synthesis Synchronizing gesture with sound

boundaries in the state-of-the-art systems

Simultaneous automatic learning Classical HMM learning applied to

articulatory parameters Proposed audiovisual delays learning

algorithm is applied. This iterative analysis by synthesis algorithm is based on Viterbi algorithm.

Simple phasing model: averagedelay associated with eachcontext-dependent HMM

Tested using FT AV database

Context-dependent phasing model

Viterbi alignment

Training of HMM

Context-dependent HMMs

Acoustic segmentation

Synchronous audiovisual

data

Articulatory trajectories

Parameter generation from HMM

TTS

Synthesized articulatory trajectories



15

Results Rapid convergence

Within a few iterations But constraints

• Simple phasing model• Min. durations for gestures

Large improvement 10% for context-independent HMMs Combines to context

• Further & larger improvement for context-dependent HMMs

Significant delays Largest for first & last segments

(prephonatory gestures ~150 msec) Positive for vowels, glides and bilabials Negative for back and nasal consonants In accordance with Öhman numerical theory

of coarticulation: slow vocalic gestures expand whereas rapid consonantal gestures shrink

0 1 2 3 4 5 6 7 8 9 101.35

1.4

1.45

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

Mea

n re

cons

truc

tion

erro

r(m

m)

Number of iterations

Monophone without contextMonophone with next viseme context

F.ph. L.ph. Unr.V. R.V. Sv. Blb. Alv. Lbd. Cons.-100

-50

0

50

100

150

200

250



16

Illustration

Features Prephonation Postphonation

(see final /o/) Rounding

(see /ɥi/): longer gestural duration enables complete protrusion)

0 300 600 900 1200 1500 1800 2100 2400 2700-10

0

10

._ x~ h i k l o .

Lips aperture (mm)

msec

0 300 600 900 1200 1500 1800 2100 2400 270050

55

60

._ x~ h i k l o .

Lips width (mm)

msec

0 300 600 900 1200 1500 1800 2100 2400 27000

5

10

._ x~ h i k l o .

Protrusion (mm)

msec

0 300 600 900 1200 1500 1800 2100 2400 2700-1

0

1

._ x~ h i k l o .

"un huis-clos"Audio signal

msec

OriginalHMM synthesisPHMM synthesis



17

Conclusions Speech-specific trajectory formation models

Trainable and parameterized by data TDA: robustness & detailed articulation PHMM: learning phasing relations between modalities

Perspectives Combining TDA and PHMM

• Notably segmenting multimodal units using PHMM Subjective evaluation

• Intelligibility, adequacy & cognitive load PHMM

• More sophisticated phasing models: regression trees, etc• Using state boundaries as possible anchor points• Applying to other gestures: CS, deictic/iconic gestures that should be coordinated

with speech



18

Examples



19

Thank you for your attention

For further details Mail me at : [email protected]



20

PHMM

Classical context-dependent HMM learning on articualtory

parameters

Temporal information on phonetic boundaries

(audio segmentation: SA)

Phoneme realignmenton articulatory parameters

by Viterbi

Average audiovisual delay.Constraint of minimal

phoneme duration (30 ms)

Visual segmentation (SV) calculated from average

audiovisual delay model and audio segmentation (SA)

SV(i)

Stop ifCorr(SV(i), SV(i-1))1

2

3

1

4

5



21

Examples

learning optimal audiovisual phasing for an hmm-based control model for facial animation

Documents