learning optimal audiovisual phasing for an hmm-based control model for facial animation
DESCRIPTION
Learning optimal audiovisual phasing for an HMM-based control model for facial animation. O. Govokhina (1,2), G. Bailly (2), G. Breton (1) (1) France Telecom R&D – Rennes (2) GIPSA-lab, dpt. Parole&Cognition – Grenoble SSW6, Bonn, August 2007. Agenda. 1. Facial animation - PowerPoint PPT PresentationTRANSCRIPT
research & development
Learning optimal audiovisual phasing for an HMM-based control model for facial animation
O. Govokhina (1,2), G. Bailly (2), G. Breton (1)
(1) France Telecom R&D – Rennes
(2) GIPSA-lab, dpt. Parole&Cognition – Grenoble
SSW6, Bonn, August 2007
research & development
France Telecom Group GIPSA-lab
2
Facial animation
Data and articulatory model
Trajectory formation models State of the art First improvement: Task-Dynamics for Animation (TDA)
Multimodal coordination AV asynchrony PHMM: Phased Hidden Markov Model Results and conclusions
1
2
3
Agenda
4
research & development
France Telecom Group GIPSA-lab
3
1Facial Animation
research & development
France Telecom Group GIPSA-lab
4
Facial Animation Domain: Visual speech synthesis Control model
Computes multiparametric trajectories from phonetic input
Shape model Specifies how facial geometry is modified by articulatory parameters
Appearance model Final image rendering
Data from Motion Capture
Shapemodel
Shapemodel
Appearancemodel
AppearancemodelControl modelControl model
AV DataAV DataLearningLearning MotionCapture
MotionCaptureAnalys
i
s
Synthesi
s
Phonetic input
Facial animation
research & development
France Telecom Group GIPSA-lab
5
2Data and articulatory model
research & development
France Telecom Group GIPSA-lab
6
Data and articulatory model
Audiovisual database FT 540 sentences, one female subject 150 colored beads, automatic tracking
Cloning methodology developed at ICP
Badin et al., 2002; Revéret et al., 2000
Visual parameters: 3 geometric parameters:Lips aperture/closure, Lips width, Lips protrusion
6 articulatory parameters:Jaw opening, Jaw advance, Lips rounding, Upper lip
movements, Lower lip movements, Throat movements
protrusion
aperture
width
research & development
France Telecom Group GIPSA-lab
7
3Trajectory formation systems
research & development
France Telecom Group GIPSA-lab
8
Trajectory formation systems.State of the art
Control models Visual-only
Coarticulation models• Massaro-Cohen; Öhman, …
Triphones, cinematic models • Deng; Okadome, Kaburagi & Honda, …
From acoustics Linear vs. Nonlinear mappings
• Yehia et al; Berthommier• Nakamura et al: voice conversion (GMM, HMM) used for speech to articulatory
inversion
Multimodal Synthesis by concatenation
• Minnis et al; Bailly et al, ... HMM synthesis
• Masuko et al; Tokuda et al, …
research & development
France Telecom Group GIPSA-lab
9
Trajectory formation systems.Concatenation
Linguistic processing
Linguistic processing
Unit selection/concatenation
Unit selection/concatenation
Prosodic model
Prosodic model
Parametricsynthesis
Parametricsynthesis
Principles Multi-represented multimodal segments Selection & concatenation costs Optimal selection by DTW Selection costs
• Between features ormore complex phonologicalstructures
• Between stored cues andcues computed by external models: e.g. prosody
Post-processing• Smoothing
Advantages/disadvantages+ Quality of the synthetic speech (units from natural
speech). MOS test (rule-based, concatenation, linear
acoustic-to-visual mapping) : Concatenation is considered as almost equivalent to original movements Gibert et al, IEEE SS 2002
- Requires very large audiovisual database- Bad joins and/or inappropriate units are very
visible
research & development
France Telecom Group GIPSA-lab
10
Trajectory formation systems.HMM-based synthesis
Principles Learning
• Contextual phone-sized HMM• Static & dynamic parameters• Gaussian/multiGaussian pdfs
Generation• Selection of HMM• Distribution of phone durations
among states (z-scoring)• Solving linear equations• Smoothing due to dynamic pdfs
Advantages/disadvantages+ Statistical parametrical synthesis
• Requires relatively small database• It can be easily modified for different
applications (languages, speaking rate, emotions, …)
MOS test (concatenation, hmm, linear acoustic-to-visual mapping) : In average HMM synthesis rated better than concatenation… but under-articulated Govokhina et al, Interspeech 2006
HMM learningHMM learning
SegmentationSegmentation
Visual
parameters
Audio
HMM and state duration models
HMM and state duration models
Phonetic input
Visual parametersgeneration
Visual parametersgeneration
State durationgeneration
State durationgeneration
Synthetic
trajectories
HMM sequency
…a p
qq,
research & development
France Telecom Group GIPSA-lab
11
First improvement: TDAHMM+Concatenation
HMMsynthesis
HMMsynthesis
Unitselection/concatenation
Unitselection/concatenation
Geometricscore
Geometricscore
Articulatoryscore
Articulatoryscore
Dictionnary: visual segments (geometric and articulatory)
Phonetic inputPhonetic input
Planning
Execu
tion
Task dynamicsSaltzman & Munhall; 1989Specifying phonetic targetswith geometric/aerodynamicgauges
Testing TDAEncouraging results [Govokhina et al, Interpseech 2006]but HMM planning fails to predict precise AV timing relations
PlanningGeometric goalsHMM synthesissmooth, coherent
ExecutionArticulatory parameters Synthesis by concatenationDetailed articulation & intrinsic multimodal synchronization
research & development
France Telecom Group GIPSA-lab
12
4PHMM: Phased Hidden Markov Model
research & development
France Telecom Group GIPSA-lab
13
AV asynchrony
Possible/known asynchrony Non audible gestures: during silences (ex: pre-phonatory gestures),
plosives, etc. Visual salience with few acoustic impact
• Anticipatory gestures: rounding within consonants (/stri/ vs/. /stry/)• Predominance of phonatory modes over articulation for determining
phone boundaries Cause (articulation) precedes effect (sound)
Modeling synchrony Few attempts in AV recognition
• Coupled HMMs: Alissali, 1996; Luettin et al, 2001; Gravier et al 2002• Non significant improvements Hazen, 2005• But AV fusion more problematic than timing
Very few in AV synthesis• Okadome et al
research & development
France Telecom Group GIPSA-lab
14
PHMM: Phased Hidden Markov Model
Visual speech synthesis Synchronizing gesture with sound
boundaries in the state-of-the-art systems
Simultaneous automatic learning Classical HMM learning applied to
articulatory parameters Proposed audiovisual delays learning
algorithm is applied. This iterative analysis by synthesis algorithm is based on Viterbi algorithm.
Simple phasing model: averagedelay associated with eachcontext-dependent HMM
Tested using FT AV database
Context-dependent phasing model
Viterbi alignment
Training of HMM
Context-dependent HMMs
Acoustic segmentation
Synchronous audiovisual
data
Articulatory trajectories
Parameter generation from HMM
TTS
Synthesized articulatory trajectories
research & development
France Telecom Group GIPSA-lab
15
Results Rapid convergence
Within a few iterations But constraints
• Simple phasing model• Min. durations for gestures
Large improvement 10% for context-independent HMMs Combines to context
• Further & larger improvement for context-dependent HMMs
Significant delays Largest for first & last segments
(prephonatory gestures ~150 msec) Positive for vowels, glides and bilabials Negative for back and nasal consonants In accordance with Öhman numerical theory
of coarticulation: slow vocalic gestures expand whereas rapid consonantal gestures shrink
0 1 2 3 4 5 6 7 8 9 101.35
1.4
1.45
1.5
1.55
1.6
1.65
1.7
1.75
1.8
1.85
Mea
n re
cons
truc
tion
erro
r(m
m)
Number of iterations
Monophone without contextMonophone with next viseme context
F.ph. L.ph. Unr.V. R.V. Sv. Blb. Alv. Lbd. Cons.-100
-50
0
50
100
150
200
250
research & development
France Telecom Group GIPSA-lab
16
Illustration
Features Prephonation Postphonation
(see final /o/) Rounding
(see /ɥi/): longer gestural duration enables complete protrusion)
0 300 600 900 1200 1500 1800 2100 2400 2700-10
0
10
._ x~ h i k l o .
Lips aperture (mm)
msec
0 300 600 900 1200 1500 1800 2100 2400 270050
55
60
._ x~ h i k l o .
Lips width (mm)
msec
0 300 600 900 1200 1500 1800 2100 2400 27000
5
10
._ x~ h i k l o .
Protrusion (mm)
msec
0 300 600 900 1200 1500 1800 2100 2400 2700-1
0
1
._ x~ h i k l o .
"un huis-clos"Audio signal
msec
OriginalHMM synthesisPHMM synthesis
research & development
France Telecom Group GIPSA-lab
17
Conclusions Speech-specific trajectory formation models
Trainable and parameterized by data TDA: robustness & detailed articulation PHMM: learning phasing relations between modalities
Perspectives Combining TDA and PHMM
• Notably segmenting multimodal units using PHMM Subjective evaluation
• Intelligibility, adequacy & cognitive load PHMM
• More sophisticated phasing models: regression trees, etc• Using state boundaries as possible anchor points• Applying to other gestures: CS, deictic/iconic gestures that should be coordinated
with speech
research & development
France Telecom Group GIPSA-lab
18
Examples
research & development
France Telecom Group GIPSA-lab
19
Thank you for your attention
For further details Mail me at : [email protected]
research & development
France Telecom Group GIPSA-lab
20
PHMM
Classical context-dependent HMM learning on articualtory
parameters
Temporal information on phonetic boundaries
(audio segmentation: SA)
Phoneme realignmenton articulatory parameters
by Viterbi
Average audiovisual delay.Constraint of minimal
phoneme duration (30 ms)
Visual segmentation (SV) calculated from average
audiovisual delay model and audio segmentation (SA)
SV(i)
Stop ifCorr(SV(i), SV(i-1))1
2
3
1
4
5
research & development
France Telecom Group GIPSA-lab
21
Examples