Producing Emotional
Speech
Thanks to Gabriel Schubiner
Papers
Generation of Affect in Synthesized Speech
Corpus-based approach to synthesis
Expressive visual speech using talking head
Demos
Affect Editor Quiz/Demo
Synface Demo
Affect in SpeechGoals
Addition of Emotion to Synthetic speech
Acoustic Model
Typology of parameters of emotional speech
Quantification
Addresses problem of expressiveness
What benefit is gained from expressive speech?
Emotion Theory/Assumptions
Emotion -> Nervous System -> Speech Output
Binary distinction
Parasympathetic vs Sympathetic
based on physical changes
universal emotions
Approaches to Affect
Generative
Emotion -> Physical -> Acoustic
Descriptive
Observed acoustic params imposed
Descriptive Framework
4 Parameter groups
Pitch
Timing
Voice Quality
Articulation
Assumption of independence
How could this affect design and results?
PitchTiming
Accent Shape
Average Pitch
Contour Slope
Final Lowering
Pitch Range
Reference Line
Exaggeration (not used)
Fluent Pauses
Hesitation Pauses
Speech Rate
Stress Frequency
Stressed Stressable
Voice Quality Articulation
Breathiness
Brilliance
Loudness
Pause Discontinuity
Pitch Discontinuity
Tremor
Laryngealization
Precision
Implementation
Each parameter has scale
Each scale is independent
from other parameters
between positive and negative
Implementation
Settings grouped into preset conditions for each emotion
based on prior studies
Program Flow: Input
Emotion -> parameter representation
Utterance -> clauses
Agent, Action, Object, Locative
Clause and lexeme annotations
Finds all possible locations of affect and chooses whether or not to use
Program Flow
Utterance -> Tree structure -> linear phonology
“compiled” for specific synthesizer with software to simulate affects not available in hardware
Perception
30 Utterances
5 sentences * 6 affects
Forced choice of one of six affects
magnitude and comments
Elicitation Sentences
Intro
I’m almost finished
I’m going to the city
I saw your name in the paper X
I thought you really meant it
Look at that picture
Pop Quiz!!!
Pop Quiz Solutions
I’m almost finishedDisgust : Surprise : Sadness : Gladness : Anger : Fear
I’m going to the citySurprise : Gladness : Anger : Disgust : Sadness : Fear
I thought you really meant itAnger : Disgust : Gladness : Sadness : Fear : Surprise
Look at that pictureAnger : Fear : Disgust : Sadness : Gladness : Surprise
Resultsapprox 50% recognition rate
91% sadness
Conclusions
Effective?
Thoughts?
Corpus-based Approach to
Expressive Speech Synthesis
Corpus
Collect utterances in each emotion
emotion-dependent semantics
One speaker
Good news, Bad news, Question
Model: Feature Vector
FeaturesLexical stressPhrase-level stressDistance from beginning of phraseDistance from end of phrasePOSPhrase-typeEnd of syllable pitch
Model: Classification
Predicts F0
5 syllable window
Uses feature vector to predict observation vector
observation vector: log(p), Δp
p = end of syllable pitch
Decision Tree
Model: Target Duration
Similar to predicting F0
build tree with goal of providing Gaussian at leafs
Use mean of class as target duration
discretization
ModelsUses acoustic analogue of n-grams
captures sense of contextcompared to describing full emotion as sequence
compare to Affect EditorUses only F0 and length (comp. A E)Include information about from which utterance the features are derived
intentional bias, justified?
Model: SynthesisData tagged with original expression and emotion
expression-cost matrix
noted trade-off:
emotional intensity vs. smoothness
Paralinguistic events
SSML
Compare to Cahn’s typology
Abstraction layers
Perception Experiment
Distinguish same utterance spoken with neutral and affected prosody
Semantic content problematic?
Results
Binary decision
Reasonable gain over baseline?
Conclusion
Major contributions?
Paths forward?
Synthesis of Expressive Visual Speech on a
Talking Head
< Not these Talking Heads...
>
Synthesis Background
Manipulation of video imagesVirtual model with deformation parametersSynchronized with time-aligned transcriptionArticulatory Control Model
Cohen & Massaro (1993)
Data
Single actor
Given specific emotion as instruction
6 emotions + neutral
Facial Animation Parameters
Face independent
FAP Matrix * scaling factor + position0
Weighted deformations of distance between vertices and feature point
Modeling
Phonetic segments assigned target parameter vector
temporal blending over dominance functions
Principal components
ML
Separate models for each emotion
6:1 training:testing ratio
models -> PC traj -> FAP traj * emotion param matrix
Results
More extreme emotions easier to perceive
73% sad, 60% angry, 40% sad
Synface Demo
Discussion
Changes in approach from Cahn to Eide
Production compared to Detection