Download - Producing Emotional Speech Thanks to Gabriel Schubiner

Producing Emotional

Speech

Thanks to Gabriel Schubiner

Papers

Generation of Affect in Synthesized Speech

Corpus-based approach to synthesis

Expressive visual speech using talking head

Demos

Affect Editor Quiz/Demo

Synface Demo

Affect in SpeechGoals

Addition of Emotion to Synthetic speech

Acoustic Model

Typology of parameters of emotional speech

Quantification

Addresses problem of expressiveness

What benefit is gained from expressive speech?

Emotion Theory/Assumptions

Emotion -> Nervous System -> Speech Output

Binary distinction

Parasympathetic vs Sympathetic

based on physical changes

universal emotions

Approaches to Affect

Generative

Emotion -> Physical -> Acoustic

Descriptive

Observed acoustic params imposed

Descriptive Framework

4 Parameter groups

Pitch

Timing

Voice Quality

Articulation

Assumption of independence

How could this affect design and results?

PitchTiming

Accent Shape

Average Pitch

Contour Slope

Final Lowering

Pitch Range

Reference Line

Exaggeration (not used)

Fluent Pauses

Hesitation Pauses

Speech Rate

Stress Frequency

Stressed Stressable

Voice Quality Articulation

Breathiness

Brilliance

Loudness

Pause Discontinuity

Pitch Discontinuity

Tremor

Laryngealization

Precision

Implementation

Each parameter has scale

Each scale is independent

from other parameters

between positive and negative

Implementation

Settings grouped into preset conditions for each emotion

based on prior studies

Program Flow: Input

Emotion -> parameter representation

Utterance -> clauses

Agent, Action, Object, Locative

Clause and lexeme annotations

Finds all possible locations of affect and chooses whether or not to use

Program Flow

Utterance -> Tree structure -> linear phonology

“compiled” for specific synthesizer with software to simulate affects not available in hardware

Perception

30 Utterances

5 sentences * 6 affects

Forced choice of one of six affects

magnitude and comments

Elicitation Sentences

Intro

I’m almost finished

I’m going to the city

I saw your name in the paper X

I thought you really meant it

Look at that picture

Pop Quiz!!!

Pop Quiz Solutions

I’m almost finishedDisgust : Surprise : Sadness : Gladness : Anger : Fear

I’m going to the citySurprise : Gladness : Anger : Disgust : Sadness : Fear

I thought you really meant itAnger : Disgust : Gladness : Sadness : Fear : Surprise

Look at that pictureAnger : Fear : Disgust : Sadness : Gladness : Surprise

Resultsapprox 50% recognition rate

91% sadness

Conclusions

Effective?

Thoughts?

Corpus-based Approach to

Expressive Speech Synthesis

Corpus

Collect utterances in each emotion

emotion-dependent semantics

One speaker

Good news, Bad news, Question

Model: Feature Vector

FeaturesLexical stressPhrase-level stressDistance from beginning of phraseDistance from end of phrasePOSPhrase-typeEnd of syllable pitch

Model: Classification

Predicts F0

5 syllable window

Uses feature vector to predict observation vector

observation vector: log(p), Δp

p = end of syllable pitch

Decision Tree

Model: Target Duration

Similar to predicting F0

build tree with goal of providing Gaussian at leafs

Use mean of class as target duration

discretization

ModelsUses acoustic analogue of n-grams

captures sense of contextcompared to describing full emotion as sequence

compare to Affect EditorUses only F0 and length (comp. A E)Include information about from which utterance the features are derived

intentional bias, justified?

Model: SynthesisData tagged with original expression and emotion

expression-cost matrix

noted trade-off:

emotional intensity vs. smoothness

Paralinguistic events

SSML

Compare to Cahn’s typology

Abstraction layers

Perception Experiment

Distinguish same utterance spoken with neutral and affected prosody

Semantic content problematic?

Results

Binary decision

Reasonable gain over baseline?

Conclusion

Major contributions?

Paths forward?

Synthesis of Expressive Visual Speech on a

Talking Head

< Not these Talking Heads...

>

Synthesis Background

Manipulation of video imagesVirtual model with deformation parametersSynchronized with time-aligned transcriptionArticulatory Control Model

Cohen & Massaro (1993)

Data

Single actor

Given specific emotion as instruction

6 emotions + neutral

Facial Animation Parameters

Face independent

FAP Matrix * scaling factor + position0

Weighted deformations of distance between vertices and feature point

Modeling

Phonetic segments assigned target parameter vector

temporal blending over dominance functions

Principal components

ML

Separate models for each emotion

6:1 training:testing ratio

models -> PC traj -> FAP traj * emotion param matrix

Results

More extreme emotions easier to perceive

73% sad, 60% angry, 40% sad

Synface Demo

Discussion

Changes in approach from Cahn to Eide

Production compared to Detection

Download - Producing Emotional Speech Thanks to Gabriel Schubiner

Top Related