pf-star: emotional speech synthesis istituto di scienze e tecnologie della cognizione, sezione di...
TRANSCRIPT
PF-STAR: emotional speech synthesis
Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR
Analysis of emotive speech: audio
Recordings: /’aba/, /’ava/, /m’amma/
Cues extraction and analysis: Intensity, duration, pitch, pitch range, formants.
F0 stressed vowel mean and F0mid values are strongly correlated. Shimmer, Jitter, HNR, Hammarberg’s index, Spectral flatness, Spectral energy distributions: voice quality correlates.
F0mean (global and for stressed vowel), F0mid, and F0range for /’aba/
anger (A)joy (J)fear (F)sadness (SA)
disgust (D)surprise (SU)neutral (N)
Analysis of emotive speech: voice quality
Discriminant analysis:
classification scores: 60/70 % for stressed and unstressed vowel
Best score: Fear, Anger Worst score: Surprise
Voice quality characterization:
Anger: harsh voice (/’a/) Disgust: creaky voice (/a/) Joy, Fear, Surprise : breathy voice
VOQUAL 2003 paper: “Emotions and Voice Quality: Experiments with Sinusoidal Modeling”
Processing of emotive speech
Neutral Target Disgust
Target Sadness
Disgust
Sadness
Results: • Time-stretch and (formant preserving) pitch shift alone can’t account for the principal emotion related cues
• Spectral conversion can account for some of the emotion cues
Disgust (Ps+Ts)
Sadness (Ps+Ts)
Neutral Emotive transformation based on sinusoidal modeling:
Processing of emotive speech
Neutral Emotive transformation based on sinusoidal modeling:
Neutral
anger
disgust
joy
fear
surprise
sadness
Ps+Ts Ps+Ts+Sc Target
SI voice processing for TTS systems
Processing of emotive speech: results
Emotive synthesis based on FESTIVAL MBROLA (Male Voice)
Neutral
Anger
Disgust
Joy
Fear
Surprise
Sadness
Ps+Ts Ps+Ts+VQtr Target
ETTS Audio Examples“Neutral” Prosody
Anger
Disgust
Fear
Joy
Surprise
Sadness
E-Prosody E-Prosody+VQ
Mark-Up Languages for E-TTSHierarchic description of emotive voice:Hierarchic description of emotive voice:High Level: emotive tag (e.g., <anger>, <joy>, <fear>, etc.)
Medium Level: phonetic voice description (e.g., <modal>, <soft>, <pressed>, etc.)
Low Level: acoustic description (e.g., <spectral tilt>, <shimmer >, <jitter>, etc.)
Definition of speaker-independent rules to control voice quality within a text-to-speech synthesizer.Definition of speaker-independent rules to control voice quality within a text-to-speech synthesizer.