pf-star: emotional speech synthesis istituto di scienze e tecnologie della cognizione, sezione di...

PF-STAR: emotional speech synthesis

Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR

Analysis of emotive speech: audio

Recordings: /’aba/, /’ava/, /m’amma/

Cues extraction and analysis: Intensity, duration, pitch, pitch range, formants.

F0 stressed vowel mean and F0mid values are strongly correlated. Shimmer, Jitter, HNR, Hammarberg’s index, Spectral flatness, Spectral energy distributions: voice quality correlates.

F0mean (global and for stressed vowel), F0mid, and F0range for /’aba/

anger (A)joy (J)fear (F)sadness (SA)

disgust (D)surprise (SU)neutral (N)

Analysis of emotive speech: voice quality

Discriminant analysis:

classification scores: 60/70 % for stressed and unstressed vowel

Best score: Fear, Anger Worst score: Surprise

Voice quality characterization:

Anger: harsh voice (/’a/) Disgust: creaky voice (/a/) Joy, Fear, Surprise : breathy voice

VOQUAL 2003 paper: “Emotions and Voice Quality: Experiments with Sinusoidal Modeling”

Processing of emotive speech

Neutral Target Disgust

Target Sadness

Disgust

Sadness

Results: • Time-stretch and (formant preserving) pitch shift alone can’t account for the principal emotion related cues

• Spectral conversion can account for some of the emotion cues

Disgust (Ps+Ts)

Sadness (Ps+Ts)

Neutral Emotive transformation based on sinusoidal modeling:

Processing of emotive speech

Neutral Emotive transformation based on sinusoidal modeling:

Neutral

anger

disgust

joy

fear

surprise

sadness

Ps+Ts Ps+Ts+Sc Target

SI voice processing for TTS systems

Processing of emotive speech: results

Emotive synthesis based on FESTIVAL MBROLA (Male Voice)

Neutral

Anger

Disgust

Joy

Fear

Surprise

Sadness

Ps+Ts Ps+Ts+VQtr Target

ETTS Audio Examples“Neutral” Prosody

Anger

Disgust

Fear

Joy

Surprise

Sadness

E-Prosody E-Prosody+VQ

Mark-Up Languages for E-TTSHierarchic description of emotive voice:Hierarchic description of emotive voice:High Level: emotive tag (e.g., <anger>, <joy>, <fear>, etc.)

Medium Level: phonetic voice description (e.g., <modal>, <soft>, <pressed>, etc.)

Low Level: acoustic description (e.g., <spectral tilt>, <shimmer >, <jitter>, etc.)

Definition of speaker-independent rules to control voice quality within a text-to-speech synthesizer.Definition of speaker-independent rules to control voice quality within a text-to-speech synthesizer.

pf-star: emotional speech synthesis istituto di scienze e tecnologie della cognizione, sezione di...

Documents

analysis of emotive

harsh voice

creaky voice

emotive synthesis

voice quality correlates

si voice processing

emotive tag

phonetic voice description