modelling prosody for speech synthesis: example from polish dominika oliver igk colloquium 22 july...

Modelling Prosody for Speech Synthesis: example from Polish

Dominika Oliver

IGK Colloquium22 July 2004

04/18/23 2

Outline

Goal prosodic modelling for TTS

Review of past studies intonational investigations

Current state latest modelling results

04/18/23 3

TTS Cycle

Text Processing Text Normalisation : names,abbrev.,numbers Linguistic Analysis : morphology,syntax,semantics

Prosodic Analysis Pitch, Phrasing & Duration Modelling

Speech Synthesis Voice Rendering

Text Input (raw or annotated)

Phonetic Analysis Grapheme-to-Phoneme Conversion : rules, dict.

Prosodic Analysis Pitch, Phrasing & Duration Modelling

04/18/23 4

TTS Cycle

Prosodic analysis/modelling

Prosodic components (focus, stress, duration etc.)

Prosodic phrasing Intonation: accent types, pitch

contour

04/18/23 5

Overview

ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements

04/18/23 6

Procedure

Prosodic modelling shopping list:

Language specific intonation description

Accent type and placement prediction & F0 generation methods

Research and evaluation tool (Festival)

04/18/23 7

Language specific intonation description

Quantitative analysis of Polish intonation (accent types) Standard description of Polish intonation

(Jassem, 1961, 1984, Demenko, 1999)Falling: HL, HM, ML, xLRising: LM, MH, LH Level: MMRise-fall: LHL

Broad-Narrow Focus/Peak alignment study (Andreeva and Oliver, 2003)

04/18/23 8

Accent types

Falling

04/18/23 9

Accent types

Rising

04/18/23 10

Overview


04/18/23 11

Resources

Speech corpora: PoInt (Polish Intonation Database) (Karpiński, 2001) 350MB, multi-speaker (~40)

read, (semi)-spontaneous

TranscribedSyllable based IPA segmentalSyllable based prosodic annotation

04/18/23 12

Resources

PoInt Prosodic transcription

Tone heights : xH, H, M, L, xL

Phrase boundary indication

04/18/23 13

Resources

Falling

Time (s)0 3.48395

-0.7994

0.78

0

Time (s)0 3.48395

120

350

fst ci va wem vled vje vi dot ne d ur ci ter pen t neH L |

Time (s)0 3.48395

Time (s)0 3.48395

120

400

fst ci vawem

vled

vje

vi dot ned ur ci ter pen

t

ne

04/18/23 14

Resources

Rising

Time (s)0 0.713424

-0.5675

0.8338

0

Time (s)0 0.713424

80

350

t to prav daL H |

Time (s)0 0.694399

04/18/23 15

Resources

Festival TTS (Black & Taylor, 1998) a general multi-lingual speech

synthesis system offers a full text to speech system environment for development and

research of speech synthesis techniques

04/18/23 16

Overview


04/18/23 17

Modelling techniques

Default prosodic assignment from simple text analysis

Hand-built rule-based system: hard to modify and adapt to new domains

Corpus-based approaches (Sproat et al ’92) Train prosodic variation on large labeled

corpora using machine learning techniques

04/18/23 18

Modelling techniques – accent type/placement prediction

Classification and regression trees (CART) (Breiman, Friedman, Olshen & Stone 1984, 1993)

In speech synthesis widely used to model • segment durations (e.g. Riley 1992) • accent prediction (Syrdal, Hirschberg,McGory,

Beckman 2001)• pitch contour generation (Dusterhoff 1997,

Dusterhoff, Black, Taylor 1999)

04/18/23 19

Modelling techniques - F0 prediction

Linear regression (Black & Hunt, 1996) used e.g. for F0 contour prediction/generation find the appropriate F0 target per syllable

based on available features trained from data predicted variable (p) can be modelled as a

sum of a set of weighted real-valued factorsp= w0 + w1f1 + w1f1 + w1f1 + … + wnfn

factors (fi) - parameterised properties of the data

weights (wi) - trained usually using a stepwise least squares technique

04/18/23 20

Prerequisite

F0 normalisation (Ladd, 1995, Clark, 2003)

(PoInt 40 speakers, mixed sex)

-where is f0 mean and is the f0 standard deviation of the utterance

-the rescaling uses standard deviation and mean f0 of the database :

i

in

ff

0

0

DDnff 00

i i

04/18/23 21

Overview


04/18/23 22

Modelling

Steps Building the utterance structure of the database

speech filesIncorporating database intonation labelling

Extracting features for accent prediction and f0 generation

Building CART modelPoInt intonation labels

Building LR model3 points per syllable

Incorporating model parameters into voice description

04/18/23 Dominika Oliver 23

Modelling - accent type/placement prediction

Model based on PoInt multiple speaker (male, female) Accent inventory (L, H, M) Accent prediction method: CART Features (31)

POS windowPosition of candidate syllable in word and

sentenceStress information window etc.

04/18/23 24

Results – accent prediction

train set (total 963 correct 897 93.146% )

test set (total 1070 correct 996 93.084%)

Accents NONE H L M Total AccuracyNONE 839 0 3 0 839/842 99.60%H 15 5 5 4 5/29 17.20%L 7 1 37 5 37/50 74%M 13 1 12 16 16/42 38%

Accents NONE H L M Total AccuracyNONE 953 3 6 4 953/966 98.70%H 11 8 2 6 8/27 29.63%L 4 1 29 12 29/46 63.04%M 11 3 11 6 6/31 19.40%

04/18/23 25

Modelling - F0 prediction/generation

F0 generation :Linear regression

Features • accent type• POS window• Position of candidate syllable in word and

sentence• Stress information window etc.

04/18/23 26

Results – F0 shape prediction

Train TestPosition RMSE Correlation RMSE Correlationstart 48.56 0.46 50.17 0.40mid 55.87 0.49 59.13 0.49end 58.99 0.45 54.50 0.48

04/18/23 27

Overview


04/18/23 28

Potential problems

Data not enough tokens to learn from Annotation inconsistencies (noisy data,

messy accent class assignment )

Inappropriate technique / suboptimal feature set

04/18/23 29

Potential data problems

04/18/23 30


04/18/23 31


04/18/23 32

PoInt Analysis

Peak alignment

04/18/23 33

Addressing data issues

F0 tracking errors

Identifying outliers / annotation inconsistencies

Re-classifying accent types

04/18/23 34

When everything else fails – blame it on the data

Labelling errorsUnmarked disfluencies/wrong reading Phonemic labellingMissing phrasingNo indication of sentence mode in

annotation

Inconsistent labellingMisleading transcription descriptionNo independent labellers

04/18/23 35

Data fixes

Automatically identifying outliers /annotation inconsistencies Statistic analysis of acoustic parameters

Manual data inspection Insertion of phrase boundaries Marking of disfluencies Aligning speech with text Deriving Gold Standard (hard)

04/18/23 36

Accent classification studies

Hierarchical clustering (Klabbers & van Santen 2004)

Linear regression (Keller & Zellner Keller, 2003) EM bagging & boosting (Sun, 2002) HMMs

(Kumpf, King 2004) (Blackburn ,Vonwiller, and King, 1993) (Batliner et al 1999, 2001) (Maragoudakis 2003, Zervas 2004) (Chan, Feng, Heinen, and Niederjohn 1994)

04/18/23 37

Accent type re-classification

Two stage procedure Self-organising maps (Kohonen 1982,1995)

(Kaski, 1997)(Vesanto & Alhoniemi, 2000)create set of data representative prototype vectorsprojection of prototypes onto low dimensional space

Hierarchical agglomerative clustering (HAC)method for good candidates for map unit clusters –

cut the dendrogram where there is a large distance between two clusters

04/18/23 38

Acoustic data parameterisation

Accent type classification: (Demenko, 1999)

1. Difference between start F0 (first vowel) and F0 extreme value (on a vowel or consonant)

2. Difference between F0 extreme value and end point F0

3. Difference between F0 max and F0 min 4. Difference between utterance mean F0 and mean

F0 for all utterances by the same voice5. Difference between utterance min F0 and global

mean min F0 for the same voice

ke FFx 2evp FFx 1

minmax3 FFx

srgsr FFx 4

gFFx minmin5

04/18/23 39


Clusters description

3 30 7 405 15 108 1285 51 37 93

6 2 8149 18 2 16958 2 3 6353 9 2 6481 3 3 8760 65 9 13488 24 48 16042 54 14 110

544 277 235 1056

HHHLHMHxHLHLLLMLxLMHMLMM

Accent label

Gruppen-Gesamtwert

1 2 3HAC Gruppen-G

esamtwert

04/18/23 40


Clusters characteristics

04/18/23 41


04/18/23 42

New results – Accent placement prediction

train data

test data

Accent 89/103 86.40%

Accent 88/97 90.70%

04/18/23 43

New results – Accent type prediction

train data

test data

Accents Total Accuracy PreviousH 13/24 54.20% 17.20%L 49/55 89.10% 74.00%M 14/30 46.70% 38.00%

Accents Total Accuracy PreviousH 12/24 50.00% 29.00%L 42/44 95.50% 63.00%M 13/29 44.80% 19.00%

04/18/23 44

Evaluation

self-organised maps - potential method for categorisation

the results relatively successful and consistent

the data pre-processing - most critical phase

automatic training phase requires solid and consistent preparations (manual)

04/18/23 45

Overview


04/18/23 46

Need for better data

Based on problems encountered Further analysis of clusters A large amount of data from a single

speaker (primary need) A large amount of prosodic variation A balanced set of pitch events Clear speech which can be easily tracked Complex prosodic structure

04/18/23 47

Suggested improvements

Model modification More data e.g. Peak Alignment study Separate models for different sentence

types (Y/N Quest/Statements) Re-estimation of parameters based on

new intonationally rich data

04/18/23 48

Next

Closer inspection of automatically assigned accent classes (clusters)

Evaluation: perception experiments

04/18/23 49

The End

modelling prosody for speech synthesis: example from polish dominika oliver igk colloquium 22 july...

Documents

prosodic annotation

resources zrising slide

resources zfalling slide

pitch contour slide

accent types zrising

accent types zfalling

evaluation tool festival

speech system y environment