modelling prosody for speech synthesis: example from polish dominika oliver igk colloquium 22 july...
Post on 18-Dec-2015
220 views
TRANSCRIPT
Modelling Prosody for Speech Synthesis: example from Polish
Dominika Oliver
IGK Colloquium22 July 2004
04/18/23 2
Outline
Goal prosodic modelling for TTS
Review of past studies intonational investigations
Current state latest modelling results
04/18/23 3
TTS Cycle
Text Processing Text Normalisation : names,abbrev.,numbers Linguistic Analysis : morphology,syntax,semantics
Prosodic Analysis Pitch, Phrasing & Duration Modelling
Speech Synthesis Voice Rendering
Text Input (raw or annotated)
Phonetic Analysis Grapheme-to-Phoneme Conversion : rules, dict.
Prosodic Analysis Pitch, Phrasing & Duration Modelling
04/18/23 4
TTS Cycle
Prosodic analysis/modelling
Prosodic components (focus, stress, duration etc.)
Prosodic phrasing Intonation: accent types, pitch
contour
04/18/23 5
Overview
ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements
04/18/23 6
Procedure
Prosodic modelling shopping list:
Language specific intonation description
Accent type and placement prediction & F0 generation methods
Research and evaluation tool (Festival)
04/18/23 7
Language specific intonation description
Quantitative analysis of Polish intonation (accent types) Standard description of Polish intonation
(Jassem, 1961, 1984, Demenko, 1999)Falling: HL, HM, ML, xLRising: LM, MH, LH Level: MMRise-fall: LHL
Broad-Narrow Focus/Peak alignment study (Andreeva and Oliver, 2003)
04/18/23 8
Accent types
Falling
04/18/23 9
Accent types
Rising
04/18/23 10
Overview
ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements
04/18/23 11
Resources
Speech corpora: PoInt (Polish Intonation Database) (Karpiński, 2001) 350MB, multi-speaker (~40)
read, (semi)-spontaneous
TranscribedSyllable based IPA segmentalSyllable based prosodic annotation
04/18/23 12
Resources
PoInt Prosodic transcription
Tone heights : xH, H, M, L, xL
Phrase boundary indication
04/18/23 13
Resources
Falling
Time (s)0 3.48395
-0.7994
0.78
0
Time (s)0 3.48395
120
350
fst ci va wem vled vje vi dot ne d ur ci ter pen t neH L |
Time (s)0 3.48395
Time (s)0 3.48395
120
400
fst ci vawem
vled
vje
vi dot ned ur ci ter pen
t
ne
04/18/23 14
Resources
Rising
Time (s)0 0.713424
-0.5675
0.8338
0
Time (s)0 0.713424
80
350
t to prav daL H |
Time (s)0 0.694399
04/18/23 15
Resources
Festival TTS (Black & Taylor, 1998) a general multi-lingual speech
synthesis system offers a full text to speech system environment for development and
research of speech synthesis techniques
04/18/23 16
Overview
ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements
04/18/23 17
Modelling techniques
Default prosodic assignment from simple text analysis
Hand-built rule-based system: hard to modify and adapt to new domains
Corpus-based approaches (Sproat et al ’92) Train prosodic variation on large labeled
corpora using machine learning techniques
04/18/23 18
Modelling techniques – accent type/placement prediction
Classification and regression trees (CART) (Breiman, Friedman, Olshen & Stone 1984, 1993)
In speech synthesis widely used to model • segment durations (e.g. Riley 1992) • accent prediction (Syrdal, Hirschberg,McGory,
Beckman 2001)• pitch contour generation (Dusterhoff 1997,
Dusterhoff, Black, Taylor 1999)
04/18/23 19
Modelling techniques - F0 prediction
Linear regression (Black & Hunt, 1996) used e.g. for F0 contour prediction/generation find the appropriate F0 target per syllable
based on available features trained from data predicted variable (p) can be modelled as a
sum of a set of weighted real-valued factorsp= w0 + w1f1 + w1f1 + w1f1 + … + wnfn
factors (fi) - parameterised properties of the data
weights (wi) - trained usually using a stepwise least squares technique
04/18/23 20
Prerequisite
F0 normalisation (Ladd, 1995, Clark, 2003)
(PoInt 40 speakers, mixed sex)
-where is f0 mean and is the f0 standard deviation of the utterance
-the rescaling uses standard deviation and mean f0 of the database :
i
in
ff
0
0
DDnff 00
i i
04/18/23 21
Overview
ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements
04/18/23 22
Modelling
Steps Building the utterance structure of the database
speech filesIncorporating database intonation labelling
Extracting features for accent prediction and f0 generation
Building CART modelPoInt intonation labels
Building LR model3 points per syllable
Incorporating model parameters into voice description
04/18/23 Dominika Oliver 23
Modelling - accent type/placement prediction
Model based on PoInt multiple speaker (male, female) Accent inventory (L, H, M) Accent prediction method: CART Features (31)
POS windowPosition of candidate syllable in word and
sentenceStress information window etc.
04/18/23 24
Results – accent prediction
train set (total 963 correct 897 93.146% )
test set (total 1070 correct 996 93.084%)
Accents NONE H L M Total AccuracyNONE 839 0 3 0 839/842 99.60%H 15 5 5 4 5/29 17.20%L 7 1 37 5 37/50 74%M 13 1 12 16 16/42 38%
Accents NONE H L M Total AccuracyNONE 953 3 6 4 953/966 98.70%H 11 8 2 6 8/27 29.63%L 4 1 29 12 29/46 63.04%M 11 3 11 6 6/31 19.40%
04/18/23 25
Modelling - F0 prediction/generation
F0 generation :Linear regression
Features • accent type• POS window• Position of candidate syllable in word and
sentence• Stress information window etc.
04/18/23 26
Results – F0 shape prediction
Train TestPosition RMSE Correlation RMSE Correlationstart 48.56 0.46 50.17 0.40mid 55.87 0.49 59.13 0.49end 58.99 0.45 54.50 0.48
04/18/23 27
Overview
ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements
04/18/23 28
Potential problems
Data not enough tokens to learn from Annotation inconsistencies (noisy data,
messy accent class assignment )
Inappropriate technique / suboptimal feature set
04/18/23 29
Potential data problems
04/18/23 30
Potential data problems
04/18/23 31
Potential data problems
04/18/23 32
PoInt Analysis
Peak alignment
04/18/23 33
Addressing data issues
F0 tracking errors
Identifying outliers / annotation inconsistencies
Re-classifying accent types
04/18/23 34
When everything else fails – blame it on the data
Labelling errorsUnmarked disfluencies/wrong reading Phonemic labellingMissing phrasingNo indication of sentence mode in
annotation
Inconsistent labellingMisleading transcription descriptionNo independent labellers
04/18/23 35
Data fixes
Automatically identifying outliers /annotation inconsistencies Statistic analysis of acoustic parameters
Manual data inspection Insertion of phrase boundaries Marking of disfluencies Aligning speech with text Deriving Gold Standard (hard)
04/18/23 36
Accent classification studies
Hierarchical clustering (Klabbers & van Santen 2004)
Linear regression (Keller & Zellner Keller, 2003) EM bagging & boosting (Sun, 2002) HMMs
(Kumpf, King 2004) (Blackburn ,Vonwiller, and King, 1993) (Batliner et al 1999, 2001) (Maragoudakis 2003, Zervas 2004) (Chan, Feng, Heinen, and Niederjohn 1994)
04/18/23 37
Accent type re-classification
Two stage procedure Self-organising maps (Kohonen 1982,1995)
(Kaski, 1997)(Vesanto & Alhoniemi, 2000)create set of data representative prototype vectorsprojection of prototypes onto low dimensional space
Hierarchical agglomerative clustering (HAC)method for good candidates for map unit clusters –
cut the dendrogram where there is a large distance between two clusters
04/18/23 38
Acoustic data parameterisation
Accent type classification: (Demenko, 1999)
1. Difference between start F0 (first vowel) and F0 extreme value (on a vowel or consonant)
2. Difference between F0 extreme value and end point F0
3. Difference between F0 max and F0 min 4. Difference between utterance mean F0 and mean
F0 for all utterances by the same voice5. Difference between utterance min F0 and global
mean min F0 for the same voice
ke FFx 2evp FFx 1
minmax3 FFx
srgsr FFx 4
gFFx minmin5
04/18/23 39
Accent type re-classification
Clusters description
3 30 7 405 15 108 1285 51 37 93
6 2 8149 18 2 16958 2 3 6353 9 2 6481 3 3 8760 65 9 13488 24 48 16042 54 14 110
544 277 235 1056
HHHLHMHxHLHLLLMLxLMHMLMM
Accent label
Gruppen-Gesamtwert
1 2 3HAC Gruppen-G
esamtwert
04/18/23 40
Accent type re-classification
Clusters characteristics
04/18/23 41
Accent type re-classification
04/18/23 42
New results – Accent placement prediction
train data
test data
Accent 89/103 86.40%
Accent 88/97 90.70%
04/18/23 43
New results – Accent type prediction
train data
test data
Accents Total Accuracy PreviousH 13/24 54.20% 17.20%L 49/55 89.10% 74.00%M 14/30 46.70% 38.00%
Accents Total Accuracy PreviousH 12/24 50.00% 29.00%L 42/44 95.50% 63.00%M 13/29 44.80% 19.00%
04/18/23 44
Evaluation
self-organised maps - potential method for categorisation
the results relatively successful and consistent
the data pre-processing - most critical phase
automatic training phase requires solid and consistent preparations (manual)
04/18/23 45
Overview
ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements
04/18/23 46
Need for better data
Based on problems encountered Further analysis of clusters A large amount of data from a single
speaker (primary need) A large amount of prosodic variation A balanced set of pitch events Clear speech which can be easily tracked Complex prosodic structure
04/18/23 47
Suggested improvements
Model modification More data e.g. Peak Alignment study Separate models for different sentence
types (Y/N Quest/Statements) Re-estimation of parameters based on
new intonationally rich data
04/18/23 48
Next
Closer inspection of automatically assigned accent classes (clusters)
Evaluation: perception experiments
04/18/23 49
The End