sub-project i prosody, tones and text-to-speech synthesis
Post on 14-Jan-2016
49 Views
Preview:
DESCRIPTION
TRANSCRIPT
Sub-Project IProsody, Tones and Text-To-Speech
Synthesis
Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI),Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI),
Lin-shan Lee, Hsin-min Wang
2
Outline
Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure Future Direction
3
Members
Sin-Horng ChenProfessor (PI)NCTU
Yih-Ru Wang, Associate Professor(Co-PI) , NCTU
Lin-shan LeeProfessor , NTU
Chiu-yu TsengProfessor & Research Fellow(Co-PI) Academia Sinica
Yuan-Fu LiaoAssistant Professor(Co-PI) , NTUT
Hsin-min WangAssociate Research FellowAcademia Sinica
40 0.02 0.04 0.06 0.08 0.1 0.12
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Dimension 1
Dim
ensi
on 2
1
65
7
3
2
4
8
8-4 8-1
8-3
8-58-78-6
8-8
4-7
4-6
4-1
4-3
2-1
2-7
2-3
2-6
2-4
2-2
2-5
3-1
3-6
3-2
3-5
3-7
3-3
3-4
7-6
7-1
7-57-77-47-8
7-2
7-3
5-7
5-6
5-5
5-1
5-2
5-35-8
5-4
6-56-7
6-6
6-1
6-2
6-3
6-4
6-8
1-6
1-5
1-3
1-7
1-1
1-2
1-4
1-8
Keyword
SpeakerFast
speakers
Slow speake
rs
More breaks
Less breaks
Tone Behavior and Modeling
Tone Behavior and Modeling
Applications inSpeech/Speaker
Recognition
Applications inSpeech/Speaker
Recognition
Applications inText-to-speech
Synthesis
Applications inText-to-speech
Synthesis
Theme of Sub-Project I
Prosody Analysis and Modeling
Prosody Analysis and Modeling
Latent Factor-based pitch contour model
nn ssnn YZ )( Mean model:
Shape model:
nnnnnn pfiftpttnn XY
nnnnn fisqtcnn bbbbbXZ
Tone Sandhi
Hierarchical modeling of fluent prosody
High performance TTSSpeaker recognition
Prosodic model-based tone recognizer
5
Research Focus
How to analyze and model fluent speech prosody– Approach 1: Hierarchical modeling of fluent speech prosody
• Develop a hierarchical prosody framework of fluent speech• Construct modular acoustic models for: (1) F0 contours, (2) duration pattern
s, (3) Intensity distribution and (4) boundary breaks– Approach 2: Latent factor analysis-based modeling
• Assume there are some latent affecting factors• Latent factor analysis for syllable duration, pitch contour, energy and Inter-s
yllable coarticulation• Explore the relation between latent factors and syntactic information
How to integrate these two approaches and apply them to– Text-to-speech synthesis– Speech/tone/speaker recognition
6
Research Roadmap
•Automatic prosodic labeling•Prosodic phrase analysis
•High performance TTSMandarin, Min-south, Hakka
Current Achievements Future Direction
•Eigen prosody analysis-based speaker recognition
•RNN/VQ-basedprosodic modeling
•COSPRO corpus/Toolkits•Hierarchical modeling of fluent speech prosody
•Corpus-based TTS•Model-based TTS
•Language model+pause, PM
•Tone modeling and recognition, MLP/RNN
•HMM •Model-based tone recognizer
•Prosodic model-basedspeaker recognition
•Prosodic cues-dependent LM
•Latent factor analysis duration, pitch mean, shape, inter-syllable coarticulation
•Investigation in relation to prosody organization: F0 range and reset, naturalness and measurement, voice quality
7
Hierarchical Prosody Framework of Fluent Speech (1/4)
Hierarchical framework of fluent speech prosody for multi-phrase speech paragraphs – Hierarchical cross-phrase patterns and contributions are found in all 4
acoustic dimensions.
– Acoustic templates are derived for each prosody level • F0 template
• Syllable duration templates and temporal allocation patterns
• Intensity distribution patterns
• Boundary break patterns
8
Breath Group
Initial PP Final PPMiddle Prosodic Phrase
PWPW .. .. .. .. .. .. .. .. .. .. .. .. .. PW
B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2
B3 B3
B4
Prosodic Group
B4
B5
Hierarchical Prosody Framework of Fluent Speech (2/4)
The Prosody Hierarchy with Prosodic Boundaries
9
Hierarchical Prosody Framework of Fluent Speech (3/4)
F0 cadence of multi-phrase PG (Prosodic Phrase Group )
Tide over Wave and Ripple
Syllable duration cadence of multi-phrase PG
the PW level
the PPh level
PG-initial PPh l
PG-medial PPh l
PG-final PPh l
-1.2
-1
-0.8
-0.6
-0.4
-0.2
00.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10 11
-1.2
-1
-0.8
-0.6
-0.4
-0.2
00.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10 11
-1.2
-1
-0.8
-0.6
-0.4
-0.2
00.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10 11
-1.2
-1
-0.8
-0.6
-0.4
-0.2
00.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
00.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10 11
10
Hierarchical Prosody Framework of Fluent Speech (4/4)
Duration Re-synthesis, F054C F0 Re-synthesis, F054C
Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s
0
50
100
150
200
250
300
350
Initial Medial Final
0
50
100
150
200
250
300
350
Initial Medial Final
0
50
100
150
200
250
300
350
Initial Medial Final
0
50
100
150
200
250
300
350
Initial Medial Final
0
50
100
150
200
250
300
350
Initial Medial Final
0
50
100
150
200
250
300
350
Initial Medial Final
0
50
100
150
200
250
300
350
Initial Medial Final
0
50
100
150
200
250
300
350
Initial Medial Final
OriginalOriginal
0
50
100
150
200
250
300
350
Initial Middle Final
F0 (H
z)
0
50
100
150
200
250
300
350
Initial Middle Final
F0 (H
z)
0
50
100
150
200
250
300
350
Initial Middle Final
F0 (H
z)
0
50
100
150
200
250
300
350
Initial Middle Final
F0 (H
z)
0
50
100
150
200
250
300
350
Initial Middle Final
F0 (H
z)
0
50
100
150
200
250
300
350
Initial Middle Final
F0 (H
z)
0
50
100
150
200
250
300
350
Initial Middle Final
F0 (H
z)
0
50
100
150
200
250
300
350
Initial Middle FinalF0
(Hz)
OriginalOriginal
F051P
Standard
Average
65.3287
199.565
Speech Rate 236.18
F01S
72.3278
289.832
362.346
Modified
OriginalOriginal
11
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
companding factor of syllable
com
pandin
g facto
r of in
itia
l and fin
al
final initial
Syllable Duration Model– Multiplicative model
– Additive model
Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models
n n n n nn n t y j l sZ X
n n n n nn n t y j l sZ X
mean: 42.3 frames 43.9 frames
variance: 180 frame2 2.52 frame2
RMSE: 1.93 frames
(5ms/frame)
Latent Factor Analysis-based Prosody Modeling (1/3)
12
0 2 4 6 8 10 12 14 16 18 206
6.5
7
7.5
8
8.5
9
9.5
frame
pitch p
eriod (
ms)
033133233333433533020030
Syllable Pitch Contour Model– Mean model
– Shape model
The patterns of x-3-3
Latent Factor Analysis-based Prosody Modeling (2/3)
nn ssnn YZ )(
nnnnnn pfiftpttnn XY
nnnnn fisqtcnn bbbbbXZ
Reconstructed pitch mean
13
Inter-syllable coarticulation pitch contour model
The relationship of syllable pitch contours and affecting factors
Reconstructed pitch contour
Latent Factor Analysis-based Prosody Modeling (3/3)
14
Mandarin/Taiwanese TTS
Block diagram of TTS system TTS samples
input Min-Nan or Chinese text
Text Analyzer
AcousticInventory
RNN-basedProsody
Generator
PSOLA Speech Synthesizer
synthetic speech
base-syllablesequence
linguisticfeature
waveformsequence
prosodicparameters
Model-based TTS
Corpus-based TTS
female 1
female 2
female 3
female 4
female 5
female 1
female 2
female 3
female 4
female 5
Taiwanese -
15
Tone Behavior Modeling and Recognition with Inter-Syllabic Features
Gabor-IFAS-based pitch detection Four inter-syllabic features
– Ratio of duration of adjacent syllables
– Averaged pitch value over a syllable
– Maximum pitch difference within a syllable
– Averaged slope of the pitch contour over a syllable
Context-dependent tone behavior modeling
16
60.2 61.9
74.9
79.3
58.360.5
69.4
74.6
50.0
55.0
60.0
65.0
70.0
75.0
80.0
85.0
MAP-GMM/CMS
+GPD_S +ML-AKI +EPA
avg s
pk re
cog r
ate(%
)
avg
unseen handset
Eigen-Prosody Analysis-based Robust Speaker Recognition
Use latent semantic analysis (LSA) to efficiently extract useful speaker cues to resist handset mismatch from few training/test data
– Step 1: Automatic prosodic state labeling and speaker-keyword statistics
– Step 2: Eigen-prosody space construction using Latent semantic analysis
prosodicfeatures
Prosody State Labeling
Prosodykeywordparsing
prosodykeywords
A
….........
…….
……..
1
1
2
1
Co-occurrence Matrix
speakers
dictionary
VQ-basedProsody modeling
sequences of prosodystates
eigen-prosodyspace
A U
VTS
high dimensionalprosody space
Eigen-prosodyanalysis(SVD)
0 0.02 0.04 0.06 0.08 0.1 0.12-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Dimension 1
Dim
ensi
on 2
1
65
7
3
2
4
8
8-4 8-1
8-3
8-58-78-6
8-8
4-7
4-6
4-1
4-3
2-1
2-7
2-3
2-6
2-4
2-2
2-5
3-1
3-6
3-2
3-5
3-7
3-3
3-4
7-6
7-1
7-57-77-47-8
7-2
7-3
5-7
5-6
5-5
5-1
5-2
5-35-8
5-4
6-56-7
6-6
6-1
6-2
6-3
6-4
6-8
1-6
1-5
1-3
1-7
1-1
1-2
1-4
1-8
Keyword
Speaker
Fast speake
rs
Slow speakersMore
breaks
Less break
s
Experimental results on HTIMIT corpus
– Ten different handsets
– 302 speakers
– 7/3 utterances for training/test respectively
17
Research Infrastructure (1/2)
Sinica COSPRO and Toolkits: http://www.myet.com/COSPRO/– 9 sets of Mandarin Chinese fluent speech corpora collected – Platform developed– Each corpus was designed to bring out different prosody features involved in fluent speech. – Annotation processes include labeling and tagging perceived units and boundaries in fluent s
peech, especially the ultimate unit the multiple phrase speech paragraph.– Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship
characteristic to narrative or discourse organization.
COSPRO Toolkit
Performing Acoustic Analysis Function Re-synthesizing Speech Signals Function
PointProcess
Pitch
Intensity
Formant
Spectrogram
WavF0 Block
PitchTier DurationTier
Duration Block
Labeling Continuous Fluent Speech Function
Adjustment
Break
Syllable
User-defined Tag
PAC
18
Tree-Bank Speech Database– Uttered by a single female speaker
– Short paragraphs, 110,000 syllables
– Sentence-based syntactic tree annotated manually
– Pitch contour and syllable segmentation corrected manually
Research Infrastructure (2/2)
19
Future Direction (1/5)
Automatic prosodic labeling of Mandarin speech corpus
Analysis of prosodic phrase structure Model-based tone recognition High performance TTS Speech recognition/language modeling using
prosodic cues Prosodic modeling-based robust speaker recognition
20
Future Direction (2/5) Automatic prosodic labeling of Mandarin Speech corpus
– Goal: To construct a prosody-syntax model by exploiting the relationship of prosodic features and linguistic features and use it to automatic labeling of various acoustic cues:
• Prosodic phrase boundary detection• Inter-syllable/inter-word coarticulation classification• Full/half/sandhi tone labeling for Tone 3• Syllable pronunciation clustering• Homograph determination• The grouping of monosyllabic words with their neighboring words
21
Future Direction (3/5) Analysis of prosodic phrase structure
– 4-level prosody hierarchy: PW, PPh, BG, PG– Issues to be studied
• Detection and classification of prosodic phrases• Relation between syntactic phrase structure and prosodic phrase structure• Other affecting factors: speaking rate, speaking style, emotion type, spontaneit
y of speech
Model-based tone recognition– Current approach
• Acoustic feature normalization• Context-dependent tone modeling
– Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour
22
High performance TTS – Applying the sophisticated prosody models
• Modular model of fluent speech prosody• Latent factor analysis-based modeling
– Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient.
• Consider both linguistic information and acoustic cues • Specially treat to monosyllabic words
– Use the above prosody-syntax models to assist in the generation of prosodic information
Future Direction (4/5)
23
Future Direction (5/5) Speech recognition/language modeling using prosodic cues
– Automatic prosodic states labeling– Prosodic state-dependent acoustic modeling– Prosodic state-dependent language modeling
Prosodic modeling-based robust speaker recognition– Automatic prosodic cues labeling– N-gram language model to learn the prosodic behavior of speakers– Applying principle component analysis (PCA) to N-gram to find a
compact prosodic speaker space
top related