latent prosody models of continuous mandarin speech speech lab., cm, nctu chen yu chiang 2007/2/8
TRANSCRIPT
LATENT PROSODY MODELS OF CONTINUOUS MANDARIN SPEECH
Speech Lab., CM, NCTUChen Yu Chiang
2007/2/8
Outline
Introduction Base Latent Prosody Models (LPM)
A Statistical Syllable Duration Model A Statistical Syllable Pitch Contour
Model Automatic Prosody Labeling based
on LPM Summary
Introduction (1/11)
What is Prosody? Prosody is an inherent supra-
segmental feature of human speech. It carries stress, intonation patterns and timing structures of continuous speech which decide the naturalness and understandability of an utterance.
Introduction (2/11) For the listener’s points of view, prosody consis
ts of systematic perception and recovery of a speaker’s intentions based on: Pause: to indicate phrases and avoid running out of ai
r. Pitch: rate of vocal-fold cycling( fundamental frequen
cy or F0) as a function of time. Rate/relative duration: phoneme durations, timing, a
nd rhythm. Loudness (Energy): relative amplitude/volume
For simplicity, we may say “ 抑 , 揚 , 頓 , 挫 , 輕 ,重 , 緩 , 急”
Introduction (3/11)
The affecting factors of prosody Linguistic
Lexical, Syntactic, Semantic, Pragmatic Para-linguistic
Intentional, Attitudinal, Stylistic Non-linguistic
Physical, Emotional
Introduction (4/11) Issues concerned in prosody modeling
Labeling of important prosodic cues Construction of prosody hierarchy Modeling of syntax-prosody relationship Prediction of prosodic phrase boundary (break)
from text, etc. Applications
Automatic Speech Recognition (ASR) Important prosodic cues can be explored from the input
utterance to assist in both acoustic and linguistic decoding
Text-to-Speech (TTS) A good prosody model can be used to generate
appropriate prosodic features from the input text
Introduction (5/11)
Important characteristics of Mandarin Chinese A tonal language (Four lexical tones,
one neutral tone) The tonality of a monosyllable is mainly
characterized by the shape of its fundamental frequency (F0) contour
A syllable-based language (411 base-syllables)
Introduction (6/11) Syllable duration is also seriously affected
by the phonetic structure of base-syllables. Generally speaking, syllable duration
increases as the number of constituent phonemes increases.
For examples: Syllables with single vowels are shortest. Syllables with stop initials or no initials, and
without nasal endings are pronounced shorter. Syllables with fricative initials and with nasal
endings are longer.
Introduction (7/11) Standard tone pattern
Affection of context and intonation
Introduction (8/11) As a tonal language, in Mandarin
speech, there is a tight interaction between four lexical tones, a neutral tone, base-syllable types and the underlying speech prosody/intonation.
Introduction (9/11) To find the underlying prosody/intonation structur
e, we propose the Latent Prosody Models (LPM) LPM considered several Companding Factors (CFs)
(or affecting factors) on syllable pitch contour and syllable duration, including tone, initial-final type, base syllable type and prosodic state, etc.
The prosodic state (treated as a latent variable) is conceptually defined as the state of a syllable in a prosodic phrase and used as a substitute for high level linguistic information, like a word, phrase or a syntactic boundary.
Use of unlabeled database
Introduction (10/11) LPMs are formulated based on the assu
mption that all affecting factors are combined additively or multiplicatively
n n n n nn n t y j l sZ X
n n n n nn n t y j l sZ X
Prosodic observed
feature vector
Normalized feature vector
Affecting factors
Introduction (11/11) The main purpose of using prosodic state to replace
conventional high level linguistic information is to decompose the affections of low-level and high-level linguistic features on speech.
Through this modeling approach, some unsolved problems, such as the inconsistency of prosodic and syntactic structures, the ambiguity of word segmentation and word chunking for Mandarin Chinese, can be avoided.
Hence, based on the LPM, the proposed prosody labeling model can focus on modeling the global effect of mapping high-level linguistic features to the prosodic state and break indices, since interference caused by low-level linguistic feature has been removed by LPM.
References1. Sin-Horng Chen, Wen-hsing Lai and Yih-Ru Wang, “A new duratio
n modeling approach for Mandarin speech”, IEEE transaction on speech and audio processing, vol. 11, no.4, Jul 2003, pp. 308-320
2. Sin-Horng Chen, Wen-hsing Lai and Yih-Ru Wang, “A statistics-based pitch contour model for Mandarin speech”, J. Acoust. Soc. Am. 117(2), Feb. 2005, pp. 908 – 925
3. Chen-Yu Chiang, Yih-Ru Wang, and Sin-Horng Chen, "On the inter-syllable coarticulation effect of pitch modeling for Mandarin speech", INTERSPEECH-2005, pp. 3269-3272
4. Chen-Yu Chiang, Xiao-Dong Wang, Yuan-Fu Liao, Yih-Ru Wang, Sin-Horng Chen, Keikichi Hirose, “Latent prosody model of continuous Mandarin speech”, ICASSP 2007
Base Latent Prosody Models (LPM)
A Statistical Syllable Duration Model A Statistical Syllable Pitch Contour
Model
A Statistical Syllable Duration Model• In ASR, state duration models are constructed to a
ssist.• In TTS, synthesis of proper duration information is
essential for natural speech.• An extension includes the modelings of initial and f
inal durations.• Multiplicative and additive models are compared.
The Multiplicative Duration Model
n n n n nn n t y j l sZ X
nZ
nX
nt
ny
nj
nl
ns
observed duration of the nth syllable
normalized duration of the nth syllable
affecting factor
lexical tone of the nth syllable
prosodic state of the nth syllable
base-syllable of the nth syllable
utterance of the nth syllable
speaker of the nth syllable
Training of the Model (1/2)
Expectation-Maximization (EM) algorithm
},,,,,,{ sljytvu
N the total number of training samples
Y the total number of prosodic states
the set of parameters to be estimated
auxiliary function in E-step
: new set : old set
)|,(log),|(),(1 1
n
N
n
Y
ynnn yZpZypQ
n
Training of the Model (2/2)
nX : normal distribution with mean u and variance v
Assumption
Y
ynn
nnnn
n
yZp
yZpZyp
1
)|,(
)|,(),|(
),;()|,( 22222
nnnnnnnnnn sljytsljytnnn vZyZp
sequential optimizations in M-step
Assign prosodic state * max ( | , )n
n n ny
y p y Z
The Additive Duration Model
nnnnn sljytnn XZ Model ->
Auxiliary Function ->
))((
)|,(log),|(),(
1
1 1
zsl
N
njyt
N
n
Y
ynnnn
N
yZpZypQ
nnnnn
n
Experimental Database (1/2) MIC
high-quality, reading style microphone-speech database
MIC-sent : 455 phonetic-balanced sentential utterances
MIC-para : 300 paragraphic utterances Training : 102,529 syllables Testing : 22,109 syllables 20kHz sampling rate downsampled to 8kHz 1 frame = 5 ms
Experimental Database (2/2)
Data Set Speaker Sentence Paragraph Syllable
Training Male A 1-455 1-200 34670
Training Female B 1-455 1-50 12945
Training Male C 1-455 1-100 20748
Training Female D 1-455 1-200 34166
Testing Female E None 201-300 22109
Experimental Results (1/7)Training set Testing set
Mean Variance Mean Variance
Syllable44.31
(42.34)“43.89”
180.17(2.52)“2.53”
41.08(44.77)“43.77”
136.26(4.44)“3.97”
Initial17.21
(16.63)“17.20”
62.28(0.74)“0.78”
13.83(18.36)“17.05”
40.02(5.92)“1.73”
Final31.75
(31.50)“31.44”
117.06(2.12)“1.84”
30.94(33.90)“31.38”
104.15(3.40)“2.85”
(units: mean in frame and variance in frame2; 1 frame = 5 ms)
Observed Durations
( ) Normalized Durations in Multiplicative Model with 16 prosodic states
“ “ Normalized Durations in Additive Model with 16 prosodic states
Experimental Results (2/7)
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
duration(frame)
num
ber
Histogram of Observed (left)/Normalized (right) Syllable Duration in Multiplicative Model for Training Set
0 10 20 30 40 50 60 70 80 900
1000
2000
3000
4000
5000
6000
7000
8000
9000
duration(frame)
num
ber
Experimental Results (3/7) Analyses of CFs
tone 1 2 3 4 5
CF 1.00 1.02 0.99 1.03 0.84
state 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
syllable0.56
-16.070.72
-12.360.79-9.69
0.84-7.71
0.89-5.79
0.91-4.70
0.95-3.14
0.98-1.94
1.000.00
1.020.12
1.051.69
1.094.10
1.145.87
1.229.65
1.3315.08
1.6928.74
initial0.30
-11.200.49-6.82
0.63-6.22
0.71-4.98
0.80-3.82
0.85-3.60
0.86-2.92
0.89-2.49
0.96-1.40
1.00-0.41
1.040.00
1.090.89
1.121.39
1.193.56
1.306.03
1.6112.69
final0.50
-14.280.68
-10.240.75-7.94
0.80-6.45
0.84-5.15
0.87-4.24
0.91-2.99
0.95-.1.73
0.98-0.86
1.000.00
1.020.73
1.083.12
1.145.10
1.248.50
1.4013.42
1.8625.49
CFs for prosodic states (up: multiplicative model down: additive model)
CFs for tones
Experimental Results (4/7)
用 14* 百 14 子 9 蓮 15* 、蕾 11 絲 4 花 15* 、姬 7 百 11 合 15* 、龍 13 膽 15* 、土 5耳 9 其 11 桔 10 梗 13* 和 14* 蒜 4 香 1藤 12* 為 4 材 15* ,以 14* 維 4 納 6 斯 13* 執 8 壺 2 的 14* 石 10 膏 13* 花 3 器14* 烘 10 托 15* ,好 4 一 2 趟 11* 春 4雨 14* 濛 3濛 10 的 15* 郊 9外 14* 田 4野 9風 13 光 15* 。
Examples of Prosodic State Labeling* denotes word boundary
Experimental Results (5/7)1.07
0.86 1.10
0.79 0.89
0.83 0.92
1.00 0.91
1.221.06
1.211.03
0.96 1.05
{b, d, g}?
Single vowel
Compoundvowel
Open vowel
{f, s, sh, shi, h}
{ts, ch, chi}
Single vowel
Decision Tree of Base-Syllable CFs for Syllable Duration ModelThe number associated with a node is the mean of the CFs belonging to the cluster
Solid line indicates positive answerDashed line indicates negative answer
Experimental Results (6/7)0.79
0 0.87
0.95
0.89
0.76
Null initial
0.37
1.29
1.42 1.25
0.42 0.35
1.321.18
0.91
1.21
0.70
1.00 0.89
{b, d, g}
{ts, ch, chi}
Singlevowel
{f, s, sh, shi, h}
Vowel begins with {i}
Singlevowel
{p, t, k}
Vowel begins with {i}
1.141.22
With medial
1.291.17
Vowel begins with {u}
Decision Tree of Base-Syllable CFs for Initial Duration Model
Experimental Results (7/7)1.07
1.37 1.04
1.08
1.06
Null initial
1.33
0.96
Single vowel
1.40
1.47 1.35
0.91
1.150.83
1.02 0.94 1.01 1.08
1.150.94 1.071.02
With medial
Vowel begins with {i}With medial
{m, n, l, r}
{m, n, l, r}Compound
vowel
{b, d, g}{ts, ch, chi}
Decision Tree of Base-Syllable CFs for Final Duration Model
A Statistical Syllable Pitch Contour Model (1/7)• Mandarin is a tonal language. Information o
f the tonality appears on its pitch contour.• Pitch contour patterns in continuous speec
h are highly varying and can deviate dramatically away from their canonical forms.
• Separate an utterance’s pitch contour into a global trend pitch mean model and a locally variational shape model.
• A quantitative description to the coarticulation effect is given.
A Statistical Syllable Pitch Contour Model (2/7)
Gaussian normalization
original pitch period of frame t
mean of speaker k
standard deviation of speaker k
normalized pitch period of frame t
( )( ) k
all allk
f tf t
( )f t
( )f t
k
k
all
all
averaged mean of all speaker
averaged standard deviation of all speakers
A Statistical Syllable Pitch Contour Model (3/7)
Discrete orthogonal polynomial Basis Functions (Discrete Legendre
Polynomials) :
1)(0 Mi
][][)( 212/1
212
1
Mi
MM
Mi
])[(][)( 6122/1
)3)(2)(1(180
2
3
MM
Mi
Mi
MMMM
Mi
])()()[(][)( 22
25
20
)2)(1(
102362
2332/1
)4)(3)(2)(2)(1(2800
3 M
MMMi
MMM
Mi
Mi
MMMMMM
Mi
Mi 0 3M
A Statistical Syllable Pitch Contour Model (4/7)
Parameterized pitch contour
3
0
)()(ˆj
Mi
jjMi af Mi 0
M
iMi
jMi
Mj fa0
11 )()(
A Statistical Syllable Pitch Contour Model (5/7)
Pitch mean modeling
nn ssnn YZ )(
nZ observed log-pitch mean
ns
ns speaker’s dynamic range change CF
speaker’s level shift CF
nY speaker-compensated log-pitch mean
A Statistical Syllable Pitch Contour Model (6/7)
nnnnnn pfiftpttnn XY
nX
nt
normalized log-pitch mean of the nth syllable
affecting factor
current lexical tone of the nth syllable
prosodic state of the nth syllable
r
npt
nft
ni
nf
np
previous lexical tone of the nth syllable
initial class of the nth syllable
following lexical tone of the nth syllable
final class of the nth syllable
A Statistical Syllable Pitch Contour Model (7/7)
Pitch shape modeling
normalized pitch shape vector of the nth syllable
CF vector for affecting factor
lexical tone combinations of the nth syllable
nZ
nX
rb
ntc
pause < 13 frames : tight coupling effect >=13 : loose
Taaa 321observed of the nth syllable
nnnnn fisqtcnn bbbbbXZ
nq prosodic state of pitch shape
Experimental Results (1/6)Observed Log-Pitch
(unit of pitch period: ms)
training set test set
mean (co)variance mean (co)variance
mean 1.949 0.0372 1.948 0.0345
Shape(x 0.01)
056.0
982.0
545.3
900.2106.0140.5
106.0671.9229.3
140.5229.3550.58
142.0
749.0
012.4
356.4276.0007.4
276.0460.12653.3
007.4653.3489.49
Experimental Results (2/6)
Normalized Log-Pitch with 16 Prosodic States
(unit of pitch period: ms)
training set test set
mean (co)variance
RMSE mean (co)variance
RMSE
mean 1.948 0.000402 0.0203 1.948 0.000344 0.0183
shape(x 0.01)
104.0
996.0
660.3
251.1232.0076.0
232.0907.1354.0
076.0354.0865.9
120.1
381.1
143.3
085.0
906.0
861.3
263.2808.0073.1
808.0101.3955.0
073.1955.0885.12
505.1
762.1
603.3
Experimental Results (3/6)
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30
200
400
600
800
1000
1200
1400
1600
1800
2000
pitch mean
num
ber
1.6 1.7 1.8 1.9 2 2.1 2.2 2.30
1000
2000
3000
4000
5000
6000
7000
pitch mean
num
ber
Histograms of Observed (left)/Normalized (right) Log-Pitch Mean for the Training Set
Experimental Results (4/6)
Examples of the Reconstructed Pitch Contours Inside Test : ” 在國人消費習慣改變,國民所得提高,信用貸款市場,成為潛力市場。
”
0 200 400 600 800 1000 1200 14000
2
4
6
8
10
12
Frame
Pitch P
eroid (m
s)
original predicted
Experimental Results (5/6)
Examples of the Reconstructed Pitch ContoursOutside Test : ” 在意國政經混亂中臨危受命的齊安培,未來在政經兩方面都有不少
艱困任務待完成。 ”
0 200 400 600 800 1000 1200 1400 1600 18000
1
2
3
4
5
6
7
8
Frame
Pitch P
eroid (m
s)
original predicted
Experimental Results (6/6)
Influences of the 16 Unified Prosodic States
0 2 4 6 8 10 12 14 164
5
6
7
8
9
10
11
prosodic state
pitc
h pe
riod
(ms)
Analyses of the Inferred Model (1/13)
t
pt
tone 1 2 3 4 5
-0.154 0.054 0.160 -0.035 0.128
-0.022 -0.034 0.018 0.024 0.029
0.022 -0.003 -0.047 0.011 0.013ft
CFs of Current, Previous and Following Tones in Pitch Mean Model
Analyses of the Inferred Model (2/13)
Comparison of a Tone 3 Precedes another Tone 3 with Canonical Tone 2 and 3
0 2 4 6 8 10 12 14 16 18 206
6.5
7
7.5
8
8.5
9
9.5
frame
pitc
h pe
riod
(ms)
033133233333433533020030
Analyses of the Inferred Model (3/13)
Comparison of a Tone 4 Precedes another Tone 4 with Canonical Tone 4
0 2 4 6 8 10 12 14 16 18 205.5
6
6.5
7
7.5
8
8.5
frame
pitc
h pe
riod
(ms)
044144244344444544040
Analyses of the Inferred Model (4/13)
CFs of Initial/Final Classes in Pitch Mean Model
i
f
class 0 1 2 3 4 5 6
-0.008 0.004 0.011 -0.013 0.003 -0.014 0.003
0.011 -0.001 -0.004 0.008 -0.005 -0.019 0.004
(unit of pitch period: ms)
Null initial {b,d,g} {f,s,sh,shi,h}
{m,n,l,r} {ts,ch,chi}
{p,t,k} {tz,j,ji}
Low vowels Middle vowels
High vowels
Compound vowels
Vowel with nasal ending
retroflexion Null vowels
Analyses of the Inferred Model (5/13)
CFs of Initial/Final Classes in Pitch Shape Model
(unit of pitch period: ms)
class 0 1 2 3 4 5 6
ib
fb
548.0
125.1
971.0
020.0
015.0
522.0
321.0
440.0
509.0
697.0
506.0
520.0
648.0
666.0
270.1
389.0
627.0
111.0
075.0
161.0
722.0
095.0
280.0
641.0
076.0
865.0
278.0
094.0
017.0
978.0
166.0
703.0
640.0
080.0
891.0
266.1
291.0
696.0
354.0
182.0
131.0
224.0
(x 0.01)(x 0.01)
(x 0.01)(x 0.01)
Analyses of the Inferred Model (6/13)
CFs of Speakers in Pitch Mean Model
s
s
speakers 1(M) 2(F) 3(M) 4(F)
1.014 0.971 1.026 0.981
-0.030 0.049 -0.044 0.041
(unit of pitch period: ms)
Analyses of the Inferred Model (7/13)
CFs of Speakers in Pitch Shape Model
(unit of pitch period: ms)
speakers 1(M) 2(F) 3(M) 4(F)
sb
012.0
134.0
291.0
125.0
302.0
324.0
348.0
349.0
216.0
152.0
472.0
301.0
(x 0.01)(x 0.01)
Analyses of the Inferred Model (8/13)
state 0 1 2 3 4 5 6 7
-0.400 -0.225 -0.159 -0.113 -0.081 -0.047 -0.016 0.014
state 8 9 10 11 12 13 14 15
0.039 0.073 0.102 0.130 0.161 0.196 0.265 0.348
p
p
CFs of Prosodic States in Pitch Mean Model
(unit of pitch period: ms)
Analyses of the Inferred Model (9/13)
(unit of pitch period: ms)
CFs of Prosodic States in Pitch Shape Model
state 0 1 2 3 4 5 6 7
state 8 9 10 11 12 13 14 15
qb
qb
108.0
832.4
662.3
476.1
249.1
354.9
535.1
179.0
047.0
304.0
479.0
164.0
436.0
221.3
167.1
773.0
295.0
707.3
346.0
218.4
297.2
164.1
798.0
340.1
267.0
591.0
245.2
184.0
249.2
849.0
466.0
194.1
558.1
961.0
582.0
033.4
248.0
550.1
167.1
603.1
469.1
094.0
684.0
455.2
550.1
106.0
289.0
279.0
(x 0.01)(x 0.01)
(x 0.01)(x 0.01)
Analyses of the Inferred Model (10/13)
Analyses of the Inferred Model (11/13)
Analyses of the Inferred Model (12/13)
BreakPM
Non-boundary Minor boundary Major boundary
Non-PM 89.18% 9.80% 1.02%
Minor PM 57.73% 33.48% 8.80%
Secondary Major PM
30.52% 44.65% 24.83%
Major PM 19.31% 31.66% 49.02%
Statistics of the Prosodic Labeling
Major PM={, ,。 ,! ,; ,? }, Secondary Major PM={、 ,: }, Minor PM={brace, bracket, dot}
1
1
major boundary if 10 15
location after syllable minor boundary if 4 9
non-boundary otherwise
n n
n n
p p
n p p
Analyses of the Inferred Model (13/13)
這位約翰霍普金斯大學名譽教授 *在第一屆國際 &性高潮會議中說 *,他對這一始於 &一九八O年代的性趨勢 &感到 ...這場比賽 *將於今日下午2時 &在 &台北 &市立棒球場舉行 *,黑鷹組織 &所屬 &三級棒球隊 *,包括台南六信 *、台東農工 &、屏東鶴聲國中 *、台東鹿野國中 &及台南善化國小等隊 *,將各著球隊服裝&到場加油 *,預計人數有近千人以上 *。黑鷹兩位教練 *黃永裕及&江泰權 *,對於此場比賽 *不敢掉以輕心 *,除了排出鑽石陣容外,也要親自上場 *。黑鷹所 ...商人非法囤積 &大量爆竹 *,萬一發生爆炸事件 *,不但會造成死傷慘劇 *,自己也可能成為 &受害最大 ...世界性的環保潮流 &,使人們日益重視環境汙染的問題 *;而觀光旅遊 & 這個﹁無煙囪工業 *﹂正好吻合此一 *健康訴求 * ,因此可預期& 今年將是遊樂區 ...
Examples of Possible Minor (&) and Major (*) Prosodic Phrase Boundaries
Conclusions Effectiveness on isolating several main
factors Greatly reducing the variance of the mo
deled duration/pitch The estimated companding factors (CF
s) conformed well to the prior linguistic knowledge
The prosodic-state labels produced are linguistically meaningful
Automatic Prosody Labeling based on LPM
Break types In this study, we define break types to be five
levels; i.e., B0~B4. B0 : tightly coupling syllabic boundary that the
pitch contour on the syllable juncture may be connected and affected by contextual syllables severely
B1 represents normal syllabic boundary which loosely couples two consecutive syllables and does not have a pitch reset.
B2 represents prosodic word boundary which has short pause or an irregular pitch reset.
B3 /B4 :minor/major breaks with medium and long pauses, respectively. Besides, they usually accompany large or medium pitch resets.
Break Labeling Algorithm
* *
,
,
,
, argmax ( , | , , , )
argmax ( , , , | , )
argmax ( , | , , , ) ( , | , )
P
P
P P
B p
B p
B p
B p p B x Pau L t
p B x PauL t
x Pau p B L t p BL t
Break type
Prosodic state
Pitch contour
Pause duration
High-level Linguistic feature
Low-level Linguistic feature (tone)
Acoustic-prosodic model
linguistic-prosodic model
Acoustic-prosodic model (1/3)
1, , , -1 , , 1 , , 1 , , ,
1 1
( , | , , , )
( | , , , ) ( | , , , )
( | , , ) ( | , )
( | , , , , , ) ( | , )kNK
k n k n k n k n k n k n k n k n k n k nk n
P
P P
P P
P p B B t t t P Pau B L
x Pau p B L t
x p B L t Pau p B L t
x p B t PauB L
x
The syllable pitch contour model
(Base LPM)
The pause-break model
Acoustic-prosodic model (2/3)
The syllable pitch contour model, , , ,, 1 , 1 ,, , ,k n k n k n k nk n k n
f bt p B tpk n k n B tp
μx y PT PP PC PC
Acoustic-prosodic model (3/3)
The pause-break model
, , , -1 , , 1 , , 1
, , , , 1 , 1 , ,, ,
( | , , , , , )
( ; , )
k n k n k n k n k n k n k n
k n k n k n k n k n k n k nf b
t p B tp B tp
P p B B t t t
N
x
x μ RPT PP PC PC
1 1, , , ,
1, , , , , ,
( | , ) ( ; , )k n k n k n k n
k n k n k n k n B L B LP Pau B L g Pau
Linguistic-prosodic model
12
,1 , , 1 , 1 , ,1 2 1
( , | , ) ( , | ) ( | , ) ( | ) ( | ) ( | )
( ) ( | , ) ( | )k kN NK
k k n k n k n k n k nk n n
P P P P P P
P p P p p B P B L
p BL t p BL pB L BL pB BL
Prosodic state transition modelLinguistic-break model
Training of the Model To estimate the parameters of the break
labeling model, a sequential optimization procedure based on the ML criterion is adopted. It first defines a likelihood function
expressed by 1
, , , -1 , , 1 , , 1 , , ,1 1
12
,1 , , 1 , 1 , ,1 2 1
log ( | , , , , , ) ( | , )
( ) ( | , ) ( | )
k
k k
NK
k n k n k n k n k n k n k n k n k n k nk n
N NK
k k n k n k n k n k nk n n
Q P p B B t t t P Pau B L
P p P p p B P B L
x
Initialization of Break Labeling
Pause ≥ 300ms
Pause ≥ 125ms
Pause ≥ 75ms
PMNormalized pitch reset ≥ threshold
Pitch pause ≥ 30ms
Interword
Pitch pause ≥ 30ms
B4
B3
B3 B2
B1 B0
B1 B0
B2
Y
Y
Y
YY
YY
Y
N
N
N
NN
N
N
Experimental Database Performance of the proposed pitch modeling meth
od was evaluated using a Mandarin speech database
The database contained the read speech of a single female professional announcer
Its texts were all short paragraphs composed of several sentences selected from the Sinica Tree-Bank Corpus
The database consisted of 380 utterances with 52192 syllables
Sampling rate 16kHz All segmentations and F0 values are manually corr
ected
Experimental Results
The learning curve
Experimental Results
Covariance matrices of observed and normalized feature vectors
-4
932.3 0 0 0
0 89.9 0 0 10
0 0 17.8 0
0 0 0 5.0
xR-4
y
9.0 0 0 0
0 31.9 0 0 10
0 0 11.1 0
0 0 0 3.8
R
Experimental Results-syllable pitch contour model(1/12)
The learned pitch contour of 5 tones
Experimental Results-syllable pitch contour model(2/12)
Prosodic state patterns
Experimental Results-syllable pitch contour model(3/12)
Coarticulation patterns
Experimental Results-syllable pitch contour model(4/12)
Experimental Results-syllable pitch contour model(5/12)
Experimental Results-syllable pitch contour model(6/12)
Experimental Results-syllable pitch contour model(7/12)
Experimental Results-syllable pitch contour model(8/12)
Experimental Results-syllable pitch contour model(9/12)
Experimental Results-syllable pitch contour model(10/12)
Experimental Results-syllable pitch contour model(11/12)
Experimental Results-syllable pitch contour model(12/12)
Experimental Results-Pause-break model (1/2)
Pause-break model
Break type
B0 B1 B2 B3 B4
Pause duration mean in
sec
0.0020.00
90.035 0.206
0.479
Experimental Results-Pause-break model (2/2)
1, , ,( | 4, )k n k n k nP Pau B L
Experimental Results-length of prosodic units (1/3)
Histogram of length of prosodic group
Experimental Results-length of prosodic units (2/3)
Histogram of length of prosodic phrase
Experimental Results-length of prosodic units (3/3)
Histogram of length of word
Experimental Results
Count of break indices
Experimental Results
Count of prosodic state
Experimental Results
Prob. of prosodic state after B3
Experimental Results
Prob. of prosodic state before B3
Experimental Results
Prob. of prosodic state after B4
Experimental Results
Prob. of prosodic state before B4
Experimental Results-prosodic state transition model(1/5)
, , 1 , 1( | , 4)k n k n k nP p p B
Pn-1\Pn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 0.01 0.01 0.01 0.02 0.01 0.08 0.08 0.02 0.04 0.13 0.01 0.12 0.13 0.12 0.12 0.092 0.00 0.01 0.01 0.01 0.04 0.00 0.10 0.01 0.08 0.16 0.07 0.00 0.18 0.11 0.13 0.083 0.00 0.00 0.00 0.03 0.00 0.04 0.00 0.10 0.03 0.07 0.00 0.23 0.10 0.19 0.12 0.084 0.00 0.00 0.00 0.02 0.00 0.04 0.00 0.13 0.00 0.14 0.00 0.04 0.13 0.20 0.16 0.135 0.00 0.00 0.00 0.01 0.00 0.06 0.00 0.00 0.13 0.00 0.17 0.00 0.33 0.10 0.17 0.006 0.00 0.00 0.01 0.00 0.06 0.01 0.00 0.20 0.00 0.03 0.14 0.00 0.07 0.08 0.28 0.107 0.00 0.00 0.00 0.00 0.02 0.01 0.00 0.00 0.08 0.01 0.00 0.26 0.00 0.43 0.00 0.178 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.09 0.01 0.00 0.11 0.00 0.38 0.00 0.35 0.009 0.01 0.01 0.01 0.01 0.01 0.01 0.07 0.01 0.01 0.01 0.01 0.24 0.01 0.35 0.01 0.2510 0.01 0.01 0.01 0.01 0.12 0.01 0.01 0.01 0.22 0.03 0.01 0.01 0.31 0.07 0.16 0.0111 0.02 0.02 0.02 0.02 0.04 0.02 0.02 0.04 0.02 0.06 0.02 0.25 0.04 0.19 0.04 0.1512 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.08 0.02 0.10 0.07 0.01 0.08 0.08 0.28 0.1713 0.02 0.02 0.02 0.02 0.15 0.02 0.02 0.12 0.04 0.10 0.06 0.15 0.12 0.02 0.08 0.0414 0.03 0.03 0.03 0.05 0.03 0.03 0.19 0.03 0.03 0.11 0.05 0.08 0.16 0.05 0.05 0.0315 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.10 0.05 0.05 0.05 0.05 0.05 0.10 0.10 0.0516 0.03 0.03 0.03 0.03 0.03 0.07 0.03 0.03 0.03 0.17 0.10 0.07 0.07 0.03 0.10 0.10
Experimental Results-prosodic state transition model(2/5)
, , 1 , 1( | , 3)k n k n k nP p p B
Pn-1\Pn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 0.03 0.03 0.03 0.15 0.05 0.03 0.08 0.03 0.08 0.03 0.20 0.05 0.10 0.05 0.05 0.032 0.01 0.03 0.08 0.11 0.01 0.15 0.09 0.03 0.01 0.16 0.15 0.05 0.05 0.05 0.01 0.013 0.00 0.01 0.06 0.05 0.10 0.07 0.00 0.18 0.01 0.26 0.00 0.07 0.09 0.03 0.04 0.004 0.00 0.01 0.04 0.00 0.11 0.00 0.22 0.00 0.12 0.00 0.20 0.11 0.00 0.10 0.04 0.025 0.00 0.00 0.00 0.09 0.13 0.00 0.00 0.10 0.18 0.00 0.00 0.32 0.10 0.00 0.05 0.016 0.00 0.00 0.08 0.02 0.00 0.00 0.28 0.00 0.01 0.33 0.01 0.03 0.00 0.19 0.00 0.047 0.00 0.00 0.00 0.05 0.00 0.17 0.00 0.28 0.00 0.00 0.25 0.00 0.13 0.01 0.08 0.008 0.00 0.00 0.03 0.00 0.15 0.00 0.13 0.00 0.25 0.00 0.00 0.00 0.26 0.12 0.00 0.059 0.00 0.00 0.03 0.16 0.00 0.00 0.00 0.11 0.00 0.00 0.46 0.04 0.00 0.00 0.16 0.0010 0.00 0.01 0.02 0.00 0.10 0.18 0.00 0.00 0.15 0.23 0.00 0.00 0.24 0.01 0.00 0.0411 0.01 0.05 0.03 0.01 0.10 0.01 0.19 0.06 0.01 0.08 0.01 0.17 0.03 0.06 0.14 0.0112 0.00 0.00 0.00 0.10 0.00 0.13 0.01 0.07 0.17 0.00 0.00 0.25 0.00 0.19 0.00 0.0413 0.01 0.01 0.04 0.01 0.14 0.01 0.01 0.37 0.02 0.01 0.14 0.01 0.15 0.01 0.08 0.0114 0.01 0.01 0.03 0.04 0.05 0.01 0.08 0.10 0.09 0.02 0.22 0.06 0.07 0.14 0.03 0.0515 0.02 0.02 0.02 0.04 0.02 0.02 0.02 0.17 0.08 0.17 0.13 0.06 0.11 0.02 0.02 0.0816 0.05 0.05 0.10 0.05 0.05 0.05 0.05 0.10 0.05 0.05 0.10 0.05 0.05 0.05 0.05 0.05
Experimental Results-prosodic state transition model(3/5)
, , 1 , 1( | , 2)k n k n k nP p p B
Pn-1\Pn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 0.06 0.13 0.15 0.09 0.04 0.02 0.11 0.02 0.06 0.09 0.02 0.02 0.09 0.02 0.02 0.022 0.05 0.05 0.09 0.22 0.01 0.05 0.19 0.12 0.03 0.01 0.12 0.01 0.03 0.01 0.01 0.013 0.02 0.01 0.05 0.11 0.17 0.03 0.00 0.22 0.04 0.06 0.00 0.16 0.00 0.09 0.02 0.014 0.01 0.00 0.04 0.00 0.19 0.00 0.06 0.00 0.35 0.00 0.00 0.15 0.14 0.01 0.04 0.005 0.02 0.00 0.03 0.03 0.00 0.00 0.18 0.00 0.00 0.39 0.00 0.13 0.04 0.16 0.00 0.026 0.00 0.00 0.00 0.15 0.00 0.00 0.00 0.38 0.00 0.00 0.29 0.00 0.00 0.05 0.11 0.007 0.00 0.01 0.04 0.00 0.00 0.00 0.15 0.00 0.13 0.22 0.00 0.00 0.31 0.06 0.06 0.018 0.00 0.00 0.00 0.00 0.06 0.00 0.00 0.00 0.17 0.00 0.09 0.37 0.00 0.28 0.00 0.019 0.00 0.01 0.03 0.03 0.00 0.01 0.02 0.00 0.00 0.23 0.00 0.00 0.37 0.00 0.24 0.0610 0.00 0.01 0.00 0.00 0.04 0.00 0.00 0.10 0.00 0.00 0.00 0.36 0.00 0.47 0.00 0.0011 0.01 0.00 0.00 0.04 0.00 0.04 0.02 0.01 0.11 0.01 0.01 0.00 0.43 0.00 0.23 0.0912 0.00 0.00 0.01 0.00 0.00 0.04 0.00 0.05 0.00 0.00 0.18 0.00 0.20 0.17 0.26 0.0613 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.02 0.03 0.01 0.00 0.08 0.00 0.33 0.37 0.1014 0.01 0.01 0.01 0.08 0.01 0.02 0.00 0.00 0.12 0.02 0.03 0.00 0.13 0.05 0.29 0.2215 0.01 0.01 0.02 0.03 0.00 0.10 0.01 0.04 0.01 0.07 0.03 0.13 0.05 0.03 0.21 0.2316 0.01 0.03 0.04 0.04 0.03 0.01 0.09 0.01 0.01 0.11 0.01 0.12 0.08 0.06 0.14 0.22
Experimental Results-prosodic state transition model(4/5)
, , 1 , 1( | , 1)k n k n k nP p p B
Pn-1\Pn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 0.09 0.29 0.18 0.07 0.04 0.02 0.02 0.02 0.04 0.04 0.02 0.04 0.02 0.02 0.02 0.022 0.09 0.28 0.30 0.10 0.01 0.07 0.00 0.05 0.02 0.02 0.00 0.00 0.01 0.02 0.01 0.003 0.04 0.24 0.21 0.30 0.00 0.12 0.03 0.00 0.00 0.01 0.00 0.00 0.03 0.01 0.00 0.004 0.02 0.13 0.26 0.17 0.30 0.00 0.00 0.08 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.005 0.00 0.05 0.22 0.35 0.00 0.18 0.00 0.14 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.006 0.02 0.10 0.00 0.34 0.07 0.00 0.21 0.00 0.07 0.12 0.00 0.00 0.03 0.03 0.00 0.007 0.00 0.03 0.11 0.18 0.22 0.00 0.33 0.00 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.008 0.00 0.02 0.15 0.00 0.45 0.00 0.00 0.24 0.00 0.00 0.11 0.00 0.02 0.00 0.00 0.009 0.01 0.00 0.00 0.35 0.00 0.20 0.00 0.24 0.00 0.10 0.00 0.00 0.08 0.00 0.00 0.0110 0.00 0.02 0.06 0.00 0.00 0.00 0.43 0.00 0.34 0.00 0.00 0.15 0.00 0.00 0.00 0.0011 0.00 0.01 0.05 0.00 0.36 0.00 0.00 0.00 0.00 0.33 0.00 0.09 0.00 0.10 0.05 0.0112 0.00 0.01 0.00 0.14 0.00 0.16 0.00 0.34 0.00 0.17 0.05 0.00 0.11 0.00 0.00 0.0013 0.00 0.01 0.04 0.00 0.09 0.00 0.13 0.00 0.24 0.08 0.00 0.29 0.00 0.10 0.02 0.0114 0.00 0.00 0.01 0.06 0.00 0.07 0.00 0.18 0.02 0.19 0.00 0.17 0.12 0.11 0.04 0.0215 0.00 0.01 0.00 0.02 0.00 0.00 0.08 0.00 0.12 0.08 0.00 0.19 0.19 0.19 0.09 0.0416 0.00 0.01 0.01 0.03 0.00 0.00 0.02 0.03 0.03 0.08 0.00 0.12 0.10 0.24 0.23 0.07
Experimental Results-prosodic state transition model(5/5)
, , 1 , 1( | , 0)k n k n k nP p p B
Pn-1\Pn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 0.19 0.10 0.03 0.13 0.13 0.03 0.03 0.03 0.06 0.03 0.03 0.03 0.03 0.03 0.03 0.032 0.11 0.31 0.16 0.13 0.01 0.06 0.02 0.06 0.02 0.01 0.02 0.02 0.01 0.02 0.01 0.013 0.03 0.14 0.33 0.24 0.00 0.00 0.12 0.00 0.03 0.04 0.01 0.02 0.01 0.00 0.01 0.004 0.02 0.06 0.21 0.10 0.31 0.00 0.00 0.21 0.00 0.00 0.02 0.03 0.00 0.02 0.00 0.005 0.00 0.01 0.02 0.38 0.00 0.40 0.00 0.00 0.15 0.00 0.01 0.00 0.02 0.00 0.01 0.006 0.02 0.00 0.21 0.00 0.46 0.00 0.00 0.15 0.09 0.01 0.00 0.02 0.00 0.02 0.00 0.017 0.01 0.02 0.04 0.00 0.18 0.00 0.46 0.00 0.00 0.17 0.00 0.08 0.00 0.03 0.00 0.008 0.00 0.02 0.00 0.22 0.24 0.00 0.00 0.00 0.35 0.00 0.07 0.00 0.06 0.01 0.02 0.009 0.00 0.01 0.01 0.00 0.00 0.23 0.00 0.47 0.00 0.00 0.00 0.20 0.06 0.00 0.00 0.0010 0.00 0.00 0.03 0.00 0.15 0.00 0.34 0.00 0.00 0.36 0.00 0.00 0.00 0.09 0.01 0.0111 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.54 0.00 0.16 0.00 0.26 0.00 0.00 0.0012 0.00 0.01 0.00 0.05 0.00 0.11 0.00 0.20 0.00 0.21 0.00 0.30 0.00 0.08 0.02 0.0013 0.00 0.00 0.01 0.00 0.03 0.00 0.12 0.03 0.19 0.00 0.16 0.00 0.31 0.06 0.07 0.0214 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.10 0.00 0.20 0.00 0.25 0.08 0.17 0.11 0.0415 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.05 0.00 0.08 0.16 0.23 0.20 0.17 0.0816 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.03 0.07 0.00 0.02 0.13 0.26 0.28 0.16
Experimental Results-The decision tree of linguistic-break model
Experimental Results-break labeling example
Summary In base LPM
The prosodic state was introduced to replace conventional high level linguistic information so as to decompose the affections of low-level and high-level linguistic features on speech
Effectiveness on isolating several main factors Greatly reducing the variance of the modeled du
ration/pitch The estimated companding factors conformed
well to the prior linguistic knowledge The prosodic-state labels produced are linguisti
cally meaningful
Summary In Automatic Prosody Labeling
We propose a new automatic prosody labeling algorithm based on base LPM
We treat both break type and prosodic state as latent variables
The premiere experimental results are both linguistically and acoustically meaningful
Further discussion for each models is needed