latent prosody models of continuous mandarin speech speech lab., cm, nctu chen yu chiang 2007/2/8

LATENT PROSODY MODELS OF CONTINUOUS MANDARIN SPEECH

Speech Lab., CM, NCTUChen Yu Chiang

2007/2/8

Outline

Introduction Base Latent Prosody Models (LPM)

A Statistical Syllable Duration Model A Statistical Syllable Pitch Contour

Model Automatic Prosody Labeling based

on LPM Summary

Introduction (1/11)

What is Prosody? Prosody is an inherent supra-

segmental feature of human speech. It carries stress, intonation patterns and timing structures of continuous speech which decide the naturalness and understandability of an utterance.

Introduction (2/11) For the listener’s points of view, prosody consis

ts of systematic perception and recovery of a speaker’s intentions based on: Pause: to indicate phrases and avoid running out of ai

r. Pitch: rate of vocal-fold cycling( fundamental frequen

cy or F0) as a function of time. Rate/relative duration: phoneme durations, timing, a

nd rhythm. Loudness (Energy): relative amplitude/volume

For simplicity, we may say “ 抑 , 揚 , 頓 , 挫 , 輕 ,重 , 緩 , 急”

Introduction (3/11)

The affecting factors of prosody Linguistic

Lexical, Syntactic, Semantic, Pragmatic Para-linguistic

Intentional, Attitudinal, Stylistic Non-linguistic

Physical, Emotional

Introduction (4/11) Issues concerned in prosody modeling

Labeling of important prosodic cues Construction of prosody hierarchy Modeling of syntax-prosody relationship Prediction of prosodic phrase boundary (break)

from text, etc. Applications

Automatic Speech Recognition (ASR) Important prosodic cues can be explored from the input

utterance to assist in both acoustic and linguistic decoding

Text-to-Speech (TTS) A good prosody model can be used to generate

appropriate prosodic features from the input text

Introduction (5/11)

Important characteristics of Mandarin Chinese A tonal language (Four lexical tones,

one neutral tone) The tonality of a monosyllable is mainly

characterized by the shape of its fundamental frequency (F0) contour

A syllable-based language (411 base-syllables)

Introduction (6/11) Syllable duration is also seriously affected

by the phonetic structure of base-syllables. Generally speaking, syllable duration

increases as the number of constituent phonemes increases.

For examples: Syllables with single vowels are shortest. Syllables with stop initials or no initials, and

without nasal endings are pronounced shorter. Syllables with fricative initials and with nasal

endings are longer.

Introduction (7/11) Standard tone pattern

Affection of context and intonation

Introduction (8/11) As a tonal language, in Mandarin

speech, there is a tight interaction between four lexical tones, a neutral tone, base-syllable types and the underlying speech prosody/intonation.

Introduction (9/11) To find the underlying prosody/intonation structur

e, we propose the Latent Prosody Models (LPM) LPM considered several Companding Factors (CFs)

(or affecting factors) on syllable pitch contour and syllable duration, including tone, initial-final type, base syllable type and prosodic state, etc.

The prosodic state (treated as a latent variable) is conceptually defined as the state of a syllable in a prosodic phrase and used as a substitute for high level linguistic information, like a word, phrase or a syntactic boundary.

Use of unlabeled database

Introduction (10/11) LPMs are formulated based on the assu

mption that all affecting factors are combined additively or multiplicatively

n n n n nn n t y j l sZ X

Prosodic observed

feature vector

Normalized feature vector

Affecting factors

Introduction (11/11) The main purpose of using prosodic state to replace

conventional high level linguistic information is to decompose the affections of low-level and high-level linguistic features on speech.

Through this modeling approach, some unsolved problems, such as the inconsistency of prosodic and syntactic structures, the ambiguity of word segmentation and word chunking for Mandarin Chinese, can be avoided.

Hence, based on the LPM, the proposed prosody labeling model can focus on modeling the global effect of mapping high-level linguistic features to the prosodic state and break indices, since interference caused by low-level linguistic feature has been removed by LPM.

References1. Sin-Horng Chen, Wen-hsing Lai and Yih-Ru Wang, “A new duratio

n modeling approach for Mandarin speech”, IEEE transaction on speech and audio processing, vol. 11, no.4, Jul 2003, pp. 308-320

2. Sin-Horng Chen, Wen-hsing Lai and Yih-Ru Wang, “A statistics-based pitch contour model for Mandarin speech”, J. Acoust. Soc. Am. 117(2), Feb. 2005, pp. 908 – 925

3. Chen-Yu Chiang, Yih-Ru Wang, and Sin-Horng Chen, "On the inter-syllable coarticulation effect of pitch modeling for Mandarin speech", INTERSPEECH-2005, pp. 3269-3272

4. Chen-Yu Chiang, Xiao-Dong Wang, Yuan-Fu Liao, Yih-Ru Wang, Sin-Horng Chen, Keikichi Hirose, “Latent prosody model of continuous Mandarin speech”, ICASSP 2007

Base Latent Prosody Models (LPM)

A Statistical Syllable Duration Model A Statistical Syllable Pitch Contour

A Statistical Syllable Duration Model• In ASR, state duration models are constructed to a

ssist.• In TTS, synthesis of proper duration information is

essential for natural speech.• An extension includes the modelings of initial and f

inal durations.• Multiplicative and additive models are compared.

The Multiplicative Duration Model

n n n n nn n t y j l sZ X

observed duration of the nth syllable

normalized duration of the nth syllable

affecting factor

lexical tone of the nth syllable

prosodic state of the nth syllable

base-syllable of the nth syllable

utterance of the nth syllable

speaker of the nth syllable

Training of the Model (1/2)

Expectation-Maximization (EM) algorithm

},,,,,,{ sljytvu

N the total number of training samples

Y the total number of prosodic states

the set of parameters to be estimated

auxiliary function in E-step

: new set : old set

)|,(log),|(),(1 1

ynnn yZpZypQ

Training of the Model (2/2)

nX : normal distribution with mean u and variance v

Assumption

yZpZyp

)|,(),|(

),;()|,( 22222

nnnnnnnnnn sljytsljytnnn vZyZp

sequential optimizations in M-step

Assign prosodic state * max ( | , )n

n n ny

y p y Z

The Additive Duration Model

nnnnn sljytnn XZ Model ->

Auxiliary Function ->

)|,(log),|(),(

yZpZypQ

Experimental Database (1/2) MIC

high-quality, reading style microphone-speech database

MIC-sent : 455 phonetic-balanced sentential utterances

MIC-para : 300 paragraphic utterances Training : 102,529 syllables Testing : 22,109 syllables 20kHz sampling rate downsampled to 8kHz 1 frame = 5 ms

Experimental Database (2/2)

Data Set Speaker Sentence Paragraph Syllable

Training Male A 1-455 1-200 34670

Training Female B 1-455 1-50 12945

Training Male C 1-455 1-100 20748

Training Female D 1-455 1-200 34166

Testing Female E None 201-300 22109

Experimental Results (1/7)Training set Testing set

Mean Variance Mean Variance

Syllable44.31

(42.34)“43.89”

180.17(2.52)“2.53”

41.08(44.77)“43.77”

136.26(4.44)“3.97”

Initial17.21

(16.63)“17.20”

62.28(0.74)“0.78”

13.83(18.36)“17.05”

40.02(5.92)“1.73”

Final31.75

(31.50)“31.44”

117.06(2.12)“1.84”

30.94(33.90)“31.38”

104.15(3.40)“2.85”

(units: mean in frame and variance in frame2; 1 frame = 5 ms)

Observed Durations

( ) Normalized Durations in Multiplicative Model with 16 prosodic states

“ “ Normalized Durations in Additive Model with 16 prosodic states

Experimental Results (2/7)

0 20 40 60 80 100 120 1400

duration(frame)

Histogram of Observed (left)/Normalized (right) Syllable Duration in Multiplicative Model for Training Set

0 10 20 30 40 50 60 70 80 900

duration(frame)

Experimental Results (3/7) Analyses of CFs

tone 1 2 3 4 5

CF 1.00 1.02 0.99 1.03 0.84

state 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

syllable0.56

-16.070.72

-12.360.79-9.69

0.84-7.71

0.89-5.79

0.91-4.70

0.95-3.14

0.98-1.94

1.000.00

1.020.12

1.051.69

1.094.10

1.145.87

1.229.65

1.3315.08

1.6928.74

initial0.30

-11.200.49-6.82

0.63-6.22

0.71-4.98

0.80-3.82

0.85-3.60

0.86-2.92

0.89-2.49

0.96-1.40

1.00-0.41

1.040.00

1.090.89

1.121.39

1.193.56

1.306.03

1.6112.69

final0.50

-14.280.68

-10.240.75-7.94

0.80-6.45

0.84-5.15

0.87-4.24

0.91-2.99

0.95-.1.73

0.98-0.86

1.000.00

1.020.73

1.083.12

1.145.10

1.248.50

1.4013.42

1.8625.49

CFs for prosodic states (up: multiplicative model down: additive model)

CFs for tones

用 14* 百 14 子 9 蓮 15* 、蕾 11 絲 4 花 15* 、姬 7 百 11 合 15* 、龍 13 膽 15* 、土 5耳 9 其 11 桔 10 梗 13* 和 14* 蒜 4 香 1藤 12* 為 4 材 15* ，以 14* 維 4 納 6 斯 13* 執 8 壺 2 的 14* 石 10 膏 13* 花 3 器14* 烘 10 托 15* ，好 4 一 2 趟 11* 春 4雨 14* 濛 3濛 10 的 15* 郊 9外 14* 田 4野 9風 13 光 15* 。

Examples of Prosodic State Labeling* denotes word boundary

Experimental Results (5/7)1.07

0.86 1.10

0.79 0.89

0.83 0.92

1.00 0.91

1.221.06

1.211.03

0.96 1.05

{b, d, g}?

Single vowel

Compoundvowel

Open vowel

{f, s, sh, shi, h}

{ts, ch, chi}

Single vowel

Decision Tree of Base-Syllable CFs for Syllable Duration ModelThe number associated with a node is the mean of the CFs belonging to the cluster

Solid line indicates positive answerDashed line indicates negative answer

0 0.87

Null initial

1.42 1.25

0.42 0.35

1.321.18

1.00 0.89

{b, d, g}

{ts, ch, chi}

Singlevowel

{f, s, sh, shi, h}

Vowel begins with {i}

Singlevowel

{p, t, k}

Vowel begins with {i}

1.141.22

With medial

1.291.17

Vowel begins with {u}

Decision Tree of Base-Syllable CFs for Initial Duration Model

1.37 1.04

Null initial

Single vowel

1.47 1.35

1.150.83

1.02 0.94 1.01 1.08

1.150.94 1.071.02

With medial

Vowel begins with {i}With medial

{m, n, l, r}

{m, n, l, r}Compound

{b, d, g}{ts, ch, chi}

Decision Tree of Base-Syllable CFs for Final Duration Model

A Statistical Syllable Pitch Contour Model (1/7)• Mandarin is a tonal language. Information o

f the tonality appears on its pitch contour.• Pitch contour patterns in continuous speec

h are highly varying and can deviate dramatically away from their canonical forms.

• Separate an utterance’s pitch contour into a global trend pitch mean model and a locally variational shape model.

• A quantitative description to the coarticulation effect is given.

A Statistical Syllable Pitch Contour Model (2/7)

Gaussian normalization

original pitch period of frame t

mean of speaker k

standard deviation of speaker k

normalized pitch period of frame t

( )( ) k

all allk

f tf t

( )f t

averaged mean of all speaker

averaged standard deviation of all speakers

Discrete orthogonal polynomial Basis Functions (Discrete Legendre

Polynomials) :

1)(0 Mi

][][)( 212/1

])[(][)( 6122/1

)3)(2)(1(180

])()()[(][)( 22

)2)(1(

102362

2332/1

)4)(3)(2)(2)(1(2800

MMMMMM

Mi 0 3M

Parameterized pitch contour

)()(ˆj

jjMi af Mi 0

Mj fa0

11 )()(

Pitch mean modeling

nn ssnn YZ )(

nZ observed log-pitch mean

ns speaker’s dynamic range change CF

speaker’s level shift CF

nY speaker-compensated log-pitch mean

nnnnnn pfiftpttnn XY

normalized log-pitch mean of the nth syllable

affecting factor

current lexical tone of the nth syllable

prosodic state of the nth syllable

previous lexical tone of the nth syllable

initial class of the nth syllable

following lexical tone of the nth syllable

final class of the nth syllable

Pitch shape modeling

normalized pitch shape vector of the nth syllable

CF vector for affecting factor

lexical tone combinations of the nth syllable

pause < 13 frames : tight coupling effect >=13 : loose

Taaa 321observed of the nth syllable

nnnnn fisqtcnn bbbbbXZ

nq prosodic state of pitch shape

Experimental Results (1/6)Observed Log-Pitch

(unit of pitch period: ms)

training set test set

mean (co)variance mean (co)variance

mean 1.949 0.0372 1.948 0.0345

Shape(x 0.01)

900.2106.0140.5

106.0671.9229.3

140.5229.3550.58

356.4276.0007.4

276.0460.12653.3

007.4653.3489.49

Normalized Log-Pitch with 16 Prosodic States

training set test set

mean (co)variance

RMSE mean (co)variance

mean 1.948 0.000402 0.0203 1.948 0.000344 0.0183

shape(x 0.01)

251.1232.0076.0

232.0907.1354.0

076.0354.0865.9

263.2808.0073.1

808.0101.3955.0

073.1955.0885.12

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30

pitch mean

1.6 1.7 1.8 1.9 2 2.1 2.2 2.30

pitch mean

Histograms of Observed (left)/Normalized (right) Log-Pitch Mean for the Training Set

Examples of the Reconstructed Pitch Contours Inside Test : ” 在國人消費習慣改變，國民所得提高，信用貸款市場，成為潛力市場。

0 200 400 600 800 1000 1200 14000

Pitch P

eroid (m

original predicted

Examples of the Reconstructed Pitch ContoursOutside Test : ” 在意國政經混亂中臨危受命的齊安培，未來在政經兩方面都有不少

艱困任務待完成。 ”

0 200 400 600 800 1000 1200 1400 1600 18000

Pitch P

eroid (m

original predicted

Influences of the 16 Unified Prosodic States

0 2 4 6 8 10 12 14 164

prosodic state

Analyses of the Inferred Model (1/13)

tone 1 2 3 4 5

-0.154 0.054 0.160 -0.035 0.128

-0.022 -0.034 0.018 0.024 0.029

0.022 -0.003 -0.047 0.011 0.013ft

CFs of Current, Previous and Following Tones in Pitch Mean Model

Comparison of a Tone 3 Precedes another Tone 3 with Canonical Tone 2 and 3

0 2 4 6 8 10 12 14 16 18 206

033133233333433533020030

Comparison of a Tone 4 Precedes another Tone 4 with Canonical Tone 4

0 2 4 6 8 10 12 14 16 18 205.5

044144244344444544040

CFs of Initial/Final Classes in Pitch Mean Model

class 0 1 2 3 4 5 6

-0.008 0.004 0.011 -0.013 0.003 -0.014 0.003

0.011 -0.001 -0.004 0.008 -0.005 -0.019 0.004

Null initial {b,d,g} {f,s,sh,shi,h}

{m,n,l,r} {ts,ch,chi}

{p,t,k} {tz,j,ji}

Low vowels Middle vowels

High vowels

Compound vowels

Vowel with nasal ending

retroflexion Null vowels

CFs of Initial/Final Classes in Pitch Shape Model

class 0 1 2 3 4 5 6

(x 0.01)(x 0.01)

CFs of Speakers in Pitch Mean Model

speakers 1(M) 2(F) 3(M) 4(F)

1.014 0.971 1.026 0.981

-0.030 0.049 -0.044 0.041

CFs of Speakers in Pitch Shape Model

speakers 1(M) 2(F) 3(M) 4(F)

(x 0.01)(x 0.01)

state 0 1 2 3 4 5 6 7

-0.400 -0.225 -0.159 -0.113 -0.081 -0.047 -0.016 0.014

state 8 9 10 11 12 13 14 15

0.039 0.073 0.102 0.130 0.161 0.196 0.265 0.348

CFs of Prosodic States in Pitch Mean Model

CFs of Prosodic States in Pitch Shape Model

state 0 1 2 3 4 5 6 7

state 8 9 10 11 12 13 14 15

(x 0.01)(x 0.01)

BreakPM

Non-boundary Minor boundary Major boundary

Non-PM 89.18% 9.80% 1.02%

Minor PM 57.73% 33.48% 8.80%

Secondary Major PM

30.52% 44.65% 24.83%

Major PM 19.31% 31.66% 49.02%

Statistics of the Prosodic Labeling

Major PM={， ,。 ,！ ,； ,？ }, Secondary Major PM={、 ,： }, Minor PM={brace, bracket, dot}

major boundary if 10 15

location after syllable minor boundary if 4 9

non-boundary otherwise

這位約翰霍普金斯大學名譽教授 *在第一屆國際 &性高潮會議中說 *，他對這一始於 &一九八Ｏ年代的性趨勢 &感到 ...這場比賽 *將於今日下午２時 &在 &台北 &市立棒球場舉行 *，黑鷹組織 &所屬 &三級棒球隊 *，包括台南六信 *、台東農工 &、屏東鶴聲國中 *、台東鹿野國中 &及台南善化國小等隊 *，將各著球隊服裝&到場加油 *，預計人數有近千人以上 *。黑鷹兩位教練 *黃永裕及&江泰權 *，對於此場比賽 *不敢掉以輕心 *，除了排出鑽石陣容外，也要親自上場 *。黑鷹所 ...商人非法囤積 &大量爆竹 *，萬一發生爆炸事件 *，不但會造成死傷慘劇 *，自己也可能成為 &受害最大 ...世界性的環保潮流 &，使人們日益重視環境汙染的問題 *；而觀光旅遊 & 這個﹁無煙囪工業 *﹂正好吻合此一 *健康訴求 * ，因此可預期& 今年將是遊樂區 ...

Examples of Possible Minor (&) and Major (*) Prosodic Phrase Boundaries

Conclusions Effectiveness on isolating several main

factors Greatly reducing the variance of the mo

deled duration/pitch The estimated companding factors (CF

s) conformed well to the prior linguistic knowledge

The prosodic-state labels produced are linguistically meaningful

Automatic Prosody Labeling based on LPM

Break types In this study, we define break types to be five

levels; i.e., B0~B4. B0 : tightly coupling syllabic boundary that the

pitch contour on the syllable juncture may be connected and affected by contextual syllables severely

B1 represents normal syllabic boundary which loosely couples two consecutive syllables and does not have a pitch reset.

B2 represents prosodic word boundary which has short pause or an irregular pitch reset.

B3 /B4 :minor/major breaks with medium and long pauses, respectively. Besides, they usually accompany large or medium pitch resets.

Break Labeling Algorithm

, argmax ( , | , , , )

argmax ( , , , | , )

argmax ( , | , , , ) ( , | , )

B p p B x Pau L t

p B x PauL t

x Pau p B L t p BL t

Break type

Prosodic state

Pitch contour

Pause duration

High-level Linguistic feature

Low-level Linguistic feature (tone)

Acoustic-prosodic model

linguistic-prosodic model

Acoustic-prosodic model (1/3)

1, , , -1 , , 1 , , 1 , , ,

( , | , , , )

( | , , , ) ( | , , , )

( | , , ) ( | , )

( | , , , , , ) ( | , )kNK

k n k n k n k n k n k n k n k n k n k nk n

P p B B t t t P Pau B L

x Pau p B L t

x p B L t Pau p B L t

x p B t PauB L

The syllable pitch contour model

(Base LPM)

The pause-break model

The syllable pitch contour model, , , ,, 1 , 1 ,, , ,k n k n k n k nk n k n

f bt p B tpk n k n B tp

μx y PT PP PC PC

The pause-break model

, , , -1 , , 1 , , 1

, , , , 1 , 1 , ,, ,

( | , , , , , )

( ; , )

k n k n k n k n k n k n k n

k n k n k n k n k n k n k nf b

t p B tp B tp

P p B B t t t

x μ RPT PP PC PC

1 1, , , ,

1, , , , , ,

( | , ) ( ; , )k n k n k n k n

k n k n k n k n B L B LP Pau B L g Pau

Linguistic-prosodic model

,1 , , 1 , 1 , ,1 2 1

( , | , ) ( , | ) ( | , ) ( | ) ( | ) ( | )

( ) ( | , ) ( | )k kN NK

k k n k n k n k n k nk n n

P P P P P P

P p P p p B P B L

p BL t p BL pB L BL pB BL

Prosodic state transition modelLinguistic-break model

Training of the Model To estimate the parameters of the break

labeling model, a sequential optimization procedure based on the ML criterion is adopted. It first defines a likelihood function

expressed by 1

, , , -1 , , 1 , , 1 , , ,1 1

,1 , , 1 , 1 , ,1 2 1

log ( | , , , , , ) ( | , )

( ) ( | , ) ( | )

k n k n k n k n k n k n k n k n k n k nk n

k k n k n k n k n k nk n n

Q P p B B t t t P Pau B L

P p P p p B P B L

Initialization of Break Labeling

Pause ≥ 300ms

Pause ≥ 125ms

Pause ≥ 75ms

PMNormalized pitch reset ≥ threshold

Pitch pause ≥ 30ms

Interword

Pitch pause ≥ 30ms

Experimental Database Performance of the proposed pitch modeling meth

od was evaluated using a Mandarin speech database

The database contained the read speech of a single female professional announcer

Its texts were all short paragraphs composed of several sentences selected from the Sinica Tree-Bank Corpus

The database consisted of 380 utterances with 52192 syllables

Sampling rate 16kHz All segmentations and F0 values are manually corr

Experimental Results

The learning curve

Covariance matrices of observed and normalized feature vectors

932.3 0 0 0

0 89.9 0 0 10

0 0 17.8 0

0 0 0 5.0

9.0 0 0 0

0 31.9 0 0 10

0 0 11.1 0

0 0 0 3.8

Experimental Results-syllable pitch contour model(1/12)

The learned pitch contour of 5 tones

Prosodic state patterns

Coarticulation patterns

Experimental Results-Pause-break model (1/2)

Pause-break model

Break type

B0 B1 B2 B3 B4

Pause duration mean in

0.0020.00

90.035 0.206

Experimental Results-Pause-break model (2/2)

1, , ,( | 4, )k n k n k nP Pau B L

Experimental Results-length of prosodic units (1/3)

Histogram of length of prosodic group

Histogram of length of prosodic phrase

Histogram of length of word

Count of break indices

Count of prosodic state

Prob. of prosodic state after B3

Prob. of prosodic state before B3

Prob. of prosodic state after B4

Prob. of prosodic state before B4

Experimental Results-prosodic state transition model(1/5)

, , 1 , 1( | , 4)k n k n k nP p p B