doctoral thesis high-quality and flexible speech …srallaba/pdfs/toda_phd.pdfnaist-is-dt0161027...

NAIST-IS-DT0161027

Doctoral Thesis

High-Quality and Flexible Speech Synthesis

with Segment Selection and Voice Conversion

Tomoki Toda

March 24, 2003

Department of Information Processing

Graduate School of Information Science

Nara Institute of Science and Technology

Doctoral Thesis

submitted to Graduate School of Information Science,

Nara Institute of Science and Technology

in partial fulfillment of the requirements for the degree of

DOCTOR of ENGINEERING

Tomoki Toda

Thesis committee: Kiyohiro Shikano, Professor

Yuji Matsumoto, Professor

Nick Campbell, Professor

Hiroshi Saruwatari, Associate Professor

High-Quality and Flexible Speech Synthesis

with Segment Selection and Voice Conversion∗

Tomoki Toda

Abstract

Text-to-Speech (TTS) is a useful technology that converts any text into a

speech signal. It can be utilized for various purposes, e.g. car navigation, an-

nouncements in railway stations, response services in telecommunications, and

e-mail reading. Corpus-based TTS makes it possible to dramatically improve

the naturalness of synthetic speech compared with the early TTS. However, no

general-purpose TTS has been developed that can consistently synthesize suffi-

ciently natural speech. Furthermore, there is not yet enough flexibility in corpus-

based TTS.

This thesis addresses two problems in speech synthesis. One is how to improve

the naturalness of synthetic speech in corpus-based TTS. The other is how to

improve control of speaker individuality in order to achieve more flexible speech

synthesis. To deal with the former problem, we focus on two factors: (1) an

algorithm for selecting the most appropriate synthesis units from a speech corpus,

and (2) an evaluation measure for selecting the synthesis units. Moreover, we

focus on a voice conversion technique to control speaker individuality to deal

with the latter problem.

Since various vowel sequences appear frequently in Japanese, it is not re-

alistic to prepare long units that include all possible vowel sequences to avoid

vowel-to-vowel concatenation, which often produces auditory discontinuity. In

order to address this problem, we propose a novel segment selection algorithm

based on both phoneme and diphone units that does not avoid concatenation of

∗Doctoral Thesis, Department of Information Processing, Graduate School of Information

Science, Nara Institute of Science and Technology, NAIST-IS-DT0161027, March 24, 2003.

i

vowel sequences but alleviates the resulting discontinuity. Experiments testing

concatenation of vowel sequences clarify that better segments can be selected by

considering concatenations not only at phoneme boundaries but also at vowel

centers. Moreover, the results of perceptual experiments show that speech syn-

thesized using the proposed algorithm has better naturalness than that using the

conventional algorithms.

A cost is established as a measure for selecting the optimum waveform seg-

ments from a speech corpus. In order to achieve high-quality segment selection for

concatenative TTS, it is important to utilize a cost that corresponds to perceptual

characteristics. We first clarify the correspondence of the cost to the perceptual

scores and then evaluate various functions to integrate local costs capturing the

degradation of naturalness in individual segments. From the results of perceptual

experiments, we find a novel cost that takes into account not only the degrada-

tion of naturalness over the entire synthetic speech but also the local degradation.

We also clarify that the naturalness of synthetic speech can be slightly improved

by utilizing this cost and investigate the effect of using this cost for segment

selection.

We improve the voice conversion algorithm based on the Gaussian Mixture

Model (GMM), which is a conventional statistical voice conversion algorithm.

The GMM-based algorithm can convert speech features continuously using the

correlations between source and target features. However, the quality of the con-

verted speech is degraded because the converted spectrum is excessively smoothed

by the statistical averaging operation. To overcome this problem, we propose a

novel voice conversion algorithm that incorporates Dynamic Frequency Warping

(DFW) technique. The experimental results reveal that the proposed algorithm

can synthesize speech with a higher quality while maintaining equal conversion-

accuracy for speaker individuality compared with the GMM-based algorithm.

Keywords:

Text-to-Speech, Naturalness, speaker individuality, Segment selection, Synthesis

unit, Measure for selection, Voice conversion

ii

Acknowledgments

I would like to express my deepest appreciation to Professor Kiyohiro Shikano

of Nara Institute of Science and Technology, my thesis advisor, for his constant

guidance and encouragement through my master’s course and doctoral course.

I would also like to express my gratitude to Professor Yuji Matsumoto, Profes-

sor Nick Campbell, and Associate Professor Hiroshi Saruwatari, of Nara Institute

of Science and Technology, for their invaluable comments to the thesis.

I would sincerely like to thank Dr. Nobuyoshi Fugono, President of ATR, and

Dr. Seiichi Yamamoto, Director of ATR Spoken Language Translation Research

Laboratories, for giving me the opportunity to work for ATR Spoken Language

Translation Research Laboratories as an Intern Researcher.

I would especially like to express my sincere gratefulness to Dr. Hisashi Kawai,

Supervisor of ATR Spoken Language Translation Research Laboratories, for his

continuous support and valuable advice through the doctoral course. The core

of this work originated with his pioneering ideas in speech synthesis, which led

me to a new research idea. This work could not have been accomplished without

his direction. I could learn many lessons from his attitude toward study. I have

always been happy to carry out research with him.

I would like to thank Assistant Professor Hiromichi Kawanami and Assis-

tant Professor Akinobu Lee of Nara Institute of Science and Technology, for

their beneficial comments. I would also like to thank former Associate Professor

Satoshi Nakamura, who is currently Head of Department 1 at ATR Spoken Lan-

guage Translation Research Laboratories, and former Assistant Professor Jinlin

Lu, who is currently an Associate Professor at Aichi Prefectural University, for

their helpful discussions. I want to thank all members of the Speech and Acoustics

Laboratory and Applied Linguistics Laboratory in Nara Institute of Science and

v

Technology for providing fruitful discussions. I would especially like to thank Dr.

Toshio Hirai, Senior Researcher of Arcadia Inc., for providing thoughtful advice

and discussions on speech synthesis techniques. Also, I owe a great deal to Ryuichi

Nishimura, doctoral candidate of Nara Institute of Science and Technology, for

his support in the laboratories.

I greatly appreciate Dr. Hideki Tanaka, the Head of Department 4 at ATR

Spoken Language Translation Research Laboratories, for his encouragement. I

would sincerely like to thank Minoru Tsuzaki, Senior Researcher, and Dr. Jinfu

Ni, a Researcher, of ATR Spoken Language Translation Research Laboratories,

for providing lively and fruitful discussions about speech synthesis. I would also

like to thank my many other colleagues at ATR Spoken Language Translation

Research Laboratories.

I am indebted to many Researchers and Professors. I would especially like

to express my gratitude to Dr. Masanobu Abe, Associate Manager, Senior Re-

search Engineer of NTT Cyber Space Laboratories, Professor Hideki Kawahara

of Wakayama University, Associate Professor Keiichi Tokuda of Nagoya Institute

of Technology, and Professor Yoshinori Sagisaka of Waseda University, for their

valuable advice and discussions. I would also like to express my gratitude to

Professor Fumitada Itakura, Associate Professor Kazuya Takeda, Associate Pro-

fessor Syoji Kajita, and Research Associate Hideki Banno, of Nagoya University,

and Associate Professor Mikio Ikeda of Yokkaichi University, for their support,

guidance, and having recommended that I enter Nara Institute of Science and

Technology.

Finally, I would like to acknowledge my family and friends for their support.

vi

Contents

Abstract ii

Japanese Abstract iv

Acknowledgments v

List of Figures xiv

List of Tables xv

1 Introduction 1

1.1 Background and Problem Definition . . . . . . . . . . . . . . . . . 1

1.2 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Improvement of naturalness of synthetic speech . . . . . . 2

1.2.2 Improvement of control of speaker individuality . . . . . . 4

1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Corpus-Based Text-to-Speech and Voice Conversion 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Structure of Corpus-Based TTS . . . . . . . . . . . . . . . . . . . 8

2.2.1 Text analysis . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Prosody generation . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Unit selection . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.4 Waveform synthesis . . . . . . . . . . . . . . . . . . . . . . 15

2.2.5 Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Statistical Voice Conversion Algorithm . . . . . . . . . . . . . . . 19

vii

2.3.1 Conversion algorithm based on Vector Quantization . . . . 20

2.3.2 Conversion algorithm based on Gaussian Mixture Model . 22

2.3.3 Comparison of mapping functions . . . . . . . . . . . . . . 23

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 A Segment Selection Algorithm for Japanese Speech Synthesis

Based on Both Phoneme and Diphone Units 26

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Cost Function for Segment Selection . . . . . . . . . . . . . . . . 28

3.2.1 Local cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.2 Sub-cost on prosody: Cpro . . . . . . . . . . . . . . . . . . 30

3.2.3 Sub-cost on F0 discontinuity: CF0. . . . . . . . . . . . . . 31

3.2.4 Sub-cost on phonetic environment: Cenv . . . . . . . . . . 32

3.2.5 Sub-cost on spectral discontinuity: Cspec . . . . . . . . . . 32

3.2.6 Sub-cost on phonetic appropriateness: Capp . . . . . . . . . 33

3.2.7 Integrated cost . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Concatenation at Vowel Center . . . . . . . . . . . . . . . . . . . 35

3.3.1 Experimental conditions . . . . . . . . . . . . . . . . . . . 36

3.3.2 Experiment allowing substitution of phonetic environment 37

3.3.3 Experiment prohibiting substitution of phonetic environment 38

3.4 Segment Selection Algorithm Based on Both Phoneme and Di-

phone Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.1 Conventional algorithm . . . . . . . . . . . . . . . . . . . . 39

3.4.2 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . 41

3.4.3 Comparison with segment selection based on half-phoneme

units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 46

3.5.1 Experimental conditions . . . . . . . . . . . . . . . . . . . 46

3.5.2 Experimental results . . . . . . . . . . . . . . . . . . . . . 47

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 An Evaluation of Cost Capturing Both Total and Local Degra-

dation of Naturalness for Segment Selection 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

viii

4.2 Various Integrated Costs . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Perceptual Evaluation of Cost . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Correspondence of cost to perceptual score . . . . . . . . . 52

4.3.2 Preference test on naturalness of synthetic speech . . . . . 57

4.3.3 Correspondence of RMS cost to perceptual score in lower

range of RMS cost . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Segment Selection Considering Both Total Degradation of Natu-

ralness and Local Degradation . . . . . . . . . . . . . . . . . . . . 62

4.4.1 Effect of RMS cost on various costs . . . . . . . . . . . . . 63

4.4.2 Effect of RMS cost on selected segments . . . . . . . . . . 66

4.4.3 Relationship between effectiveness of RMS cost and corpus

size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.4 Evaluation of segment selection by estimated perceptual score 70

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5 A Voice Conversion Algorithm Based on Gaussian Mixture Model

with Dynamic Frequency Warping 73

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 GMM-Based Conversion Algorithm Applied to STRAIGHT . . . 75

5.2.1 Evaluation of spectral conversion-accuracy of GMM-based

conversion algorithm . . . . . . . . . . . . . . . . . . . . . 77

5.2.2 Shortcomings of GMM-based conversion algorithm . . . . 77

5.3 Voice Conversion Algorithm Based on Gaussian Mixture Model

with Dynamic Frequency Warping . . . . . . . . . . . . . . . . . . 79

5.3.1 Dynamic Frequency Warping . . . . . . . . . . . . . . . . 79

5.3.2 Mixing of converted spectra . . . . . . . . . . . . . . . . . 80

5.4 Effectiveness of Mixing Converted Spectra . . . . . . . . . . . . . 82

5.4.1 Effect of mixing-weight on spectral conversion-accuracy . . 83

5.4.2 Preference tests on speaker individuality . . . . . . . . . . 84

5.4.3 Preference tests on speech quality . . . . . . . . . . . . . . 85

5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.1 Subjective evaluation of speaker individuality . . . . . . . 86

5.5.2 Subjective evaluation of speech quality . . . . . . . . . . . 89

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

ix

6 Conclusions 91

6.1 Summary of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Appendix 96

A Frequency of Vowel Sequences . . . . . . . . . . . . . . . . . . . . 96

B Definition of the Nonlinear Function P . . . . . . . . . . . . . . . 97

C Sub-Cost Functions, Ss and Sp, on Mismatch of Phonetic Environ-

ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

References 100

List of Publications 113

x

List of Figures

1.1 Problems addressed in this thesis. . . . . . . . . . . . . . . . . . . 3

2.1 Structure of corpus-based TTS. . . . . . . . . . . . . . . . . . . . 8

2.2 Schematic diagram of text analysis. . . . . . . . . . . . . . . . . . 9

2.3 Schematic diagram of HMM-based prosody generation. . . . . . . 12

2.4 Schematic diagram of segment selection. . . . . . . . . . . . . . . 15

2.5 Schematic diagram of waveform concatenation. . . . . . . . . . . . 18

2.6 Schematic diagram of speech synthesis with prosody modification

by STRAIGHT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Various mapping functions. The contour line denotes frequency

distribution of training data in the joint feature space. “x” de-

notes the conditional expectation E[y|x] calculated in each value

of original feature x. . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Schematic diagram of cost function. . . . . . . . . . . . . . . . . . 29

3.2 Targets and segments used to calculate each sub-cost in calculation

of the cost of a candidate segment ui for a target ti. ti and ui show

phonemes considered target and candidate segments, respectively. 31

3.3 Schematic diagram of function to integrate local costs LC. . . . . 34

3.4 Spectrograms of vowel sequences concatenated at (a) a vowel bound-

ary and (b) a vowel center. . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Statistical characteristics of static feature and dynamic feature of

spectrum in vowels. “Normalized time” shows the time normalized

from 0 (preceding phoneme boundary) to 1 (succeeding phoneme

boundary) in each vowel segment. . . . . . . . . . . . . . . . . . . 37

xi

3.6 Concatenation methods at a vowel boundary and a vowel center.

“V*” shows all vowels. “Vfh” and “Vlh” show the first half-vowel

and the last half-vowel, respectively. . . . . . . . . . . . . . . . . . 38

3.7 Frequency distribution of distortion caused by concatenation be-

tween vowels in the case of allowing substitution of phonetic envi-

ronment. “S.D. ” shows standard deviation. . . . . . . . . . . . . 40

3.8 Frequency distribution of distortion caused by concatenation be-

tween vowels that have the same phonetic environment. . . . . . . 40

3.9 Example of segment selection based on phoneme units. The input

sentence is “tsuiyas” (“spend” in English). Concatenation at C-V

boundaries is prohibited. . . . . . . . . . . . . . . . . . . . . . . . 41

3.10 Targets and segments used to calculate each sub-cost in calculation

of the cost of candidate segments ufi , u

li for a target ti. . . . . . . . 43

3.11 Example of segment selection based on phoneme units and diphone

units. Concatenation at C-V boundaries and selection of isolated

half-vowels are prohibited. . . . . . . . . . . . . . . . . . . . . . . 44

3.12 Example of segment selection based on half-phoneme units. . . . . 45

3.13 Results of comparison with the segment selection based on phoneme

units (“Exp. A”) and those of comparison with the segment selec-

tion allowing only concatenation at vowel center in V-V, V-S, and

V-N sequences (“Exp. B”). . . . . . . . . . . . . . . . . . . . . . . 48

4.1 Distribution of average cost and maximum cost for all synthetic

utterances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Scatter chart of selected test stimuli. . . . . . . . . . . . . . . . . 53

4.3 Correlation coefficient between norm cost and perceptual score as

a function of power coefficient, p. . . . . . . . . . . . . . . . . . . 54

4.4 Correlation between average cost and perceptual score. . . . . . . 55

4.5 Correlation between maximum cost and perceptual score. . . . . . 55

4.6 Correlation between RMS cost and perceptual score. The RMS

cost can be converted into a perceptual score by utilizing the re-

gression line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Correlation coefficient between RMS cost and normalized opinion

score for each listener. . . . . . . . . . . . . . . . . . . . . . . . . 58

xii

4.8 Best correlation between RMS cost and normalized opinion score

(left figure) and worst correlation between RMS cost and normal-

ized opinion score (right figure) in results of all listeners. . . . . . 58

4.9 Examples of local costs of segment sequences selected by the av-

erage costs and by the RMS cost. “Av.” and “RMS” show the

average and the root mean square of local costs, respectively. . . . 59

4.10 Scatter chart of selected test stimuli. Each dot denotes a stimulus

pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.11 Preference score. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.12 Correlation between RMS cost and perceptual score in lower range

of RMS cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.13 Local costs as a function of corpus size. Mean and standard devi-

ation are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.14 Target cost as a function of corpus size. . . . . . . . . . . . . . . . 65

4.15 Concatenation cost as a function of corpus size. . . . . . . . . . . 65

4.16 Segment length in number of phonemes as a function of corpus size. 68

4.17 Segment length in number of syllables as a function of corpus size. 68

4.18 Increase rate in the number of concatenations as a function of

corpus size. “*” denotes any phoneme. . . . . . . . . . . . . . . . 69

4.19 Concatenation cost in each type of concatenation. The corpus size

is 32 hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.20 Differences in costs as a function of corpus size. . . . . . . . . . . 70

4.21 Estimated perceptual score as a function of corpus size. . . . . . . 71

5.1 Mel-cepstral distortion. Mean and standard deviation are shown. . 78

5.2 Example of spectrum converted by GMM-based voice conversion

algorithm (“GMM-converted spectrum”) and target speaker’s spec-

trum (“Target spectrum”). . . . . . . . . . . . . . . . . . . . . . . 78

5.3 GMM-based voice conversion algorithm with Dynamic Frequency

Warping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Example of frequency warping function. . . . . . . . . . . . . . . . 81

5.5 Variations of mixing-weights that correspond to the different pa-

rameters a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

xiii

5.6 Example of converted spectra by the GMM-based algorithm (“GMM”),

the proposed algorithm without the mix of the converted spectra

(“GMM & DFW”), and the proposed algorithm with the mix of

the converted spectra (“GMM & DFW & Mix of spectra”). . . . . 83

5.7 Mel-cepstral distortion as a function of parameter a of mixing-

weight. “Original speech of source” shows the mel-cepstral distor-

tion before conversion. . . . . . . . . . . . . . . . . . . . . . . . . 84

5.8 Relationship between conversion-accuracy for speaker individuality

and parameter a of mixing-weight. A preference score of 50% shows

that the conversion-accuracy is equal to that of the GMM-based

algorithm, which provides good performance in terms of speaker

individuality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.9 Relationship between converted speech quality and parameter a of

mixing-weight. A preference score of 50% shows that the converted

speech quality is equal to that of the GMM-based algorithm with

DFW, which provides good performance in terms of speech quality. 87

5.10 Correct response for speaker individuality. . . . . . . . . . . . . . 88

5.11 Mean Opinion Score (“MOS”) for speech quality. . . . . . . . . . 89

B.1 Nonlinear function P for sub-cost on prosody. . . . . . . . . . . . 98

xiv

List of Tables

3.1 Sub-cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Number of concatenations in experiment comparing proposed al-

gorithm with segment selection based on phoneme units. “S” and

“N” show semivowel and nasal. “Center” shows concatenation at

vowel center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 Number of concatenations in experiment comparing proposed algo-

rithm with segment selection allowing only concatenation at vowel

center in V-V, V-S, and V-N sequences . . . . . . . . . . . . . . . 47

A.1 Frequency of vowel sequences . . . . . . . . . . . . . . . . . . . . 96

xv

Chapter 1

Introduction

1.1 Background and Problem Definition

Speech is the ordinary way for most people to communicate. Moreover, speech

can convey other information such as emotion, attitude, and speaker individuality.

Therefore, it is said that speech is the most natural, convenient, and useful means

of communication.

In recent years, computers have come into common use as computer tech-

nology advances. Therefore, it is important to realize a man-machine interface

to facilitate communication between people and computers. Naturally, speech is

focused on as a medium for such communication. In general, two technologies for

processing speech are needed. One is speech recognition, and the other is speech

synthesis. Speech recognition is a technique for information input. Necessary in-

formation, e.g. message information, is extracted from input speech that includes

diverse information. Thus, it is important to find a method to extract only useful

information. On the other hand, speech synthesis is a technique for informa-

tion output. This procedure is the reverse of speech recognition. Output speech

includes various types of information, e.g. sound information and prosodic infor-

mation, and is generated from input information. Moreover, other information

such as speaker individuality and emotion is needed in order to realize smoother

communication. Thus, it is important to find a method to generate the various

types of paralinguistic information that are not processed in speech recognition.

Text-to-Speech (TTS) is one of the speech synthesis technologies. TTS is

1

a technique to convert any text into a speech signal [67], and it is very useful

in many practical applications, e.g. car navigation, announcements in railway

stations, response services in telecommunications, and e-mail reading. Therefore,

it is desirable to realize TTS that can synthesize natural and intelligible speech,

and research and development on TTS has been progressing.

The current trend in TTS is based on a large amount of speech data and sta-

tistical processing. This type of TTS is generally called corpus-based TTS. This

approach makes it possible to dramatically improve the naturalness of synthetic

speech compared with the early TTS. Corpus-based TTS can be used for prac-

tical purposes under limited conditions [15]. However, no general-purpose TTS

has been developed that can synthesize sufficient natural speech consistently for

any input text. Furthermore, there is not yet enough flexibility in corpus-based

TTS. In general, corpus-based TTS can synthesize only speech having the specific

style included in a speech corpus. Therefore, in order to synthesize other types

of speech, e.g. speech of various speakers, emotional speech, and other speak-

ing styles, various speech samples need to be recorded in advance. Moreover,

large-sized speech corpora are needed to synthesize speech with sufficient natu-

ralness. Speech recording is hard work, and it requires an enormous amount of

time and expensive costs. Therefore, it is necessary to improve the performance

of corpus-based TTS.

1.2 Thesis Scope

This thesis addresses two problems in speech synthesis shown in Figure 1.1.

One is how to improve the naturalness of synthetic speech in corpus-based TTS.

The other is how to improve control of speaker individuality in order to achieve

more flexible speech synthesis.

1.2.1 Improvement of naturalness of synthetic speech

In corpus-based TTS, three main factors determine the naturalness of synthetic

speech: (1) a speech corpus, (2) an algorithm for selecting the most appropriate

synthesis units from the speech corpus, and (3) an evaluation measure to select

the synthesis units. We focus on the latter two factors.

2

Nat

ural

ness

Flexibility

Synthesis of variousspeakers’ speech

Large amountof speech data Small amount

of speech data

Synthesis of specificspeakers’ speech

Current levelImprovement ofnaturalness of

synthetic speech

Improvement ofcontrol of

speaker individuality

Figure 1.1. Problems addressed in this thesis.

In a speech synthesis procedure, the optimum set of waveform segments, i.e.

portions of speech utterances included in the corpus, are selected, and the syn-

thetic speech is generated by concatenating the selected waveform segments. This

selection is performed based on synthesis units. Various units, phonemes, di-

phones, syllables, and so on have been proposed. In Japanese speech synthesis,

syllable units are often used since the number of Japanese syllables is small and

transition in the syllables is important for intelligibility. However, syllable units

cannot avoid vowel-to-vowel concatenation, which often produces auditory dis-

continuity, because various vowel sequences appear frequently in Japanese. In

order to alleviate this discontinuity, we propose a novel selection algorithm based

on two synthesis unit definitions.

Moreover, in order to realize high and consistent quality of synthetic speech,

it is important to use an evaluation measure that corresponds to perceptual char-

acteristics in the selection of the most suitable waveform segments. Although a

measure based on acoustic measures is often used, the correspondence of such

a measure to the perceptual characteristics is indistinct. Therefore, we clarify

3

the correspondence of the measure utilized in our TTS by performing perceptual

experiments on the naturalness of synthetic speech. Moreover, we improve this

measure based on the results of these experiments.

1.2.2 Improvement of control of speaker individuality

We focus on a voice conversion technique to control speaker individuality. In this

technique, conversion rules between two speakers are extracted in advance using

a small amount of training speech data. Once training has been performed, any

utterance of one speaker can be converted to sound like that of another speaker.

Therefore, we can easily synthesize speech of various speakers from only a small

amount of speech data of the speakers by using the voice conversion technique.

However, the performance of conventional voice conversion techniques is in-

adequate. The training of the conversion rules is performed based on statisti-

cal methods. Although accurate conversion rules can be extracted from a small

amount of training data, important information influencing speech quality is lost.

In order to avoid quality degradation caused by losing the information, we in-

troduce Dynamic Frequency Warping (DFW) technique into the statistical voice

conversion. From the results of perceptual experiments, we show that the pro-

posed voice conversion algorithm can synthesize converted speech more naturally

while maintaining equal conversion-accuracy on speaker individuality compared

with a conventional voice conversion algorithm.

1.3 Thesis Overview

The thesis is organized as follows.

In Chapter 2, a corpus-based TTS system and conventional voice conversion

techniques are described. We describe the basic structure of the corpus-based

TTS system. Then some techniques in each module are reviewed, and we briefly

introduce the techniques applied to the TTS system under development in ATR

Spoken Language Translation Research Laboratories. Moreover, some conven-

tional voice conversion algorithms are reviewed and conversion functions of the

algorithms are compared.

4

In Chapter 3, we propose a novel segment selection algorithm for Japanese

speech synthesis. Not only the segment selection algorithms but also our measure

for selection of optimum segments are described. Results of perceptual experi-

ments show that the proposed algorithm can synthesize speech more naturally

than conventional algorithms.

In Chapter 4, the measure is evaluated based on perceptual characteristics.

We clarify correspondence of the measure to the perceptual scores determined

from the results of perceptual experiments. Moreover, we find a novel measure

having better correspondence and investigate the effect of using this measure for

segment selection. We also show the effectiveness of increasing the size of a speech

corpus.

In Chapter 5, control of speaker individuality by voice conversion is de-

scribed. We propose a novel voice conversion algorithm and perform an exper-

imental evaluation on it. The results of experiments show that the proposed

algorithm has better performance compared with a conventional algorithm.

In Chapter 6, we summarize the contributions of this thesis and offer sug-

gestions for future work.

5

Chapter 2

Corpus-Based Text-to-Speech

and Voice Conversion

Corpus-based TTS is the main current direction in work on TTS. The naturalness

of synthetic speech has been improved dramatically by the transition from the

early rule-based TTS to corpus-based TTS. In this chapter, we describe the basic

structure of corpus-based TTS and the various techniques used in each module.

Moreover, we review conventional voice conversion algorithms that are useful for

flexibly synthesizing speech of various speakers, and then we compare various

conversion functions.

2.1 Introduction

The early TTS was constructed based on rules that researchers determined from

their objective decisions and experience [67]. In general, this type of TTS is called

rule-based TTS. The researcher extracts the rules for speech production by the

Analysis-by-Synthesis (A-b-S) method [6]. In the A-b-S method, parameters

characterizing a speech production model are adjusted by performing iterative

feedback control so that the error between the observed value and that produced

by the model is minimized. Such rule determination needs professional expertise

since it is difficult to extract consistent and reasonable rules. Therefore, the rule-

based TTS systems developed by researchers usually have different performances.

Moreover, synthetic speech by rule-based TTS has an unnatural quality because a

6

speech waveform is generated by a speech production model, e.g. terminal analog

speech synthesizer, which generally needs some approximations in order to model

the complex human vocal mechanism [67].

On the other hand, the current TTS is constructed based on a large amount

of data and a statistical process [43][89]. In general, this type of TTS is called

corpus-based TTS in contrast with rule-based TTS. This approach has been devel-

oped through the dramatic improvements in computer performance. In corpus-

based TTS, a large amount of speech data are stored as a speech corpus. In

synthesis, optimum speech units are selected from the speech corpus. An out-

put speech waveform is synthesized by concatenating the selected units and then

modifying their prosody. Corpus-based TTS can synthesize speech more naturally

than rule-based TTS because the degradation of naturalness in synthetic speech

can be alleviated by selecting units satisfying certain factors, e.g. a mismatch

of phonetic environments, difference in prosodic information, and discontinuity

produced by concatenating units. If the selected units need little modification,

natural speech can be synthesized by concatenating speech waveform segments

directly. Furthermore, since the corpus-based approach has hardly any depen-

dency on the type of language, we can apply the approach to other languages

more easily than the rule-based approach.

If a large-sized speech corpus of a certain speaker can be used, corpus-based

TTS can synthesize high-quality and intelligible speech of the speaker. While,

not only quality and intelligibility but also speaker individuality are important

for smooth and full communication. Therefore, it is important to synthesize the

speech of various speakers as well as the speech of a specific speaker. One of

approaches for flexibly synthesizing speech of various speakers is speech modifi-

cation by a voice conversion technique used to convert one speaker’s voice into

another speaker’s voice [68].

In voice conversion, it is important to extract accurate conversion rules from

a small amount of training data. This problem is associated with a mapping

between features. In general, an extraction method of conversion rules is based

on statistical processing, and it is often used in speaker adaptation for speech

recognition.

This chapter is organized as follows. In Section 2.2, we describe the basic

7

Synthetic speech

Unit information

Contextual information (Pronunciation, Accent, ...)

Prosodic information (F0, Duration, Power, ...)

Speech corpus

Text analysis

Prosody generation

Unit selection

Text

Waveform synthesis

TTS

Figure 2.1. Structure of corpus-based TTS.

structure of corpus-based TTS and review various techniques in each module. In

Section 2.3, conventional voice conversion algorithms and comparison of map-

ping functions of the algorithms are described. Finally, we summarize this chapter

in Section 2.4.

2.2 Structure of Corpus-Based TTS

In general, corpus-based TTS is comprised of five modules: text analysis, prosody

generation, unit selection, waveform synthesis, and speech corpus. The structure

of corpus-based TTS is shown in Figure 2.1.

2.2.1 Text analysis

In the text analysis, an input text is converted into contextual information, i.e.

pronunciation, accent type, part-of-speech, and so on, by natural language pro-

cessing [91][96]. The contextual information plays an important role in the quality

and intelligibility of synthetic speech because prediction accuracy on this infor-

mation affects all of the subsequent procedures.

8

Input text

Text normalization

Morphological analysis

Syntactic analysis

Phoneme generation

Accent generation

Contextual information

Figure 2.2. Schematic diagram of text analysis.

First, various obstacles, such as unreadable marks like HTML tags and e-mail

headings, are removed if the input text includes these obstacles. This processing

is called text normalization.

The normalized text is then divided into morphemes, which are minimum

units of letter strings having linguistic meaning. These morphemes are tagged

with their parts of speech, and a syntactic analysis is performed. Then, the mod-

ule determines phoneme and prosodic symbols, e.g. accent nucleus, accentual

phrases, boundaries of prosodic clauses, and syntactic structure. Reading rules

and accentual rules for word concatenation are often applied to the determina-

tion of this information [77][88]. Especially in Japanese, the accent information

is crucial to achieving high-quality synthetic speech. In some TTS systems, es-

pecially English TTS systems, ToBI (Tone and Break Indices) labels [97] or Tilt

parameters [108] are predicted [17][38].

A schematic diagram of text analysis is shown in Figure 2.2.

9

2.2.2 Prosody generation

In the prosody generation, prosodic features such as F0 contour, power contour,

and phoneme duration are predicted from the contextual information output from

the text analysis. This prosodic information is important for the intelligibility

and naturalness of synthetic speech.

Fujisaki’s model has been proposed as one of the models that can represent

the F0 contour effectively [39]. This model decomposes the F0 contour into two

components, i.e. a phrase component that decreases gradually toward the end

of a sentence and an accent component that increases and decreases rapidly at

each accentual phrase. Fujisaki’s model is often used to generate the F0 contour

from the contextual information in rule-based TTS, particularly in Japanese TTS

[45][61]. Then the rules arranged by experts are applied. In recent years, auto-

matic extraction algorithms of control parameters and rules from a large amount

of data with statistical methods have been proposed [41][46].

Many data-driven algorithms for prosody generation have been proposed. In

the F0 contour control model proposed by Kagoshima et al. [56], an F0 contour of

a whole sentence is produced by concatenating segmental F0 contours, which are

generated by modifying vectors that are representative of typical F0 contours. The

representative vectors are selected from an F0 contour codebook with contextual

information. The codebook is designed so that the approximation error between

F0 contours generated by this model and real F0 contours extracted from a speech

corpus is minimized. Isogai et al. proposed using not the representative vectors

but natural F0 contours selected from a speech corpus in order to generate an F0

contour of a sentence [50]. In this algorithm, if there is an F0 contour having equal

contextual information to the predicted contextual information in the speech

corpus, the F0 contour is selected and used without modification. In all other

cases, the F0 contour that most suits the predicted contextual information is

selected and used with modification. Moreover, algorithms for predicting the F0

contour from the ToBI labels or Tilt parameters have been proposed [13][37].

As a powerful data-driven algorithm, HMM-based (Hidden Markov model)

speech synthesis has been proposed by Tokuda et al. [111][112][117]. In this

method, the F0 contour, the mel-cepstrum sequence including the power con-

tour, and phoneme duration are generated directly from HMMs trained by a

10

decision-tree based on a context clustering technique. The F0 is modeled by

multi-space probability distribution HMMs [111], and the duration is modeled by

multi-dimensional Gaussian distribution HMMs in which each dimension shows

the duration in each state of the HMM. The mel-cepstrum is modeled by either

multi-dimensional Gaussian distribution HMMs or multi-dimensional Gaussian

mixture distribution HMMs. Decision-trees are constructed for each feature. The

decision-tree for the F0 and that for the mel-cepstrum are constructed in each

state of the HMM. As for the duration, one decision-tree is constructed. All train-

ing procedures are performed automatically. In synthesis, the smooth parameter

contours, which are static features, are generated from the HMMs by maximizing

the likelihood criterion while considering the dynamic features of speech [112].

Some TTS systems do not perform the prosody generation [24]. In these

systems, contextual information is used instead of prosody information for the

next procedure, unit selection.

In our corpus-based TTS under development, HMM-based speech synthesis

is applied to a prosody generation module. A schematic diagram of HMM-based

prosody generation is shown in Figure 2.3.

2.2.3 Unit selection

In the unit selection, an optimum set of units is selected from a speech corpus by

minimizing the degradation of naturalness caused by various factors, e.g. prosodic

difference, spectral difference, and a mismatch of phonetic environments [52][89].

Various types of units have been proposed to alleviate such degradation.

Nakajima et al. proposed an automatic procedure called Context Oriented

Clustering (COC) [80]. In this technique, the optimum synthesis units are gen-

erated or selected from a speech corpus of a single speaker in advance in order

to alleviate degradation caused by spectral difference. All segments of a given

phoneme in the speech corpus are clustered in advance into equivalence classes

according to their preceding and succeeding phoneme contexts. The decision

trees that perform the clustering are constructed automatically by minimizing

the acoustic differences within the equivalence classes. The centroid segment

of each cluster is saved as a synthesis unit. In the speech synthesis phase, the

optimum synthesis units are selected from leaf clusters that most suit given pho-

11

Context dependentduration models

Decision tree forstate duration model

Decision treesfor spectrum

Decision treesfor F0

1st state 2nd state 3rd state

Context dependent HMMsfor spectrum and F0

Sentence HMM

Contextual information

State duration sequence, F0 contour, andmel-cepstrum sequence including power contour

Figure 2.3. Schematic diagram of HMM-based prosody generation.

netic contexts. As the synthesis units, either spectral parameter sequences [40]

or waveform segments [51] are utilized.

Kagoshima and Akamine proposed an automatic generation method of synthe-

sis units with Closed-Loop Training (CLT) [5][55]. In this approach, an optimum

set of synthesis units is selected or generated from a speech corpus in advance to

minimize the degradation caused by synthesis processing such as prosodic modifi-

cation. A measure capturing this degradation is defined as the difference between

a natural speech segment prepared as training data cut off from the speech corpus

and a synthesized speech segment with prosodic modification. The selection or

generation of the best synthesis unit is performed on the basis of the evaluation

12

and minimization of the measure in each unit cluster represented by a diphone

[87]. Although the number of diphone waveform segments used as synthesis units

is very small (only 302 segments), speech with natural and smooth sounding

quality can be synthesized.

There are many types of basic synthesis units, e.g. phoneme, diphone, syl-

lable, VCV units [92], and CVC units [93]. The units comprised of more than

two phonemes can preserve transitions between phonemes. Therefore, the con-

catenation between phonemes that often produces perceptual discontinuity can

be avoided by utilizing these units. The diphone units have unit boundaries at

phoneme centers [32][84][87]. In the VCV units, concatenation points are vowel

centers in which formant trajectories are stabler and clearer than those in con-

sonant centers [92]. While, in the CVC units, concatenation is performed at the

consonant centers in which waveform power is often smaller than that in the

vowel centers [93]. In Japanese, CV (C: Consonant, V: Vowel) units are often

used since nearly all Japanese syllables consists of CV or V.

In order to use stored speech data effectively and flexibly, Sagisaka et al.

proposed Non-Uniform Units (NUU) [52][89][106]. In this approach, the specific

units are not selected or generated from a speech corpus in advance. An optimum

set of synthesis units is selected by minimizing a cost capturing the degradation

caused by spectral difference, difference in phonetic environment, and concate-

nation between units in a synthesis procedure. Since it is possible to use all

phoneme subsequences as synthesis units, the selected units, i.e. NUU, have

variable lengths. The ATR ν-talk speech synthesis system is based on the NUU

represented by a spectral parameter sequence [90]. Hirokawa et al. proposed that

not only factors related to spectrum and phonetic environment but also prosodic

difference are considered in selecting the optimum synthesis units [44]. In this

approach, speech is synthesized by concatenating the selected waveform segments

and then modifying their prosody. Campbell et al. also proposed utilization of

prosodic information in selecting the synthesis units [19][20]. Based on these

works, Black et al. proposed a general algorithm for unit selection by using two

costs [11][22][47]. One is a target cost, which captures the degradation caused

by prosodic difference and difference in phonetic environment, and the other is

a concatenation cost, which captures the degradation caused by concatenating

13

units. In this algorithm, the sum of the two costs is minimized using a dynamic

programming search based on phoneme units. By introducing these techniques to

ν-talk, CHATR is constructed as a generic speech synthesis system [10][12][21].

Since the number of considered factors increases, a larger-sized speech corpus is

utilized than that of ν-talk. If the size of a corpus is large enough and it’s possi-

ble to select waveform segments satisfying target prosodic features predicted by

the prosody generation, it is not necessary to perform prosody modification [23].

Therefore, natural speech without degradation caused by signal processing can

be synthesized by concatenating the waveform segments directly. This waveform

segment selection has become the main current of corpus-based TTS systems for

any language. In recent years, Conkie proposed a waveform segment selection

based on half-phoneme units to improve the robustness of the selection [28].

CV* units [60] and multiform units [105] have been proposed as synthesis

units by Kawai et al. and Takano et al., respectively. These units can preserve

the important transitions for Japanese, i.e. V-V transitions, in order to alleviate

the perceptual discontinuity caused by concatenation. The units are stored in a

speech corpus as sequences of phonemes with phonetic environments. A stored

unit can have multiple waveform segments with different F0 or phoneme dura-

tion. Therefore, optimum waveform segments can be selected while considering

both the degradation caused by concatenation and that caused by prosodic mod-

ification. In general, the number of concatenations becomes smaller by utilizing

longer units. However, the longer units cannot always synthesize natural speech,

since the number of candidate units becomes small and the flexibility of prosody

synthesis is lost.

Shorter units have also been proposed. Donovan et al. proposed HMM state-

based units [33][34]. In this approach, decision-tree state-clustered HMMs are

trained automatically with a speech corpus in advance. In order to determine the

segment sequence to concatenate, a dynamic programming search is performed

over all waveform segments aligned to each leaf of the decision-trees in synthesis.

In the HMM-based speech synthesis proposed by Yoshimura et al. [117], the

optimum HMM sequence is selected from decision-trees by utilizing phonetic and

prosodic context information.

In our corpus-based TTS, the waveform segment selection technique is applied

14

Predicted phonetic and prosodic information

Segment sequence causingleast degradation of naturalness

e r a b u

e r a b u

e r a b u

Target sequence

Candidate segmentsin speech corpus

Selected sequence

Figure 2.4. Schematic diagram of segment selection.

to a unit selection module. A schematic diagram of the segment selection is shown

in Figure 2.4.

2.2.4 Waveform synthesis

An output speech waveform is synthesized from the selected units in the last

procedure of TTS. In general, two approaches to waveform synthesis have been

used. One is waveform concatenation without speech modification, and the other

is speech synthesis with speech modification.

In the waveform concatenation, speech is synthesized by concatenating wave-

form segments selected from a speech corpus using prosodic information to remove

need for signal processing [23]. In this case, instead of performing prosody mod-

ification, raw waveform segments are used. Therefore, synthetic speech has no

degradation caused by signal processing. However, if the prosody of the selected

waveform segments is different from the predicted target prosody, degradation is

caused by the prosodic difference [44]. In order to alleviate the degradation, it

is necessary to prepare a large-sized speech corpus that contains abundant wave-

15

form segments. Although synthetic speech by waveform concatenation sounds

very natural, the naturalness is not always consistent.

In the speech synthesis, signal processing techniques are used to generate a

speech waveform with the target prosody. The Time-Domain Pitch-Synchronous

OverLap-Add (TD-PSOLA) algorithm is often used for prosody modification

[79]. TD-PSOLA does not need any analysis algorithm except for determina-

tion of pitch marks throughout the segments. Speech analysis-synthesis methods

can also modify the prosody. In the HMM-synthesis method, a mel-cepstral

analysis-synthesis technique is performed [117]. Speech is synthesized from a

mel-cepstrum sequence generated directly from the selected HMMs and the ex-

citation source by utilizing a Mel Log Spectrum Approximation (MLSA) filter

[49]. A vocoder type algorithm such as this can modify speech easily by varying

speech parameters, i.e. spectral parameter and source parameter [36]. However,

the quality of the synthetic speech is often degraded. As a high-quality vocoder

type algorithm, Kawahara et al. proposed the STRAIGHT (Speech Transforma-

tion and Representation using Adaptive Interpolation of weiGHTed spectrum)

analysis-synthesis method [58]. STRAIGHT uses pitch-adaptive spectral analy-

sis combined with a surface reconstruction method in the time-frequency region

to remove signal periodicity and designs an excitation source based on phase

manipulation. Moreover, STRAIGHT can manipulate such speech parameters

as pitch, vocal tract length, and speaking rate while maintaining high reproduc-

tive quality. Stylianou proposed the Harmonic plus Noise Model (HNM) as a

high-quality speech modification technique [100]. In this model, speech signals

are represented as a time-varying harmonic component plus a modulated noise

component. Speech synthesis with these modification algorithms is very useful in

the case of a small-sized speech corpus. Synthetic speech by this speech synthesis

sounds very smooth, and the quality is consistent. However, the naturalness of

the synthetic speech is often not as good as that of synthetic speech by waveform

concatenation.

In our corpus-based TTS, both the waveform concatenation technique and

STRAIGHT synthesis method are applied in the waveform synthesis module. In

the waveform concatenation, we control waveform power in each phoneme seg-

ment by multiplying the segment by a certain value so that average power in a

16

phoneme segment selected from a speech corpus becomes equal to average target

power in the phoneme. When the segments modified by this power are con-

catenated, an overlap-add technique is applied in the frame-pair with the highest

correlation around a concatenation boundary between the segments. A schematic

diagram of the waveform concatenation is shown in Figure 2.5. In the other syn-

thesis method based on STRAIGHT, speech waveforms in voiced phonemes are

synthesized with STRAIGHT by using a concatenated spectral sequence, a con-

catenated aperiodic energy sequence, and target prosodic features. In unvoiced

phonemes, we use original waveforms modified only by power. A schematic dia-

gram of the speech synthesis with prosody modification by STRAIGHT is shown

in Figure 2.6.

2.2.5 Speech corpus

A speech corpus directly influences the quality of synthetic speech in corpus-

based TTS. In order to realize a consistently high quality of synthetic speech,

it is important to prepare a speech corpus containing abundant speech segments

with various phonemes, phonetic environments, and prosodies, which should be

recorded while maintaining high quality.

Abe et al. developed a Japanese sentence set in which phonetic coverage is

controlled [1]. This sentence set is often used not only in the field of speech

synthesis but also in speech recognition. Kawai et al. proposed an effective

method for designing a sentence set for utterances by taking into account prosodic

coverage as well as phonetic coverage [62]. This method selects the optimum

sentence set from a large number of sentences by maximizing the measure of

coverage. The size of the sentence set, i.e. the number of sentences, is decided in

advance. The coverage measure captures two factors, i.e. (1) the distributions of

F0 and phoneme duration predicted by the prosody generation and (2) perceptual

degradation of naturalness due to the prosody modification.

In general, the degradation of naturalness caused by a mismatch of phonetic

environments and prosodic difference can be alleviated by increasing the size of

the speech corpus. However, variation in voice quality is caused by recording the

speech of a speaker for a long time in order to construct the large-sized corpus [63].

Concatenation between speech segments with different voice qualities produces

17

Waveform segments

Synthetic speech

Power modification

Search for optimum frame-pairsaround concatenative boundaris

Overlap-add

Target power

Figure 2.5. Schematic diagram of waveform concatenation.

Waveform segmentsin unvoiced phonemes

Parameter segmentsin voiced phonemesTarget prosody

Synthesized waveform

Synthetic speech

Waveform concatenation

Concatenated parameters

STRAIGHT synthesisPower modification

Figure 2.6. Schematic diagram of speech synthesis with prosody modification by

STRAIGHT.

18

audible discontinuity. To deal with this problem, previous studies have proposed

using a measure capturing the difference in voice quality to avoid concatenation

between such segments [63] and normalization of power spectral densities [94].

In our corpus-based TTS, a large-sized corpus of speech spoken by a Japanese

male who is a professional narrator is under construction. The maximum size

of the corpus used in this thesis is 32 hours. A sentence set for utterances is

extracted from TV news articles, newspaper articles, phrase books for foreign

tourists, and so on by taking into account prosodic coverage as well as phonetic

coverage.

2.3 Statistical Voice Conversion Algorithm

The code book mapping method has been proposed as a voice conversion method

by Abe et al. [2]. In this method, voice conversion is formulated as a map-

ping problem between two speakers’ codebooks, and this idea has been proposed

as a method for speaker adaptation by Shikano et al. [95]. This algorithm has

been improved by introducing the fuzzy Vector Quantization (VQ) algorithm [81].

Moreover, the fuzzy VQ-based algorithm using difference vectors between map-

ping code vectors and input code vectors has been proposed in order to represent

various spectra beyond the limitation caused by the codebook size [75][76]. These

VQ-based algorithms are basic to statistical voice conversion. Abe et al. has also

proposed a segment-based approach [4]. In this approach, speech segments of a

source speaker selected by HMM are replaced by the corresponding segments of a

target speaker. Both static and dynamic characteristics of speaker individuality

can be preserved in the segment.

A voice conversion algorithm using Linear Multivariate Regression (LMR) has

been proposed by Valbret et al. [114]. In the LMR algorithm, a spectrum of a

source speaker is converted with a simple linear transformation for each class. In

the algorithm using the Dynamic Frequency Warping (DFW) [114], a frequency

warping algorithm is performed in order to convert the spectrum. As a similar

conversion method to the DFW algorithm, modification of formant frequencies

and spectral intensity has been proposed [78]. Moreover, an algorithm using

Neural Networks has been proposed [82].

19

A voice conversion algorithm by speaker interpolation has been proposed by

Iwahashi and Sagisaka [53]. The converted spectrum is synthesized by interpolat-

ing the spectra of multiple speakers. In HMM-based speech synthesis, HMM pa-

rameters among representative speakers’ HMM sets are interpolated [118]. Speech

with the voice quality of various speakers can be synthesized from the interpolated

HMMs directly [49][112]. Moreover, some speaker adaptation methods, e.g. VFS

(Vector Field Smoothing) [83], MAP (Maximum A Posteriori) [69], VFS/MAP

[104], and MLLR (Maximum Likelihood Linear Regression) [71], can be applied

to HMM-based speech synthesis [74][107]. In the voice conversion by HMM-based

speech synthesis, the average voice is often used in place of the source speaker’s

voice. Although the quality of the synthetic speech is not adequate, this attractive

approach has the ability to synthesize various speakers’ speech flexibly.

In this thesis, we focus on a voice conversion algorithm based on Gaussian

Mixture Model (GMM) proposed by Stylianou et al. [98][99]. In this algorithm,

the feature space can be represented continuously by multiple distributions, i.e.

a Gaussian mixture model. Utilization of correlation between features of two

speakers is indeed characteristic of this algorithm. The VQ-based algorithms

mentioned above and the GMM-based algorithm are described in the following

section. In these algorithms, only the speech data of the source speaker and target

speaker are needed, and both training procedures and conversion procedures are

performed automatically.

2.3.1 Conversion algorithm based on Vector Quantiza-

tion

In the codebook mapping method, the converted spectrum is represented by a

mapping codebook, which is calculated as a linear combination of the target

speaker’s code vectors [2]. A code vector C(map)i of class i in the mapping code-

book for a code vector C(x)i of class i in the source speaker’s codebook is generated

as follows:

C(map)i =

m∑

j=1

wi,j · C(y)j , (2.1)

20

wi,j =hi,j

m∑

k=1

hi,k

, (2.2)

where C(y)j denotes a code vector of class j in the target speaker’s codebook having

m code vectors. hi,j denotes a histogram for the frequency of the correspondence

between the code vector C(x)i and the code vector C

(y)j in the training data. In

the conversion-synthesis step, the source speaker’s speech features are vector-

quantized into a code vector sequence with the source speaker’s codebook, and

then each code vector is mapped into a corresponding code vector in the mapping

codebook. Finally, the converted speech is synthesized from the mapped code

vector sequence having characteristics of the target speaker. Since representation

of the features is limited by the codebook size, i.e. the number of code vectors,

quantization errors are caused by vector quantization. Therefore, the converted

speech includes unnatural sounds.

The quantization errors are decreased by introducing a fuzzy VQ technique

that can represent various kinds of vectors beyond the limitation caused by the

codebook size [81]. The vectors are represented not as one code vector but as a

linear combination of several code vectors. The fuzzy VQ is defined as follows:

x′ =k∑

i=1

w(f)i ·C

(x)i , (2.3)

w(f)i =

(ui)f

k∑

j=1

(uj)f

, (2.4)

where x′ denotes a decoded vector of an input vector x. k denotes the number

of the nearest code vectors to the input vector. ui denotes a fuzzy membership

function of class i and is given by

ui =1

k∑

j=1

(

di

dj

) 1

f−1

, (2.5)

di = ||x − C(x)i ||, (2.6)

where f denotes fuzziness. The conversion is performed by replacing the source

speaker’s code vectors with the mapping code vectors. The converted vector

21

x(map) is given by

x(map) =k∑

i=1

w(f)i · C

(map)i . (2.7)

Furthermore, it is possible to represent various additional input vectors by

introducing difference vectors between the mapping code vectors and the code

vectors in the fuzzy VQ-based voice conversion algorithm [75]. In this algorithm,

the converted vector x(map) is given by

x(map) = D(map) + x, (2.8)

D(map) =k∑

i=1

w(f)i · Di, (2.9)

where w(f)i is given by Equation (2.4) and Di denotes the difference vector as

follows:

Di = C(map)i − C

(x)i . (2.10)

2.3.2 Conversion algorithm based on Gaussian Mixture

Model

We assume that p-dimensional time-aligned acoustic features x{[x0, x1, . . . , xp−1]T}

(source speaker’s) and y{[y0, y1, . . . , yp−1]T} (target speaker’s) are determined

by Dynamic Time Warping (DTW), where T denotes transposition of the vector.

In the GMM algorithm, the probability distribution of acoustic features x can

be written as

p(x) =m∑

i=1

αiN(x; µi, Σi),

subject tom∑

i=1

αi = 1, αi ≥ 0, (2.11)

where αi denotes a weight of class i, and m denotes the total number of Gaussian

mixtures. N(x; µ, Σ) denotes the normal distribution with the mean vector µ

and the covariance matrix Σ and is given by

N(x; u, Σ) =|Σ|−1/2

(2π)p/2exp

[

−1

2(x − u)TΣ−1(x − u)

]

. (2.12)

22

The mapping function [98][99] converting acoustic features of the source speaker

to those of the target speaker is given by

F (x) = E[y|x]

=m∑

i=1

hi(x)[µ(y)i + Σ

(yx)i

(

Σ(xx)i

)

−1(x − µ

(x)i )], (2.13)

hi(x) =αiN(x; µ

(x)i , Σ

(xx)i )

∑mj=1 αjN(x; µ

(x)j , Σ

(xx)j )

, (2.14)

where µ(x)i and µ

(y)i denote the mean vector of class i for the source and that for

the target speakers, respectively. Σ(xx)i denotes the covariance matrix of class i

for the source speaker. Σ(yx)i denotes the cross-covariance matrix of class i for

the source and target speakers.

In order to estimate parameters αi, µ(x)i , µ

(y)i , Σ

(xx)i , and Σ

(yx)i , the proba-

bility distribution of the joint vectors z = [xT, yT]T for the source and target

speakers is represented by the GMM [57] as follows:

p(z) =m∑

i=1

αiN(z; µ(z)i , Σ

(z)i ),

subject tom∑

i=1

αi = 1, αi ≥ 0, (2.15)

where Σ(z)i denotes the covariance matrix of class i for the joint vectors and µ

(z)i

denotes the mean vector of class i for the joint vectors. These are given by

Σ(z)i =

Σ(xx)i Σ

(xy)i

Σ(yx)i Σ

(yy)i

, µ(z)i =

µ(x)i

µ(y)i

. (2.16)

These parameters are estimated by the EM algorithm [30]. In this thesis, we

assume that the covariance matrices, Σ(xx)i and Σ

(yy)i , and the cross-covariance

matrices, Σ(xy)i and Σ

(yx)i , are diagonal.

2.3.3 Comparison of mapping functions

Figure 2.7 shows various mapping functions: that in the codebook mapping

algorithm (“VQ”), that in the fuzzy VQ mapping algorithm (“Fuzzy VQ”), that

23

4.4

4.6

4.8

5

5.2

5.4

5.6

5.8

6

Original feature, x4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2

Tar

get

feat

ure,

y

GMM

VQFuzzy VQ

Fuzzy VQ usingdifference vector

0.005

0.001

0.0001

Figure 2.7. Various mapping functions. The contour line denotes frequency

distribution of training data in the joint feature space. “x” denotes the conditional

expectation E[y|x] calculated in each value of original feature x.

in the fuzzy VQ mapping algorithm using the difference vector (“Fuzzy VQ us-

ing difference vector”), and that in the GMM-based algorithm (“GMM”). The

number of classes or the number of Gaussian mixtures is set to 2. We also show

values of the conditional expectation E[y|x] calculated directly from the training

samples.

The mapping function in the codebook mapping algorithm is discontinuous

because hard decision clustering is performed. The mapping function becomes

continuous by performing fuzzy clustering in the fuzzy VQ algorithm. However,

the accuracy of the mapping function is bad because the mapping function seems

to be different from the conditional expectation values. Although the mapping

function nearly approximates the conditional expectation by introducing the dif-

ference vector, the accuracy is not high enough. On the other hand, it is shown

that the mapping function in the GMM-based algorithm is close to the conditional

expectation because the correlation between the source feature and the target fea-

24

ture can be utilized. Moreover, Gaussian mixtures can represent the probability

distribution of features more accurately than the VQ-based algorithms, since the

covariance can be considered in the GMM-based algorithm. Therefore, the map-

ping function in the GMM-based algorithm is the most reasonable and has the

highest conversion-accuracy among the conventional algorithms.

The GMM-based algorithm can convert spectrum more smoothly and synthe-

size converted speech with higher quality compared with the codebook mapping

algorithm [109].

2.4 Summary

This chapter described the basic structure of corpus-based Text-to-Speech (TTS)

and reviewed the various techniques in each module. We also introduced some

techniques applied to the corpus-based TTS under development.

Moreover, conventional voice conversion algorithms are described. From the

results of comparing various mapping functions, the mapping function of the

voice conversion algorithm based on the Gaussian Mixture Model (GMM) is the

most practical and has the highest conversion-accuracy among the conventional

algorithms.

Corpus-based TTS improve the naturalness of synthetic speech dramatically

compared with rule-based TTS. However, its naturalness is still inadequate, and

flexible synthesis has not yet been achieved.

25

Chapter 3

A Segment Selection Algorithm

for Japanese Speech Synthesis

Based on Both Phoneme and

Diphone Units

This chapter describes a novel segment selection algorithm for Japanese TTS sys-

tems. Since Japanese syllables consist of CV (C: Consonant or consonant cluster,

V: Vowel or syllabic nasal /N/) or V, except when a vowel is devoiced, and these

correspond to symbols in the Japanese ‘Kana’ syllabary, CV units are often used

in concatenative TTS systems for Japanese. However, speech synthesized with

CV units sometimes has discontinuities due to V-V or V-semivowel concatena-

tion. In order to alleviate such discontinuities, longer units, e.g. CV* units,

have been proposed. However, since various vowel sequences appear frequently in

Japanese, it is not realistic to prepare long units that include all possible vowel

sequences. To address this problem, we propose a novel segment selection al-

gorithm that incorporates not only phoneme units but also diphone units. The

concatenation in the proposed algorithm is allowed at the vowel center as well

as at the phoneme boundary. The advantage of considering both types of units

is examined by experiments on concatenation of vowel sequences. Moreover, the

results of perceptual evaluation experiments clarify that the proposed algorithm

outperforms the conventional algorithms.

26

3.1 Introduction

In Japanese, a speech corpus can be constructed efficiently by considering CV

(C: Consonant or consonant cluster, V: Vowel or syllabic nasal /N/) syllables as

synthesis units, since Japanese syllables consist of CV or V except when a vowel

is devoiced. CV syllables correspond to symbols in the Japanese ‘Kana’ syllabary

and the number of the syllables is small (about 100). It is also well known

that transitions from C to V, or from V to V, are very important in auditory

perception [89][105]. Therefore, CV units are often used in concatenative TTS

systems for Japanese. On the other hand, other units are often used in TTS

systems for English because the number of syllables is enormous (over 10,000)

[67]. In recent years, an English TTS system based on CHATR has been adapted

for diphone units by AT&T [7]. Furthermore, the NextGen TTS system based on

half-phoneme units has been constructed [8][28][102], and this system has proved

to be an improvement over the previous system.

In Japanese TTS, speech synthesized with CV units has discontinuities due

to V-V or V-semivowel concatenation. In order to alleviate these discontinuities,

Kawai et al. extended the CV unit to the CV* unit [60]. Sagisaka proposed

non-uniform units to use stored speech data effectively and flexibly [89]. In this

algorithm, optimum units are selected from a speech corpus to minimize the to-

tal cost calculated as the sum of some sub-costs [52][90][106]. As a result of

dynamic programming search based on phoneme units, various sized sequences

of phonemes are selected [11][22][47]. However, it is not realistic to construct a

corpus that includes all possible vowel sequences, since various vowel sequences

appear frequently in Japanese. The frequency of vowel sequences is described

in Appendix A. If the coverage of prosody is also to be considered, the cor-

pus becomes enormous [62]. Therefore, the concatenation between V and V is

unavoidable.

Formant transitions are more stationary at vowel centers than at vowel bound-

aries. Therefore, concatenation at the vowel centers tends to reduce audible dis-

continuities compared with that at the vowel boundaries. VCV units are based

on this view [92], which has been supported by our informal listening test. As

typical Japanese TTS systems that utilize the concatenation at the vowel centers,

TOS Drive TTS (Totally Speaker Driven Text-to-Speech) has been constructed

27

by TOSHIBA [55] and Final Fluet has been constructed by NTT [105]. The

former TTS is based on diphone units. In the latter TTS, diphone units are used

if the desirable CV* units are not stored in the corpus. Thus, both TTS systems

take into account only the concatenation at the vowel centers in vowel sequences.

However, concatenation at the vowel boundaries is not always inferior to that at

the vowel centers. Therefore, both types of concatenation should be considered in

vowel sequences. In this chapter, we propose a novel segment selection algorithm

incorporating not only phoneme units but also diphone units. The proposed al-

gorithm permits the concatenation of synthesis units not only at the phoneme

boundaries but also at the vowel centers. The results of evaluation experiments

clarify that the proposed algorithm outperforms the conventional algorithms.

The chapter is organized as follows. In Section 3.2, cost functions for segment

selection are described. In Section 3.3, the advantage of performing concatena-

tion at the vowel centers is discussed. In Section 3.4, the novel segment selection

algorithm is described. In Section 3.5, evaluation experiments are described.

Finally, we summarize this chapter in Section 3.6.

3.2 Cost Function for Segment Selection

The cost function for segment selection is viewed as a mapping, as shown in

Figure 3.1, of objective features, e.g. acoustic measures and contextual infor-

mation, into a perceptual measure. A cost is considered the perceptual measure

capturing the degradation of naturalness of synthetic speech. In this thesis, only

phonetic information is used as contextual information, and the other contextual

information is converted into acoustic measures by the prosody generation.

The components of the cost function should be determined based on results

of perceptual experiments. A mapping of acoustic measures into a perceptual

measure is generally not practical except when the acoustic measures have simple

structure, as in the case of F0 or phoneme duration. Acoustic measures with

complex structure, such as spectral features that are accurate enough to capture

perceptual characteristics, have not been found so far [31][66][101][115].

On the other hand, a mapping of phonetic information into perceptual mea-

sures can be determined from the results of perceptual experiments [64]. There-

28

Cost function

Acousticmeasure

Contextualinformation

SpectrumDuration

F0

Phonetic contextAccent type

Observable features

Cost

Perceptualmeasure

Mapping observable featuresinto a perceptual measure

Figure 3.1. Schematic diagram of cost function.

fore, it is possible to capture the perceptual characteristics by utilizing such a

mapping. However, acoustic measures that can represent the characteristic of

each segment are still necessary, since phonetic information can only evaluate the

difference between phonetic categories.

Therefore, we utilize both acoustic measures and perceptual measures deter-

mined from the results of perceptual experiments.

3.2.1 Local cost

The local cost shows the degradation of naturalness caused by utilizing an individ-

ual candidate segment. The cost function is comprised of five sub-cost functions

shown in Table 3.1. Each sub-cost reflects either source information or vocal

tract information.

The local cost is calculated as the weighted sum of the five sub-costs. The

local cost LC(ui, ti) at a candidate segment ui is given by

LC(ui, ti) = wpro ·Cpro(ui, ti)

+wF0· CF0

(ui, ui−1)

29

Table 3.1. Sub-cost functions

Source information Prosody (F0, duration) Cpro

F0 discontinuity CF0

Vocal tract information Phonetic environment Cenv

Spectral discontinuity Cspec

Phonetic appropriateness Capp

+wenv · Cenv(ui, ui−1)

+wspec · Cspec(ui, ui−1)

+wapp ·Capp(ui, ti), (3.1)

wpro + wF0+ wenv + wspec + wapp = 1, (3.2)

where ti denotes a target phoneme. All sub-costs are normalized so that they

have positive values with the same mean. These sub-cost functions are described

in the following subsections. wpro, wF0, wenv, wspec, and wapp denote the weights

for individual sub-costs. In this thesis, these weights are equal, i.e. 0.2. The pre-

ceding segment ui−1 shows a candidate segment for the (i−1)-th target phoneme

ti−1. When the candidate segments ui−1 and ui are connected in the corpus, con-

catenation between the two segments is not performed. Figure 3.2 shows targets

and segments used to calculate each sub-cost in the calculation of the cost of a

candidate segment ui for a target ti.

3.2.2 Sub-cost on prosody: Cpro

This sub-cost captures the degradation of naturalness caused by the difference in

prosody (F0 contour and duration) between a candidate segment and the target.

In order to calculate the difference in the F0 contour, a phoneme is divided

into several parts, and the difference in an averaged log-scaled F0 is calculated in

each part. In each phoneme, the prosodic cost is represented as an average of the

30

Targets t i t i + 1t i - 1

C pro

appCt iu i ,( )t iu i ,( )

F0CspecCenvC u i ,( )u i - 1

u i ,( )u i - 1

u i ,( )u i - 1

Segments u i u i + 1u i - 1

Figure 3.2. Targets and segments used to calculate each sub-cost in calculation

of the cost of a candidate segment ui for a target ti. ti and ui show phonemes

considered target and candidate segments, respectively.

costs calculated in these parts. The sub-cost Cpro is given by

Cpro(ui, ti) =1

M

M∑

m=1

P (DF0(ui, ti, m), Dd(ui, ti)), (3.3)

where DF0(ui, ti, m) denotes the difference in the averaged log-scaled F0 in the

m-th divided part. In the unvoiced phoneme, DF0is set to 0. Dd denotes the

difference in the duration, which is calculated for each phoneme and used in the

calculation of the cost in each part. M denotes the number of divisions. P

denotes the nonlinear function and is described in Appendix B.

The function P was determined from the results of perceptual experiments

on the degradation of naturalness caused by prosody modification, assuming that

the output speech was synthesized with prosody modification. When prosody

modification is not performed, the function should be determined based on other

experiments on the degradation of naturalness caused by using a different prosody

from that of the target.

3.2.3 Sub-cost on F0 discontinuity: CF0

This sub-cost captures the degradation of naturalness caused by an F0 disconti-

nuity at a segment boundary. The sub-cost CF0is given by

CF0(ui, ui−1) = P (DF0

(ui, ui−1), 0), (3.4)

31

where DF0denotes the difference in log-scaled F0 at the boundary. DF0

is set to

0 at the unvoiced phoneme boundary. In order to normalize a dynamic range of

the sub-cost, we utilize the function P in Equation (3.3). When the segments

ui−1 and ui are connected in the corpus, the sub-cost becomes 0.

3.2.4 Sub-cost on phonetic environment: Cenv

This sub-cost captures the degradation of naturalness caused by the mismatch of

phonetic environments between a candidate segment and the target. The sub-cost

Cenv is given by

Cenv(ui, ui−1) = {Ss(ui−1, Es(ui−1), ui) + Sp(ui, Ep(ui), ui−1)}/2, (3.5)

= {Ss(ui−1, Es(ui−1), ti) + Sp(ui, Ep(ui), ti−1)}/2, (3.6)

where we turn Equation (3.5) into Equation (3.6) by considering that a phoneme

for ui is equal to a phoneme for ti and a phoneme for ui−1 is equal to a phoneme

for ti−1. Ss denotes the sub-cost function that captures the degradation of natu-

ralness caused by the mismatch with the succeeding environment, and Sp denotes

that caused by the mismatch with the preceding environment. Es denotes the

succeeding phoneme in the corpus, while Ep denotes the preceding phoneme in the

corpus. Therefore, Ss(ui−1, Es(ui−1), ti) denotes the degradation caused by the

mismatch with the succeeding environment in the phoneme for ui−1, i.e. replacing

Es(ui−1) with the phoneme for ti, and Sp(ui, Ep(ui), ti−1) denotes the degradation

caused by the mismatch with the preceding environment in the phoneme ui, i.e.

replacing Ep(ui) with the phoneme for ti−1. The sub-cost functions Ss and Sp are

determined from the results of perceptual experiments described in Appendix C.

Even if a mismatch of phonetic environments does not occur, the sub-cost

does not necessarily become 0 because this sub-cost reflects the difficulty of con-

catenation caused by the uncertainty of segmentation. When the segments ui−1

and ui are connected in the corpus, this sub-cost is set to 0.

3.2.5 Sub-cost on spectral discontinuity: Cspec

This sub-cost captures the degradation of naturalness caused by the spectral

discontinuity at a segment boundary. This sub-cost is calculated as the weighted

32

sum of mel-cepstral distortion between frames of a segment and those of the

preceding segment around the boundary. The sub-cost Cspec is given by

Cspec(ui, ui−1) = cs ·w/2−1∑

f=−w/2

h(f )MCD(ui, ui−1, f ), (3.7)

where h denotes the triangular weighting function of length w. MCD(ui, ui−1, f )

denotes the mel-cepstral distortion between the f -th frame from the concatena-

tion frame (f = 0) of the preceding segment ui−1 and the f -th frame from the

concatenation frame (f = 0) of the succeeding segment ui in the corpus. Concate-

nation is performed between the −1-th frame of ui−1 and the 0-th frame of ui. cs

is a coefficient to normalize the dynamic range of the sub-cost. The mel-cepstral

distortion calculated in each frame-pair is given by

20

ln 10·

√

√

√

√2 ·40∑

d=1

(mc(d)α − mc

(d)β )2, (3.8)

where mc(d)α and mc

(d)β show the d-th order mel-cepstral coefficient of a frame α

and that of a frame β, respectively. Mel-cepstral coefficients are calculated from

the smoothed spectrum analyzed by the STRAIGHT analysis-synthesis method

[58][59]. Then, the conversion algorithm proposed by Oppenheim et al. is used

to convert cepstrum into mel-cepstrum [85]. When the segments ui−1 and ui are

connected in the corpus, this sub-cost becomes 0.

3.2.6 Sub-cost on phonetic appropriateness: Capp

This sub-cost denotes the phonetic appropriateness and captures the degradation

of naturalness caused by the difference in mean spectra between a candidate

segment and the target. The sub-cost Capp is given by

Capp(ui, ti) = ct · MCD(CEN(ui), CEN(ti)), (3.9)

where CEN denotes a mean cepstrum calculated at the frames around the

phoneme center. MCD denotes the mel-cepstral distortion between the mean

cepstrum of the segment ui and that of the target ti. ct is a coefficient to nor-

malize the dynamic range of the sub-cost. The mel-cepstral distortion is given

33

Targets

Segments

t 1

u 1

t 0

u 0

t N

u N

t 2

u 2

LC(u , t )N NLC(u , t )1 1 LC(u , t )2 2

Integration-functionIntegrated

costMapping local costsinto an integrated cost

Figure 3.3. Schematic diagram of function to integrate local costs LC.

by Equation (3.8). We utilize the mel-cepstrum sequence output from context-

dependent HMMs in the HMM synthesis method [117] in calculating the mean

cepstrum of the target CEN(ti). In this thesis, this sub-cost is set to 0 in the

unvoiced phoneme.

3.2.7 Integrated cost

In segment selection, the optimum set of segments is selected from a speech

corpus. Therefore, we integrate local costs for individual segments into a cost for

a segment sequence as shown in Figure 3.3. This cost is defined as an integrated

cost. The optimum segment sequence is selected by minimizing the integrated

cost.

The average cost AC is often used as the integrated cost [11][22][25][47][102],

and it is given by

AC =1

N·

N∑

i=1

LC(ui, ti), (3.10)

where N denotes the number of targets in the utterance. t0 (u0) shows the pause

before the utterance and tN (uN) shows the pause after the utterance. The sub-

34

costs Cpro and Capp are set to 0 in the pause. Minimizing the average cost is

equivalent to minimizing the sum of the local costs in the selection.

3.3 Concatenation at Vowel Center

Figure 3.4 compares spectrograms of vowel sequences concatenated at a vowel

boundary and at a vowel center. At vowel boundaries, discontinuities can be

observed at the concatenation points. This is because it is not easy to find a

synthesis unit satisfying continuity requirements for both static and dynamic

characteristics of spectral features at once in a restricted-sized speech corpus. At

vowel centers, in contrast, finding a synthesis unit involves only static charac-

teristics, because the spectral characteristics are nearly stable. Therefore, it is

expected that more synthesis units reducing the spectral discontinuities can be

found. As a result, the formant trajectories are continuous at the concatenation

points, and their transition characteristics are well preserved.

In order to investigate the instability of spectral characteristics in the vowel,

the distances of static and dynamic spectral features were calculated between

centroids of individual vowels and all segments of each vowel in a corpus described

in the following subsection. As the spectral feature, we used the mel-cepstrum

described in Section 3.2.5. The results are shown in Figure 3.5. It is obvious

that the spectral characteristics are stabler around the vowel center than those

around the boundary.

From these results, it is assumed that the discontinuities caused by concate-

nating vowels can be reduced if the vowels are concatenated at their centers.

In order to clarify this assumption, we need to investigate the effectiveness of

concatenation at vowel centers in segment selection. However, it is difficult to

directly show the effectiveness achieved by using the concatenation at vowel cen-

ters since various factors are considered in segment selection. Therefore, we first

investigate this effectiveness in terms of spectral discontinuity, which is one of the

factors considered in segment selection.

In this section, we compare concatenation at vowel boundaries with that at

vowel centers by the mel-cepstral distortion. When a vowel sequence is generated

by concatenating one vowel segment and another vowel segment, the mel-cepstral

35

phonemes2

Time [s]

Freq

uenc

y [Hz

]0 0.05 0.1 0.15 0.2 0.25

0

500

1000

1500

2000

2500

3000

3500

4000

diphones2

Time [s]

Freq

uenc

y [Hz

]

0 0.05 0.1 0.15 0.2 0.250

500

1000

1500

2000

2500

3000

3500

4000

(a) Concatenation at boundary

(b) Concatenation at vowel center

a o iInput phonemes

Diphone units

Phoneme units

Concatenation points

Figure 3.4. Spectrograms of vowel sequences concatenated at (a) a vowel bound-

ary and (b) a vowel center.

distortion caused by the concatenation at vowel boundaries and that at vowel

centers are calculated. The vowel center shows a point of a half duration of each

vowel segment.

3.3.1 Experimental conditions

The concatenation methods at a vowel boundary and at a vowel center are shown

in Figure 3.6. We used a speech corpus comprising Japanese utterances of a male

speaker, where segmentation was performed by experts and F0 was revised by

hand. The utterances had a duration of about 30 minutes in total (450 sentences).

36

Dis

tanc

e of

mel

-cep

stru

m∆

0.2

0.3

0.4

0.5

Dis

tanc

e of

mel

-cep

stru

m

0.7

0.8

0.9

Static Dynamic

0 0.2 0.4 0.6 0.8 1Normalized time

0.6

Figure 3.5. Statistical characteristics of static feature and dynamic feature of

spectrum in vowels. “Normalized time” shows the time normalized from 0 (pre-

ceding phoneme boundary) to 1 (succeeding phoneme boundary) in each vowel

segment.

The sampling frequency was 16,000 Hz. The concatenation at vowel boundaries

and that at vowel centers were performed by using all of the vowel sequences in

the corpus. In each segment-pair, the weighted sum of the mel-cepstral distortion

given by Equation (3.7) was calculated, and then the coefficient cs was set to 1.

3.3.2 Experiment allowing substitution of phonetic en-

vironment

In this experiment, substitution of phonetic environments was not prohibited. All

segments of “V1” having a vowel as the succeeding phoneme in the corpus were

used, i.e. “V*” in Figure 3.6 shows all vowels.

Figure 3.7 shows frequency distribution of distortion caused by concatena-

tion. Concatenation at vowel centers (“Vowel center”) can generally reduce the

discontinuity caused by the concatenation compared with that at vowel bound-

37

Target vowel sequence: V1 - V2Concatenation at vowel boundary

Concatenation at vowel center

V1 fh V*V1 lh V1 fh V2V1 lh

Segment 2Segment 1

V2V1 lhV1 fh

V*V1 V2V1

V2V1

Segment 1 Segment 2

Figure 3.6. Concatenation methods at a vowel boundary and a vowel center.

“V*” shows all vowels. “Vfh” and “Vlh” show the first half-vowel and the last

half-vowel, respectively.

aries (“Vowel boundary”). In the segment selection, it is important to select the

segments that can reduce not only the spectral discontinuity but also the distor-

tion caused by various factors, e.g. prosodic distance. Therefore, as the frequency

distribution shifts to the left side, the number of segments that can reduce dis-

tortion increases. From this point of view, it was found that segment selection

was better when using concatenation at vowel centers along with substitution of

phonetic environments.

3.3.3 Experiment prohibiting substitution of phonetic

environment

In this experiment, substitution of phonetic environments was prohibited. All

segments of “V1” having “V2” as the succeeding phoneme in the corpus were

used, i.e. “V*” = “V2.”

38

Figure 3.8 shows the frequency distribution of distortion caused by the con-

catenation between vowels that have the same phonetic environment. The distor-

tion caused by the concatenation at vowel centers is almost equal to that at vowel

boundaries. Therefore, the performance of concatenation at vowel centers is the

same as that at vowel boundaries when substitution of phonetic environments is

not performed.

Next, we selected the best type of concatenation in each segment-pair by

allowing both vowel center and vowel boundary concatenations. Frequency dis-

tribution of distortion in this case is shown in Figure 3.8. This approach (“Vowel

boundary & Vowel center”) can reduce the discontinuity in concatenation com-

pared with concatenation performed only at vowel boundaries or only at vowel

centers. This shows that the number of better segments increases by considering

both types of concatenation.

These results clarify that better segment selection can be achieved by consider-

ing not only the concatenation at vowel centers but also that at vowel boundaries

in vowel sequences.

3.4 Segment Selection Algorithm Based on Both

Phoneme and Diphone Units

Motivated by the considerations in Section 3.3, we propose a novel segment

selection algorithm based on both phoneme and diphone units. Here, we also

describe the conventional segment selection algorithm based on phoneme unit for

comparison with the proposed algorithm.

3.4.1 Conventional algorithm

An input sentence, i.e. a target phoneme sequence, is divided into phonemes.

The local costs of candidate segments for each target phoneme are calculated by

Equation (3.1). The optimum set of segments is selected from a speech corpus by

minimizing the average cost given by Equation (3.10), i.e. the sum of the local

costs. As a result, non-uniform units based on phoneme units can be used as

synthesis units.

39

0

0.02

0.04

0.06

0.08

0.10

0.12

0 2 4 6 8 10 12 14 16 18 20

Nor

mal

ized

freq

uenc

y

Weighted sum of mel-cepstral distortion [dB]

Vowel centerMean 8.23, S.D. 1.74

Vowel boundaryMean 9.35, S.D. 2.07

Figure 3.7. Frequency distribution of distortion caused by concatenation between

vowels in the case of allowing substitution of phonetic environment. “S.D. ” shows

standard deviation.

0

0.02

0.04

0.06

0.08

0.10

0.12

0 2 4 6 8 10 12 14 16 18

Nor

mal

ized

freq

uenc

y

Weighted sum of mel-cepstral distortion [dB]

Vowel boundary & Vowel centerMean 7.17, S.D. 1.39

Vowel centerMean 7.78, S.D. 1.64

Vowel boundaryMean 7.87, S.D. 1.60

Figure 3.8. Frequency distribution of distortion caused by concatenation between

vowels that have the same phonetic environment.

40

( C V V C V C )/ ts u i y a s /Input sentence =

sil sil ts ts

u u

i i

s s a y

a y

Figure 3.9. Example of segment selection based on phoneme units. The input

sentence is “tsuiyas” (“spend” in English). Concatenation at C-V boundaries is

prohibited.

Figure 3.9 shows an example of the conventional segment selection based on

phoneme units. In this thesis, we do not allow C-V concatenation since the tran-

sition from C to V is very important in auditory perception to the intelligibility in

Japanese [89][105]. Therefore, the segment sequences comprised of non-uniform

units based on the syllable were selected.

3.4.2 Proposed algorithm

When concatenation is allowed at the vowel centers, the half-vowel segments

dividing the vowel segments are utilized to take account of diphone units. Each

half-vowel segment has a half duration of the original vowel segment.

We assumed that candidate segments ufi and ul

i are the first half-vowel segment

of the original vowel segment u1i and the last half-vowel segment of the original

vowel segment u2i, respectively. Sub-costs for a target phoneme ti, which is

divided into the first half-phoneme tfi and the last half-phoneme tl

i, are calculated

as follows:

• The Cpro sub-cost is calculated as the weighted sum of the sub-costs calcu-

lated at the half-vowel segments and is given by

wf ·Cpro(ufi , t

fi ) + wl ·Cpro(u

li, t

li), (3.11)

wf =dur(uf

i )

dur(ufi ) + dur(ul

i), wl =

dur(uli)

dur(ufi ) + dur(ul

i), (3.12)

41

where the weights wf and wl are defined according to durations of the seg-

ments. dur(ufi ) and dur(ul

i) denote duration of the first half-vowel segment

ufi and that of the last half-vowel segment ul

i, respectively. In calculating

the Cpro for the half-vowel segments, each half-vowel segment is divided into

M/2 parts.

• The CF0sub-cost is calculated as the sum of the sub-costs at a phoneme

boundary and a vowel center and is given by

CF0(uf

i , ui−1) + CF0(ul

i, ufi ). (3.13)

• The Cenv sub-cost is calculated as the sum of the sub-costs at a phoneme


Cenv(ufi , ui−1) + Cenv(u

li, u

fi )

= Cenv(u1i, ui−1) + {Sds (u1i, Es(u1i), ti+1)

+Sdp(u2i, Ep(u2i), ti−1)}/2, (3.14)

where the phonetic environments of the half-vowel segment are equal to

those of the original vowel segment divided into the half-vowel segments.

On the other hand, the sub-cost functions, Sds and Sd

p , for the concatenation

at vowel centers are not equal to the sub-cost functions, Ss and Sp, for the

concatenation at phoneme boundaries, which are given by Equation (3.6).

• The Cspec sub-cost is calculated as the sum of the sub-costs at a phoneme


Cspec(ufi , ui−1) + Cspec(u

li, u

fi ). (3.15)

• The Capp sub-cost is calculated as the weighted sum of the sub-costs calcu-

lated at the original vowel segments, u1i and u2i divided into the half-vowel

segments, uli and uf

i , and is given by

wf · Capp(u1i, ti) + wl · Capp(u2i, ti), (3.16)

where the weights wf and wl are given by Equation (3.12).

42

Targets

t i

t if t i

l t i + 1t i - 1

Division of a vowelinto half vowels

Segments u i - 1 u ilu i

f u i + 1

u il

u il

u il

CspecCenvC ,( )

,( ),( )

u if

u if

u ifu i

f

u if

u if

CspecCenvC ,( )u i - 1

,( )u i - 1

,( )u i - 1

u il ,( )t i

lC pro

appC u il ,( )t i

lw l

w l ..

u if ,( )t i

fC pro

appC u if ,( )t i

f.

F0 F0

.w f

w f

Figure 3.10. Targets and segments used to calculate each sub-cost in calculation

of the cost of candidate segments ufi , u

li for a target ti.

Figure 3.10 shows targets and segments used to calculate each sub-cost in the

calculation of the cost of candidate segments ufi , u

li for a target ti. If a diphone unit

is used, the sub-costs, Cenv(ufi , ui−1), Cspec(u

fi , ui−1), and CF0

(ufi , ui−1), become

0, since ui−1 and ufi are connected in the corpus. Furthermore, if a phoneme unit

is used, the sub-costs, Cenv(uli, u

fi ), Cspec(u

li, u

fi ), and CF0

(uli, u

fi ), also become 0,

since ufi and ul

i are connected in the corpus.

In the other phonemes where concatenations are not allowed at their centers,

the costs are calculated in the same way as the conventional algorithm. The

optimum segment sequence is selected from a speech corpus by minimizing the

average cost. As a result, non-uniform units based on both phoneme units and

diphone units can be used as synthesis units.

Phoneme units and diphone units have their own advantages and disadvan-

tages. The advantage of using phoneme units is the ability to preserve the char-

acteristics of phonemes. Moreover, it might be assumed that slight spectral dis-

continuities in the transitions in which spectral features change rapidly are hard

to perceive. However, it is not easy to find synthesis units that can reduce the

spectral discontinuities since static and dynamic characteristics of spectral fea-

43

sil sil ts ts

( C V V C V C )/ ts u i y a s /

/y/ is semivowel.

Input sentence =

u u a y

a y i i

s s

ts u u

u i

i

i

a y

Figure 3.11. Example of segment selection based on phoneme units and diphone

units. Concatenation at C-V boundaries and selection of isolated half-vowels are

prohibited.

tures should be considered in the transitions. On the other hand, the advantage

of using diphone units is the ability to preserve transitions between phonemes

and to concatenate at steady parts in phonemes. Therefore, more synthesis units

reducing the spectral discontinuities can be found. However, it might be assumed

that spectral discontinuities in steady parts are easy to perceive. In the proposed

algorithm, the cost is used to determine which of the two units is better.

We allow concatenations at vowel centers not only in transitions from V to V

but also in transitions from V to a semivowel or a nasal. In the transitions from

V to a semivowel or a nasal, diphone units that start from the center of a vowel in

front of consonants are used. In this thesis, the half-vowel segments are not used

except for the segments having silences as phonetic environments. Therefore,

minimum units preserve either the important transitions between phonemes or

the characteristics of Japanese syllables.

An example of the proposed segment selection algorithm is shown in Figure

3.11. Diphone units such as /ts-u/, /u-i/, and /i-y/ as well as phoneme units

are considered in the segment selection.

44

/ ts u i y a s /Input sentence =

u u ts ts i i y y a a s s

ts ts

sil sil

u u i i y y a a s s

Figure 3.12. Example of segment selection based on half-phoneme units.

3.4.3 Comparison with segment selection based on half-

phoneme units

Our TTS under development is mainly for Japanese speech synthesis. Since

the number of Japanese syllables is much smaller than that of English syllables,

we can construct a speech corpus containing all of the syllables. Therefore, we

restrict minimum synthesis units to syllables or diphones in order to preserve

important transitions. Namely, we don’t use half-phoneme units as used in the

AT&T NextGen TTS system [28]. An example of a segment selection algorithm

based on half-phoneme units is shown in Figure 3.12.

The proposed algorithm can be considered an algorithm based on half-phoneme

units adapted for Japanese speech synthesis by restricting some types of concate-

nation. However, it might be assumed that half-phoneme units would also work

well for Japanese speech synthesis. Therefore, we compared the proposed algo-

rithm with the conventional algorithm based on half-phoneme units. In order to

make a fair comparison, the weight wenv in Equation (3.1) was set to 0 since we

have not yet determined Ss and Sp in sub-cost Cenv for almost all of the half-

phoneme units. As a result of a preference test on the naturalness of synthetic

speech, the 95% confidence interval of the proposed algorithm’s preference score

was 66.67 ± 3.98%. This result shows that as the length of units used in seg-

ment selection becomes short, the risk of causing more audible discontinuities by

excessive concatenations becomes high, although more combinations of units can

be considered to reduce the prosodic difference. Therefore, it is important to use

a cost that is accurate enough to capture the audible discontinuities, especially

45

in segment selection based on short units.

3.5 Experimental Evaluation

In order to evaluate the performance of the proposed algorithm, we compared

the proposed algorithm with the conventional algorithm, which allows concate-

nation only at phoneme boundaries. Moreover, we also compared the proposed

algorithm with another conventional algorithm, which allows concatenation only

at vowel centers in V-V, V-S, and V-N sequences. We call the former comparison

Experiment A and the latter comparison Experiment B.

3.5.1 Experimental conditions

We used the speech corpus comprising Japanese utterances of a male speaker,

which is described in Section 3.3.1.

A preference test was performed with synthesized speech of 10 Japanese sen-

tences. The sentences used in Experiment A were different form those in Exper-

iment B. These sentences were not from the speech corpus used in the segment

selection. The speech was synthesized by the proposed segment selection algo-

rithm or the conventional algorithm in each experiment. The natural prosody

and the mel-cepstrum sequences extracted from the original utterances were used

to investigate the performance of the segment selection algorithms. In Experi-

ment A, all of the synthesized speech were comprised of 366 phonemes, and the

number of concatenations in each algorithm is shown in Table 3.2. In the other

experiment, Experiment B, all of the synthesized speech were comprised of 453

phonemes, and the number of concatenations in each algorithm is shown in Table

3.3.

The speech was synthesized with prosody (F0 contour, duration, and power)

modification by using STRAIGHT. Ten listeners participated in the experiment.

In each trial, a pair of utterances synthesized with the proposed algorithm and

the conventional algorithm was presented in random order, and the listeners were

asked to choose either of the two types of synthetic utterances as sounding more

natural.

46

Table 3.2. Number of concatenations in experiment comparing proposed al-

gorithm with segment selection based on phoneme units. “S” and “N” show

semivowel and nasal. “Center” shows concatenation at vowel center

Concatenation V-C V-V V-S V-N Center

Proposed algorithm 125 6 3 11 25

Conventional algorithm 124 16 3 20 -

Table 3.3. Number of concatenations in experiment comparing proposed algo-

rithm with segment selection allowing only concatenation at vowel center in V-V,

V-S, and V-N sequences

Concatenation V-C V-V V-S V-N Center

Proposed algorithm 137 10 6 23 22

Conventional algorithm 143 - - - 62

3.5.2 Experimental results

Results of Experiment A and those of Experiment B are shown in Figure 3.13.

The preference score of the proposed algorithm in Experiment A was 69.25%,

and that in Experiment B was 64.25%. In both experiments, the preference

scores of the proposed algorithm exceeded 50% by a large margin. These results

demonstrate that the proposed algorithm can synthesize speech more naturally

than the conventional algorithms.

3.6 Summary

In this chapter, we proposed a novel segment selection algorithm for Japanese

speech synthesis with both phoneme and diphone units to avoid the degrada-

tion of naturalness caused by concatenation at perceptually important transi-

47

0 20 40 60 80 100Preference score [%]

Exp. A

Exp. B

95% confidence intervalProposed algorithm Conventional algorithm

Figure 3.13. Results of comparison with the segment selection based on phoneme

units (“Exp. A”) and those of comparison with the segment selection allowing

only concatenation at vowel center in V-V, V-S, and V-N sequences (“Exp. B”).

tions between phonemes. In the proposed algorithm, non-uniform units allowing

concatenation not only at phoneme boundaries but also at vowel centers can be

selected from a speech corpus. The experimental results of concatenation of vowel

sequences clarified that better segments reducing the spectral discontinuities in-

creases by considering both types of concatenation. We also performed perceptual

experiments. The results showed that speech synthesized with the proposed al-

gorithm has better naturalness than that of the conventional algorithms.

Although we restrict minimum synthesis units to syllables or diphones, these

units are not always the best units. The best unit definition is expected to be

determined according to various factors, e.g. corpus size, correspondence of the

cost to perceptual characteristics, synthesis methods, and the kinds of languages.

Consequently, we need to investigate various units based on this view. Moreover,

we need to clarify the effectiveness of searching optimal concatenation frames [27]

after segment selection, although we have no acoustic measure that is accurate

enough to capture perceptual characteristics.

48

Chapter 4

An Evaluation of Cost

Capturing Both Total and Local

Degradation of Naturalness for

Segment Selection

In this chapter, we evaluate various costs for a segment sequence in terms of

correspondence of the cost to perceptual scores determined from results of per-

ceptual experiments on the naturalness of synthetic speech. The results show that

the conventional average cost, which shows the degradation of naturalness over

the entire synthetic speech, has better correspondence to the perceptual scores

than the maximum cost, which shows the local degradation of naturalness. Fur-

thermore, it is shown that RMS (Root Mean Square) cost, which is affected by

both the average cost and the maximum cost, has the best correspondence. We

also clarify that the naturalness of synthetic speech can be slightly improved by

utilizing the RMS cost. Then, we investigate the effect of using the RMS cost

for segment selection. From the results of experiments comparing this approach

with segment selection based on the average cost, it is found that (1) in segment

selection based on the RMS cost a larger number of concatenations causing slight

local degradation are performed to avoid concatenations causing greater local

degradation, and (2) the effect of the RMS cost has little dependence on the size

of the corpus.

49

4.1 Introduction

In corpus-based concatenative TTS, speech synthesis based on segment selection

has recently become the focus of much work on synthesis [24][102]. In segment

selection, the optimum set of segments is selected from a speech corpus by mini-

mizing the integrated cost for a segment sequence, which is described in Section

3.2.7. Therefore, it is important to utilize a cost that corresponds to the percep-

tual characteristics to synthesize speech naturally [70][113]. However, such a cost

has not been found so far [31][66][103][115]. To realize TTS with high quality and

robustness, it is necessary to improve this cost function [25][86].

In the design process of the cost function, however, it is doubtful whether

this correspondence is preserved, since there are some approximations, e.g. uti-

lization of acoustic measures that are not accurate enough to capture perceptual

characteristics [64][101], and independence among various factors. Moreover, it

might be assumed that local degradation of naturalness would have a great ef-

fect on the naturalness of synthetic speech, although the average cost, which

shows the degradation of naturalness over the entire synthetic speech, is often

used [11][22][47]. Therefore, a direct investigation of the relationship between a

perceptual measure and the cost is worthwhile.

Here, we clarify the correspondence of our cost described in Section 3.2 to

the perceptual scores. Then various functions to integrate the costs are evaluated

in terms of correspondence to the MOS (Mean Opinion Score) determined from

the results of perceptual experiments on the naturalness of synthetic speech. As

a result, we show that the RMS (Root Mean Square) cost, which is affected by

both the average cost and the maximum cost showing the local degradation of

naturalness, has the best correspondence to the perceptual scores. We also clarify

that the naturalness of synthetic speech can be slightly improved by using the

RMS cost in segment selection.

In order to investigate the effect of considering not only the total degradation

of naturalness of synthetic speech but also the local degradation in segment se-

lection, we compare segment selection based on the RMS cost with that based on

the average cost. Selected segment sequences are analyzed from various points

of view to clarify how the local degradation of naturalness can be alleviated by

utilizing the RMS cost. We also clarify the relationship between the effectiveness

50

of the RMS cost and the size of the corpus.

This chapter is organized as follows. In Section 4.2, various integrated costs

for the segment selection are described. In Section 4.3, we present perceptual

evaluations of the costs. In Section 4.4, the effectiveness of utilizing the RMS

cost in segment selection is discussed. Finally, we summarize this chapter in

Section 4.5.

4.2 Various Integrated Costs

In the conventional segment selection [11][22][25][47][102], the optimum set of

segments is selected from a speech corpus by minimizing the average cost AC

given by Equation (3.10). The average cost shows the degradation of naturalness

over the entire synthetic utterance. Therefore, a segment with a large cost can

be included in the output sequence of segments even if it is optimal in view of

the average cost.

It might be assumed that the largest cost in the sequence, i.e. the local degra-

dation of naturalness, would have much effect on the degradation of naturalness

in synthetic speech. To investigate this issue, let us define the maximum cost

MC as the integrated cost given by

MC = maxi

{W C(ui, ti)}, 1 ≤ i ≤ N, (4.1)

where N denotes the number of targets in the utterance.

In order to evaluate the various integrated costs, we utilize the norm cost,

NCp, given by

NCp =

[

1

N·

N∑

i=1

{W C(ui, ti)}p

]

1

p

, (4.2)

where p denotes a power coefficient. When p is set to 1, the norm cost is equal

to the average cost. When p is set to infinity, the norm cost is equal to the

maximum cost. Thus, this norm cost takes into account both the mean value

and the maximum value by varying the power coefficient. In this chapter, we

find an optimum value of the power coefficient from the results of perceptual

experiments.

51

4.3 Perceptual Evaluation of Cost

4.3.1 Correspondence of cost to perceptual score

We performed an opinion test on the naturalness of the synthetic speech. In

order to select a proper set of test stimuli, a large number of utterances were

synthesized by varying the corpus size from 0.5 to 32 hours (0.5, 0.7, 1, 1.4, 2,

2.8, 4, 5.7, 8, 11.3, 16, 22.6, 32). Each utterance consisted of a part of a sentence

that was divided by a pause. We synthesized 14,926 utterances that were not

included in the corpus, from which we selected a set of 140 stimuli so that the

set covers a wide field in terms of both average cost and maximum cost. This

selection was performed under the restriction that the number of phonemes in an

utterance, the duration of an utterance, and the number of concatenations are

roughly equal among the selected stimuli. The distribution of the average cost

and the maximum cost for all synthetic utterances and selected test stimuli are

shown in Figure 4.1 and Figure 4.2, respectively.

Natural prosody and the mel-cepstrum sequences extracted from the original

utterances were used as input information for segment selection. In order to alle-

viate audible discontinuity at the boundary between vowel and voiced phoneme,

concatenation at the preceding vowel center is also allowed in the segment selec-

tion. In the waveform synthesis, signal processing for prosody modification was

not performed except for power control. Therefore, we should use a different sub-

cost on prosody Cpro from that described in Section 3.2.2, where the function

P has been determined in the case of performing prosody modification. In the

case of not performing prosody modification, however, we have not determined

any function P . Therefore, we approximate Cpro by the function P described in

Appendix C.

Eight listeners participated in the experiment. They evaluated the naturalness

on a scale of seven levels, namely 1 (very bad) to 7 (very good). These levels

were determined by each listener, so the scores were distributed widely among

the stimuli. The perceptual score, here the MOS, was calculated as an average of

the normalized score calculated as a Z-score (mean = 0, variance = 1) for each

listener in order to equalize the score range among listeners.

Figure 4.3 shows the correlation coefficient between the norm cost and

52

0

0.5

1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Max

imum

cos

t

Average cost

1

2

Figure 4.1. Distribution of average cost and maximum cost for all synthetic

utterances.

0

0.5

1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Max

imum

cos

t

Average cost

1

2

Figure 4.2. Scatter chart of selected test stimuli.

53

-0.9

-0.85

-0.8

-0.75

-0.7

-0.65

-0.61 2 3 4 5 6 7 8 9 10

Cor

rela

tion

coe

ffic

ient

Power coefficient, p

Norm costMaximum cost

Figure 4.3. Correlation coefficient between norm cost and perceptual score as a

function of power coefficient, p.

the perceptual score as a function of the power coefficient. The average cost

(p = 1) has better correspondence to the perceptual scores (correlation coeffi-

cient = −0.808) than does the maximum cost (correlation coefficient = −0.685).

Therefore, the naturalness of synthetic speech is better estimated by the degra-

dation of naturalness over the entire synthetic utterance than by using only the

local degradation of naturalness. Figure 4.4 shows the correlation between the

average cost and the perceptual score, and Figure 4.5 shows the correlation

between the maximum cost and the perceptual score.

Moreover, when the power coefficient is set to 2, the norm cost, called the

Root Mean Square (RMS) cost, has the best correspondence to the perceptual

scores (correlation coefficient = −0.840). The absolute value of the correlation

coefficient in the case of the RMS cost is statistically larger than those in the

cases of the average cost (t = 2.4696, df = 137, p < 0.05). Therefore, the natural-

ness of synthetic speech is better estimated by considering both the degradation

of naturalness over the entire synthetic speech and the local degradation of natu-

54

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0.1 0.2 0.3 0.4 0.5 0.6 0.7Per

cept

ual s

core

(Nor

mal

ized

MO

S)

Average cost

Regression line

Correlation coefficient = -0.808Residual standard error = 0.502

Figure 4.4. Correlation between average cost and perceptual score.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Per

cept

ual s

core

(Nor

mal

ized

MO

S)

Maximum cost


Regression line

Figure 4.5. Correlation between maximum cost and perceptual score.

55

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Per

cept

ual s

core

(Nor

mal

ized

MO

S)

RMS cost

Regression line


Figure 4.6. Correlation between RMS cost and perceptual score. The RMS cost

can be converted into a perceptual score by utilizing the regression line.

ralness, since the RMS cost is affected by both types of degradation. Figure 4.6

shows the correlation between the RMS cost and the perceptual score. Moreover,

Figure 4.7 shows the correlation coefficient between the RMS cost and the nor-

malized opinion score for each listener. Figure 4.8 shows the best correlation

and the worst correlation in the results of all listeners.

In order to estimate the perceptual scores more accurately, we also performed

multiple linear regression analysis by utilizing the norm costs while varying the

power coefficient from 1 to 10 and the maximum cost as predictor variables. As a

result, the correspondence to the perceptual scores was not improved statistically

(multiple correlation coefficient = 0.846) compared with the correspondence of

the RMS cost.

Although the RMS cost is the best integrated cost in this experiment, it is

expected that the best power coefficient depends on the correspondence of the

local cost to perceptual characteristics. It is worth noting that the naturalness

of synthetic speech is better estimated by using both the total degradation of

56

naturalness and the local degradation than by using only the total degradation.

4.3.2 Preference test on naturalness of synthetic speech

Figure 4.9 shows an example of local costs of a segment sequence selected by

the conventional average cost and that of another segment sequence selected by

the novel RMS cost. Some large local costs surrounded by circles are shown

in the case of the average cost. On the other hand, such large local costs are

alleviated in the case of the RMS cost. In order to clarify which of the two costs

could select the best segment sequence, we performed a preference test on the

naturalness of synthetic speech. The corpus size was 32 hours, and utterances

used as test stimuli were not included in the corpus. Natural prosody and the

mel-cepstrum sequences extracted from the original utterances were used as input

information for segment selection. Signal processing for prosody modification was

not performed except for power control.

The naturalness of synthetic speech was expected to be nearly equal between

segment sequences having similar costs. Therefore, we used pairs of segment

sequences that had greater cost differences in the test. Each pair is comprised

of the segment sequence selected by the RMS cost and that by the average cost.

In order to fairly compare the performances of these two costs, we selected the

pairs with the larger differences in the average cost as well as those with the

larger differences in the RMS cost. A scatter chart of the test stimuli is shown in

Figure 4.10. The difference in the RMS cost and that in the average cost were

calculated as follows:

RMSCACsel − RMSCRMSCsel , (4.3)

ACACsel − ACRMSCsel , (4.4)

where RMSCACsel and ACACsel denote the RMS cost and the average cost

of the segment sequence selected by minimizing the average cost, respectively.

RMSCRMSCsel and ACRMSCsel denote the RMS cost and the average cost of the

segment sequence selected by minimizing the RMS cost, respectively. “Sub-set

A” includes stimulus pairs with larger differences in the RMS cost. On the other

hand, “sub-set B” includes stimulus pairs with larger differences in average cost.

57

-0.8

-0.75

-0.7

-0.65

-0.6

-0.5581 2 3 4 5 6 7

Cor

rela

tion

coe

ffic

ient

Listener index

Figure 4.7. Correlation coefficient between RMS cost and normalized opinion

score for each listener.

-2

-1

0

1

2

0.1 0.3 0.5 0.7 0.9

Nor

mal

ized

opi

nion

sco

re

RMS cost

Correlation coefficient = -0.778

0.1 0.3 0.5 0.7 0.9RMS cost

-2

-1

0

1

2

Nor

mal

ized

opi

nion

sco

re

Correlation coefficient = -0.572

Figure 4.8. Best correlation between RMS cost and normalized opinion score

(left figure) and worst correlation between RMS cost and normalized opinion

score (right figure) in results of all listeners.

58

0

0.2

0.4

0.6

0.8

1L

ocal

cos

t

Selection by average cost (Av.= 0.294, RMS = 0.412)

Selection by RMS cost (Av.= 0.316, RMS = 0.360)

/sil/ j/ i/ b/ u/ N/ y/ o/ r/ i/oo/kI/kU/ k/ a/ N/ j/ i/ r/ a/ r/ e/ r/ u/ m/ o/ n/ o/ w/ a/pau/

Phoneme in segment sequence

Figure 4.9. Examples of local costs of segment sequences selected by the average

costs and by the RMS cost. “Av.” and “RMS” show the average and the root

mean square of local costs, respectively.

There were 20 stimulus pairs in each sub-set, and the total number of stimulus

pairs was 35, since 5 pairs were included in both sub-sets.

Eight Japanese listeners participated in the experiment. In each trial, syn-

thetic speech by the segment selection based on the average cost and that by

the segment selection based on the RMS cost were presented in random order,

and listeners were asked to choose either of the two types of synthetic speech as

sounding more natural.

The results in Figure 4.11 show that the segment selection based on the

RMS cost can synthesize speech more naturally than that based on the average

cost in all cases: utilizing all stimuli, stimuli in sub-set A only, and stimuli in

sub-set B only. However, this improvement is only slight.

59

0

0.01

0.02

0.03

0.04

0.05

0.06

-0.035 -0.03 -0.025 -0.02 -0.015 -0.01 -0.005 0

Dif

fere

nce

in R

MS

cost

Difference in average cost

Sub-set B

Sub-set A

Figure 4.10. Scatter chart of selected test stimuli. Each dot denotes a stimulus

pair.

4.3.3 Correspondence of RMS cost to perceptual score

in lower range of RMS cost

We clarified the correspondence of the RMS cost to the perceptual scores when

the size of the corpus was varied in Section 4.3.1. However, our TTS system

utilizes a large-sized corpus that includes many segments with high coverage on

both phonetic environment and prosody to synthesize speech more naturally and

consistently. In the case of a large-sized corpus, the RMS costs are expected to be

distributed not in a wide range but in a lower range, since segments causing only a

slight degradation of naturalness can usually be selected. Thus, it is worthwhile

to investigate the correspondence to the perceptual scores in a range of lower

RMS costs.

We performed an opinion test on the naturalness of the synthetic speech to

clarify the correspondence of the RMS cost to the perceptual scores in a lower

range. Test stimuli were included in the region covered in utilizing the 32-hour

60

0 20 40 60 80 100Preference score [%]

All stimuli

Sub-set A

Sub-set B

95% confidence intervalRMS cost Average cost

Figure 4.11. Preference score.

corpus, in which the RMS costs were less than 0.4. They were selected from a large

number of utterances synthesized by varying the corpus size. This selection was

performed under the restriction that the number of phonemes in an utterance,

the duration of an utterance, and the number of concatenations were roughly

equal among the selected stimuli. The number of selected stimuli was 160. Eight

Japanese listeners participated in the experiment. They evaluated the naturalness

on a scale of seven levels. These levels were determined by each listener, so the

scores were distributed widely among the stimuli. The perceptual score was

calculated as an average of the normalized score calculated as a Z-score (mean

= 0, variance = 1) for each listener in order to equalize the score range among

listeners.

The correspondence of the RMS cost to the perceptual scores is shown in Fig-

ure 4.12. The correspondence is much worse (correlation coefficient = −0.400)

than that in the case of utilizing stimuli that cover a wide range of the cost (cor-

relation coefficient = −0.840). Therefore, it is obvious that the correspondence

of the RMS cost is inadequate and that we should improve the cost function.

61

-2

-1

0

1

2

0.1 0.2 0.3 0.4 0.5

Nor

mal

ized

MO

S

RMS cost


Regression line

Figure 4.12. Correlation between RMS cost and perceptual score in lower range

of RMS cost.

4.4 Segment Selection Considering Both To-

tal Degradation of Naturalness and Local

Degradation

In the conventional segment selection based on the average cost, the optimum

segment sequence is selected while taking into account only the total degradation.

In order to consider not only the total degradation but also the local degradation,

we incorporated the RMS cost in segment selection. In this selection, the RMS

cost RMSC is minimized and given by

RMSC =

√

√

√

√

1

N·

N∑

i=1

{LCi(ui, ti)}2. (4.5)

Actually, only the sum of the square local costs is calculated for the selection.

62

In this section, in order to represent the costs more easily, we divide the five

sub-costs described in Section 3.2 into two commonly used costs, i.e. a target

cost Ct and a concatenation cost Cc [11][22][47][102]. These costs are given by

Ct(ui, ti) = wpro/wt · Cpro(ui, ti)

+wapp/wt · Capp(ui, ti), (4.6)

Cc(ui, ui−1) = wenv/wc ·Cenv(ui, ui−1)

+wspec/wc · Cspec(ui, ui−1)

+wF0/wc · CF0

(ui, ui−1), (4.7)

wt = wpro + wapp, (4.8)

wc = wF0+ wenv + wspec, (4.9)

and then the local cost is written as

LC(ui, ti) = wt ·Ct(ui, ti) + wc ·Cc(ui, ui−1), (4.10)

wt + wc = 1. (4.11)

We compared the segment selection based on the RMS cost with that based

on the average cost. We utilized 1,131 utterances as an evaluation set. These

utterances were not included in the corpus used for segment selection. In the seg-

ment selection, concatenations at certain phoneme centers, i.e. preceding vowel

centers of voiced phonemes and unvoiced fricative centers, were also allowed in

order to alleviate audible discontinuity.

4.4.1 Effect of RMS cost on various costs

We investigated the effect of the RMS cost on the local cost. Figure 4.13 shows

the local costs as a function of corpus size. In the segment selection based on

the RMS cost (“RMS cost”), the standard deviation of the local cost is smaller

than that of the segment selection based on the average cost (“Average cost”),

although the mean of this cost is slightly worse. This is a consequence of the large

penalty imposed on a segment with a large local cost in the case of the RMS cost.

In order to clarify what causes the decrease in the standard deviation, we

investigated the effects of the RMS cost on both the target cost and the concate-

nation cost. The target cost is shown in Figure 4.14 as a function of corpus size.

63

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10

Loc

al c

ost

Corpus size [hour]0.5 2 5 20

Average costRMS cost

Figure 4.13. Local costs as a function of corpus size. Mean and standard deviation

are shown.

The mean of the target costs is degraded, and the standard deviation increases

slightly by utilizing the RMS cost. Figure 4.15 shows the concatenation cost as

a function of corpus size. Although the means of concatenation costs are equal

between the average cost and the RMS cost, the standard deviation becomes

smaller by utilizing the RMS cost. The increase in the standard deviation of the

target cost is much smaller than the decrease in the standard deviation of the

concatenation cost. These results show that the effectiveness of decreasing the

standard deviation of the local cost is dependent on the concatenation cost. On

the other hand, the mean of the local cost is slightly worse as a consequence of

the degradation of the target cost.

However, it might be assumed that these results would be influenced by the

weights for sub-costs rather than by the local degradation, since we utilized a

weight set in which the weight for the target cost was smaller than that for

the concatenation cost, i.e. wt = 0.4, wc = 0.6 in Equation (4.10). Therefore,

we tried to analyze the effects of utilizing other weight sets. The same results

64

-0.2

0

0.2

0.4

0.6

0.8

1

1 10

Tar

get

cost

Corpus size [hour]

Average cost

2 50.5 20

RMS cost

Figure 4.14. Target cost as a function of corpus size.

-0.2

0

0.2

0.4

0.6

0.8

1

1 10

Con

cate

nati

on c

ost

Corpus size [hour]

Average cost

2 5 200.5

RMS cost

Figure 4.15. Concatenation cost as a function of corpus size.

65

were obtained for all weight sets, in which the ratios of the target cost to the

concatenation cost were set to 1 to 2, 1 to 1.5, 1 to 1, 1.5 to 1, and 2 to 1.

Therefore, the effectiveness mentioned above depends not on the weights for sub-

costs but on the function used to integrate the local costs, which can take account

of the local degradation as well as the the total degradation.

4.4.2 Effect of RMS cost on selected segments

In order to clarify what causes the decrease in the standard deviation of the

concatenation cost, we investigated characteristics of the segments selected by

minimizing the RMS cost. The segment length in the number of phonemes is

shown in Figure 4.16 as a function of corpus size. The segment length is shorter

in the selection with the RMS cost compared with that in the case of the average

cost, while the standard deviation is nearly equal.

Moreover, Figure 4.17 shows the segment length in number of syllables as

a function of corpus size for reference, since our segment selection is essentially

based on syllable units, i.e. concatenations at C-V boundaries (C: Consonant and

V: Vowel) are prohibited. In calculating the segment length, the number of syl-

lables for half-phonemes in a diphone unit is set to 0.5 when the syllable is V, or

0.25 when the syllable is comprised of CV. It can be seen that the segment length

is unexpectedly short, only less than 1.4 syllables, although the standard devia-

tion of the segment length is large. In the segment selection, it is not necessarily

best to select the longest segments, since such segments often cause a decrease

in the number of candidate segments and thus cannot always synthesize speech

naturally. The important point is to select the best segment sequence by consid-

ering not only the degradation caused by concatenation but also that caused by

various factors, e.g. prosodic distance. It is also shown that the segment length

becomes shorter as the corpus size increases to a level over 20 hours. This result is

caused by pruning candidate segments to reduce the computational complexity of

segment selection. We perform the pruning process, called pre-selection [29], by

considering the target cost and the mismatch of phonetic environments. Namely,

we do not consider whether segments are connected in the corpus. Therefore,

when we use the large-sized corpus that includes many candidate segments hav-

ing target phonetic environments, remaining candidate segments do not always

66

connected in the corpus even if these segments have target phonetic environments.

The rate of increase in the number of concatenations is shown in Figure

4.18 as a function of corpus size. This rate is calculated by dividing the num-

ber of concatenations in the case of utilizing the RMS cost by that in the case

of utilizing the average cost. By utilizing the RMS cost, the concatenation at a

boundary between any phoneme and a voiced consonant (“* - Voiced consonant”)

decreases in any corpus size. However, the concatenations at both a phoneme cen-

ter (“Phoneme center”) and a boundary between any phoneme and an unvoiced

consonant (“* - Unvoiced consonant”) increase. Figure 4.19 shows the concate-

nation cost in each type of concatenation when the corpus size is 32 hours. The

concatenation between any phoneme and an unvoiced consonant can often reduce

the concatenation cost compared with that between any phoneme and a voiced

consonant, since the former type of concatenation has no discontinuity caused by

concatenating F0s at a segment boundary. It was also found that the concate-

nation at the phoneme center tends to reduce the concatenation cost compared

with the other types of concatenation, although the number of concatenations is

small.

These results show a tendency to avoid performing concatenations that cause

much local degradation of naturalness by instead performing more concatenations

that cause slight audible discontinuity. As a whole, the number of concatenations

increases rather than decreases. Therefore, a larger number of segments with

shorter lengths, which only cause slight local degradation, are selected by utilizing

the RMS cost.

4.4.3 Relationship between effectiveness of RMS cost

and corpus size

In order to clarify the relationship between the effectiveness of using RMS cost and

the corpus size, we investigated the differences in average costs, RMS costs, and

maximum costs between the segment sequences selected by utilizing the average

cost and those by utilizing the RMS cost.

The results are shown in Figure 4.20. The cost differences are calculated

by subtracting the costs of the segment sequences selected by utilizing the RMS

cost from those of the segment sequences selected by utilizing the average cost

67

2

2.1

2.2

2.3

2.4

2.5

2.6

1 5 10Segm

ent

leng

th (N

umbe

r of

pho

nem

es)

Corpus size [hour]20.5 20


1/10 standard deviation

Figure 4.16. Segment length in number of phonemes as a function of corpus size.

1.1

1.2

1.3

1.4

1.5

1 10Segm

ent

leng

th (N

umbe

r of

syl

labl

es)

Corpus size [hour]


520.5 20

1/10 standard deviation

Figure 4.17. Segment length in number of syllables as a function of corpus size.

68

0.8

1

1.2

1.4

1.6

1 10

Incr

ease

rat

e of

the

num

ber

of c

onca

tena

tion

s

Corpus size [hour]

* - Vowel

* - Voiced consonant

* - Unvoiced consonant

Phoneme center

20520.5

Figure 4.18. Increase rate in the number of concatenations as a function of corpus

size. “*” denotes any phoneme.

Average cost

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1000

Con

cate

nati

on c

ost

Number of concatenations

RMS cost

2000500 5000

* - Unvoiced consonant

* - Voiced consonant

* - VowelPhonemecenter

Figure 4.19. Concatenation cost in each type of concatenation. The corpus size

is 32 hours.

69

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

1 10

Dif

fere

nce

inR

MS

cost

Corpus size [hour]

Difference in RMS cost

Difference in average cost

Difference in maximum cost

202 50.5

Dif

fere

nce

inav

erag

e co

st

Dif

fere

nce

in m

axim

um c

ost

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Figure 4.20. Differences in costs as a function of corpus size.

as described in Section 4.3.2. From the results, the RMS cost works well for

alleviating the local degradation of naturalness, since the maximum cost becomes

small, i.e. the differences in the maximum cost are positive. Moreover, the

differences in all costs have little dependence on the corpus size. Therefore, the

effectiveness of utilizing the RMS cost can be found in a corpus of any size.

4.4.4 Evaluation of segment selection by estimated per-

ceptual score

The performance of segment selection is shown in the cost. However, it is difficult

to estimate the naturalness of synthetic speech from the cost value directly. In

order to indicate the performance of segment selection in a more intuitive quantity

than the cost, we converted the cost value into a perceptual score by using the

regression line on the RMS cost shown in Figure 4.6. Chu and Peng have also

estimated the MOS from their cost [25][86].

Estimated perceptual score is shown in Figure 4.21 as a function of corpus

size. Small-sized corpora have been constructed so that various phonetic contexts

70

-0.4

-0.2

0.2

0.4

0.6

0.8

1.2

1.4

Est

imat

ed p

erce

ptua

l sco

re

1 10Corpus size [hour]2 5 200.5

0

1

Figure 4.21. Estimated perceptual score as a function of corpus size.

are included. As the corpus size becomes larger, the estimated perceptual score

becomes higher and its standard deviation becomes smaller. This result means

that the quality of the segment selection is higher and more consistent by utilizing

the larger corpus.

4.5 Summary

In segment selection for concatenative TTS, it is important to utilize a cost that

corresponds to the perceptual characteristics. In this chapter, we evaluated a

cost for segment selection based on a comparison with perceptual scores deter-

mined from the results of perceptual experiments on the naturalness of synthetic

speech. As a result, we clarified that the average cost, which captures the total

degradation, has a better correspondence to the perceptual scores than does the

maximum cost, which captures the local degradation. Furthermore, we found

that the RMS (Root Mean Square) cost, which takes into account both the aver-

age cost and the maximum cost, has the best correspondence. We also clarified

71

that the naturalness of synthetic speech could be slightly improved by utilizing

the RMS cost.

We investigated the effect of considering not only the degradation of natu-

ralness over the entire synthetic speech but also local degradation in segment

selection. In this selection, the optimum segment sequences are selected by min-

imizing the RMS cost instead of the conventional average cost. From the results

of experiments comparing this approach with segment selection based on the av-

erage cost, it was found that segment selection based on RMS cost performed a

larger number of concatenations that caused slight local degradation in order to

avoid concatenations causing greater local degradation. Namely, a larger number

of segments with shorter units were selected. Moreover, the effectiveness of this

selection was found for any size of corpus.

When the RMS costs were distributed widely, the correspondence of the RMS

cost seemed to be good. Namely, it was possible to accurately estimate the

perceptual score from the RMS cost for synthetic speech of various qualities.

Therefore, we evaluated the performance of segment selection by the estimated

perceptual score by varying the corpus size. As a result, the quality of the segment

selection was higher and more consistent by utilizing the larger corpus.

We also performed the perceptual evaluation of the RMS cost in a lower range

of the RMS cost. The results clarified that the correspondence of the RMS cost

to the perceptual scores is inadequate in this case. Therefore, it is obvious that

the RMS cost is not accurate enough for making comparisons between similar

segments, which is naturally a difficult problem. However, since our TTS does not

consistently synthesize sufficiently natural speech, we should further improve the

cost function based on perceptual characteristics. In particular, it is necessary to

determine the optimum weight set for sub-costs. We will determine this weight set

from the results of perceptual experiments on the naturalness of synthetic speech

with a set of stimuli covering a wide range in terms of individual sub-costs.

72

Chapter 5

A Voice Conversion Algorithm

Based on Gaussian Mixture

Model with Dynamic Frequency

Warping

Speech of various speakers can be synthesized by utilizing a voice conversion

technique that can control speaker individuality. The voice conversion algorithm

based on the Gaussian Mixture Model (GMM), which is a conventional statistical

voice conversion algorithm, can convert speech features continuously by using the

correlation between a source feature and a target feature. However, the quality of

the converted speech is degraded because the converted spectrum is excessively

smoothed by the statistical averaging operation. In this chapter, we propose a

GMM-based algorithm with Dynamic Frequency Warping (DFW) to avoid such

over-smoothing. In the proposed algorithm, the converted spectrum is calcu-

lated by mixing the GMM-based converted spectrum and the DFW-based con-

verted spectrum to avoid deterioration of spectral conversion-accuracy. Results

of evaluation experiments clarify that the converted speech by the proposed algo-

rithm has a higher quality than that by the GMM-based algorithm and that the

conversion-accuracy for speaker individuality is the same as that by the GMM-

based algorithm.

73

5.1 Introduction

The speech of various speakers can be synthesized by preparing speech corpora

of the various speakers. In order to synthesize natural speech consistently, each

corpus must include a sufficient amount of waveform segments having various

phonetic environments and prosodic variety. Therefore, it is necessary to record

a large number of speech utterances for each speaker. However, recording speech

is very difficult and expensive work. Furthermore, it needs an enormous amount

of time [63].

Another approach for synthesizing speech of various speakers is speech modi-

fication by a voice conversion technique used to convert one speaker’s voice into

another speaker’s voice [68]. In voice conversion, conversion rules are extracted in

advance through training data comprised of utterance pairs of a source speaker

and a target speaker. By utilizing the conversion rules, any utterance of the

source speaker can be converted into that of the target speaker. A large amount

of speech data is not required in the training. Therefore, speech of any speaker

who is regarded as the target speaker can be synthesized from speech of a specific

speaker who is regarded as the source speaker by only preparing a small number

of the target speaker’s utterances to extract the conversion rules. Namely, by

applying voice conversion to corpus-based TTS, it is possible to easily synthe-

size any utterance of any speaker easily. However, the quality of the converted

speech is degraded considerably. Therefore, we aim to develop a high-quality

voice conversion system.

In voice conversion, it is important to realize both high speech quality and

high conversion-accuracy for speaker individuality. Although we can synthesize

high-quality speech using a simple conversion rule, e.g. converting only F0, such

converted speech cannot sound like that of the target speaker. Moreover, even if

the converted speech sounds like that of the target speaker, the speech quality is

not always high, e.g. in telephone speech, the speech quality is degraded although

speaker individuality is kept. Therefore, we need to evaluate the performance of

voice conversion in terms of both speech quality and conversion-accuracy for

speaker individuality.

To continuously represent the acoustic space of a speaker, a voice conver-

sion algorithm based on the Gaussian Mixture Model (GMM) has been proposed

74

by Stylianou et al. [98][99]. In this GMM-based algorithm, the acoustic space

is modeled by the GMM without the use of vector quantization, and acoustic

features are converted from a source speaker to a target speaker by a mapping

function based on the feature-parameter correlation between two speakers.

The quality of an analysis-synthesis method is also important for achiev-

ing a high-quality voice conversion since voice conversion is usually performed

with an analysis-synthesis method. As a high-quality analysis-synthesis method,

STRAIGHT has been proposed by Kawahara et al. [58][59]. This algorithm

provides a high-quality vocoder type algorithm.

In the GMM-based voice conversion algorithm applied to STRAIGHT, the

quality of converted speech is degraded because the converted spectrum is exces-

sively smoothed by statistical processing. In this chapter, we propose a GMM-

based algorithm with Dynamic Frequency Warping (DFW) [114] to avoid over-

smoothing. However, the spectral conversion-accuracy of the DFW algorithm is

worse than that of the GMM-based algorithm because spectral intensity cannot

be converted by the DFW. Therefore, the converted spectrum in the proposed

algorithm is calculated by mixing the GMM-based converted spectrum and the

DFW-based converted spectrum in order to avoid the deterioration of spectral

conversion-accuracy. We clarify that the proposed algorithm can improve the

naturalness of synthetic speech by experimental evaluation.

This chapter is organized as follows. In Section 5.2, the GMM-based voice

conversion algorithm applied to STRAIGHT is described. In Section 5.3, the

proposed voice conversion algorithm is described, and the effect of mixing con-

verted spectra is described in Section 5.4. In Section 5.5, experimental evalu-

ation is described. Finally, we summarize this chapter in Section 5.6.

5.2 GMM-Based Conversion Algorithm Applied

to STRAIGHT

The mel-cepstrum of the smoothed spectrum analyzed by STRAIGHT [58] is

used as an acoustic feature. In this chapter, sampling frequency is 16000 Hz and

the order of the mel-cesptrum is set to 40, i.e. quefrency is 2.5 ms. In order to

perform voice conversion, the 1st to 40-th order cepstral coefficients are converted.

75

The 0-th order cepstral coefficient is converted so that the power of the converted

speech becomes equal to that of the original speech. As a feature on source

information, the log-scaled fundamental frequency extracted by STRAIGHT [59]

is used.

In the training of the mapping function, we use 50 utterance-pairs comprised

of the source speaker’s speech and the target speaker’s speech. The GMM pa-

rameters for spectral conversion are trained as follows:

Step 1: All utterances are converted into the mel-cepstrum sequences. Then,

silence frames are removed by performing silence detection based on power

information. In each pair, the correspondence between mel-cepstrum se-

quences extracted from two speakers’ utterances is determined by using

Dynamic Time Warping (DTW).

Step 2: GMM parameters for spectral conversion are trained with the time-

aligned mel-cepstrum sequences.

Step 3: Differences between the source speaker’s speech and the target speaker’s

speech have a bad influence on the accuracy of time-alignment. Therefore,

the source speaker’s utterances are converted into the target speaker’s in

order to perform more accurate time-alignment between the mel-cepstrum

sequences. Then, mel-cepstral distortion between two speakers is calculated.

Step 4: Steps 2 and 3 are repeated to refine the GMM parameters until the

mel-cepstral distortion calculated in Step 3 is saturated.

The GMM parameters for F0 conversion are trained as follows:

Step 1: F0s are extracted from all utterances. In each pair, the correspondence

of F0 sequences from two speakers is determined by using time-alignment

of the mel-cepstrum sequences determined in the training of spectral con-

version. Then, frame-pairs including unvoiced frames are removed from the

time-aligned F0 sequences.

Step 2: GMM parameters for F0 conversion are trained with the time-aligned

F0 sequences.

76

5.2.1 Evaluation of spectral conversion-accuracy of GMM-

based conversion algorithm

We evaluated spectral conversion-accuracy of the GMM-based algorithm by the

mel-cepstral distortion given by Equation (3.8). The evaluation set was comprised

of 50 utterances that were not included in the training data. We performed

voice conversions between male A and male B, between female A and female B,

between male A and female A, and between male B and female B. In each case,

we performed two kinds of conversion by alternating the speakers as a source

speaker and a target speaker. The mel-cepstral distortion was calculated in every

frame-pair in which the silence frames were removed. The frame shift is set to

5 ms. The number of Gaussian mixtures is set to 64.

The results (Figure 5.1) clarify that the GMM-based voice conversion can

decrease the acoustic distance, that is, the mel-cepstral distortion, between the

source speaker’s speech and the target speaker’s speech. The distances between

speakers of opposite sex are larger than those between speakers of same sex in

the case of original speech shown as “Before conversion.” However, it can be seen

that the distances become almost equal in all cases of voice conversion shown as

“After conversion.”

5.2.2 Shortcomings of GMM-based conversion algorithm

In the GMM-based algorithm applied to STRAIGHT, the quality of the converted

speech is degraded because the converted spectrum is excessively smoothed by

the statistical averaging operation. Figure 5.2 shows an example of the GMM-

based converted spectrum and the target speaker’s spectrum. As shown in this

figure, there is over-smoothing on the GMM-based converted spectrum.

As mentioned in Section 2.3, most statistical approaches to speaker adapta-

tion can be applied to voice conversion. In speech recognition, it is not necessary

to preserve the information removed by statistical processing. However, in speech

synthesis, such information is important for synthesizing natural speech. There-

fore, it is necessary to improve the adaptation algorithms in order to achieve

high-quality voice conversion.

77

34

5

6

78

9

1011

Mel

-cep

stra

l dis

tort

ion

[dB

]

Before conversionAfter conversion

Male Aand

male B

Female Aand

female B

Male Aand

female A

Male Band

female B

Figure 5.1. Mel-cepstral distortion. Mean and standard deviation are shown.

60

80

100

120

0 2000 4000 6000 8000

Pow

er [d

B]

Frequency [Hz]

Target spectrumGMM-converted spectrum

Figure 5.2. Example of spectrum converted by GMM-based voice conversion

algorithm (“GMM-converted spectrum”) and target speaker’s spectrum (“Target

spectrum”).

78

STRAIGHT spectrumof source speaker

1. GMM-based algorithm Converted spectrum by

GMM-based algorithm(Original) (Target)

2. Calculation of warping function by Dynamic Frequency Warping

Warping function3. Dynamic Frequency Warping

Frequency-warped spectrum

Converted spectrum

w ( f )1 - w ( f )

4. Mix of both spectra(0 < w(f) < 1)

Figure 5.3. GMM-based voice conversion algorithm with Dynamic Frequency

Warping.

5.3 Voice Conversion Algorithm Based on Gaus-

sian Mixture Model with Dynamic Frequency

Warping

To avoid over-smoothing, we propose a GMM-based algorithm with DFW. An

overview of the proposed algorithm is shown in Figure 5.3. This procedure is

performed in each frame.

5.3.1 Dynamic Frequency Warping

Spectral conversion is performed with DFW [72][114] to avoid over-smoothing

of the converted spectrum. In this technique, the correspondence between an

original frequency axis and a converted frequency axis is represented by a warping

function. This function is calculated as a path minimizing a normalized mel-

spectral distance between the STRAIGHT logarithmic mel-spectrum of the source

79

speaker and the GMM-based converted logarithmic mel-spectrum.

The normalized mel-spectral distance D(i, j) is given by

D(i, j) =

√

g(i, j)

i + j, (5.1)

g(i, j) = min

g(i − 1, j − 2) + 2d(i, j − 1) + d(i, j),

g(i − 1, j − 1) + 2d(i, j),

g(i − 2, j − 1) + 2d(i − 1, j) + d(i, j),

, (5.2)

where g(i, j) denotes an accumulated distance in the grid (i, j). d(i, j) denotes a

mel-spectral distance in a grid (i, j), and it is given by

d(i, j) = (20 log10 |Sx(i)| − 20 log10 |Sy(i)|)2 , (5.3)

where Sx denotes a mel-spectrum of the source speaker at the i-th bin, and Sy

denotes a mel-spectrum converted by the GMM-based algorithm at the j bin. In

this chapter, the number of FFT points is set to 1024, and the number of bins to

calculate the mel-spectral distance is 513. Figure 5.4 shows an example of the

frequency warping function for a certain frame.

Unnatural sound is often caused by synthesizing with DFW-based converted

spectra, since such spectra have unnatural changing parts. These parts can be

removed by performing smoothing or liftering. Therefore, we perform the smooth-

ing processing used in STRAIGHT [58].

5.3.2 Mixing of converted spectra

The spectral conversion-accuracy of DFW is slightly worse than that of the GMM-

based algorithm because spectral intensity cannot be converted. Therefore, we

also propose that the converted spectrum be calculated by mixing the GMM-

based converted spectrum and the DFW-based converted spectrum in order to

avoid the deterioration of the spectral conversion-accuracy. In the proposed al-

gorithm, the converted spectrum Sc(f ) is written as

|Sc(f )| = exp[w(f ) ln |Sg(f )|+ {1 − w(f )} ln |Sd(f )|],

subject to 0 ≤ w(f ) ≤ 1, (5.4)

80

0

1000

2000

4000

8000

1000 2000 4000 80000Original frequency [Hz]

Con

vert

ed fr

eque

ncy

[Hz]

Figure 5.4. Example of frequency warping function.

where Sd(f ) and Sg(f) denote the DFW-based converted spectrum and the GMM-

based converted spectrum, respectively. w(f ) denotes the weight for mixing spec-

tra. As the mixing-weight is closer to 1, the converted spectrum is closer to the

GMM-based converted spectrum.

The results of preliminary experiments clarified that the quality of the con-

verted speech is degraded considerably when a spectrum is excessively smoothed

in the low-frequency regions [110]. Therefore, we use the mixing-weight as follows:

w(f ) =

∣

∣

∣

∣

∣

2πf

fs

+ 2 tan−1

(

a sin(2πf/fs)

1− a cos(2πf/fs)

)∣

∣

∣

∣

∣

/

π,

subject to − 1 < a < 1, −fs/2 ≤ f ≤ fs/2, (5.5)

where fs denotes the sampling frequency, and a denotes the parameter that

changes the mixing-weight. Figure 5.5 shows the variations of the mixing-

weights that correspond to the different parameters a. Figure 5.6 shows an

example of converted spectra by the GMM-based algorithm (“GMM”), the GMM-

based algorithm with DFW (“GMM & DFW”), and the GMM-based algorithm

81

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000

Wei

ght,

w (

f )

Frequency [Hz]

a = -0.5

a = 0

a = 0.5

Figure 5.5. Variations of mixing-weights that correspond to the different param-

eters a.

with DFW and the mixing of converted spectra when the parameter a is set to 0

(“GMM & DFW & Mix of spectra”). The converted spectrum by the GMM-based

algorithm with DFW has clearer peaks than that by the GMM-based algorithm.

The mixed converted spectrum is close to the converted spectrum by the GMM-

based algorithm with DFW in the low-frequency regions but close to that by the

GMM-based algorithm in the high-frequency regions.

5.4 Effectiveness of Mixing Converted Spectra

Evaluation experiments were performed to investigate the effects of the mixing-

weight. We performed objective experiments on spectral conversion-accuracy and

subjective experiments on speech quality and speaker individuality. The amount

of training data was set to 58 sentences. The total duration of this data is

about 4 or 5 minutes. The male-to-male voice conversion and female-to-female

voice conversion were performed, and 10 listeners participated in each subjective

82

60

80

100

120

0 2000 4000 6000 8000

Pow

er [d

B]

Frequency [Hz]

GMM & DFW &Mix of spectra (a = 0)

GMMGMM & DFW

Figure 5.6. Example of converted spectra by the GMM-based algorithm

(“GMM”), the proposed algorithm without the mix of the converted spectra

(“GMM & DFW”), and the proposed algorithm with the mix of the converted

spectra (“GMM & DFW & Mix of spectra”).

experiment.

5.4.1 Effect of mixing-weight on spectral conversion-accuracy

In order to investigate the effects of the mixing-weight on spectral conversion-

accuracy, we performed the experimental evaluation on the mel-cepstral distortion

given by Equation (3.8) between the converted speech and the target speech. Ten

sentences that were not included in the training data were used in the evaluation.

The experimental results shown in Figure 5.7 clarify that the conversion-

accuracy of the GMM-based algorithm with DFW (“GMM & DFW”) is worse

than that of the GMM-based algorithm (“GMM”). It is also shown that the

deterioration of the conversion-accuracy can be alleviated by mixing the converted

spectra.

As the parameter a of the mixing-weight becomes closer to 1, the mel-cepstral

83

4.8

5.2

5.6

6

6.4

6.8

7.2

-1 -0.5 0 0.5 1

Mel

-cep

stra

l dis

tort

ion

[dB

]

Parameter, a

Original speech of source ( Male Female)GMM ( Male-to-male Female-to-female)

GMM & DFW & Mix of spectra ( Male-to-male Female-to-female)

GMM & DFW ( Male-to-male Female-to-female)

Figure 5.7. Mel-cepstral distortion as a function of parameter a of mixing-weight.

“Original speech of source” shows the mel-cepstral distortion before conversion.

distortion in the GMM-based algorithm with DFW and the mixing of converted

spectra (“GMM & DFW & Mix of spectra”) becomes small. This is because the

converted spectra by the proposed algorithm includes more components of the

GMM-based converted spectra than those of the DFW-based converted spectra

when the parameter a is near 1.

5.4.2 Preference tests on speaker individuality

In order to evaluate the relationship between the conversion-accuracy for speaker

individuality and the parameter a of the mixing-weight, the preference (XAB)

test was performed. Two sentences that were not included in the training data

were used in the evaluation. In the XAB test, X was the synthesized speech by

converting F0 and replacing the source speaker’s spectra with those of the target

speaker, i.e. the perfect spectral conversion. A and B were the converted speech

by the voice conversion algorithms. In each trial, X was presented first, and A

84

and B were presented in random order. Listeners were asked to choose either A

or B as being the most similar to X.

The preference score of the proposed voice conversion algorithm compared

with the GMM-based algorithm is shown in Figure 5.8. The conversion-accuracy

for speaker individuality of the GMM-based algorithm with DFW (“GMM &

DFW”) tends to be slightly worse than that of the GMM-based algorithm. How-

ever, the conversion-accuracy becomes equal to that of the GMM-based algo-

rithm by mixing the converted spectra (“GMM & DFW & Mix of spectra”).

The tendency of these results is similar to that of the experimental results for

the mel-cepstral distortion. However, even if the mel-cepstral distortion is de-

graded, the conversion-accuracy for speaker individuality is not always degraded.

This demonstrates that the mel-cepstral distortion cannot capture the conversion-

accuracy for speaker individuality with sufficient accuracy.

5.4.3 Preference tests on speech quality

In order to evaluate the relationship between the quality of the converted speech

and the parameter a of the mixing-weight, the preference test was performed.

Four sentences that were not included in the training data were used in the eval-

uation. In each trial, stimuli synthesized by the two voice conversion algorithms

were presented in random order. Listeners were asked to choose the converted

speech having better speech quality.

The experimental results are shown in Figure 5.9. The converted speech

quality of the GMM-based algorithm with DFW is significantly better than that

of the GMM-based algorithm (“GMM”). As the mixing-weight becomes close to

1, the converted speech quality of the GMM-based algorithm with DFW and the

mix of the converted spectra (“GMM & DFW & Mix of spectra”) becomes worse

compared with the GMM-based algorithm with DFW. These results shows the

opposite tendency to that of the conversion-accuracy for speaker individuality.

The results of the two experiments clarify that both the converted speech qual-

ity and the conversion-accuracy for speaker individuality show good performance

when the parameter a of the mixing-weight is set to 0 or −0.5.

85

0

20

40

60

80

100

Pre

fere

nce

scor

eco

mpa

red

wit

h G

MM

[%]

GMM & DFW a = -0.5 a = 0 a = 0.5

GMM & DFW & Mix of spectra

Male-to-maleFemale-to-femaleTotal

95% confidenceinterval

Figure 5.8. Relationship between conversion-accuracy for speaker individuality

and parameter a of mixing-weight. A preference score of 50% shows that the

conversion-accuracy is equal to that of the GMM-based algorithm, which provides

good performance in terms of speaker individuality.

5.5 Experimental Evaluation

In order to evaluate the performance of the proposed algorithm, we performed

subjective evaluation experiments on speaker individuality and speech quality.

The experimental conditions were the same as those described in the previous

section. The parameter a was set to 0 and −0.5 in the male-to-male and the

female-to-female voice conversions, respectively.

5.5.1 Subjective evaluation of speaker individuality

In order to evaluate the conversion-accuracy for speaker individuality of the pro-

posed algorithm, a preference (ABX) test was performed. In the ABX test, A

and B were the source and target speaker’s speech, respectively, and X was one

86


0

20

40

60

80

100

Pre

fere

nce

scor

eco

mpa

red

wit

h G

MM

& D

FW

[%]

a = -0.5 a = 0 a = 0.5 GMM

GMM & DFW & Mix of spectra

95% confidenceinterval

Figure 5.9. Relationship between converted speech quality and parameter a of

mixing-weight. A preference score of 50% shows that the converted speech quality

is equal to that of the GMM-based algorithm with DFW, which provides good

performance in terms of speech quality.

of the types of converted speech as follows:

• converted speech by the proposed algorithm · · · “Proposed algorithm”,

• converted speech by the GMM-based algorithm · · · “GMM”,

• synthesized speech by performing only the F0 conversion · · · “F0 only”,

• synthesized speech by performing the F0 conversion and replacing the source

speaker’s spectrum with that of the target speaker · · · “F0 & Spectrum”,

• target speaker’s speech synthesized by STRAIGHT · · · “STRAIGHT.”

“F0 & Spectrum” were used to evaluate the conversion-accuracy for speaker indi-

viduality when the spectral conversion was performed perfectly. “STRAIGHT”

87

Male-to-male Femal-to-female95% confidence interval

Total

20

40

60

80

100C

orre

ct r

espo

nse

[%]

0STRAIGHT F0 &

SpectrumF0 only GMM Proposed

algorithm

Figure 5.10. Correct response for speaker individuality.

was used to evaluate the conversion-accuracy when both the spectral conversion

and the conversion of source information were performed perfectly. In each trial,

A and B were presented in random order, and then X was presented. Listeners

were asked to choose either A or B as being the most similar to X. Two sentences

that were not included in the training data were used in the evaluation.

The experimental results are shown in Figure 5.10. The conversion-accuracy

for speaker individuality of the proposed algorithm (“Proposed algorithm”) is the

same as that of the GMM-based algorithm (“GMM”). The results clarify that the

conversion-accuracy of only the F0 conversion (“F0 only”) is insufficient, and it

can be improved by converting spectra. The difference in the conversion-accuracy

between the “F0 & Spectrum” and the proposed algorithm shows that the spectral

conversion is not sufficient. Moreover, the difference between STRAIGHT and

“F0 & Spectrum” shows that conversion of prosody should be improved in order

to achieve better performance.

88

1

2

3

4

5

MO

S

OriginalSpeech

STRAIGHT GMM Proposedalgorithm

30

20

10

Equ

ival

ent

Q v

alue

[dB

]


95% confidence interval Standard deviation

Figure 5.11. Mean Opinion Score (“MOS”) for speech quality.

5.5.2 Subjective evaluation of speech quality

In order to evaluate the quality of the converted speech by the proposed algorithm,

an opinion test was performed. An opinion score for evaluation was set to a 5-

point scale (5: excellent, 4: good, 3: fair, 2: poor, 1: bad). Four sentences that

were not included in the training data were used in the evaluation.

The experimental results are shown in Figure 5.11. The converted speech

quality by the proposed algorithm (“Proposed algorithm”) is significantly better

than that of the GMM-based algorithm (“GMM”) because over-smoothing of the

converted spectrum can be avoided in the proposed algorithm. However, the

converted speech quality is degraded compared with that of the synthetic speech

by STRAIGHT analysis-synthesis. Moreover, it is shown that the quality of

STRAIGHT cannot reach that of the original speech. Therefore, it is necessary

to improve both the quality of the voice conversion algorithm and that of the

analysis-synthesis method in order to achieve higher-quality voice conversion.

The results of the two subjective evaluations clarify that the proposed algo-

89

rithm can improve the naturalness of synthetic speech while maintaining equal

conversion-accuracy on speaker individuality compared with the GMM-based al-

gorithm.

5.6 Summary

In this chapter, we proposed a voice conversion algorithm based on the Gaus-

sian Mixture Model (GMM) with Dynamic Frequency Warping (DFW) of the

STRAIGHT spectrum. We also proposed a converted spectrum calculated by

mixing the GMM-based converted spectrum and the DFW-based converted spec-

trum. The GMM-based voice conversion algorithm, which is a statistical algo-

rithm, can convert speech features continuously with the correlation between a

source feature and a target feature. However, the converted spectrum is exces-

sively smoothed by statistical processing. The proposed algorithm can avoid the

typical over-smoothing of a converted spectrum, which causes the degradation of

speech quality. In order to evaluate the proposed algorithm, we performed ex-

perimental evaluations of speech quality and speaker individuality by comparison

with the conventional GMM-based algorithm. The experimental results revealed

that the proposed algorithm can improve converted speech quality while main-

taining equal conversion-accuracy for speaker individuality compared with the

GMM-based algorithm.

However, the performance of the proposed voice conversion is not high enough.

In order to reach a practical level of performance, it is necessary to improve

not only the conversion algorithm but also the speech analysis-synthesis method.

Moreover, the conversion of source information, i.e. prosody conversion, is needed

since prosody also contains important information on speaker individuality.

90

Chapter 6

Conclusions

6.1 Summary of the Thesis

Corpus-based Text-to-Speech (TTS) enables us to dramatically improve the nat-

uralness of synthetic speech over that of rule-based TTS. However, so far no

general-purpose TTS has been developed that can consistently synthesize suffi-

ciently natural speech. Furthermore, the flexibility of TTS is still inadequate.

In order to improve the performance of corpus-based TTS, we addressed two

problems in speech synthesis in this thesis. One is how to improve the naturalness

of synthetic speech in corpus-based TTS. The other is how to improve the control

of speaker individuality in order to realize a more flexible speech synthesis.

We first described the structure of a corpus-based TTS system in Chapter

2. Almost all corpus-based TTS systems have been developed on the basis of

this structure. The various techniques in each module were reviewed. We also

described some conventional statistical voice conversion algorithms and compared

conversion functions of the algorithms.

In Chapter 3, we proposed a novel segment selection algorithm for Japanese

speech synthesis in order to improve the naturalness of synthetic speech. Since

Japanese syllables consist of CV (C: Consonant or consonant cluster, V: Vowel or

syllabic nasal /N/) or V, except when a vowel is devoiced, CV units are often used

in concatenative TTS systems for Japanese. However, speech synthesized with

CV units sometimes have auditory discontinuity due to V-V and V-semivowel

concatenations. Since various vowel sequences appear frequently in Japanese, it

91

is not realistic to prepare long units that include all possible vowel sequences to

avoid V-V concatenation. In order to address this problem, we proposed a novel

segment selection algorithm that does not avoid the concatenation of the vowel

sequences but alleviates the discontinuity by utilizing both phoneme and diphone

units. In the proposed algorithm, non-uniform units allowing concatenation not

only at phoneme boundaries but also at vowel centers can be selected from a

speech corpus. The experiments on concatenation of vowel sequences clarified

that the number of better candidate segments increases by considering concate-

nations both at phoneme boundaries and at vowel centers. We also performed

perceptual experiments. The results showed that speech synthesized with the

proposed algorithm has better naturalness than that of the conventional algo-

rithms. We also compared the proposed algorithm with the algorithm based on

half-phoneme units. Moreover, a cost function for selecting the optimum wave-

form segments in our TTS system was described.

In Chapter 4, we performed a perceptual evaluation of costs for segment se-

lection. In order to achieve high-quality segment selection for concatenative TTS,

it is important to utilize a cost that corresponds to the perceptual characteristics.

From the results of perceptual experiments, we clarified the correspondence of the

cost to the perceptual scores and then evaluated various functions to integrate

local costs capturing degradation in individual segments. As a result, it was clari-

fied that the average cost, which captures the degradation of naturalness over the

entire synthetic speech, has better correspondence to the perceptual scores than

the maximum cost, which captures the local degradation of naturalness. Further-

more, the RMS (Root Mean Square) cost taking account of both average cost and

maximum cost has the best correspondence. We also showed that the naturalness

of synthetic speech can be slightly improved by utilizing the RMS cost. Then,

we investigated the effect of using the RMS cost for segment selection. From the

results of experiments comparing this approach with segment selection based on

the conventional average cost, it was found that (1) in segment selection based on

the RMS cost, a larger number of concatenations causing slight local degradation

were performed in order to avoid concatenations causing greater local degrada-

tion, and (2) the effect of the RMS cost has little dependence on the size of the

corpus. We also clarified that utilizing a larger corpus improves the quality and

92

consistency of the segment selection.

In Chapter 5, control of speaker individuality by voice conversion was de-

scribed. The voice conversion algorithm based on the Gaussian Mixture Model

(GMM), which is one of the conventional algorithms, can convert speech fea-

tures continuously with the correlation between a source feature and a target

feature. However, the quality of the converted speech is degraded because the

converted spectrum is excessively smoothed by the statistical averaging operation.

To overcome this problem, we proposed the GMM-based algorithm with Dynamic

Frequency Warping (DFW). The converted spectrum in the proposed algorithm

is calculated by mixing the GMM-based converted spectrum and the DFW-

based converted spectrum in order to avoid deterioration of spectral conversion-

accuracy. In order to evaluate the proposed algorithm, we performed experimen-

tal evaluation of speech quality and speaker individuality by comparison with

the conventional GMM-based algorithm. The experimental results revealed that

the proposed algorithm can improve converted speech quality while maintaining

equal conversion-accuracy for speaker individuality compared with the GMM-

based algorithm.

In summary, we confirmed that the proposed segment selection algorithm

and the proposed cost function based on perceptual evaluation are effective for

improving the naturalness of synthetic speech. In addition, the proposed voice

conversion algorithm can help improve the control of speaker individuality.

6.2 Future Work

Although we have improved the corpus-based TTS system, a number of problems

still remain to be solved.

Effective search algorithm: The computational complexity of segment selec-

tion needs to be reduced while maintaining the naturalness of synthetic

speech. As an approach to this problem, some clustering algorithms have

been proposed to decrease the number of candidate segments [14][35]. Deci-

sion trees are constructed by utilizing target features in segment selection in

advance in order to cluster similar candidate segments into the same classes.

93

Other approaches have proposed pre-selection by sub-costs with small com-

putational cost [29], which is applied in our TTS, and construction of a

practical and efficient cache of sub-costs with high computational cost [9].

Moreover, a pruning algorithm that considers concatenation between can-

didate segments has been proposed [18]. It is expected that a larger-sized

corpus will be used in the future to synthesize high-quality speech more con-

sistently. More approaches from various viewpoints are needed to address

this problem.

Measures against voice-quality variation: Variation in voice quality is caused

by recording the speech of a speaker for a long time. Some approaches as

described in Section 2.2.5 have been proposed [63][94]. However, much

more research is needed to solve this problem.

Utilization of multiple targets: Almost all algorithms use the most suitable

target information predicted from contextual information. However, the

predicted target information is not always the best in cases where only a

small number of candidate segments having the predicted target exist in the

corpus. It is assumed that segment sequences with smooth concatenations

cannot be selected under such conditions. As an interesting approach to

this problem, Bulyko et al. proposed a selection algorithm that considers

multiple targets [15]. Moreover, Hirai et al. proposed a selection algorithm

based on acceptable prosodic targets in Japanese speech synthesis [42]. Es-

pecially in Japanese, since accent information is crucial, it is important to

avoid selecting segment sequences with unacceptable accent information.

Moreover, it might be promising to utilize not only accent information but

also other contextual information, e.g. syntactic structure in segment se-

lection.

Improvement of cost: Cost functions for segment selection should be improved

based on perceptual characteristics [25][86]. In this thesis, we did not de-

termine the optimum weight set for sub-costs. Although a determination

algorithm based on linear regression has been proposed, it uses an acoustic

measure, e.g. cepstral distortion, as an objective measure in the regression

[47]. However, the correspondence to perceptual characteristics of acoustic

94

measures is not sufficient. Therefore, it is necessary to explore a measure

having better correspondence to perceptual characteristics.

Various applications of TTS: Not only general-purpose TTS but also limited

domain TTS have been studied [15][26]. As an effective approach in the case

of a small-sized corpus, spectral modification has been applied to concate-

nation speech synthesis [116]. Moreover, it is necessary to synthesize not

only speech in a reading style but also speech in various speaking styles

[54][65] as well as expressive speech [16][48] in order to realize rich com-

munication between man and machine. A number of these problems still

remain to be solved.

Further improvement of voice conversion: Voice conversion is an attractive

technique. However, no practical voice conversion systems has been devel-

oped yet. Further improvement in performance is indispensable. Moreover,

a prosodic conversion algorithm is also needed since prosody is an important

factor for representing speaker individuality.

Cross-language voice conversion: As an interesting application of voice con-

version, cross-language voice conversion has been proposed by Abe et al. [3].

This technique can synthesize other-language speech uttered by a speaker

who cannot naturally speak the language. We applied our voice conversion

algorithm to this type of cross-language voice conversion and evaluated the

performance of the conversion [73]. More research is needed to advance

cross-language voice conversion.

95

Appendix

A Frequency of Vowel Sequences

Table A.1 shows the frequency of vowel sequences in newspaper articles com-

prised of 571,283 sentences. Phoneme sequences were divided into CV* units [60],

and then we calculated the frequency of vowel sequences while ignoring conso-

nants, e.g. both /kai/ (CVV) and /ai/ (VV) are considered /ai/ (length = 2).

Semivowels were considered vowels. The length of long vowels is set to 1, and

that of devoiced vowels is set to 0.

Table A.1. Frequency of vowel sequences

Length Frequency Normalized frequency [%] Number of different sequences

0 518553 2.7590 -

1 16069643 85.4999 11

2 1677866 8.9272 99

3 459538 2.4450 575

4 57090 0.3038 1154

5 10306 0.0548 1155

6 1582 0.0084 473

7 306 0.0016 146

8 28 0.0002 30

9 4 0.0000 4

96

B Definition of the Nonlinear Function P

We performed perceptual experiments on the degradation of naturalness caused

by prosody modification with STRAIGHT [58][59] in order to define a sub-cost

function on prosodic difference. Listeners evaluated the degradation on a scale

of seven levels, namely 1 (very bad) to 7 (very good). Perceptual score was

calculated as an average of the normalized score calculated as a Z-score (mean

= 0, variance = 1) for each listener in order to equalize the score range among

listeners.

The perceptual scores were modeled by a nonlinear function z as follows:

z(x, y) = min{s, g(x, y)}, (B.1)

g(x, y) = a · exp

[

−

{

(

x

Sx

)2

− 2 · r ·(

x

Sx

)

·

(

y

Sy

)

+

(

y

Sy

)2

+ b, (B.2)

Sx =

Sxp (x > 0)

Sxm (x ≤ 0), Sy =

Syp (y > 0)

Sym (y ≤ 0), (B.3)

where, x and y denote F0 modification ratio and duration modification ratio by

octave, respectively. Parameters a, b, r, s, Sxp, Sxm, Syp, and Sym are given by

a = 2.649368, b = −1.574549,

r = 0.062317, s = 0.942505,

Sxp = 0.400522, Sxm = 0.568179,

Syp = 0.924295, Sym = 0.913813.

(B.4)

These parameters were estimated by minimizing the error between the perceptual

score and that estimated by the function z(x, y). The nonlinear function P in

Equation (3.3) is defined as follow:

P (x, y) = −z(x, y) + s. (B.5)

The function P is shown as the contour line in Figure B.1.

In the cases of 1) the sub-cost on F0 discontinuity as described in Equation

(3.4) and 2) no signal processing for prosody modification in waveform synthesis,

97

0.4

0.2

0

-0.2

-0.4

Dur

atio

n m

odif

icat

ion

rati

o [o

ctav

e]

-0.4 -0.2 0 0.2 0.4F0 modification ratio [octave]

21.8

1.6

1.4

1.210.8

0.60.4

0.2 0.050.0

Figure B.1. Nonlinear function P for sub-cost on prosody.

the parameters are given by

a = 2.649368, b = −1.574549,

r = 0.0, s = 1.074819,

Sxp = 0.400522, Sxm = 0.400522,

Syp = 0.913813, Sym = 0.913813.

(B.6)

98

C Sub-Cost Functions, Ss and Sp, on Mismatch

of Phonetic Environment

The sub-cost functions were determined from results of perceptual experiments,

in which listeners evaluated the degradation of naturalness by listening to the

speech stimuli synthesized by concatenating phonemes extracted from various

phonetic environments. The experimental method is described in [64]. Listeners

evaluated the degradation on a scale of seven levels, namely 1 (very bad) to 7

(very good). Perceptual score was calculated as an average of the normalized

score calculated as a Z-score (mean = 0, variance = 1) for each listener in order

to equalize the score range among listeners.

The cost Ss of capturing the degradation caused by a mismatch with the

succeeding environment is given by

Ss(Ph, Phe, Phu) = −z(Ph, Phe, Phu) + b, (C.1)

where z(Ph, Phe, Phu) denotes the perceptual score in the case of a mismatch

of the succeeding environment in a phoneme Ph, i.e. replacing a phoneme Phe

with a phoneme Phu. b is set to 1.5, which is utilized in order to convert the

perceptual score into a positive value. The cost Sp of capturing the degradation

caused by a mismatch with the preceding environment is determined in a similar

way.

99

References

[1] M. Abe, Y. Sagisaka, T. Umeda, H. Kuwabara. Speech database user’s

manual. ATR Technical Report, TR-I-0166, 1990.

[2] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara. Voice conversion

through vector quantization. J. Acoust. Soc. Jpn. (E), Vol. 11, No. 2, pp.

71–76, 1990.

[3] M. Abe, K. Shikano, and H. Kuwabara. Statistical analysis of bilingual

speaker’s speech for cross-language voice conversion. J. Acoust. Soc. Am.,

Vol. 90, No. 1, pp. 76–82, 1991.

[4] M. Abe and S. Sagayama. A voice conversion based on phoneme segment

mapping. J. Acoust. Soc. Jpn. (E), Vol. 13, No. 3, pp. 131–139, 1992.

[5] M. Akamine and T. Kagoshima. Analytic generation of synthesis units by

closed loop training for totally speaker driven text to speech system (TOS

Drive TTS). Proc. ICSLP, pp. 1927–1930, Sydney, Australia, Dec. 1998.

[6] C.G. Bell, H. Fujisaki, J.M. Heinz, K.N. Stevens, and A.S. House. Reduction

of speech spectra by analysis-by-synthesis techniques. J. Acoust. Soc. Am.,

Vol. 33, No. 12, pp. 1725–1736, 1961.

[7] M. Beutnagel, A. Conkie, and A.K. Syrdal. Diphone synthesis using unit

selection. Proc. 3rd ESCA/COCOSDA International Workshop on Speech

Synthesis, pp. 185–190, Jenolan Caves, Australia, Nov. 1998.

[8] M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A.K. Syrdal. The

AT&T Next-Gen TTS system. Joint Meeting of ASA, EAA, and DAGA,

100

Berlin, Germany, Mar. 1999.

http://www.research.att.com/projects/tts/pubs.html

[9] M. Beutnagel, M. Mohri, and M. Riley. Rapid unit selection from a large

speech corpus for concatenative speech synthesis. Proc. EUROSPEECH, pp.

607–610, Budapest, Hungary, Sep. 1999.

[10] A.W. Black and P. Taylor. CHATR: a generic speech synthesis system. Proc.

COLING, pp. 983–986, Kyoto, Japan, Aug. 1994.

[11] A.W. Black and N. Campbell. Optimizing selection of units from speech

database for concatenative synthesis. Proc. EUROSPEECH, pp. 581–584,

Madrid, Spain, Sep. 1995.

[12] A.W. Black. Building practical speech synthesis systems. ATR Technical

Report, TR-IT-0163, 1996.

[13] A.W. Black and A.J. Hunt, Generating F0 contours from ToBI labels using

linear regression. Proc. ICSLP, pp. 1385–1388, Philadelphia, U.S.A., Oct.

1996.

[14] A.W. Black and P. Taylor. Automatically clustering similar units for unit

selection in speech synthesis. Proc. EUROSPEECH, pp. 601–604, Rhodes,

Greece, Sep. 1997.

[15] A.W. Black and K. Lenzo. Limited domain synthesis. Proc. ICSLP, Vol. 2,

pp. 411–414, Beijing, China, Sep. 2000.

[16] M. Bulut, S.S. Narayanan, A.K. Syrdal. Expressive speech synthesis using

a concatenative synthesizer. Proc. ICSLP, pp. 1265–1268, Denver, U.S.A.,

Sep. 2002.

[17] I. Bulyko and M. Ostendorf. Efficient integrated response generation from

multiple targets using weighted finite state transducers. Computer Speech

and Language, Vol. 16, No. 3–4, pp. 533–550, 2002.

[18] I. Bulyko, M. Ostendorf, and J. Bilmes. Robust splicing costs and efficient

search with BMM models for concatenative speech synthesis. Proc. ICASSP,

pp. 461–464, Orlando. U.S.A., May 2002.

101

[19] W.N. Campbell and C.W. Wightman. Prosodic encoding of syntactic struc-

ture for speech synthesis. Proc. ICSLP, pp. 369–372, Banff, Canada, Oct.

1992.

[20] W.N. Campbell. Prosody and the selection of units for concatenation syn-

thesis. Proc. 2nd ESCA/IEEE Workshop on Speech Synthesis, pp. 61–64,

New York, U.S.A., Sep. 1994.

[21] W.N. Campbell. CHATR: a high-definition speech re-sequencing system.

Proc. Joint Meeting of ASA and ASJ, pp. 1223–1228, Hawaii, U.S.A., Dec.

1996.

[22] W.N. Campbell and A.W. Black. Prosody and the selection of source units

for concatenative synthesis. Progress in Speech Synthesis, J.P.H. Van Santen,

R.W. Sproat, J.P. Olive, and J. Hirschberg, N.Y., Springer, pp. 279–292,

1997.

[23] W.N. Campbell. Processing a speech corpus for CHATR synthesis. Proc.

ICSP, pp. 183–186, Seoul, Korea, Aug. 1997.

[24] M. Chu, H. Peng, H. Yang, and E. Chang. Selecting non-uniform units from

a very large corpus for concatenative speech synthesizer. Proc. ICASSP, pp.

785–788, Salt Lake City, U.S.A., May 2001.

[25] M. Chu and H. Peng. An objective measure for estimating MOS of synthe-

sized speech. Proc. EUROSPEECH, pp. 2087–2090, Aalborg, Denmark, Sep.

2001.

[26] M. Chu, C. Li, H. Peng, and E. Chang. Domain adaptation for TTS systems.

Proc. ICASSP, pp. 453–456, Orlando. U.S.A., May 2002.

[27] A. Conkie and S. Isard. Optimal coupling of diphones. Progress in Speech

Synthesis, J.P.H. Van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg,

N.Y., Springer, pp. 293–304, 1997.

[28] A. Conkie. Robust unit selection system for speech synthesis. Joint Meeting

of ASA, EAA, and DAGA, Berlin, Germany, Mar. 1999.

http://www.research.att.com/projects/tts/pubs.html

102

[29] A. Conkie, M. Beutnagel, A.K. Syrdal, and P.E. Brown. Preselection of

candidate units in a unit selection-based text-to-speech synthesis system.

Proc. ICSLP, Vol. 3, pp. 279–282, Beijing, China, Oct. 2000.

[30] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from

incomplete data via the EM algorithm. J. Royal Statist. Soc., Vol. 39, No.

1, pp. 1–38, 1977.

[31] W. Ding, K. Fujisawa, and N. Campbell. Improving speech synthesis

of CHATR using a perceptual discontinuity function and constraints of

prosodic modification. Proc. 3rd ESCA/COCOSDA International Workshop

on Speech Synthesis, pp. 191–194, Jenolan Caves, Australia, Nov. 1998.

[32] N.R. Dixon and H.D. Maxey. Terminal analog synthesis of continuous speech

using the diphone method of segment assembly. IEEE Trans. Audio Elec-

troacoust., Vol. 16, No. 1, pp. 40–50, 1968.

[33] R.E. Donovan and E.M. Eide. The IBM trainable speech synthesis system.

Proc. ICSLP, pp. 1703–1706, Sydney, Australia, Dec. 1998.

[34] R.E. Donovan and P.C. Woodland. A hidden Markov-model-based trainable

speech synthesizer. Computer Speech and Language, Vol. 13, No. 3, pp.

223-241, 1999.

[35] R.E. Donovan. Segment pre-selection in decision-tree based speech synthesis

systems. Proc. ICASSP, pp. 937–940, Istanbul, Turkey, June 2000.

[36] H. Dudley. Remaking speech. J. Acoust. Soc. Am., Vol. 11, No. 2, pp.

169-177, 1939.

[37] K.E. Dusterhoff and A.W. Black. Generating F0 contours for speech synthesis

using the Tilt intonation theory. Proc. ESCA Workshop on Intonation, pp.

107–110, Athens, Greece, Sep. 1997.

[38] K.E. Dusterhoff, A.W. Black, and P. Taylor. Using decision trees within

the Tilt intonation model to predict F0 contours. Proc. EUROSPEECH,

Budapest, Hungary, pp. 1627–1630, Sep. 1999.

103

[39] H. Fujisaki, K. Hirose. Analysis of voice fundamental frequency contours for

declarative sentence of Japanese. J. Acoust. Soc. Jpn. (E), Vol. 5, No. 4, pp.

233–242, 1984.

[40] K. Hakoda, S. Nakajima, T. Hirokawa, and H. Mizuno. A new Japanese

text-to-speech synthesizer based on COC synthesis method. Proc. ICSLP,

pp. 809–812, Kobe, Japan, Nov. 1990.

[41] T. Hirai, N. Iwahashi, N. Higuchi, and Y. Sagisaka. Automatic extraction of

F0 control rules using statistical analysis. Progress in Speech Synthesis, J.P.H.

Van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg, N.Y., Springer, pp.

333–346, 1997.

[42] T. Hirai, S. Tenpaku, and K. Shikano. Speech unit selection based on target

values driven by speech data in concatenative speech synthesis. Proc. IEEE

2002 Workshop on Speech Synthesis, Santa Monica, U.S.A., Sep. 2002.

[43] T. Hirokawa. Speech synthesis using a waveform dictionary. Proc. EU-

ROSPEECH, pp. 140–143, Paris, France, Sep. 1989.

[44] T. Hirokawa and K. Hakoda. Segment selection and pitch modification for

high quality speech synthesis using waveform segments. Proc. ICSLP, pp.

337–340, Kobe, Japan, Nov. 1990.

[45] K. Hirose, H. Fujisaki, and H. Kawai. Generation of prosodic symbols for

rule-synthesis of connected speech of Japanese. Proc. ICASSP, pp. 2415–

2418, Tokyo, Japan, Apr. 1986.

[46] K. Hirose, M. Eto, and N. Minematsu. Improved corpus-based synthesis

of fundamental frequency contours using generation process model. Proc.

ICSLP, pp. 2085–2088, Denver, U.S.A., Sep. 2002.

[47] A.J. Hunt and A.W. Black. Unit selection in a concatenative speech synthesis

system using a large speech database. Proc. ICASSP, pp. 373–376, Atlanta,

U.S.A., May 1996.

[48] A. Iida, N. Campbell, S. Iga, F. Higuchi, and M. Yasumura. A speech

synthesis system with emotion for assisting communication. Proc. ISCA

104

Workshop on Speech and Emotion, pp. 167–172, Belfast, Northern Ireland,

Sep. 2000.

[49] S. Imai, K. Sumita, and C. Furuichi. Mel log spectrum approximation

(MLSA) filter for speech synthesis. IEICE Trans., Vol. J66-A, No. 2, pp.

122–129, 1983 (in Japanese).

[50] M. Isogai and H. Mizuno. A new F0 contour control method based on vector

representation of F0 contour. Proc. EUROSPEECH, pp. 727–730, Budapest,

Hungary, Sep. 1999.

[51] K. Itoh, S. Nakajima, and T. Hirokawa. A new waveform speech synthesis

approach based on the COC speech spectrum. Proc. ICASSP, pp. 577–580,

Adelaide, Australia, Apr. 1994.

[52] N. Iwahashi, N. Kaiki, and Y. Sagisaka. Speech segment selection for con-

catenative synthesis based on spectral distortion minimization. IEICE Trans.

Fundamentals, Vol. E76-A, No. 11, pp. 1942–1948, 1993.

[53] N. Iwahashi and Y. Sagisaka. Speech spectrum conversion based on speaker

interpolation and multi-functional representation with weighting by radial

basis function networks. Speech Communication, Vol. 16, No. 2, pp. 139–

151, 1995.

[54] K. Iwano, M. Yamada, T. Togawa, and S. Furui, Speech-rate-variable HMM-

based Japanese TTS system. Proc. IEEE 2002 Workshop on Speech Synthe-

sis, Santa Monica, U.S.A., Sep. 2002.

[55] T. Kagoshima and M. Akamine. Automatic generation of speech synthesis

units based on closed loop training. Proc. ICASSP, pp. 963–966, Munich,

Germany, Apr. 1997.

[56] T. Kagoshima and M. Akamine. An F0 contour control model for totally

speaker driven text to speech system. Proc. ICSLP, pp. 1975–1978, Sydney,

Australia, Dec. 1998.

[57] A. Kain and M.W. Macon. Spectral voice conversion for text-to-speech syn-

thesis. Proc. ICASSP, Seattle, U.S.A., pp. 285–288, May 1998.

105

[58] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveigne. Restructuring speech

representations using a pitch-adaptive time-frequency smoothing and an

instantaneous-frequency-based F0 extraction: possible role of a repetitive

structure in sounds. Speech Communication, Vol. 27, No. 3–4, pp. 187–207,

1999.

[59] H. Kawahara, H. Katayose, A.de Cheveigne, and R.D. Patterson. Fixed

point analysis of frequency to instantaneous frequency mapping for accu-

rate estimation of F0 and periodicity. Proc. EUROSPEECH, pp. 2781–2784,

Budapest, Hungary, Sep. 1999.

[60] H. Kawai, N. Higuchi, T. Shimizu, and S. Yamamoto. Development of a text-

to-speech system for Japanese based on waveform splicing. Proc. ICASSP,

pp. 569–572, Adelaide, Australia, Apr. 1994.

[61] H. Kawai, K. Hirose, and H. Fujisaki. Rules for generating prosodic features

for text-to-speech synthesis of Japanese. J. Acoust. Soc. Jpn. (J), Vol. 50,

No. 6, pp. 433–442, 1994 (in Japanese).

[62] H. Kawai, S. Yamamoto, N. Higuchi, and T. Shimizu. A design method of

speech corpus for text-to-speech synthesis taking account of prosody. Proc.

ICSLP, Vol. 3, pp. 420–425, Beijing, China, Oct. 2000.

[63] H. Kawai and M. Tsuzaki. A study on time-dependent voice quality variation

in a large-scale single speaker speech corpus used for speech synthesis. Proc.

IEEE 2002 Workshop on Speech Synthesis, Santa Monica, U.S.A., Sep. 2002.

[64] H. Kawai and M. Tsuzaki. Acoustic measures vs. phonetic features as predic-

tors of audible discontinuity in concatenative speech synthesis. Proc. ICSLP,

pp. 2621–2624, Denver, U.S.A., Sep. 2002.

[65] H. Kawanami, T. Masuda, T. Toda, and K. Shikano. Designing Japanese

speech database covering wide range in prosody. Proc. ICSLP, pp. 2425–

2428, Denver, U.S.A., Sep. 2002.

[66] E. Klabbers and R. Veldhuis. Reducing audible spectral discontinuities.

IEEE Trans. Speech and Audio Processing, Vol. 9, No. 1, pp. 39–51, 2001.

106

[67] D.H. Klatt. Review of text-to-speech conversion for English. J. Acoust. Soc.

Am., Vol. 82, No. 3, pp. 737–793, 1987.

[68] H. Kuwabara and Y. Sagisaka. Acoustic characteristics of speaker individ-

uality: control and conversion. Speech Communication, Vol. 16, No. 2, pp.

165–173, 1995.

[69] C.H. Lee, C.H. Lin, and B.H. Juang. A study on speaker adaptation of

the parameters of continuous density hidden Markov models. IEEE Trans.

Signal Processing, Vol. 39, No. 4, pp. 806–814, 1991.

[70] M. Lee. Perceptual cost functions for unit searching in large corpus-based

concatenative text-to-speech. Proc. EUROSPEECH, pp. 2227–2230, Aal-

borg, Denmark, Sep. 2001.

[71] C.J. Leggetter and P.C. Woodland. Maximum likelihood linear regression for

speaker adaptation of continuous density hidden Markov models. Computer

Speech and Language, Vol. 9, No. 2, pp. 171–185, 1995.

[72] N. Maeda, H. Banno, S. Kajita, K. Takeda, and F. Itakura. Speaker conver-

sion through non-linear frequency warping of STRAIGHT spectrum. Proc.

EUROSPEECH, Budapest, Hungary, pp. 827–830, Sep. 1999.

[73] M. Mashimo, T. Toda, H. Kawanami. K. Shikano, and N. Campbell. Cross-

language voice conversion evaluation using bilingual databases. IPSJ Jour-

nal, Vol. 43, No. 7, pp. 2177–2185, 2002.

[74] T. Masuko, K. Tokuda, T. Kobayashi, and Satoshi Imai. Voice characteristics

conversion for HMM-based speech synthesis system. Proc. ICASSP, pp.1611–

1614, Munich, Germany, Apr. 1997.

[75] H. Matsumoto and H. Inoue. A minimum distortion spectral mapping ap-

plied to voice quality conversion. Proc. ICSLP, pp. 161–164, Kobe, Japan,

Nov. 1990.

[76] H. Matsumoto and Y. Yamashita. Unsupervised speaker adaptation from

short utterances based on a minimized fuzzy objective function. J. Acoust.

Soc. Jpn. (E), Vol. 14, No. 5, pp. 353–361, 1993.

107

[77] N. Minematsu, R. Kita, and K. Hirose. Automatic estimation of accentual

attribute values of words to realize accent sandhi in Japanese text-to-speech

conversion. Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica,

U.S.A., Sep. 2002.

[78] H. Mizuno and M. Abe. Voice conversion algorithm based on piecewise linear

conversion rules of formant frequency and spectrum tilt. Speech Communi-

cation, Vol. 16, No. 2, pp. 153–164, 1995.

[79] E. Moulines and F. Charpentier. Pitch-synchronous waveform processing

techniques for text-to-speech synthesis using diphones. Speech Communica-

tion, Vol. 9, No. 5–6, pp. 453–467, 1990.

[80] S. Nakajima and H. Hamada. Automatic generation of synthesis units based

on context oriented clustering. Proc. ICASSP, pp. 659–662, New York,

U.S.A., Apr. 1988.

[81] S. Nakamura and K. Shikano. Speaker adaptation applied to HMM and

neural networks. Proc. ICASSP, pp. 89–92, Glasgow, Scotland, May 1989.

[82] M. Narendranath, H.A. Murthy, S. Rajendran, B. Yegnanarayana. Trans-

formation of formants for voice conversion using artificial neural networks.

Speech Communication, Vol. 16, No. 2, pp. 207–216, 1995.

[83] K. Ohkura, M. Sugiyama, and S. Sagayama. Speaker adaptation based

on transfer vector field smoothing with continuous mixture density HMMs.

Proc. ICSLP, pp. 369–372, Banff, Canada, Oct. 1992.

[84] J.P. Olive. Rule synthesis of speech from diadic units. Proc. ICASSP, pp.

568–570, Hartford, U.S.A., May. 1977.

[85] A.V. Oppenheim and D.H. Johnson. Discrete representation of signals. Proc.

IEEE, Vol. 60, No. 6, pp. 681–691, 1972.

[86] H. Peng, Y. Zhao, and M. Chu. Perpetually optimizing the cost function for

unit selection in a TTS system with one single run of MOS evaluation. Proc.


108

[87] G. Peterson, W. Wang, and E. Silvertsen. Segmentation techniques in speech

synthesis. J. Acoust. Soc. Am., Vol. 30, No. 8, pp. 739–742, 1958.

[88] Y. Sagisaka and H. Satoh. Accentuation rules for Japanese word concatena-

tion. IEICE Trans., Vol. J66-D, No. 7, pp. 849–856, 1983 (in Japanese).

[89] Y. Sagisaka. Speech synthesis by rule using an optimal selection of non-

uniform synthesis units. Proc. ICASSP, pp. 679–682, New York, U.S.A.,

Apr. 1988.

[90] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura. ATR ν-talk speech

synthesis system. Proc. ICSLP, pp. 483–486, Banff, Canada, Oct. 1992.

[91] Y. Sagisaka, Natural language processing in speech synthesis. IPSJ Journal,

Vol. 34, No. 10, pp. 1281–1286, 1993 (in Japanese).

[92] H. Sato. Speech synthesis on the basis of PARCOR-VCV concatenation

units. IEICE Trans., Vol. J61-D, No. 11, pp. 858–865, 1978 (in Japanese).

[93] H. Sato. Speech synthesis using CVC concatenation units and excitation

waveform elements. Trans. Comm. Speech Res., Acoust. Soc. Jpn., S83-69,

pp. 541–546, 1984 (in Japanese).

[94] Y. Shi, E. Chang, H. Peng, and M. Chu. Power spectral density based

channel equalization of large speech database for concatenative TTS system.

Proc. ICSLP, pp. 2369–2372, Denver, U.S.A., Sep. 2002.

[95] K. Shikano, K.F. Lee, and R. Reddy. Speaker adaptation through vector

quantization. Proc. ICASSP, pp. 2643–2646, Tokyo, Japan, Apr. 1986.

[96] T. Shimizu, N. Higuchi, H. Kawai, and S. Yamamoto. A morphological

analyzer for a Japanese text-to-speech system based on the strength of con-

nection between words. J. Acoust. Soc. Jpn. (J), Vol. 51, No. 1, pp. 3–13,

1995 (in Japanese).

[97] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price,

J. Pierrehumbert, and J. Hirschberg. ToBI: a standard for labeling English

prosody. Proc. ICSLP, pp. 867–870, Banff, Canada, Oct. 1992.

109

[98] Y. Stylianou, O. Cappe, and E. Moulines. Statistical methods for voice

quality transformation. Proc. EUROSPEECH, Madrid, Spain, pp. 447–450,

Sep. 1995.

[99] Y. Stylianou, and O. Cappe. A system voice conversion based on probabilis-

tic classification and a harmonic plus noise model. Proc. ICASSP, Seattle,

U.S.A., pp. 281–284, May 1998.

[100] Y. Stylianou. Applying the harmonic plus noise model in concatenative

speech synthesis. IEEE Trans. Speech and Audio Processing, Vol. 9, No. 1,

pp. 21–29, 2001.

[101] Y. Stylianou and A.K. Syrdal. Perceptual and objective detection of dis-

continuities in concatenative speech synthesis. Proc. ICASSP, pp. 837–840,

Salt Lake City, U.S.A., May 2001.

[102] A.K. Syrdal, C.W. Wightman, A. Conkie, Y. Stylianou, M. Beutnagel, J.

Schroeter, V. Strom, K-S. Lee, and M.J. Makashay. Corpus-based techniques

in the AT&T NextGen synthesis system. Proc. ICSLP, Vol. 3, pp. 410–415,

Beijing, China, Oct. 2000.

[103] A.K. Syrdal. Phonetic effects on listener detection of vowel concatenation.

Proc. EUROSPEECH, pp. 979–982, Aalborg, Denmark, Sep. 2001.

[104] J. Takahashi and S. Sagayama. Vector-field-smoothed Bayesian learning

for incremental speaker adaptation. Proc. ICASSP, pp. 696–699, Detroit,

U.S.A., May 1995.

[105] S. Takano, K. Tanaka, H. Mizuno, M. Abe, and S. Nakajima. A Japanese

TTS system based on multiform units and a speech modification algorithm

with harmonics reconstruction. IEEE Trans. Speech and Audio Processing,

Vol. 9, No. 1, pp. 3–10, 2001.

[106] K. Takeda, K. Abe, and Y. Sagisaka. On the basic scheme and algorithms

in non-uniform unit speech synthesis. Talking machines: theories, models,

and designs, G. Bailly, C. Benoit, and T. Sawallis, North-Holland, Elsevier

Science, pp. 93–105, 1992.

110

[107] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi. Adaptation of pitch

and spectrum for HMM-based speech synthesis using MLLR. Proc. ICASSP,

pp. 805–808, Salt Lake City, U.S.A., May 2001.

[108] P. Taylor and A.W. Black. Synthesizing conversational intonation from a

linguistically rich input. Proc. 2nd ESCA/IEEE Workshop on Speech Syn-

thesis, pp. 175–178, New York, U.S.A., Sep. 1994.

[109] T. Toda, J. Lu, H. Saruwatari, and K. Shikano. STRAIGHT-based voice

conversion algorithm based on Gaussian mixture model. Proc. ICSLP, Vol.

3, pp. 279–282, Beijing, China, Oct. 2000.

[110] T. Toda, H. Saruwatari, and K. Shikano. Voice conversion algorithm based

on Gaussian mixture model with dynamic frequency warping of STRAIGHT

spectrum. Proc. ICASSP, pp. 841–844, Salt Lake City, U.S.A., May 2001.

[111] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi. Hidden Markov

models based on multi-space probability distribution for pitch pattern mod-

eling. Proc. ICASSP, pp. 229–232, Phoenix, U.S.A., May 1999.

[112] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura.

Speech parameter generation algorithms for HMM-based speech synthesis.

Proc. ICASSP, pp. 1315–1318, Istanbul, Turkey, June 2000.

[113] M. Tsuzaki and H. Kawai. Feature extraction for unit selection in concate-

native speech synthesis: comparison between AIM, LPC, and MFCC. Proc.


[114] H. Valbret, E. Moulines and J.P. Tubach. Voice transformation using

PSOLA technique. Speech Communication, Vol. 11, No. 2–3, pp. 175–187,

1992.

[115] J. Wouters and M.W. Macon. A perceptual evaluation of distance measures

for concatenative speech synthesis. Proc. ICSLP, pp. 2747–2750, Sydney,

Australia, Dec. 1998.

111

[116] J. Wouters and M.W. Macon. Control of spectral dynamics in concatenative

speech synthesis. IEEE Trans. Speech and Audio Processing, Vol. 9, No. 1,

pp. 30–38, 2001.

[117] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Si-

multaneous modeling of spectrum, pitch and duration in HMM-based speech

synthesis. Proc. EUROSPEECH, pp. 2347–2350, Budapest, Hungary, Sep.

1999.

[118] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.

Speaker interpolation for HMM-based speech synthesis system. J. Acoust.

Soc. Jpn. (E), Vol. 21, No. 4, pp. 199–206, 2000.

112

List of Publications

Journal Papers

1. T. Toda, H. Banno, S. Kajita, K. Takeda, F. Itakura, and K. Shikano. Im-

provement of STRAIGHT method under noisy conditions based on lateral

inhibitive weighting. IEICE Trans., Vol. J83-D-II, No. 11, pp. 2180–2189,

Nov. 2000 (in Japanese).

2. T. Toda, J. Lu, H. Saruwatari, and K. Shikano. Voice conversion algorithm

based on Gaussian mixture model with dynamic frequency warping. IEICE

Trans., Vol. J84-D-II, No. 10, pp. 2181–2189, Oct. 2001 (in Japanese).

3. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. A segment selection algo-

rithm for Japanese concatenative speech synthesis based on both phoneme

unit and diphone unit. IEICE Trans., Vol. J85-D-II, No. 12, pp. 1760–

1770, Dec. 2002 (in Japanese).

4. M. Mashimo, T. Toda, H. Kawanami. K. Shikano, and N. Campbell. Cross-

language voice conversion evaluation using bilingual databases. IPSJ Jour-

nal, Vol. 43, No. 7, pp. 2177–2185, July 2002.

International Conferences

1. T. Toda, J. Lu, H. Saruwatari, and K. Shikano. STRAIGHT-based voice

conversion algorithm based on Gaussian mixture model. Proc. ICSLP, Vol.

3, pp. 279–282, Beijing, China, Oct. 2000.

113

2. T. Toda, J. Lu, S. Nakamura, and K. Shikano. Voice conversion algorithm

based on Gaussian mixture model applied to STRAIGHT. Proc. WEST-

PRAC VII, pp. 169–172, Kumamoto, Japan, Oct. 2000.

3. T. Toda, H. Saruwatari, and K. Shikano. Voice conversion algorithm based

on Gaussian mixture model with dynamic frequency warping of STRAIGHT

spectrum. Proc. ICASSP, pp. 841–844, Salt Lake City, U.S.A., May 2001.

4. T. Toda, H. Saruwatari, and K. Shikano. High quality voice conversion

based on Gaussian mixture model with dynamic frequency warping. Proc.

EUROSPEECH, pp. 349–352, Aalborg, Denmark, Sep. 2001.

5. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. Unit selection algorithm

for Japanese speech synthesis based on both phoneme unit and diphone

unit. Proc. ICASSP, pp. 465–468, Orlando, U.S.A., May 2002.

6. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. Perceptual evaluation of

cost for segment selection in concatenative speech synthesis. Proc. IEEE

2002 Workshop on Speech Synthesis, Santa Monica, U.S.A., Sep. 2002.

7. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. Segment Selection Consid-

ering Local Degradation of Naturalness in Concatenative Speech Synthesis.

Proc. ICASSP, Hong Kong, China, Apr. 2003 (accepted).

8. M. Mashimo, T. Toda, K. Shikano, and N. Campbell. Evaluation of cross-

language voice conversion based on GMM and STRAIGHT. Proc. EU-

ROSPEECH, pp. 361–364, Aalborg, Denmark, Sep. 2001.

9. H. Kawanami, T. Masuda, T. Toda, and K. Shikano. Designing speech

database with prosodic variety for expressive TTS system. Proc. LREC,

pp. 2039–2042, Las Palmas, Spain, May 2002.

10. M. Mashimo, T. Toda, H. Kawanami, H. Kashioka, K. Shikano, and N.

Campbell. Evaluation of cross-language voice conversion using bilingual

and non-bilingual databases. Proc. ICSLP, pp. 293–296, Denver, U.S.A.,

Sep. 2002.

114

11. H. Kawanami, T. Masuda, T. Toda, and K. Shikano. Designing Japanese

speech database covering wide range in prosody. Proc. ICSLP, pp. 2425–

2428, Denver, U.S.A., Sep. 2002.

Technical Reports

1. T. Toda, H. Banno, S. Kajita, K. Takeda, and F. Itakura. Improvement

of STRAIGHT method under noisy conditions based on lateral inhibitive

weighting. IEICE Tech. Rep., EA99-7, pp. 25–32, Aichi, May 1999 (in

Japanese).

2. T. Toda, J. Lu, S. Nakamura, and K. Shikano. Voice conversion algorithm

based on Gaussian mixture model applied to STRAIGHT. IEICE Tech.

Rep., SP2000-7, pp. 49–54, Kyoto, May 2000 (in Japanese).

3. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. A unit selection algorithm

for Japanese speech synthesis based on both phoneme unit and diphone unit.

IEICE Tech. Rep., SP2001-120, pp. 45–52, Shiga, Jan. 2002 (in Japanese).

4. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. Perceptual evaluation of

cost for segment selection in concatenative text-to-speech synthesis. IEICE

Tech. Rep., SP2002-69, pp. 19–24, Fukuoka, Aug. 2002 (in Japanese).

5. T. Masuda, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. A

study on the speech synthesis method by using database with variety of

speech-rate. IEICE Tech. Rep., SP2001-122, pp. 61–68, Shiga, Jan. 2002

(in Japanese).

6. T. Shiraishi, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. A

visual speech synthesis method using real image database. IPSJ SIG Notes,

2002-SLP-42, pp. 23–29, Gunma, July 2002 (in Japanese).

7. H. Kawai and T. Toda. An evaluation of automatic phoneme segmentation

for concatenative speech synthesis. IEICE Tech. Rep., SP2002-170, pp.

5–10, Shiga, Jan. 2003 (in Japanese).

115

8. Y. Iwami, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. Emo-

tional speech synthesis using GMM-based voice conversion technique. IE-

ICE Tech. Rep., SP2002-171, pp. 11–16, Shiga, Jan. 2003 (in Japanese).

Meetings

1. T. Toda, J. Lu, S. Nakamura, and K. Shikano. Application of voice con-

version algorithm based on Gaussian mixture model to STRAIGHT. Proc.

Spring Meeting, Acoust. Soc. Jpn., 1-7-6, pp. 207–208, Chiba, Mar. 2000

(in Japanese).

2. T. Toda, J. Lu, H. Saruwatari, and K. Shikano. Voice conversion algorithm

based on Gaussian Mixture Model with dynamic frequency warping. Proc.

Fall Meeting, Acoust. Soc. Jpn., 2-1-6, pp. 185–186, Iwate, Sep. 2000 (in

Japanese).

3. T. Toda, H. Saruwatari, and K. Shikano. Study on conversion-accuracy on

speaker individuality of voice conversion algorithm with dynamic frequency

warping. Proc. Spring Meeting, Acoust. Soc. Jpn., 1-6-5, pp. 237–238,

Ibaraki, Mar. 2001 (in Japanese).

4. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. A unit selection algorithm

for Japanese speech synthesis allowing concatenation at vowel center. Proc.

Spring Meeting, Acoust. Soc. Jpn., 2-10-2, pp. 267–268, Kanagawa, Mar.

2002 (in Japanese).

5. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. Evaluation based on per-

ceptual characteristic of cost for segment selection in concatenative speech

synthesis. Proc. Fall Meeting, Acoust. Soc. Jpn., 1-10-6, pp. 235–236,

Akita, Sep. 2002 (in Japanese).

6. T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano. Evaluation of segment se-

lection considering local degradation of naturalness in concatenative speech

synthesis. Proc. Spring Meeting, Acoust. Soc. Jpn., 1-6-13, pp. 247–248,

Tokyo, Mar. 2003 (in Japanese).

116

7. T. Masuda, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano.

STRAIGHT-based prosody modification of CHATR output. Proc. Fall

Meeting, Acoust. Soc. Jpn., 1-2-20, pp. 245–246, Ohita, Oct. 2001 (in

Japanese).

8. M. Mashimo, T. Toda, H. Kawanami, K. Shikano, and Nick Campbell.

Adaptation of voice conversion method based on GMM and STRAIGHT

to cross-language speech synthesis. Proc. Fall Meeting, Acoust. Soc. Jpn.,

1-P-17, pp. 389–390, Ohita, Oct. 2001 (in Japanese).

9. M. Mashimo, T. Toda, H. Kawanami, K. Shikano, and Nick Campbell.

Cross-language voice conversion and vowel pronunciation. Proc. Spring

Meeting, Acoust. Soc. Jpn., 1-10-16, pp. 261–262, Kanagawa, Mar. 2002

(in Japanese).

10. T. Masuda, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. De-

signing and evaluation of speech database with prosodic variety. Proc.

Spring Meeting, Acoust. Soc. Jpn., 2-10-13, pp. 289–290, Kanagawa, Mar.

2002 (in Japanese).

11. Y. Iwami, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. Evalu-

ation of emotional control with GMM-based voice conversion. Proc. Fall

Meeting, Acoust. Soc. Jpn., 1-10-24, pp. 277–278, Akita, Sep. 2002 (in

Japanese).

12. T. Shiraishi, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. A

visual speech synthesis method using real image corpus. Proc. Fall Meeting,

Acoust. Soc. Jpn., 2-10-15, pp. 315–316, Akita, Sep. 2002 (in Japanese).

13. H. Kawai and T. Toda. An evaluation of a design method of speech corpus

for concatenative speech synthesis. Proc. Spring Meeting, Acoust. Soc.

Jpn., 1-6-20, pp. 261–262, Tokyo, Mar. 2003 (in Japanese).

14. Y. Iwami, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. Emo-

tion control of speech using GMM-based voice conversion. Proc. Spring

Meeting, Acoust. Soc. Jpn., 1-6-23, pp. 267–268, Tokyo, Mar. 2003 (in

Japanese).

117

15. T. Shiraishi, T. Toda, H. Kawanami, H. Saruwatari, and K. Shikano. Corpus-

based visual speech synthesis using perceptual definition of viseme. Proc.

Spring Meeting, Acoust. Soc. Jpn., 2-Q-13, pp. 399–400, Tokyo, Mar. 2003

(in Japanese).

Master’s Thesis

1. T. Toda. High quality voice conversion based on STRAIGHT analysis-

synthesis method. Master’s thesis, Department of Information Process-

ing, Graduate School of Information Science, Nara Institute of Science and

Technology, NAIST-IS-MT9951077, Feb. 2001 (in Japanese).

118

doctoral thesis high-quality and flexible speech …srallaba/pdfs/toda_phd.pdfnaist-is-dt0161027...

Documents