selection of the most signiﬁcant parameters for duration ... file184 r. cordoba´ et al. 1....

doi: 10.1006/csla.2002.0190Available online at http://www.idealibrary.com on

Computer Speech and Language(2002)16, 183–203

Selection of the most significant parameters forduration modelling in a Spanish text-to-speech

system using neural networks

Ricardo Cordoba,†§Juan M. Montero,†Juana M. Gutierrez,† Jose A. Vallejo,† Emilia Enriquez ‡

and Jose M. Pardo†

†Grupo de Tecnologıa del Habla—Departamento de Ingenierıa Electronica,E.T.S.I. Telecomunicacion—Universidad Politecnica de Madrid, Ciudad

Universitaria s/n, 28040 Madrid, Spain,‡Facultad de Filologıa, UniversidadNacional de Educacion a Distancia, Senda del Rey s/n, 28040 Madrid, Spain

Abstract

Accurate prediction of segmental duration from text in a text-to-speech system isdifficult for several reasons. One which is especially relevant is the great quantityof contextual factors that affect timing and it is difficult to find the right way tomodel them. There are many parameters that affect duration, but not all of themare always relevant and some can even be counterproductive because of thepossibility of overtraining.

The main motivation of this paper has been to reduce the error in the durationestimation. To this end, it is of the utmost importance to find the factors that mostinfluence duration in a given language. The approach we have taken is to use aneural network, which is completely configurable, and experiment with thedifferent combinations of parameters that yield the minimum error in theestimation.

We have oriented our work mainly towards the following aspects: the mostsignificant parameters that can be used as input to the automatic model, and thebest way to code these parameters. We have studied first the effect of eachparameter alone and, after that, we have included all parameters together to haveour final system.

Another important aspect of this study is the generation of a suite of softwaretools and design protocols that will be used in future tasks with different speakersand databases. The applications for automatic modelling are obvious: adapt theprosody to a new speaker, to a new environment, to “restricted-domain” sentences,etc., in a fast, semi-automatic and inexpensive way. After the database labelling, itis a matter of minutes to prepare the inputs to the network for the new situation,and the network is trained in 1 h.

The result has been a system that predicts duration with very good results(19 ms in RMS) and that clearly improves our previous rule-based system.

c© 2002 Elsevier Science Ltd. All rights reserved.

§Author for correspondence: E-mail:[email protected]

0885–2308/02/$35.00 c© 2002 Elsevier Science Ltd. All rights reserved.

184 R. Cordobaet al.

1. Introduction

The primary goal of this study was to develop an automatic system to model duration for aSpanish text-to-speech system and achieve better results than those obtained with our pre-vious system, which used a rule-based approach (Santos, 1984; Pardo, Martınez, Quilis &Munoz, 1987). We will compare them in this paper. The rule-based approach follows a classicmultiplicative model, also known as the Klatt model (Allen, Hunnicut & Klatt, 1987). Begin-ning from an initial inherent duration based on the phoneme identity, this value is multipliedby a series of factors based on the characteristics of the phoneme, its position, stress, etc.

Now, our objective is to develop a self-learning model: using a general-prosody segmenteddatabase for training and considering the characteristics (or parameters) of every phoneme,the model has to predict the estimated duration of the phoneme, without inferring any rule.At the same time, we wanted the system to be very flexible, so we could easily adapt it tonew contexts, especially to restricted-domain prosody, an area of research that we are alsovery interested in. For example, in some banking applications, the text-to-speech system hasto generate sentences with a very similar structure. Applying the environment presented inthis paper to this application has produced outstanding results.

Many studies have been successfully carried out lately in the field of automatic estimationof a duration model, using different techniques and input parameters to obtain the model.These automatic techniques are mainly of two types: decision trees and neural networks (theobjective of this paper). Another line of investigation with very good results is the statisticalsum-of-products method. We will now highlight recent work in all these areas.

In all the systems, regardless of the technique used for the modelling, it is crucial to find theparameters that are most significant for duration modelling. So, we can take advantage of pre-vious studies dedicated to duration modelling, but using other techniques. For example, thetechnique using a sum-of-products model fromvan Santen(1994) used the following param-eters: phoneme identity, surrounding phonemes, pitch accent, syllabic stress, within-syllableposition, within-word position (initial/not initial, final/not final), within-utterance position(last syllable/penultimate syllable/other). Invan Son and van Santen(1997) they used thefollowing parameters: phoneme identity, stress, position in the word (initial, medium, andfinal), word length in syllables (one, two, three, or more than three) and the frontedness ofthe syllabic vowel. InMobius and van Santen(1996) the sum-of-products model is appliedto German. A very large set of possible parameters is considered, and the most relevantparameters (statistically determined) are the syllable stress, the word class, and the presenceof phrase and word boundaries. Invan Santen, Shih, Mobius, Tzoukermann and Tanenblatt(1997) where multi-lingual duration modelling is described, they distinguish three groups ofcore factors: phone identity, stress-related and locational. We have studied all of them in thispaper and adopted a similar solution for the locational factors: location of the segment in thesyllable, syllable in the word, word in the phrase and phrase in the utterance.

Eloquens (Quazza & Heuvel, 1997) is a system for Italian using decision trees. The param-eters considered in this system are: phoneme identity, stress, window of phonemes, and char-acteristics of higher order units (syllable, word, and phrase). InDeans, Breen and Jackson(1999), they propose a CART-based approach using prosodic features obtained from the TTSsystem, but the results were not as good as expected.

Neural networks have previously been used with success. InCampbell(1992) a neural net-work was trained to predict syllable timing. The parameters used are the number of phonemesin the syllable, nature of the syllabic peak, position in tone-group, type of foot, stress andword-class. InRiedi (1995) they apply a neural network to duration modelling in German

Duration modelling in a Spanish text-to-speech system 185

with very good results. InTournemire(1997) a syllable duration modelling is shown, in whichthe neural network is used to predict the standard deviation of the duration (called the “syl-lable elasticity factor”). The parameters used are the breaks, the stress, and the informationon the composition of the syllable.Corrigan and Massey(1997a,b) propose a hybrid systemusing a neural network and a rule-based system. It is the neural network which is responsi-ble for the final estimation of the duration, and the rule-based system is only used to obtainthe inputs to the network (which are all binary). In concept, the solution is similar to ours,as we always need a preprocessing step to obtain the inputs to the network, the differencebeing that we accept different inputs, not only binary.Morlec, Bailly and Auberge (1997)present the “sequential neural networks”, where there is a set of neural networks workingin parallel to predict the prosody of the sentence. InFackrell, Vereecken, Martens and VanCoile (1999) they compare the performance of neural networks and CART techniques for sixdifferent languages, including Spanish. The results for both are very similar, which showsthat any of them can be used. They also propose an architecture with three levels of cas-caded modules for prosody prediction. Another possibility is to use “multivariate adaptiveregression splines” (Riedi, 1997). Its objective is to determine automatically the structureand parameters of the model with sparsely distributed data.

Regarding the application of these techniques to Spanish, there are very little referencesand none is dedicated to neural networks or CART approaches. For example, the TTS fromthe Spanish dominant telephone company—Telefonica—uses a multiplicative model(Rodrıguez-Crespo, Escalada & Torre, 1998; Macarron, Escalada & Rodrıguez, 1991), sim-ilar to our previous solution. The factors that were considered as significant in this systemwere: position of the phoneme within the phrase, phrase length, position of the phonemewithin the word with respect to the stressed vowel, left and right context of the phoneme.They suggested that they would use “type of phrase” in future works. As we will see in thispaper, we have covered all these parameters in our system and some more. InVillar, L opez-Gonzalo and Relano (1999) they use a database of syllables, each one associated with itslinguistic properties, and search the closest syllable in the database using a dynamic pro-gramming algorithm. This way, they model f0 and duration at the same time. In previousworks of this group, they propose very little variance to this approach, as inLopez-Gonzaloand Rodrıguez-Garcıa (1996), but they do not provide any result about the accuracy of theestimation (no measure of error in the estimation). So, we cannot compare our results withtheirs. InFebrer, Padrell and Bonafonte(1998) they implement a sum-of-products modelfor Catalan and compare it with the multiplicative model. There is a deep study of factorsthat affect duration in Catalan using a CART tree. InFernandez-Salgado and Banga(1999),there is a study on duration for Galician, which is very close phonetically to Spanish. Theydecided that the most significant parameters for duration were: number of allophones in thesyllable, position in the syllable, broad phonetic class, stress and final position (they discardedphoneme identity and contextual information). They used a look-up table for the estimation.

In this paper, we will show and analyse new alternatives for the automatic techniquesmentioned earlier, specifically for the parameters used and their coding as inputs to the neuralnetwork, and for the way to model the duration, as there are several alternatives: the durationitself, a normalized duration, the logarithm, standard deviation, and different combinationsof these measures.


TABLE I. Number of units in the database

List Phonemes Syllables Words Phrases1 3 224 1355 654 942 2 354 1015 541 903 3 867 1671 882 1384 2 440 987 497 1645 3 256 1314 612 198All 15 141 6342 3186 684

2. Database used for the modelling

2.1. Contents

Our database consists of five sets of phrases of different lengths and patterns, giving a total of732 phrases (15 141 phonemes). We have used a single speaker with a neutral voice, lookingfor homogeneity. In previous studies using many speakers, the variation in our parameterswas too high and it was difficult to extract reliable statistical decisions.

We have used this database to model both fundamental frequency (see a thorough studyin Vallejo, 1998) and duration. So, some considerations of the design of our database arededicated to intonation modelling. We split the database into five lists with the followingtypes of sentence in the database:

• List 1: a discourse, with a predominance of declarative patterns, especially all patternsrelative to complex phrases, but with very few interrogative and exclamatory sentences.This discourse has the same content as the one used to develop our first rule-basedsystem.

• Lists 2 and 3: colloquial text, read as an interview, with more interrogative and exclam-atory sentences to complete list 1.

• Lists 4 and 5: laboratory sentences, designed to complement the previous text. It inclu-des at least one sample of all the possible patterns with one or two stressed syllables insentences with up to eight syllables. The distribution of this set follows the distributionof a large set of texts in Spanish. The objective is to improve the coverage of all possiblevectors of parameters that we will have in our system.

In its design, the goal was to cover not only intonation patterns, but all syntactic and stresspatterns in Spanish, and have an appropriate number of each of them to be able to trainour model. To this end we have followed the studies developed inNavarro Tomas(1948).Similarly, we wanted to have enough examples of words with a different number of syllables,with one or two stressed syllables and with different positions of the stressed syllable insidethe word.

The detailed figures of our database are shown in TableI. Throughout the paper we willalso mention the results we have obtained in similar experiments using a female speakerin a restricted-domain environment (Cordoba, Montero, Gutierrez-Arriola & Pardo, 2001).This restricted-domain offers several advantages to the modelling: the variation in the dif-ferent patterns is reduced, and there are more instances of each vector of parameters in thedatabase. Besides, this database is slightly larger (1735 phrases, 3594 words, 6551 syllables,20 089 phonemes).


2.2. Separation in training and testing

We broke down the database into three parts: 60% is used for the “training set” (i.e. adjust theparameters in our system), 20% for the “development test set”, and 20% for the “validationset”. The topology of the network is determined considering the results in the “developmenttest set”, and the final system is checked against the “validation set”. In this division, wedecided to assign one complete phrase of each type alternately to training, testing, and vali-dation.

2.3. Database segmentation

The segmentation of the database has been automatic, using a continuous speech recogni-tion system with HMM models (Ferreiros, Cordoba, Savoji & Pardo, 1995), followed by amanual revision of the marks to correct some of the accuracy problems of the system. Inthis manual segmentation we were limited to 10 ms steps. The segmentation has been madeat the phoneme level, as the phoneme is the base unit in our modelling. To obtain the syl-lable boundaries we have developed a module to generate those boundaries from text usingSpanish rules. In Spanish the syllabication rules are known, fixed and well documented. Themodule is the same as the one used in our TTS system.

3. Neural network system development methodology

In order to develop a system based on neural networks, we have to make several decisions:Which topology should we use for our network? Which is the best way to code the param-eters? Which are the best parameters that we should use (the most representative for ourmodelling)?

3.1. Topology of the neural network

We have used a multilayer perceptron (MLP) and the sigmoid as the activation function. Thetraining algorithm is backpropagation. The experiments in this section were for the purposeof fine-tuning the network. We worked with what we call the “base experiment”, that includesas parameters just the phoneme identity and the stress.

Our basic unit is the phoneme. For each phoneme, we compute a series of parameters,which we code and use their values as inputs to the neural network. There is one output inour network: the duration of the phoneme.

In our experiments, we divided the training into three phases of 200 iterations each. Thisway we could analyse the evolution of the error for the test set as a function of the number ofiterations and detect a condition of overfitting (network too adapted to the training data).

It is very difficult to know the optimum number of neurons and layers that the net shouldhave. A large number usually causes overtraining (results decrease for the test set althoughthey improve for the training set) whilst a small number may be insufficient. Our approach hasbeen to begin from the very beginning: just three neurons. The best procedure is to increasethe number of neurons one by one and observe the test results (training results are not sig-nificant for our problem, as the greater the number of neurons, the greater the improvement),and stop when there is an overtraining symptom (e.g. there is an increase in RMS on thedevelopment test set). As we will see, in some cases we got better results with just three orfour neurons. So, we have to be careful in the design, as using two layers and many neuronsis rarely the best approach.


3.2. Coding of the parameters

We have considered different ways of presenting the parameters to the neural network, i.e.the way they are coded, as we have different kinds of parameters. A good explanation of allpossible codings can be found inMasters(1993).

1. Binary coding: this is the standard coding for true/false parameters. A binary “one” isassigned to the true condition and a binary “zero” to the false condition.

2. One-of-n coding: we use n neurons and only one of them is active, that is, the one thatcorresponds to the class or category.In ordinal values we have more possibilities, because there is an order relationshipbetween the different values. For example, the position of a unit inside a higher-orderunit. For these values we have considered three codings:

3. Percentage transformation: we divide the current value by the maximum value to obtaina percentage. We obtain a floating-point value between 0 and 1 as input. This transfor-mation presents many problems when the distribution of the parameter is not uniform.If there are values very far from the average (the maximum value is too high), the sec-ondary effect is a compression of the information that is close to the average, whichmeans a decrease in the performance of the transformation.

4. Thermometer: we divide all the possible values into different classes (intervals). Weactivate all the neurons until we get to the current class and leave the remaining neuronsinactive. If we use equidistant classes, we have the same problem that we had withthe percentage transformation: if the distribution of the parameter is not uniform, theclasses will be unbalanced, which has a bad effect on the training of a neural network.So, to decide the size of each class we developed an algorithm to obtain a uniformdistribution of all the classes. This algorithm considers the whole database and thehistogram of all the values of the parameter.

5. Z-score mapping: we apply a normalization to the floating-point value that takes intoaccount the average and the standard deviation of the variable. It is a good coding forhigh variance parameters, as is the case for all the “number of units” parameters.

3.3. Network evaluation

To evaluate and compare the networks we have considered different metrics for the error (thedifference between the prediction from the network and the optimum value) (Masters, 1993).

1. The RMS is equal to the square root of the mean square error (MSE). It is the mostimportant metric, and is more reliable than the average absolute error.

2. To make our comparisons it is better to use an adimensional metric, because it willbe independent of the way we code the duration. We decided to use the followingmetric because it does not include any offset and is independent of the average valueof the magnitude compared. This way, we can use it to compare results obtained withdifferent-size databases.

Relative RMS error=RMS√∑(ti − t)2

whereti are the target values.


3.4. Parameters to be used

3.4.1. Background

The different parameters considered in our system are the result of thorough statistical studiesapplied to Spanish that were carried out previously in our laboratory (Santos, 1984) usinga multiplicative model (Allen et al., 1987): each phoneme is assigned an initial inherentduration based on the phoneme identity, and then this value is multiplied by a series of factors.The following factors were determined as relevant for duration in Spanish:

• Poly-syllabic shortening: depending on the number of syllables in the word, the initialduration of the vowel is reduced accordingly.

• Stress, syllable structure (syllable ending with/without vowel) and position of the syl-lable in the word. All these three parameters were relevant for duration, but their effortsto model them independently using their multiplicative model were useless. That wasthe reason to model them together, i.e. they obtained one single value for each combi-nation of the three parameters.

• Post-vocalic context: considering the type of the consonant that followed the vowel(plosives, voiceless/voiced fricatives, voiceless affricate, nasal, lateral) they computedthe different factors for the multiplicative model.

• Final lengthening: they found that the vowel in a syllable which is before a pause islengthened by a factor between 11 and 47%, depending on the number of syllables inthe word and the syllable structure.

In a later work byBerrojo(1994), significance tests were made to guarantee that these factorswere significant in the database used in this paper. All of them were successful, and thefollowing factors were found significant too:

• Voiced/unvoiced post-vocalic context. In Spanish, unvoiced contexts lengthen the vow-els (5% over voiced contexts).

• Function words: duration of vowels in function words is shortened 30% compared tovowels in non-function words. The difference is significant.

• Number of phonemes in the syllable: there are significant differences between the threeclasses: one phoneme, two phonemes, three or more phonemes.

• Position in the phrase: considering initial, medium and final syllable, there are signifi-cant differences.

• Size of the phrase: there were no significant differences for the classes considered.

These studies are the basis we have followed for the election of our parameters. We haveconsidered all of them (the only difference is the way to code them for the neural network)and introduced new variations for type of phrase, position and “number of units” parameters.

3.4.2. Methodology

As we found out inCordobaet al. (1999), it is too difficult to decide which parameters arerelevant and the best way to code them using a very large network with many parameters,because the differences in performance are too small and not always consistent. So, we haveused a base experiment using only the phoneme identity and the stress, which are, withoutdoubt, the most relevant ones, as all previous work related to duration estimation—and ourown pilot experiments—has shown.


Then, we have added the different parameters one by one to see their effect and thesignificance of each of them. We have considered the introduction of contextual informa-tion in some of them, i.e. windowing information.

We have carried out the experiments in four directions:

• First, we used a fixed number of neurons (three) to test each possible parameter.• As increasing the number of parameters demands more neurons in the network, we

increased the number of neurons until we detected an overtraining symptom.• We tested different codings for non-binary parameters.• We included all the parameters to get the final network.

Now, we will describe in detail the procedures followed for all the parameters that we haveconsidered.

3.4.3. Phoneme identity

This is the most obvious one. In our first approach, we considered 51 different phonemes withall the allophonic variations for Spanish, including four variants of each vowel. In TableIIwe can see these units. The first column shows the class of the phonemes; the third andfourth columns show examples of each phoneme in Spanish and English (if there is anycorrespondence). InQuilis and Fernandez(1989) there is a complete description of Spanishphonetics and its relation to the English one.

We analysed the distribution of phonemes in the database and observed that many of themhad too few samples, which is an undesirable situation for the training of the network. So, wedecided to cluster the allophones with a very low number of repetitions in the database: thevariants for each vowel and the variants for the vibrant (the “r”). The fifth column in TableIIshows this reduction to 33 phonemes. The coding used is a one-of-n coding: a “1” in the inputwhich corresponds to the phoneme and “0” for all the other inputs.

Contextual phonemes.As our previous studies in duration showed, the kind of phonemeadjacent to the phoneme considered significantly affects its duration. In our neural network,we decided to use the phonemes that are to the right and to the left of the current one. As thenumber of phonemes is too high (we need 99 inputs for the three phonemes in the window)the results obtained are low, showing that there are not enough examples in the database totrain all the possible contexts.

The solution to this problem is to make clusters of phonemes of the same class, consideringmainly its manner of articulation (just in two cases where the number of instances of thephoneme in the database was very low we have considered the similarity of their duration inthe distribution). Then, we divide our set of phonemes into 13 classes and classify the leftand right context phonemes in these classes. The result of the clustering can be seen in thesixth column in TableII .

This way we reduce the inputs to 59(33+13+13). The results improve with this approach.So we will use it in our base system. The next step would be to consider a two-phonemecontext (a window of five values, instead of three), but the results decreased slightly for thetest set. The most probable reason is the excessive number of inputs—and, at the same time,of weights to be trained—needed for this window. So, we discarded this option at this point.


TABLE II. Phonemes considered in the system

Complete set of Examples Equivalent Reduced Clusters ofphonemes (51) in English set (33) phonemes

Vowels

a palabra aisle a a’∼a∼a’

All vowels

e reduce fail e e’∼e∼e’i realidad it i i’ ∼i ∼i’o teatro november o o’ ∼o ∼o’u justicia good u u’ ∼u ∼u’

Stressedvowels

a’ palabrae’ retoi’ mitoo’ logrou’ jubilo

Nasalvowels

∼a andar∼e en∼i ministro∼o autonomıa∼u unido

Stressednasalvowels

∼a’ mano∼e’ fomento∼i’ mınimo∼o’ tonto∼u’ mundo

Diph-thongs

j (i before a, e, o) pie yes j j ww (u before a, e, o) cuatro wine w

I (i after a, e, o) aire — I I UU (u after a, e, o) raudo — U

Lateral l lado leaf l l LL (Spanish ll) calle — L

Vibrant

r (between vocals) pero red r

r R R/ R*R

R (begin of word) ropa —R R/ R*R

R/ (cons. + r) patr on —R* (final r) jugar —

R perr o —

Plosive

p par pay pp t kt tope tea t

k cama cold k

b bar bar bb d gd dar day d

g gana go g

Fricative

B saber — BB D GD Codo then D

G paga — G

s sol see s

s f T Xf fin foot fT (Spanish “z”) zumo thin TX (Spanish “j”) jota — X


Table II—continuedComplete set of Examples Equivalent Reduced Clusters ofphonemes (51) Examples in English set (33) phonemes

Nasals

n no no n n m Jm mama make m

N tango long N N J/

N∼ (Spanish “n”) cana — N∼ N∼ T/Affricates T/ (Spanish “ch”) chico cheap T/

J (Spanish “y”) ayer — JJ/ conyuge jump J/

3.4.4. The stress

The effect of this parameter is always relevant for duration modelling (the duration forstressed vowels is 20% longer on average than for unstressed vowels), so we have includedit in our base experiment. We use the lexical stress, which could be defined as a phono-logical feature by which a syllable is heard as more prominent than others. The perceivedprominence may be due to a longer duration, F0, intensity, or a combination of all of them,although usually it also includes a component of phonation. In Spanish, words only have onestressed syllable (with just one exception).

The coding is binary: the phoneme may or may not have stress. In this section, we studiedthe inclusion of contextual information about the stress (adjacent values). We experimentedwith different sizes for the window of stress values and obtained better results using fivevalues.

In TableIII , first row, we can see the results of the base experiment considering just thephoneme identity and the stress (both with the windowing mentioned earlier). In this table,the relative RMS (see Section3.3) for the train and test set is presented for all experimentsdedicated to the study of individual parameters. Each row corresponds to an experiment inwhich a new parameter has been added to the base experiment.

3.4.5. Binary parameters

Later, we show other binary parameters that have been considered in our experiments. As thecoding is fixed, we have worked in their effect for different numbers of neurons. All of themhave shown an improvement over the base experiment (see experiments 1–4 in TableIII ).That is what we expected, as all of them affect the duration.

• Stress in the syllable. It is similar to “stress in the phoneme”, used in the base exper-iment, but now all the phonemes of the syllable are considered as “stressed” whenthe main vowel is stressed. In spite of being somewhat redundant, the inclusion ofthis parameter has been positive, showing again the relevance of stress-related factors.Analysing our database, stressed syllables were 25% longer on average than unstressedones (Berrojo, 1994).

• Syllable structure (syllable ending with/without vowel). A syllable ending with vowelis called an open syllable, which affects duration significantly, as we have observedin the statistical analysis mentioned in Section3.4.1. Vowels in an open syllable arelonger: 13% for stressed vowels and 9% for unstressed vowels (on average).

• Diphthong (“i/u” before/after “a/e/o”). In Spanish, we differentiate both of them asdifferent allophones, and they follow different rules for duration. In previous studies(Santos, 1984) we observed that there were different degrees of shortening for them.


TABLE III. Summary of results (relative RMS)

Experiment N Train TestBase experiment 4 0.7877 0.82611-Base+ stress in syllable 6 0.7887 0.82152-Base+ diphthong 6 0.7827 0.82003-Base+ syllable structure 5 0.7767 0.82324-Base+ function word 3 0.7870 0.82375-Base+ type of phrase (8 types) 3 0.7877 0.8221Base+ type of phrase (4 types) 3 0.7850 0.81806-Base+ pos. of P in S (3 classes) 5 0.7836 0.8189Base+ pos. of P in W (5 cl.) 4 0.7773 0.8204Base+ pos. of P in PHR (5 cl.) 4 0.7848 0.82157-Base+ pos. of S in W (4 cl.) 3 0.7863 0.8190Base+ pos. of S in PHR (6 cl.) 5 0.7735 0.81168-Base+ pos. of W in PHR (3 cl.) 5 0.7960 0.82519-Base+ number of P in S 6 0.7882 0.8196Base+ number of P in W 6 0.7750 0.8109Base+ number of P in PHR 6 0.7881 0.818710-Base+ number of S in W 6 0.7755 0.8119Base+ number of S in PHR 6 0.7877 0.818311-Base+ number of W in PHR 6 0.7777 0.817212-Base+ position in the phrase . . . 5 0.7755 0.8130

N = number of neurons, P= phoneme, S= syllable, W=

word, PHR= phrase.

• Phoneme in a function word. In our database, syllables in a function word are 30%shorter on average.

The last two apply only to special cases, so they do not show a significant improvement inthe overall system, but they improve the prediction in their specific cases.

3.4.6. Type of phrase

After a statistical analysis of duration values according to type of phrase we found that therewere significant differences between them. So, we decided to include this parameter. In ourtext-to-speech system, it is the type of punctuation mark that is at the end of the phrase. Weare using punctuation in place of explicit prosodic phrase prediction because punctuation is aprimary factor in phrase prediction in Spanish. We have considered the following punctuationmarks:

• Closing exclamation (! )• Opening bracket (( )• Closing bracket ( ) )• Comma (,)• Period (.) and semicolon (;)• Colon (:)• Final question mark (?)• Any other punctuation mark.

Our first experiment was to use a one-of-n coding (with eight inputs), but the results were notas good as expected. A thorough examination of the database showed that the distribution ofphrases was not uniform, some types had significantly less examples than the others.


The solution was to reduce the number of types to four obtaining a more uniform distribu-tion:

• Closing exclamation (!) and final question mark (?)• Opening bracket (( ), closing bracket ( ) ) and comma (,)• Period (.), colon (:), and semicolon (;)• Any other punctuation mark.

Using a one-of-n coding as before, the results were good (see TableIII , item 5), showing animprovement of 1·0% for the test set.

3.4.7. Position parameters

We have considered different alternatives for parameters related to position:

• Position of the phoneme in the syllable, word and phrase.• Position of the syllable in the word and phrase.• Position of the word in the phrase.

Phrase boundaries are determined using the punctuation marks mentioned in the previoussection. In our first approach we carried out the following steps for the coding:

1. Normalize the value of position dividing it by the total in the higher-order unit—weobtain a floating point value between 0 and 1. We need to subtract “1” prior to thedivision to obtain values from 0 to 1.

2. This value is coded using four classes. The intervals that define these classes are com-puted automatically. We consider all the values of the parameter in the training databaseand compute the boundary values looking for uniform distributions, i.e. that the numberof examples in each class is balanced.

3. The four classes use a thermometer-type coding with three inputs (always the numberof classes minus 1).

The first results using just three neurons showed very little improvement. After analysing theresults we decided to increase the number of neurons and use a different number of classesfor each parameter. It is difficult to find an exact rule for deciding the optimum numberof classes. Our approach has been to use the average value of the parameter in the wholedatabase as the first approximation of the number of classes, and experiment with differentvalues around that number.

These are the considerations made for each parameter:

• Position of the phoneme in the syllable: the average value is 2·27 and the optimumnumber of classes is three. The intervals obtained automatically arex = 0, 0 < x <

0·75,x > 0·75.• Position of the phoneme in the word: the average value is 4·5 and the optimum number

of classes is five. The intervals obtained automatically arex = 0, 0 < x < 0·3684,0·3684< x < 0·6429, 0·6429< x < 0·9474,x > 0·9474.

• Position of the phoneme in the phrase: the average value is 21 and has a high variability.We experimented with five, 13, and 21 classes, but the results were not good for anyof them. It was a foreseeable result, as this information is not relevant to durationmodelling.

• Position of the syllable in the word: the average value is slightly lower than two andthe optimum number of classes is four.


1 2 3 4 5 60

1000

2000

3000

4000

5000

6000

7000

Position of phoneme in syllable

Occ

urre

nces

Histograms for position parameters

1 2 3 4 5 6 7 80

500

1000

1500

2000

2500

3000

3500

Position of syllable in word

Occ

urre

nces

0 5 10 15 20 250

100

200

300

400

500

600

700

Position of word in phrase

Occ

urre

nces

0 20 40 60 800

100

200

300

400

500

600

700

Position of phoneme in phrase

Occ

urre

nces

Figure 1. Histograms for position parameters.

• Position of the syllable in the phrase: the average value is nine and the optimum num-ber of classes is six. The results are low again, showing the high variability of thisparameter and its difficulty to generalize.

• Position of the word in the phrase: the average value is five and the optimum numberof classes is three.

The best results for each parameter are shown in items 6–8 of TableIII . All of them haveimproved the base experiment, and—although the results do not show significant differencesbetween them—we have decided to use: phoneme in the syllable, syllable in the word, andword in the phrase. The reason is that they provide different information to the network (theyare not redundant), their range of values is smaller, and, because of their smaller range, theyneed fewer neurons and classes to reach an optimum. Another advantage of this solution isthat it automatically responds to effects like the lengthening before a pause, where all inputsto the network will be close to 1 for these position parameters.

In the first three histograms of Figure1 we can see the distribution of the selected param-eters. The fourth histogram of Figure1 displays the distribution of ‘position of the phonemein the phrase”, which shows its great variability compared to the three selected parameters.That is a good reason for the difficulties we had to model it.


1 2 3 4 50

1000

2000

3000

4000

Number of phonemes in syllable

Occ

urre

nces

Histograms for "number of units" parameters

1 2 3 4 5 6 7 80

500

1000

1500

Number of syllables in word

Occ

urre

nces

0 5 10 15 20 250

50

100

150

Number of words in phrase

Occ

urre

nces

Figure 2. Histograms for “number of units” parameters.

3.4.8. “Number of units” parameters

In a similar way as for parameters related to position, we have considered the number ofphonemes in the syllable, word and phrase; syllables in the word and phrase; and words inthe phrase. Because of its different distribution, we needed a different coding for this kind ofparameter.

We decided to follow the following steps for their coding:

1. Normalize the value of position by the maximum value—we obtain a floating pointvalue between 0 and 1.

2. Apply Z-score (using average and standard deviation) as it is the usual recommendationin neural network literature (Masters, 1993): we can restrict at our will the operatingrange of the parameter, which is too variable.

The results can be seen in items 9–11 of TableIII . The conclusions that can be extracted arevery similar to those of position parameters. All parameters referring to phrase provide worseresults, which is due to their broad range of values. As can be seen in Figure2, the range ofvalues for “number of words in phrase” is bigger.

In order to check the suitability of this coding we tested the thermometer-type codinginstead of the floating point one (as for position-related parameters). We applied it to the


number of phonemes in the word. The average value is 4·5 and the variability is low. Weconsidered three, four, five, and nine classes, obtaining the optimum three classes, but in allcases the results were not as good as those obtained with the floating point coding.

At this point, we wondered whether the floating point coding was better for position-relatedparameters than the thermometer-type coding. We carried out a similar experiment but theresults were slightly lower. In any case, when results are similar, in a neural network it isalways safer to use binary inputs that are more suitable for the training (considering thefinal network, where the training is more difficult because of the large number of inputs andneurons).

3.4.9. Position in the phrase in relation to first/last stress

The motivation of this parameter is the explicit inclusion of the “lengthening before pause”effect, which is relevant for all languages. We code each syllable in five possible classes:beginning of phrase till first stress; first stressed syllable; after the first stress and before thelast stress; last stressed syllable; from the last stress till the end of the phrase. The improve-ment obtained (1·6%) was remarkably good (see TableIII , item 12). In Subection3.5we willcheck if this improvement can be additive, combining in the same model this parameter withthe position parameters.

We compared the five classes coding with using just two classes: beginning of phrase tilllast stress, from the last stress till the end of the phrase. We thought that it could capture betterthe “lengthening before pause” effect, but the improvement was only 0·6%, so we decided touse the coding with five classes.

3.4.10. Summary of results for parameter evaluation

The summary of most relevant results is shown in TableIII . We have been able to find a goodcoding for all the parameters, as there is an improvement in all of them. We have obtainedthe best results for: type of phrase, position in the phrase in relation to first/last stress, andthe “number of units” parameters. In any case, the maximum improvement with a singleparameter has been 1·8% for the development test set.

We have applied the same methodology to a female speaker in a restricted-domain environ-ment (Cordobaet al., 2001). The most remarkable differences with the unrestricted-domaindatabase presented in this paper are the following: the improvements are bigger, up to 5%;all position and “number of units” parameters give consistent improvements (between 2 and3·5%); syllable structure and function word mean an improvement close to 2·5%; the param-eters related to phrase provide better results; the best parameter is “position of the word inthe phrase”; and the window of five phonemes for the phonemes identity is 5% better thanthe window of three phonemes.

3.4.11. Modelling of the duration

To model the duration the first decision to make is whether it should be normalized. As theresults in TableIV show (experiment B), we found that it is better to normalize the durationof the phoneme by the average phoneme duration in the phrase; this way the system is lessaffected by changes in speed in the database recordings. The improvement is more than 3%in the development test set and 4% for the train set.

After that conclusion, we considered several transformations for the duration. The objec-tive of these transformations is to balance the distribution and to have a magnitude that


TABLE IV. Results for different ways to code the dura-tion (relative RMS)

Experiment Train TestA—Duration not normalized 0.7877 0.8261B—Normalized duration 0.7539 0.7993C—Log of normalized duration 0.7891 0.8359D—Standard deviation 0.7495 0.7991E—Standard deviation of the logarithm 0.7493 0.7980

can be more easily trained. These are the options considered (all references are to resultsin TableIV):

• The duration itself (experiment B).• The logarithm of duration (experiment C). The objective of taking the logarithm is to

balance the distribution of the duration. The results are bad and this option is discarded.• Find the average duration for each phoneme and model the standard deviation, i.e. the

normalized difference of duration from a phoneme-dependent mean (experiment D,Z-score mapping, see Section3.2). The objective is to minimize the error made bythe network, as it includes the characteristic duration of each phoneme in the finalprediction. The improvement over option A is close to 3·3% for the test set.

• The standard deviation of the logarithm of the duration: it tries to cover both objectives(experiment E). The improvement over option A is 3·4% for the test set.

The results are shown in TableIV. Clearly, options B, D and E are the best possibilities.Although the differences between them are not significant, the best behaviour in general(considering several experiments not shown in this paper which use different topologies)corresponds to the last option, which we will use from now on.

3.5. Putting everything together

The next set of experiments was dedicated to including all the parameters together. This isthe crucial step, because often the improvements obtained for one parameter are not reflectedwhen this parameter is used in conjunction with others. There are two possible reasons forthis behaviour:

• The parameters are closely correlated, so one of them does not offer additional infor-mation.

• The topology of the network needs to be tuned: including more parameters, the numberof inputs also increases, so a larger number of neurons can help to discriminate theinformation.

It is difficult to determine which of the two reasons is responsible for a bad result when manyparameters are included at the same time. Our solution has been to try different topologieseach time a new parameter has been added to the network.

In TableV we can see the summary of results using the parameters together. The numbersin the description of the experiments refer to the experiments specified in TableIII . In allexperiments the optimum was obtained Using eight or 10 neurons in the network.

The first experiments in TableV are new windowing experiments that show an improve-ment:


TABLE V. Results with the inclusion of all parameters (relative RMS)

Experiment Train TestBase 0.7493 0.798013—Base+ window of 5 phonemes 0.7293 0.784214—Base+ window of 7 phonemes 0.7170 0.784415—13+ window of 5 in stress 0.7505 0.805516—13+ 1 + 4 0.7447 0.782417—13+ 1 + 4 + 6 + 7 + 8 0.7238 0.779118—13+ 1 + 4 + 6 + 7 + 8 + 9 + 10+ 11 0.7071 0.777419—13+ 1 + 4 + 6 + 7 + 8 + 9 + 10+ 11+ 5 0.7120 0.781421—13+ 1 + 4 + 6 + 7 + 8 + 9 + 10+ 11+ 14 0.7036 0.773022—13+ 1 + 4 + 6 + 7 + 8 + 9 + 10+ 11+ 14+ 2 + 3 0.7059 0.7719

• Experiment 13: it is the base experiment now using a window of five phonemes (clus-tering into 13 classes the adjacent phonemes, as we did in Section3.4.3). The improve-ment was 1·7% for the test set and 2·7% for the train set.

• Experiment 14: we experimented with a window of seven phonemes, but the resultswere similar for the test set, which was probably due to an excess of inputs to thenetwork.

So, from that point we used the window five phonemes in all the experiments.

• Experiment 15: we used a window of five values for the stress, but with worse results,so we returned to the window of three values.

The last set of experiments (16–22) shows the inclusion in successive steps of the differ-ent parameters studied in Section3.4. The main conclusion is that all inclusions provide animprovement (except the type of phrase), but as could be expected the improvements are notadditive. One important aspect is that the inclusion of the parameter “position in the phrasein relation to first/last stress” improved the global system (already using the position param-eters), which means that it provides complementary information.

The global improvement obtained in the inclusion of all parameters is 3·3%. In the res-tricted-domain environment with a female speaker the inclusion of all the parameters meantan improvement of 11·2% for a similar number of neurons, and 18·7% after increasing thenumber of neurons to 20. This is another advantage of the better coverage of that database:we could increase the number of neurons and still improve the results for the test set.

3.6. Final results

In our best experiment the average absolute error is 230 samples, equivalent to 14·3 ms,which is really close to the maximum accuracy of the segmentation of our database (we werelimited to 10 ms steps). The absolute RMS for that experiment is equal to 19·3 ms.

We applied the same topology to the validation set. These are the results: average absoluteerror 14·2 ms, absolute RMS 19·1 ms and relative RMS 0·7667. So, the results are verysimilar in both sets. There is a slight improvement in the validation set, but it is obviously notsignificant. In any case, it shows the robustness of the selected topology.

In our restricted-domain environment with a female speaker the relative RMS was 0·4536,the average absolute error was 11·8 ms, and the absolute RMS was 15·5 ms. Using the rule-based system, the absolute RMS was 28·5 ms. Again, the better coverage of that databaseguaranteed better results.


We can compare ourselves toFernandez-Salgado and Banga(1999) in Galician, where theRMS was 19·6 ms, but they used a database of 300 short sentences, much more uniform thanours and similar in size (as they admit in the paper). InRiley (1992), the RMS was 23 ms ina system based on CART. In any case, the results are not really comparable as they use verydifferent databases.

3.7. Comparison to the rule-based system

Our previous rule-based system had a relative RMS equal to 0·9055, which is clearly worsethan our best result (17·3% worse).

To check the significance of the comparison of both systems, we applied the Student’st-test to the comparison of the normalized duration obtained with both systems. We obtaineda “2-tail Sig” = 0·000� 0·05, so the means are clearly different. The 95% confidence bandfor the difference in means is−0·138,−0·077.

We have visually observed that the results provided by the net follow the peak values of theduration very accurately, which shows the correct training of our system. Maybe sometimesthe neural network cannot reach the value of the peaks, which is the main factor in the error.But in these cases, we have observed that they are mostly due to changes made by the speakerin an emotional way, because the speaker is using special emphasis and/or greater speakingrate variation. In these experiments, we did not want to model these kinds of variations, butit is another line of investigation that we are following, and it will be reported later when it isfinished.

This effect can be seen in Figure3, which shows the comparison between the predictedduration value and the manually labelled value for a typical part (40 phonemes) of the recog-nition set.

4. Conclusions

Compared to our previous rule-based system, the results are much better, even when usinga limited number of parameters. We have a system that can be easily adapted to specificcontexts and/or new databases. We have applied the same methodology to a database withrestricted prosody and have obtained a major improvement in these circumstances, as thedatabase is more uniform and homogeneous, with many instances of each type of phrase.

With our environment, the generation of the network for a new speaker can be fast, semi-automatic and inexpensive. Two approaches can be taken:

1. The cheapest way is to adopt the existing configuration and generate directly the newnetwork. We have done that for the female speaker with very similar results.

2. If there are no time constraints, some of the experiments can be repeated to fine tunethe configuration.

Regarding the topology, it is difficult to find the optimum of the network. It is better tobegin with a low number of neurons and increase it step by step. The same applies to theinclusion of parameters: it is better to decide their best coding in small networks. We havefound that in most cases a second hidden layer is not necessary. It is good for the training setbut not for the test set, so we decided not to use it. The “Z-score” normalization used withnumeric parameters shows an optimum behaviour: it adjusts the margin of accepted valuesautomatically and rejects the out-of-range values.


25 30 35 40 45 50 55 60 650

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Network evaluated: e133–3–wg3 Database: d–s14–e133.mat

Sample number

Out

put o

f the

net

wor

k

NN valueOPTIMUM value

Figure 3. Performance of the neural network.

Regarding the parameters, the most important ones are the phoneme identity and the stress,which is just as we expected. The inclusion of contextual information for them has beenpositive. For phoneme identity we needed to cluster the phonemes to reduce the number ofinputs to the network. Another important aspect is the way to code the duration: we havefound better results by normalizing it and modelling the standard deviation of the logarithm.The effect of the other parameters has been positive in general, although they have had moreimpact on the training set, which shows not only their relevance but also the difficulty togeneralize.

In general, we can say that we have found a good compromise between network topologyand the parameters considered, with good results that are stable.

The system will be included in a high quality text-to-speech system in Spanish, Boris-GTH(Pardoet al., 1995). This system has been commercially distributed and can be tested on-lineat the web address:http://www-gth.die.upm.es/index.html.

This work has been financed by the Interministerial Commission of Science and Technology projectsSicovox 194/89 and Demostenes TIC 95-0147. We acknowledge the help of M. A. Berrojo and J. Fer-reiros in the development of the database used in this work, and Miguel A. Lopez-Carmona in thepreparation of the environment and the experiments.

http://www-gth.die.upm.es/index.html


References

Allen, J., Hunnicut, S. & Klatt, D. H. (1987).From Text to Speech: The MITalk System. Cambridge University Press,Cambridge.

Berrojo, M. A. (1994). Preliminary work in duration modeling in Spanish. Internal Report TR/GTH-DIE-ETSIT-UPM/2-94. Departamento de Ingenierıa Electronica, Universidad Politecnica de Madrid.

Campbell, W. N. (1992). Syllable-based segmental duration. InTalking Machines: Theories, Models and Designs.(G. Bailly, C. Benoit and T. R. Sawallis, eds), pp. 211–224. Elsevier, Oxford.

Cordoba, R., Vallejo, J. A., Montero, J. M., Gutierrez-Arriola, J., Lopez, M. A. & Pardo, J. M. (1999). Automaticmodeling of duration in a Spanish text-to-speech system using neural networks.European Conference on SpeechCommunication and Technology, 1619–1622.

Cordoba, R., Montero, J. M., Gutierrez-Arriola, J. & Pardo, J. M. (2001). Duration modeling in a restricted-domainfemale-voice synthesis in Spanish using neural networks.International Conference on Acoustics, Speech andSignal Processing, II-793-796.

Corrigan, G. & Massey, N. (1997a). Generating segment durations in a text-to-speech system: a hybrid rule-based/neural network approach.European Conference on Speech Communication and Technology, 2675–2678.

Corrigan, G. & Massey, N. (1997b). Text-to-speech conversion with neural networks: a recurrent TDNN approach.European Conference on Speech Communication and Technology, 561–564.

Deans, P., Breen, A. & Jackson, P. (1999). CART-based duration modeling using a novel method of extractingprosodic features.European Conference on Speech Communication and Technology, 1823–1826.

Fackrell, J. W. A., Vereecken, H., Martens, J. P. & Van Coile, B. (1999). Multilingual prosody modeling using cas-cades of regression trees and neural networks.European Conference on Speech Communication and Technology,1835–1838.

Febrer, A., Padrell, J. & Bonafonte, A. (1998). Modeling phone duration: application to Catalan TTS.3rdESCA/COCOSDA Workshop on Speech Synthesis, 43–46.

Fernandez-Salgado, X. & Banga, E. R. (1999). Segmental duration modeling in a text-to-speech system for theGalician language.European Conference on Speech Communication and Technology, 1635–1638.

Ferreiros, J., Cordoba, R., Savoji, M. H. & Pardo, J. M. (1995). Continuous speech HMM training system: appli-cation to speech recognition and phonetic label alignment. InNATO ASI “Speech Recognition and Coding, NewAdvances and Trends”(A. J. Rubio and J. M. Lopez, eds), pp. 68–71. Springer, Berlin.

Lopez-Gonzalo, E. & Rodrıguez-Garcıa, J. M. (1996). Statistical methods in data-driven modeling of Span-ish prosody for text-to-speech.International Conference on Speech and Language Processing, 1373–1376.

Macarron, A., Escalada, G. & Rodrıguez, M. A. (1991). Generation of duration rules for a Spanish text-to-speechsynthesizer.European Conference on Speech Communication and Technology, 617–620.

Masters, T. (1993).Practical Neural Network Recipes in C++. Academic Press, Inc, San Diego, CA.Mobius, B. & van Santen, J. P. H. (1996). Modeling segmental duration in German text-to-speech synthesis.Inter-

national Conference on Spoken Language Processing, 2395–2398.Morlec, Y., Bailly, G. & Auberge, V. (1997). Synthesising attitudes with global rhythmic and intonation contours.

European Conference on Speech Communication and Technology, 219–222.Navarro Tomas, T. (1948).Manual de entonacion espanola (Manual of Spanish Intonation)2nd edition. Hispanic

Institute, New York, NY.Pardo, J. M., Martınez, M., Quilis, A. & Munoz, E. (1987). Improving text-to-speech conversion in Spanish. Lin-

guistic analysis and prosody.Proceedings of the European Conference on Speech Technology, volume 2,pp. 173–176. CEP Consultants LTD, Edinburgh.

Pardo, J. M., Gimenez de los Galanes, F. M., Vallejo, J. A., Berrojo, M. A., Montero, J. M., Enrıquez, E. &Romero, A. (1995). Spanish text-to-speech, from prosody to acoustics.15th International Congress on Acoustics,133–136.

Quazza, S. & Heuvel, H. (1997). The use of lexica in text-to-speech systems.Course at Nijmegen University.Quilis, A. & Fernandez, J. A. (1989). Curso de fonetica y fonologıa espanolas. Ed. C.S.I.C.Riedi, M. (1995). A neural-network-based model of segmental duration for speech synthesis.European Conference

on Speech Communication and Technology, 599–602.Riedi, M. (1997). Modeling segmental duration with multivariate adaptive regression splines.European Conference

on Speech Communication and Technology, 2627–2630.Riley, M. D. (1992). Tree-based modeling of segmental durations. InTalking Machines: Theories, Models and

Designs. (G. Bailly, C. Benoit and T. R. Sawallis, eds), pp. 265–273. Elsevier, Oxford.Rodrıguez-Crespo, M. A., Escalada, J. G. & Torre, D. (1998). Conversor texto-voz multilingue para espanol, catalan,

gallego y euskera.SEPLN (Spanish Society of Natural Language Processing), Rev. 23, 16–23.


Santos, A. (1984).A system for multi-voice synthesis in real-time from a written text. PhD Thesis, UniversidadPolitecnica de Madrid. E.T.S.I. Telecomunicacion.

Tournemire, S. de. (1997). Identification and automatic generation of prosodic contours for a text-to-speech synthesissystem in French.European Conference on Speech Communication and Technology, 191–194.

Vallejo, J. A. (1998).Improvements to the fundamental frequency in the text-to-speech conversion. PhD Thesis,Universidad Politecnica de Madrid. E.T.S.I. Telecomunicacion.

van Santen, J. P. H. (1994). Assignment of segmental duration in text-to-speech synthesis.Computer Speech andLanguage, 8, 95–128.

van Santen, J. P. H., Shih, C., Mobius, B., Tzoukermann, E. & Tanenblatt, M. (1997). Multi-lingual duration model-ing. European Conference on Speech Communication and Technology, 2651–2655.

van Son, R. J. J. H. & van Santen, J. P. H. (1997). Strong interaction between factors influencing consonant duration.European Conference on Speech Communication and Technology, 319–322.

Villar, J. M., Lopez-Gonzalo, E. & Relano, J. (1999). A mixed strategy approach to Spanish prosody.EuropeanConference on Speech Communication and Technology, 1879–1882.

(Received 25 July 2000 and accepted for publication 25 November 2001)

selection of the most signiﬁcant parameters for duration ... file184 r. cordoba´ et al. 1....

Documents