module coursework feedback - wordpress.com...mlsalt10: parametric speech synthesis riashat islam...

15
MPhil in Machine Learning, Speech and Language Technology 2015-2016 MODULE COURSEWORK FEEDBACK Student Name: Module Title: CRSiD: Module Code: College: Coursework Number: I confirm that this piece of work is my own unaided effort and adheres to the Department of Engineering’s guidelines on plagiarism Date Marked: Marker's Name(s): Marker's Comments: This piece of work has been completed to the following standard (Please circle as appropriate): Distinction Pass Fail (C+ - marginal fail) Overall assessment (circle grade) Outstanding A+ A A- B+ B C+ C Unsatisfactory Guideline mark (%) 90-100 80-89 75-79 70-74 65-69 60-64 55-59 50-54 0-49 Penalties 10% of mark for each day late (Sunday excluded) The assignment grades are given for information only; results are provisional and are subject to confirmation at the Final Examiners Meeting and by the Department of Engineering Degree Committee. Riashat Islam ri258 St John's Statistical Speech Synthesis MLSALT10 1

Upload: others

Post on 23-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

MPhil in Machine Learning, Speech and Language Technology

2015-2016

MODULE COURSEWORK FEEDBACK

Student Name: Module Title: CRSiD: Module Code: College: Coursework Number: I confirm that this piece of work is my own unaided effort and adheres to the Department of Engineering’s guidelines on plagiarism

Date Marked: Marker's Name(s): Marker's Comments: This piece of work has been completed to the following standard (Please circle as appropriate):

Distinction Pass Fail (C+ - marginal fail)

Overall assessment (circle grade) Outstanding A+ A A- B+ B C+ C Unsatisfactory

Guideline mark (%) 90-100 80-89 75-79 70-74 65-69 60-64 55-59 50-54 0-49

Penalties 10% of mark for each day late (Sunday excluded)

The assignment grades are given for information only; results are provisional and are subject to confirmation at the Final Examiners Meeting and by the Department of Engineering Degree Committee.

Riashat Islam

ri258

St John's

Statistical Speech Synthesis

MLSALT10

1

Page 2: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

MLSALT10: Parametric Speech SynthesisRiashat Islam

Department of EngineeringUniversity of Cambridge

Trumpington Street, Cambridge, CB2 1PZ, [email protected]

Abstract—This work presents an overview for the basic tech-niques in statistical parametric speech synthesis based on HMMsystems. We present experimental demonstrations and discussthe different approaches for accurate speech synthesis. Our ex-perimental results demonstrate the approaches for synthesis andparameter trajectory generation, and how the global variancemethod can alter the form of the generated

I. INTRODUCTION

A typical speech synthesis system consists of training andsynthesis parts as shown by the block diagram in figure 1. Ina text to speech synthesis system, the training part consistsof extracting both spectrum and excitation parameters from aspeech database and model it by context-dependent HMMs.During the synthesis stage, the text for a given speechutterance is first converted to context dependent sequencesof labels and then the utterance HMM is constructedby concatenating multiple HMMs according to the givenlabel sequence. A sequence of speech parameters are thengenerated based on which the speech waveform can finallybe synthesized.

Figure 1. HMM-based Speech Synthesis System [1]

In section II, we first provide a brief background for thebasic task of statistical speech synthesis process. Section IIIthen considers the basic tasks of generating label sequences,the process of generating the waveform parameters and thespeech waveform, and how trajectories are generated basedon forced alignment of the waveform data with the acousticmodel. Finally, section IV then considers the approach forcombining multiple acoustic models together within theProduct of Experts (PoE) framework, and the significance ofusing the global variance model. For each of the sections,

we first present a brief overview of the significance of theapproach, include experimental results followed by discussionof the results obtained.

II. BACKGROUND

We provide a brief background to the statistical parametricapproaches to text-to-speech synthesis based on HiddenMarkov Models (HMMs). The process of speech synthesisis based on generating speech given only text as input. Weconsider approaches based on parametric models that candescribe the speech using statistics of the parameters. Thespectral and excitation parameters of speech, along withthe spectrum are first extracted from a speech database.Maximum likelihood estimation using the EM algorithmis used in the training phase of the synthesis system. Themodels for HMM synthesis are trained on labelled data. Inthe synthesis phase, a sequence of context labels is producedfrom the input text. A sequence of models is produced fromthe sequence of labels and using maximum likelihood, asequence of parameters are produced. Maximum likelihoodcriterion is used to estimate the model parameters.

The speech waveform can be constructed from the parametricrepresentations of the speech. One major difference of thisprocess is that compared to speech recognition system,linguistic and prosodic contexts are also taken into account inaddition to the phonetic contexts. The given word sequenceor text is converted into a context dependent label sequenceand context-dependent HMMs are concatenated togetherto produce the label sequence. A sequence of spectral andexcitation parameters are generated from the utterance HMM.Additionally, in generating speech using HMM, we needto consider modelling the rate of change of the statisticalparameters to produce naturally sounding speech. Hence,delta coefficients (first order derivatives) also need to beconsidered in addition to the static parameters (coefficients).The training phase is therefore to learn the distribution ofthese parameters, and the in the synthesis phase, the model isused to generate parameter trajectories that have appropriatestatistical properties.

In the synthesis process, a smoothly varying speech parametertrajectory is generated by maximizing the likelihood of the

Page 3: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

composite HMM subject to constraint between static anddynamical features. However, one important drawback ofthis approach is the inconsistency between the training andsynthesis criteria since the likelihood estimates of the jointstatic and dynamic features in the training process and thelikelihoods for only the static feature vectors in the synthesisprocess.

III. SYNTHESIS AND TRAJECTORIES

A. Question 1: Speech Spectrogram

The utt1.wav file can be played to listen to the waveform.The speech in the file is: ”Keep the hatch tight and the watchconstant”. The spectrogram for the waveform was observedusing Matlab as follows:

Figure 2. Spectrogram for the utt1.wav waveform

The spectrogram is for the original waveform for the utterance.Spectrogram is a good representation for phones and theirproperties. It shows how the frequency component of speechvaries with time. Figure 2 shows that there are higher numberof low frequency regions compared to high frequency regionsconsidering the Time vs Frequency representation. The dark(yellowish) regions indicates the formants or peaks in the spec-trum. We consider spectrogram representations since soundscan be identified much better by the formants and their transi-tions. The transitions are represented by the intervals betweenthe dark regions. Spectrograms are useful for evaluation oftext to speech systems. In figure 2 we consider the speech andtheir phonetic components and their transitions given the textin utt1.txt file. A high quality text to speech system producessynthesized speech whose spectrograms should nearly matchwith the natural sentences.

B. Question 2: Generate Labels for Sentence

We consider text analysis for the speech synthesis system.Considering utt1.txt file containing the sentence for thespeech uttered, text analysis includes dividing the text intowords and sentences assigning syntactic categories to words,grouping words into phrases etc.

The reference sentence in utt1.txt can be broken downinto sub word units to generate sequences of labels for thesequence of trajectory HMMs. The labels can be generated

using the text analysis tool Festival, and the label file utt1.labcan then be obtained.

The generated label file shows the explicit representation ofthe linguistic structure of the message encoded in the originaltext. The phonetic interpretation of the TTS system showsquantitative phonetic values for the representation, durationsof phonetic segments, F0 target values for pitch accents.

. / s c r i p t s / t x t 2 l a b . sh \o r i g i n a l / t x t / u t t 1 . t x t l a b

C. Question 3: Generate Trajectory

The speech parameters of the spectrum, fundamentalfrequency (F0) and phoneme duration can be statisticallymodeled and generated by using HMMs based on themaximum likelihood criterion. The problem of generatingthe speech parameters from the HMM considers obtaininga vector sequence by using the most likely state sequencesimilar to the viterbi algorithm.

The label file can then be used to generate a trajectory ofthe model parameters using the script below. The label fileis used to produce time varying functions of the controlparameters for an acoustic speech synthesis model, whichcan then be used to calculate samples of the speech waveform.

. / s c r i p t s / l a b 2 t r a j . shhmmdir models / h t s l a b d i r l a b \o u t d i r t r a j f i l e n a m e u t t 1

The trajectories can then be read using the matlab scriptloadtraj.m. We consider the nature of the generatedfundemental frequency component parameter trajectoryin figure 3. For synthesizing speech, it is necessary to modeland generate fundamental frequency (F0) as well as thespectral sequences.

Figure 3. Trajectory of generated F0 patterns

Figure 3 shows the F0 patters generated for the sentence andutternace in utt1. Figure 3 only shows the the one-dimensionalcontinuous values. Since the F0 is not defined in unvoiced

Page 4: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

regions, the F0 patters cannot be observed by conventionaldiscrete or continuous HMMs. The observations of F0 invoiced regions are only drawn from a single one-dimensionalspace. Also, if we compare figure 3 with figure 2 we cansee that the F0 parameter trajectory is similar to the patternobserved in the spectrogram. The spectrogram also shows onlythe voiced regions and the transitions are similar between thetwo figures. The generated F0 pattern is almost identical withthe real F0 pattern observed from the spectrogram

D. Question 4: Generate Waveform from Trajectories

The traj2wav script can then be used to generate a waveformfrom the sequence of parameter trajectories.

. / s c r i p t s / t r a j 2 w a v . sh t r a j d i r t r a j o u t d i r \wav f i l e n a m e u t t 1

The synthesized speech is quite different compared to theoriginal speech. In the synthesized speech, the uttered wordsare more stretched compared to original utterance. Theoriginal utterance is also much quicker, whereas for thesynthesized speech there is a lot of stretch in pronounciationof the word ”constant”.

Figure 4. Comparison of spectrogram of original and generated waveform

Figure 4 shows the differences in the speech spectrogram ofthe original and generated waveforms. The spectrogram ofthe generated waveform shows a shift in the fuundamentalfrequency components, and also the transitions between thewords are not smooth. The generated spectrogram shows thatthe F0 components span over a longer period compared tothe original waveform, and there is a shift in time. This isalso justified from the speech heard, since the words seemedto have been stretched during utterance, and also spans for alonger period of time.

E. Question 5: Manipulate Trajectories

The utt1.f0.txt file has been replaced with the fileutt1.f0.Zero.txt that is of the same length containing allzeros. This text file contains all the f0 parameter trajectories,except that the parameter values are now modified to zero.

This modified trajectory can then be used to generate awaveform and the spectrogram for the waveform can beobserved.

The quality of the synthesized speech is much worse comparedto the previously generated waveform. The synthesized speechnow contains only unvoiced regions, and there are no voicedfrequencies for the utterance. This is because there are allzero fundamental frequencies after we have manipulated thegenerated trajectories to contain all zero values.

On observing the spectrogram, we further see that thereare no differences between the spectrogram of this and theprevious spectrogram. This is because the spectrogram canonly represent the voiced regions, and the synthesized speechin this waveform now contains all the unvoiced regions. If weobserve the F0 component (not included here), we see that itis only a horizontal line since there are all zero F0 components.

F. Question 6: Forced Alignment of Parameter Trajectories

This section considers generating the trajectories based onforced alignment. The waveform data can be forced alignedwith the acoustic model using the alignment script included inappendix. The modelling of F0 is difficult due to the differingnature of F0 observations within voiced and unvoiced speechregions. Any practical F0 modelling approach must be capableof dealing with two issues: classifying each speech frame asvoiced or unvoiced, and modelling F0 observations in bothvoiced and unvoiced speech regions.

We have original waveforms from recorded databases, and weneed to produce a linguistic specification for all the utterancesin the speech corpus of the traiing data. However, manuallylabelling for all the utterances is a difficult task. Therefore,when synthesizing new sentences, we need to predict thelinguistic specification based on the text corresponding to thespeech corpus. Forced alignment can therefore be applied,borrowed from the speech recognition literature to improvethe accuracy of the labelling to identify the pause locationsand pronounciation variations in the synthesized speech.

The phonetic and prosodic information is conveyed throughthe spectral envelope, fundamental frequency and theduration of individual phones. Previously, we consideredextracting the parameters, ie, the spectral and F0 features.For the duration of individual phones, for modelling speechvariations, forced alignment using pre-trained HMMs isusually considered instead of manually labelling individualphone durations. The trajectory parameters therefore needto be modelled in separate streams due to their differentcharacteristics and time-scales. To reduce the effect ofthe duration models when comparing the generated F0trajectories, the state level durations are therefore obtained byforced aligning the natural speech wavefor from the test set.With these natural speech durations, voicing classifications

Page 5: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

can therefore be obtained for each state, followed by F0value generation within the voiced regions. This thereforeallows the alignment of the natural speech parameters withthe synthesized speech, by both the generated waveforms canbe compared frame by frame or by observing the spectrogram.

Figure 5. Generated trajectories based on forced alignment. Comparisonbetween original and generated parameter trajectory

Figure 5 shows the significance of alignment to generatetrajectories based on forced alignment. In figure 5 first weshow the F0 patterns of the real utterance in the originalwaveform, and second figure shows F0 from generatedpatterns without any alignment. Both the sub figures showthat there are significant differences between the twoparameter trajectories since the length of the number ofsamples in the generated trajectory differs from the originalsamples. Figure 5 shows that after aligning the generatedparameters with the natural speech, the generated parametertrajetory is now much more similar to the original trajectory.By forced alignment, we can therefore better model theduration of each state. The resulting generated parametertrajectory is therefore similar to the original trajectory lengthalthough it might be slightly different, as shown in figure 5due to consideration of the side-effects.

Considering the label file in utt1.lab, we can use thisto generate trajectories that have the same length as theoriginal waveform. Taking the first 60 features for themel-cepstrum parameter trajectories, we can then align theoriginal and generated parameters using the alignment scriptavailable in appendix. We also consider a parameter thatconsiders the tradeoff or cost of aligning two time pointsby a L2 norm, and also assign a cost for skipping a time point.

Having aligned the two sets of parameters, we also obtain thestate durations. After alignment, as shown in the bottom two

figures in figure 5, we find that the generated F0 patters arenow quite similar to the original F0 pattern having consideredlabels occurring in the sentence that were observed in thetraining data.

Figure 6. Spectrogram for the synthesized speech, based on generatedparameter trajectory with forced alignment

Figure 6 then considers the spectrum based on the alignmentof 10 mel-cepstrum features having considered the alignmentusing L2 error of all the features. Our results in figure 6shows that the spectrum for the features are now quite similarunlike shown in figure 4. This shows that F0 modelling canbe significantly improved by considering forced alignment,and the generated speech waveform, by considering phoneticand pronounciation variations, can now better synthesize thespeech. Using forced alignment, the state durations can there-fore be obtained by aligning the natural speech with the knownphonetic context transcriptions. The models now generateF0 trajectories for the voiced regions by considering phonedurations.

G. Question 7: Replacing elements of synthesized trajecto-

ries

We considered replacing elements of the trajectories toanalyse the impact of the synthesized speech waveformand how it compares to the original waveform. Consideringa trajectory size of 509, the script for stripping trajectoryconsiders whether the size of the trajectory is more orless than the size of the generated mel-cepstrum parametertrajectories. If the size is more, then the trajectories arestripped, whereas padding is considered if the size is less inorder to make the trajectories for comparison of the samelength.

Replacing first component of mel-cepstrum parameters:We first consider replacing only the first element and generatea speech waveform based on the first feature replacedtrajectory. The generated speech waveform is quite bad, andthere is a lot of disruption in the utterance. It is similar to the

Page 6: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

speech waveform considering all zero F0.

Replacing all mel-cepstrum parameters: We then consideredreplacing all the parameters in the trajectory. Surprisingly,this generated waveform sounds better than the previous one,although it is quite poor compared to the original synthesizedspeech.

The generated speech spectrogram is further given in figure7 for the two replaced trajectories, compared to the originaltrajectory.

Figure 7. Spectrogram after replacing particular or all trajectories on thewaveform

IV. TRAJECTORY GENERATION

In a HMM-based speech synthesis system, the conventionalalgorithm for generating a trajectory of static features is tomaximize an output probability of a parameter sequence thatconsists of static and dynamic features from HMMs. Weconsider the task of generating trajectories of the parametersand the role of the global variance method. Global variancemethod has been shown to improve the synthetic speechquality since GV considers approaches based on variance ofthe static feature vectors calculated over a time sequence inthe parameter generation process.

The static feature vectors that are generated from the HMMsare usually over-smoothed causing a muffled effect in HMM-synthesized speech. The GV method is inversely correlatedto the smoothing effects and hence a metric on the GV ofthe generated parameters is used that works as the penaltyterm in the parameter generartion process. Related work hasshown that the synthetic speech quality can be improved by

generating the parameter trajectory while keeping the globalvariance close to the natural one.

A. Question 1: Trajectory Generation based on Experts

In this section, we consider the task of combining multipleacoustic models together that can be expressed as a Productof Experts (PoE). In a speech synthesis system, the speechparameter trajectory generated from the acoustic modelsshould satisfy many levels of constraints. To achieve highquality speech synthesis, multiple acoustic models arecombined, and the acoustic features of the training dataare extracted and modelled individually. During speechsynthesis, the speech parameters will jointly maximize theoutput probabilities from the multiple acousitc models thatare generated. PoE framework is used to jointly estimatethe multiple acoustic models. The output probabilities fromthe individual models (ex- perts) are multiplied together andthen normalized, effectively forming an intersection of thedistributions.

The trajectorygeneration.py script included in appendix isused to generate a sequence of trajectories given the expertswithin the PoE framework. First, the script getexpert.sh canbe used to extract the parameters of the expert for a givendimension. This can be done as:

. / s c r i p t s / g e t e x p e r t . sh hmmdir \models / h t k l a b d i r \o r i g i n a l / l a b s t r e a m 1 \

d imens ion 4 \o u t d i r e x p t s \f i l e n a m e u t t 1

The following script is then used to generate the trajectories.

# To g e n e r a t e t h e t r a j e c t o r i e s :py thon g e n e r a t e t r a j e c t o r y . py

This generates files that contains the experts for the mel-cepstrum and emitting state durations. Using the commandsabove, we can therefore efficiently model the high dimensionaltraining data such that each expert within PoE can satisfy oneof the low dimensional constraints.

Figure 8 shows the experimental results for the generatedparameter trajectory compared to the original waveform.Considering dimension 4 only, as in figure 8, we show thatthe PoE framework can train multiple acoustic models jointlyto obtain better estimates of the genrated parameter trajectory,from which the synthesized speech can be obtained. Usingthe PoE framework, the likelihood contribution from each ofthe models are weighted and the output is the most likelytrajectory from the combined distribution.

Page 7: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

Figure 8. Comparison of Original and Generated Trajectory of Dimension 4(c3)

B. Question 2: Trajectory Generation based on Global

Variance

Another way to compensate for the over-smoothing is tointegrate multiple-level statistical models to generate thespeech parameter trajectories. In practice, GV is used becausethe dynamic range of the mel-cepstral coefficients in thegenerated speech is smaller than the coefficients for naturalspeech. GV tries to recover the dynamic range of generatedtrajectories close to the natural ones. In the Global Variancemethod, all the training uttrances are modelled by using asingle multi-variate Gaussian distribution.

Considering a speech parameter trajectory c, GV considersmaximizing the following objective function with respect tothe parameter trajectory:

FGV (c;�,�GV ) = w logN(Wc;µq,⌃q)+logN(v(c);µGV ,⌃GV )

(1)where w is used to balance the HMM and GV probabilities,and the second term is considered as the penalty termto prevent overs-smoothing. In this section, we considerthe global variance model, by implementing two differentapproaches based on products of experts and constrainedoptimisation. Since the maximum likelihood estimation basedparameter training algorithm causes distorition of perceptionin speech synthesis, so the GV based training method isintroduced in the HTS training structure. The GV methodtries to enlarge the variance of the spectrum and F0 generatio.

1) Products of Experts Framework: Consideringcombinations of multiple acoustic models, the acousticfeatures of the training data can be extracted and modelledindividually. During the synthesis stage, we can then considerspeech parameters that can jointly maximize the outputprobabilities from the multiple acoustic models. The multipleAMs can be jointly estimated within a Products of Experts(PoE) framework.

Figure 10 shows the generated parameter trajectory using theglobal variance model in the PoE framework. If we were to

Figure 9. Generated Trajectory with Global Variance considering a Productof Experts Framework

observe the spectrogram, for comparing the natural speechand the generated speech, with and without considering globalvariance, then it can be seen that the spectral structure is moreclearer when we consider GV. In other words, the spectramfrequencies can be observed more clearly in the generatedspectrogram. Additionally, when we listened to the waveform,it seemed as if with the GV approach in PoE, the synthesizedspeech has some additional artificial sounds associated with it.

2) Global Variance with Constraint: Another approach isto use trajectory training subject to a constraint on the globalvariance. In this approach, the update is made on both themeans and variances and it is applied to both spectral andF0 components. In this approach, the parameter generation isperformed under a constraint on the variance of the generatedparameter trajectory.

Figure 10. Generated Trajectory with Global Variance considering a Productof Experts Framework

3) Expert and Constraint for other dimensions of the

system: We then evaluate whether the same trend in resultscan be observed for both the GV model with products ofexperts and optimisation approach. We first examine theglobal variance model for Dimension 50 as shown in figures11 and 12.

Page 8: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

Figure 11. Generated Trajectory with Global Variance considering a Productof Experts Framework - For Dimension 50

Figure 12. Generated Trajectory with Global Variance considering an opti-misation approach - For Dimension 50

Furthermore, we also examined the significance of the globalvariance model on dimension 30 as shown in figures 13 and14.

Figure 13. Generated Trajectory with Global Variance considering a Productof Experts Framework - For Dimension 30

Our results above, considering other dimensions such as 50and 30 instead of dimension 4, shows similar trends in results.In the above figures, we compared the GV model with the PoEand the optimisation approach.

Figure 14. Generated Trajectory with Global Variance considering an opti-misation approach - For Dimension 30

C. Question 3: Comparison with other utterances

We repeated our experiments for the other speech utterances.In this section, we present the results for the speech utterances2 and 5. For each of the utterances, we repeated the parametergeneration process considering multiple acoustic models inthe PoE framework. Further from there, for each utternace,the global variance model within PoE and optimisationapproach was then considered. We again present our resultsonly for dimension 4. Similar to the previous utterance inconsideration, the experimental results again show that theglobal variance model within PoE plays a more significantrole in altering the generated parameter.

Figure 15. Generated Trajectory given experts for dimension 4 in utterance2

The first set of experimental results are shown for Utter-nace 2 (UTT2). We re-ran the paramter generation processfor synthesizing the speech. Figure 15 shows the generatedtrajectory given the experts for dimension 4. Figure 16 furtherconsiders the global variance model within a product of expertsframework. Figure 16 shows that the generated parametertrajectory better aligns with the original parameter comparedto figure 15. Finally, figure 17 shows the generated parametertrajectory for utternace 2 using the global variance model withconstraints. We then again repeat the same set of experimentalresults for utternace 5 (UTT5).

Page 9: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

Figure 16. Generated parameter trajectory using global variance model forutternace 2

Figure 17.

Figure 18.

Our experiments above were repeated for all the utterances. Byconsidering multiple acoustic models in the PoE framework,we again show that we can compensate for the over-smoothingeffect in the generated parameter trajectory. Our results inthis section again demonstrate that GV can better recover thedynamic range of the generated trajectory close to the naturaltrajectory. The same same trend in results are therefore alsoobserved for almost all the utterances. The synthesized speechfor all the utterances are smoothed using the PoE framework,considering the global variance model.

Figure 19.

Figure 20.

V. SUMMARY

In this work, we have demonstrated the basic techniquesfor parametric speech synthesis. By generating a parame-ter trajectory, a sequence of labels for sentences, we firstdemonstrated how a waveform can be generated from thegenerated set of parameter trajectories. We then evaluated thedifferent approaches for parameter generation. In particular,we demonstrated the usefulness of the global variance methodand showed that it can improve the quality of the syntheticspeech. Our experimental results showed the significane ofglobal variance method in parametric speech synthesis, anddemonstrated its usefulness in multiple utterances.

REFERENCES

[1] Alan W. Black, Heiga Zen, and Keiichi Tokuda. Statisticalparametric speech synthesis. In Proceedings of the IEEEInternational Conference on Acoustics, Speech, and SignalProcessing, ICASSP 2007, Honolulu, Hawaii, USA, April15-20, 2007, pages 1229–1232, 2007.

Page 10: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

APPENDIX: Speech Synthesis Practical CODE

April 17, 2016

1 Forced Alignment

1

2 f unc t i on [ o r i g i na l a l i gnmen t , generated a l ignment ] = Alignment ( o r i g i n a l t r a j e c t o r y ,g en e r a t ed t r a j e c t o r y , a l i gnment co s t )

3

4

5 O = s i z e ( o r i g i n a l t r a j e c t o r y , 2 ) ;6 G = s i z e ( g en e r a t ed t r a j e c t o r y , 2 ) ;7

8 a l i gnment co s t = @( i , j ) norm( o r i g i n a l t r a j e c t o r y ( : , i ) g e n e r a t e d t r a j e c t o r y ( : , j ) ) ;9

10 X = ones (O, G) ⇤ I n f ;11 Y = ze ro s (O, G) ;12 X(1 ,1 ) = a l i gnment co s t (1 , 1 ) ;13 Y(1 ,1 ) = 1 ;14

15 f o r i =2:O16 X( i , 1 ) = X( i 1 , 1 ) + a l i gnment co s t ;17 Y( i , 1 ) = 2 ;18 end19

20 f o r j =2:G21 X(1 , j ) = X(1 , j 1 ) + a l i gnment co s t ;22 Y(1 , j ) = 3 ;23 end24

25 f o r i =2:O26 f o r j =2:G27 A = ones (3 , 1 ) ⇤ I n f ;28 A(1) = X( i 1 , j 1 ) + a l i gnment co s t ( i , j ) ;29 A(2) = X( i 1 , j ) + a l i gnment co s t ;30 A(3) = X( i , j 1 ) + a l i gnment co s t ;31 [X( i , j ) , Y( i , j ) ] = min (A) ;32 end33 end34 o r i g i n a l a l i g nmen t = f a l s e (O, 1 ) ;35 generated a l ignment = f a l s e (G, 1 ) ;36 i=O; j=G;37

38 whi le Y( i , j ) ˜= 139 ac t i on = Y( i , j ) ;40 i f a c t i on == 241 i = i 1 ;42 e l s e i f a c t i on == 343 j = j 1 ;44 e l s e45 o r i g i n a l a l i g nmen t ( i ) = true ;46 generated a l ignment ( j ) = true ;47 i = i 1 ;48 j = j 1 ;49 end50 end

1

Page 11: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

51

52 end

Listing 1: Alignment.m script

2 Trajectory Generation

1 #!/ usr /bin /env python2 import re3 import s c ipy . l i n a l g4 import numpy as np5

6 GaussianParams = namedtuple ( ’ GaussianParams ’ , [ ’mean ’ , ’ var ’ ] )7 DurationExpert = namedtuple ( ’ DurationExpert ’ , [ ’ phone ’ , ’ params ’ ] )8 MelCepExpert = namedtuple ( ’MelCepExpert ’ , [ ’ phone ’ , ’ s t a t i c ’ , ’ d e l t a ’ , ’ d e l t a d e l t a ’ ] )9

10

11 NUMFEATURES = 112 EMITTING STATES = range (2 , 7 )13

14

15 from c o l l e c t i o n s import namedtuple16

17

18 de f CMPREAD( path ) :19 cmpExpts = d i c t ( )20 with open ( path ) as f :21 whi le f :22 t ry :23 phone = next ( f ) . s t r i p ( ) [ 1 : 1 ]24 params = {25 ’ s t a t i c ’ : d i c t ( ) ,26 ’ d e l t a ’ : d i c t ( ) ,27 ’ d e l t a d e l t a ’ : d i c t ( )28 }29 f o r s t a t e in EMITTING STATES:30 l i n e = map( f l o a t , next ( f ) . s t r i p ( ) . s p l i t ( ” ” ) )31 f o r i , k in enumerate ( [ ’ s t a t i c ’ , ’ d e l t a ’ , ’ d e l t a d e l t a ’ ] ) :32 params [ k ] [ s t a t e ] = GaussianParams ( l i n e [ i ] , l i n e [3+ i ] )33 cmpExpts [ phone ] = MelCepExpert ( phone , ⇤⇤params )34 except S t op I t e r a t i on :35 break36 r e turn cmpExpts37

38 de f CTXREAD( path ) :39 PhoneSequence = l i s t ( )40 Durations = d i c t ( )41 with open ( path ) as f :42 f o r l i n e in f :43 s t a r t , end , phone = re . s p l i t ( ”\ s+” , l i n e . s t r i p ( ) )44 PhoneSequence . append ( phone )45

46 Durations [ phone ] = in t ( ( i n t ( end ) i n t ( s t a r t ) ) ⇤ 1e 7 / 0 .005 )47 r e turn PhoneSequence , Durat ions48

49

50 de f DURATIONREAD( path ) :51 durat ionExperts = d i c t ( )52 with open ( path ) as f :53 whi le f :54 t ry :55 phone = next ( f ) . s t r i p ( ) [ 1 : 1 ]56 l i n e = map( f l o a t , next ( f ) . s t r i p ( ) . s p l i t ( ” ” ) )57 gaussianParams = map(58 lambda i : ( i , GaussianParams ( l i n e [ 2 ⇤ ( i 2 ) ] , l i n e [ 2⇤ ( i 2 ) +1]) ) ,59 EMITTING STATES)60 durat ionExperts [ phone ] = DurationExpert ( phone , d i c t ( gaussianParams ) )

2

Page 12: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

61 except S t op I t e r a t i on :62 break63 r e turn durat ionExperts64

65

66

67 de f g e t t r a j g au s s i an pa rams (cmpExpt , q ) :68 mu = np . concatenate (map( lambda s t a t e : np . array ( [69 cmpExpt . s t a t i c [ s t a t e ] . mean ,70 cmpExpt . d e l t a [ s t a t e ] . mean ,71 cmpExpt . d e l t a d e l t a [ s t a t e ] . mean72 ] ) ,73 q ) )74 sigma = sc ipy . l i n a l g . b l o ck d i ag (⇤map( lambda s t a t e : s c ipy . l i n a l g . b l o ck d i ag ( ⇤ [75 cmpExpt . s t a t i c [ s t a t e ] . var ,76 cmpExpt . d e l t a [ s t a t e ] . var ,77 cmpExpt . d e l t a d e l t a [ s t a t e ] . var78 ] ) ,79 q ) )80

81

82 block = np . vstack ( [83 np . hstack ( [84 np . z e r o s ( (NUMFEATURES, NUMFEATURES) ) ,85 np . z e r o s ( (NUMFEATURES, NUMFEATURES) ) ,86 np . eye (NUMFEATURES) ,87 np . z e r o s ( (NUMFEATURES, NUMFEATURES) ) ,88 np . z e r o s ( (NUMFEATURES, NUMFEATURES) )89 ] ) ,90 np . hstack ( [91 0 . 2 ⇤ np . eye (NUMFEATURES) ,92 0 . 1 ⇤ np . eye (NUMFEATURES) ,93 np . z e r o s ( (NUMFEATURES, NUMFEATURES) ) ,94 0 .1⇤np . eye (NUMFEATURES) ,95 0 .2⇤np . eye (NUMFEATURES)96 ] ) ,97 np . hstack ( [98 0.285714⇤np . eye (NUMFEATURES) ,99 0 . 1 42857⇤ np . eye (NUMFEATURES) ,

100 0 . 2 85714⇤ np . eye (NUMFEATURES) ,101 0 . 1 42857⇤ np . eye (NUMFEATURES) ,102 0.285714⇤np . eye (NUMFEATURES)103 ] )104 ] )105 paddedBlock = np . pad ( block , ( ( 0 , 0 ) , ( 0 , l en (q ) 1 ) ) , mode=” constant ” )106

107

108 W = np . concatenate (map( lambda i : np . r o l l ( paddedBlock , i , a x i s=1) , range ( l en (q ) ) ) )109 W = W[ : , 2 : 2 ]110

111 Wtp = np . t ranspose (W)112 sigmaInv = sc ipy . l i n a l g . inv ( sigma )113 Wtp dot sigmaInv = np . dot (Wtp, sigmaInv )114 Wtp sigmaInv W = np . dot (Wtp dot sigmaInv , W)115

116

117 s igmaFul l = sc ipy . l i n a l g . inv (Wtp sigmaInv W)118 muFull = np . dot ( s igmaFull , np . dot (Wtp dot sigmaInv , mu) )119

120 r e turn muFull , s igmaFul l121

122

123

124

125 de f g e n t r a j ( durat ionExperts , cmpExpts , PhoneSequence , Durat ions ) :126 t r a j = [ ]127 f o r phone in PhoneSequence :128

3

Page 13: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

129 expertDursMle = map( lambda s t a t e : durat ionExperts [ phone ] . params [ s t a t e ] . mean ,EMITTING STATES)

130

131 f ramesPerState = map( lambda x : i n t ( round (x ⇤ Durations [ phone ] / sum( expertDursMle ) ) ), expertDursMle )

132

133

134 q = [ ]135 f o r s ta te , numFrames in z ip (EMITTING STATES, f ramesPerState ) :136 q . extend ( [ s t a t e ]⇤numFrames )137

138

139 cmpExpt = cmpExpts [ phone ]140 muFull , = ge t t r a j g au s s i an pa rams (cmpExpt , q )141 t r a j . append (muFull )142 r e turn np . concatenate ( t r a j )143

144

145

146

147

148 de f main ( ) :149 durat ionExperts = DURATIONREAD( ’ /home/ r i 258 /Documents/MLSALT10/ expts / utt1 . dur . expt ’ )150 cmpExpts = CMPREAD( ’ /home/ r i 258 /Documents/MLSALT10/ expts / utt1 . cmp . expt ’ )151 PhoneSequence , Durat ions = CTXREAD( ’ /home/ r i 258 /Documents/MLSALT10/ o r i g i n a l / lab / utt1 .

lab ’ )152

153 pr in t ”Generating t r a j e c t o r y ”154 t r a j = g en t r a j ( durat ionExperts , cmpExpts , PhoneSequence , Durat ions )155

156 writePath = ”/home/ r i 258 /Documents/MLSALT10/ t r a j e c t o r i e s / generated t r a j e c t o r y . csv ”157 pr in t ”Writing to {0}” . format ( writePath )158 np . save tx t ( writePath , t r a j , d e l im i t e r=” , ” )159

160 i f name == ” main ” :161 main ( )

Listing 2: Trajectory Generation script

3 Global Variance

1

2 #!/ usr /bin /env python3 import s c ipy . opt imize4

5 from t r a j e c t o r y g e n e r a t i o n import ⇤6

7 np . random . seed (44)8

9 DIMENSION = 410

11

12

13 de f main ( ) :14 GV Experts = GV EXPTS( ”/home/ r i 258 /Documents/MLSALT10/ expts /GV. expt ” )15 durat ion Expert s = read dur expt s ( ’ /home/ r i 258 /Documents/MLSALT10/ expts / utt1 . dur . expt ’ )16 CMP expts = read cmp expts ( ’ /home/ r i 258 /Documents/MLSALT10/ expts / utt1 . cmp . expt ’ )17 PhoneSequence , phoneDurs = r e a d c t x l a b e l s ( ’ /home/ r i 258 /Documents/MLSALT10/ o r i g i n a l / lab /

utt1 . lab ’ )18

19 pr in t ( ”Generating t r a j e c t o r y ( expert ) ” )20 t r a j = genera t e t ra j e c to ry GV ( durat ion Experts , CMP expts , GV Experts , PhoneSequence ,

phoneDurs , method=” expert ” )21

22 writePath = ”/home/ r i 258 /Documents/MLSALT10/ generated t r a j e c t o r y GlobalVariance . csv ”23 pr in t ( ”Writing to {0}” . format ( writePath ) )24 np . save tx t ( writePath , t r a j , d e l im i t e r=” , ” )

4

Page 14: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

25

26 pr in t ( ”Generating t r a j e c t o r y ( c on s t r a i n t ) ” )27 t r a j = genera t e t ra j e c to ry GV ( durat ion Experts , CMP expts , GV Experts , PhoneSequence ,

phoneDurs , method=” con s t r a i n t ” )28

29 writePath = ”/home/ r i 258 /Documents/MLSALT10/ generated t r a j e c t o r y GlobalVarianceConstra int . csv ”

30 pr in t ( ”Writing to {0}” . format ( writePath ) )31 np . save tx t ( writePath , t r a j , d e l im i t e r=” , ” )32

33 i f name == ” main ” :34 main ( )35

36

37 de f gene ra t e t ra j e c to ry GV ( durat ion Experts , CMP expts , GV Experts , PhoneSequence , phoneDurs, method=” expert ” ) :

38 muCmp, covCmp , muGV, varGV , T = gen GV problem ( durat ion Experts , cmpExpts , GV Experts ,PhoneSequence , phoneDurs )

39 x0 = np . random . normal ( s i z e=T)40 i f method == ” expert ” :41 alpha = 3⇤T42 f = lambda x : 1 ⇤ (43 0 . 5 ⇤ np . dot (44 np . dot (45 np . t ranspose (x muCmp) ,46 s c ipy . l i n a l g . inv (covCmp) ) ,47 (x muCmp) ) +48 alpha ⇤ ( 0 . 5 ⇤ ( ( np . var ( x ) muGV) ⇤⇤2) / varGV)49 )50 fpr ime = lambda x : 1 ⇤ (51 np . dot ( np . t ranspose (x muCmp) , s c ipy . l i n a l g . inv (covCmp) ) +52 alpha ⇤ ( 1 / ( varGV ⇤ T) ) ⇤ (np . var ( x ) muGV) ⇤ 2 ⇤ ( x np .mean(x ) )53 )54 t r a j = sc ipy . opt imize . minimize ( f , x0 , j a c=fpr ime )55 r e turn t r a j . x56 e l i f method == ” con s t r a i n t ” :57 J = np . eye (T) (1 . 0/T) ⇤np . ones ( (T,T) )58 P = sc ipy . l i n a l g . inv (covCmp)59 b = np . dot (P, muCmp)60 f = lambda lam : (np . var (np . dot ( s c ipy . l i n a l g . inv (P lam⇤J ) , b ) ) muGV) ⇤⇤261

62

63 s o ln = sc ipy . opt imize . m in im i z e s ca l a r ( f , t o l=1e 9 )64 lam = so ln . x65

66 r e turn np . dot ( s c ipy . l i n a l g . inv (P lam⇤J ) , b )67 e l s e :68 r a i s e Exception ( ”Unknown method {0} f o r g loba l var i ance t r a j e c t o r y gene ra t i on ” .

format (method ) )69

70

71 de f gen GV problem ( durat ion Experts , CMP expts , GV Experts , PhoneSequence , phoneDurs ) :72 T = 073 muCmp = [ ]74 covCmp = [ ]75 f o r phone in PhoneSequence :76

77 expertDursMle = map( lambda s t a t e : durat ion Exper t s [ phone ] . params [ s t a t e ] . mean ,EMITTING STATES)

78

79

80 f ramesPerState = map( lambda x : i n t ( round (x ⇤ phoneDurs [ phone ] / sum( expertDursMle ) ) ), expertDursMle )

81

82

83 q = [ ]84 f o r s ta te , numFrames in z ip (EMITTING STATES, f ramesPerState ) :85 q . extend ( [ s t a t e ]⇤numFrames )

5

Page 15: MODULE COURSEWORK FEEDBACK - WordPress.com...MLSALT10: Parametric Speech Synthesis Riashat Islam Department of Engineering University of Cambridge Trumpington Street, Cambridge, CB2

86 T += numFrames87

88

89 cmpExpt = CMP expts [ phone ]90 mu, cov = ge t t r a j g au s s i an pa rams (cmpExpt , q )91 muCmp. append (mu)92 covCmp . append ( cov )93

94

95 muCmp = np . concatenate (muCmp)96 covCmp = sc ipy . l i n a l g . b l o ck d i ag (⇤covCmp)97

98

99 muGV = GV Experts [DIMENSION ] . mean100 varGV = GV Experts [DIMENSION ] . var101

102 r e turn muCmp, covCmp , muGV, varGV , T103

104

105

106 de f GV EXPTS( path ) :107 with open ( path ) as f :108 e n t r i e s = map( lambda x : f l o a t ( x . s t r i p ( ) ) , f . r e a d l i n e s ( ) )109 GV Experts = map( lambda x : GaussianParams (⇤x ) , z ip ( e n t r i e s [ : l en ( e n t r i e s ) /2 ] , e n t r i e s

[ l en ( e n t r i e s ) / 2 : ] ) )110 r e turn GV Experts

Listing 3: Global Variance Model script

6