towardsconversationalspeechsynthesis ... · iii abstract the aim of a text-to-speech synthesis...

75
Towards conversational speech synthesis - Experiments with data quality, prosody modification, and non-verbal signals BAJIBABU BOLLEPALLI Licentiate Thesis Stockholm, Sweden 2017

Upload: others

Post on 23-Nov-2019

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Towards conversational speech synthesis -Experiments with data quality, prosody modification,

and non-verbal signals

BAJIBABU BOLLEPALLI

Licentiate ThesisStockholm, Sweden 2017

Page 2: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

TRITA-CSC-A-2017:04ISSN 1653-5723ISRN KTH/CSC/A-17/04-SEISBN 978-91-7729-235-7

KTH School of Computer Science and CommunicationSE-100 44 Stockholm

SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framläggestill offentlig granskning för avläggande av teknologie licensiatexamen Torsdagenden 19 januari 2017 klockan 15.00 i Fantum, KTH Tal, musik och hörsel, Lindsted-tsvägen 24, Stockholm.

© Bajibabu Bollepalli, January 2017

Tryck: Universitetsservice US AB

Page 3: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

iii

Abstract

The aim of a text-to-speech synthesis (TTS) system is to generate ahuman-like speech waveform from a given input text. Current TTS sys-tems have already reached a high degree of intelligibility, and they can bereadily used to read aloud a given text. For many applications, e.g. publicaddress systems, reading style is enough to convey the message to the people.However, more recent applications, such as human-machine interaction andspeech-to-speech translation, call for TTS systems to be increasingly human-like in their conversational style. The goal of this thesis is to address a fewissues involved in a conversational speech synthesis system.

First, we discuss issues involve in data collection for conversational speechsynthesis. It is very important to have data with good quality as well as con-tain more conversational characteristics. In this direction we studied twomethods 1) harvesting the world wide web (WWW) for the conversationalspeech corpora, and 2) imitation of natural conversations by professional ac-tors. In former method, we studied the effect of compression on the per-formance of TTS systems. It is often the case that speech data availableon the WWW is in compression form, mostly use the standard compressiontechniques such as MPEG. Thus in paper 1 and 2, we systematically stud-ied the effect of MPEG compression on TTS systems. Results showed thatthe synthesis quality indeed affect by the compression, however, the percep-tual differences are strongly significant if the compression rate is less than32kbit/s. Even if one is able to collect the natural conversational speech itis not always suitable to train a TTS system due to problems involved in itsproduction. Thus in later method, we asked the question that can we imi-tate the conversational speech by professional actors in recording studios. Inthis direction we studied the speech characteristics of acted and read speech.Second, we asked a question that can we borrow a technique from voice con-version field to convert the read speech into conversational speech. In paper3, we proposed a method to transform the pitch contours using artificial neu-ral networks. Results indicated that neural networks are able to transformpitch values better than traditional linear approach. Finally, we presenteda study on laughter synthesis, since non-verbal sounds particularly laughterplays a prominent role in human communications. In paper 4 we present anexperimental comparison of state-of-the-art vocoders for the application ofHMM-based laughter synthesis.Keywords: MPEG compression, Voice Conversion, Artificial Neural Net-works, Laughter synthesis, HTS

Page 4: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 5: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Contents

Abstract iii

Contents v

List of publications 3

Author’s contribution 5

1 Introduction 71.1 Intro to Text-to-speech synthesis . . . . . . . . . . . . . . . . . . . . 71.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Progress in TTS systems . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 HMM-based speech synthesis (HTS) . . . . . . . . . . . . . . . . . . 91.5 Characteristics of conversational speech . . . . . . . . . . . . . . . . 12

2 Background 15

3 Research questions 17

4 Preliminary study: analysis of read and acted speech 19

5 Summary of publications 23

6 Conclusions and future work 31

Bibliography 33

Publication I 39

Publication II 47

Publication III 54

Publication IV 63

v

Page 6: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 7: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Acknowledgements

I am extremely grateful for an amazing set of family, friends, colleagues and adviserswithout whom my work would not be possible.

My sincere thanks to my supervisors Joakim Gustafsson and Jonas Beskow fortheir support and guidance on every stage of my research work. They also providedme with a financially secure position, which has given the required amount of con-fidence and continuity on this work. Especially, Jonas, without his encouragementthis thesis would not finish.

My thanks to Martin, Raveesh, Samer, Saeed, Catha, Kalin, Simon, Niklas,Jose, Jana and other colleagues at TMH for making my stay at KTH more fun andmemorable.

My thanks to Olov and David for their corrections, comments and feedback onthe thesis. All remaining errors and mistakes are solely due to me.

My thanks to Paavo Alku and Tuomo Raitio for their guidance and supportduring my stay at Aalto University, and also to Junichi Yamagishi for his guidanceand support during my stay at NII, Tokyo.

I am especially grateful to my family for giving me the opportunity to followmy dreams and the love to make them a reality. My sincere gratitude to my fiancéManvisha Kodali for her unconditional love and support in all matters.

1

Page 8: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 9: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

List of Publications

This thesis consists of an introduction and of the following publications which arereferred to in the text by their Roman numerals.

I Bajibabu Bollepalli, Tuomo Raitio, and Paavo Alku. “Effect of MPEG audiocompression on HMM-based speech synthesis”. In Proceedings of INTER-SPEECH, Lyon, France, pp. 1062–1066. August 2013.

II Bajibabu Bollepalli, and Tuomo Raito. “Effect of MPEG audio compressionon vocoders used in statistical parametric speech synthesis”. In Proceedings ofthe 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Por-tugal, pp. 1237–1241, September 2014.

III Bajibabu Bollepalli, Jonas Beskow, and Joakim Gustafson. “Non-Linear pitchmodification in voice conversion using artificial neural networks”. In Proceed-ings of ISCA Workshop on Non-Linear Speech Processing (NOLISP), Mons,Belgium, pp. 97–103, June 2013.

IV Bajibabu Bollepalli, Jérôme Urbain, Tuomo Raitio, Joakim Gustafson andHüseyin Çakmak. “A comparative evaluation of vocoding techniques forHMM-based laughter synthesis”. In Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), Florence,Italy, pp. 254–259, May 2014.

3

Page 10: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 11: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Author’s contribution

Publication I: “Effect of MPEG audio compression on HMM-based speechsynthesis”

Publication II: “Effect of MPEG audio compression on vocoders usedin statistical parametric speech synthesis”

The main objective of papers I and II was motivated from the Blizzard challenge2012, where the speech data were audio books from the LibriVox. These audiobooks were compressed using MP3 technology. This work started when the authorvisited the Aalto University for one month in 2013. The idea emerged from dis-cussions with the co-authors. The author designed and implemented the objectiveevaluations and primarily wrote both of the papers, except for the description of theGlottHMM vocoder part. The co-author (Tuomo) synthesized the Finnish-voicesusing HTS and helped in report writing.

Publication III: “Non-Linear pitch modification in voice conversion us-ing artificial neural networks”

The author developed and implemented the proposed method, ran all the objectiveas well as subjective tests, analyzed the results and primarily wrote the article. Theauthors proofread the paper.

Publication IV: “A comparative evaluation of vocoding techniques forHMM-based laughter synthesis”

The main idea was proposed by the author. Jérôme Urbain and Hüseyin Çakmakprovided the data and basic HTS for laughter synthesis using the DSM vocoder.Tuomo Raitio was responsible for writing/formatting the paper. The author con-ducted the experiments, analyzed the results and was also responsible for the reportwriting. Joakim proofread the paper.

5

Page 12: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 13: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Introduction

1.1 Intro to Text-to-speech synthesis

The aim of a text-to-speech synthesis (TTS) system is to generate a human-likespeech waveform from a given input text. Typical TTS systems comprise two maincomponents. The first one is text analysis, which converts text into a phonetic orsome other linguistic representation. The second one is speech generation, whichgenerates a speech waveform based on linguistic representation from the text anal-ysis. This thesis primarily focuses on the speech generation component.

Applications

The most important and useful TTS system applications are for individuals witha wide range of disabilities [1, 2]. For example, screen readers for the blind anddyslexic, and personalized voices for individuals who are deaf, vocally handicappedor have motor neurone disease [3]. Stephen Hawking, an English theoretical physi-cist and cosmologist, is probably one of the most famous people using a TTS systemfor communication purposes. More recent applications include GPS navigation,e-book readers, speech-to-speech translations, singing speech synthesizers and allsorts of human/machine interaction systems such as embodied conversational agentswith believable virtual character [4].

1.2 Motivation

Speech produced by humans is heterogeneous and varies greatly based on the en-vironment, the physical and emotional state of the speaker, and the state of theinterlocutor. For example, speech produced by an individual in a quiet environmentis very different from the speech produced in a noisy environment. People tend tospeak a bit louder (increased vocal effort) and elongate the duration of vowels in anoisy compared to a quite environment (also known as Lombard speech [5]) or thesame phenomenon can be observed when the distance between talker and listener isincreased. Similarly, one can notice the increase in fundamental frequency of speechwhen people are under emotional stress. Also, speech produced during spontaneousconversation greatly varies in vocabulary and complexity of syntax when compared

7

Page 14: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

8 CHAPTER 1. INTRODUCTION

to speech produced by simply reading a text aloud. An ideal speech synthesis sys-tem should be able to imitate all these variations based on environment. However,the current TTS systems are effective only at reading out a given text. Their in-telligibility reached human level [6] but achieving naturalness in such systems isstill beyond human capability. Instead of incorporating all the variations at onceinto the TTS system, in this work we focus only on integration of converstationalspeech phenomena.

The main focus of a synthesizer’s read-mode is to transmit the propositionalcontent of a message, in which the synthesizer speaks and a human listens, withalmost no interaction between the two. In contrast, our daily communication oc-curs by talking to each other, where both speaker and listeners take active roles ininteraction, instead of reading the text to each other. Moreover, today’s spoken di-alogue systems have potential for use in areas and applications far beyond directoryassistance and travel bookings; social and collaborative applications for example,entertainment and education, where they need systems to behave human-like in aconversation.

1.3 Progress in TTS systems

Research on speech synthesis dates back to the 1700s when the researchers em-ployed reeds and bellows to synthesize vowels. However, the most notable invention,VODER, produced speech with an electrical device, was introduced by Homer Dud-ley at the 1939 World’s Fair in New York [7]. After that, several techniques wereproposed by researchers with the help of advances in computer science. Broadly, allthese techniques can be grouped into two paradigms. The first one is a rule-basedspeech synthesis, largely dependent on expert guesses (rules) for speech genera-tion, e.g. formant synthesis [8] and articulatory synthesis [9]. The second one isdata-driven speech synthesis, which requires a large amount of speech corpora andsophisticated mathematical models for the generation of speech. With the develop-ments in memory-storage and computational power of machines, research in speechsynthesis has progressed from a rule-based approach to a data-driven one. Thedata-driven approaches can be divided into two methods: 1) concatenative speechsynthesis (CS) and 2) statistical parametric speech synthesis (SPSS). Techniquessuch as diphone [10] and unit-selection [11] synthesis fall under the umbrella ofCS, where speech is produced by stitching together the segments (various units)of recorded speech, whereas techniques such as CLUSTERGEN [12] and HMM-based speech synthesis (HTS) [13] belong to the field of SPSS, where speech isproduced from statistical models. The quality of synthesized speech using CS isdirectly proportional to the quality and size of a speech corpus used for building thevoice. However, it requires much time and expertise to collect such larger databasesincorporating a variety of styles. Thus, it is a tedious task to build voices for every-one with different speaking styles. Fortunately SPSS, particularly HTS, has beenshown to have many benefits in comparison to CS - such techniques as speaker

Page 15: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

1.4. HMM-BASED SPEECH SYNTHESIS (HTS) 9

adaptation [14] and speaking style adaptation [15], which need less data in orderto synthesize different speaking styles, were designed to alleviate the need for alarge database. The flexibility of HMMs has convinced us to use this as a tool ininvestigating the research questions stated in section 3.

1.4 HMM-based speech synthesis (HTS)

In this method several acoustic parameters of speech are modeled using a generativemodel called hidden Markov model (HMM). Largely, the acoustic parameters areextracted based on source filter theory of speech production. In this theory speech isdefined as the output of a time-varying vocal tract system excited by a time-varyingexcitation signal [16]. The time-varying vocal tract can be relatively well approxi-mated by a digital filter, and the time-varying excitation signal can be approximatedby an impulse train. Generally, spectral features such as the mel-cepstral coeffi-cients (MCEPs) [17], line-spectral pairs (LSPs) [18], or mel-generalized cepstralcoefficients (MGCEPs) [19], etc. could be used to represent the shape of the vocaltract or the filter. In this way, features such as fundamental frequency (F0) andaperiodicities [20] could be used to characterize the excitation signal. Vocoders suchas MLSA [21], STRAIGHT [22], DSM [23] and GlottHMM [24], etc. provide anefficient source and filter representation of a speech signal. Often, these vocodersare use to represent source filter information in HTS. Publications III and V presenta detailed description of the state-of-the-art vocoders used in HTS.

Figure 1.1 is a block diagram of a basic HTS system. It illustrates both trainingand synthesis parts. In HTS, parameters such as spectrum, fundamental frequency,and duration of phonemes are modeled in a unified framework [25]. These parame-ters are modeled and generated by using HMMs based on the maximum likelihoodcriterion. In the training part, the maximum likelihood criteria is used to estimatethe model parameters:

λmax = arg maxλ

p(O|λ,W ) (1.1)

where λ denotes model parameters set and O and W denote the set of speechparameters and its corresponding linguistic specifications (such as phoneme labels)of training data, respectively. The following contextual information is consideredin the training to build contextual-HMMs:

• phoneme:

– {preceding, current, succeeding} phoneme– position of current phoneme in current syllable

• syllable:

– number of phonemes in {preceding, current, succeeding} syllable

Page 16: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

10 CHAPTER 1. INTRODUCTION

– accent and stress of {preceding, current, succeeding} syllable– position of current syllable in current word– number of {preceding, succeeding} accented and stressed syllables in

current phrase– number of syllables {from previous, to next} accented and stressed syl-

lable– vowel within current syllable

• word:

– guess at part of speech of {preceding, current, succeeding} word– number of syllables in {preceding, current, succeeding} word– position of current word in current phrase– number of {preceding, succeeding} content words in current phrase– number of words {from previous, to next} content word

• phrase:

– number of syllables in {preceding, current, succeeding} phrase– position in major phrase– ToBI endtone of current phrase

• utterance:

– number of syllables in current utterance

At synthesis time, the input text is transformed into linguistic representation wand then the most probable speech parameter vector sequence is generated usingthe speech parameter generation algorithm (equation 1.2).

Omax = arg maxo

p(O|λmax, w) (1.2)

Finally, predicted speech parameters are rendered into a speech waveform using thesame vocoder used for parameter extraction. Studies [26] and [13] discuss furtherthe applications, advantages and drawbacks of HTS. The demonstration scripts ofHTS are made publicly available [27].

In [28], we developed a HTS system for the Swedish language. In the literature,only one study [29] applied the older verstion HTS-1.1.1 for the Swedish language.However, many improvements have been developed since then in both acousticmodeling and vocoding techniques. In this paper, we used the latest HTS-2.2with STRAIGHT vocoder for Swedish speech synthesis. A total of 1000 Swedishsentences were employed for training with an average duration of 3 seconds. Thecontextual-labels were generated by using RULSYS, which was developed at KTH

Page 17: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

1.4. HMM-BASED SPEECH SYNTHESIS (HTS) 11

Figure 1.1: Overview of the HMM-based speech synthesis system (taken from [13]).

Table 1.1: MOS-based evaluation of the HMM-based speech synthesis system

System Naturalness IntelligibilityHTS 3.8 4.2

and is well established (Carlson et. al, 1982). Five state left-to-right HMMs withoutskip paths were used for training. Each state has a single Gaussian probabilitydistribution function (pdf) with a diagonal covariance matrix as the state outputpdf and a single Gaussian pdf with a scalar variance as the state duration pdf.

A perceptual evaluation was conducted to asses the performance of the system.Evaluation is based on the Mean opinion score (MOS) concerning the naturalnessand the intelligibility on a scale of one to five where one stands for “bad” and fivestands for “excellent”. A total of 5 sentences were synthesized by the system. Agroup of 10 listeners, comprising both speech and non-speech experts were askedto express their opinion of each sentence on a MOS scale. The results of the testare shown in Table 1.1. The results indicate that the Swedish speech synthesizedby the basic HTS system were perceived as natural and intelligible.

Page 18: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

12 CHAPTER 1. INTRODUCTION

1.5 Characteristics of conversational speech

Conversational speech is a speech mode often adopted by a talker when he/sheconverses with an interlocutor. Studies [30, 31, 32] showed that intelligibility ofconversational speech is significantly lower than clear speech. The clear speech isoften produced by talkers when they are speaking in challenging communicationsituations, e.g., when talking to a hearing impaired person or in a reverberantenvironment. Study [33] discussed that the familiarity of interlocutor with talkeroffers a significant advantage in understanding the message. Conversational speechplays an important role in human speech communications by providing cues suchas the linguistic context of the message, nonverbal signals, and the conversationalmilieu [34]. Usually these conversational cues can be grouped into the following [35]:

• Paralinguistic cues: falsetto, whisper, creak, laughter, giggle, cry, sob.

• Disfluency patterns: words such as: and, oh, so well, okay, etc. repetitionsand filled pauses: uh and um.

• Reflexes: throat clearing, sniff/gulp, clucking of the tongue, lip smacking,and breathing in/out.

These cues are of utmost importance for the interactional aspects of conversa-tion, and without them, we are left strangely lacking in interactional skills. Beinguncommon in read speech, these are systematically removed in speech synthesisregardless of the synthesis method, the rationale being that they do not carrypropositional content. However, Clark [36] argued that most spoken disfluenciesare not problems in speaking, but the solutions to problems in speaking. Given thiscomplexity, it is no wonder that we have yet to develop machines that speak likehumans in different situations.

Comparision between conversational and read speech

Conversational speech sounds much more natural, and less “flat” than read speech,which sounds more rehearsed and indeed “flat” (because people already know whatthey want to say). Humans can distinguish between these two modes of speechby listening to them [37]. Acoustically, conversational speech differs from readspeech in many aspects (e.g. prosody) because of the differences in manner ofspeaking [38]. Characteristics such as overall increase in speaking rate [39], morevariation in duration of words and syllables based on their position [40, 41], spec-tral space reductions [38], more reduced pronunciations [42], and more deviationin phoneme pronunciations [43] are associated with speech produced spontaneouslywhen compared to speech produced by reading aloud. Moreover, dynamics of acous-tic features in conversational speech are higher than in read speech. Linguistically,conversational speech often introduces disflunencies such as filled pauses, restarts,

Page 19: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

1.5. CHARACTERISTICS OF CONVERSATIONAL SPEECH 13

repeats and repairs. Even if the disflunencies are removed, differences exist in func-tional word distribution (e.g yeah, okay, know, etc.). More often, the functionalwords in conversational speech play a role in regulating the conversational flow andmay be the reason for distributional differences [39]. Further conversational speechcontains no punctuation marks as opposed to read speech, which implicitly containspunctuation marks from the text [44].

Page 20: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 21: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Background

Since the last decade researchers have made several attempts to include the conver-sational phenomena in synthesized speech. Most of these used the concatenativespeech synthesis as their TTS systems [45, 35, 46, 47, 48], recently HTS has alsobeen used for this task [39, 49].

In [35] included features of VoiceFonts [50] such as laughter, breaths, lip smack-ing and filled pauses being incorporated into a limited domain synthesizer (LDS).They used a limited domain corpus, i.e lecture monologues, to train the LDS andVoiceFonts were tagged by using unique words. They inserted these unique words aspart of an input text, at synthesis time, to generate the corresponding VoiceFonts.The results showed that including VoiceFonts in LDS has 3 times the probabilityto be as confused as the natural speech than the LDS without VoiceFonts. Simi-larly, [46] inserted the paralinguistic elements such as laughter and hesitations usingacted prompts, which contain all the transitions between speech and paralinguis-tic elements (called anticipation phases) in the French language. In [47, 51] theyrecorded a very large conversational speech corpus (about 1500 hours) from nor-mal everyday conversations. They built a concatenative speech synthesizer using apart of that large corpus (approx 600 hours) for a female speaker which generatesnon-verbal (laughter) sounds also. However, earlier works had not discussed howand when to synthesize conversational speech phenomenon in unknown contexts.In [52, 53, 45] they showed the disfluencies synthesis by modifying the prosody (du-ration and pitch) and also predicted the locations of disfluencies to insert them inthe text. There are a lot of phenomena that could be categorized as belonging tothe LNRE class (Large Numbers of Rare Events) that for practical reasons cannotbe added to concatenative speech databases [54]. This can be solved by a HTS,as opposed to concatenative synthesis method, as it possible to generalize fromphenomena found in speech corpora and generate them in new places.

Current data-driven speech synthesis systems, including HTS, are designed suchthat they can benefit by using read speech that often recorded by a professionalvoice actor with clear articulation in a quiet studio environment. Spontaneousspeech in contrast to read speech contains sloppy pronunciation, and much morevariation in both prosodic and segmental level features. Thus, one can not directlyuse read speech for synthesis of conversational speech. In [48, 39] they studiedtwo methods using HTS to synthesize the conversational characteristics. The first

15

Page 22: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

16 CHAPTER 2. BACKGROUND

method used traditional HTS without any modification and the voice was trainedon carefully selected spontaneous speech utterances which excluded the utteranceswith mispronunciations, word fragments, heavily reduced pronunciations and par-alinguistics. The results showed that the voice built with spontaneous speech wasperceived as being more natural than the voice built with read aloud speech. In thesecond method, they used a blending technique [55] where they pooled both theread and spontaneous speech to train the voice using HTS. To distinguish betweenread and conversational speech they used extra contextual information in the full-contextual labels of HTS: speaking style (spontaneous or read). The results showedthat the resulting voice lay somewhere between conversational and read styles. In[49] showed a problem in duration modeling for synthesis of filled pauses using aHTS system which was trained on read speech. Thus, either we have to redesign ourpresent HTS system such that it is able to model conversational speech in an effi-cient manner, or, we have to come up with a smart strategy to acquire spontaneousconversational speech with high quality conditions that are suitable for an HTSsystem. Section 4 discusses the latter part, i.e how to collect conversational speechwith high-quality in studios while maintaining the conversational phenomena.

Page 23: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Research questions

Many aspects of speech synthesis, from subtle voice quality changes to prosodicmelody, are in dire need of development to create synthesis that can be used forhuman-like conversations. Attempting to improve everything at once, however,would likely fail. Instead, this work primarily focused on a few critical dimensions:

1. How to obtain conversational speech data for speech synthesis?Since the conventional TTS systems follow the data-driven approach, dataplay a crucial role. Often, a professional voice actor reads the data with cleararticulations in a quiet studio environment, the styles of synthesized speechcan be confined to the styles of training data they use [55, 56]. However, if ourgoal is to synthesize speech in a conversational manner, then it is necessary tohave the styles of conversational speech in the training data. Thus this raisesthe question of how to obtain the conversational speech phenomena with highquality recordings, which can be used for the purpose of speech synthesis. Inthis context, we experimented with two methods:

• Imitate natural conversations (Acted speech)Previous research showed various aspects inherent to natual speech suchas sloppy pronunciations, high speaking rate and disfluencies are hard tomodel/train in the current TTS systems. Thus in this method we pro-pose a novel approach – imitate natural conversations (acted speech) – tocollect the conversational speech with high quality which can be used inspeech synthesis training. However, one can ask about the intelligibilityand naturalness of acted speech. Section 4 shows that the intelligibilityof acted speech is greater than natural conversational speech and lessthan or equal to the read speech, whereas naturalness in the sense ofconversational manner is vice versa.

• Harvesting the world-wide-web (WWW) for the conversational speechcorporaDue to the growth of the Internet, one can find lots of spontaneousspeech data for free, e.g YouTube, podcasts, and audio-books. However,issues such as background noise, mixture of music sounds and variousrecording settings, degrade the quality of speech signal. Most of thetimes, these audio files are compressed before being uploaded to the

17

Page 24: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

18 CHAPTER 3. RESEARCH QUESTIONS

Internet, because of bandwidth limit. In this work, we focused on theeffect of audio compression on the quality of a TTS system. This isadressed in Publication I and II of this thesis. It is shown that thelevel of audio compression can limit the quality of a TTS system.

2. What if we don’t have conversational speech data to train a TTS system?It is not always possible to obtain conversational speech suited for speech syn-thesis applications. Thus we can view this problem as being similar to voiceconversion where the goal is to convert speech from a given source speaker toa given target speaker. In this case we want to use the same technique. Inliterature many studies were based on voice transformation. However, mostof the studies focused on the conversion of segmental features. However, totransform read speech to conversational speech it is equally important toconvert both segmental and suprasegmental features (e.g duration, pitch andenergy). Publication III discusses the transformation of pitch contour usingneural networks.

3. Conventional TTS systems are not designed to model conversational/socialphenomenaTo bring expressiveness into speech synthesis systems, concentrating on im-proving the verbal signals alone would give no recognition to the non-verbalsignals, which play an important role in expressing emotions and moods inhuman communication [57]. Laughter is one such non-verbal signal playing akey role in our daily conversations. In the literature, there has been only onestudy reported on HMM-based laughter synthesis using a single vocoder [58].Thus, there is a need for further analysis, particularly on the vocoders, asthey control the quality of the synthesized speech. Publication IV analyzesfour state-of-the-art vocoders commonly employed in statistical parametricspeech synthesis for the application of acoustic laughter synthesis.

Page 25: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Preliminary study: analysis of readand acted speech

This preliminary study1 presesents an analysis between two speaking styles: 1)read speech, and 2) acted speech for the application of speech synthesis. In thiscontext, the read speech is produced by reading a given text aloud. The actedspeech is produced by reading the same text in an expressive style which suitsthe given text context. One can correlate the acted speech to the speech used inour daily conversations since it conveys both the message and emotional state ofthe speaker. It is emphasized that the acted speech is different from emotionalspeech where subjects are forced to act in a particular emotion which is very oftenirrelevant to the given context. Although the acted speech is not the same as naturalconversational speech, where humans can communicate with each other without anytext, one can place the acted speech somewhere in between the read and the naturalconversational speech. Two hypotheses were proposed in this preliminary study:

• Intelligibility: Natural conversational speech < Acted speech ≤ Read speech

• Naturalness: Read speech < Acted speech ≤ Natural conversational speech

Analysis of acted speech and read speech

Database

For this study we used Japanese MULTEXT [59], a prosodic corpus with a totalduration of 3 hours 37 minutes. It was recorded by 6 (3 male, 3 female) speakers.The text corpus used for the recordings was translated from the passages of fivedifferent European languages (English, French, German, Italian and Spanish) whichwere part of the MULTEXT prosodic corpus distributed in the European LanguageResource Association (ELRA). The total text consists of 40 different passages whichrepresents various topics like a telephone complaint, an urgent report, an apology, aboast, an occasional thought, a novel, etc. Speakers participating in the recordingsare professional narrators and actors. Each sentence was recorded in two different

1This work was done under the guidance of Dr. Junichi Yamagishi, NII, Tokyo, where theauthor did a research internship for six months.

19

Page 26: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

20CHAPTER 4. PRELIMINARY STUDY: ANALYSIS OF READ AND ACTED

SPEECH

Parameters Acted speech Read speechF0_max (Hz) 371 187F0_min (Hz) 84 59F0_mean (Hz) 242 120F0_std (Hz) 55 25En_mean_v (dB) 19 18En_std_v (dB) 0.7 0.7En_mean_uv (dB) -18 -15En_std_uv (dB) 1.6 1.9

Table 4.1: Analysis between acted and read speech for male speakers.

speaking styles by each speaker, 1) reading style where the speakers were instructedto read each text aloud and, 2) acting style where the speakers pretended that theywere taking part as a participant in each of the situations. The performance ofeach speaker in acting style depended on the context of each text not on his or herown emotional state. Perceptual tests showed that the acted speech conveyed theparalinguistic cues such as sarcasm, dissatisfaction, and a sincere apology, whencompared to read speech [59].

Analysis

To investigate the intelligibility of acted speech we first asked, what are the dif-ferences in the acoustic parameters of conversational speech, responsible for lessintelligibility, when compared to clear speech. In [60] they related the intelligibilitydifferences, between natural conversational speech and clear speech, to the acousticdifferences between them. Although one cannot attribute the intelligibility of actedspeech to a single acoustic parameter difference, rather it is a combination of mul-tiple differences such as high intensity, lower speaking rate, increase in number ofpauses, higher fundamental frequency (F0) with larger range and higher spectrumvalues at higher frequencies, etc. In the same way we compared the acoustic pa-rameters of acted and read speech styles. Unfortunately, due to time and economicreasons, we don’t have the natural conversations for the analysis. The followingparameters were compared between acted and read speech styles: fundamental fre-quency (F0), intensity (En), and long-term average spectrum (LTAS) values. TheF0 values were provided along with the corpus, whereas En and LTAS values areextracted frame-wise with 25ms window length and 10ms as window shift.

Table 4.1 and Table 4.2 show the statistics of F0 and En values, where F0_max,F0_min, F0_mean and F0_std show the maximum, minimum, mean, and stan-dard deviation values of F0 values in hertz, En_mean_v and En_std_v representthe mean and standard deviation values of intensity for voiced sounds in dB, andEn_mean_uv and En_std_uv represent the mean and standard deviation valuesof intensity for unvoice sounds in dB, respectively. These parameters are extracted

Page 27: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

21

Parameters Acted speech Read speechF0_max (Hz) 422 247F0_min (Hz) 115 65F0_mean (Hz) 258 145F0_std (Hz) 60 34En_mean_v (dB) 22 21En_std_v (dB) 1 0.7En_mean_uv (dB) -18 -16En_std_uv (dB) 1.8 1.9

Table 4.2: Analysis between acted and read speech for female speakers.

0 1 2 3 4 5 6 7 8Frequency (KHz)

60

55

50

45

40

35

30

25

20

15

Mag

nitu

de (d

B)

read speechacted speech

0 1 2 3 4 5 6 7 8Frequency (KHz)

55

50

45

40

35

30

25

20

15

Mag

nitu

de (d

B)

read speechacted speech

Figure 4.1: LTAS values of read and acted speech of a male (left) and a female(right) speaker.

for each speaker and then the average for all male and female speakers was calcu-lated separately. We can observe that mean and range of F0 values of acted speechwere higher than read speech for both male and female speakers, similarly, actedspeech has higher intensity values when compared to read speech. Figure 4.1 showsthe average LTSA values for both male and female speakers. The LTSA valuesignore the short-time variations caused by different phones and retain the long-term behavior of speech signals. It illustrates that acted speech has higher valuesin higher frequencies when compared to read speech. From this analysis, perhaps,one can correlate the acted speech to the clear speech which has also shown up asa similar trend in the literature. It may shed light on the conversational speechdata collection problem, i.e, one can benefit by using professional actors to imitatenatural conversations. It should be noted that this is a work in progress, as of nowwe have got only a few results. We are left to do much more analysis to test ourhypotheses, which we will consider in our future work.

Page 28: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

22CHAPTER 4. PRELIMINARY STUDY: ANALYSIS OF READ AND ACTED

SPEECH

DiscussionPrevious studies have shown how the acoustic parameters of clear speech behavewhen compared to normal and conversational speech. Our analysis on acousticparameters of acted and read speech showed that the parameters of acted speechbehave similarly to clear speech. Thus, perhaps, one can interpret this as theintelligibility of acted speech being equal to or higher than read speech. Since thisis still a work in progress, we postponed further analysis to our future work. We havealso trained HTS voices using standard HTS procedure [61]. Informal perceptualtests showed that naturalness of the voice built on acted speech is better than voicebuilt on read speech. In future work we want to do a more extensive analysis to testout our proposed hypotheses. And also we want to compare the naturalness andintelligibility of both speaking styles before and after synthesis by formal listeningtests.

Page 29: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Summary of publications

This section summarises the publications in this thesis.

Publication I: “Effect of MPEG audio compression on HMM-basedspeech synthesis”In this conference paper, we studied the effects of speech compression on vocodingand HTS. In [62] and [63], it was shown that the quality of acoustic parameters isaffected by the speech compression. However, to the best of our knowledge, thereare no studies which have specifically addressed how compression of speech affectsvocoding and statistical parametric speech synthesis [26]. In this paper, we used theMPEG-1 Audio Layer III (MP3) [64], the most commonly used audio compressiontechnique, to compress speech signals with various bit-rates. All the analysis andsynthesis steps are performed using the GlottHMM vocoder [24]. Two databasesdesigned for TTS development were used in experiments. The first corpus consistsof 599 sentences read by a Finnish male (labeled as MV), and the second one consistsof 513 sentences read by a Finnish female (labeled as HK). All audio files were PCMencoded and sampled at 16 kHz with a resolution of 16 bits, resulting in a datarate of 256 kbit/s. The effect of compression (bit-rates shown in Table 5.1) wasevaluated by comparing the vocoder parameters extracted from the MP3-processedsounds to those obtained from the corresponding uncompressed sentences.

Table 5.1: Bit-rates and corresponding theoretical and realized compression ratioswith respect to 256 kbit/s 16 kHz PCM speech.

Bit-rate Compression ratio Compression ratio(kbit/s) w.r.t. bit-rate w.r.t. file size

160 1.6 1.56128 2 1.9264 4 3.1332 8 6.2524 10.67 8.3316 16 12.508 32 25.00

23

Page 30: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

24 CHAPTER 5. SUMMARY OF PUBLICATIONS

System

Nat

ural

ness

e.8 v.8 e.16 v.16 e.32 v.32 e.128 v.128

12

34

5

Figure 5.1: MOS scores for naturalness where e.8, e.16, e.32, and e.128 refer toencoded speech with bit-rates of 8, 16, 32, and 128 kbit/s, respectively. Vocodedspeech is denoted by v.8, v.32, and v.128 corresponding, respectively, to bit-rates of8, 32, and 128 kbit/s.

The objective evaluations of all parameters show a significant increase in errorbetween the bit-rates of 64 and 32 kbit/s. In conclusion, if the bit-rate is 32 kbit/sor less, the compressed acoustic signal is significantly different from the originalspeech signal, which has a clear effect on the vocoder. A subjective evaluation wasperformed to study how the quality of vocoded signals is affected by compression.Figure 5.1 shows the means and 95% confidence intervals of the naturalness ratings.The results show that the subjects rated the 128 kbit/s and 32 kbit/s compressedsignals as completely natural, whereas the speech sounds compressed with 16 kbit/sor lower show a clear drop in naturalness. An interesting observation is that thegap between the compressed signals and vocoded signals is reduced along with thedecreased bit-rate.

HMM-based synthetic voices were built with the GlottHMM vocoder. StandardHTS procedure [61, 27] was used for training the voices with the modificationsneeded to accommodate the increased number of parameters of the vocoder [24].Subjective evaluations were conducted to assess the performance of the HMM-based voices. A total of 10 Finnish listeners participated in the MOS naturalnessevaluation. Figure 5.2 shows the means and 95% confidence intervals of the MOSnaturalness ratings. The difference between the naturalness of 128 kbit/s and 32kbit/s voices is not significant, thus suggesting that the degradations of compres-sion in the 32 kbit/s signal do not affect the training of a HMM-based syntheticvoice. The difference between 32 kbit/s and 16 kbit/s voices is greater, indicatingdegraded quality.

Publication II: “Effect of MPEG audio compression on vocoders usedin statistical parametric speech synthesis”

Page 31: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

25

System

Nat

uarln

ess

hts_8 hts_16 hts_32 hts_128

12

34

5

Figure 5.2: Naturalness ratings for HMM-based synthetic voices trained with thefollowing bit-rates: 8, 16, 32, and 128 kbit/s.

This conference paper extends the work presented in Publication II by introducingthe STRAIGHT vocoder for both analysis and synthesis purposes, in addition todetailed subjective evaluations. We used the experiment set up as in the previouspaper. In objective analysis three types of parameters were analyzed: 1) F0, 2)HNR, and 3) LSF for GlottHMM vocoder and 1) F0, 2) AP, and 3) MCEP forSTRAIGHT vocoder. The relative error of F0 is 1–5 % for GlottHMM and 1–3 % forSTRAIGHT, but the differences between bit-rates were not statistically significantacross the two voices. However, GlottHMM seems to be slightly more affected bythe compression than STRAIGHT with low bit-rates. For both vocoders, the errorof HNR/AP is rather small with high bit-rates such as 160 kbit/s and 128 kbit/s,with only a small increase in error with 64 kbit/s. With bit-rates 32 kbit/s andlower, however, the error increases substantially. The relative error of LSF forGlottHMM shows a similar effect: high bit-rates (64 kbit/s or more) show smallerrors, while 32 kbit/s and lower bit-rates show larger errors. Particularly the8 kbit/s voice shows very high errors, especially in the perceptually important lowand mid-frequencies. MCEP parameters of STRAIGHT also show that high bit-rates (≥ 64 kbit/s) show small errors while higher compression significantly affectsthe spectral parameters. In conclusion, although the degradations are gradual, thecompressed acoustic signals are significantly different from the original signal if thebit-rate is 32 kbit/s or less, which has a clear effect on both vocoders.

Experiments using different LSF (LP analysis) and MCEP orders were alsoconducted. LSF order was varied from 14 to 30 and MCEP order from 10 to 40.The results indicate that increasing the LSF order had the effect of reducing theaverage parameter error, whereas increasing the MCEP order had the oppositeeffect of increasing the error due to compression.

HMM-based synthetic voices were built with both vocoders and voices and fivebit-rates. Standard HTS procedure [61, 27] was used for training the voices. Sub-jective evaluations were conducted to assess the quality of the HMM-based voices.

Page 32: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

26 CHAPTER 5. SUMMARY OF PUBLICATIONS

System

Nat

ural

ness

hts_8 hts_16 hts_24 hts_32 hts_pcm

12

34

5 FMGlottHMM

System

Nat

ural

ness

hts_8 hts_16 hts_24 hts_32 hts_pcm

12

34

5 FMSTRAIGHT

System

Nat

ural

ness

hts_8 hts_16 hts_24 hts_32 hts_pcm

12

34

5 GlottHMMSTRAIGHT

Figure 5.3: HMM-based synthesis naturalness scores as a function of bit-rate forGlottHMM (top-left figure) and STRAIGHT (top-right figure) for the female (F)and male (M) speakers, and averaged MOS scores for both vocoders (bottom figure).

Figure 5.3 shows the means and 95% confidence intervals of the MOS ratings foreach vocoder and speaker, and also averages across gender. For GlottHMM, maleand female voices are, generally, rated similar with a slight preference for the malevoice except with the lowest bit-rate.

Publication III: “Non-linear pitch modification in voice conversionusing artificial neural networks”In this workshop paper, we proposed a new method to transform pitch in voiceconversion using neural networks. In the literature, the majority of voice con-version techniques focused mainly on the modification of short-term spectral fea-tures [65], [66]. However, prosodic features, such as pitch contour and speakingrhythm, also contain important cues of speaker identity. The most common methodfor pitch contour transform is:

log(f t0) =log(fs0 )− µslogf0

σslogf0

∗ σtlogf0+ µtlogf0

(5.1)

where fs0 , f t0 represent the pitch values at frame level, and µslogf0, σslogf0

, µtlogf0, and

Page 33: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

27

Table 5.2: RMSE (in Hz) between target and converted contours with linear andnon-linear transformation methods.

Speaker pair Linear modification Non-linear modificationRMS-to-SLT 18.28 14.36SLT-to-RMS 15.92 12.50

σtlogf0represent the mean and standard deviation of the pitch values in log domain

for the source and target speakers, respectively. In this paper, we refer to thismethod as linear transformation. However, it is difficult to model and transformthe dynamics of pitch contour using linear transformation methods. Thus, weproposed a non-linear transformation method using multi-layer feed-forward neuralnetworks (MLFNN). The pitch contours over the voiced segments were representedby their discrete cosine transform (DCT) coefficients [67], [68], [69].

The experiments were carried out on the CMU ARCTIC database consistingof utterances recorded by seven speakers [70]. We used the STRAIGHT vocoderfor analysis and synthesis purposes. During the training process, acoustic featuresof the source and target speakers were given as input-output to the network. Thenetwork learns from these two data sets and tries to capture a non-linear mappingfunction based on minimum mean square error. A generalized back propagationlearning [71] is used to adjust weights of the neural network so as to minimize themean squared error between the desired and the actual output values. we considerthe four layer networks with architectures 40L− 80N − 80N − 40L, 21L− 42N −42N − 21L, 9L − 18N − 18N − 9L and 5L − 10N − 10N − 5L for mapping thefeatures MCEPs, BAPs, F0shape and F0limits, respectively. The first and fourthlayers were input-output layers with linear units (L) and have the same dimensionas that of input-output acoustic features. The second layer (first-hidden layer) andthird layer (second-hidden layer) have non-linear nodes (N), which help in capturingthe non-linear relationship that may exist between the input-output features.

In order to evaluate the performance of the proposed method, we estimate theroot mean square error (RMSE) between target and converted pitch contours of thetest set. The RMSE is calculated after the durations of predicted contours werenormalized with respect to actual contours of the target speaker. It can be seenfrom Table 5.2 that the non-linear transformation method performed better thanthe linear method. The results of an informal perceptual test also confirmed thesame.

Publication IV: “A compartive evaluation of vocoding techniques forHMM-based laughter synthesis”In this conference paper, we compared the state-of-the-art vocoders for HMM-based laughter synthesis. In the last decade, a considerable amount of research has

Page 34: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

28 CHAPTER 5. SUMMARY OF PUBLICATIONS

Table 5.3: Speaker similarity score

Speaker pair Linear modification Non-linear modificationRMS-to-SLT 3 3.3SLT-to-RMS 2.55 3.1

been done on the analysis and detection of laughter (see e.g. [72]), whereas onlya few studies have been conducted for synthesis (see e.g [73], [74], [75]). Recently,Urbain et al. [58] used HMMs to synthesize laughs from phonetic transcriptions,similar to the traditional methods used in statistical parametric speech synthesis.Models were trained using the HMM-based speech synthesis system (HTS) [27]on a range of phonetic clusters encountered in 64 laughs from one person. It isclear that the research on HMM-based laughter synthesis is in early stage – thereexists only one study on HMM-based laughter synthesis using a single vocoder.Thus, in this paper, we studied the role of four state-of-the-art vocoders commonlyused in statistical parametric speech synthesis for the application of HMM-basedlaughter synthesis. The following vocoders were chosen for comparison: 1) Impulsetrain excited mel-cepstrum based vocoder, 2) STRAIGHT [20, 22] using mixedexcitation, 3) Deterministic plus stochastic model (DSM) [23], and 4) GlottHMMvocoder [24].

Two voices from the AVLaughterCycle database [76] were selected: a femalevoice (subject 5, 54 laughs) and a male voice (subject 6, the same voice as inprevious work [58], 64 laughs). A subjective evaluation was carried out to comparethe performance of the 4 vocoders in synthesizing natural laughs. For each vocoder,two types of samples were used: a) copy-synthesis, which consists of extractingthe parameters from a laugh signal and re-synthesizing the same laugh from theextracted parameters; b) HMM-based synthesis, where a HMM-based system istrained from a laughter database and laughs are then synthesized using the modelsand the original phonetic transcriptions of laughter. Copy-synthesis can be seenas the theoretically best synthesis that can be obtained with a particular vocoder,while HMM-based synthesis shows the current performance that can be achievedwhen synthesizing new laughs. Human laughs were also included in the evaluationfor reference.

Subjects were asked to rate the quality of synthesized laughter signals on a5-point Likert scale [77].The 45 laughter signals were presented in random order.Eighteen participants evaluated the male voice while 15 evaluated the female one.All listeners were between 25–35 years of age, and few of them were speech experts.Figure 5.4 shows the means and 95% confidence intervals of the naturalness ratingsfor copy-synthesis (right) and HMM synthesis (left) of the male (upper) and female(lower) voices. These results indicate that MCEP and DSM are in general goodchoices for laughter synthesis. Both vocoders use simple parameter representationin statistical modeling: only F_0 and spectrum are modeled and all other features

Page 35: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

29

Vocoders

Nat

ural

ness

MCEP STR GlottHMM DSM Natural

12

34

5

Copy−synthesis, male

Vocoders

Nat

ural

ness

MCEP STR GlottHMM DSM1

23

45

HMM−synthesis, male

Vocoders

Nat

ural

ness

MCEP STR GlottHMM DSM Natural

12

34

5

Copy−synthesis, female

Vocoders

Nat

ural

ness

MCEP STR GlottHMM DSM

12

34

5

HMM−synthesis, female

Figure 5.4: Naturalness scores for copy-synthesis (left) and HMM synthesis (right)for the male (upper) and female (lower) speakers.

are fixed. Accordingly, the synthesis procedure of these vocoders is very simple: theexcitation generation depends only on the modeled F_0. In DSM, F_m, residualwaveform, and noise time envelope are fixed and thus they cannot produce addi-tional artefacts beyond possible errors in F_0 and spectrum. MCEP attained thebest naturalness scores for the female voice, although the known drawback of thismethod is its buzziness. This was likely not too disturbing as the the female voiceused few voiced segments. The buzziness could, however, explain why male laughssynthesized with MCEP were perceived as less natural than female laughs, sincethe male laughs contained more and longer voiced segments.

Page 36: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 37: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Conclusions and future work

Conversational speech synthesis is a new and a difficult problem. Developing atext-to-speech (TTS) system adequate for human-like discourse requires not onlythe successful synthesis of read speech, but also generating appropriate extralinguis-tic cues, such as speaking style and speaker characteristics, as well as non-verbalsounds. Thus, speech synthesis requires both the successful prediction of contextand meaning from the input text and the modelling and reproduction of all conver-sational cues in speech. This thesis focus on the latter by providing ideas on datacollection and laughter synthesis.

This thesis presented two ideas on conversational speech data collection 1) har-vesting the world wide web (WWW) for the conversational speech corpora, and2) imitation of natural conversations by professional actors. It is well known mostof media content available on the WWW is stored in compressed form. Thus, weexplored the effect of compression on the performance of TTS systems. Resultsin paper 1 showed that the synthesis quality indeed affect by the compression,however, the perceptual differences were strongly significant when the compressionrate is less than 32kbit/s. However, natural conversational speech contains variouscues such as paralinguistics, disfulency patterns, and reflexes which make harder tomodel in traditional TTS systems. To overcome this issue, we propose to imitatethe conversational speech by professional actors in recording studios. Our resultsshowed that the acoustic properties of acted speech is closer to clean speech withmore expressiveness than read speech. Moreover, we presented an idea to transformpitch contours of one speaker to another speaker using artificial neural networks.This idea can apply to transform the read speech into conversation speech in whenthe conversational speech data is limited. One class of nonverbal sounds that playsan important role in human communication is laughter, which motivates the studyon laughter synthesis using hidden Markov models (HMM) based speech synthesis.

Conversational speech synthesis is in very early stage there is a lot room forimprovement. We want to study further the proposed techniques by combiningthem in a single system and evaluate them formally through a large subjectiveevaluation. Recently, deep neural networks showed a promising direction in speechsynthesis, thus we want to explore current study with deep neural network basedspeech synthesis systems.

31

Page 38: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 39: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Bibliography

[1] D. H. Klatt, “Review of text-to-speech conversion for english,” The Journal ofthe Acoustical Society of America, vol. 82, no. 3, pp. 737–793, 1987.

[2] I. R. Murray, J. L. Arnott, N. Alm, and A. F. Newell, “A communicationsystem for the disabled with emotional synthetic speech produced by rule.” inEUROSPEECH, 1991.

[3] J. Yamagishi, C. Veaux, S. King, and S. Renals, “Speech synthesis technolo-gies for individuals with vocal disabilities: Voice banking and reconstruction,”Acoustical Science and Technology, vol. 33, no. 1, pp. 1–5, 2012.

[4] J. Beskow, “Talking heads-communication, articulation and animation,” inproceedings of Fonetik, vol. 96, 1996, pp. 29–31.

[5] W. Van Summers, D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes,“Effects of noise on speech production: Acoustic and perceptual analyses,” TheJournal of the Acoustical Society of America, vol. 84, no. 3, pp. 917–928, 1988.

[6] “Blizzard challenges.” [Online]. Available:http://synsig.org/index.php/Blizzard_Challenge

[7] H. Dudley, R. Riesz, and S. Watkins, “A synthetic speaker,” Journal of theFranklin Institute, vol. 227, no. 6, pp. 739–764, 1939.

[8] R. Carlson and B. Granström, “Rule-based speech synthesis,” in SpringerHandbook of Speech Processing, 2008, pp. 429–436.

[9] H. Cecil, “A model of articulatory dynamics and control,” Proc. of the IEEE,vol. 64, no. 4, pp. 452–460, 1976.

[10] E. Moulines and F. Charpentier, “Pitch-synchronous waveform processingtechniques for text-to-speech synthesis using diphones,” Speech Communica-tion, vol. 9, no. 5-6, pp. 453–467, 1990.

[11] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesissystem using a large speech database,” in Proc. Int. Conf. Acoust., Speech,Signal Process., vol. 1, 1996, pp. 373–376.

[12] A. W. Black, “Clustergen: a statistical parametric synthesizer using trajectorymodeling.” in Proc. of INTERSPEECH, 2006, pp. 1762–1765.

33

Page 40: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

34 BIBLIOGRAPHY

[13] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speechsynthesis based on hidden markov models,” Proc. of the IEEE, vol. 101, no. 5,pp. 1234–1252, 2013.

[14] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysisof speaker adaptation algorithms for hmm-based speech synthesis and a con-strained smaplr adaptation algorithm,” IEEE Trans. on Audio, Speech, andLang. Proc., vol. 17, no. 1, pp. 66–83, 2009.

[15] T. Nose and T. Kobayashi, “An intuitive style control technique in hmm-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model,” Speech Communication, vol. 55, no. 2, pp.347–357, 2013.

[16] G. Fant, Acoustic theory of speech production: with calculations based on X-raystudies of Russian articulations. Walter de Gruyter, 1971, vol. 2.

[17] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithmfor mel-cepstral analysis of speech,” in Proc. Int. Conf. Acoust., Speech, SignalProcess., vol. 1, 1992, pp. 137–140.

[18] F. K. Soong and B.-H. Juang, “Line spectrum pair (lsp) and speech datacompression,” in Proc. Int. Conf. Acoust., Speech, Signal Process., vol. 9, 1984,pp. 37–40.

[19] T. Kobayashi, S. Imai, and Y. Fukuda, “Mel generalized-log spectrum approx-imation (mglsa) filter,” Journal of IEICE, vol. 68, pp. 610–611, 1985.

[20] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuringspeech representations using a pitch-adaptive time–frequency smoothing andan instantaneous-frequency-based f0 extraction: Possible role of a repetitivestructure in sounds,” Speech communication, vol. 27, no. 3, pp. 187–207, 1999.

[21] S. Imai, “Cepstral analysis synthesis on the mel frequency scale,” in Proc. Int.Conf. Acoust., Speech, Signal Process., vol. 8, 1983, pp. 93–96.

[22] H. Kawahara, J. Estill, and O. Fujimura, “Aperiodicity extraction and controlusing mixed mode excitation and group delay manipulation for a high qualityspeech analysis, modification and synthesis system STRAIGHT,” in 2nd Inter-national Workshop on Models and Analysis of Vocal Emissions for BiomedicalApplications (MAVEBA), 2001, pp. 59–64.

[23] T. Drugman and T. Dutoit, “The deterministic plus stochastic model of theresidual signal and its applications,” IEEE Trans. on Audio, Speech, and Lang.Proc., vol. 20, no. 3, pp. 968–981, 2012.

Page 41: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

BIBLIOGRAPHY 35

[24] T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen, M. Vainio, andP. Alku, “Hmm-based speech synthesis utilizing glottal inverse filtering,” IEEETrans. on Audio, Speech, and Lang. Proc., vol. 19, no. 1, pp. 153–165, 2011.

[25] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Si-multaneous modeling of spectrum, pitch and duration in hmm-based speechsynthesis,” in Proc. of Eurospeech, 1999, pp. 2347–2350.

[26] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,”Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.

[27] [Online], “HMM-based speech synthesis system (HTS),”http://hts.sp.nitech.ac.jp/.

[28] B. Bollepalli, J. Beskow, and J. Gustafson, “HMM based speech synthesissystem for swedish language,” in The Fourth Swedish Language TechnologyConference, 2012, pp. 9–10.

[29] A. Lundgren, “HMM-baserad talsyntes,”Master’s Thesis, KTH, Sweden, 2005.

[30] K. L. Payton, R. M. Uchanski, and L. D. Braida, “Intelligibility of conversa-tional and clear speech in noise and reverberation for listeners with normal andimpaired hearing,” The Journal of the Acoustical Society of America, vol. 95,no. 3, pp. 1581–1592, 1994.

[31] S. H. Ferguson, “Talker differences in clear and conversational speech: Vowelintelligibility for normal-hearing listeners,” The Journal of the Acoustical So-ciety of America, vol. 116, no. 4, pp. 2365–2373, 2004.

[32] S. H. Ferguson, “Talker differences in clear and conversational speech: Vowelintelligibility for older adults with hearing loss,” Journal of Speech, Language,and Hearing Research, vol. 55, no. 3, pp. 779–790, 2012.

[33] P. Flipsen Jr, “Speaker-listener familiarity: Parents as judges of delayed speechintelligibility,” Journal of Communication Disorders, vol. 28, no. 1, pp. 3–19,1995.

[34] P. Flipsen Jr, “Measuring the intelligibility of conversational speech in chil-dren,” Clinical linguistics & phonetics, vol. 20, no. 4, pp. 303–312, 2006.

[35] S. Sundaram and S. Narayanan, “Spoken language synthesis: Experiments insynthesis of spontaneous monologues,” in Proc. of IEEE Workshop on SpeechSynthesis, 2002, pp. 203–206.

[36] H. H. Clark, “Speaking in time,” Speech communication, vol. 36, no. 1, pp.5–13, 2002.

Page 42: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

36 BIBLIOGRAPHY

[37] G. P. Laan, “The contribution of intonation, segmental durations, and spectralfeatures to the perception of a spontaneous and a read speaking style,” SpeechCommunication, vol. 22, no. 1, pp. 43–65, 1997.

[38] M. Nakamura, K. Iwano, and S. Furui, “Differences between acoustic charac-teristics of spontaneous and read speech and their effects on speech recognitionperformance,” Computer Speech & Language, vol. 22, no. 2, pp. 171–184, 2008.

[39] S. Andersson, J. Yamagishi, and R. A. Clark, “Synthesis and evaluation ofconversational characteristics in hmm-based speech synthesis,” Speech Com-munication, vol. 54, no. 2, pp. 175–188, 2012.

[40] M. P. Aylett and A. Turk, “Vowel quality in spontaneous speech: What makesa good vowel?” in Proc. of the Int. Conf. on Spoken Language Processing,1998.

[41] A. Bell, D. Jurafsky, E. Fosler-Lussier, C. Girand, M. Gregory, and D. Gildea,“Effects of disfluencies, predictability, and utterance position on word formvariation in english conversation,” The Journal of the Acoustical Society ofAmerica, vol. 113, no. 2, pp. 1001–1024, 2003.

[42] C. Cucchiarini, H. Strik, D. Binnenpoorte, and L. Boves, “Pronunciation eval-uation in read and spontaneous speech: A comparison between human ratingsand automatic scores,” Proc. of the New Sounds, 2000.

[43] E. Blaauw, “Phonetic differences between read and spontaneous speech,” inProc. of the Int. Conf. on Spoken Language Processing, 1992.

[44] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper,“Enriching speech recognition with automatic detection of sentence boundariesand disfluencies,” IEEE Trans. on Audio, Speech, and Lang. Proc., vol. 14,no. 5, pp. 1526–1540, 2006.

[45] J. Adell, D. Escudero, and A. Bonafonte, “Production of filled pauses in con-catenative speech synthesis based on the underlying fluent sentence,” SpeechCommunication, vol. 54, no. 3, pp. 459–476, 2012.

[46] D. Cadic and L. Segalen, “Paralinguistic elements in speech synthesis,” in Proc.of INTERSPEECH, 2008, pp. 1861–1864.

[47] N. Campbell, “Conversational speech synthesis and the need for some laugh-ter,” IEEE Trans. on Audio, Speech, and Lang. Proc., vol. 14, no. 4, pp.1171–1178, 2006.

[48] S. Andersson, K. Georgila, D. Traum, M. Aylett, and R. A. Clark, “Predictionand realisation of conversational characteristics by utilising spontaneous speechfor unit selection,” in Proc. of Int. Conf. on Speech Prosody, 2010.

Page 43: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

BIBLIOGRAPHY 37

[49] R. Dall, M. Tomalin, M. Wester, W. J. Byrne, and S. King, “Investigatingautomatic & human filled pause insertion for speech synthesis.” in Proc. ofINTERSPEECH, 2014, pp. 51–55.

[50] N. Campbell, “Where is the information in speech?(and to what extent can itbe modelled in synthesis?),” in in Proc. of. the 3rd ESCA/COCOSDA Work-shop (ETRW) on Speech Synthesis, 1998, pp. 17–20.

[51] N. Campbell, “Towards conversational speech synthesis; lessons learned fromthe expressive speech processing project.” in Sixth ISCA Workshop on SpeechSynthesis, 2007, pp. 22–27.

[52] J. Adell, A. Bonafonte, and D. Escudero, “Filled pauses in speech synthesis:towards conversational speech,” in Text, Speech and Dialogue, 2007, pp. 358–365.

[53] J. Adell, A. Bonafonte, and D. E. Mancebo, “Synthesis of filled pauses based ona disfluent speech model.” in Proc. Int. Conf. Acoust., Speech, Signal Process.,2010, pp. 4810–4813.

[54] B. Möbius, “Rare events and closed domains: Two delicate concepts in speechsynthesis,” International Journal of Speech Technology, vol. 6, no. 1, pp. 57–71,2003.

[55] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modelingof speaking styles and emotional expressions in hmm-based speech synthesis,”IEICE TRANSACTIONS on Information and Systems, vol. 88, no. 3, pp.502–509, 2005.

[56] L. Badino, J. S. Andersson, J. Yamagishi, and R. A. Clark, “Identification ofcontrast and its emphatic realization in hmm-based speech synthesis,” in Proc.of INTERSPEECH, 2009, pp. 520–523.

[57] J. Robson and B. Janet, “Hearing smiles-perceptual, acoustic and productionaspects of labial spreading,” in Proc. of the XIVth International Congress ofPhonetic Sciences, vol. 1, 1999, pp. 219–222.

[58] J. Urbain, H. Cakmak, and T. Dutoit, “Evaluation of hmm-based laughtersynthesis,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 7835–7839.

[59] K. Shigeyoshi, K. Shinya, I. Toshihiko, and N. Campbell, “Japanese MUL-TEXT: a prosodic corpus,” in Proc. of Int. Language Resources and Evalua-tion, 2004, pp. 2167–2170.

[60] R. M. Uchanski, “Clear speech,” The handbook of speech perception, pp. 207–235, 2008.

Page 44: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

38 BIBLIOGRAPHY

[61] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. Black, andK. Tokuda, “The HMM-based speech synthesis system (HTS) version 2.0.” inProc. of 6th ISCA Workshop on Speech Synthesis (SSW), 2007, pp. 294–299.

[62] J. Gonzalez and T. Cervera, “The effect of mpeg audio compression on multi-dimensional set of voice parameters,” Logopedics Phoniatrics Vocology, vol. 26,no. 3, pp. 124–138, 2001.

[63] R. J. Van Son, “A study of pitch, formant, and spectral estimation errorsintroduced by three lossy speech compression algorithms,” Acta acustica unitedwith acustica, vol. 91, no. 4, pp. 771–778, 2005.

[64] ISO, “Information technology âĂŞ coding of moving pictures and associatedaudio for digital storage media at up to about 1.5 mbit/s - part 3: Audio,”ISO/IEC 11172-3:1993, International Organization for Standardization, 1993.

[65] T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez, and Y. Stylianou,“Towards a voice conversion system based on frame selection,” in Proc. Int.Conf. Acoust., Speech, Signal Process., vol. 4, 2007, pp. IV–513.

[66] Y. Stylianou, “Voice transformation: a survey,” in Proc. Int. Conf. Acoust.,Speech, Signal Process., 2009, pp. 3585–3588.

[67] J. Teutenberg, C. Watson, and P. Riddle, “Modelling and synthesising f0 con-tours with the discrete cosine transform,” in Proc. Int. Conf. Acoust., Speech,Signal Process., 2008, pp. 3973–3976.

[68] C. Veaux and X. Rodet, “Intonation conversion from neutral to expressivespeech.” in Proc. of INTERSPEECH, 2011, pp. 2765–2768.

[69] E. E. Helander and J. Nurminen, “A novel method for prosody prediction invoice conversion,” in Proc. Int. Conf. Acoust., Speech, Signal Process., vol. 4,2007, pp. 504–509.

[70] J. Kominek and A. W. Black, “The CMU arctic speech databases,” in Proc.of 5th ISCA Workshop on Speech Synthesis, 2004.

[71] H. Simon, Neural networks: a comprehensive foundation. Prentice-Hall Inc.,NJ, 1999, vol. 2.

[72] S. Petridis and M. Pantic, “Audiovisual discrimination between speech andlaughter: Why and when visual information might help,” IEEE Transactionson Multimedia, vol. 13, no. 2, pp. 216–234, 2011.

[73] S. Sundaram and S. Narayanan, “Automatic acoustic synthesis of human-likelaughtera),” The Journal of the Acoustical Society of America, vol. 121, no. 1,pp. 527–535, 2007.

Page 45: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

BIBLIOGRAPHY 39

[74] E. Lasarcyk and J. Trouvain, “Imitating conversational laughter with an ar-ticulatory speech synthesizer,” in Proc. of Interdisciplinary Workshop on thePhonetics of Laughter, 2008, pp. 43–48.

[75] S. A. Thati, S. Kumar, and B. Yegnanarayana, “Synthesis of laughter bymodifying excitation characteristics,” The Journal of the Acoustical Societyof America, vol. 133, no. 5, pp. 3072–3082, 2013.

[76] J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadomski, C. Pelachaud,B. Picart, J. Tilmanne, and J. Wagner, “The AVLaughterCycle database.” inProc. of Int. Language Resources and Evaluation, 2010, pp. 2996–3001.

[77] R. Likert, “A technique for the measurement of attitudes.” Archives of psy-chology, 1932.

Page 46: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Publication I

Page 47: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 48: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Effect of MPEG Audio Compression on HMM-based Speech Synthesis

Bajibabu Bollepalli1, Tuomo Raitio2, Paavo Alku2

1Department of Speech, Music and Hearing, KTH, Stockholm, Sweden2Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland

[email protected], [email protected]

AbstractIn this paper, the effect of MPEG audio compression on HMM-based speech synthesis is studied. Speech signals are en-coded with various compression rates and analyzed using theGlottHMM vocoder. Objective evaluation results show that thevocoder parameters start to degrade from encoding with bit-rates of 32 kbit/s or less, which is also confirmed by the sub-jective evaluation of the vocoder analysis-synthesis quality. Ex-periments with HMM-based speech synthesis show that the sub-jective quality of a synthetic voice trained with 32 kbit/s speechis comparable to a voice trained with uncompressed speech, butlower bit rates induce clear degradation in quality.Index Terms: speech synthesis, HMM, MP3, GlottHMM

1. IntroductionThe paradigm of text-to-speech (TTS) has shifted from read-aloud corpus based synthesis of short sentences to audio-bookbased synthesis of longer paragraphs [1]. Nowadays, one canfind extensive amounts of speech data from, e.g., the worldwide web. However, due to the limitations in storage and band-width on the Internet, speech data is typically available in com-pressed forms. In addition, speech data are expressed in variousforms which might involve also mixtures of voice signals, musicand video. Thus, instead of using speech-specific compressionmethods, general audio compression methods are increasinglyused when speech data is disseminated on the Internet. Depend-ing on the optimization of video and audio data rate, compres-sion may introduce severe artefacts in the speech signal.

In [2], it was shown that both 29 acoustical voice pa-rameters, obtained from the multi-dimensional voice program(MDVP) model of Kay Elemetrics Corp., and the amplitude-frequency spectrum have high fidelity with high data rates(128 kbit/s and 168 kbit/s with sampling rate of 44.1 kHz)whereas lower bit-rates, such as 64 kbit/s or lower, introducesubstantial modifications in the speech signal and amplitudespectrum parameters. In [3], it was shown that bit-rates of80 kbit/s and up (with sampling rate of 44.1 kHz) can be usedfor acoustic analysis (pitch and formant extraction, global spec-tral measure, the spectral center of gravity) without any degra-dation in quality while a low bit-rate of 40 kbit/s introduceslarger errors in formant measurements. However, to the bestof our knowledge, there are no studies which specifically haveaddressed how compression of speech affects vocoding and sta-tistical parametric speech synthesis [4].

The aim of this investigation is to study how compressionof speech affects hidden Markov model (HMM) based speechsynthesis. One of the most commonly used audio compressiontechniques, MPEG-1 Audio Layer III (MP3) [5], is used to com-press speech signals with various bit-rates. All the analysis andsynthesis steps are performed using the GlottHMM vocoder [6].

First, the extent to which the speech signal and the vocoder pa-rameters are affected by the compression at different bit-rates isstudied with objective methods. Second, the analysis-synthesisquality of the vocoder is evaluated by subjective listening testsby varying the bit-rate of the input speech. Finally, the role ofspeech compression in the quality of HMM-based synthesis isstudied by building voices with various bit-rates and evaluatingthe subjective quality of the resulting synthetic voices.

2. Speech CompressionFor compression of speech, we used the MPEG-1 AudioLayer 3 compression method [5], commonly known as MP3.MPEG (moving pictures expert group) is a standard in audiocoding which enables high compression rates while preservinghigh quality. MP3 takes advantage of the characteristics of hu-man auditory mechanism to compress audio. MP3 compressionis lossy; it uses psychoacoustic models to discard or reduce pre-cision of components less audible to human hearing, and en-codes the remaining material with high efficiency. First, theaudio signal is converted into spectral components using a fil-ter bank analysis. For each spectral component, the perceptualmasking effect caused by other components is calculated. Later,the low-level signals (maskee) are replaced by a simultaneousoccurring stronger signal (masker) as long as the masker andmaskee are close enough to each other in frequency or time [7].

In this work, we have used a freely available software calledthe LAME-v3.99 [8] encoder to compress speech signals withdifferent bit-rates. The standard options of the encoder wereused, i.e., the fixed bit-rate encoding scheme. All manipula-tions were done on a PC workstation running Linux. Table 1shows the bit-rates along with the compression ratios used inthis study. Here, compression ratios are calculated with respectto the original speech utterances which are recorded at a sam-pling rate of 16 kHz with 16-bit resolution, resulting in a datarate of 256 kbit/s with pulse code modulation (PCM) encoding.

Table 1: Bit-rates and corresponding theoretical and realizedcompression ratios with respect to 256 kbit/s 16 kHz PCMspeech.

Bit-rate Compression ratio Compression ratio(kbit/s) w.r.t. bit-rate w.r.t. file size

160 1.6 1.56128 2 1.9264 4 3.1332 8 6.2524 10.67 8.3316 16 12.508 32 25.00

Copyright © 2013 ISCA 25-29 August 2013, Lyon, France

INTERSPEECH 2013

1062

Page 49: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

3. VocoderThe GlottHMM statistical parametric speech synthesizer [6] isused in the experiments of this study. GlottHMM aims to accu-rately model the speech production mechanism by decomposingspeech into the vocal tract filter and the voice source signal us-ing glottal inverse filtering. It is built on a basic framework of aHMM-based speech synthesis system (HTS) [9, 10], but it usesa distinct type of vocoder for parameterizing and synthesizingspeech. GlottHMM has been shown to yield high-quality syn-thetic speech [6, 11, 12, 13], better or comparable to the qual-ity of STRAIGHT [14], which is currently the most commonlyused vocoder in statistical parametric speech synthesis.

In the parametrization of speech with GlottHMM, iterativeadaptive inverse filtering (IAIF) [15] is used to estimate the vo-cal tract filter and the voice source signal. Linear prediction(LP) is used for spectral estimation in the IAIF method, andthe estimated vocal tract filter is converted to line spectral fre-quencies (LSF) [16] for better representation of the LP infor-mation in HMM-training [17]. From the estimated voice sourcesignal, fundamental frequency (F0) is estimated with the auto-correlation method, and the harmonic-to-noise ratio (HNR) isestimated in five bands according to the equivalent rectangularbandwidth (ERB) [18] scale: 0–241 Hz, 241–731 Hz, 731–1735Hz, 1735–3791 Hz, and 3791–8000 Hz. The voice source spec-trum is estimated with LP and converted to LSFs.

In synthesis, natural glottal flow pulses are used for recon-structing the excitation signal. Pulses are interpolated in timeand scaled in amplitude to match F0 and energy. In order tomatch the degree of voicing in the excitation, noise is added ac-cording to the HNR of five bands in the spectral domain. Also,in order to control the phonation type, the excitation is filteredwith an infinite impulse response (IIR) filter to match the voicesource spectrum. Finally, the created excitation is filtered withthe vocal tract filter to synthesize speech.

Besides being used as a key element of statistical speechsynthesis, GlottHMM can also be used as a general speech anal-ysis tool. In addition to the parameters described above, twoadditional voice source quantities are extracted: the differencebetween the first and the second harmonic (H1–H2) [19], de-scribing the spectral tilt of the voice source, and the normalizedamplitude quotient (NAQ) [20], describing phonation type. Theparameters extracted with GlottHMM are shown in Table 2.

4. Experiments4.1. Speech material

Two databases designed for TTS development were used in ex-periments. The first corpus consists of 599 sentences read bya Finnish male (labeled as MV), and the second one consistsof 513 sentences read by a Finnish female (labeled as HK). All

Table 2: Speech features and the number of parameters.

Feature Number of parametersVocal tract spectrum 30Voice source spectrum 10Harmonic-to-noise ratio (HNR) 5Energy 1Fundamental frequency (F0) 1H1–H2 1 (only for analysis)NAQ 1 (only for analysis)

8 162432 64 128 160−1

0

1

2

3

4

5

6

7

8

Bit−rate (kbit/s)

Error (%)

Figure 1: Relative error of F0 as a function of bit-rate. Datais represented as means and 95% confidence intervals over thetwo voices (male MV and female HK).

1 2 3 4 50

10

20

30

40

50

60

70

HNR band

Error (%)

8kbit/s16kbit/s24kbit/s32kbit/s64kbit/s128kbit/s160kbit/s

Figure 2: Relative error of HNR as a function of bit-rate.

audio files were PCM encoded and sampled at 16 kHz with aresolution of 16 bits, resulting in a data rate of 256 kbit/s.

4.2. Objective evaluations of vocoder parameters

The effects of compression (bit-rates shown in Table 1) wasevaluated by comparing the vocoder parameters extracted fromthe MP3-processed sounds to those obtained from the corre-sponding uncompressed sentences. For each compression rate,the relative error was determined between the parameter valuecomputed from the uncompressed and compressed sound forboth speakers. The following five parameters were analyzed: 1)F0, 2) HNR, 3) LSF of the voice source, 4) LSF of the vocaltract, 5) H1–H2, and 6) NAQ.

Figure 1 shows the relative error of F0 as a function of bit-rate. It can observed that the error of F0 for high bit-rates (64kbit/s or more) is almost negligible and even for low bit-ratesthe average error is less than 5%. Figure 2 shows the rela-tive error of HNR of each five bands. The error in the highfrequency bands is larger than in lower bands, which indicatesthat the strong harmonic structure of the low-frequency bandsis fairly well preserved while the details in the high-frequencybands suffer from compression. Figures 3 and Figure 4 showthe relative error of the LSFs of the voice source and the vo-cal tract, respectively. The figures show that the error is great-

1063

Page 50: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

35

LSF coefficients

Error (%)

8kbit/s16kbit/s24kbit/s32kbit/s64kbit/s128kbit/s160kbit/s

Figure 3: Relative error of voice source LSFs as a function ofbit-rate.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 300

5

10

15

20

25

30

35

LSF coefficients

Error (%)

8kbit/s16kbit/s24kbit/s32kbit/s64kbit/s128kbit/s160kbit/s

Figure 4: Relative error of vocal tract LSFs as a function ofbit-rate.

est in low-frequencies. Table 3 shows the correlation coeffi-cient of H1–H2 and NAQ between the original and compressedspeech signals. The table shows that correlation declines gradu-ally with decreasing bit-rates. Especially H1–H2 suffers greatlyfrom high compression, while NAQ remains relatively stable.

The objective evaluations of all parameters show a signifi-cant increase in error between the bit-rates of 64 and 32 kbit/s.In conclusion, if the bit-rate is 32 kbit/s or less, the compressedacoustic signal is significantly different from the original speechsignal, which has a clear effect on the vocoder.

4.3. Evaluation of analysis-synthesis quality

A subjective evaluation was performed to study how the qualityof vocoded signals is affected by compression. As subjectiveevaluations are more laborious than objective ones, only fourbit-rates were included in the subjective tests: 128 kbit/s, 32kbit/s, 16 kbit/s, and 8 kbit/s. From each bit-rate, two sets ofsignals were selected: 1) compressed signals and 2) vocodedsignals. From each category, six randomly selected sentences(3 male and 3 female) were used. A total of 10 native Finnishlisteners participated in the evaluations. The sentences werepresented in random order to subjects who rated the naturalnessof signals on the mean opinion score (MOS) scale, ranging from1 to 5 (1–completely unnatural, 5–completely natural).

System

Nat

ural

ness

e.8 v.8 e.16 v.16 e.32 v.32 e.128 v.128

12

34

5

Figure 5: MOS scores for naturalness where e.8, e.16, e.32, ande.128 refer to encoded speech with bit-rates of 8, 16, 32, and128 kbit/s, respectively. Vocoded speech is denoted by v.8, v.32,and v.128 corresponding, respectively, to bit-rates of 8, 32, and128 kbit/s.

Figure 5 shows the means and 95% confidence intervals ofthe naturalness ratings. The results show that the subjects ratedthe 128 kbit/s and 32 kbit/s compressed signals as completelynatural, whereas the speech sounds compressed with 16 kbit/sor lower show a clear drop in naturalness. The result suggeststhat speech, sampled with 16 kHz, can be compressed with a bit-rate as low as 32 kbit/s with very little or without any degrada-tion in quality. A similar trend can also be observed for vocodedsignals: vocoded speech corresponding to the bit-rates of 128kbit/s and 32 kbit/s are rated equal in naturalness, although sig-nificantly lower than non-vocoded signals. An interesting ob-servation is that the gap between the compressed signals andvocoded signals is reduced along with the decreased bit-rate.

4.4. Evaluation of HMM-synthesis quality

HMM-based synthetic voices were built with the GlottHMMvocoder. Standard HTS procedure [9, 10] was used for train-ing the voices with the modifications needed to accommodatethe increased number of parameters of the vocoder [6]. Com-pressed speech signals with different bit-rates were used forbuilding 7 voices: full-rate PCM, and 160, 128, 64, 32, 24,16, and 8 kbit/s. Figure 6 show the average spectra of origi-nal speech and HMM-based synthetic voices for the male MVand female HK speakers. Voices with bit-rates of 24 kbit/s orless show a large decrease in magnitude from 5 kHz to 8 kHzand also some deviation from 2 kHz to 5 kHz for both naturalcompressed and HMM-based voices. However, the low-bit-rateHMM-based voices also show more distortion in their averagespectra compared to compressed and original speech.

Table 3: Correlation coefficient (ρ) of H1–H2 and NAQ betweencompressed and original uncompressed speech.

Bit-rate (kbit/s) ρ (H1–H2) ρ (NAQ)160 0.98 0.94128 0.97 0.9464 0.96 0.9232 0.90 0.8824 0.85 0.8616 0.71 0.838 0.54 0.81

1064

Page 51: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

0 1 2 3 4 5 6 7 8−10

0

10

20

30

40

50

60

70

80

90

Mag

nitu

de (d

B)

Frequency (kHz)0 1 2 3 4 5 6 7 8

Frequency (kHz)

8 kbit/s16 kbit/s24 kbit/s32 kbit/s64 kbit/s128 kbit/s160 kbit/sPCM

0 1 2 3 4 5 6 7 8−10

0

10

20

30

40

50

60

70

80

90

Mag

nitu

de (d

B)

Frequency (kHz)0 1 2 3 4 5 6 7 8

Frequency (kHz)

8 kbit/s16 kbit/s24 kbit/s32 kbit/s64 kbit/s128 kbit/s160 kbit/sPCM

Figure 6: Long-term average spectra of compressed natural(upper graphs) and synthetic speech (lower graphs) with dif-ferent bit-rates for the male (left) and female (right) speaker.

Subjective evaluations were conducted to assess the perfor-mance of the HMM-based voices. Four voices were selectedfor the tests: 1) 128 kbit/s voice, which is considered compa-rable to training with uncompressed speech since it was ratedtotally natural in subjective listening tests, and 2) 32 kbit/s, 3)16 kbit/s, and 4) 8 kbit/s voices, which are heavily compressed.For the subjective evaluations, the same setup as described inSection 4.3 was used. A total of 10 Finnish listeners partici-pated in the MOS naturalness evaluation.

Figure 7 shows the means and 95% confidence intervals ofthe MOS naturalness ratings. The difference between the nat-uralness of 128 kbit/s and 32 kbit/s voices is not significant,thus suggesting that the degradations of compression in the 32kbit/s signal do not affect the training of a HMM-based syn-thetic voice. The difference between 32 kbit/s and 16 kbit/svoices is greater, indicating degraded quality. The naturalnessof 8 kbit/s voice is rated very low, indicating that such high com-pression rates are not suitable for building HMM-based syn-thetic voices.

5. DiscussionThe results of the experiments show that the vocoder parameterscorresponding to high bit-rates such as 160, 128 and 64 kbit/smaintain high fidelity in comparison to the parameters of origi-nal, uncompressed speech, whereas the parameters correspond-ing to low bit-rates (32 kbit/s or lower) are distorted. However,the subjective evaluations indicate that the naturalness of thespeech signals is affected by the compression scheme when the

SystemN

atua

rlnes

s

hts_8 hts_16 hts_32 hts_128

12

34

5Figure 7: Naturalness ratings for HMM-based synthetic voicestrained with the following bit-rates: 8, 16, 32, and 128 kbit/s.

bit-rate is 16 kbit/s or less.In this study, compression of speech signals was done with

the MPEG-1 Audio Layer-3 technique, which utilizes a psy-choacoustic model for determining masked signals that are lessrelevant for perception, and which can thus be removed. The er-ror introduced by this compression on F0 is very low, which isirrelevant from a practical point of view. The compression tech-nique, removing less relevant high-frequency content, may ex-plain the large error in high frequency bands of the HNR values.The large error with low-frequency LSFs is partly explained bythe higher sensitivity to relative error of the lower coefficients,but the high error of H1–H2 also indicates that low-frequenciesare also distorted with high compression rates.

The results of the subjective evaluation of synthetic speechsuggest that the effect of compression is smaller when the MP3-processed data are used in training of HMM-based voices. Thisis confirmed with the bit-rate of 16 kbit/s, where the degrada-tion caused by compression is clearly audible, but only a slightdegradation can be observed with the HMM-based voice builtfrom the same signals. This may be due to the statistical train-ing, which averages out occasional audible artefacts but pre-serves the main characteristics of speech.

6. ConclusionsIn this paper, the effects of using MP3-compressed degradedspeech in vocoding and HMM-based speech synthesis was stud-ied. Speech signals were encoded with various compressionrates and experiments were performed using the GlottHMMvocoder. Both objective and subjective evaluations were usedto study the effect of compression on vocoder and HMM-basedspeech synthesis. Objective evaluation results showed that thevocoder parameters were degraded from encoding with bit-ratesof 32 kbit/s or less, which was also confirmed by the subjectiveevaluation of the vocoder analysis-synthesis quality. Experi-ments with HMM-based speech synthesis showed that the sub-jective quality of a synthetic voice trained with 32 kbit/s speechwas comparable to a voice trained with uncompressed speech,but lower bit rates induced clear degradation in quality.

7. AcknowledgmentsThe research leading to these results has received funding fromthe European Community’s Seventh Framework Programme(FP7/2007–2013) under grant agreement n◦ 287678 and fromthe Academy of Finland (256961, 135003).

1065

Page 52: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

8. References[1] King, S. and Karaiskos, V., “The Blizzard Challenge

2012”, The Blizzard Challenge 2012 workshop, 2012,http://festvox.org/blizzard

[2] Gonzalez, J. and Cervera, T., “The Effect of MPEG Audio Com-pression on a Multi-dimensional Set of Voice Parameters”, Log.Phon. Vocol., 26(3):124–138, 2001.

[3] van Son, R.J.J.H., “A Study of Pitch, Formant, and Spectral Esti-mation Errors Introduced by Three Lossy Speech CompressionAlgorithms”, Acta Acustica United With Acustica, 91(4):771–778, 2005.

[4] Zen, H., Tokuda, K. and Black, A.W., “Statistical parametricspeech synthesis”, Speech Commun., 51(11):1039–1064, 2009.

[5] ISO, “Information Technology – Coding of Moving Pictures andAssociated Audio for Digital Storage Media at up to About 1.5Mbit/s – Part 3: Audio”, ISO/IEC 11172-3:1993, InternationalOrganization for Standardization, 1993.

[6] Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J.,Vainio, M. and Alku, P., “HMM-based speech synthesis utiliz-ing glottal inverse filtering”, IEEE Trans. on Audio, Speech, andLang. Proc., 19(1):153–165, 2011.

[7] Tzanetakis, G. and Cook, P., “Sound Analysis Using Mpeg Com-pressed Audio”, Proc. of ICASSP, vol. 2, pp. 761–764, 2000.

[8] LAME encoder, [online] http://lame.sourceforge.net/

[9] Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black,A.W. and Tokuda, K., “The HMM-based speech synthesis system(HTS) version 2.0”, Sixth ISCA Workshop on Speech Synthesis,pp. 294–299, 2007.

[10] [Online] HMM-based speech synthesis system (HTS),http://hts.sp.nitech.ac.jp

[11] Suni, A., Raitio, T., Vainio, M. and Alku, P., “The GlottHMMspeech synthesis entry for Blizzard Challenge 2010”, The Bliz-zard Challenge 2010 workshop, 2010, http://festvox.org/blizzard

[12] Suni, A., Raitio, T., Vainio, M. and Alku, P., “The GlottHMMentry for Blizzard Challenge 2011: Utilizing source unit se-lection in HMM-based speech synthesis for improved excita-tion generation”, The Blizzard Challenge 2011 workshop, 2011,http://festvox.org/blizzard

[13] Suni, A., Raitio, T., Vainio, M. and Alku, P., “The GlottHMM en-try for Blizzard Challenge 2012 – Hybrid approach”, The BlizzardChallenge 2012 workshop, 2012, http://festvox.org/blizzard

[14] Kawahara, H., Masuda-Katsuse, I. and de Cheveigne, A., “Re-structuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0extraction: possible role of a repetitive structure in sounds”,Speech Commun., 27(3–4):187–207, 1999.

[15] Alku, P., “Glottal wave analysis with pitch synchronous iterativeadaptive inverse filtering”, Speech Commun., 11(2–3):109–118,1992.

[16] Soong, F.K. and Juang, B.-H., “Line spectrum pair (LSP) andspeech data compression”, Proc. ICASSP, vol. 9, 1984, pp. 37–40.

[17] Marume, M., Zen, H., Nankaku, Y., Tokuda, K. and Kitamura, T.,“An investigation of spectral parameters for HMM-based speechsynthesis”, Proc. Autumn Meeting of Acoust. Soc. of Japan, 2006(In Japanese).

[18] Moore, B. and Glasberg, B., “A revision of Zwicker’s loudnessmodel”, ACTA Acustica, 82:335–345, 1996.

[19] Titze, I. and Sundberg, J., “Vocal intensity in speakers andsingers”, J. Acoust. Soc. Am., 91(5):2936–2946, 1992.

[20] Alku, P., Backstrom, T. and Vilkman, E., “Normalized amplitudequotient for parametrization of the glottal flow”, J. Acoust. Soc.Am., 112(2):701–710, 2002.

1066

Page 53: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Publication II

Page 54: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 55: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

EFFECT OF MPEG AUDIO COMPRESSION ON VOCODERS USED IN STATISTICAL PA RA METRIC SPEECH SYNTHESIS

Bajibabu Bollepalli*, Tuomo Raitot

* Department of Speech, Music and Hearing, KTH, Stockholm, Sweden t Department of Signal Processing and Acoustics, Aalto University, Finland

ABSTRACT

This paper investigates the effect of MPEG audio compres­

sion on HMM-based speech synthesis using two state-of-the­

art vocoders. Speech signals are first encoded with various

compression rates and analyzed using the GlottHMM and

STRAIGHT vocoders. Objective evaluation results show

that the parameters of both vocoders gradually degrade with

increasing compression rates, but with a clear increase in

degradation with bit-rates of 32 kbitls or less. Experiments

with HMM-based synthesis with the two vocoders show that

the degradation in quality is already perceptible with bit-rates

of 32 kbitls and both vocoders show similar trend in degrada­

tion with respect to compression ratio. The most perceptible

artefacts induced by the compression are spectral distortion

and reduced bandwidth, while prosody is better preserved.

Index Terms- Statistical parametric speech synthesis,

HMM, MPEG, MP3, GlottHMM, STRAIGHT

1. INTRODUCTION

Research on text-to-speech (TTS) synthesis has taken steps

from read-aloud corpus based synthesis of short sentences to

audio-book based synthesis of longer paragraphs [1]. Nowa­

days, one can find extensive amounts of speech data from,

e.g., the world wide web. However, due to the limitations in

storage and bandwidth, speech data is typically available in

compressed forms. In addition, speech data are expressed in

various forms involving also mixtures of speech, music and

video. Thus, instead of using speech-specific compression

methods, general audio compression methods are often used

when speech data is distributed on the Internet. Depending

on the optimization of video and audio data rate, compression

may introduce severe artefacts in the speech signal.

There are a few studies that have addressed the degrada­

tion of speech parameters due to compression (see e.g. [2, 3]).

In [4], the authors of the current paper conducted the first

study on how the compression of speech affects vocoding

The research leading to these results has received funding from the Euro­

pean Community's Seventh Framework Programme (FP712007-2013) under

grant agreement nO 287678 (Simple4 All).

and statistical parametric speech synthesis. The results of the

study indicated that building voices from compressed speech

data was not severely affected if the compression rate was

32 kbitls or more. In this paper, the previous study is elab­

orated by including two different vocoding techniques and

using a more detailed subjective evaluation. Different fea­

ture extraction algorithms in the two vocoders are expected

to behave differently in relation to speech compression, and

the different data representation in statistical modeling and

synthesis technique may also affect the quality of synthesized

speech. Moreover, listening tests are conducted with only

synthetic speech in order to more accurately study the effect

of compression, while in previous study, vocoded and natural

speech were included in the same listening test.

The paper is structured as follows. In Section 2, speech

compression using MPEG-l Audio Layer III (MP3) audio

compression techniques is described. Section 3 describes

the two vocoders, GlottHMM and STRAIGHT, used in this

study. In Section 4, the effect of compression at different bit­

rates on the vocoder parameters is first studied using objective

methods, after which the role of speech compression in the

quality of HMM-based synthesis is studied using subjec­

tive listening tests. Finally, Section 5 discusses the obtained

results and summarizes the findings of the paper.

2. SPEECH COMPRESSION

MPEG-l Audio Layer 3 compression method [5], commonly

known as MP3 was used for compressing speech in this study.

MPEG (moving pictures expert group) is a standard in audio

Table 1. Bit-rates and corresponding theoretical and realized com­

pression ratios with respect to 256 kbitls 16 kHz PCM speech.

Bit-rate Compression ratio Compression ratio

(kbitls) W.r.t. bit-rate w.r.t. file size

160 1.6 1.56

128 2 1.92

64 4 3.13

32 8 6.25

24 10.67 8.33

16 16 12.50

8 32 25.00

1237

Page 56: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

coding which enables high compression rates while preserv­

ing high quality. MP3 takes advantage of the characteristics

of human auditory mechanism to compress audio. MP3 com­

pression is lossy; it uses psychoacoustic models to reduce

the precision of components less audible to human hearing,

and encodes the remaining material with high efficiency. In

MPEG compression, the audio signal is first converted into

spectral components using a filter bank analysis. For each

spectral component, the perceptual masking effect caused by

other components is first calculated. Then, each spectral com­

ponent is quantized so that the low-level signals (maskee)

can be coded with fewer bits than the simultaneous occurring

stronger signal (masker) as long as the masker and maskee

are close enough to each other in frequency and time [6], thus

keeping the quantization noise below the masking threshold.

With very low bit-rates, low-pass filtering is used in order to

reduce audio bandwidth and thus the required bit-rate.

In this work, a freely available software called the LAME­v3.99 [7] encoder is used to compress speech signals with

standard options (fixed bit-rate encoding scheme). Table 1

shows the bit-rates along with the compression ratios used in

this study. Here, compression ratios are calculated with re­

spect to the original speech utterances recorded at a sampling

rate of 16 kHz with 16-bit resolution, resulting in a data rate

of 256 kbitls with pulse code modulation (PCM) encoding.

3. VOCODERS

3.1. GlottHMM

GlottHMM [8, 9] is designed for parameter extraction and

speech waveform generation for HMM-based speech synthe­

sis. GlottHMM aims to accurately model the speech produc­

tion mechanism by using glottal inverse filtering. GlottHMM

has been shown to yield high-quality synthetic speech [8-12],

better or comparable to the quality of STRAIGHT [13], the

most widely used vocoder in HMM-based speech synthesis.

In GlottHMM speech parametrization, iterative adaptive

inverse filtering (IAIF) [14] is used to estimate the vocal tract

filter and the voice source signal. Linear prediction (LP) is

used for spectral estimation in the IAIF method, and the es­

timated vocal tract filter is converted to line spectral frequen­

cies (LSF) [15] for better representation of the LP information

in HMM-training. From the estimated voice source signal,

fundamental frequency (FO) is estimated with the autocorre­

lation method, and the log harmonic-to-noise ratio (HNR) of

Table 2. Speech features for the GlottHMM vocoder.

GlottHMM features

Vocal tract spectrum

Voice source spectrum

Harmonic-to-noise ratio (HNR)

Energy

Fundamental frequency (FO)

Number of parameters

30

10

5

five frequency bands is estimated by comparing the upper and

lower spectral envelopes constructed from the harmonic peaks

and the interharmonic valleys, respectively. HNR values are

then averaged to five frequency bands according to the equiv­

alent rectangular bandwidth [16] (ERB) scale. Additionally,

the voice source spectrum is estimated with LP (converted to

LSFs) in order to control the phonation characteristics in syn­

thesis. The GlottHMM parameters are shown in Table 2.

In synthesis, a pre-stored natural glottal flow pulse is used

for reconstructing the excitation signal. The pulse is first in­

terpolated to a duration according to FO and scaled in ampli­

tude according to the energy parameter. In order to match the

degree of voicing in the excitation, noise is added according

to the HNR of five bands in the spectral domain. In order to

control the phonation type, the excitation spectrum is matched

to the given voice source LP spectrum. Finally, the excitation

is filtered with the vocal tract filter to synthesize speech.

3.2. STRAIGHT

STRAIGHT [13, 17] was originally proposed as a speech

manipulation tool, but nowadays it is widely used for HMM­

synthesis [18]. STRAIGHT extracts three types of param­

eters: FO, spectrum, and aperiodicity parameters (AP). In

STRAIGHT analysis, FO and voiced-unvoiced decision are

first estimated using an instantaneous-frequency based algo­

rithm and a fixed-point analysis TEMPO [19]. In order to

estimate speech spectrum, FO-adaptive smoothing is applied

to remove the effect of signal periodicity, after which filter

coefficient are estimated with mel-cepstrum (MCEP) [20].

The AP for mixed excitation are based on an amplitude

ratio between the lower and upper smoothed spectral en­

velopes [17] and averaged across 21 frequency sub-bands.

The STRAIGHT parameters are shown in Table 3.

STRAIGHT synthesis uses mixed excitation [21] consist­

ing of impulses and a noise component weighted according

to the AP. The pitch-synchronous overlap add (PSOLA) [22]

method is used to reconstruct the excitation signal, which ex­

cites a mel log spectrum approximation (MLSA) filter [23].

4. EXPERIMENTS

4.1. Speech material

Two databases designed for TTS development were used in

experiments. The first corpus consists of 599 sentences by a

Finnish male (labeled as MV), and the second one consists

of 513 sentences by a Finnish female (labeled as HK). All

Table 3. Speech features for the STRAIGHT vocoder.

STRAIGHT features

Mel-cepstrum

Aperiodicity parameters (AP)

Fundamental frequency (FO)

Number of parameters

40

21

1238

Page 57: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

701�����------�------�------� 8 kbit/s 16 kbit/s 24 kbit/s 32 kbit/s 64 kbit/s

60 50

128 kbit/s

, .' ---- -- .. - ............ - .. _;:

,,;' ," oIt_ , ____ • , " , " " ,. " , ,. "

160 kbit/s � 40 L-,----� ",,/r-'-

, " ... ,'-----,,' , ' " . ,), .. ,,' .... , e .l:i 30 ) , , " ,,' " , , .. ;,;-.. -.. ;'" .. ........ .. : : ::: ........

20 _' � � " : : =_ =, �; '=-=:::::::' ,/

.... - .. -.. -.. -- .,�"" ,

10:�:- " -' ,-

,,'. ' .

..... - _ ............ .. . - - : :::::::::::: :: .. .. ...... - -

o () '\. 1-HNR·band coefficients

.-e 8 kbit/s

.-. 16 kbit/s .. -. 24 kbit/s .-. 32 kbit/s .. -. 64 kbit/s .. -. 128 kbit/s .. _e 160 kbit/s

250'rr=::C:;;:;:S:;:=l����������������­.-. 8 kbit/s .... 16 kbit/s .. -. 24 kbit/s

200 :�: :! �::�: ;;.�

.-. 128 kbit/s ;,..--�,

.... -- .... ---

_ .......... .. -& ........ -

_ ....

15

10

• -. 160 kbltlS,/"<:'

;.-.. </ .._-_ ...

... -- .. �--..

.... -- . -- .-- ---.

.. .. , -" ,. .... . .. .... .................. .. .- .. .. ... .. ......... - ........ .. �' . ,. , ... :::::.:: . ,

I -".. .. .... : ,' , ;,: .:::: .. -! ," , .. ..;......

...-_ ...... _--.--.-----. 50

.; . � •• :�. ��<

.��� .. . _ .. .

...... r •••••• -- _ .••• , ••

' _··r'

' . :. -- .. . --.... -_ ..... - .... -----.. --.-- .... _ ..•. - ........ -.

. . . -

..•

..

_ ..•

.

I\)� '�::' ? � " \>

100

... - 8 kbitJs . -- 16 kbitJs .. . .t. 24 kbitJs .. -- 32 kbitJs •• - 64 kbitJs .. -. 128 kbitJs

!. ,.� /� ... __ 16,O_���tJ: : \/. �.:/'\\ /·�"':,··"'i/ " ..

\� _.l . ' k'

Fig. 1. Relative error of HNR and LSF (GlottHMM, left) and AP and MCEP (STRAIGHT, right) as a function of bit-rate.

audio files were PCM encoded and sampled at 16 kHz with a

resolution of 16 bits, resulting in a data rate of 256 kbitls.

4.2. Objective evaluations of vocoder parameters

The effect of compression was evaluated by comparing

the vocoder parameters extracted from the MP3-processed

sounds to those obtained from the uncompressed ones. For

each compression rate, the relative error was determined

between the parameter values computed from the uncom­

pressed and compressed sound for both speakers. Note that

the relative error depends on the scale of the original pa­

rameter values. However, relative error was used in order

to have a common error measure for all the parameters of

the two vocoders, and because it seems to describe the ef­

fects of compression fairly well. The following three types of

parameters were analyzed: 1) FO, 2) HNRlAP, 3) LSFIMCEP.

Compression has only a small effect on FO error, stem­

ming mainly from slight differences in estimated FO values

and voicing decisions. The relative error of FO is 1-5 % for

GlottHMM and 1-3 % for STRAIGHT, but the differences

between bit-rates were not statistically significant across the

two voices. However, GlottHMM seems to be slightly more

affected by the compression than STRAIGHT with low bit­

rates. Figure 1 shows the relative error of HNR and LSF for

GlottHMM and AP and MCEP for STRAIGHT. For both

vocoders, the error of HNRlAP is rather small with high bit­

rates such as 160 kbitls and 128 kbitls, with only a small in­

crease in error with 64 kbitls. With bit-rates 32 kbitls and

lower, however, the error increases substantially. The rela­

tive error of LSF for GlottHMM shows a similar effect: high

bit-rates (64 kbitls or more) show small errors, while 32 kbitls

and lower bit-rates show larger errors. Particularly the 8 kbitls

voice shows very high errors, especially in the perceptually

important low and mid-frequencies. MCEP parameters of

STRAIGHT also show that high bit-rates (:2 64 kbitls) show

small errors while higher compression significantly affects the

spectral parameters. In conclusion, although the degradations

are gradual, the compressed acoustic signals are significantly

different from the original signal if the bit-rate is 32 kbit/s or

less, which has a clear effects on both vocoders.

Experiments using different LSF (LP analysis) and MCEP

orders were also conducted. LSF order was varied from 14 to

30 and MCEP order from lO to 40. The results indicate that

increasing the LSF order had an effect of reducing the average

parameter error, whereas increasing the MCEP order had the

opposite effect of increasing the error due to compression.

4.3. Evaluation of HMM synthesis quality

HMM-based synthetic voices were built with both vocoders

and voices and five bit-rates. Standard HTS procedure [24,25]

1239

Page 58: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

........... .............................. .... ·17 ���t�,��

hts_8 hts_16 hts_24 hts_32 htsycm

System

hts_8 hts_16 hts_24 hts_32 hts_pcm

System

Fig. 2. HMM-based synthesis naturalness scores as a function of bit-rate for GlottHMM (leftmost figure) and STRAIGHT (center figure) for

the female (F) and male (M) speakers, and averaged MOS scores for both vocoders (rightmost figure).

was used for training the voices. Subjective evaluations were

conducted to assess the quality of the HMM-based voices. As

subjective evaluations are more laborious than objective ones,

only five bit-rates were included in the tests, concentrating

on the low bit-rates, where the differences are expected to be

perceptible [4]: 1) Full peM (256 kbitis), 2) 32 kbitis, 3) 24

kbitis, 3) 16 kbitis, and 4) 8 kbitis. For each voice (5 bit­

rates, 2 vocoders, 2 genders), 3 randomly selected sentences

were used, totaling 60 sentences per test. The sentences were

presented in random order to subjects (15 listeners, of which

11 native Finnish) who rated the naturalness of signals on the

mean opinion score (MOS) scale, ranging from 1 to 5 (1-

completely unnatural, 5--completely natural).

Figure 2 shows the means and 95% confidence intervals

of the MOS ratings for each vocoder and speaker, and also

averages across gender. For GlottHMM, male and female

voices are rated similar in general with a slight preference for

the male voice except with the lowest bit-rate. The low qual­

ity of the male 8 kbitis voice seems to stem from the overly

sharp formants in the mid-frequencies, which can be seen

in Figure 3 as over-emphasized frequencies from 4 kHz to

5 kHz. The female voice, however, exhibits rather loss of mid­

frequencies at low bit-rates and is thus perceived overly soft,

but not as low in quality as the male voice. For STRAIGHT,

the male and female voices are rated completely different;

STRAIGHT male voices are comparable to the GlottHMM

voices, but female voice is rated very low. This may stem

from the simpler mixed excitation used instead of the glottal­

flow excitation of GlottHMM. Both male and female low

bit-rate STRAIGHT voices exhibit overly sharp formants at

mid-frequencies, which can be seen in Figure 3. On aver­

age, GlottHMM is rated always better than STRAIGHT, but

the degradation due to compression seems to follow the same

general trend; the quality gradually decreases as a function of

bit-rate, which is perceptible with bit-rates of 32 kbitis and

lower (although not statistically significant).

Figure 3 shows the long-term average spectra of natural

and synthetic speech with different bit-rates, plotted sepa­

rately for each vocoder and speaker. The spectra of natural

compressed speech shows that low-pass filtering is used in

encoding. With 32 kbitis, spectral components are missing

above 7 kHz, and for lower bit-rates the cut-off frequency

is between 5 kHz and 6 kHz. This introduces clear audible

effects, and seems to affect the spectral estimation and mod­

eling in the boundary frequencies in the synthetic voices. In

addition, serious deviations in spectrum from 1 kHz to 5 kHz

can be observed with the lowest bit-rates.

The most perceptible artefacts induced by the compres­

sion were spectral distortion due to the overly sharp spectral

components at 4 kHz-5 kHz, especially with the two lowest

bit-rates, and the reduced bandwidth induced by the compres­

sion with 24 kbitis or lower bit-rates. However, the prosody

of all voices was rather well preserved.

5. CONCLUSIONS

In this paper, the effects of using MP3-compressed speech

in HMM-based speech synthesis was studied. Speech sig­

nals were encoded with various compression rates and exper­

iments were performed using the GlottHMM and STRAIGHT

vocoders. Objective evaluations showed that the parameters

of both vocoders gradually degraded with increasing com­

pression rates, but with a clear increase in degradation with

bit-rates of 32 kbitis or less. Experiments with HMM-based

speech synthesis showed that the degradation of subjective

quality was perceptible with bit-rates of 32 kbitis or less, and

both vocoders showed similar trend in degradation with re­

spect to compression ratio. The most perceptible artefacts

induced by the compression were spectral distortion and re­

duced bandwidth, while prosody was better preserved.

6. REFERENCES

[1] S. King and V. Karaiskos, 'The Blizzard Challenge

2012," in The Blizzard Challenge 2012 workshop, 2012,

http://festvox.orglblizzard.

[2] J. Gonzalez and T. Cervera, "The effect of MPEG audio com­

pression on a multi-dimensional set of voice parameters," Log.

Phon. Vocol., vol. 26, no. 3, pp. 124-138,2001.

[3] RJJ.H. van Son, "A study of pitch, formant, and spectral es­

timation errors introduced by three lossy speech compression

algorithms," Acta Acustica United With Acustica, vol. 91, no.

4,pp. 771-778,2005.

1240

Page 59: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

60 co � 50 Q) � 40 '2 g> 30 :;;

20

10

co � 50 Q) � 40 '2 g> 30 :;;

20

10

60 co � 50 Q) � 40 '2 g> 30 :;;

20

10

STRAIGHT synthesis, male

1 3 4 5 Frequency (kHz)

7 8 0

--8kbiVs

-- 16kbiVs --24kbiVs --32kbiVs

--PCM

--8kbiVs

-- 16kbiVs --24kbiVs --32kbiVs

--PCM

3 4 5 6 Frequency (kHz)

Fig. 3. Long-term average spectra of compressed and PCM

speech (upper graphs) and synthetic speech with GlottHMM (middle

graphs) and STRAIGHT (bottom graphs) with different bit-rates for

the male (left) and female (right) speakers.

[4] B. Bollepalli, T. Raitio, and P. Alku, "Effect of MPEG au­

dio compression on HMM-based speech synthesis," in Proc.

Interspeech, 2013, pp. 1062-1066.

[5] ISO, "Information Technology - Coding of Moving Pictures

and Associated Audio for Digital Storage Media at up to About

1.5 Mbit!s - Part 3: Audio," 1993, ISO/IEC 11172-3: 1993,

International Organization for Standardization.

[6] G. Tzanetakis and P. Cook, "Sound analysis using MPEG com­

pressed audio," in Proc. IEEE Int. Con! Acoust. Speech Signal

Proc., 2000, vol. 2, pp. 761-764.

[7] [Online], "LAME encoder," 2013,

http://lame.sourceforge. net!.

[8] T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen,

M. Vainio, and P. Alku, "HMM-based speech synthesis utiliz­

ing glottal inverse filtering," IEEE Trans. Audio Speech Lang.

Proc., vol. 19, no. 1, pp. 153-165, 201l.

[9] T. Raitio, A. Suni, H. Pulakka, M. Vainio, and P. Alku, "Utiliz­

ing glottal source pulse library for generating improved excita­

tion signal for HMM-based speech synthesis," in Proc. IEEE

Int. Con! Acoust. Speech Signal Proc., 2011, pp. 4564-4567.

[10] A. Suni, T. Raitio, M. Vainio, and P. Alku, "The

1241

GlottHMM speech synthesis entry for Blizzard Challenge

2010," in The Blizzard Challenge 2010 workshop, 2010,

http://festvox,orglblizzard.

[11] A. Suni, T. Raitio, M. Vainio, and P. Alku, "The GlottHMM

entry for Blizzard Challenge 2011: Utilizing source unit selec­

tion in HMM-based speech synthesis for improved excitation

generation," in The Blizzard Challenge 2011 workshop, 2011,

http://festvox,orglblizzard.

[12] A. Suni, T. Raitio, M. Vainio, and P. Alku, "The

GlottHMM entry for Blizzard Challenge 2012 - Hybrid ap­

proach," in The Blizzard Challenge 2012 workshop, 2012,

http://festvox.orglblizzard.

[13] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, "Re­

structuring speech representations using a pitch-adaptive time­

frequency smoothing and an instantaneous-frequency-based

FO extraction: possible role of a repetitive structure in sounds,"

Speech Commun., vol. 27, no. 3-4, pp. 187-207, 1999.

[14] P. Alku, "Glottal wave analysis with pitch synchronous itera­

tive adaptive inverse filtering," Speech Commun., vol. 11, no.

2-3,pp. 109-118, 1992.

[15] F. K. Soong and B.-H. Juang, "Line spectrum pair (LSP) and

speech data compression," in Proc. IEEE Int. Con! Acoust.

Speech Signal Proc., 1984, vol. 9, pp. 37-40.

[16] B. C. J. Moore and B. R. Glasberg, "Suggested formulae for

calculating auditory-filter bandwidths and excitation patterns,"

1. Acoust. Soc. Am., vol. 74, pp. 750-753,1983.

[17] H. Kawahara, Jo Estill, and O. Fujimura, "Aperiodicity extrac­

tion and control using mixed mode excitation and group delay

manipulation for a high quality speech analysis, modification

and synthesis system STRAIGHT," in 2nd Inernational Work­

shop on Models and Analysis of Vocal Emissions for Biomedi­

cal Applications (MAV EBA), 200l.

[18] Heiga Zen, Keiichi Tokuda, and Alan W. Black, "Statistical

parametric speech synthesis," Speech Commun., vol. 51, no.

II,pp. l039-1064,2009.

[19] H. Kawahara, H. Katayose, A. de Cheveigne, and R. Patter­

son, "Fixed point analysis of frequency to instantaneous fre­

quency mapping for accurate estimation of FO and periodicity,"

in Proc. Eurospeech, 1999, pp. 2781-2784.

[20] S. Imai, "Cepstral analysis synthesis on the mel frequency

scale," in Proc. IEEE Int. Con! Acoust. Speech Signal Proc.,

1983, vol. 8, pp. 93-96.

[21] T. Yoshimura, K. Tokuda, T. Masuko, and T. Kitamura,

"Mixed-excitation for HMM-based speech synthesis," Proc.

Eurospeech, pp. 2259-2262, 2001.

[22] E. Moulines and F. Charpentier, "Pitch-synchronous wave­

form processing techniques for text-to-speech synthesis using

diphones," Speech Commun., vol. 9, no. 5-6, pp. 453-467,

1990.

[23] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, "An adaptive

algorithm for mel-cepstral analysis of speech," in Proc. IEEE

Int. Con! Acoust. Speech Signal Proc., 1992, vol. 1, pp. l37-

140.

[24] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. Black,

and K. Tokuda, "The HMM-based speech synthesis system

(HTS) version 2.0," in Sixth ISCA Workshop on Speech Syn­

thesis, 2007, pp. 294-299.

[25] [Online], "HMM-based speech synthesis system (HTS),"

20l3, http://hts.sp.nitech.ac.jp.

Page 60: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Publication III

Page 61: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 62: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Non-linear Pitch Modification

in Voice ConversionUsing Artificial Neural Networks

Bajibabu Bollepalli, Jonas Beskow, and Joakim Gustafson

Department of Speech, Music and Hearing, KTH, Sweden

Abstract. Majority of the current voice conversion methods do not fo-cus on the modelling local variations of pitch contour, but only on linearmodification of the pitch values, based on means and standard devia-tions. However, a significant amount of speaker related information is alsopresent in pitch contour. In this paper we propose a non-linear pitch mod-ification method for mapping the pitch contours of the source speakeraccording to the target speaker pitch contours. This work is done withinthe framework of Artificial Neural Networks (ANNs) based voice conver-sion. The pitch contours are represented with Discrete Cosine Transform(DCT) coefficients at the segmental level. The results evaluated usingsubjective and objective measures confirm that the proposed methodperformed better in mimicking the target speaker’s speaking style whencompared to the linear modification method.

1 Introduction

The aim of a voice conversion system is to transform the utterance of an arbitraryspeaker, referred to as source speaker, to sound as if spoken by a specific speaker,referred to as target speaker. Listeners perceive the source speaker’s speech as ifuttered by the target speaker. Voice conversion can also be referred to as voicetransformation or voice morphing. Since past two decades voice conversion hasbeen an active research topic in the area of speech synthesis [1], [2], [3], [4]. Ap-plications like text-to-speech (TTS), speech-to-speech translation, mimicry gen-eration and human-machine interaction systems are greatly benefited by havinga voice conversion module.

In the literature, majority of voice conversion techniques focused mainly onthe modification of short-term spectral features [5], [6]. However, prosodic fea-tures, such as pitch contour and speaking rhythm, also contain important cuesof speaker identity. In [8] it was shown that pure prosody alone can be used, toan extent, to recognize speakers that are familiar to us. To build a good qualityvoice conversion system, it needs to modify the prosodic features along withthe spectral features. The pitch contour is one of the most important prosodicfeatures related to speaker identity.

The most common method for pitch contour transform is:

log(f t0) =

log(f s0 )− μs

logf0

σslogf0

∗ σtlogf0 + μt

logf0 (1)

T. Drugman and T. Dutoit (Eds.): NOLISP 2013, LNAI 7911, pp. 97–103, 2013.c© Springer-Verlag Berlin Heidelberg 2013

Page 63: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

98 B. Bollepalli, J. Beskow, and J. Gustafson

where f s0 , f

t0 represent the pitch values at frame level, and μs

logf0, σs

logf0, μt

logf0,

and σtlogf0

represent the mean and standard deviation of the pitch values inlog domain for the source and target speakers, respectively. In this paper, werefer to this method as linear transformation. The local shapes of the pitchcontour segments are not modelled and transformed in the linear transformationmethod. To capture the local dynamics of the pitch contour, we proposed anon-linear transformation method using artificial neural networks (ANNs). Thepitch contours over the voiced segments are represented by their discrete cosinetransform (DCT) coefficients.

There are some studies which have used the DCT for parametric representa-tion of pitch contour and its modelling [9], [10], [11]. In [9], it is shown that theuse of DCT for analysis and synthesis of pitch contours is beneficial. In [10], DCTis used to model the pitch contours of syllables for conversion of neutral speechinto expressive speech using Gaussian mixture models (GMM). In [11], DCTrepresentation is used for modelling and transformation of prosodic informationin a voice conversion system using a code book generated by classification andregression trees (CART) methods. The work presented in this paper is differentfrom [11] in the following aspects:

1. The proposed method does not use any linguistic information for pitch con-tour modification.

2. The proposed method uses ANNs to model the non-linear mapping betweenthe pitch contours of source and target speakers.

3. The proposed method, represents the pitch contour of a voiced segmentusing two sets of parameters. One set represents the statistics, and anotherset represents the fine variations of a pitch contour.

This paper is organised as follows: Section 2 describes the database, feature ex-traction and parametrization of the pitch contour. Section 3, outlines the ANNbased voice conversion system. The experimental results obtained using bothsubjective and objective tests are presented in Section 4. Section 5 gives a sum-mary of the work.

2 Database and Feature Extraction

The experiments are carried out on the CMU ARCTIC database consistingof utterances recorded by seven speakers. Each speaker has recorded a set of1132 phonetically balanced utterances, same for all speakers. ARCTIC databasecontains the utterances of SLT (US Female), CLB (US Female), BDL (US Male),RMS (US Male), JMK (Canadian Male), AWB (Scottish Male), and KSP (IndianMale).

To extract the features from a given speech signal we used a high qualityanalysis tool called STRAIGHT vocoder [12]. The features were extracted forevery 5ms of speech. Features are: 1) mel-cepstral coefficients (MCEPs), 2)bandaperiodicity coefficients (BAPs) and 3) fundamental frequency (pitch contour).All these three features were used for voice conversion. Section 2.1 explains aboutthe parametrization of pitch contour.

Page 64: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Non-linear Pitch Modification in Voice Conversion Using ANNs 99

2.1 Parametrization of Pitch Contour

The proposed pitch contour model is defined on a voiced segment basis. Forvoiced speech, the pitch contour varies slowly and continuously over time. It istherefore well modelled by using DCT, an orthogonal transform. One advantageof DCT representation is that the mean square error between two linearly time-aligned pitch contours can be simply estimated from the mean square errorbetween coefficients. The following steps explains the parametrization of a pitchcontour:

1. Derive the pitch contours from the utterances spoken by the source speaker.2. Segment the pitch contour with respect to the voiced segments present in

the utterance.3. Consider only if the duration of each voiced segment is ≥ 50ms. If the du-

ration is less than 50ms then use the linear transformation to transform thepitch values.

4. Map the pitch contour of each voiced segment onto equivalent rectangularbandwidth (ERB) scale [7] using Equation 2.

F0ERB = log10(0.00437 ∗ F0 + 1) (2)

5. Compute the DCT coefficients for each voiced segment using Equation 3.

cn =

M−1∑

i=0

F0(i) cos(π

Mn(i+

1

2)) (3)

where pitch contour F0 of length M is decomposed into N DCT coefficients[c0, c1, c2, c3, ...cN−1]. The first coefficient represents the mean value and re-maining DCT coefficients represents the variations in pitch contour such asthose due to syllable stress.

6. Each segment is represented by two sets of parameters. They are

F0shape = [c1, c2, c3, ...cN−1] and F0limits = [c0, varF0,maxF0,minF0, log(dur)](4)

Where F0shape and F0limits represents the local variations and the con-straints in a pitch contour. [c0, c1, c2, c3, ...cN−1] are the DCT coefficients andvarF0,maxF0,minF0, and log(dur) are variance, maximum value, minimumvalue, and logrithm of duration of a pitch contour, respectively.

3 Voice Conversion Using ANNs

Figure 3, shows the block diagram of both training and transformation processin a voice conversion system. In this work, we used the parallel utterances tobuild a mapping function between source and target speakers. Even though bothspeakers speak the same utterances they still differ in the durations. To align

Page 65: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

100 B. Bollepalli, J. Beskow, and J. Gustafson

Fig. 1. A block diagram of voice conversion system

the feature vectors of source speaker with respect to target speaker we use thedynamic time warping (DTW) method. It enables us to build a mapping functionat frame-level.

For mapping the acoustic features between the source and target speakers,various models have been explored in literature. These models are specific to thekind of features used for mapping. For instance, GMMs [3], vector quantization(VQ) [1] and ANNs [4] are widely used for mapping the vocal tract character-istics. The changes in the vocal tract shape for different speakers are highlynon-linear, therefore to model these non-linearities, it is required to capturethe non-linear relations present in the patterns. Hence, to capture the non-linearrelations between acoustic features, we use a neural network based model (multi-layer feed forward neural networks) for mapping the MCEPs, BAPs and pitchcontour coefficients.

During the process of training, acoustic features of the source and targetspeakers are given as input-output to the network. The network learns fromthese two data sets and tries to capture a non-linear mapping function basedon minimum mean square error. A generalized back propagation learning [13]is used to adjust the weights of the neural network so as to minimize the meansquared error between the desired and the actual output values. Selection ofinitial weights, architectures of ANNs, learning rate, momentum and number ofiterations are some of the optimization parameters in training. Once the train-ing is complete, we get a weight matrix that represents the mapping functionbetween the acoustic features of the given source and target speakers. Such aweight matrix can be used to predict acoustic features of the target speaker fromacoustic features of the source speaker.

Different network structures can be possible by varying the number of hiddenlayers and the number of nodes in each of the hidden layer. In [14] it is shownthat four layer network is optimal for mapping the vocal tract characteristics ofthe source speaker to the target speaker. Therefore, we consider the four layer

Page 66: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Non-linear Pitch Modification in Voice Conversion Using ANNs 101

0.2 0.4 0.6 0.8 1 1.2 1.40

50

100

150

F0 (Hz)

0.2 0.4 0.6 0.8 1 1.2 1.40

50100150200250

F0 (Hz)

0.2 0.4 0.6 0.8 1 1.2 1.40

50100150200250

F0 (Hz)

0.2 0.4 0.6 0.8 1 1.2 1.40

50100150200250

Time (Sec)

F0 (Hz)

(d)

(a)

(b)

(c)

Fig. 2. Conversion of pitch contour from source speaker to target speaker. (a) originalsource speaker pitch contour, (b) linear modification of source speaker pitch contour,(c) non-linear modification of source speaker pitch contour and (d) original targetspeaker pitch contour.

networks with architectures 40L− 80N − 80N − 40L, 21L− 42N − 42N − 21L,9L−18N−18N−9L and 5L−10N−10N−5L for mapping the features MCEPs,BAPs, F0shape and F0limits, respectively. The first and fourth layers are input-output layers with linear units (L) and have the same dimension as that ofinput-output acoustic features. The second layer (first-hidden layer) and thirdlayer (second-hidden layer) have non-linear nodes (N), which help in capturingthe non-linear relationship that may exist between the input-output features.

4 Experiments and Results

As described in Section 2, from ARCTIC database we picked one male speaker(RMS) and one female speaker (SLT) for our experiments. For each speaker, weconsidered 80 parallel utterances for training and a separate set of 32 utterancesfor testing. We extracted acoustic features, MCEPs of dimension 40, BAPs ofdimension 21, and 10 DCT coefficients for every 5ms of speech. Given these fea-tures for training, they are aligned using dynamic time warping to obtain pairedfeature vectors as explained in Section 3. We build a separate mapping func-tion for spectral, band aperiodicity and pitch contour transformations. After themapping functions are trained, we use the test sentences of the source speakerto predict the acoustic features of the target speaker. The pitch contour is con-structed back by using the IDCT on predicted features. An instance of converted

Page 67: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

102 B. Bollepalli, J. Beskow, and J. Gustafson

pitch contour from source speaker (RMS) to target speaker (SLT) is illustratedin Figure 3. From Figure 3.(b), we can observe that linear modification of pitchcontour is not able to model the local variations of the target speaker, whereasin Figure 3.(c) the non-linear method is able to model the local variations of thetarget speaker. Please note that here we have used the same durations of thesource speaker.

Table 1. RMSE (in Hz) between target and converted contours with linear and non-linear transformation methods

Speaker pair Linear modification Non-linear modification

RMS-to-SLT 18.28 14.36

SLT-to-RMS 15.92 12.50

In order to evaluate the performance of the proposed method, we estimate theroot mean square error (RMSE) between target and converted pitch contoursof test set. The RMSE is calculated after the durations of predicted contoursnormalized with respect to actual contours of target speaker. It can be seenfrom Table 1 that the non-linear transformation method performed better thanlinear method.

Table 2. Speaker similarity score

Speaker pair Linear modification Non-linear modification

RMS-to-SLT 3 3.3

SLT-to-RMS 2.55 3.1

An informal perceptual test was also conducted with 10 transformed speechsignals randomly chosen for both conversion pair and presented to 10 listeners.We have used the STRAIGHT vocoder to synthesize the transformed speechsignals. The subjects were asked to compare similarity of the transformed speechsignals with respect to original target speaker speech signals. The ratings weregiven on a scale of 1-5, with 5 for excellent match and 1 for not-at-all match. Thescores are shown in Table 2. It can be observed from Table 2, that non-linearmodification performs better than linear modification in perceptual tests as well.

5 Conclusion

A non-linear pitch modification method was proposed for mapping the pitchcontours of the source speaker according to the target speaker pitch contours.In this method, pitch contour was compressed to a few coefficients using DCT.A four layer ANN model was used for modelling the non-linear patterns of apitch contour between the source and target speaker. The results showed thatboth objective and subjective scores gave very clear preference for the proposedmethod in mimicking the target speaker’s speaking style when compared to thelinear modification method.

Page 68: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Non-linear Pitch Modification in Voice Conversion Using ANNs 103

References

1. Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice conversion through vectorquantization. In: Proc. of ICASSP, New York, USA, pp. 655–658 (April 1988)

2. Stylianou, Y., Cappe, O., Moulines, E.: Continuous probabilistic transform forvoice conversion. IEEE Transactions on Speech and Audio Processing 6(2), 131–142(1998)

3. Ohtani, Y., Toda, T., Saruwatari, H., Shikano, K.: Maximum likelihood voice con-version based on GMM with STRAIGHT mixed excitation. In: Proc. of INTER-SPEECH, Pittsburgh, USA, pp. 2266–2269 (September 2006)

4. Bollepalli, B., Black, A.W., Prahallad, K.: Modeling a noisy-channel for voice con-version using articulatory features. In: Proc. of INTERSPEECH, Portland, USA(August 2012)

5. Dutoit, T., Holzapfel, A., Jottrand, M., Moinet, A., Perez, J., Stylianou, Y.: To-wards a voice conversion system based on frame selection. In: Proc. of ICASSP,pp. 513–516 (2007)

6. Stylianou, Y.: Voice transformation: A survey. In: Proc. of ICASSP, pp. 3585–3588(2009)

7. Smith, J.O., Abel, J.S.: Bark and ERB bilinear transforms. IEEE Transactions onSpeech and Audio Processing 7(6), 697–708 (1999)

8. Helander, E., Nurminen, J.: On the importance of pure prosody in the perceptionof speaker identity. In: Proc. of INTERSPEECH, pp. 2665–2668 (2007)

9. Teutenberg, J., Watson, C., Riddle, P.: Modeling and synthesizing F0 contourswith the discrete cosine transform. In: Proc. of ICASSP, pp. 3973–3976 (2008)

10. Veaux, C., Rodet, X.: Intonation conversion from neutral to expressive speech. In:INTERSPEECH, pp. 2765–2768 (2011)

11. Helander, E., Nurminen, J.: A Novel method for prosody prediction in voice con-version. In: Proc. of ICASSP, pp. IV-509–IV-512 (2007)

12. Kawahara, H., Masuda-Katsuse, I., Cheveigne, A.: Restructuring speech represen-tations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds.Speech Communication 27, 187–207 (1999)

13. Haykin, S.: Neural networks: A comprehensive foundation. Prentice-Hall Inc., NJ(1999)

14. Desai, S., Black, A.W., Yegnanarayana, B., Prahallad, K.: Spectral mapping usingartificial neural networks for voice conversion. IEEE Trans. Audio, Speech andLanguage Processing 18(5), 954–964 (2010)

Page 69: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Publication IV

Page 70: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current
Page 71: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASEDLAUGHTER SYNTHESIS

Bajibabu Bollepalli1, Jerome Urbain2, Tuomo Raitio3, Joakim Gustafson1, Huseyin Cakmak2

1Department of Speech, Music and Hearing, KTH, Stockholm, Sweden2TCTS Lab – University of Mons, Belgium

3Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland

ABSTRACT

This paper presents an experimental comparison of various leadingvocoders for the application of HMM-based laughter synthesis. Fourvocoders, commonly used in HMM-based speech synthesis, are usedin copy-synthesis and HMM-based synthesis of both male and fe-male laughter. Subjective evaluations are conducted to assess theperformance of the vocoders. The results show that all vocoders per-form relatively well in copy-synthesis. In HMM-based laughter syn-thesis using original phonetic transcriptions, all synthesized laughtervoices were significantly lower in quality than copy-synthesis, indi-cating a challenging task and room for improvements. Interestingly,two vocoders using rather simple and robust excitation modeling per-formed the best, indicating that robustness in speech parameter ex-traction and simple parameter representation in statistical modelingare key factors in successful laughter synthesis.

Index Terms— Laughter synthesis, vocoder, mel-cepstrum,STRAIGHT, DSM, GlottHMM, HTS, HMM

1. INTRODUCTION

Text-to-speech (TTS) synthesis systems have already reached highdegree of intelligibility and naturalness, and they can be readily usedin reading aloud a given text. However, applications such as human-machine interaction and speech-to-speech translation require that thesynthetic speech includes more expressiveness and conversationalcharacteristics. To bring expressiveness into speech synthesis sys-tems, it is not sufficient to only concentrate on improving the verbalsignals alone, since non-verbal signals also play an important role inexpressing emotions and moods in human communication [1].

Laughter is one such non-verbal signal playing a key role in ourdaily conversations. It conveys information about emotions and ful-fills important social functions, such as back-channeling. Integratinglaughter into a speech synthesis system can bring the synthesis closerto natural human conversation [2]. Hence, the research on analysis,detection, and synthesis of laughter signals has seen a significant in-crease in the last decade. In this paper, we focus on acoustic laughtersynthesis, and explore the role of vocoder techniques in statisticalparametric laughter synthesis.

The paper is organized as follows. Section 2 gives the back-ground of work done in laughter processing and laughter synthesis

The research leading to these results has received funding from theSwedish research council project InkSynt (VR #2013-4935) and the Euro-pean Community’s Seventh Framework Programme (FP7/2007-2013) un-der grant agreements n◦ 270780 (ILHAIRE) and n◦ 287678 (Simple4All).H. Cakmak receives a Ph.D. grant from the Fonds de la Recherche pourl’Industrie et l’Agriculture (F.R.I.A.), Belgium.

in particular. Section 3 describes the different vocoders compared inthis work. Section 4 focuses on the perceptual evaluation experimentcarried out to compare the vocoders in their capabilities to producenatural laughter. The results of these experiments are discussed inSection 5. Finally, Section 6 presents the conclusions of this work.

2. BACKGROUND

In the last decade, a considerable amount of research has been doneon the analysis and detection of laughter (see e.g. [3]), whereas onlya few studies have been conducted for synthesis. The characteristicsof laughter and speech are slightly different. Formant frequencies inlaughter have been reported to correspond to those of central vow-els in speech, but acoustic features like fundamental frequency (F0)has been shown to have higher variability in laughter than in speech[4]. Importantly, the proportion of fricatives in laughter has been re-ported to be about 40–50% [5], which is much higher than in speech.Despite the differences, the same speech processing algorithms havebeen applied for laughter analysis as for speech analysis.

As the acoustic behavior of laughter is different from speech,it is relatively easy to discriminate laughter from speech. Classi-fication usually depends upon various machine learning methods,such as Gaussian mixture models (GMMs), support vector machines(SVMs), multi-layer perceptrons (MLPs), or hidden Markov models(HMMs), which all use traditional acoustic features (MFCCs, PLP,F0, energy, etc.). Equal error rates (EER) vary between 2% and 15%depending on the data and classification method used [6, 7, 8].

On the other hand, acoustic laughter synthesis is an almost unex-plored domain. In [9], Sundaram and Narayanan modeled the tem-poral behaviour of laughter using the principle of a damped simpleharmonic motion of a mass-spring model. Laughs synthesized withthis method were perceived as non-natural by naive listeners (aver-age naturalness score of 1.71 on a 5-point Likert scale [10]. rangingfrom 1 (very poor) to 5 (excellent)). Lasarcyk and Trouvain [11]compared two laughter synthesis approaches: articulatory synthe-sis resulting from a 3D modeling of the vocal organs and diphoneconcatenation (obtained from a speech database). The 3D model-ing led to the best results, but laughs could still not compete withnatural human laughs in terms of naturalness. Recently two othermethods have been proposed. Sathya et al. [12] synthesized voicedlaughter bouts by controlling several excitation parameters of laugh-ter vowels: pitch period, strength of excitation, amount of frication,number of laughter syllables, intensity ratio between the first andthe last syllables, duration of fricative and vowel in each syllable.The synthesized laughs reached relatively high scores in perceivedquality and acceptability, with values around 3 on a scale rangingfrom 1 to 5. However, it must be noted that no human laugh was

2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)

978-1-4799-2893-4/14/$31.00 ©2014 IEEE 255

Page 72: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

included in the evaluation, which might have had a positive influ-ence on the scores obtained by the synthesized laughs (as there isno “perfect” reference to compare with in the evaluation). Also, themethod only enables the synthesis of voiced bouts (there is no con-trol over unvoiced laughter parts). Finally, Urbain et al. [13] usedHMMs to synthesize laughs from phonetic transcriptions, similar tothe traditional methods used in statistical parametric speech synthe-sis. Models were trained using the HMM-based speech synthesissystem (HTS) [14] on a range of phonetic clusters encountered in 64laughs from one person. Subjective evaluation resulted in an averagenaturalness score of 2.6 out of 5 for the synthesized laughs.

From this brief review of the literature, it is clear that the re-search on HMM-based laughter synthesis is scarce – there existsonly one study on HMM-based laughter synthesis using a singlevocoder. In this work, we report the role of four state-of-the-artvocoders commonly used in statistical parametric speech synthesisfor the application of HMM-based laughter synthesis.

3. VOCODERS

The following vocoders were chosen for comparison: 1) Impulsetrain excited mel-cepstrum based vocoder, 2) STRAIGHT [15, 16]using mixed excitation, 3) Deterministic plus stochastic model(DSM) [17], and 4) GlottHMM vocoder [18]. All the vocoders usethe source-filter principle for synthesis, and thus there are two com-ponents that mostly differ among the systems: the type of spectralenvelope extraction and representation, and the method for modelingand generating the excitation signal. The vocoders are depicted inTable 1 and described in more detail in the following sections.

3.1. Impulse train excited mel-cepstral vocoder

The impulse train excited mel-cepstrum based vocoder (denoted inthis work as MCEP) describes speech with only two acoustic fea-tures: F0 and speech spectrum. The speech spectrum is estimatedusing the algorithm described in [19]. Mel-cepstral coefficients arecommonly used as the spectral representation of speech as they pro-vide a good approximation of the preceptually relevant speech spec-trum. By changing the values of α (frequency warping) and γ (factordefining generalization between LP and cepstrum), various types ofcoefficients for spectral representation can be obtained [19]. Here,we use α = 0.42 and γ = 0 which correspond to simple mel-cepstral coefficients. Both F0 and mel-cepstrum are estimated us-ing the pitch function in speech signal processing toolkit (SPTK)[20], which uses the RAPT method [21]. Speech is synthesized byexciting the mel-generalized log spectral approximation (MGLSA)filter [22] with either simple impulse train for voiced speech or whitenoise for unvoiced speech. This simple excitation method has an ef-fect that the synthesized signal often sounds buzzy.

System Parameters ExcitationMCEP mcep: 35 + F0: 1 Impulse + noise

STRAIGHT mcep: 35 + F0: 1 Mixed excitationband aperiodicity: 21 + noise

DSM mcep: 35 + F0: 1 DSM + noise

GlottHMM F0: 1 + Energy: 1 + Stored glottalHNR: 5 + source LSF: 10 flow pulse +

+ vocal tract LSF: 30 noise

Table 1. Vocoders in test and their parameters and excitation type.

3.2. STRAIGHT

STRAIGHT [15, 16] was proposed mainly for the high quality ana-lysis, synthesis, and modification of speech signals. However, moreoften STRAIGHT is used as a reference for comparing between dif-ferent vocoders in HMM-based speech synthesis, since it is the mostwidely used vocoder, is robust and can produce synthetic speechof good quality [23]. STRAIGHT decomposes the speech signalinto three components: 1) spectral features extracted using pitch-adaptive spectral smoothing and represented as mel-cepstrum, 2)band-aperiodicity features which represent the ratios between peri-odic and aperiodic components of 21 sub-bands, and 3) F0 extractedusing instantaneous-frequency-based pitch estimation. In synthesis,STRAIGHT uses mixed excitation [24] in which impulse and noiseexcitations are mixed according to the band-aperiodicity parametersin voiced speech. The excitation of unvoiced speech is white Gaus-sian noise. Overlap-add is used to construct the excitation, which isthen used to excite a mel log spectrum approximation (MLSA) filter[25] corresponding to the STRAIGHT mel-cepstral coefficients.

3.3. Deterministic plus stochastic model (DSM)

The deterministic plus stochastic model (DSM) of the residual sig-nal [26] first estimates the speech spectrum, and uses the inverse ofthe filter to reveal the speech residual. Glottal closure instant (GCI)detection is used to extract individual GCI-centered residual wave-forms, which are further resampled to fixed duration. The residualwaveforms are then decomposed into the deterministic and stochas-tic parts in frequency domain, separated by the maximum voiced fre-quency Fm fixed at 4 kHz. The deterministic part is computed as thefirst principal component of a codebook of residual frames centeredon glottal closure instants and having a duration of two pitch peri-ods. The stochastic part consists of a white Gaussian noise filteredwith the linear prediction (LP) model of the average high-pass fil-tered residual signal, and time-modulated according to the averageHilbert envelope of the stochastic part of the residual. White Gaus-sian noise is used as excitation for unvoiced speech. The DSM exci-tation is then passed through the MGLSA filter. The DSM vocoderhas been shown to reduce buzziness and to achieve comparable syn-thesis quality as that of STRAIGHT [26]. DSM vocoder was alsoused in the previous HMM-based laughter synthesis work [13]. Inthis paper, STRAIGHT is used to extract F0 and mel-cepstrum forthe DSM analysis, but the extraction of voice source features andsynthesis is performed using the DSM vocoder.

3.4. GlottHMM

The GlottHMM vocoder uses glottal inverse filtering (GIF) in orderto separate the speech signal into the vocal tract filter contributionand the voice source signal. Iterative adaptive inverse filtering (IAIF)[27] is used for the GIF, inside which LP is used for the estimationof the spectrum. IAIF is based on repetitively estimating and cancel-ing the vocal tract filter and voice source spectral contribution fromthe speech signal. The output of the IAIF are the LP coefficients,which are converted to line spectral frequencies (LSF) [28] in orderto achieve a better parameter representation for the statistical mod-eling, and the voice source signal that is further parameterized intovarious features. First, pitch is estimated from the voice source sig-nal using autocorrelation method. Harmonic-to-noise ratio (HNR)of five frequency bands is estimated by comparing the upper andlower smoothed spectral envelopes constructed from the harmonicpeaks and the interharmonic valleys, respectively. In addition, thevoice source spectrum is estimated with LP and converted to LSFs.

256

Page 73: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

In synthesis, a pre-stored natural glottal flow pulse is used for creat-ing the excitation. First, the pulse is interpolated to achieve a desiredduration according to F0, scaled in energy, and mixed with noise ac-cording to the HNR measures. The spectrum of the excitation is thenmatched to the voice source LP spectrum, after which the excitationis fed to the vocal tract filter to create speech.

4. EVALUATION

A subjective evaluation was carried out to compare the performanceof the 4 vocoders in synthesizing natural laughs. For each vocoder,two types of samples were used: a) copy-synthesis, which consists ofextracting the parameters from a laugh signal and re-synthesizing thesame laugh from the extracted parameters; b) HMM-based synthe-sis, where HMM-based system is trained from a laughter databaseand laughs are then synthesized using the models and the originalphonetic transcriptions of a laughter. Copy-synthesis can be seen asthe theoretically best synthesis that can be obtained with a particu-lar vocoder, while HMM-based synthesis shows the current perfor-mance that can be achieved when synthesizing new laughs. Humanlaughs were also included in the evaluation for reference.

Our initial hypotheses were the following:

• H1: Human laughs are more natural than copy-synthesis andHMM laughs.

• H2: Copy-synthesis laughs are more natural than HMMlaughs, as they omit the modeling stage.

• H3: All vocoders are equivalent for laughter synthesis.

The third hypothesis concerns the comparison of the vocodersamong themselves, which is the main objective of this work. Theway this hypothesis is formulated illustrates the fact that we do nothave a priori expectations that one vocoder would be better suitedfor laughter than other vocoders.

4.1. Data

For the purpose of this work, two voices from the AVLaughterCycledatabase [29] were selected: a female voice (subject 5, 54 laughs)and a male voice (subject 6, the same voice as in previous work [13],64 laughs). As in [13], phonetic clusters were formed by groupingacoustically close phones found in the narrow phonetic annotationsof the laughs [30]. This resulted in 10 phonetic clusters used for syn-thesis: 3 for consonants (nasals, fricatives and plosives), 4 for vow-els (@, a, I and o), and 3 additional clusters were formed with typicallaughter sounds: grunts, cackles, and nareal fricative (noisy airflowexpelled through the nostrils). Inhalation and exhalation phones aredistinguished and form separate clusters. Hence there are 20 clus-ters in total when considering both inhalation and exhalation clus-ters. For each voice, the phonetic clusters that did not have at least11 occurrences were assigned to a garbage class.

For each voice and each of the considered vocoders and ex-tracted parameters (see Table 1), HMM-based systems were trainedusing the standard HTS procedure [14, 31] using all the availablelaughs. For the test, five laughs lasting at least 3.5 seconds wererandomly selected for each voice. For each vocoder, these laughswere synthesized from their phonetic transcriptions (HMM synthe-sis) as well as re-synthesized directly from their extracted parame-ters (copy-synthesis). The 5 original laughs were also included inthe evaluation. This makes a total of 5 (original laughs) + 5 × 2(HMM and copy-synthesis) × 4 (number of vocoders) = 45 laughsin the evaluation set for both voices.

4.2. Evaluation setup

A subjective evaluation was carried out using a web-based listeningtest, where listeners were asked to rate the quality of synthesizedlaughter signals on a 5-point Likert scale [10]. Participants weresuggested to use headphones, and were then presented one laugh ata time. Participants could listen to the laugh as many times as theywanted and were asked to rate its naturalness on a 5-point Likertscale where only the highest (completely natural) and lowest (com-pletely unnatural) options were labeled. The 45 laughter signalswere presented in random order. 18 participants evaluated the malevoice while 15 evaluated the female one. All listeners were between25–35 years of age, and some of them were speech experts.

5. RESULTS

Figure 1 shows the means and 95% confidence intervals of the nat-uralness ratings for copy-synthesis (right) and HMM synthesis (left)of the male (upper) and female (lower) voices. The pairwise p-values(using the Bonferroni correction) between vocoders are shown in Ta-ble 2 for copy-synthesis and in Table 3 for HMM synthesis.

As expected (H1), original human laughs were perceived asmore natural than all other laughs (copy-synthesis and HMM). Inaddition, H2 was also confirmed: for each vocoder, the naturalnessachieved with copy-synthesis was significantly higher than withHMM synthesis. The most interesting is the comparison betweenthe vocoders (H3). In copy-synthesis, GlottHMM was rated as lessnatural than all other vocoders (for both female and male), MCEPand DSM obtained similar naturalness scores, while STRAIGHTwas slightly preferred for female laughs (but not for male laughs).This may indicate that STRAIGHT is potentially the most suit-able vocoder for laughter synthesis with the female voice, whileMCEP, DSM, and STRAIGHT are equivalently good for the malevoice. This trend is generally confirmed when looking at HMM-based laughter synthesis (right plots), where it appears that MCEPobtained the best results for the female voice, followed by DSM,STRAIGHT and finally GlottHMM. For the male laughs, DSMachieved the best results, slightly over STRAIGHT and finallyMCEP and GlottHMM, which were rated as similar. However, theonly statistically significant differences with HMM synthesis werefor the female voice with MCEP (significantly more natural thanSTRAIGHT and GlottHMM) and DSM (significantly better thanGlottHMM).

These results indicate that MCEP and DSM are in general goodchoices for laughter synthesis. Both vocoders use simple parame-ter representation in statistical modeling: only F0 and spectrum are

Female System DSM Glott MCEP STR NatDSM − 0.006 1 1 0Glott 0.006 − 0.04 0.002 0

MCEP 1 0.04 − 1 0STR 1 0.002 1 − 0Nat 0 0 0 0 −

Male System DSM Glott MCEP STR NatDSM − 0.003 1 1 0Glott 0.003 − 0 0.002 0

MCEP 1 0 − 1 0.027STR 1 0.002 1 − 0Nat 0 0 0.027 0 −

Table 2. Pairwise p-values between the vocoders copy-synthesis andnatural laughs. Statistically significant results are marked in bold.

257

Page 74: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

Vocoders

Nat

ural

ness

MCEP STR GlottHMM DSM Natural

12

34

5

Copy−synthesis, male

Vocoders

Nat

ural

ness

MCEP STR GlottHMM DSM

12

34

5

HMM−synthesis, male

Vocoders

Nat

ural

ness

MCEP STR GlottHMM DSM Natural

12

34

5

Copy−synthesis, female

VocodersN

atur

alne

ss

MCEP STR GlottHMM DSM

12

34

5

HMM−synthesis, female

Fig. 1. Naturalness scores for copy-synthesis (left) and HMM synthesis (right) for the male (upper) and female (lower) speakers.

modeled and all other features are fixed. Accordingly, the synthesisprocedure of these vocoders is very simple: the excitation generationdepends only on the modeled F0. In DSM, Fm, residual waveform,and noise time envelope are fixed and thus they cannot produce ad-ditional artefacts beyond possible errors in F0 and spectrum. MCEPobtained the best naturalness scores for the female voice, althoughthe known drawback of this method is its buzziness. This was likelynot too disturbing as the the female voice used few voiced segments.The buzziness could, however, explain why male laughs synthesizedwith MCEP were perceived as less natural than female laughs, sincethe male laughs contained more and longer voiced segments.

STRAIGHT performed better in copy-synthesis with a femalevoice but cannot hold this advantage in HMM-based laughter syn-thesis, when statistical modeling is involved. This may well be dueto the modeled aperiodicity parameters, which are difficult to esti-mate from the challenging laughter signals, consisting a lot of partlyvoiced sounds. Moreover, STRAIGHT pitch estimation is known tobe unreliable with non-modal voices (see e.g. [32]), which is veryoften the case with laughter. Thus, the estimated aperiodicity param-

Female System DSM Glott MCEP STRDSM − 0.003 1 0.16Glott 0.003 − 0 0.34

MCEP 1 0 − 0.02STR 0.16 0.34 0.02 −

Male System DSM Glott MCEP STRDSM − 0.14 0.46 1Glott 0.14 − 1 1

MCEP 0.46 1 − 1STR 1 1 1 −

Table 3. Pairwise p-values between HMM synthesis of differentvocoders. Statistically significant results are marked in bold.

eters may have a lot of inconsistent variation, thus degrading the sta-tistical modeling of the parameters. Therefore, in HMM synthesis,the mixed excitation may fail to produce an appropriate excitation.

GlottHMM also suffers occasionally from pitch estimation er-rors, especially if the voicing settings are not accurately set or speechmaterial is challenging. At least the latter is true with laughter, inwhich the vocal folds do not reach a complete closure as in modalspeech [33]. Pitch estimation errors are even more harmful for theGlottHMM vocoder than the other vocoders since the analysis ofvoiced and unvoiced sounds is treated completely in a different man-ner. Thus, voicing errors generate severe errors in the output param-eters of GlottHMM. GlottHMM is also considerably more complexthan the other systems, thus making the statistical modeling of allthe parameters challenging with small amount of data.

Finally, the role of the training material was not studied in thisexperiment, but it is expected that it also has a significant effect,especially when dealing with challenging material such as laughter.

6. SUMMARY AND CONCLUSIONS

This paper presented an experimental comparison of four vocodersfor HMM-based laughter synthesis. The results show that allvocoders perform relatively well in copy-synthesis. However, inHMM-based laughter synthesis, all synthesized laughter voiceswere significantly lower in quality than in copy-synthesis. Theevaluation results revealed that two vocoders using rather simpleand robust excitation modeling performed the best, while two othervocoders using more complex analysis, parameter representation,and synthesis suffered from the statistical modeling. These findingssuggest that the robustness of parameter extraction and representa-tion is a key factor in laughter synthesis, and increased efforts shouldbe directed on enhancing the robust estimation and representation ofthe acoustic parameters of laughter.

258

Page 75: Towardsconversationalspeechsynthesis ... · iii Abstract The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current

7. REFERENCES

[1] J. Robson and J. MackenzieBeck, “Hearing smiles-perceptual,acoustic and production aspects of labial spreading,” in Proc.of Inter. Conf. of the Phon. Sci. (ICPhS), San Francisco, USA,1999, pp. 219–222.

[2] N. Campbell, “Conversational speech synthesis and the needfor some laughter,” IEEE Trans. on Audio, Speech, and Lang.Proc., vol. 14, no. 4, pp. 1171–1178, 2006.

[3] S. Petridis and M. Pantic, “Audiovisual discrimination betweenspeech and laughter: Why and when visual information mighthelp,” IEEE Transactions on Multimedia, vol. 13, no. 2, pp.216–234, 2011.

[4] J.-A. Bachorowski, M. J. Smoski, and M. J. Owren, “Theacoustic features of human laughter,” J. Acoust. Soc. Am., vol.110, no. 3, pp. 1581–1597, 2001.

[5] J.-A. Bachorowski and M. J. Owren, “Not all laughs are alike:Voiced but not unvoiced laughter readily elicits positive affect,”in Psychological Science, 2001, vol. 12, pp. 252–257.

[6] K. P. Truong and D. A. van Leeuwen, “Automatic discrimina-tion between laughter and speech,” Speech Commun., vol. 49,pp. 144–158, 2007.

[7] M. T. Knox and N. Mirghafori, “Automatic laughter detectionusing neural networks,” in Proc. Interspeech, Antwerp, Bel-gium, 2007, pp. 2973–2976.

[8] L. Kennedy and D. Ellis, “Laughter detection in meetings,” inNIST ICASSP 2004 Meeting Recognition Workshop, Montreal,2004, pp. 118–121.

[9] S. Sundaram and S. Narayanan, “Automatic acoustic synthesisof human-like laughter,” J. Acoust. Soc. Am., vol. 121, no. 1,pp. 527–535, 2007.

[10] R. Likert, “A technique for the measurement of attitudes,”Archives of psychology, 1932.

[11] E. Lasarcyk and J. Trouvain, “Imitating conversationallaughter with an articulatory speech synthesis,” in Proc.of Interdisciplinary Workshop on the Phonetics of Laughter,Saarbrucken, Germany, 2007, pp. 43–48.

[12] T. Sathya Adithya, K. Sudheer Kumar, and B. Yegnanarayana,“Synthesis of laughter by modifying excitation characteris-tics,” J. Acoust. Soc. Am., vol. 133, no. 5, pp. 3072–3082,2013.

[13] J. Urbain, H. Cakmak, and T. Dutoit, “Evaluation of hmm-based laughter synthesis,” in Proc. IEEE Int. Conf. on Acoust.Speech and Signal Proc. (ICASSP), Vancouver, Canada, 2013,pp. 7835–7839.

[14] [Online], “HMM-based speech synthesis system (HTS),”http://hts.sp.nitech.ac.jp/.

[15] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, “Re-structuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-basedF0 extraction: possible role of a repetitive structure in sounds,”Speech Commun., vol. 27, no. 3–4, pp. 187–207, 1999.

[16] H. Kawahara, Jo Estill, and O. Fujimura, “Aperiodicity extrac-tion and control using mixed mode excitation and group delaymanipulation for a high quality speech analysis, modificationand synthesis system STRAIGHT,” in 2nd International Work-shop on Models and Analysis of Vocal Emissions for Biomedi-cal Applications (MAVEBA), 2001.

[17] T. Drugman and T. Dutoit, “The deterministic plus stochasticmodel of the residual signal and its applications,” IEEE Trans.

on Audio, Speech, and Lang. Proc., vol. 20, no. 3, pp. 968–981,2012.

[18] T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen,M. Vainio, and P. Alku, “Hmm-based speech synthesis uti-lizing glottal inverse filtering,” IEEE Trans. on Audio, Speech,and Lang. Proc., vol. 19, no. 1, pp. 153–165, 2011.

[19] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis – A unified approach to speechspectral estimation,” in Proc. ICSLP, 1994, vol. 94, pp. 18–22.

[20] [Online], “Speech signal processing toolkit (SPTK) v. 3.6,”2013.

[21] D. Talkin, “A robust algorithm for pitch tracking (rapt),” inSpeech Coding and Synthesis, W. B. Klein and K. K. Palival,Eds. Elsevier, 1995.

[22] T. Kobayashi, S. Imai, and T. Fukuda, “Mel generalized logspectrum approximation (MGLSA) filter,” Journal of IEICE,vol. J68-A, no. 6, pp. 610–611, 1985.

[23] H. Zen, T. Toda, M. Nakamura, and K. Tokuda, “Details ofnitech hmm-based speech synthesis system for the blizzardchallenge 2005,” in IEICE Trans. Inf. and Syst., 2007, vol.E90-D, pp. 325–333.

[24] T. Yoshimura, K. Tokuda, T. Masuko, and T. Kitamura,“Mixed-excitation for HMM-based speech synthesis,” Proc.Eurospeech, pp. 2259–2262, 2001.

[25] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptivealgorithm for mel-cepstral analysis of speech,” in Proc. IEEEInt. Conf. on Acoust. Speech and Signal Proc. (ICASSP), 1992,vol. 1, pp. 137–140.

[26] T. Drugman and T. Dutoit, “The deterministic plus stochasticmodel of the residual signal and its applications,” IEEE Trans.on Audio, Speech, and Lang. Proc., vol. 20, no. 3, pp. 968–981,2012.

[27] P. Alku, “Glottal wave analysis with pitch synchronous itera-tive adaptive inverse filtering,” Speech Commun., vol. 11, no.2–3, pp. 109–118, 1992.

[28] F. K. Soong and B.-H. Juang, “Line spectrum pair (LSP) andspeech data compression,” in Proc. IEEE Int. Conf. on Acoust.Speech and Signal Proc. (ICASSP), Mar. 1984, vol. 9, pp. 37–40.

[29] J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadom-ski, C. Pelachaud, B. Picart, J. Tilmanne, and J. Wagner, “TheAVLaughterCycle database,” in Proc. of Seventh conference onIntl Language Resources and Evaluation (LREC’10), Valletta,Malta, 2010, pp. 2996–3001.

[30] J. Urbain and T. Dutoit, “A phonetic analysis of natural laugh-ter, for use in automatic laughter processing systems,” in Proc.of 4th bi-annual Intl Conf. of the HUMAINE Association onAffective Computing and Intelligent Interaction (ACII2011),Memphis, Tennesse, 2011, pp. 397–406.

[31] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. Black,and K. Tokuda, “The HMM-based speech synthesis system(HTS) version 2.0,” in Sixth ISCA Workshop on Speech Syn-thesis, 2007, pp. 294–299.

[32] T. Raitio, J. Kane, T. Drugman, and C. Gobl, “HMM-basedsynthesis of creaky voice,” in Proc. Interspeech, 2013, pp.2316–2320.

[33] Wallace Chafe, The Importance of not being earnest. The feel-ing behind laughter and humor., vol. 3 of Consciousness &Emotion Book Series, John Benjamins Publishing Company,Amsterdam, The Nederlands, paperback 2009 edition, 2007.

259