ezzat abstract 1

Upload: pruthvi-pruthvi-r

Post on 10-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Ezzat Abstract 1

    1/2

    Audio Morphing

    Tony Ezzat, Jim Glass, with Prof. T. Poggio, CBCL

    The Problem: In this work, we tackle the problem of morphing between different audio sequences.The system should take as input 2 audio sequences, and produce as output intermediate audio se-quences that represent natural exemplars lying between the 2 input sequences.Motivation: Audio morphing might have important applications in speech recognition, speech syn-thesis, music synthesis, and other applications where large corpora are recorded and there is a strongneed to interpolate between the exemplars in the corpora to produce new exemplars.Previous Work: There has been a spate of recent work on voice conversion[2, 9, 8], where a referencespeaker speech sample is warped to match the statistical properties of a target speaker. Most authorsresort to mixed time- and frequency- domain methods to alter pitch, duration, and spectral features.

    In audio morphing, instead of warping original audio to new speaker characteristics, natural audioexemplars are interpolated to generate novel audio. The work by Slaney [4] is closest in spirit to the goalof this work. The author used dynamic time-warping to time-align two speech samples, cross-faded therespective smoothed spectrograms, and warped a pitch residual to morph between two sounds.Approach: There are two variants of our work: inter-voice morphing and intra-voice morphing. In the

    intra-voice morphing scenario, a single persons voice is recorded uttering a wide range of utterances.The speakers phones are then morphed in time to generate new utterances of the speaker. We notethat intra-voice morphing addresses the same problem that concatenative speech synthesis algorithmsaddress, except that instead of simply re-ordering and concatenating original speech samples, intra-voice morphing methods smoothly morph between the different recorded sounds. This will help toreduce the amount of recorded audio needed for speech synthesis systems. For example, current state-of-the-art concatenative speech synthesis systems require around 40 hours of recorded audio in order toproduce natural-sounding speech.

    In the inter-voice morphing scenario, several people are recorded uttering the same utterance. Inter-voice morphing techniques are then developed to morph between the recorded utterances to generatethe same utterance but with a new voice characteristic that is a morph of the recorded voice characteristics.Applications for these algorithms include the development of text-to-speech systems with tuneable voicecharacteristics which can be set by the user or client.

    We are motivated in this work by the multidimensional morphable model (MMM) framework devel-oped in [1] and [7]. In that work, individual prototype images of various face identities [1] or mouthcongurations [7] are extracted from a collected corpus and morphed together in a multidimensionalsetting to produce novel face identities and novel mouth congurations.

    In this work, we aim to explore and develop a multidimensional morphable model of audio. In theintra-voice morphing setting, we envision a multidimensional space where the example prototypes areindividual pitch periodsof a single persons speech. The prototypes are embedded in a multidimensionalspace whose axes are pitch, duration, and spectral characteristics. Particular phones constitute regionsin this MMM space. A particular utterance constitutes a trajectory in this MMM audio space that passessmoothly through the relevant phone regions.

    In the inter-voice morphing setting, the example prototypes constitute the (same) utterance spoken

    by the various speakers in the database. The prototypes are embedded in a multidimensional spacein which the prototypes are smoothly blended together to create the same utterance but spoken withmorphed pitch, duration, and voice quality.

    Our goal in this work is to develop the appropriate algorithmic machinery to construct these MMMaudio spaces, and to be able to synthesize novel utterances, as well as utterances with different voicecharacteristics.Progress: As a rst step, we have collected two audio databases with which to conduct our experi-ments. The rst database is a set of 90 TIMIT utterances from one speaker with which to explore ourintra-voice morphing algorithms. The second database was collected for our inter-voice morphing ex-periments, and consists of 15 people uttering the same sentence. Both databases were transcribed andphonetically aligned. The pitch was manually labeled in the second database.

  • 8/8/2019 Ezzat Abstract 1

    2/2

    We initially implemented the TD-PSOLA [6] technique for manipulating the pitch and duration as-pects of audio sequences. In TD-PSOLA, each pitch period in a sequence is windowed and extracted.Duration is shortened/lengthened by removing/replicating pitch periods, with overlap-add. Pitch ismanipulated by overlap-adding the pitch periods with the desired pitch contour prole. Our exper-iments indicate that, for its simplicity, TD-PSOLA is a surprisingly effective technique for pitch andduration manipulation, but is not capable of representing spectral transformations between sounds.

    With regard to spectral transformations, we are looking at LPC techniques, which separate the for-mant structure of speech from the excitation signals[5]. Our observation is that the formant structuresduring speech transform smoothly but nonlinearly from one sound to the next. We have begun to ex-amine techniques which allow for the nonparametric transformation of one formant to another.

    Finally, we have begun examining techniques to manipulate and interpolate the excitation residualextracted by LPC. In particular, we are currently examining prototype waveform interpolation methodsfor the residual[3].Future: Once we have identied the relevant features to morph between two audio samples, we planto implement a 1-dimensional morphing algorithm that can morph between two examples. The goalwill be to generate intermediate sounds thats sound realistic, without any audible degradations.

    After that, we plan to extend the 1-dimensional algorithm to multiple dimensions, in order that wemay morph all the sounds together and make better use of larger datasets.Research Support: Research at CBCL is sponsored by grants from: Ofce of Naval Research (DARPA)Contract No. N00014-00-1-0907, National Science Foundation (ITR/IM) Contract No. IIS-0085836, Na-tional Science Foundation (ITR) Contract No. IIS-0112991, National Science Foundation (KDI) Contract

    No. DMS-9872936, and National Science Foundation Contract No. IIS-9800032.Additional support was provided by: AT&T, Central Research Institute of Electric Power Industry,

    Center for e-Business (MIT), DaimlerChrysler AG, Compaq/Digital Equipment Corporation, EastmanKodak Company, Honda R&D Co., Ltd., ITRI, Komatsu Ltd., Merrill-Lynch, Mitsubishi Corporation,NEC Fund, Nippon Telegraph & Telephone, Oxygen, Siemens Corporate Research, Inc., SumitomoMetal Industries, Toyota Motor Corporation, WatchVision Co., Ltd., and The Whitaker Foundation.

    References:

    [1] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In Alyn Rockwood, editor,Proceedings of SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages187194, Los Angeles, 1999. ACM, ACM Press / ACM SIGGRAPH.

    [2] A. Kain and M. Macon. Spectral voice conversion for text-to-speech synthesis. In Proc. ICASSP,pages 285288, 1998.

    [3] W. B. Kleijn and J. Haagen. Waveform interpolation for coding and synthesis. Speech Coding Synthesisby W. B. Kleijn and K. K. Paliwal Elsevier Science B. V. Chapter 5, pages 175207, 1995.

    [4] M. Covell M. Slaney and B. Lassiter. Automatic audio morphing. In Proc. ICASSP, Atlanta Georgia,1996.

    [5] J.D. Markel and Jr. A.H. Gray. Linear prediction of speech. Springer-Verlag, 1976.

    [6] E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9:453467, 1990.

    [7] T.Ezzat, G.Geiger, and T.Poggio. Trainable videorealistic facial animation. In Proceedings of SIG-GRAPH 2002, volume 21, pages 388398, San Antonio, Texas, 2002.

    [8] H. Valbret, E. Moulines, and J.P. Tubach. Voice transformation using psola technique. Speech Com-munication, 11:175187, 1992.

    [9] O. Cappe Y. Stylianou and E. Moulines. Statistical methods for voice quality transformation. In Proc.Eurospeech, pages 447450, Madrid, Spain, 1995.