modern methods of speech processing

MODERN METHODS OF SPEECH PROCESSING

THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE

VLSI, COMPUTER ARCHITECTURE AND DIGITAL SIGNAL PROCESSING

Consulting Editor Jonathan Allen

Other books in the series:

FORMAL SEMANTICS FOR VlIDL, Carlos Delgado Kloos ISBN: 0-7923-9552-2

ON OPTIMAL INTERCONNECTIONS FOR VLSI, Andrew B. Kahng, Gabriel Robins ISBN: 0-7923-9483-6

SIMULATION TECHNIQUES AND SOLUTIONS FOR MIXED-SIGNAL COUPLING IN INTEGRATED CIRCUITS, Nishath K. Verghese, Timothy J. Schmerbeck, David J. Allstot

ISBN: 0-7923-9544-1 MIXED-MODE SIMULATION AND ANALOG MULTILEVEL SIMULATION, Resve Saleh, Shyh-Jye, A. Richard Newton

ISBN: 0-7923-9473-9 CAD FRAMEWORKS: Principles and Arcbitecutres, Pieter van der Wolf

ISBN: 0-7923-9501-8 PIPELINED ADAPTIVE DIGITAL FILTERS, Naresh R. Shanbhag, Keshab K. Parhi

ISBN: 0-7923-9463-1 TIMED BOOLEAN FUNCTIONS: A UNIFIED FORMALISM FOR EXACT TIMING ANALYSIS, William K. C. Lam, Robert K. Brayton

ISBN: 0-7923-9454-2 AN ANALOG VLSI SYSTEM FOR STEREOSCIPIC VISION, Misha Mahowald

ISBN: 0-7923-9444-5 ANALOG DEVICE-LEVEL LAYOUT AUTOMATION, John M. Cohn, David J. Garrod, Rob A. Rutenbar, L. Richard Carley

ISBN: 0-7923-9431-3 VLSI DESIGN METHODOLOGIES FOR DIGITAL SIGNAL PROCESSING ARCHITECTURES, Magdy A. Bayoumi

ISBN: 0-7923-9428-3 CIRCUIT SYNTHESIS WITH VHDL, Roland Airiau, Jean-Michel Berge, Vincent Olive

ISBN: 0-7923-9429-1 ASYMPOTIC WAVEFORM EVALUATION, Eli Chiprout, Michel s. Nakhla

ISBN: 0-7923-9413-5 WAVE PIPELINING: THEORY AND CMOS IMPLEMENTATION, C. Thomas Gray, Wentai Liu, Ralph K. Cavin, III

ISBN: 0-7923-9398-8 CONNECTIONIST SPEECH RECOGNITION: A Hybrid Appoach, H. Bourlard, N. Morgan

ISBN: 0-7923-9396-1 BiCMOS TECHNOLOGY AND APPLICATIONS, SECOND EDmON, A.R. Alvarez

ISBN: 0-7923-9384-8 TECHNOLOGY CAD-COMPUTER SIMULATION OF IC PROCESSES AND DEVICES, R. Dutton, Z. Yu

ISBN: 0-7923-9379 VlIDL '92, THE NEW FEATURES OF THE VlIDL HARDWARE DESCRIPTION LANGUAGE, J. Berge, A. Fonkoua, S. Maginot, J. Rouillard

ISBN: 0-7923-9356-2 APPLICATION DRIVEN SYNTHESIS, F. Catthoor, L. Svenson

ISBN :0-7923-9355-4

MODERN METHODS OF SPEECH PROCESSING

edited by

Ravi P. Ramachandran Richard J. Mammone

CAIP Center, Rutgers University

~.

" SPRINGER SCIENCE+BUSINESS MEDIA, LLC

ISBN 978-1-4613-5962-3 ISBN 978-1-4615-2281-2 (eBook) DOI 10.1007/978-1-4615-2281-2

Library of Congress Cataloging-in-Publication Data

A C.I.P. Catalogue record for this book is available from the Library of Congress.

Copyright © 1995 by Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 1995 Softcover reprint of the hardcover 1 st edition 1995

AlI rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permis sion of the publisher, Springer Science+Business Media, LLC.

Printed on acid-free pa per.

To my wife, parents and grandparents.

R.P.R.

To my wife and sons.

R.J.M.

CONTENTS

CONTRIBUTORS xi

PREFACE xiii

ACKNOWLEDGEMENTS xvii

PART 1: SPEECH CODING 1

1 THE USE OF PITCH PREDICTION IN SPEECH CODING Ravi P. Ramachandran 3

2 VECTOR QUANTIZATION OF LINEAR PREDICTOR COEFFICIENTS John S. Collura 23

3 LINEAR PREDICTIVE ANALYSIS BY SYNTHESIS CODING Peter Kroon and W. Bastiaan Kleijn 51

4 WAVEFORM INTERPOLATION Jesper Haagen and W. Bastiaan Kleijn 75

5 VARIABLE RATE SPEECH CODING Vladimir Cuperman and Peter Lupini 101

PART 2: SPEECH RECOGNITION 121

vii

Vlll MODERN METHODS OF SPEECH PROCESSING

6 WORD SPOTTING Jan Robin Rohlicek 123

7 SPEECH RECOGNITION USING NEURAL NETWORKS Stephen V. K osonocky 159

8 CURRENT METHODS IN CONTINUOUS SPEECH RECOGNITION P. S. Gopalakrishnan 185

9 LARGE VOCABULARY ISOLATED WORD RECOGNITION Vishwa Gupta and Matthew Lennig 213

10 RECENT DEVELOPMENTS IN ROBUST SPEECH RECOGNITION B. H. Juang 231

11 HOW DO HUMANS PROCESS AND RECOGNIZE SPEECH? Jont B. Allen 251

PART 3: SPEAKER RECOGNITION 277

12 DATA FUSION TECHNIQUES FOR SPEAKER RECOGNITION Kevin R. Farrell and Richard J. Mammone 279

13 SPEAKER RECOGNITION OVER TELEPHONE CHANNELS Yu-Hung Kao, Lorin Netsch and P. K. Rajasekaran 299

PART 4: TEXT TO SPEECH SYNTHESIS 323

Contents IX

14 APPROACHES TO IMPROVE AUTOMATIC SPEECH SYNTHESIS Douglas 0 'Shaughnessy 325

PART 5: APPLICATIONS OF MODELS 349

15 MICROPHONE ARRAY FOR HANDS-FREE VOICE COMMUNICATION IN A CAR Stephen Oh and Vishu Viswanathan 351

16 THE PITCH MODE MODULATION MODEL AND ITS APPLICATION IN SPEECH PROCESSING Michael A. Ramalho and Richard J. Mammone 377

17 AUDITORY MODELS AND HUMAN PERFORMANCE IN TASKS RELATED TO SPEECH CODING AND SPEECH RECOGNITION Oded Ghitza 401

18 APPLICATIONS OF WAVELETS TO SPEECH PROCESSING: A CASE STUDY OF A CELP CODER James Ooi and Vishu Viswanathan 449

INDEX 465

Jont B. Allen AT&T Bell Laboratories Murray Hill, New Jersey

John S. Collura Department of Defense Ft. Meade, Maryland

Vladimir Cuperman Simon Fraser University Burnaby, B. C., Canada

Kevin R. Farrell Dictaphone Corporation Stratford, Connecticut

Oded Ghitza AT&T Bell Laboratories Murray Hill, New Jersey

P. S. Gopalakrishnan mM T. J. Watson Research Center Yorktown Heights, NY

Vishwa Gupta Bell Northern Research Montreal, Canada

J esper Haagen Tele Denmark Research Horsholm, Denmark

B. H. Juang AT&T Bell Laboratories Murray Hill, New Jersey

CONTRIBUTORS

Yu-Hung Kao Texas Instruments Dallas Texas

W. Bastiaan Kleijn AT&T Bell Laboratories Murray Hill, New Jersey

Stephen V. Kosonocky mM T. J. Watson Research Center Yorktown Heights, NY

Peter Kroon AT&T Bell Laboratories Murray Hill, New Jersey

Matthew Lennig Bell Northern Research Montreal, Canada

Peter Lupini Simon Fraser University Burnaby, B. C., Canada

Richard J. Mammone Rutgers University Piscataway, New Jersey

Lorin Netsch Texas Instruments Dallas Texas

Stephen Oh Texas Instruments Dallas Texas

XlI

James Ooi Massachusetts Institute of Technology Cambridge, Massachusetts

Douglas O'Shaughnessy INRS Telecommunications Montreal, Canada

P. K. Rajasekaran Texas Instruments Dallas Texas

Ravi P. Ramachandran Rutgers University Piscataway, New Jersey

Michael A. Ramalho Bell Communications Research Red Bank, New Jersey

J an Robin Rohlicek BBN HARK Systems Corporation Cambridge, Massachusetts

Vishu Viswanathan Texas Instruments Dallas Texas

CONTRIBUTORS

PREFACE

The term speech processing refers to the scientific discipline concerned with the analysis and processing of speech signals for getting the best benefit in various practical scenarios. These different practical scenarios correspond to a large variety of applications of speech processing research. Examples of some applications include enhancement, coding, synthesis, recognition and speaker recognition. A very rapid growth, particularly during the past ten years, has resulted due to the efforts of many leading scientists. The ideal aim is to develop algorithms for a certain task that maximize performance, are computationally feasible and are robust to a wide class of conditions.

The purpose of this book is to provide a cohesive collection of articles that describe recent advances in various branches of speech processing. The main focus is in describing specific research directions through a detailed analysis and review of both the theoretical and practical settings. The intended audience includes graduate students who are embarking on speech research as well as the experienced researcher already working in the field. For graduate students taking a course, this book serves as a supplement to the course material. As the student focuses on a particular topic, the corresponding set of articles in this book will serve as an initiation through exposure to research issues and by providing an extensive reference list to commence a literature survey. Experienced researchers can utilize this book as a reference guide and can expand their horizons in this rather broad area.

With the above thoughts, we now expand on the various topics covered bearing in mind that as is the case with any book, the areas covered are by no means exhaustive. Although we have tried to partition this vast field into sections for the purposes of effective book organization, we realize that there are no strict boundaries of knowledge. Each part of the book is devoted to a goal of speech research. Part 1 deals with effectively communicating speech from one point to another through coding. Part 2 covers the issue of recognizing a word or other speech unit independently of the speaker it came from. The topic of Part 3 is the task of successfully recognizing a speaker from his or her speech utterance. Part 4 concentrates on the transformation of text into speech. In Part 5, specific

xiii

XlV MODERN METHODS OF SPEECH PROCESSING

applications of signal processing concepts and modeling phenomena to speech are illustrated.

We now further describe the contents of the book by expanding on each part separately. First, consider Part 1. The article by Ramachandran on pitch prediction focuses on a specific component of predictive speech coders that is used to regenerate the periodicity in the signal. The second article by Collura describes quantization strategies for coding another component of predictive coders, namely, the parameters of the near-sample predictor which reinserts the formant structure. The article by Kroon and Kleijn on the analysis-by-synthesis paradigm describes a particular technique used in low bit rate predictive coding. A recent concept, waveform interpolation, for producing high quality speech at low bit rates is discussed in the fourth article by Haagen and Kleijn. Part 1 ends with an exposition of speech coding at variable bit rates in the article by Cuperman and Lupini.

Part 2 commences with an article on word spotting by Rohlicek. The following article by Kosonocky describes the use of neural networks for speech recognition. The next two articles deal with techniques to recognize a word or sequence of words which form part of a large vocabulary. The first of these articles by Gopalakrishnan further focuses on the concepts of feature extraction and modeling in continuous speech recognition. The second article by Gupta and Lennig looks at isolated word recognition. The fifth article by J uang addresses the very important issue of getting high recognition accuracy under different environmental conditions. This will result in automatic systems that perform well even under unexpected or adverse conditions thereby enhancing robustness. The final article by Allen discusses human speech recognition. A deep understanding of how humans recognize speech will play a vital role in improving automatic machine based recognition systems.

Both articles in Part 3 examine the speaker recognition problem. The use of data fusion to augment performance for both text-independent speaker identification and text-dependent speaker verification is illustrated in the first article by Farrell and Mammone. The second article by Kao, Netsch and Rajasekaran concentrates on speech transmitted over long distance telephone channels. The theme of robustness to channel effects is central to this article.

Part 4 has one article by O'Shaughnessy devoted to getting natural sounding speech from text. Current approaches that attempt to alleviate the inadequacy in modeling human speech production are examined.

Preface xv

In Part 5, four different areas having applications to speech processing are depicted. First, the use of a beamforming algorithm with a microphone array for hands-free voice communication in a car is described in the article by Oh and Viswanathan. The next article by Ramalho and Mammone discusses a new speech model (known as the Pitch Mode Modulation Model) and its applications in speech enhancement, speaker identification and speech synthesis. The area of auditory modeling and its use in coding and recognition is the topic of the third article by Ghitza. This part ends with an article by Ooi and Viswanathan that discusses the use of wavelets in speech with particular emphasis on a type of analysis-by-synthesis coder. The above constitutes a synopsis of the material contained in this book.

Ravi P. Ramachandran

Richard J. Mammone

ACKNOWLEDGEMENTS

At the outset, we thank all the authors for their contributions. It is worthwhile to note that some have assisted in contributing more than one chapter. We particularly are happy to have Vishu Viswanathan take the initiative in offering his second chapter on wavelets. Our gratitude goes to the research support provided by the CAIP center at Rutgers. For timely help in giving suggestions on the use of LATEX for typesetting this book and for supplying the macros for certain fonts, we thank the computer support staff at CAIP, the staff at Kluwer and Peter Kroon of Bell Laboratories. The secretarial assistance given by Kathyrn Bryan is gratefully acknowledged. Portions of the manuscript were proofread by John Collura, Jesper Haagen, Vidhya Ramanujam and Roopashri Ramachandran. Their suggestions improved the quality of certain chapters. We appreciate the assistance of Peter Kroon and Kevin Farrell in helping coordinate the material for Part 1 and Part 3 respectively. Sincere thanks goes to our respective families for their constant encouragement in this very rewarding effort.

xvii

modern methods of speech processing

Documents