digital speech processing · web viewtext analysis: parsing and analyzing a sequence of words to...

73
DIGITAL SPEECH PROCESSING Biing Hwang Juang, Bell Labs, Lucent Technologies M. Mohan Sondhi, Bell Labs, Lucent Technologies Lawrence R. Rabiner, AT&T Labs I. Introduction II. Speech Analysis & Representation III. Speech Coding IV. Speech Synthesis V. Speech Recognition VI. Speaker Verification VII. Speech Enhancement VIII. Concluding Comment GLOSSARY Speech Analysis: The process of using computational algorithms to measure properties of the speech signal. Speech Production Model: A characterization of the sound excitation source and the articulatory mechanism used in producing speech. Formant: Resonance of the vocal tract manifested acoustically as a concentration of energy in the frequency spectrum. Speech Coding: Process of converting a speech signal into a digital format for transmission over a network, or for storage. Adaptive Coding: A coding process that adapts its attributes and process parameters according to specific characteristics of the signal that is being encoded. Predictive Coding: A coding process that involves using past (known) values of a signal to hypothesize the current value of the signal in order to reduce the variance in the residual (uncoded) signal for increased coding efficiency. Subjective Quality: Quality of a speech signal, often as a result of processing, perceived by human listeners. Text-to-Speech Synthesis: Process of synthesizing speech from printed, often unlimited in scope, text. Text Analysis: Parsing and analyzing a sequence of words to make explicit the underlying syntactic and semantic structure so as to allow proper pronunciation and grouping of the words, and to 06/04/22 1 digital speech processing-- encyclopedia 8:47 PM

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

DIGITAL SPEECH PROCESSING

DIGITAL SPEECH PROCESSING

Biing Hwang Juang, Bell Labs, Lucent Technologies

M. Mohan Sondhi, Bell Labs, Lucent Technologies

Lawrence R. Rabiner, AT&T Labs

I. Introduction

II. Speech Analysis & Representation

III. Speech Coding

IV. Speech Synthesis

V. Speech Recognition

VI. Speaker Verification

VII. Speech Enhancement

VIII. Concluding Comment

GLOSSARY

Speech Analysis: The process of using computational algorithms to measure properties of the speech signal.

Speech Production Model: A characterization of the sound excitation source and the articulatory mechanism used in producing speech.

Formant: Resonance of the vocal tract manifested acoustically as a concentration of energy in the frequency spectrum.

Speech Coding: Process of converting a speech signal into a digital format for transmission over a network, or for storage.

Adaptive Coding: A coding process that adapts its attributes and process parameters according to specific characteristics of the signal that is being encoded.

Predictive Coding: A coding process that involves using past (known) values of a signal to hypothesize the current value of the signal in order to reduce the variance in the residual (uncoded) signal for increased coding efficiency.

Subjective Quality: Quality of a speech signal, often as a result of processing, perceived by human listeners.

Text-to-Speech Synthesis: Process of synthesizing speech from printed, often unlimited in scope, text.

Text Analysis: Parsing and analyzing a sequence of words to make explicit the underlying syntactic and semantic structure so as to allow proper pronunciation and grouping of the words, and to facilitate automatic understanding of the meaning in the given word sequence.

Isolated Word Recognition: automatic recognition of words spoken as an isolated sequence of words with distinct pauses between individual words.

Continuous Speech Recognition: automatic recognition of speech utterances in which words are spoken continuously without pauses in-between words

Keyword Spotting: Process of automatically detecting and recognizing a key word, or a key phrase, embedded in a naturally spoken sentence.

Speech Understanding: high level processing of a sequence of spoken words that extracts meaning from the spoken input.

Speaker Verification: Process of authenticating a claimed identity based on spoken utterances.

Verbal Information Verification: Process of verifying a claimed identity based on user-specific information, such as a password, mother’s maiden name, or a personal identification number (PIN), contained in spoken utterances.

Speech Enhancement: Processing of speech to make it less noisy, clearer, more intelligible, or easier to listen to.

ABSTRACT

Speech is the most fundamental form of communication among humans. Digital speech processing is the science and technology of transducing, analyzing, representing, transmitting, transforming, and reconstituting speech information by digital techniques. The technology is intended to improve communication between humans as well as to enable communications between humans and machines. Communications between humans can be enhanced by digital encoding and decoding of speech for efficient transmission, storage and privacy, and by digital enhancement for noise suppression, distortion correction, and hearing loss compensation. Communications between humans and machines embodies automatic recognition and understanding of speech, to give machines an “ear” with which to listen to spoken utterances, and digital synthesis of speech, to give machines a “mouth” with which to respond back to humans.

Introduction

Speech is the most sophisticated signal naturally produced by humans The speech signal carries linguistic information for sharing of information and ideas. It allows people to express emotions and verbally share feelings. It is the most fundamental form of communication among humans. The aim of digital speech processing is to take advantage of digital computing techniques to process the speech signal for increased understanding, improved communication, and increased efficiency and productivity associated with speech activities.

The field of speech processing includes speech analysis and representation, speech coding, speech synthesis, speech recognition and understanding, speaker verification, and speech enhancement. Speech is a complex signal that is characterized by varying distributions of energy in time as well as in frequency, depending on the specific sound that is being produced. The speech signal also possesses other characteristics that make it a very efficient means for carrying semantic (meaning) as well as pragmatic (task-dependent) information. The processes of speech analysis and representation, which lie at the technical basis of digital speech processing, attempt to use computational algorithms to “discover”, to measure, and to represent the important properties of speech for many applications.

One of the most important applications of digital speech processing is speech coding, which is concerned with efficient and reliable communication between people who may be separated by geographical distance or by time. The former forms the basis of modern telephony, which enables people to converse regardless of their locations and the latter forms the basis for applications like “voice mail” which let people create and retrieve verbal messages at arbitrary times. Speech coding enables both telephony and voice messaging by converting the speech signal into a digital format suitable for either transmission or storage. Relevant issues in speech coding are conservation of bandwidth (the rate of the voice coder), voice quality requirements, processing and transmission delay, and processing power, as well as techniques for privacy and secure communication.

Speech synthesis is concerned with providing a machine with the ability to talk to people in as intelligible and natural a voice as possible. A speech synthesis system can be as simple as a 'prerecorded' announcement machine with a limited collection of utterances or sentences, or as complicated as a full text-to-speech conversion system, which automatically converts unlimited printed text into speech. Speech recognition, often viewed as the counterpart to speech synthesis, is concerned with systems that have the ability to recognize speech (literally transcribe word-for-word what was spoken), and then understand the intended meaning of the spoken words (or more properly the recognized spoken words). Speech recognition systems range in sophistication from the simplest speaker-dependent, isolated-word or phrase recognizer to fully conversational systems that attempt to deal with virtually unlimited vocabularies and complex task syntax comparable to that of a natural language system.

Speaker recognition is concerned with machine verification or identification of individual talkers, based on their speech, for authorization of access to information, networks, computing systems, services, or physical premises. At times, as an important biometric feature, a talker’s speech may also be used for forensic or criminal investigations.

Speech enhancement attempts to improve the intelligibility and quality of a speech signal that may have been corrupted with noise or distortion, causing loss of intelligibility or quality and compromising its effectiveness in communication. Speech enhancement methods are also useful in the design of hearing aid devices, which transform the speech signal in a way to compensate for the hearing loss in the auditory system of the wearer of the device.

Digital speech processing has continued to make substantial advances in the past decade, primarily due to the explosive growth in computational capabilities (both processing and storage), the introduction of key mathematical algorithms (see below), and the vast amount of real-world data that are being systematically made available through organized efforts among government agencies and research institutions. According to Moore's Law, the computing capabilities of general purpose processors grow at the rate of a doubling the processor speed and memory every 18 months, while keeping the cost at approximately the same level. Advances in very-large-scale-integrated (VLSI) circuits have led to realization of extremely fast digital signal processors (DSPs) with very low power consumption, helping the proliferation of miniaturized devices such as cellular mobile phones. (cite DSP processor speed.) These advances in computing power are particularly beneficial in dealing with signals as complex as human speech. And finally, in the area of software, the convergence of programming languages (C and C++), the widespread use of command and interpretive language for scientific computing such as the MATLAB, and user-friendly DSP assembly-code development environments have made digital speech processing a very accessible, practical, and easily realizable concept.

In the past decade, computing and digitization devices also have become ever more readily available for data recording and management. The availability of easily accessible databases has added a new direction to digital speech processing research, going beyond the traditional paradigm of hypothesis and observation made by experts in the field, into a data-driven mode in which computational algorithms are designed to learn from real-world data directly. This change of research paradigm is significant in making digital speech processing a useful practice because speech signals usually exhibit a wide range of variability, which can only be adequately observed (in a statistical sense) in a large collection of data. A human expert inherently lacks the ability to deal with such a large quantity of data. Realizing this, a number of government agencies and industrial research institutions have, in the past two decades, separately collected a number of task-specific speech databases. Such databases include speech sentences chosen for coder evaluations; the set of Naval Resource Management (RM) sentences for the study of continuous speech recognition; the set of Air Travel Information System (ATIS) sentences for speech recognition and dialog research; the set of North America News Broadcast (NAB) sentences for automatic speech transcription; and the Switchboard recordings for recognition of conversational telephone speech, just to name a few. These databases have guided many speech research programs and have been used for evaluating the progress in digital speech processing, thereby helping to make many system designs practically useful.

Many digital speech-processing systems have been deployed for real-world services in the past decade. Major growth in the utility of voice systems exists in at least four major market sectors: telecommunications, business applications, consumer products, and government. In the telecommunications sector, speech coders are essential in digital and packet telephony, for traditional voice circuits as well as over the Internet. Specialized speech announcement systems use coded speech to provide timely information to customers and speech recognition systems are used for automation of operator and attendant services as well as for account information retrieval. In business applications, voice mail and store-and-forward messaging systems are in widespread use; automatic speech recognizers help people direct calls in a private branch exchange (PBX) arrangement; and voice interactive terminals let users dictate correspondence or issue voice commands for command and control of various devices or parts of the computer operating system. In the consumer products sector, toys incorporating speech synthesis and recognition have been available for years, and talking appliances and alarm announcement systems are also beginning to appear in household appliances. In the area of government communications, anticipated uses of speech processing include coding for secure communications, speech recognition for command and control of military systems, and automatic speech understanding and summarization for intelligence applications. These examples, by no means exhaustive, illustrate the burgeoning applications of digital speech processing and point to a growing market in the coming years.

I. Speech Analysis and Representations

The traditional framework for analyzing speech is the source-tract model first proposed by Homer Dudley at Bell Laboratories in the 1930s. In this model, as depicted in Figure 1, a speech excitation signal is produced by an excitation source and processed by a filter system that “modulates” the spectral characteristics of the excitation signal based on the shape of the vocal tract for the specific sound being generated. The excitation source has two components - a ”buzz” source and a “hiss” source. The signal produced by the buzz source is a sequence of pulses with a controllable repetition rate (the fundamental frequency, or F0) and provides the carrier for voiced sounds such as /a/ in father, /e/ in met, and /o/ in hello. The hiss source produces a noise-like signal, and provided the carrier for unvoiced sounds such as /s/ in sell, /sh/ in shout, and /k/ in kitten. The source signal is then filtered by a time-varying linear system with adjustable frequency responses thereby creating a (time-varying) distribution of energy (the final speech signal) as the output of the overall speech production model. This time-varying filter system, which models the effects of the human vocal tract, is realized by a bank (10 channels in Dudley’s original embodiment) of bandpass filters that span the range of speech frequencies. Any desired vocal tract frequency response characteristic can be achieved by adjusting the amplitudes of the outputs of the bandpass filters. Dudley was able to demonstrate (very successfully at the 1939 New York World’s fair) that a machine based on the production model of Fig. 1 can produce sounds that are very close in quality to that of natural speech.

Figure 1 The speech production model -- the basis for speech analysis

Based on the work of Homer Dudley, the source-tract model of speech production has since become the dominant framework in speech analysis and representation for the past 60 years. Key elements in the model, namely the specification of the frequency response of the slowly time-varying vocal tract filters and the characteristics of the source signal, are thus the targets of speech analysis, which aims at finding optimal parameter values that best define the source-tract model to match a given speech signal.

To obtain the (time-varying) frequency response characteristics of the vocal tract filters, the class of well known methods of spectral estimation is normally employed. One such method uses a bank of bandpass filters, each followed by a non-linearity, to measure the power level at the output of the channel. This method can be used to generate a local estimate of the spectral profile of the speech sound at a given time. Digital filter-bank design techniques are very well understood (REF) and are relatively straightforward to implement. Alternatively one can use digital spectral analysis methods, such as the discrete Fourier transform (DFT) or fast Fourier transforms (FFT) methods, to implement the filterbanks according to a prescribed structure and spectral matching criterion. The resulting spectral profile of the time-varying speech signal, when plotted in the time-frequency plane using the density of the print to indicate the corresponding power level, is called a sound spectrograph, sonogram, or spectrogram. Figure 2 shows an example of a spectrogram corresponding to a speech segment of 2.34 seconds duration.

Figure 2 An example of a spectrogram for a speech segment

Another important method for analyzing and representing the time-varying vocal tract frequency response is known as linear prediction. Linear prediction, or linear predictive coding (LPC), is motivated by the assumption that the speech signal, at an arbitrary time, could be approximately predicted by a linear combination of its past values. The difference between the predicted value and the true sample value is called the residual, or the prediction error. There are three important properties of LPC that make this method of analysis the most important one in most speech processing systems. First, the resulting set of optimal weights (the so-called linear prediction coefficients) that achieve the best prediction, in the sense of minimizing the residual for each short-time window of speech, defines an all-pole filter that best characterizes the behavior of the vocal tract. Second, the residual signal, when used as the excitation signal to drive the optimal tract filter, produces a signal essentially identical to the original signal. The residual signal retains the properties of the source signal, resembling a noise-like signal for unvoiced sounds and a pulse train for voiced sounds. Third, the optimal residual signal has a much-reduced dynamic range from the original signal and is thus a preferred signal in speech coding (see Speech Coding below).

To model the time-varying characteristics of the speech signal, the LPC analysis procedure updates the estimation process progressively over time, once every few centiseconds. This process is generally referred to as short-time spectral analysis.

Another important property of the speech signal is the excitation mode of the vocal tract. The excitation can be either voiced (vocal chords are vibrating during the production of the sound) or unvoiced (vocal chords are not vibrating and a noise-like excitation is produced by the air rushing through the open vocal tract). For voiced sounds, the key characteristic of the source is the fundamental frequency, which is the frequency of vocal-cord vibration. Another important speech parameter is the status of speech/non-speech activity--i.e., whether speech is being produced or whether silence is being observed during periods on non-activity of the vocal tract.

To estimate the fundamental frequency for a voiced sound, one usually measures the repetitiveness and the repetition rate in the signal. In the time domain, the repetitiveness can be measured by finding the strongest peak in the autocorrelation function at a lag corresponding to the pitch period (the reciprocal of the fundamental frequency). Another method for estimating the fundamental frequency of voiced sounds, which capitalizes on the regularity in the spectrum, is the so-called method of cepstral analysis. The cepstrum is the (inverse) Fourier transform of the log spectrum of the speech signal. Regularity in the log spectrum, such as repetitive spikes at essentially equal spacing (corresponding to the fundamental frequency), results in a clear spike in the cepstrum at a location (the “quefrency” index) corresponding to the pitch period. Cepstral analysis, although computationally more costly than more traditional algorithms, is used extensively in fundamental frequency estimation and often also in feature measurement for speech recognition.

II. Speech Coding

The goal of speech coding is to transform the speech waveform into a digital representation so as to allow efficient transmission and storage of the signal. The transformation, in general, will result in a certain loss of fidelity and hence, the efficiency of the speech coder is measured in terms of the required bit rate to achieve a certain quality requirement. The higher the bit rate, the easier it is to preserve the quality. Besides quality, a related dimension of significance is the incurred delay during processing. For real-time telephony between humans, a delay (processing plus transmission) of over 200 ms would make fluent two-way communication quite difficult. To achieve high coding efficiency, however, one needs to take advantage of the slow-varying nature of speech; as a result, many coding algorithms use a long delay buffer on the order of a few centiseconds to meet the efficiency requirements. Trade-offs among the three coding dimensions – bit-rate, quality, and delay – often have to be made in specific applications.

Quality of speech is a subjective measure, often expressed in terms of the so-called mean opinion score (MOS). The MOS is obtained by averaging the quality judgment from a pool of human listeners in response to a set of stimuli (listening samples). The subjective quality judgment is measured on a five-point scale ranging from 1 (unacceptable) to 5 (excellent). An MOS of 4 or above is generally regarded as high quality. A descriptive taxonomy of speech quality is the following:

· broadcast quality (FM bandwidth of ~7kHz with MOS greater than 4),

· toll quality (telephone bandwidth of ~3.2 kHz with MOS around 4), and

· communication quality (military and mobile radio with MOS around 3).

Speech intelligibility, often obtained via the diagnostic rhyme test (DRT), is another measure that indicates the level of retained clarity in the coded speech after processing. Monosyllabic word pairs with confusable word initials (e.g., met vs net) or finals (e.g., flat vs flak) according to six major linguistic dimensions are used in DRT tests. Speech processing can introduce distortions that can lead to increases in confusability, as judged by human listeners. Except in rare military applications, speech processing algorithms need to be able to maintain a minimum DRT score of 90% in order to maintain acceptable intelligibility of the coded speech.

Analysis methods in speech coding are generally based on two approaches. A waveform coder encodes the speech such that when the signal is reconstituted at the receiver for playback, the original waveform is as faithfully reproduced as possible. A model-based speech coder, or vocoder, transforms the speech into a representation, that specifies a model that emulates the way the articulatory system works in producing the speech sound. These two approaches represent the two ends of the technological spectrum that covers most speech coder designs.

The most widely used method of waveform coding is called Pulse Coded Modulation (PCM) which was historically motivated by transmission applications (and hence the term modulation). In linear PCM, the amplitude of each sampled speech signal is expressed in a binary format (0s and 1s), assuming one of a number of values uniformly spread across the amplitude range. To be able to faithfully reproduce a speech signal without detrimental perceptual effects, at least 4096 values are needed, which means a 12-bit linear PCM representation. When the amplitude distribution of the speech smaples is taken into account, a linear PCM system is generally preceded by amplitude compression during encoding and followed by amplitude expansion during decoding. The resulting so-called A-law or -law companders [compressor-expanders) are more efficient than a straightforward linear PCM, requiring around 8 bits per sample to achieve a similar quality. For telephony applications, the speech signal is usually sampled at 8 kHz and encoded with 8-bit companded PCM, with a resulting transmission rate of 64 kb/s.

Speech is a slow-varying signal and displays substantial correlation among the values of adjacent samples. A coding method that makes use of this correlation is thus generally more efficient than simple PCM encoding. The simplest way to utilize the adjacent sample correlation in speech is the scheme of Differential PCM (DPCM), which encodes not the speech signal directly but the difference between the current and the previous samples. When the coding scheme is made adaptive to certain time-varying properties of the signal (rather than remaining fixed for all time, independent of the signal characteristics), still higher efficiency is possible. The method of Adaptive DPCM (or ADPCM) allows toll quality transmission of speech at 32 kb/s. Note that these simple differential schemes only incur a single sample (0.125 ms) of delay.

More sophisticated coding schemes, such as the adaptive predictive coding (APC) method, use more previous speech data samples to better predict the current speech sample value, resulting in a residual signal of much reduced variance and hence one that is easier to encode. The idea of using a linear weighted combination of the previous sample values to predict the current value, as a way to reduce the variance of the residual signal and to describe the behavior of the speech signal, gained tremendous attention in the late 60’s and early 70’s (see Speech Analysis and Representation above). This is the method of Linear Predictive Coding (LPC), in which a set of optimal weight coefficients (the so-called predictor coefficients) are calculated at a regular time interval, say every 20 ms, to minimize the variance (or energy) of the residual within the short window of speech samples. Studies in articulatory modeling were able to relate the predictor to the vocal tract shape and its resonance structure (or the formants), enabling LPC to become one of the most successful methods in digital speech processing.

The concept of model-based coding has its root in the original source-tract framework by Homer Dudley. In the source-tract model, key components are the parameters that define the tract filter, and the pitch and voicing information that defines the excitation signal. In the encoder, these components are obtained by applying speech analysis techniques to the signal with the resulting parameters being quantized with efficient coding schemes for transmission. The speech signal could be reconstituted (synthesized) from these parameters (the quantized spectrum and the quantized excitation signal), which vary relatively slowly with time. By transmitting these data at a slower rate (normally less than 30 data values once every 20 milliseconds) than the original sampled waveform (normally 8 sample values every millisecond), the resulting vocoder thus provides a potential for speech transmission at a much higher efficiency. The advent of LPC makes analysis and representation of the time-varying spectral shape of the corresponding vocal tract straightforward and reliable. LPC-based vocoders operating at 2.4 kb/s have been in military deployment for secure communications since the 70’s. At this bit rate, the coarse buzz-hiss model is usually used to represent the excitation, and the quality of the re-synthesized speech is quite limited with an MOS normally below 3 and a DRT score in the low 90% range.

Many new digital speech coders fall into the category of a hybrid coder, in which the concept of vocal tract representation is integrated with a waveform-tracking scheme for the excitation signal to achieve a quality suitable for telephony applications. These include the multi-pulse excitation LPC coder, the code-excited LPC (CELP) coder, and the low-delay CELP (LD-CELP) coder. These coders normally operate at bit rates ranging from 4.8 kb/s to 16 kb/s.

Another major advance in speech coding, namely vector quantization, took place in the late 70’s. The basic idea of vector quantization is to encode several samples (i.e., a vector), as opposed to a single sample (i.e., a scalar), at a time. Shannon’s communication theorems provided strong motivations to support the idea of vector quantization. Furthermore, a number of algorithms were developed to automatically design the collection of reconstruction vectors (the codebook) according to the data distribution, so as to minimize the potential distortion incurred as a result of coding. Many recent speech coders employ vector quantization. When applied to model-based coding, vector quantization vocoders allow transmission of digital speech at 800 b/s and below, with adequate speech quality for special communication needs (e.g., communication in electronic warfare).

For speech coding to be useful in telecommunication applications, coding systems have to be inter-operable and thus the coding scheme has to be standardized before deployment. Speech coding standards are established by various standards organizations such as the International Telecommunications Union (ITU), the Telecommunications Industry Association (TIA), the Research and Development Center for Radio Systems (RCR) in Japan, the International Maritime Satellite Corporation (Inmarsat), the European Telecommunications Standards Institute (ETSI), and some Government Agencies.

Figure3 illustrates the performance of various speech coders in terms of the perceptual quality of the decoded speech at the corresponding operating bit rate. The companded PCM coder at 64 kb/s, the ADPCM coder at 32 kb/s, the LD-CELP coder at 16 kb/s and the algebraic-coded CELP (A-CELP) coder at 8 kb/s are all capable of achieving an MOS of more than 4, and thus are readily useable in telecommunication applications. The CELP at 4.8 kb/s has an MOS of slightly less than 4 and is useful in several communication applications such as secure voice. The LPC vocoder at 2.4 kb/s can only achieve an MOS of around 2.5 and is so far strictly used only in the military. A new multi-band excitation LPC coder (ME-LPC), in which the characteristics of the excitation are determined for each band separately, is able to provide a perceptual quality at 2.4 kb/s approaching that of the CELP at twice the bit rate. The chart also shows that the current challenge, in terms of the rate-distortion performance of a coding scheme, lies at and below 4 kb/s, where the perceptual quality drops radically.

Figure 3 Speech quality in terms of MOS for various coders at typical bit rates (in kb/s).

The advent of the Internet and the possibility of packet telephony using the Internet Protocol (IP), the so-called voice over IP (or VoIP), gives rise to another challenge in speech coding. The added dimensions from the new packet networks (as opposed to the traditional synchronous telephony network hierarchy) are:

1) network delay and delay jitter, which varies substantially depending on the network traffic conditions, and

2) packet loss potential due to congestion in packet networks from multiple services running on a common infrastructure.

These new applications call for a coding scheme that performs robustly even under these potentially adverse conditions.

III. Speech Synthesis

Modern speech synthesis is the product of a rich history of attempts to generate speech by mechanical means. The earliest known device to mimic human speech was constructed by Wolfgang von Kempelen over two hundred years ago. His machine consisted of elements that mimicked various organs used by humans to produce speech – a bellows for the lungs, a tube for the vocal tract, a side branch for the nostrils, etc. Interest in such mechanical analogs of the human vocal apparatus continued well into the twentieth century. In the latter half of the nineteenth century, Helmholz and others began synthesizing vowels and other sonorants by superposition of harmonic waveforms with appropriate amplitudes. A significantly different direction was taken by Homer Dudley in the 1930’s, with his discovery of the carrier nature of speech, and its corollary, the source-filter model shown in Figure 1. By using a keyboard to control the time-varying filter and the choice of excitation for the system shown in that figure, he was able to synthesize quite good quality fluent speech.

The first digital speech synthesizer was demonstrated by Cecil Coker around 1967. In some sense that synthesizer was a throwback to von Kempelen’s machine, with one major difference. Instead of manipulating a mechanical model, the synthesizer computed what a mechanical model would do if it were implemented. On the basis of rules derived from a study of human speech production, a computer program computed the sequence of shapes that a vocal tract would have to go through in order to generate speech corresponding to any text presented as an input. From these shapes, and the knowledge of the appropriate acoustic excitation (periodic pulses for voiced sounds, noise-like excitation for unvoiced sounds), the program solved the wave equation with appropriate boundary conditions, to compute the acoustic pressure at the lips. Finally, an electrical signal with the same waveform as the computed pressure was applied to a loudspeaker to produce the desired speech.

This method of speech synthesis has a strong appeal because it mimics the way a human being produces speech. However, in spite of considerable effort, it has not yet proven possible to automatically generate good quality speech from arbitrary text by this method. This is because it has not been possible to derive rules that work correctly in all circumstances. So far, good synthesis by rule seems to require too frequent ad hoc modification of the rules to be useful in practice. Also, it is hard to generate certain speech sounds, e.g., bursts in sounds like /k/ of kitten by rule.

Modern text-to-speech synthesis (TTS) is based on a much less fundamental, but much more effective procedure called concatenative synthesis. Basically, a desired speech signal is assembled with “units” selected from an inventory compiled during a training phase of the synthesizer. The units are acoustical representations of small sub-word elements, e.g., phonemes, diphones, frequently occurring consonant clusters, etc. The representation itself can take several alternative forms, e.g., the LPC along with the residual as described earlier. The complete process of concatenative synthesis, from text input to speech output consists of several steps. These are outlined in Figure 4 for one such system for the synthesis of English. The details would differ for other languages, but the general framework would be similar for a large class of languages.

Starting with the input text in some suitable format, e.g., ASCII, the text is first normalized. Normalization consists of detecting blank spaces, sentences, paragraphs and capital letters, and converting commonly occurring abbreviations to normal spelling. The abbreviations include symbols, such as $, %, &, etc., as well as titles (e.g., Mr., Mrs., Dr., etc.), abbreviations for months (Jan., Feb., etc.) and so on. Note that, as illustrated in the figure, the same abbreviation (Dr.) can give rise to different letter sequences (Doctor in the first example and Drive in the second) depending on the context.

The next step shown in Figure 4 deals with the problem that a given string of letters must be pronounced differently depending on the part of speech or the meaning. So, for instance, “lives” as a verb is pronounced differently from “lives” as a noun; and “axes” is pronounced one way if it is the plural of “axe” and another way if it is the plural of “axis”. There are many such confusions that the parser needs to disambiguate.

At the output of the syntactic/semantic parser the text has been normalized and the intended pronunciation of all the words has been established. At this point the pronunciation dictionary is consulted to determine the sequence of phonemes for each word. If a word does not exist in the dictionary (e.g., a foreign word or an unknown abbreviation) then the synthesizer falls back on letter to sound rules that make a best guess at the intended pronunciation.

Once the phoneme sequence has been established the synthesizer specifies prosodic information, i.e., information regarding the relative durations of various speech sounds and the intensity and pitch variations during the course of the utterance. Such information is critical to the generation of natural sounding speech. If the wrong words are stressed in a sentence, or if the durations are inappropriate, the speech sounds unnatural. Not only that, inappropriate stress can alter the meaning of a sentence. Similarly, without a natural variation of pitch the utterance might sound monotonous. Also, inappropriate pitch variation might, for instance, change a simple declaration into a question.

The final box in Figure 4 does the actual synthesis of the text. Its task is to select the appropriate sequence of units from the inventory, modify the pitch, amplitude, and/or duration of each unit and concatenate these modified units to produce the desired speech.

Currently there are several directions in which concatenative text-to-speech synthesis is being extended and improved. One major effort is concerned with the collection of the inventory of units. Until recently, because of considerations of computational complexity and memory requirements, inventories had one (or at most a few) tokens for each needed unit. This token was then modified in pitch, amplitude, and duration as required by the context. Considerable effort was spent in the optimal selection of the tokens. More recently, the trend has been to have very large inventories in which most units might appear in many contexts, and with many different pitches and durations. With such an inventory, units can be concatenated after much less modification than was needed with the earlier inventories. Of course, much larger memory is required, and also the problem of searching the large inventory requires much more computation. Fortunately, both available memory and processor speed are increasing at a very rapid rate, and becoming quite affordable.

Another direction is towards “multilingual” TTS. Clearly, the synthesis of a given language will have many features not shared with other languages. For example, word boundaries are not marked in Chinese; the grouping and order in which the digits of a given number are spoken in German, is very different from that in English; the number and choice of the units inventory needs to be different for different languages; and so on. However, rather than creating a collection of language-specific synthesizers, multilingual TTS aims at developing more versatile algorithms. The ultimate aim is that all language-specific information should be in tables of data, and all algorithms should be shared by all languages. Although this ideal is unlikely to be achieved in the foreseeable future, its pursuit focuses research more on language-independent aspects of synthesis.

Another direction of interest is that of adapting synthesis to mimic the speech of a specified person. Clearly, speech synthesized in the manner described above, will sound somewhat similar to that of the person who contributed the units in the inventory. Can one make it sound like the speech of any specified person? This appears to be highly unlikely. However, it might be possible to make the synthesizer sound like several different voices by manipulating the prosodic rules and making systematic modifications of the given inventory.

We conclude this section by noting that concatenative synthesis is by no means to be considered as the ultimate in speech synthesis. Although the quality of speech generated by this method is better than that of other methods, it is by no means an accurate mimic of human speech. Ultimately, we believe, accurate modeling of the human vocal apparatus might still be necessary to achieve a significantly higher level of speech quality.

IV. Speech Recognition

Speech recognition by machine in a limited and strict sense can be considered as a problem of converting a speech waveform into words. It requires analysis of the speech signal, conversion of the signal into elementary units of speech such as phonemes or words, and interpretation of the converted sequence in order to reconstruct the sentence or for other linguistic processing such as parsing and speech understanding. Applications of speech recognition includes a voice typewriter, voice control of communication services and terminal devices, information services such as voice access to news and messages, and price inquiry and order entry in tele-commerce, just to name a few. Sometimes, the area of speech recognition is extended to “speech understanding”, because the utility of a speech recognizer often involves understanding of the spoken words in order to initiate a certain service action (for example, routing a call to the operator).

Speech recognition was mostly considered, until the 70s, to be a speech analysis problem. The fundamental belief was that if a proper analysis method were available that could reliably produce the identity of a speech sound, speech recognition would be readily attainable. Researchers in acoustic-phonetics in the past advocated this deterministic view of the speech recognition problem, citing such examples as ``A stitch in dime saves nine'' (in contrast to ``A stitch in time saves nine''), which they believe can only be recognized correctly via the use of acoustic-phonetic features. This view may be appropriate in a microscopic sense (e.g., to distinguish the isolated distinction between /d/ and /t/ in the above example) but does not address the macroscopic question of how a recognizer should be designed such that on average (in dealing with all the input sounds), it achieves the least errors or the lowest error rate.

The introduction of the statistical pattern matching approach to speech recognition, which matured in the 80s and continues to flourish in the 90s and into the 21st century, helps set the problem on a solid analytical ground. Statistical pattern matching and recognition is mostly motivated by Bayes’ decision theory which asserts that a pattern recognizer needs to possess knowledge of the variation in the observation in order to be able to achieve the lowest recognition error probability. Error probability means, essentially, how likely the recognizer will make a mistake on the average over all unknown observations (i.e., samples within the recognition test set). The knowledge of statistical variation is expressed in the form of a model, the parameters of which have to be learned from real data (called the training set), usually a large collection with known (or manually labeled) identities. A model is a probability measure and can be considered a typical “pattern” with associated statistical variances. The process of learning the statistical regularities as well as the associated variation from the data is often referred to as recognizer “training”. Since the error probability, or error rate, is the most intuitive and reasonable measure of the performance of a recognition system, this formulation thus forms the basis of the design principle of modern speech recognizers.

The most successful model used for characterizing the variation in speech today is the hidden Markov model (HMM). A hidden Markov model is a doubly stochastic process, in which the observation sequence is described, locally in time, by one of a set of random processes, and chronologically in terms of the probable changes among these random processes, by a finite-state Markov chain. This model was found to be quite suitable for the speech signal, which indeed displays two levels of variation, one pertaining to the uncertainty in the realization of a particular sound and the other the sequential changes from one sound to another. Today, most if not all of the automatic speech recognition systems in the field are based on the hidden Markov model technique.

Despite the advances in the last three decades, speech recognition research is yet far from achieving its ultimate goal---namely recognition of unlimited speech from any speaker in any environment. There are several key reasons why automatic speech recognition by machine remains a challenging problem. One important factor is the ambiguity in the set of speech units used for recognition. Units as large as words and phrases are acceptable for limited task environments (e.g., dialing telephone numbers, voice command words for text processing). However these large speech units are totally intractable for speech recognition involving large vocabularies and/or continuous, naturally spoken sentences for two main reasons. One is associated with the amount of data to be collected and labeled for establishing the needed statistical knowledge, and the other with the degree of complexity in search combinatorics in the recognition decision process. (Consider the extreme case of using sentences as the units in continuous speech recognition. The number of possible sentences easily becomes astronomical even with a medium vocabulary size. Collecting all the realizations of these sentences is infeasible if not outright impossible.) In this case a subword recognition unit (e.g., dyad, diphone, syllable, fractional syllable, or even phoneme) is preferred, and techniques for composing words from such subword units are needed (such as lexical access from stored pronouncing dictionaries). Since such units are often not well articulated in natural continuous speech, the inference of words from an error-prone sequence of the decoded units in reference to a “standard” lexicon becomes a very difficult task.

Another important factor affecting performance is the size of the user population. So-called speaker‑trained or speaker-dependent systems adapt to the voice patterns of an individual user via an enrollment procedure. For a limited number of frequent users of a particular speech recognizer, this type of procedure is reasonable and generally leads to good recognizer performance. However, for applications where the user population is large and the usage casual (e.g., users of automatic number dialers or order entry services), it is infeasible to train the system on individual users. In such cases the recognizer must be speaker-independent and able to adapt to a broad range of accents and voice characteristics. Other factors that affect recognizer performance include:

· vocabulary complexity,

· the transmission medium over which recognition is performed (e.g., over a telephone line, in an airplane cockpit, or using a hands-free speakerphone),

· task limitations in the form of syntactic and semantic constraints on what can be spoken, and

· cost and method of implementation.

These factors (which are only a partial list) illustrate why automatic speech recognition, in the broadest context, still requires rigorous research in the foreseeable future.

Currently, speech recognition has achieved modest success by limiting the scope of its applications. Systems designed to recognize small‑ to moderate‑size vocabularies (10‑500 words) in a speaker‑trained manner, in a controlled environment, with a well‑defined task (e.g., order entry, voice editing, telephone number dialing), and spoken cooperatively (as opposed to conversational utterances full of disfluencies such as partial words, incomplete sentences and extraneous sounds like uh and um) have been reasonably successful. Such systems have been, for the most part, based entirely on the techniques of statistical pattern recognition, as discussed above. In the simplest context, each word in the vocabulary is represented as a distinct pattern (or set of patterns) in the recognizer memory as shown in Fig.5. (The pattern can be either a sample of the vocabulary word, stored as a temporal sequence of spectral vectors, or a statistical model of the spectrum of the word as a function of time.) Each time a word or sequence of words (either isolated or connected) is spoken, a match between the unknown pattern and each of the stored word vocabulary patterns is made, and the best match is used as the recognized string. In order to do the matching between the unknown speech pattern and the stored vocabulary patterns, a time alignment procedure is required to register the unknown and reference patterns properly because variation in the speaking rate usually does not alter the identity of the spoken word. Several algorithms based on the techniques of dynamic programming have been devised for optimally performing the match.

Figure 5 Automatic speech recognition using a pattern recognition framework

Tables I and II provide a summary of typical speech recognition performance for systems of the type shown in Fig. 5. Table I shows average word error rate for context‑free recognition (i.e., no task syntax or semantics to help detect and correct errors) of isolated words for both speaker‑dependent (SD) and speaker‑independent (SI) systems. It can be seen that, for the same vocabulary, SD and SI recognizers can achieve comparable performance. It is further seen that performance is more sensitive to vocabulary complexity than to vocabulary size. Thus, the 39‑word alpha-digits vocabulary (letters A‑Z, digits 0‑9, three command words) with highly confusable word subsets such as B, C, D, E, G, P, T, V, Z, and 3 has an average error rate of about 5‑7%, whereas an 1109 word basic English vocabulary has an average error rate of ~4%.

Task/Application

Vocabulary Size

Mode

Word Accuracy

Digits

10

SI

~100%

Voice Dialer Words

37

SD

100%

Alpha-digits plus Command

Words

39

SD

96%

SI

93%

Computer terms

54

SI

96%

Airline Words

129

SD

99%

SI

97%

Japanese City Names

200

SD

97%

Basic English

1109

SD

96%

Table I Performance of isolated word recognition systems

Table II shows the typical performance of connected‑word recognizers applied to tasks with various types of constraints. The performance on connected digits (both speaker trained and speaker independent) reflects significant advances in the training procedures for recognition. Perplexity in the table is defined as the average number of words that can follow an arbitrary word in the corresponding domain of language, due to linguistic (syntactic, grammatical and semantic) constraints. It is usually substantially less than the size of the vocabulary and is a rough measure of the difficulty of the task. It is estimated in the context of a statistical language model, which defines the sequential dependence between words. For example, an N-gram model defines the probability of a sequence of N words (or units) as observed in a large collection of sentences according to the language. Language models are obtained, in essence, according to word counting procedures.

Task/Application

Vocabulary Size

Perplexity

Word Accuracy

Connected Digit Strings

10

10

~99%

Naval Resource Management

991

<60

97%

Air Travel Information System

1,800

<25

97%

Wall St. Journal Transcription

64,000

<140

94%

Broadcast News Transcription

64,000

<140

86%

Table II Performance benchmark of HMM-based automatic speech recognition systems for various connected and continuous speech recognition tasks or applications

The Naval Resource Management task involves a particular kind of language used in naval duties and is highly stylized. Sentences in the Air Travel Information System are query utterances focusing on flight information such as flight time, fare, and the origin and destination of the intended travel. The vocabulary size reflects the extent of city or airport names included in the database. The grammatical structure in the query sentence is quite limited, compared to many other application domain languages. In the Wall Street Journal transcription task, the input is “read” speech via a microphone (with reasonably high quality), which is known to be much easier than spontaneous, conversational speech. The Broadcast News transcription task involves signals often referred to as “found” speech such as radio or television news announcements. The signal may have adverse components due to noise and distortion, with spontaneity somewhat higher than for “read” speech. The degradation in recognition accuracy is a clear indication that this is a more difficult task. These tasks, however, are all considered far easier than transcribing and understanding a truly spontaneous and conversational speech.

These systems use phoneme-like statistical unit models to represent words, as do many speech recognition software packages offered on the market for personal computing (PC) applications such as dictating a correspondence. These phoneme-like unit models are also qualified by the context they appear in, i.e., context-dependent units; for example, the /e-l-i/-like unit to be used in the context of /e/ and /i/ when the word under recognition hypothesis is “element”. Many systems employ thousands of such context-dependent unit models in continuous speech recognition. However, the benchmark results in the table are based on extremely extensive speaker-independent training while PC-based software systems normally require speaker adaptation. The user of a PC-based speech recognition software is asked to speak a designated set of sentences, ranging from 5 minutes to a few hours, so that parameters of the baseline system can be modified for improved performance for the specific user.

As illustrated above, the performance of current systems is barely acceptable for large vocabulary systems, even with isolated word inputs, speaker training, and favorable talking environment. Almost every aspect of continuous speech recognition, from training to systems implementation, represents a challenge in performance, reliability, and robustness.

Another approach to machine recognition and understanding of speech, particularly for automated services in a limited domain, is the technique of keyword spotting. A word-spotting system aims at identifying a keyword, which may be embedded in a naturally spoken sentence. This is very useful because it makes the interaction between the user and the machine more natural and more robust than a rigid command-word recognition system. Experience shows that many (non-frequent) users of a speech recognition system in telecommunication applications often speak words or phrases that are not part of the recognition vocabulary, creating the so-called out-of-vocabulary (OOV) or out-of-grammar (OOG) errors. Rather then attempting to recognize every word in the utterance, a word-spotting system hypothesizes the presence of a keyword in appropriate portions of the speech utterance and verifies the hypothesis by computing two matching scores, one between the hypothesized portion of the speech signal and the keyword model, and the other between the former and a background speech model. These scores are subject to a ratio test against a threshold for the final decision. By avoiding forced recognition decisions on the unrecognizable and inconsequential regions of the speech signal, the system can accommodate natural command sentences with good results as long as the number of keywords is limited (less than 20). Today, a word-spotting system with five key-phrases (“collect”, “credit card”, “third party”, “person-to-person”, and “operator”) is in deployment, automating a very substantial number (in billions) of the telephone calls traditionally categorized as “operator assisted” calls, resulting in tremendous savings in operating cost.

V. Speaker Verification

The objective of speaker verification is the authentication of a claimed identity from measurements on the voice signal. Applications of speaker verification include entry control to restricted premises, access to privileged information, funds transfer, credit card authorization, voice banking, and similar transactions.

There are two types of voice authentication; one verifies the talker’s identity based on the talker-specific articulatory characteristics reflected in the spoken utterance, and the other based on the content of the spoken password or pass-phrase (such as a personal identification number (PIN), the social security number, or mother’s maiden name). In the former case, the test phrase may be in the open and even shared by talkers in the population while in the latter case, the password information is assumed to be known only to the authorized talker. We often refer to the former as speaker verification (SV) and the latter Verbal Information Verification (VIV).

Research in speaker identification has a much longer history than VIV and in the past encompasses studies of acoustic and linguistic features in the speech signal that carry the characteristics of the talker. In recent years, the dominance of a statistical, data-driven pattern matching approach (see Speech Recognition above) has substantially changed the research landscape. The assumption is that a powerful statistical model such as the hidden Markov model would be able to automatically “find” talker-specific characteristics from the speech spectral sequence, provided that a sufficient number of spoken utterances from the particular talker exist to allow such learning. Indeed, the statistical modeling technique has made similar impact on speaker verification as on speech recognition, based on an almost identical principle illustrated in Figure 1 in which the vocabulary modes are replaced with the talker models or voice patterns. Each of the stored talker-specific models has to be trained according to an enrollment procedure before the system is put to use. Storage of these models may be located remotely in telephony applications. For verification, the talker makes an identity claim (e.g., an account number) and speaks a test phrase (or simply the account number itself). The system compares the input speech with the stored model or pattern for the claimed identity. On the basis of a similarity score and a carefully selected decision threshold, the system can accept or reject the speaker. The features useful for verification are those that distinguish talkers, independent of the spoken material. In contrast, the features useful for speech recognition are those that distinguish different words, independent of the talker. The decision threshold is often made dependent on the type of transaction that will occur as a result of the verification process. Clearly, a more stringent acceptance threshold is required for the transfer of money from one account to another than is required to report a current balance in a checking account. Key factors affecting the performance of speaker verification systems are the type of input string, the features that characterize the voice pattern, and the type of transmission system over which the verification system is used. Best performance is achieved when sentence long utterances are used in a relatively noise free speaking environment. A state-of-the-art system using a text-dependent test sentence is capable of achieving 1-2% equal error rate (when the threshold is adjusted such that the probability of false acceptance and that of false rejection becomes equal). Conversely, poorer performance is achieved for short, unconstrained spoken utterances in a noisy environment (4-8% equal error rate using text-independent isolated words).

The challenge in speaker verification is to build adaptive talker models, based on a small amount of training, that perform well even for short input strings (e.g., one to four words). To achieve this goal, more research is needed in the area of talker modeling as well as in the area of robust analysis of noisy signals.

A VIV system stores the confidential information about a talker in a profile. When an identity claim is made, the talker is asked one or a series of questions based on the stored information. For example, “Please say your PIN”, “When is your birth date”, or “What is your mother’s maiden name?” Using utterance verification techniques, which compare the spoken utterance against a composite speech model constructed from the unit phone models corresponding to the expected password or pass-phrase, the system can achieve perfect (0 error) verification after 3 turns of question & answer.

VI. Speech Enhancement

During transmission from talker to listener, a speech signal may be degraded in a variety of ways. In this section we will discuss some of these degradations and some of the methods that have been devised to deal with them.

Most speech enhancement algorithms are concerned with alleviating the degradation that results from additive noise. Noise can get mixed into a speech signal in several ways. For instance, speech may originate in a noisy environment, e.g., in a noisy airport, railway station or shopping mall. A microphone placed in such an environment will pick up the sum of the desired speech and the ambient noise. Examples of practical devices that may be subjected to such degradation include cellular handsets, speakerphones, etc. Even low levels of ambient noise can become a problem in multi-point teleconferencing. This is because each participant in such a teleconference hears the ambient noise from all remote locations. Thus assuming similar conditions at all locations, the noise would be about 10 dB higher in a teleconference among ten participants, than in a two-way conference. Noise can also enter a conversation electrically, in transmission lines, filters, amplifiers, etc., that are part of a telephone circuit.

In most such situations, all that is available to a speech enhancement algorithm is a single microphone signal representing the sum of the desired speech signal and the noise. Hence, to improve the quality of such degraded speech, the enhancement algorithm must operate “blindly”. That is, it must operate in the absence of any prior knowledge of the properties of the interfering noise. It must estimate the noise component from the noisy signal, and then attempt to reduce its perceptual effects.

There are, however, important applications in which additional information is available, besides the noisy speech signal. A prime example of such applications, and one that has been of interest for several decades, is echo cancellation on long distance telephone circuits. Due to impedance mismatches on such circuits, an undesirable echo of a speech signal gets added to the desired signal being transmitted. This echo is the “noise” that needs to be eliminated from the “noisy” signal. If the speech signal that is responsible for the echo is observable, then the echo can be estimated and subtracted from the noisy signal. Another example of such applications might occur when a microphone is used to record the speech of a talker in a room with a noise source – say an air conditioning duct. If it is possible to place a secondary microphone very close to the noise source, then its output can be used to estimate and cancel the noise present at the recording microphone.

A third type of speech enhancement is required to overcome the loss of intelligibility due to hearing loss that, in general, increases with age. Hearing loss is becoming more and more important in telephony, as life expectancy increases and the percentage of older people in the population increases.

Finally, speech enhancement is of interest in situations where the speech may originate in a reasonably quiet environment, but is to be received by a listener who is in a noisy environment. An example of such a situation is announcements over the public address system at a railway platform. In such cases the noise itself is not under the control of a speech enhancer. However, if some estimate of the noise is available, then it is possible to pre-process the speech signal so as to improve its intelligibility as well as its spectral balance, when listened to in the noisy environment.

Let us take a closer look at each of these four applications of speech enhancement.

In the case when only the noisy speech signal is available, speech enhancement is accomplished by systems that may be typified by the block diagram shown in Figure 6. The noisy speech is passed through a bank of contiguous band-pass filters that span the frequency range of the speech signal (say 200 Hz to 3500 Hz for telephone speech). At the output of the i-th filter, the short-term power, Pi, is an estimate of the power spectrum of the noisy speech at the center frequency of the filter. If the desired speech signal and the noise are uncorrelated, then Pi=Si+Ni where Si is the power of the speech signal and Ni is the power of the noise at that frequency. Suppose for a moment that the noise power, Ni, can be estimated independently. (One way to estimate the noise power is to detect time intervals during which speech is present, and measure the output power only during the other intervals. Several algorithms are known for detecting the presence of speech) Then Si=Pi – Ni is an estimate of the signal power. The i-th channel signal is next adjusted by multiplying it with a gain factor Si/Pi. Finally, these adjusted signals are all added together to yield the enhanced speech. It is seen that the output of a channel is attenuated more if it has more noise in it. This is what provides the enhancement.

An alternative method is shown in Figure 7. Instead of introducing time-varying gains in the channels of a filter-bank, this approach derives the control signals of the type shown in Figure 1 from the given noisy speech, and then re-synthesizes speech from these control signals. The first step as before is to estimate the signal power, Si in each channel, and from it the signal amplitude (which is just the square root of the signal power). These estimated amplitudes characterize the time-varying filter of Figure 1. In a parallel path, the given noisy signal is analyzed to determine intervals in which the signal is voiced and the intervals in which it is unvoiced. During voiced intervals the fundamental frequency is determined. It turns out that such analysis can be performed reliably even in the presence of noise which is only six to 10 dB below of the speech signal. In this manner the characteristics of the excitation source of Figure 1 are determined. The enhanced signal is now synthesized as in Figure 1.

All single-microphone enhancement systems in use today are variations of one of the two methods described in Figures 6 and 7. They differ in the characteristics of the band-pass filters, the method of estimating the noise power, the method of determining the fundamental frequency, etc.

These methods reduce the level of noise, but also introduce artifacts of their own. The main artifact is what is known as “musical noise” and consists of bursts of periodic signals with randomly varying frequency. Many people prefer the original noisy signal to the processed signal. However, this type of artifact is almost inaudible when the noise in the original signal is at least 10 dB or so below the signal. Also, parameters can be adjusted so as to trade off this type of distortion against residual noise. It should also be noted that except in one study with specialized test signals, these methods have not been shown to increase intelligibility. The main advantage of these speech enhancement methods is that they reduce fatigue that results from prolonged listening to noisy speech.

Algorithms for the second type of enhancement – echo cancellation or noise cancellation – are exemplified in the block diagram shown in Figure 8. In that figure, the signal S is the desired signal to be recorded or transmitted. To that signal, an undesired signal N’ gets added. The signal N’ is not known. However, it is known that N’ is the output of a (hitherto unknown) linear filter H, whose input is a known signal N. In the echo cancellation example mentioned earlier, N’ would be the echo generated by the signal N, and H would be the filter representing the transfer function of the echo path. In the other example, N’ would be the noise signal at the recording microphone due to the noise N at the air conditioning duct as picked up by the secondary microphone, and H would be the acoustic transfer function of the room from the air conditioning duct to the recording microphone. An enhancement system for such a situation is shown in Figure 9. What is shown there is the setup of Figure 8, with an added portion shown in dotted lines. The box marked H* denotes an estimate of the filter H. If we could make H* equal to H, then clearly the noise would be removed and the output signal O would be thee same as the desired noise-free signal S. To drive the transfer function of the filter H* towards that of H an iterative gradient algorithm is used. During silent intervals of the speech signal S, the algorithm adjusts H* in such a way as to drive the output O towards zero. If O can be made exactly zero in those intervals, then H* provides a good estimate of H. If now the adaptive algorithm is turned off whenever the speech signal is present, then the noise portion of the signal O continues to be canceled, and hence O approximates the desired noise-free signal S.

Such algorithms have been highly successful in the echo cancellation application. Indeed, several million such devices for echo cancellation have been deployed in various telephone systems. Their usage for the other application – reduction of room noise – is far less. First of all, there are far fewer occasions where this type of noise reduction is needed. Besides that, except at low frequencies, the noise at the recording microphone in a room, cannot be accurately modeled as a single signal passed through a linear filter. However, in some special cases, e.g., in an airplane cockpit, or in a racing car, communication from the pilot or the driver can be considerably improved by this technique.

Let us now turn to the third source of degradation mentioned above, i.e., the degradation due to hearing loss. Figure 10(a) shows the threshold of hearing as a function of frequency for a person with normal hearing. Also shown in that figure is the threshold of discomfort as a function of frequency. Any sound louder than this threshold would be painful and can be harmful to the ear. Figure 10(b) shows the same two curves for a person with hearing loss. This figure is for illustration only; there are many types of hearing loss with different characteristics. But the main noteworthy feature is common to all types of hearing loss: the threshold of hearing becomes higher than that for a normal ear, but the threshold of discomfort remains the same. Also, in general at high enough intensities, the loudness at any frequency is the same for a person with normal hearing as for one with hearing loss. This last phenomenon is known as loudness recruitment.

A vast majority of hearing aids in use today provide only linear amplification, with frequency-dependent gain. From Figures 10(a,b) it is clear that this is not a satisfactory solution. If the gain is adjusted to be correct for a low level sound, it will be too high for a high level sound, and vice versa. What is needed is to compress the dynamic range of sound at any frequency to fit the reduced dynamic range due to the hearing loss. Thus an amplifier is needed such that its gain depends not only on the frequency but also on the signal level at that frequency so as to provide the appropriate compression. Hearing aids are now available that filter the speech signal into a number of bands (usually limited to two or three due to computational complexity) and provide a compression in each band that is appropriate for the hearing loss being corrected. This technique is called multi-band compression.

Much improvement is still possible in this type of hearing aids. We do not yet know the optimal control strategy for the signal-dependent gain nor do we know the best procedures for fitting such a hearing aid. The fundamental reason for this is that the information summarized in Figures 10(a,b) is obtained from tests with stationary signals, and is inaccurate for a signal like speech, which is continually changing. Also, note that the high gain required for low level sounds amplifies the noise too. For this reason hearing aids tend to be unsatisfactory in noisy environments. In future, with better understanding of the hearing process, as well as with the availability of smaller and faster digital processors with low power requirements, hearing aids that combine noise reduction with signal dependent gain will no doubt be available, and will provide much better solutions to the problem of hearing loss.

Finally, let us consider the pre-processing of a speech signal to make it more intelligible in the presence of ambient noise. It turns out that the noise induces something akin to hearing loss in a normal hearing person. At any frequency, the threshold of hearing is elevated while a signal sufficiently above the noise level sounds almost as loud as in the absence of noise. Of course, the amount of loss depends on the level of noise at each frequency. Thus if the clean speech signal is amplified with a gain that has the appropriate dependence on frequency and intensity, it can be made much more intelligible when listened to in the presence of noise. Of course, the speech will still be quite noisy, since the noise is not changed.

VII. Concluding Remarks

This cursory overview of digital speech processing has aimed to highlight recent advances, current areas of research, and key issues for which new fundamental understanding is needed. Future progress in speech processing will surely be linked closely with advances in computation, microelectronics, and algorithm design.

BIBLIOGRAPHY

Flanagan, J. L. (1972). Speech Analysis, Synthesis, and Perception, Springer-Verlag, New York.

Jayant, N. S., and Noll, P. (1984). Digital Coding of Waveforms, Prentice-Hall, Englewood Cliffs, New Jersey.

Rabiner, L. R., and Gold, B. (1975). Theory and Application of Digital Signal Processing, Prentice-Hall, Englewood Cliffs, New Jersey.

Rabiner, L. R., and Schafer, R. W. (1978). Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, N.J.

Rabiner, L.R., and Juang, B.H. (1993). Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J.

speech

Amplifier

Pulse

Generator

Noise

Source

Time-Varying

Filter

Fundamental Frequency, F0

2.34

0

4

0

FREQ. (KHz)

TIME (sec) ((sec) (sec)

08/05/001digital speech processing--encyclopedia

2:13 PM

_1026621258.doc