digital signal processing in communication systems || speech processing

58

Click here to load reader

Upload: marvin-e

Post on 08-Dec-2016

225 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Digital Signal Processing in Communication Systems || Speech Processing

9

Speech Processing

The use of digital speech has become widespread, and it provides enormous ad­vantages in some applications over analog voice signals. One of the most impor­tant applications of digital speech is in long distance telephone communications. Digital speech can be passed through an almost unlimited number of repeaters with negligible degradation. Analog signals, on the other hand, suffer a small amount of degradation each time they are passed through a repeater.

Digital speech can easily be switched using multiplexers, and a large number of digital signals can be combined on a single link by time division multiplexing (TDM). Time division multiplexing eliminates the need for banks of analog filters required for frequency division multiplexing (FDM), which was used in older, an­alog systems.

Digital speech has several other advantages. For example, in military systems it can easily be encrypted for secure voice transmission. It is much easier to scram­ble the bits of a digital signal than to scramble an analog signal. Another applica­tion where digital speech has an advantage is in military frequency hopping sys­tems. Here, it is possible to packetize the data and still eliminate the undesirable clicks caused by frequency hopping.

Another advantage in some links, where the transmission path is poor, is that powerful error correction codes can be used to reduce the error rate as much as required, whereas with analog transmission, distortions and noise in the transmis­sion path cannot be removed. Digital speech may also be stored and retrieved without loss of quality.

490

M. E. Frerking, Digital Signal Processing in Communication Systems© Springer Science+Business Media New York 1994

Page 2: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 491

Digital speech has a major disadvantage, one that is serious in many applica­tions. High-quality digital speech requires a high bit rate that, if transmitted, re­sults in a significantly wider bandwidth over the analog speech from which it orig­inated. Sophisticated digital signal processing techniques can be employed to reduce the data rate, but they produce varying degrees of quality degradation. This chapter addresses several of the more widely used methods for digitizing speech. Because of the different requirements and constraints, we find that widely differ­ing methods are used.

The telephone industry has requirements for reasonably high-quality speech at low cost. The bandwidth, while important, is not a critical parameter. Pulse code modulation (PCM) with companding (a process in which compression is followed by expansion) has been adopted for this service. The sample rate is 8 ks/s, and 8-bit words are used, resulting in a data rate of 64 kbps. At baseband, this data rate would require approximately 32 kHz of bandwidth, compared to 3 kHz for the an­alog signal. This is an increase of approximately 10: 1. More sophisticated mo­dems, as discussed in Chapter 8, can be used to reduce the bandwidth significant­ly. Nevertheless, because the data rate is rather directly related to cost, there is much emphasis on reducing the data rate. As a result, 32 kbps ADPCM and some 16 kbps systems are also used. In many applications, the permissible bandwidth is severely restricted, and this leads to requirements for much lower data rates. The radio spectrum is a prime example of how severe crowding of the available spec­trum space has become. Many services are channelized on 25 kHz centers, and it is necessary to restrict the spectrum to somewhat less than this value. One method used in these applications is continuously variable slope delta modulation (CVSD). The hardware required is relatively simple, and the data rate is often 16 kbps. While the speech quality is not as good as 64 kbps PCM, it is easily rec­ognizable, and an error rate of 5 to 10 percent can be tolerated before it becomes unintelligible. Other systems used are adaptive differential PCM (ADPCM), with a data rate of 32 kbps, and adaptive predictive coding. at 9.6 kbps.

A relatively new system, referred to as codebook excited linear prediction (CELP), is presently being investigated for military and commercial applications as a near toll-quality system. The data rate is in the 4 to 9.6 kbps range. The IS-54 (VSELP) is a system using CELP techniques, and it has a data rate of 8 kbps (see EIAITIA IS-54 Interim Standard [87]). Federal Standard 1016 [52] defines a CELP system with a data rate of 4.8 kbps.

Still lower data rates can be obtained with linear predictive vocoders (LPC), where a data rate of2.4 kbps is common. The quality ofLPC speech leaves some­thing to be desired but, in a quiet environment, it is quite understandable. The error rate that can be tolerated is only a few percent, however.

Still lower bit rates can be obtained by special encoding of the LPC parameters using vector quantization techniques (see Rebolledo et al. [88]). Data rates of 400 to 800 bps are possible. As one would expect, there is additional degradation over 2,4001bps LPC. A great deal of research is being conducted on low data rate

Page 3: Digital Signal Processing in Communication Systems || Speech Processing

492 Digital Signal Processing in Communication Systems

speech algorithms at the time of this writing. Another important area of research is directed toward obtaining toll-quality speech, similar to PCM at modest data rates in the 4,800 to 9,600 bps range. Obviously, any reduction in the data rate re­quired to obtain toll-quality speech can result in significant additional channel ca­pacity and reduce the cost of communications.

The remainder of this chapter describes several of the more widely used meth­ods of digitizing speech.

PULSE CODE MODULATION

Pulse code modulation is really nothing more than digitizing speech with an AID converter, as discussed in Chapter 3. If the digital speech is to be processed locally, there may be no particular motivation to minimize the data rate. A sample rate in the vicinity of 16 ks/s with a 12 bit AID converter may then be a good choice. A relatively simple anti-aliasing filter can be used with a stopband above 8 kHz. The passband should extend at least to 3 kHz.

As indicated previously, the telephone system has adopted a sample rate of 8 ks/s. This places a more severe requirement on the anti-aliasing filter and a switched capacitor filter, implemented in an analog IC along with the AID con­verter, is often used. A lowpass filter in series with a highpass filter is sometimes used in the PCM chips. The passband extends from approximately 200 Hz to 3.4 kHz. The attenuation is in the order of 14 dB at 4 kHz increasing to over 32 dB by 4.6 kHz. Since only eight bits are used, the system leaves something to be desired with regard to dynamic range. Noise is reduced using a nonlinear voltage shaping function preceding the AID converter. This provides greater resolution for small signals at the expense of large signals. The non-uniform quantization reduces the noticeable quantization noise at small signal levels, where it is most apparent. Most North American systems use the u-Iaw curve which is given by

IVOUTI In (1 + IlIVINI )

VMAX

In (l + 11)

u = 255 (9.1)

The sign ofVIN is attached to VOUT. The output involves the ratio of two logs, so any base logarithm gives identical results. Equation (9.1) is shown graphically in Fig. 9.1. An AID converter with this characteristic is referred to as a CODEC.

Page 4: Digital Signal Processing in Communication Systems || Speech Processing

1.00

0.75

~ :: ! 0.00

~-O.25 8

-0.50

-0.75

Speech Processing 493

-1.00 +-=-+---+---+---+---+---+---+-~ -1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00

CODEC INPUT VOLTAGE

FIGURE 9.1 Ii-law characteristic for CODEC

The digital signal must be linearized either prior to performing any filtering op­erations or before it is converted back to an analog signal. The linearizing function for Eq. (9.1) is given in Eq. (9.2). As before, the sign of VI is attached to V2.

(9.2)

European systems have adopted the A-law curve, which is similar but not identical to the u-Iaw. The A-law is given in Eqs. (9.3) and (9.4).

[ VIN ] A~

1 +lnA

1 + In [AvVIN ] MAX

I V oUTI = --l"-+-:-In-:A--

A typical value for A is 87.6.

(9.3)

(9.4)

Page 5: Digital Signal Processing in Communication Systems || Speech Processing

494 Digital Signal Processing in Communication Systems

DIFFERENTIAL PULSE CODE MODULATION

When an analog signal is sampled it is often found that adjacent samples are not significantly different from one another. This implies that the high frequency con­tent is relatively small. When this is the case, the data rate can be reduced by trans­mitting only the differences from the previous values rather than the absolute value of each point. A prediction filter is used in the receiver to reconstruct the original input from the transmitted differences. A block diagram of the receiver is shown in Fig. 9.2. The predictor may be as simple as a single delay which repre­sents an integrator that adds the old value to the new difference input. A similar prediction filter is used in the transmitter as shown in Fig. 9.3.

The prediction filter in both the transmitter and the receiver have the same in­put. Therefore, the received signal is also reproduced at the transmitter. This sig­nal is subtracted from the analog input to produce the next differential to be trans­mitted.

The block diagrams show a digital implementation, and we have assumed that the input signal has already been digitized with sufficient accuracy for audio-­perhaps to 12 bits. An alternate implementation is to quantize the signal after the

Ae RECEIVED

DIFFERENCES

K-bit WORDS

FIGURE 9.2 DPCM receiver block diagram

f---.---3> e RECONSTRUCTED 2 DIGITAL SIGNAL

N>K-bits

DIGITIZED e 1 + ANALOG ~

I--_A_e ____ ----~TRANSMITTED

SIGNAL K<N bits INPUT

N bits +

FIGURE 9.3 DPCM transmitter block diagram

Page 6: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 495

summation. The output of the prediction filter must then be reconstructed to pro­duce an analog signal prior to subtracting it from the analog input. This implemen­tation with a one-bit word forms the basis of delta modulation, which is discussed in the next section.

DELTA MODULATION

Delta modulation is often used in applications where a lower data rate is required than that used for PCM, but where it is desirable to use a single IC to digitize the signal. In these applications, CVSD is often used. Before discussing CVSD, we will first describe delta modulation. It is then a small step to introduce the variable slope parameter.

A block diagram of a delta modulator is shown in Fig. 9.4. The companion re­ceiver is shown in Fig. 9.5. Referring to Fig. 9.4, the analog input is compared with the output of the integrator. If the integrator output is too low, a +V signal is generated. This causes the integrator output to increase at a rate S = V/RC, where RC is the time constant of the integrator. If the integrator output is still too low at the next clock cycle, another + V output is generated. Otherwise a -V output is pro­duced.

CLOCK

81 COMPARATOR

D t-=--..-~ TRANSMITIED BIT STREAM

ANALOG ----1

INPUT

ANALOG INTEGRATOR

LATCH

FIGURE 9.4 Block diagram of delta modulator

CLOCK

RECEIVED BIT

STREAM

ANALOG INTEGRATOR

FIGURE 9.5 Block diagram of delta modulator receiver

RECOVERED AUDIO

OUTPUT

Page 7: Digital Signal Processing in Communication Systems || Speech Processing

496 Digital Signal Processing in Communication Systems

The slope, which we will call S, obviously must be as large as the highest slope of the incoming signal ifthe system is to track. A large value ofS results in higher quantization noise due to "hunting" around the correct value for slowly changing signals. Therefore, we wish to make S large enough but not excessively so.

For a sine wave input, e = A cos(21tfmt), the slope, found by differentiating, is

The maximum slope occurs at t = lI( 4fm). Setting this value equal to the slope, we have

The noise power can be found by noting that the smallest step size is

where

2S 8 = ST - (-ST) = -

fs

T = the sample time

fs = the sample rate

As determined in Chapter 3, the noise power is given by

(9.5)

(9.6)

for a uniformly distributed variable. Since the quantization noise occurs in a sys­tem with feedback, the output noise spectrum from the modulator has a slope pro­portional to frequency. After the integrator in the demodulator, however, it is flat. Equation (9.6) represents the noise at the demodulator. The maximum signal power is

Substituting for A from Eq. (9.5) gives

(9.7)

Page 8: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 497

Combining Eq. (9.7) with Eq. (9.6), the maximum ratio of signal to total noise power is

(9.8)

The noise is uniformly distributed over the frequency range from 0 to f/2. There­fore, the maximum signal-to-noise density ratio can be found by dividing Eq. (9.8) by f/2, which gives

(9.9)

As can be seen from in equation, the signal-to-noise density ratio improves as the third power of the sample rate. It is often undesirable to increase the sample rate, however, since this implies a larger transmission bandwidth. Another ap­proach is to use a small slope when the signal is small, and to increase the slope when the rate of change of the signal becomes large. As with companded PCM, the quantization noise is not as objectional when a large signal is being digitized. This is the essence of CVSD, discussed in the next section.

CONTINUOUSLY VARIABLE SLOPE DELTA MODULATION

As the name implies, continuously variable slope: delta modulation changes the slope in a delta modulator as needed. One way to accomplish this is to monitor the output bits and, if more than three or four consecutive ones are transmitted, the slope is increased. Likewise, ifmore than three or ~Dur consecutive zeros are trans­mitted, the rate of change of the negative slope is increased. A block diagram of the resulting CVSD modulator is shown in Fig. 9.6. The output data is also read into the local shift register. When the output consists of three consecutive ones or three consecutive zeros, the slope integrator output increases. Otherwise, it decays toward a minimum fixed value.

The receiver for CVSD data has a similar shift register and decoding network, so that the voltage, e2> is approximately reproduced at the receiver. CVSD modu­lators have been used successfully in various equipments, including military com­munication transmitters and receivers. The data rate for military equipments is of-

Page 9: Digital Signal Processing in Communication Systems || Speech Processing

498 Digital Signal Processing in Communication Systems

ANALOG INPUT >---------------~+

COMPARATOR

FIGURE 9.6 Block diagram of CYSD modulator

CLOCK

D t-=-__ -+_~OUTPUT DATA

ten 16 kbps. The speech is reasonably good and is easily understood in a quiet environment, but it is not as good as 64 kbps PCM. An error rate of 5 to 10 percent can be sustained before the speech becomes unintelligible. CVSD tends to be more robust with respect to errors than PCM, because an error only affects the slope for one sample time. A PCM error, on the other hand, generates "pops" in the received signal, particularly in the most significant bit positions. CVSD has also been used at 32 kbps to reproduce reasonably high-quality speech.

LINEAR PREDICTIVE CODING

There are many advantages to digitizing speech at a low data rate, and LPC is an effective way to dramatically reduce the required data rate. In some services (e.g., telephone communications), the advantage is primarily economic, especially for long distance communications. It is theoretically possible to multiplex 26 users in a 64 kbps PCM channel with 2,400 bls LPC coded speech. This has not been done to any great extent. A more accepted method of utilizing LPC at 2,400 bps is to multiplex four channels into a 9,600 bps modem using QAM and to transmit the signal over an analog telephone channel. Unfortunately, 2,400 bps LPC, although acceptably intelligible, has a certain machine-like quality that has prevented its widespread acceptance by the general public.

LPC is most useful in other systems where it is simply not possible to obtain the needed bandwidth for PCM. A good example of this is over the HF radio chan­nel, where the signal must often be contained in a 3 kHz bandwidth. There are sev­eral ways to transmit 2,400 bps digital signals in a 3 kHz channel as discussed in Chapter 8.

The linear predictive technique is most often used to digitize speech at the 2,400 bps rate, although some success has also been achieved as low as 400 bps. A small amount of error correction coding can be included in the 2,400 bps data stream, and an error rate of 1 or 2 percent can reasonably be tolerated.

Page 10: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 499

The basic principles of LPC speech encoding are discussed in the following paragraphs (see Rabiner and Schafer [36,44]).* A military standard called LPC-10e has evolved, and some of the material in this section is drawn from that ap­proach (see Tremain [38]). LPC is not so much a method of digitizing a signal as a technique for analyzing speech to determine certain parameters and representing them digitally. To understand the system, it is helpful to look briefly at the human vocal tract first.

During voiced sounds (also called fricatives), the vocal cords vibrate with a specific pitch frequency. The output from the vocal cords resembles a pulse exci­tation, which is rich in harmonics. This signal is modulated by the cavities of the throat, mouth, and nose. The various harmonics are increased or decreased rela­tive to one another to form the various sounds. The resonant frequencies of the vo­cal system are called formants and represent one-pole, second-order resonances, as seen in Eq. (9.12). The average pitch excitation for a male voice is about 130 Hz. The female voice is approximately one octave higher, so LPC intelligibility scores are often slightly better for male speakers.

In addition to fricatives, there are also non-voiced sounds, such as the f in "five" and s in "six." In this case, the vocal cords do not vibrate. Instead, the ex­citation to the vocal tract resembles white noise and is caused by a turbulent air flow. Therefore, one may postulate a speech model such as shown in Fig. 9.7.

The basic idea underlying linear predictive coding is to approximate the vocal tract filters on a short-term basis (20 to 30 ms) and provide the right excitation to the synthesis filter to approximate speech. We can adequately model the cavities

FIGURE 9.7 Approximate equivalent circuit of human vocal tract

SPEECH SIGNAL

·Parts of this section are adapted from L.R. Rabiner and R.w. Schafer, Digital Processing of Speech Signals. pp. 399-402,411,413-414,443,444. Copyright © 1978. Adapted by permission of Prentice Hall, Inc., Englewood Cliffs, NJ.

Page 11: Digital Signal Processing in Communication Systems || Speech Processing

500 Digital Signal Processing in Communication Systems

with an alI-pole filter, which we will calI the synthesis filter. We wilI designate the transfer function by 11 A(Z). The parameters of the synthesis filter can be derived from a digital transversal filter whose coefficients have been optimized to predict the present speech sample from past values. We will spend a considerable amount of time showing why this is possible and how to find the coefficients of the pre­diction and synthesis filters. Subsequently, we will also discuss algorithms to de­termine the pitch period, how to make voiced/non-voiced decisions, and so on.

We wilI begin by letting the excitation signal from the vocal chords be repre­sented by Yen). For voiced sounds, when the vocal chords are vibrating, Yen) can be approximated by a series of impulses. In the Z plane, the excitation can be rep­resented by

V(z) = (J L (z-k)n (9.l0) n=O

where the pitch period is k sample times. For unvoiced sounds, Yen) is approxi­mated by white noise.

The glottal shaping model can be approximated by the transfer function *

(9.11)

The majority of the shaping of the acoustic spectrum is accomplished by the vocal tract, consisting of the nose and mouth. This shaping is approximated by an aJl­pole filter model of the form

H (z) = --;K.,----------------- (9.12) n -c T 1 -2eT 2 [I - 2e J cos (Bj T) z- + e J z- ]

j = 1

The jth format frequency is

and the format bandwidth is

'Equations (9.11), (9.12), (9.15), (9.19), (9.21), (9.22), and (9.25) were adapted from J.D. Markel and A.H. Gray, Linear Prediction a/Speech. by permission. Copyright © 1976 by Springer-Verlag.

Page 12: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 501

The speech output is given by

S(z) == V(z)G(z)H(z) (9.13)

We now lump the glottal shaping fiiter, the vocal tract, and the additional lip radiation shaping into a single synthesis filter 11 A(Z). The resulting model is shown in Fig. 9.8. Using an all-pole model approximation we have

Then

1 S (z) G A (z) == V (z) == p

1- Laiz-i i= 1

p

A(z) == 1- Laiz-i i = 1

The filter 11 A(Z) is called the synthesis fiiter, and

S(z) == [A:z)]GV(Z)

(9.14)

(9.15)

The parameters needed to characterize a speech sample are the values of the A coefficients, the gain factor, G, the pitch period, and the voiced/unvoiced decision.

IMPULSE 11 i SOURCE

PITCH PERIOD

VOICED/ UNVOICED SWITCH

VARIABLE I--~ DIGITAL

G GAIN

FILTER

1/A(z)

FIGURE 9.8 Simplified equivalent circuit of vocal tract

Sen)

Page 13: Digital Signal Processing in Communication Systems || Speech Processing

502 Digital Signal Processing in Communication Systems

These parameters all vary slowly with time. If the filter size, P, is large enough, the all-pole model can accurately represent the speech sample. For most LPC ap­plications, P is chosen to be 10. The resulting system is referred to as LPC-IO.

We will now study the two most popular ways to estimate the filter parameters, (the "a" terms). The first of these is the autocovariance method, which is used in the US government and NATO system. The second is the autocorrelation method. The performance of the methods is fairly similar, and the particular method used is often a matter of the designer's choice. The basic difference is that the speech data is windowed for the autocorrelation method so that the values are zero at the edges of the analysis window. This results in a correlation matrix with equal diag­onal elements that can be inverted more easily. A recursion process can also be used to avoid inversion altogether.

The covariance method, on the other hand, has a small performance advantage because it does not throwaway some of the speech samples by windowing. How­ever, it requires more computation.

One of the major advantages of the all-pole speech model is that the coeffi­cients can be estimated in a straightforward manner. We will now proceed to de­scribe the methods used.

From Eq. (9.14), the speech samples are related to the excitation by

p

S (z) = L aiS (z) Z-i + GV (z) i = I

From Eq. (9.15), we may write

p

1 - A (z) i = I

Substituting this in Eq. (9.16) gives

S(z) = [l-A(z)]S(z)+GV(z)

or

S (z) GV(z) ---A(z)

(9.16)

(9.17)

We may also write the time domain response by taking the inverse z-transform of Eq. (9.16), which gives

Page 14: Digital Signal Processing in Communication Systems || Speech Processing

s (n)

p

Lais(n-i) +Gv(n) i = 1

Speech Processing 503

(9.18)

To use this model, we must find a reasonable method to determine the values ofthe "a" terms. Knowing these values along with the excitation v(n) and the gain will then allow us to reconstruct the speech samples at the receiver.

To determine the values of the "a" terms, we first define a prediction filter to estimate the present value of the speech from the past samples. This cannot be per­fectly predicted, of course, but a reasonable approximation can be made. We let our estimate of the present value be defined as

sen)

Taking the z-transform gives

p

LUis(n-i) i = 1

P 1\

S (z) L Uiz-iS (z) i-I

The prediction filter is then

1\

P (z) S (z)

S (z)

P (z)

(9.19)

(9.20)

The error between the actual sample point and the predicted sample point is referred to as the prediction residual. This error is given by

e (n) = s (n) - S (n)

Substituting Eq. (9.19), we have

e (n)

p

sen) - Luis(n-i) i = I

(9.21 )

(9.22)

Page 15: Digital Signal Processing in Communication Systems || Speech Processing

504 Digital Signal Processing in Communication Systems

And taking the z-transform

(9.23)

We will use this equation later to optimize the (l values to obtain the minimum RMS error, e(n).

Now let us suppose that we know the correct values for the "a" terms in Eq. (9.16) and set the (l terms equal to these values. Let

«(li = a) for i = 1, 2, ... p (9.24)

Making this substitution in Eq. (9.23) gives

Now, substituting for A(z) from Eq. (9.15), we have

e (z) = S (z) A (z) (9.25)

Now consider Eq. (9.17) and substitute for S(z). This gives

e (z) = GV (z) (9.26)

This seems plausible, since the nonpredictable part ofthe speech is due to the exci­tation.

Now, if we use Eq. (9.22) to optimize the (l terms for minimum error, it will be found that the residual error, e(n)MIN' results from the non-predictable part of the speech which, of course, is the excitation, Gv(n). This is precisely the error we obtained in Eq. (9.16) if we use the correct filter values for the all-pole model (the "a" terms). Therefore, the coefficients giving the minimum prediction error in Eq. (9.22) are also the coefficients of the all-pole speech model proposed in Eq. (9.14).

We now take a moment to review what we have presented thus far before cal­culating the coefficient values of the prediction filter. Referring to Fig. 9.9, we have approximated the vocal tract filter as an all-pole prediction filter, IIA(z), with as yet unknown coefficients. The output of the vocal tract filter, S(z), is passed through the prediction filter (also called an inverse or whitening filter). The

Page 16: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 505

VOCAL CHORD V(z) VOCAL TRACT S(z) IMPULSE OR f----7 ACOUSTIC FILTER f----7 WHITE NOISE

ARTIFICIALLY GENERATED PREDICTION

G(z)H(z)

G(z)H(zj=o1/A(z)

&(z) ALL POLE FILTER 1/A(z)

SYNTHESIS FILTER

A(z)

PREDICTION FILTER

~

SYNTHESIZED SPEECH

8(z)

FIGURE 9.9 Vocal diagram of speech analysis and synthesis process for vocoder

e(z) PREDICTION

RESIDUAL

output, called the prediction residual. resembles the excitation signal V(z) to the degree that the all-pole filter approximates the vocal tract. We then approximate the prediction residual by an impulse train or by white noise, as shown in Fig. 9.8, giving

This signal is shaped by the all-pole synthesis filter to give a reconstructed approx­imation of the speech at the receiver.

We will now derive the equations necessary to solve for the coefficients, ex, of Eq. (9.22) on a short-term basis (i.e., for a short segment of the speech waveform) to minimize the error. If the coefficients were constant, the longer the analysis window the more accurately the coefficients could be determined. If the window is made too long, however, the coefficients change appreciably during the analysis interval. A compromise is therefore required, which for LPC-l 0 is often chosen to be 22.5 ms.

The short-term average prediction error is defined to be

(9.27)

where

E = error for the interval no :s;; n :s;; n 1

Page 17: Digital Signal Processing in Communication Systems || Speech Processing

506 Digital Signal Processing in Communication Systems

Substituting for e(n) from Eq. (9.21) gives

nl

"'[ A 2 E = £.. s(n) -s(n)] (9.28) n ~ no

and from, Eq. (9.18)

n l P 2

E = L [s (n) - ,L (Xi s (n - i) ] n~no 1~1

(9.29)

Physically, the error signal to be minimized can be visualized as shown in Fig. 9.10. E is the mean square error averaged over the interval no ~ n ~ n l' The quan­tity e(n) is referred to as the prediction residual, as noted earlier. We will see later that the prediction residual contains a considerable amount of useful information, particularly about the excitation.

s(n)

e(n)

FIGURE 9.10 Block diagram of analysis filter showing physical interpretation of error

Page 18: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 507

The filter with a transfer function

p

A (z) = 1 - L (Xiz-i i = I

is called the analysis filter. The filter with a transfer function

p

p (z) = L (Xiz-i

i= I

is called the prediction filter. The minimum value of En is found by taking the partial derivatives of En with

respect to each of the coefficients and setting the partial derivatives to zero. Hence,

I = 1,2, ... P (9.30)

The differentiation is relatively simple to perform and is similar to the adaptive filter analysis performed in Chapter 8. Differentiating Eq. (9.29) gives

I = 1,2, ... P

(9.31)

This leads to the set of simultaneous equations

n, n, p

L s(n)s(n-i) = L [S(n-i)L(XjS(n- j )] n = no n = no J = I

i = 1,2, ... P

Interchanging the two summations in the right hand term gives

n, p n, L s (n) s (n - i) = L (Xj L s (n - i) s (n - j) 1 ~i ~P (9.32)

Page 19: Digital Signal Processing in Communication Systems || Speech Processing

508 Digital Signal Processing in Communication Systems

We now define the covariance matrix to be

n l

<P (i, j) L s(n-i)s(n-j) (9.33)

Note that the summation requires values of s outside of the interval no ~ 0 ~ 01

since i andj take on values up to P, which is typically 10. The autocovariance matrix calculation can be reduced somewhat by first cal­

culating the values <POj and then using the end correction procedure [38]

Given the values for <POj for j = 1,2, ... P, the values <Plj cao be found. Then values <P2j and so on can be calculated.

Now, substituting Eq. (9.33) into Eq. (9.32) gives

p

LUj<p(i,j) = <p(O,i) = 1,2, ... P (9.34) j = I

This represents a set of P simultaneous equations that can be solved for the coef­ficients, Uj.

Because of the way the autocovariance matrix was formed, the values are sym­metrical about the diagonal. As a result, the equations can be solved using a tech­nique that is less computationally intensive than a general matrix inversion. Later, we will discuss a simpler method that can be used if the data is windowed so that the values are zero at the beginning and at the end of the analysis window. This method, called autocorrelation, results in a symmetrical matrix in which the diag­onal elements are all equal. By contrast, the autocovariance method does not result in identical diagonal elements.

The standard adopted by the US Government and NATO uses the autocovari­ance method with the correlation matrix formed as described in Eq. (9.33). There are also procedures that have been developed to optimally choose the placement of the analysis window within the speech frame. These will be discussed later, along with conventions for prefiltering the speech. For the present, we will con­centrate on the matrix inversion process for the auto covariance method.

Autocovariance Method

Equation (9.34) defines a series of simultaneous equations. For example if we chose P = 4, the equations are as follows:

Page 20: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 509

i = 1: (XI<P(1,I) + (X2<P(1,2) + (X3<P(1,3) + (X4<P(1,4) = <p(O,!)

i = 2: (XI <p(2, 1) + (X2<P(2,2) + (X3<P(2,3) + (X4<P(2,4) = <p(O,2)

i = 3: (XI<P(3,1) + cx2<P(3,2) + (X3<P(3,3) + (X4<P(3,4) = <p(O,3)

i = 4: (XI<P(4,1) + cx2<P(4,2) + (X4<p(4,3) + cx4<P(4,4) = <p(O,4)

For a practical vocoder, as discussed earlier, P is usually chosen to be 10. For illustration purposes, in this chapter we will often use P = 4. This serves to dem­onstrate the principles without adding undue mathematical detail.

Choleski Decomposition * Since the analysis equations must be solved in real time, it is very desirable to min­imize the amount of computation required. As indicated previously, this can be accomplished by exploiting the symmetrical properties of the matrix. The autoco­variance matrix is positive, definite, and symmetrical about the diagonal. The diagonal elements are in general not equal, but are related by the relationship

<P (i + 1, i + 1) = <P (i, i) + s (no - 1 - i) s (no .- 1 - i) - s (n 1 - i) s (n 1 - i)

(9.35)

Equation (9.35) can be used to simplify calculation of the diagonal elements. Because the matrix is symmetrical about the diagonal, it can be solved for the

coefficients using Cholesky decomposition, also called the square root method. This method leads to a recursive system of equations to find the (X values.

Once the (X values have been obtained, it is necessary to determine ifthey result in a stable filter. In theory, this can be done by factoring the synthesis filter and determining if all the poles are within the unit circle in the Z plane. A more con­venient procedure is normally used, however. This involves solving Eq. (9.34) for an equivalent lattice filter whose parameters are the reflection coefficients. This filter resembles a tubular model of the vocal tract. If any reflection coefficient has a magnitude larger than one, the filter is unstable:. An unstable filter is not used, but the values from previous frames are substituted.

Reflection Coefficients Reflection coefficients are also referred to as the partial correlation coefficients (PARCOR). It is relatively easy to calculate the (X values of the prediction filter

• Adapted from L.R. Rabiner and R.W. Schafer, Digital Processing a/Speech Signals. pp. 399-402, 411,413-414,443,444. Copyright © 1978. Adapted by permission of Prentice Hall, Inc., Englewood Cliffs, NJ.

Page 21: Digital Signal Processing in Communication Systems || Speech Processing

510 Digital Signal Processing in Communication Systems

from the reflection coefficients. [See Eqs. (9.44) and (9.45), to be discussed later.] The calculation of the reflection coefficients, represented by K, and the transfor­mation to <X values will now be discussed. We begin with the autocovariance matrix, as defined in Eq. (9.33).

The reflection coefficients of the predictor filter can be found as follows.· Since the <I> matrix is symmetrical, it can be factored into an upper and a lower tri­angular matrix. This gives

(9.36)

For an example where P = 4 we have,

W ll 0 0 0

W= W21 W 22 0 0

W31 W32 W33 0 (9.37)

W 41 W 42 W43 W44

W 21 W 31 W41 1-

W ll W II W II

0 W32 W42

VT = W 22 W22 (9.38)

0 '0 W43

W33

0 0 0 1

The values of the lower triangular matrix can be calculated using the recursive equations

1 = 1,2, ... P (9.39)

j-I W W _ ~ in jn

Wij - <l>ij- £.J W n = I nn

(9.40)

"Equations (9.38), (9.40), (9.42), and (9.74) are adapted from T.E. Tremain, The Government Stand­ard Linear Prediction Coding Algorithm, Speech Technology, April 1982.

Page 22: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 511

After the Wil values have been detennined from Eq. (9.39), the values for Wi2 are found from Eq. (9.40) for i = 1,2, ... P. The calculation then proceeds to find the W 3i values, and so forth, until all the W elements have been detennined.

The reflection coefficients can then be found using a second set of recursive equations. We have

(9.41 )

and

r i-I 1 K. = _1 m. - ~ KW

, W .. '1',0 £... ] IJ II j = I

(9.42)

The coefficient values can now be checked for stability by ensuring that none of the magnitudes are greater than unity. The reflection coefficients are then encoded for transmission. We will have additional comments on encoding subsequently. At the receiver, the data is decoded to recover the reflection coefficients. Unless a lattice filter is used, the reflection coefficients are then converted to the coeffi­cients of the analysis filter "a" values (which we set equal to the <X values early in the analysis.) The synthesis filter is then easily constructed to have the response

H (z) 1

A (z)

This can be implemented as shown in the lower diagram of Fig. 9.11.

(9.43)

The procedure for finding the "a" values is unusual in that the coefficients of all systems of the order less than P are calculated successively until, finally, the values for a pth-order predictor are calculated. The coefficients for the lower-order systems are not used except as required to obtain the coefficients for the pth-order system. The recursive relationships are (see Rabiner and Schafer [36])

for i = 1,2, ... P (9.44)

and for each value of i

for j = 1, 2, '" i-I (9.45)

The superscript here refers to the order of the system to which the coefficient belongs.

Page 23: Digital Signal Processing in Communication Systems || Speech Processing

512 Digital Signal Processing in Communication Systems

Sen) '§tIZ01 I " PREDICTION > S(n)

FILTER

Sen) +Z)-l O §tiZ01 1

> e(n) ANALYSIS FILTER

e(n) >[§J > Sen) SYNTHESIS FILTER

REALIZATION e(n)-----~ I------;..---~ Sen) OF SYNTHESIS

FILTER

FIGURE 9.11 Relationship between prediction filter, analysis filter, and synthesis filter.

Let us again consider the example for P = 4. We must calculate the "a" values for a first-order system first. Thus, using Eq. (9.44), for i = I we have

Since the first-order system has only one prediction coefficient, Eq. (9.45) is not required for i = I. We now proceed to calculate the prediction coefficients for a second-order system (i = 2). Using Eq. (9.44)

Then, using Eq. (9.45)

a(2) = a(l) - Ka(l) for j J J J J-J

or

Page 24: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 513

All the required values were calculated previously. This completes the prediction coefficients for the second-order system. Now, going on with i = 3 and using Eq. (9.44), we have

a (3) - K 3 - 3

Then, using Eq. (9.45) for j = 2

And for j = 1

This completes the third-order system. Now we are in a position to calculate the values for i = P = 4, which corresponds to the coefficients we set out to find. Using Eq. (9.44)

a (4) - K 4 - 4

Then, using Eq. (9.45), for values j = 3, 2, and I, we have

(4) _ a(3) _ K a(3) a2 - 2 2 2

For a tenth-order system, the procedure would, of course, continue until the values of the coefficients a(10) were found.

Note that, in this case, the values of j can be ordered in either ascending or de­scending order, since required quantities are available from the previous, lower­order system calculations.

Autocorrelation Method

It was indicated previously that a simpler system, called the autocorrelation method, has been used to find the values of the prediction coefficients. To use this method, the speech samples must be windowed (e.g., with a Hamming window)

Page 25: Digital Signal Processing in Communication Systems || Speech Processing

514 Digital Signal Processing in Communication Systems

to ensure zero values at the boundary of the analysis window. Because of the win­dowing, some part of the information is lost. Nevertheless, the system has been used successfully, particularly in commercial systems. Windowing the data results in an autocorrelation matrix, <I>(i,j), which has symmetry about the diagonal and in which all the diagonal elements are also equal. Equal diagonal elements is the extra condition that allows simplification as compared with the autocorrelation method described earlier.

Let us assume a segment of the speech waveform, sen) = 0, outside 0 :s; n :s; N - 1. We obtain s by windowing the speech samples xs(n) such that

(9.46)

In the above equation, w(n) is the window function, which is zero for n < 0 and n > N - 1. We will again provide a prediction filter of the form

P

sen) L(ljs(n-i) j = \

The error function (prediction residual) is

e (n) = s (n) - s (n)

P

e (n) = s (n) - L (ljS (n - i) j = \ (9.47)

For a pth-order predictor, the error will be nonzero only over the interval 0 :s; n :s; N - 1 + P. Therefore, the mean square error can be expressed as

N+P-\

E= L e2 (n) (9.48) n=O

Now define

N+P-\

<I> (i, j) L s(n-i)s(n-j) (9.49) n=O

Page 26: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 515

We can also express Ijl as

N - I - (i - j)

Ijl(i,j) = L s(n-i)s(n+i-j) I :5: i :5: P, and I :5: j :5: P (9.50) n=O

since we are still multiplying all the values displaced by i - j from each other. Since Ijl now has the characteristics of an autocorrelation function, we write

N-I-t

Ijl(i,j) = R(t) = L s(n)s(n-t) n-O

where

t= li-jl i = 1,2, ... P andj = 1,2, ... P

Then, referring to Eq. (9.34) with the substitution ofEq. (9.51), we have

p

LajR(li-jl) = R(i) j = I

1 = 1,2, ... P

If P = 4, this leads to the set of simultaneous equations

i = I: R(O)a) + R(l)a2 + R(2)a3 + R(3)a4 = R(l)

i = 2: R(l)a) + R(0)a2 + R(l)a3 + R(2)a4 = R(2)

i = 3: R(2)a) + R(l)a2 + R(0)a3 + R(l)a4 = R(3)

i = 4: R(3)a) + R(2)a2 + R(I)a3 + R(0)a4 = R(4)

and the autocorrelation matrix has the form

rR(O) R(l) R(2) R(3)J

R(t) = R(I) R(O) R(I) R(2) R(2) R(l) R(O) R(l)

R(3) R(2) R(I) R(O)

(9.51 )

(9.52)

Page 27: Digital Signal Processing in Communication Systems || Speech Processing

516 Digital Signal Processing in Communication Systems

The PxP matrix of autocorrelation values is a Toeplitz matrix (i.e., it is symmet­rical, and all elements along the diagonal are equal). As a result the equations can be solved using Durbin's recursive solution: also called the Levinson-Durbin method. To solve for the coefficients of order P, we first must solve for all the coefficients of order P - 1. To solve for the coefficients of order P - 1, we must solve for all the coefficients of order P - 2, and so on. As before, we will use a superscript to represent the order of the system being analyzed.

The process begins by finding the residual error or initial condition

E(O) =R(O) (9.53)

A set of four recursive equations is then used for i = 1 ... P.

(9.54)

(9.55)

(X.(i) = (X.(i-l) _ K.(X.(i-:-l) J J 1 1-J

= 1,2, ... i - 1 (9.56)

(9.57)

Since the subscripts and superscripts may be a bit confusing, we will consider an example for a second-order, P = 2, system. The calculations are made as follows:

E(O) = R(O)

i= 1: KI =R(l)/E(O)

i = 1: E(l) = (1- KI2)E(O)

i = 2: K2 = {R(2) - (XI (l)R(l)}/E(l)

"Material in this section adapted from L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals. Chapter 8, p. 411, with permission. Copyright © 1978, Prentice Hall, Inc.

Page 28: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 517

Note that E(i) is the predictor error for an ith order system. We note also that the equations can just as well be solved using normalized autocorrelation coefficients

r (j) = R (j)

R (0)

This may be convenient in a machine with fixed-point arithmetic. A few additional comments are in order with regard to the Levinson-Durbin

recursion method. The intermediate quantities, represented by K and known as reflection coefficients, are also called PARCOR coefficients (partial correlation coefficients) because they can be expressed in the form of a normalized cross­correlation function. We will not be concerned with this form here.

Starting with Eqs. (9.15), (9.25), and (9.56), it is possible to show that the anal­ysis filter can also be implemented using the reflection coefficients in a lattice fil­ter (see Rabiner and Schafer [36]). This is shown in Fig. 9.12. From this figure, we may write the iterative expression for the lower portion of the lattice filter:

(9.58)

Here, bi is called the backward prediction error sequence. By inspection, the iter­ative expression for the upper portion of the lattice filter is

(9.59)

Sen) >

FIGURE 9.12 Signal flow diagram for lattice implementation of analysis filter. Adapted from L.R. Rabiner and R.W. Schafer, Digital Processing o/Speech Signals. p. 415, with pennission. Copy­right © 1978, Prentice Hall, Inc.

Page 29: Digital Signal Processing in Communication Systems || Speech Processing

518 Digital Signal Processing in Communication Systems

Here, ej is the forward prediction error sequence. The combination of Eqs. (9.58) and (9.59) form the basis of a lattice analysis filter which gives the same result as the FIR analysis filter shown in Fig. 9.1 0, with the use of reflection coefficients (K) rather than the prediction coefficients (a).

It is now a simple matter to transform the lattice analysis filter to a lattice syn­thesis filter. To accomplish this, we solve Eq. (9.59) for ei_l(n) and make the no­tational changes e/n) = fj(n). This gives

(9.60)

A block diagram of the lattice synthesis filter is shown in Fig. 9.13. In this filter fj is the forward power and bi is the backward power. From the diagram, it can be seen that the following iterative expressions can be used to compute the output signal:

fp(n) == u(n) (9.61)

for i = P, ... 2, I (9.62)

bien) == bi_1(n-I)-K/i_1(n) for i = I, 2, ... P (9.63)

and bo(n) = fo(n). The equations can be clarified by an example. In this case, we will use P = 3.

We begin with n = I and assume the initial stored values to be zero.

n = I Start: f3(1) = u(1)

i=3 f2(1) = f3(1) note: b3(0) = 0

b3(1) = -K3f2(1) b2(O) = 0

i = 2 fl(1) = f2(1) b2(O) = 0

b2(1) = -K2fl(1) bl(O) = 0

i = I fo(1) = fl(l) bl(O) = 0

b l(1) = -Klfo(l) boCO) = 0

This completes the calculations for the output at n = I, and the output is given by

sen) == fo(l)

Page 30: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 519

Sen)

FIGURE 9.13 Signal flow diagram for lattice implementation of synthesis filter

We now calculate the output for n = 2.

n=2 Start:

note: boO) = foO)

The reflection coefficients for the lattice filter are easier to determine than the <X values for the synthesis filter based on the inversion of an FIR filter. We note, however, that twice as many multiplications are required to compute each point. For this reason, reflection coefficients are often converted to the coefficients of the FIR analysis filter in a vocoder.

The lattice filter gives exactly the same results as the FIR-based synthesis filter. The lattice filter can be viewed as a model of a lossless acoustic tube filter with P sections of equal length but different areas, Am' Then, Km is the reflection coeffi­cient between sections m and m - 1.

Page 31: Digital Signal Processing in Communication Systems || Speech Processing

520 Digital Signal Processing in Communication Systems

It can be shown that

where

Zm = characteristic impedance of mth section

Am = area ofmth section

This leads to the relationship

=

(9.64)

(9.65)

This is the basis for a method of coding the reflection coefficients called the log area ratios. These result from coding the logs ofEq. (9.65).

If the <l values are known for the FIR-based synthesis filter, it is also possible to calculate the reflection coefficients. The recursive equations are as follows (see Rabiner and Schafer [36]). Note that the ex values for the prediction filter are for a pth-order system, so the givens are <l1(P>, <l2(P), ... <lp(P). For each i = P, P - 1, ... 2,1.

K. = ex.(i) 1 1

(9.66)

Also, for each i, let m = 1, 2, .. .i - 1 and findt

<l (i) - ex.(i) IX (~) a (i - I) = m 1 (1- m)

m 1- K2 1

(9.67)

After m = i-I, reduce i by 1 and return to Eq. (9.66). This concludes our discussion on finding the values of the coefficients of the

synthesis filter. The values are normally computed every 22.5 ms in the analysis mode. At the receiving (synthesis) end of the link, the coefficient values are up­dated more frequently by interpolation. The interpolation is normally performed on the reflection coefficients prior to converting to the coefficients of the FIR­based synthesis filter. In some systems the number of interpolations per frame is variable, depending on how rapidly the parameters are changing. Interpolations may be performed as frequently as every 5 ms.

-Equations 9.64 and 9.65 reprinted with permission from C. Bristow, Electronic Speech Synthesis Techniques. Technology. and Applications [40]. Copyright © 1984, McGraw-HilI. tEquation (9.67) reprinted from L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Sig­nals. with permission. Copyright © 1978, Prentice Hall, Inc.

Page 32: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 521

Gain Determination

Referring to Fig. 9.8, we see that it is necessary to adjust the gain parameter, O. This is required so the energy response to Oven) equals the energy in the signal for both voiced and unvoiced sounds. The required value is given by

P

0 2 == Rn (0) - L aiR (i) (9.68) j = I

for an excitation u(n) with unit energy. This can be determined from Eq. (9.18).

S (n)

p

Lajs(n-i) +v(n)O j = 1

(9.69)

We assume here that the a values are substituted for the "a" values, and that the excitation, v(n), is a unit impulse v(n) = O(n). We also assume that the predicted signal, s (n), has the same energy as the actual signal, sen). Now, define the auto­correlation function to be

L s (n) s (n+1) (9.70) 0=0

Substituting Eq. (9.69) for the first occurrence of sen) in Eq. (9.70) gives

R (1) (9.71 )

For 1 ;f. 0, this simplifies to

p

L ajR (11 - il) (9.72) i = 1

assuming sen) = 0 for n < O. For 1 = 0, from Eq. (9.71), we have

Page 33: Digital Signal Processing in Communication Systems || Speech Processing

522 Digital Signal Processing in Communication Systems

R (0) nto l i~ {XiS (n - i) s (n) + 0 (n) Gs (n) J

p

R(O) I, (XiR(i) +Gs(O) i = I

But from Eq. (9.69), we see that s(O) = G. Therefore,

R (0)

p

I, (Xi R (i) + G2

i = I

which is the criterion proposed in Eq. (9.68).

(9.73)

For unvoiced sounds, we assume that v(n) is a white noise process with the au­tocorrelation function Ru(O) = 1 and Ruet) = 0 for't # O. This is the same autocor­relation function obtained for the assumed voiced excitation v(n) = o(n). Conse­quently, the derivation gives the same result for white noise excitation.

Pitch Period

One of the most difficult aspects ofLPC vocoder technology is making the correct determination of the pitch period. Many different algorithms have been tried, and most work well on strongly voiced sounds. The difficulty occurs in the transition regions from voiced to unvoiced sounds and vice versa. Many of the algorithms are discussed by Rabiner [41].

We will discuss two algorithms here. The first is the average magnitude differ­ence function (AMDF) used by US Government and NATO LPC speech coders. The second is the autocorrelation method applied to the prediction residual.

The AMDF was developed during a period when multiplication was a particu­larly time consuming operation to perform, and it avoids the mUltiplications re­quired to compute the autocorrelation function. Basically, the algorithm compares the speech signal with a delayed version of itself and attempts to find the delay that results in the minimum average difference. Presumably, ifthe delay is a mul­tiple of the pitch period, the differences will be small, and the shortest pitch period exhibiting a minimum should correspond to the pitch period of the voiced sound.

The AMDF function is given by Tremain [38] as

N

AMDF ('t) I,ls(i) -s(i+'t)1 (9.74) n = I

Page 34: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 523

For an 8 kHz sample rate, it is typical to perform the search for 60 values of't from 20 to 156.

The pitch is smoothed over five frames. A confidence factor may also be de­veloped, based on the minimum AMDF over several frames. This is used in the voiced/non-voiced decision discussed later.

The signal is normally lowpass filtered before applying it to the AMDF algo­rithm. A cutoff frequency in the order of 800 Hz works well. This passes all the information required to make a pitch determination and removes extraneous infor­mation. In some instances a low order prediction filter is used to form a whitened error signal or residual, and the residual is used as the input to the pitch determi­nation algorithm. Recall from Eq. (9.26) that the prediction residual corresponds to the excitation function.

A block diagram of a pitch determination system based on a second-order pre­diction residual is shown in Fig. 9.14.

Voicing Decision

Another required determination is the voiced/non-voiced decision. This decision is normally made based on several parameters, the most important of which are a follows:

1. The low-band energy. given by LBE = Elx(i)l, where x(i) are the lowpass fil­tered speech samples. The low-band energy is higher for voiced sounds.

2. The maximum-to-minimum ratio of the AMDF{'t) function. This also is high­er for voiced sounds.

3. The rate of zero crossings. This is perhaps less than 500 per second for voiced sounds.

Several other parameters may be used to assist in making the voicing decision. One of these is the first reflection coefficient (see Kemp et al. [57]), given by

x(i)

K1.K2 TO VOICING

t===:JALGORITHM

FIGURE 9.14 Block diagram of pitch preprocessing circuit

TO PITCH ALGORITHM

AMDF

Page 35: Digital Signal Processing in Communication Systems || Speech Processing

524 Digital Signal Processing in Communication Systems

RCt = 1:[s(i)s(i-I)]

1:[s2(i)] (9.75)

Another parameter that can be used is the weighted high-band energy, which is higher for non-voiced sounds. A measure of the high-band energy (see Kemp et al. [57]) is given by

1:ls(i) -s(i-I)1 QS =

1:s2 (i) (9.76)

Two other parameters that have been used are the product of the causal reverse prediction gain (which tends to be higher for voiced sounds), and the product of the non-causal sounds (see Kemp et al. [57]). The former is given by

1:[s(i)s(i-'t)]2 ARB = -;:------::---

1:s2 (i) 1:s2 (i - 't)

and the latter is given by

1: [s (i) s (i + 't)] 2 ARF =

1:s2 (i) 1:s2 (i + 't)

(9.77)

(9.78)

A tentative voicing decision is made based on a weighted sum of the deviation of the individual parameters from their threshold values. The final voicing deci­sions are then made by smoothing the tentative classifications based on the previ­ous and next determination. A voicing decision is typically made every halfframe.

Another method has been proposed to make the voicing decision from the au­tocorrelation function of the prediction residual. The pitch period can also be de­termined from this function, which is given by

E[e(n)e(n-'t)] R ('t) = ------:--------::----

0.5 {E [e2 (n)] + E [e2 (n - 't)] } (9.79)

We recall that, to the extent that the vocal tract can be modeled by the predic­tion filter, the prediction residual corresponds to the excitation. Consequently, for voiced sounds a value of't will be found for which R('t) has a strong peak. Con­versely, the absence of such a peak is indicative of a non-voiced sound.

The voicing decision is one ofthe most difficult parts ofthe vocoder algorithm. It is responsible for a significant part of the performance degradation in a noisy background environment.

Page 36: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 525

Window Placement

As we have seen, a considerable number of subtle features may be incorporated in the LPC vocoder technique to enhance performance. Therefore, a few remarks are in order with regard to the placement of the analysis window within the frame for determination of the pitch period and reflection coefficients.

Human speech tends to be characterized by periods of relatively constant pa­rameters interspersed with abrupt changes called onsets. The speech immediately following an onset often contains important information, and it is desirable to start the analysis window for the reflection coefficients at-or slightly after-an onset. If there is more than one onset in a frame, the window is placed at the first onset. For voiced sounds it is also desirable to place the pitch window synchronously with the previous window. This is done by placing the start of the window a mul­tiple of the pitch period from the start of the previous window. The analysis win­dow is often a different length from the frame, giving some latitude in the place­ment of the starting position within the frame. For a 22.5 ms frame with an 8 ks/s rate, a frame consists of 180 samples. It is not unusual to make the analysis win­dow variable (say, from 90 to 156) with a nominal value of 130 samples.

The voicing window, used to determine the pitch and the voicing decision, should be placed to avoid onsets. Ifthere are no onsets, it is centered in the pitch window. If there is an onset, the voicing window is placed before it, if possible; otherwise, it is placed after the onset. If there are two onsets, the voicing window is placed between them, if possible.

The determination of onsets is not difficult. One method used is to examine the sample-by-sample prediction coefficients for a first-order linear predictor. If it changes abruptly over about 16 samples, an onset is present. A first-order predic­tor is given by

. [S(i)S(i-l)] f(l) = E -2=----

S (i - 1) (9.80)

In this case, the expectation is formed by a running average of 63/64 of the old sum and 1164 of the latest calculation. If the difference given by

7 15

d(i) = I/(i-j) - Lf(i-j) (9.81) j=O j=8

exceeds a threshold value (typically about 0.25), an onset may be present [57]. There are many other refinements and strategies used to make a successful LPC

vocoder, such as the way the signal is preconditioned (e.g., to remove the dc bias), the way the parameters are interpolated, and so on. For additional details, the read­er is referred in particular to Rabiner and Schafer [36], Tremain [38], Campbell et

Page 37: Digital Signal Processing in Communication Systems || Speech Processing

526 Digital Signal Processing in Communication Systems

al. [39], Bristow [40], Rabiner et al. [41], Kang and Everett [42], Kang [43], Fed­eral Standard 1015 [44], and Kemp et al. [57] for additional details.

Synthesis

As indicated in Fig. 9.8, the voice signal is reconstructed at the receiver by excit­ing the synthesis filter with an impulse train or by random noise. For voiced sounds, the impulse train has a repetition period determined by the pitch algorithm of the analysis system. Unfortunately, the extreme regularity of the excitation causes the speech to sound machine-like and tense. As a result, it has been found desirable to introduce a bit of phase jitter into the waveform. One approach is to use a multiple pulse spectrum consisting of about 25 sample pulses. The entire sequence is repeated at the pitch period and lowpass filtered. In addition, highpass filtered random noise is added to the excitation. A randomly spaced doublet may also be added resulting in a synthesis system such as shown in Fig. 9.15. This results in a more natural sound.

PITCH PERIOD

FIGURE 9.15 Modified excitation signal for voiced sounds

PERFORMANCE EVALUATION

DOUBLET

TO 10 POLE SYNTHESIS

FILTER

The evaluation or comparison of the many speech coding techniques, and the many variations of each, is a matter that requires careful consideration. It is no trivial matter to objectively determine which technique is better than another. The ear (in conjunction with the human mind) is amazingly adaptable, and investiga­tors working with a particular coder often become quite adept at understanding its output even though an outsider would think its quality is poor. This learning progresses to the point where it eventually becomes difficult to make objective comparisons. Fortunately, several relatively objective measures have been devel­oped that can be used to effectively compare systems.

The first of these, called the diagnostic rhyme test (DRT), deals with intelligi­bility." The DRT uses a basic list of 192 words consisting of 96 rhyming pairs. Each pair is normally presented twice in the course of the testing session. The lis­tener's task is to indicate which member of the pair was actually spoken. For ex-

Page 38: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 527

ample, when the stimulus word is "zeal" the options available to the listener are "zeal" and "seal." A correct response indicates that the speaker has conveyed a sufficient number of acoustic features with regard to the voicing attribute. De­pending on the word pair involved, each item serves to test for one of the follow­ing elementary phonetic attributes:

1. voicing 2. nasality 3. sustention 4. sibilation 5. graveness 6. compactness

There are 16 word pairs to test each attribute. Each pair differs only in a single attribute in the initial phoneme. Typical examples for voicing are the word pairs "veal" and "feel," and "goat" and "coat." Nasality is tested by such word pairs as "meat" and "beat," and "news" and "dues." At 2,400 bls, the LPC-1O algorithm can be expected to score from the high 80s to about 90 percent, depending on the individual speakers. This degrades rather rapidly to the mid 80s with an error rate of only 1 or 2 percent.

A second test that is often used to measure vocoder performance is the diag­nostic acceptability measure (DAM). This test consists of 12 phonetically bal­anced six-syllable sentences from each talker. A listener hears the 12 sentences as a group. He then rates the overall quality on 21 separate rating scales. The ratings address such factors as speech quality, background noise, cracking, intelligibility, nasal sound, naturalness, and so forth. A 2,400 bls LPC system typically scores in the lower 50s.

GOVERNMENT STANDARD ALGORITHM: LPC-IO

A brief discussion will now be given of the US government standard LPC-IO coder. This example will serve to summarize several of the concepts developed earlier in the chapter and will also describe a typical coding arrangement for the parameters. Additional details on the vocoder are presented in the references, par­ticularly Tremain [38], Federal Standard lOIS [44], and Kemp et al. [57].

A summary of the major characteristics is given in Table 9.1. A block diagram ofthe transmitter is shown in Fig. 9.16. The input bandwidth to the AID converter is 100 Hz to 3,600 Hz. The signal is attenuated 23 dB above 4,000 Hz. A 12-bit AID converter is used with a sample rate of 8 kHz. A digital preemphasis filter is

"This test is often scored by Dynastat, Inc., of Austin, Texas. The company maintains a stable crew of trained listeners.

Page 39: Digital Signal Processing in Communication Systems || Speech Processing

528 Digital Signal Processing in Communication Systems

TABLE 9.1 Summary of Parameters for LPC-IO

Predictor order Sampling rate Bit rate Frame Pitch algorithm Voicing Matrix load Reflection coefficient coding Error correction coding

10 8 kHz 2,400 bps 22.5 ms (54 bits per frame) AMDF (51 to 400 Hz) Two decisions per frame Covariance Log area ratio for RCI and RC2, linear for others Hamming codes on selected bits

Table based on Federal Std-I 0 15. November 28, 1984.

provided to boost the high-frequency energy. The transfer function of this filter is H(z) = 1 - 0.9375z- i . The bit allocation for the 54 bits in the LPC frame is listed in Table 9.2. The synchronization bit alternates between zero and one from frame to frame.

Pitch and voicing are encoded as a 7-bit field. The specific 7-bit codes assigned to each of the 60 pitch frequencies are defined in Federal Standard 1015 [44] and are not repeated here. For error protection, a non-voiced frame is encoded as sev­en zeros, and frames in voicing transition are encoded as seven ones. Obviously, since there are 128 decoding states, many received combinations can be allowed for error correction. As assigned by Federal Standard 1015, eight received char­acters are recognized as non-voiced. These are either seven zeros or a single one and six zeros. Thus, a single error is effectively corrected. Likewise, there are eight received characters interpreted as a transition frame. These contain all ones or a single zero and six ones.

ANALOG SPEECH INPUT

RMS

PWRLEVEL

r-;;=:;-;-;;--, PITCH VOICING RC'S

FIGURE 9.16 Block diagram of US government standard LPC speech coder

Page 40: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 529

TABLE 9.2 Bit Allocation for Vocoder

Bits Allocated per Frame

Parameter Voiced Non-voiced

Pitch and voicing 7 7

RMS amplitude 5 5 RC(1) 5 5 RC(2) 5 5

RC(3) 5 5 RC(4) 5 5

RC(5) 4 0

RC(6) 4 0 RC(7) 4 0

RC(8) 4 0

RC(9) 3 0

RCCIO) 2 0

Error control 0 20

Synchronization

Unused 0

Total 54 54

The RMS amplitude is scaled from 512 possible levels (nine bits) to 32 levels (five bits) using a table. The table levels tend to be more coarsely quantized at the higher signal values.

The first two reflection coefficients are encoded with five bits, again using a table look-up. Table 9.3 gives the nonlinear convention for these quantities and is included to give the reader a feel for the types of tables used.

The reflection coefficient RC3 is encoded using a similar five-bit table. In this case, the conversion is linear except for limiting in the region for (RC3) > 0.6. RC4 is also linearly encoded using five bits with limiting for values of (RC4) > 0.76. RC5 through RC8 are encoded with four bits for voicing frames only. Encoding is basically linear, but with slightly different saturation characteristics for each. Therefore, a separate table is used to encode each reflection coefficient. RC9 is en­coded using a three-bit table, and RC lOusing a two-bit table.

During non-voiced frames, since RC5 through RCI 0 are not transmitted, an ad­ditional 20 bits are available for error correction. A Hamming code is used for the four most significant bits of the RMS amplitude and the first four reflection coef­ficients. The error correction encoding convention is listed in Table 9.4.

At the receiver, the speech is synthesized using a tenth-order all-pole filter ex­cited by pitch-synchronous pulses. A block diagram of the synthesizer (receiver) section is shown in Fig. 9.17. The incoming signal is first examined for frame sync using the characteristic that the synchronization bit toggles. Each bit in the serial stream is correlated with the bit delayed by 54 clocks, and running averages are

Page 41: Digital Signal Processing in Communication Systems || Speech Processing

530 Digital Signal Processing in Communication Systems

TABLE 9.3 Reflection Coefficients 1 and 2

Coefficient Range Binary Encoded Value Decoded Value

-.999 to -.984 -15 -.984

-.984 to -.969 -14 -.969

-.969 to -.953 -13 -.953

-.953 to -.938 -12 -.938

-.938 to -.906 -11 -.922

-.906 to -.875 -10 -.891

-.875 to -.828 - 9 -.844

-.828 to -.766 - 8 -.781

-.766 to -.688 - 7 -.719

-.687 to -.609 - 6 -.641

-.609 to -.531 - 5 -.563

--.531 to -.422 - 4 -.469

-.422 to -.313 - 3 -.359

-.312 to -.203 - 2 -.250

-.203 to -.094 - 1 -.141

-.094 to +.094 0 +.031

.094 to .203 -.141

Identical to negative values but with positive signs.

.984 to .999 15 .984

TABLE 9.4 Pulse Amplitude Values for Voicing Excitation

Index Amplitude Index Amplitude Index Amplitude

1 249 15 - 20 29 19

2 -262 16 138 30 - 15

3 363 17 - 62 31 - 29

4 -362 18 -315 32 -- 21

5 100 19 -247 33 - 18

6 367 20 -78 34 -27 7 79 21 - 82 35 - 31

8 78 22 -123 36 - 22

9 10 23 - 39 37 - 12

10 -277 24 65 38 - 10

11 - 82 25 64 39 - 10

12 376 26 19 40 - 4

13 288 27 16

14 - 65 28 32

Page 42: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 531

FIGURE 9.17 Block diagram of LPC receiver

maintained for 54 different bit positions. Since the sync bit is the only position that toggles, it soon correlates to a value of negative one, which establishes frame syn­chronization. Once frame sync is established, the serial bit stream can be convert­ed to a 53-bit paralleled pattern.

The excitation used in the coder is not a simple sequence of impulses at the pitch period. Rather, it is a sequence oflevels that are generated at the 8 kHz rate. The waveform sequence is repeated at the pitch frequency. Table 9.4 lists these values. If the pitch period is 40, all values of the table are used in sequence and repeated for the duration of the 200 Hz pitch. If the pitch period is longer than 40, the excitation is followed by as many zeros as necessary to complete each pitch period. If the pitch period is shorter than 40, the remaining values are added to the values at the beginning of the table for the next pitch period. For example, if the pitch period is 38, excitation value 39 of the table would be added to value 1 of the next pitch period, and so forth. The table values are scaled by the RMS ampli­tude parameter.

The reflection coefficients are interpolated and converted to prediction coeffi­cients, ("a" values) to produce the synthesis filter with a response lIA(z). The fil­ter shapes the excitation signal.

The pitch, RMS amplitude, and reflection coefficients are all interpolated. As part of the interpolation, the decoded parameters are converted from frame blocks to pitch periods. Pitch and log RMS are linearly interpolated. The beginning of RMS interpolation is delayed at the onset of voiced sounds. This increases the sharpness of the voice attacks.

Page 43: Digital Signal Processing in Communication Systems || Speech Processing

532 Digital Signal Processing in Communication Systems

The interpolation of the reflection coefficients is accomplished by forming the area ratios [see Eq. (9.65)] and performing a linear interpolation on the log ofthe area ratios. The reflection coefficients are interpolated once per pitch period.

The reconstructed signal at output of the synthesis filter is deemphasized be­fore application to the D/A converter. This filter undoes the effect of the original preemphasis filter. The transfer function of the deemphasis filter is given by

H(z) = ----: 1- 0.75z-1

(9.82)

This completes our consideration of the basic LPC algorithm. Before leaving the subject of speech coding we will discuss a more efficient method of encoding LPC parameters using line spectrum pairs (LSPs) to reduce the data rate to the 400 to 800 bps range. We will also discuss several methods of obtaining near-toll qual­ity speech at 4,800 bps using code excited linear prediction (CELP).

VERY LOW DATA RATE SPEECH CODING*

The data rate communicated by the human voice is much less than 2,400 bps. This can be demonstrated easily by an example. Suppose a person is reading English text at a typical rate of 150 words per minute. This rate corresponds to an average of 12.5 characters per second. Each character can be coded with 5 bits, which results in a data rate of 62.5 bps. On the other hand, a number of subtle features are communicated that allow us, for example, to recognize the speakers voice, his level of excitement, and so on. At any rate, if we are willing to settle for the basic information, a considerable reduction in the data rate should be possible over 2,400 bps LPC. This is indeed possible, and it has been found that more sophisti­cated coding of the prediction filter coefficients can result in a reduction of the data rate to the 400 to 800 bps range. Only a small additional degradation in qual­ity from the 2,400 bps LPC is sustained, on the order of 1 to 2 percent on the diag­nostic rhyme test scores (see Kang and Jewett [46]).

As we might suspect, a variety of techniques have been considered to reduce the data rate. One of the most successful solutions has involved the use of line spectrum pairs (LSPs). Using this technique, the basic LPC vocoder is still used. The use ofLSPs is then applied to transmit the prediction filter coefficients. Using this technique, the transfer function of the analysis filter is represented by two functions that have their zeros on the unit circle. The movement of the zeros with time is particularly well behaved-especially the spacing of the pairs. These prop-

• A significant part of the material in this section is adapted from Kang and Jewett [46. 71]. and Kang and Fransen [72]. to which the reader is referred for additional details.

Page 44: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 533

erties make it possible to transmit the characteristics of the analysis filter with fewer bits. Since the LSP conversion takes place after the LPC analysis, the major­ity of the algorithms used in the speech coder are unchanged. A block diagram of the resulting speech coder showing the LSP addition is given in Fig. 9.18. The standard LPC-I 0 speech coder is shown only as one block. However, it forms all the parameters discussed previously, including the pitch period, RMS level, pre­diction filter coefficients, and the voiced/non-voiced decision. The prediction fil­ter coefficients are then represented in a different way using line spectrum pairs.

Coefficient Conversion (PC to LSP)

The transfer function of the LPC analysis filter, as given by Eq. (9.15), is

A(z) (9.83)

where

~ = the nth prediction coefficient

The corresponding LPC synthesis filter, as discussed previously, is given by lIA(z). The prediction coefficients are readily obtainable using the autocovari­ance or autocorrelation methods discussed earlier. A serious limitation of the pre­diction filter expressed in this way is that an error in one coefficient affects the entire speech spectrum. However, the same function can also be expressed in terms of the zeros in the Z plane. If this is done, each pair of zeros corresponds to a resonant frequency and a bandwidth for the resonance. To develop the idea, let

SPEECH INPUT

SPEECH OUTPUT

EXCITATION PARAMETERS

EXCITATION PARAMETERS

REFLECTION COEFFICIENTS

FIGURE 9.18 Block diagram of800 bps speech coder

Page 45: Digital Signal Processing in Communication Systems || Speech Processing

534 Digital Signal Processing in Communication Systems

us first note that Eq. (9.83) can also be expressed as the product ofthe tenns given by

where

NI2

A(z) = IT (l-zjz-I) (l-z~z-I) j = I

Zj = the ith root of the transfer function

(9.84)

The advantage of expressing the transfer function in this way is that each root pri­marily affects the transfer function only in the vicinity of that frequency.

We now decompose A(z) into two functions, consisting of the function and its conjugate

P(z) = A(z) _Z-(N+l)A(z-l) (9.85)

and

Q(z) = A(z) +Z-(N+l)A(z-l) (9.86)

Then, the prediction filter can be reconstructed by

I A (z) = "2 [P (z) + Q (z)] (9.87)

The impulse response of P(z) is odd with respect to its midpoint. It has one real root at z = 1. The other zeros are on the unit circle at

(9.88)

where

fK = frequency of zero

T s = sample time

P(z) can be factored in the fonn

P (z) (9.89)

Page 46: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 535

Multiplying out the expression gives

NI2 I II I j2ltf T j21tf T 2

P(z) = (1-Z-) l-z-[e KS+ e KS]+Z-

K= I

Using the Euler identity, this can be written in the fonn

NI2

P (z) = (1 - z-l) II [1 - 2z-1 cos (21tfK T s) + Z-2] K= I

(9.90)

The other expression, Q(z), has even symmetry about the midpoint. It has one real root at z = -1. The other roots are also on the unit circle at

Consequently, we may write

j2ltfK Ts Z = e

N/2

Q (z) = (1 + z-l) II [1 - 2z-1 cos (21tfK) + z-2] K= I

(9.91)

(9.92)

It turns out that the roots ofP(z) and Q(z) are interleaved (i.e., alternate), as illus­trated in Fig. 9.19.

The closer a pair of zeros ofP(z) and Q(z) are to each other, the closer the cor­responding zero of A(z) is to the unit circle, which indicates a sharper (higher) Q resonance. These roots are referred to as line spectrum pairs.

There are several ways to solve for the roots ofP(z) and Q(z), given the impulse response [the "a" values of the prediction filter A(z)]. Since the roots ofP(z) and Q(z) lie on the unit circle, the task is simplified. A zero on the unit circle implies that the zeros correspond to real frequencies. We can find the frequency response ofP(z) or Q(z) by taking the discrete Fourier transfonn of the impulse response.

Since the location of the zeros must be known with some precision, a fairly large FFT is suggested. The impulse response is therefore appended with zeros to fill the FFT. A 256-point FFT may be appropriate. For LPC-lO, there are 12 real input points, and the FFT input is then padded with 244 zero values. The frequen­cy resolution for an 8 ks/s sample rate is 8000/256 = 31.25 Hz.

The amplitude of the FFT outputs is found by taking y(k) = -V[I2(k) + Q2(k)] and finding the indexes (k) corresponding to the minimum values. The worst-case errors are then ±15.625 Hz. This is larger than one would like to allow. A parabol­ic interpolation has been used to refine the zero locations using the two adjacent

Page 47: Digital Signal Processing in Communication Systems || Speech Processing

536 Digital Signal Processing in Communication Systems

x

Z PLANE

2KHz

4KHz~-------------+------------~O

x

FIGURE 9.19 Zeros ofP(z) and Q(z) in the Z plane

• ZEROS OF P(z) • ZEROS OF Q(z) X ZEROS OF A(z)

values along with the minimum. This is shown graphically in Fig. 9.20. It can be shown (see Problem 9-7) that the minimum using a parabolic fit is given by

(9.93)

This expression can be simplified considerably for the present application. Let us designate the FFT output value, giving the minimum magnitude as X(k). Then,

Y3r-------~----------~

Y1 f---------------''''-

XMIN

FIGURE 9.20 Parabolic interpolation of function zeros

Page 48: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 537

M(k) = JRe [X2 (k)] + 1m [X2 (k)] . The adjacent value on the low side is then given as M(k - I), and the value on the high side is M(k + I). Substituting these values in Eq. (9.93) leads to the expression

1 [ M(k-I) -M(k+ I) ] MIN k = k+"2 M(k-I) -2M(k) +M(k+l) (9.94)

where

k = index of value M(k) nearest zero

We have made use of a condition here that x3 = x2+1 and Xl = x2-l for the FFT outputs. The actual frequency of the zero is then given by

(MIN k) x Fs Fz = N (9.95)

where

Fs = sample frequency

N = size of FFT used

We have not used the condition here that the value M(MIN k) = O. Because of this condition, we could have used only two values of the FFT output to solve for the minimum. Unfortunately, this procedure results in an expression requiring calcu­lation of a square root, which may be more cumbersome than using three values with Eq. (9.94).

The LSPs are plotted in Fig. 9.21 for a short segment of speech (the sentence shown). Several properties are interesting to note. First, we observe that there are periods of time when the LSPs are relatively constantly interspersed, with abrupt changes. We note also that the line spectrum pairs tend to track each other. In ad­dition, there is a significant amount of correlation between neighboring line spec­trum pairs. These tendencies present a variety of opportunities for efficient coding of the LSPs for transmission. One obvious procedure is to absolutely code only one frequency of each LSP and transmit a differential value to determine the po­sition ofthe other member ofthe pair. The differential value can then be transmit­ted with fewer bits. Other schemes can be devised to transmit changes in the pa­rameters and the positions where the changes take place. It should be noted that clever encoding schemes often carry a price in terms of greater susceptibility to errors.

Page 49: Digital Signal Processing in Communication Systems || Speech Processing

538 Digital Signal Processing in Communication Systems

4

... g 2 GI

" CT

~ 1

o

easy Way. ------"-------- -"-----

(A) Spectrogram

Here is an easy Way. ~ ~ ~ ".-----"'--------..., ---- --"-------...

4 r----------------------------------------------------

~ 'iii Co

E 2 2 u GI Q.

'" ~ 1 :::;

oL-_______ L-________ L-________ L-________ L-________ ~

o 0 .25 0 .50 0 .75 1.00 1 .25

Time (sec)

(8) LSP Trajectory

FIGURE 9.21 Typical LSP trajectories and spectrogram of original speech. Reprinted with per­mission from G. S. Kang and W. M. Jewett. NRL report No. 9318. December 1986. Naval Research Laboratory. Washington. DC 20375-5000.

Another method for encoding LSPs is to use vector quantization. This method involves comparing the LSP trajectories with prestored templates and transmitting the number ofthe most similar template. Vector quantization is discussed in more detail in the following section.

At the receiver, the LSPs are converted back to coefficients of the prediction filter. This is done by substituting the values for P(z) and Q(z) [see Eqs. (9.90), (9.91), and (9.92)], into Eq. (9.87). The resulting expression is

Page 50: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 539

A (z) I NI2 2 (l - Z-l) n [1 - 2z-1 cos (27tfiT) + Z-2]

i = I

1 N/2 ,

+ 2 (l-Z-l) n [1-2z-1cos (27tfiT) +z-2] i = I (9.96)

Multiplying out this expression and collecting like powers of z gives the transfer function for the prediction filter. The multiplication is rather involved because of the number of terms. Nevertheless, the procedure is straightforward.

Vector Quantization

Another approach which has been studied for encoding LSPs is to use a set of 4,096 templates, 3,840 for voiced sounds and 256 for unvoiced sounds (see Kang and Jewett [46]). The LSPs for a given frame are compared with all the templates and the number of the closest template is transmitted. The distance measure of each template is formed based on the frequency error of each line spectrum with those in the template. Here, it is taken into account that the ear is more sensitive to errors in the low-frequency region than in the higher portion of the spectrum. Hence, sensitivity decreases linearly between 100 Hz and 1 kHz, and logarithmi­cally between 1 kHz and 4 kHz.

Obviously, great effort must be taken in forming a good set of templates. For the cited reference, the templates were formed from sentences spoken by more than 50 speakers. In this example, 800 bps digital speech was obtained with the bit allocation for each three frames as follows:

Synchronization Pitch Period Amplitude Information Filter Parameters

This results in 54 bits per 3 frames, or 800 bps.

1 5 4+4+4 12+12+12

A considerable amount of work is presently being done to further reduce the bit rate of LSP coders, and it is anticipated that rates in the 300 bps range can be achieved.

CODE EXCITED LINEAR PREDICTION CODER (CELP)

The potential applications for low data rate digital voice are enormous but, to date, public acceptance of linear predictive coders at 2,400 bps or less has been very limited. This is because of the somewhat unnatural, machine-like characteristics

Page 51: Digital Signal Processing in Communication Systems || Speech Processing

540 Digital Signal Processing in Communication Systems

of the sound. It is therefore necessary to provide an algorithm that produces at least toll-quality speech or it will not be generally accepted. Speech of this type is normally provided by 64 kbps pulse code modulation (PCM), 32 kbps adaptive pulse code modulation (ADPCM), or 32 kbps continuously variable slope delta (CVSD) modulation. Excellent quality is also obtained at 16 kbps using adaptive predictive coding with hybrid quantization (APC-HQ). The government standard APC-SQ operates at 9.6 kbps with good results.

Until recently, it has not been possible to reproduce toll-quality speech at a data rate below approximately 9.6 kbps. The codebook excited linear prediction coder (CELP) has made it possible to use a 4,800 bps rate and still provide excellent quality speech reproduction. The CELP algorithm has been reported to score 93 on the DRT test and 68 on the DAM test (see Campbell et al. [20,49]). A 1 percent error rate degrades the DRT to about 90.

A brief explanation of the technique is given below.* As the name implies, lin­ear predictive techniques are used. However, the excitation is generated in a man­ner that differs greatly from the 2,400 bps LPC-1O discussed previously. In the CELP coder, a stochastic codebook stores a fairly large number of short excitation waveforms-on the order of 128 to 512. The speech encoder determines which of the stored waveforms best serves as the excitation for the analysis period and transmits the number of that waveform along with the prediction filter coeffi­cients, pitch information and the like. The prediction filter coefficients are con­verted to line spectrum pairs (LSPs), as discussed earlier, to provide efficient and channel error resilient coding.

First, the LPC parameters are determined using the autocorrelation method. The CELP coder then passes each of the adaptive and stochastic codebook exci­tation waveforms through the LPC synthesis filter and compares the output with the actual speech signal to determine which of the waveforms produces the best perceptual replica of the actual waveform for the analysis period of interest.

The manner in which the adaptive codebook ("pitch") information is used to modify the excitation waveforms is somewhat complex, and it requires further ex­planation. This will be addressed in more detail later. The CELP coder (as pro­posed in Federal Standard 10 16, 31 August 1989) uses a 30 ms analysis window and an 8 kHz sample rate with a 12-bit AID converter. The autocorrelation method of LPC analysis is used with a 30 ms Hamming window function. A tenth-order synthesis filter is used, as in the LPC-l 0 coder discussed earlier. The LPC coeffi­cients are computed once per frame; however, the codebook search for the excita­tion and the pitch analysis are made at a 7.5 ms subframe rate. The spectrum is coded using a total of 34 bits for the 10 LSPs and 144 total bits for each 30 ms frame. The number of bits for the LPC filter can be minimized because each LSP tends to occur in a fairly limited frequency range and only three or four bits are

"For more information, see Campbell et al. [39,49], Kemp et al. [50], Tremain et al. [51], and Federal Standard 1016 [52].

Page 52: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 541

required per LSP. The LSPs are linearly interpolated over the frame, as shown in Table 9.5. The past and future spectra are centered at the beginning and ending of the present frame's excitation parameters, respectively. This requires that the LPC parameters are computed half a frame (two subframes) ahead of the excitation pa­rameters.

The codebook excitation consists of the sum of two parts, the stochastic code­book and the adaptive codebook as shown in Fig. 9.22. The stochastic codebook consists of 512 sequences of 60 samples. Each sequence consists of the ternary values -1, 0, and + 1. Each sequence differs from the adjacent sequence only in two places and is a shift of two samples from the previous sequence with two new values appended. This has been found to give as subjectively good a result as in­dependent random values. Moreover, it simplifies the search calculations by al­lowing end-point correction algorithms. A stochastic codebook gain, G 1, is deter­mined along with the optimum stochastic codebook sequence for the sub frame being synthesized. As shown, the adaptive codebook is somewhat of a simplifica­tion. It is formed using values of a memory or shift register holding past values of the filter excitation.

Initially, the memory is filled with zeros. During the first subframe of7.5 ms, the excitation values, ex(n), are stored in a memory, as shown in Fig. 9.23. We use the designation that the present sample has an index of zero, the value stored be­fore that negative one, the preceding value negative two, and so on. Therefore, the index number represents a delay from the present. The adaptive codebook is cal­culated using the values in memory as they exist at the beginning of the subframe. Thus, index -1 is the last value stored in the previous subframe. To understand how the adaptive codebook values are calculated, we first consider the case where the pitch period is less than 60 (60 being the number of samples in a 7.5 ms sub­frame). Table 9.6 shows the way the values, stored in the memory, are used or grouped to produce the adaptive codebook output during the second subframe. The table is formed from the data points as they were in the memory at the end of the previous sub frame.

Consider the case when the pitch period is 20 samples long. This corresponds to the bottom row in the adaptive codebook. The first entry in this codebook po­sition corresponds to the value in the memory delayed by 20 samples. The next codebook value (2) corresponds to the value that was delayed 19 samples, and so on, until 20 samples are used. The same 20 samples are then repeated twice to

TABLE 9.5 Interpolation Weighting for LSPs

Subframe Past Spectrum Future Spectrum

I 7/8 118 2 5/8 3/8 3 3/8 5/8 4 1/8 7/8

Page 53: Digital Signal Processing in Communication Systems || Speech Processing

542 Digital Signal Processing in Communication Systems

511 STOCHASTIC 255 ADAPTIVE 510 CODE BOOK 254 CODE BOOK 509 • 508 507 •

~ • • 3 • • 2

G1 1 G2 4 0 3 2 1 ................. -... _.-- ...... _- .. __ ........ -......... 0

FIGURE 9.22 CELP synthesizer

-147 -60 -1

ZEROS ax(n)

FIGURE 9.23 Adaptive codebook storage after first subframe

form the 6O-sample subframe. The adaptive codebook row is used starting with the left-most value and working toward the right.

Now suppose a pitch period of21 had been chosen, corresponding to the table index of 3. In this table position, the past excitations, starting 21 samples delayed, are used. After 21 samples, the values are repeated twice as before, except that on the second repetition only words -21 through -4 are needed to fill out the 60 sam­ple codeword. This is consistent, since when the table is loaded for the next sub­frame, the first entry will be delayed 21 samples from the excitation at that time, which corresponds to the next position of the periodic waveform if the period has not changed. It corresponds to a point on the waveform three periods later, how­ever.

Now, consider the case ofa pitch period longer than the subframe (e.g., a delay of 147), which corresponds to the top row in Table 9.6. In this case, the first value stored in the table is delayed 147 samples. The 60th value is delayed 88 samples.

Obviously, it is possible to produce the adaptive codebook from a single series of 147 memory locations by clever manipulation of pointers. Nevertheless, it may be easier to think of the operations performed as calculating all the entries in the adaptive codebook at the beginning of each sub frame.

The 128 integer delay codewords are formed by repeating samples from the adaptive codebook as indicated above. The 128 noninteger delay codewords are

Page 54: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 543

TABLE 9.6 Adaptive Codebook Structure

Index Delay Adaptive Codebook Numbers

255 147 -147, -146, -145, ... -89, -88

131 61 -61, -60, -59, -58, ... -3,-2

128 60 -60, -59, -58, -57, ... -2,-1

3 21 -21, -20, ... -I, -21, -20, ... -1, -21, -20, ... --4

° 20 -20, -19, ... -1, -20, -19, ... -1, -20, -19, ... -1

fonned by interpolation of the adaptive codebook's samples and are assigned the index numbers between the integer values. The 256 delays specified by the pro­posed Federal Standard-1016 are

Delay Range Resolution

20-25 2/3 113

26-33 3/4 1/4

34-79 2/3 1/3

80-147

We note in Fig. 9.22 that a pitch gain, 02, is supplied with the adaptive code­book position to be used. The value of the pitch gain, which may be between -1 and +2.0, detennines the strength of the periodic component in the excitation, as well as how quickly a periodic component builds up or dies away.

The best stochastic and adaptive codebook values for each subframe are deter­mined by passing each entry through the LPC filter and comparing the results with the actual speech. This is shown in Fig. 9.24. The weighting filter is used to em­phasize those areas of the spectrum to which the human ear is most sensitive. Dur­ing each subframe, the procedure is first to try the adaptive code entries and the pitch gain, 02. After these are detennined, all values in the stochastic codebook are tried, and the gain, 01, is optimized.

Page 55: Digital Signal Processing in Communication Systems || Speech Processing

544 Digital Signal Processing in Communication Systems

G1

s

ERROR MININIZATION ex(n)

FIGURE 9.24 CELP analyzer

All values in the adaptive code book are coded on odd sub frames. During even subframes, it is delta coded and can take on values within -31 to +32 indices of the delay used in the previous subframe.

The analog audio input to the CELP coder is bandpass filtered to a frequency range from 100 to 3,800 Hz. A Hamming error correction code is used on some of the most vulnerable bits prior to encoding. The bit allocation for the various pa­rameters is given in Table 9-7.

The use of the stochastic codebook in connection with the adaptive codebook allows great flexibility in determining the excitation for the synthesis filter and is primarily responsible for the improved performance as compared with the LPC vocoder described previously.

The improvements are not without cost, however, and the amount of computa­tion, which depends on the size of the code book searches, is roughly one order of magnitude higher than for an impUlse-excited vocoder. As higher-speed signal processors are developed, this will become less of a factor.

There are additional subtle features, some of which may be proprietary to indi­vidual manufacturers, in the CELP algorithm. The interested reader is referred to the cited references for a more exact and detailed explanation of the coder.

Page 56: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 545

TABLEfJ.7 CELP Bit Allocation

Sub/rame Parameter J 2 3 4 Frame

LSPI 3

LSP2 4

LSP3 4

LSP4 4

LSP5 4

LSP6 3

LSP7 3

LSP8 3

LSP9 3

LSPIO 3

Pitch delay 8 6 8 6 28

Pitch gain 5 5 5 5 20

Codebook index 9 9 9 9 36

Codebook gain 5 5 5 5 20

Future expansion 1

Hamming parity 4

Synchronization

Total 144

PROBLEMS

9-1 a) Derive the frequency response of a DPCM modulator when the predic­tion filter is a one-sample delay (see Fig. 9.3).

b) Sketch the frequency response normalized to the sample rate.

9-2 Using the Levenson-Durbin method, write the expressions for the prediction filter coefficients (IX values) for a third-order predictor. The autocorrelation matrix is given below for the frame under consideration.

[ I 0.5 0.25

J R ('t) = 0.5 1.0 0.5 0.25 0.5 1

Also, R(3) = 0.1.

Page 57: Digital Signal Processing in Communication Systems || Speech Processing

546 Digital Signal Processing in Communication Systems

9-3 Draw the schematic diagram of an analysis filter using the a. values for the results found in Problem 9-2. Calculate the impulse response for the first four output samples (see Fig. 9.11).

9-4 a) Draw the schematic diagram of a synthesis filter using the a. values found in Problem 9-2.

b) Write the difference equations for the output signal. c) Calculate the impulse response for the first four output samples.

9-5 A method of converting the prediction filter coefficients ( a.) to reflection co­efficients (K) was presented in Eqs. (9.66) and (9.67). Using these recursion formulas, convert the a. values in Problem 9-2 to reflection coefficients. Check the results against the reflection coefficients found in Problem 9-2 as a by-product of the Levenson-Durbin method.

9-6 a) Draw the schematic diagram of a lattice synthesis filter using the reflec­tion coefficients found in Problem 9-2.

b) List the interactive equations for the filter. c) Write a computer program to calculate the impulse response of the filter

and list the results for the first four samples. Compare the results with the impulse response calculated for the synthesis filter in Problem 9-4.

9-7 Given the arbitrary points of a parabola (x). Yl), (x2' Y2), (x3' Y3), show that the x coordinate of the minimum (or maximum) is given by the ex­pression

9-8 The impulse response of the functions used to solve for line spectrum pairs are known to contain zeros on the unit circle in the Z plane. This indicates that there are real frequencies for which the FFT of the impulse response has zero values. The FFT output is examined for the value of K giving a minimum response. The amplitudes of the adjacent points are also noted, so that

K-l ~ Al

K ~ A2

K+ 1 ~ A3

Page 58: Digital Signal Processing in Communication Systems || Speech Processing

Speech Processing 547

We have Al > A2 and A3 > A2. In this case, the expression derived in Prob­lem 9-7 can be used to find the exact index of the zero. Show that the expression can be simplified to