digital signal processing in communication systems || speech processing
TRANSCRIPT
9
Speech Processing
The use of digital speech has become widespread, and it provides enormous advantages in some applications over analog voice signals. One of the most important applications of digital speech is in long distance telephone communications. Digital speech can be passed through an almost unlimited number of repeaters with negligible degradation. Analog signals, on the other hand, suffer a small amount of degradation each time they are passed through a repeater.
Digital speech can easily be switched using multiplexers, and a large number of digital signals can be combined on a single link by time division multiplexing (TDM). Time division multiplexing eliminates the need for banks of analog filters required for frequency division multiplexing (FDM), which was used in older, analog systems.
Digital speech has several other advantages. For example, in military systems it can easily be encrypted for secure voice transmission. It is much easier to scramble the bits of a digital signal than to scramble an analog signal. Another application where digital speech has an advantage is in military frequency hopping systems. Here, it is possible to packetize the data and still eliminate the undesirable clicks caused by frequency hopping.
Another advantage in some links, where the transmission path is poor, is that powerful error correction codes can be used to reduce the error rate as much as required, whereas with analog transmission, distortions and noise in the transmission path cannot be removed. Digital speech may also be stored and retrieved without loss of quality.
490
M. E. Frerking, Digital Signal Processing in Communication Systems© Springer Science+Business Media New York 1994
Speech Processing 491
Digital speech has a major disadvantage, one that is serious in many applications. High-quality digital speech requires a high bit rate that, if transmitted, results in a significantly wider bandwidth over the analog speech from which it originated. Sophisticated digital signal processing techniques can be employed to reduce the data rate, but they produce varying degrees of quality degradation. This chapter addresses several of the more widely used methods for digitizing speech. Because of the different requirements and constraints, we find that widely differing methods are used.
The telephone industry has requirements for reasonably high-quality speech at low cost. The bandwidth, while important, is not a critical parameter. Pulse code modulation (PCM) with companding (a process in which compression is followed by expansion) has been adopted for this service. The sample rate is 8 ks/s, and 8-bit words are used, resulting in a data rate of 64 kbps. At baseband, this data rate would require approximately 32 kHz of bandwidth, compared to 3 kHz for the analog signal. This is an increase of approximately 10: 1. More sophisticated modems, as discussed in Chapter 8, can be used to reduce the bandwidth significantly. Nevertheless, because the data rate is rather directly related to cost, there is much emphasis on reducing the data rate. As a result, 32 kbps ADPCM and some 16 kbps systems are also used. In many applications, the permissible bandwidth is severely restricted, and this leads to requirements for much lower data rates. The radio spectrum is a prime example of how severe crowding of the available spectrum space has become. Many services are channelized on 25 kHz centers, and it is necessary to restrict the spectrum to somewhat less than this value. One method used in these applications is continuously variable slope delta modulation (CVSD). The hardware required is relatively simple, and the data rate is often 16 kbps. While the speech quality is not as good as 64 kbps PCM, it is easily recognizable, and an error rate of 5 to 10 percent can be tolerated before it becomes unintelligible. Other systems used are adaptive differential PCM (ADPCM), with a data rate of 32 kbps, and adaptive predictive coding. at 9.6 kbps.
A relatively new system, referred to as codebook excited linear prediction (CELP), is presently being investigated for military and commercial applications as a near toll-quality system. The data rate is in the 4 to 9.6 kbps range. The IS-54 (VSELP) is a system using CELP techniques, and it has a data rate of 8 kbps (see EIAITIA IS-54 Interim Standard [87]). Federal Standard 1016 [52] defines a CELP system with a data rate of 4.8 kbps.
Still lower data rates can be obtained with linear predictive vocoders (LPC), where a data rate of2.4 kbps is common. The quality ofLPC speech leaves something to be desired but, in a quiet environment, it is quite understandable. The error rate that can be tolerated is only a few percent, however.
Still lower bit rates can be obtained by special encoding of the LPC parameters using vector quantization techniques (see Rebolledo et al. [88]). Data rates of 400 to 800 bps are possible. As one would expect, there is additional degradation over 2,4001bps LPC. A great deal of research is being conducted on low data rate
492 Digital Signal Processing in Communication Systems
speech algorithms at the time of this writing. Another important area of research is directed toward obtaining toll-quality speech, similar to PCM at modest data rates in the 4,800 to 9,600 bps range. Obviously, any reduction in the data rate required to obtain toll-quality speech can result in significant additional channel capacity and reduce the cost of communications.
The remainder of this chapter describes several of the more widely used methods of digitizing speech.
PULSE CODE MODULATION
Pulse code modulation is really nothing more than digitizing speech with an AID converter, as discussed in Chapter 3. If the digital speech is to be processed locally, there may be no particular motivation to minimize the data rate. A sample rate in the vicinity of 16 ks/s with a 12 bit AID converter may then be a good choice. A relatively simple anti-aliasing filter can be used with a stopband above 8 kHz. The passband should extend at least to 3 kHz.
As indicated previously, the telephone system has adopted a sample rate of 8 ks/s. This places a more severe requirement on the anti-aliasing filter and a switched capacitor filter, implemented in an analog IC along with the AID converter, is often used. A lowpass filter in series with a highpass filter is sometimes used in the PCM chips. The passband extends from approximately 200 Hz to 3.4 kHz. The attenuation is in the order of 14 dB at 4 kHz increasing to over 32 dB by 4.6 kHz. Since only eight bits are used, the system leaves something to be desired with regard to dynamic range. Noise is reduced using a nonlinear voltage shaping function preceding the AID converter. This provides greater resolution for small signals at the expense of large signals. The non-uniform quantization reduces the noticeable quantization noise at small signal levels, where it is most apparent. Most North American systems use the u-Iaw curve which is given by
IVOUTI In (1 + IlIVINI )
VMAX
In (l + 11)
u = 255 (9.1)
The sign ofVIN is attached to VOUT. The output involves the ratio of two logs, so any base logarithm gives identical results. Equation (9.1) is shown graphically in Fig. 9.1. An AID converter with this characteristic is referred to as a CODEC.
1.00
0.75
~ :: ! 0.00
~-O.25 8
-0.50
-0.75
Speech Processing 493
-1.00 +-=-+---+---+---+---+---+---+-~ -1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00
CODEC INPUT VOLTAGE
FIGURE 9.1 Ii-law characteristic for CODEC
The digital signal must be linearized either prior to performing any filtering operations or before it is converted back to an analog signal. The linearizing function for Eq. (9.1) is given in Eq. (9.2). As before, the sign of VI is attached to V2.
(9.2)
European systems have adopted the A-law curve, which is similar but not identical to the u-Iaw. The A-law is given in Eqs. (9.3) and (9.4).
[ VIN ] A~
1 +lnA
1 + In [AvVIN ] MAX
I V oUTI = --l"-+-:-In-:A--
A typical value for A is 87.6.
(9.3)
(9.4)
494 Digital Signal Processing in Communication Systems
DIFFERENTIAL PULSE CODE MODULATION
When an analog signal is sampled it is often found that adjacent samples are not significantly different from one another. This implies that the high frequency content is relatively small. When this is the case, the data rate can be reduced by transmitting only the differences from the previous values rather than the absolute value of each point. A prediction filter is used in the receiver to reconstruct the original input from the transmitted differences. A block diagram of the receiver is shown in Fig. 9.2. The predictor may be as simple as a single delay which represents an integrator that adds the old value to the new difference input. A similar prediction filter is used in the transmitter as shown in Fig. 9.3.
The prediction filter in both the transmitter and the receiver have the same input. Therefore, the received signal is also reproduced at the transmitter. This signal is subtracted from the analog input to produce the next differential to be transmitted.
The block diagrams show a digital implementation, and we have assumed that the input signal has already been digitized with sufficient accuracy for audio-perhaps to 12 bits. An alternate implementation is to quantize the signal after the
Ae RECEIVED
DIFFERENCES
K-bit WORDS
FIGURE 9.2 DPCM receiver block diagram
f---.---3> e RECONSTRUCTED 2 DIGITAL SIGNAL
N>K-bits
DIGITIZED e 1 + ANALOG ~
I--_A_e ____ ----~TRANSMITTED
SIGNAL K<N bits INPUT
N bits +
FIGURE 9.3 DPCM transmitter block diagram
Speech Processing 495
summation. The output of the prediction filter must then be reconstructed to produce an analog signal prior to subtracting it from the analog input. This implementation with a one-bit word forms the basis of delta modulation, which is discussed in the next section.
DELTA MODULATION
Delta modulation is often used in applications where a lower data rate is required than that used for PCM, but where it is desirable to use a single IC to digitize the signal. In these applications, CVSD is often used. Before discussing CVSD, we will first describe delta modulation. It is then a small step to introduce the variable slope parameter.
A block diagram of a delta modulator is shown in Fig. 9.4. The companion receiver is shown in Fig. 9.5. Referring to Fig. 9.4, the analog input is compared with the output of the integrator. If the integrator output is too low, a +V signal is generated. This causes the integrator output to increase at a rate S = V/RC, where RC is the time constant of the integrator. If the integrator output is still too low at the next clock cycle, another + V output is generated. Otherwise a -V output is produced.
CLOCK
81 COMPARATOR
D t-=--..-~ TRANSMITIED BIT STREAM
ANALOG ----1
INPUT
ANALOG INTEGRATOR
LATCH
FIGURE 9.4 Block diagram of delta modulator
CLOCK
RECEIVED BIT
STREAM
ANALOG INTEGRATOR
FIGURE 9.5 Block diagram of delta modulator receiver
RECOVERED AUDIO
OUTPUT
496 Digital Signal Processing in Communication Systems
The slope, which we will call S, obviously must be as large as the highest slope of the incoming signal ifthe system is to track. A large value ofS results in higher quantization noise due to "hunting" around the correct value for slowly changing signals. Therefore, we wish to make S large enough but not excessively so.
For a sine wave input, e = A cos(21tfmt), the slope, found by differentiating, is
The maximum slope occurs at t = lI( 4fm). Setting this value equal to the slope, we have
The noise power can be found by noting that the smallest step size is
where
2S 8 = ST - (-ST) = -
fs
T = the sample time
fs = the sample rate
As determined in Chapter 3, the noise power is given by
(9.5)
(9.6)
for a uniformly distributed variable. Since the quantization noise occurs in a system with feedback, the output noise spectrum from the modulator has a slope proportional to frequency. After the integrator in the demodulator, however, it is flat. Equation (9.6) represents the noise at the demodulator. The maximum signal power is
Substituting for A from Eq. (9.5) gives
(9.7)
Speech Processing 497
Combining Eq. (9.7) with Eq. (9.6), the maximum ratio of signal to total noise power is
(9.8)
The noise is uniformly distributed over the frequency range from 0 to f/2. Therefore, the maximum signal-to-noise density ratio can be found by dividing Eq. (9.8) by f/2, which gives
(9.9)
As can be seen from in equation, the signal-to-noise density ratio improves as the third power of the sample rate. It is often undesirable to increase the sample rate, however, since this implies a larger transmission bandwidth. Another approach is to use a small slope when the signal is small, and to increase the slope when the rate of change of the signal becomes large. As with companded PCM, the quantization noise is not as objectional when a large signal is being digitized. This is the essence of CVSD, discussed in the next section.
CONTINUOUSLY VARIABLE SLOPE DELTA MODULATION
As the name implies, continuously variable slope: delta modulation changes the slope in a delta modulator as needed. One way to accomplish this is to monitor the output bits and, if more than three or four consecutive ones are transmitted, the slope is increased. Likewise, ifmore than three or ~Dur consecutive zeros are transmitted, the rate of change of the negative slope is increased. A block diagram of the resulting CVSD modulator is shown in Fig. 9.6. The output data is also read into the local shift register. When the output consists of three consecutive ones or three consecutive zeros, the slope integrator output increases. Otherwise, it decays toward a minimum fixed value.
The receiver for CVSD data has a similar shift register and decoding network, so that the voltage, e2> is approximately reproduced at the receiver. CVSD modulators have been used successfully in various equipments, including military communication transmitters and receivers. The data rate for military equipments is of-
498 Digital Signal Processing in Communication Systems
ANALOG INPUT >---------------~+
COMPARATOR
FIGURE 9.6 Block diagram of CYSD modulator
CLOCK
D t-=-__ -+_~OUTPUT DATA
ten 16 kbps. The speech is reasonably good and is easily understood in a quiet environment, but it is not as good as 64 kbps PCM. An error rate of 5 to 10 percent can be sustained before the speech becomes unintelligible. CVSD tends to be more robust with respect to errors than PCM, because an error only affects the slope for one sample time. A PCM error, on the other hand, generates "pops" in the received signal, particularly in the most significant bit positions. CVSD has also been used at 32 kbps to reproduce reasonably high-quality speech.
LINEAR PREDICTIVE CODING
There are many advantages to digitizing speech at a low data rate, and LPC is an effective way to dramatically reduce the required data rate. In some services (e.g., telephone communications), the advantage is primarily economic, especially for long distance communications. It is theoretically possible to multiplex 26 users in a 64 kbps PCM channel with 2,400 bls LPC coded speech. This has not been done to any great extent. A more accepted method of utilizing LPC at 2,400 bps is to multiplex four channels into a 9,600 bps modem using QAM and to transmit the signal over an analog telephone channel. Unfortunately, 2,400 bps LPC, although acceptably intelligible, has a certain machine-like quality that has prevented its widespread acceptance by the general public.
LPC is most useful in other systems where it is simply not possible to obtain the needed bandwidth for PCM. A good example of this is over the HF radio channel, where the signal must often be contained in a 3 kHz bandwidth. There are several ways to transmit 2,400 bps digital signals in a 3 kHz channel as discussed in Chapter 8.
The linear predictive technique is most often used to digitize speech at the 2,400 bps rate, although some success has also been achieved as low as 400 bps. A small amount of error correction coding can be included in the 2,400 bps data stream, and an error rate of 1 or 2 percent can reasonably be tolerated.
Speech Processing 499
The basic principles of LPC speech encoding are discussed in the following paragraphs (see Rabiner and Schafer [36,44]).* A military standard called LPC-10e has evolved, and some of the material in this section is drawn from that approach (see Tremain [38]). LPC is not so much a method of digitizing a signal as a technique for analyzing speech to determine certain parameters and representing them digitally. To understand the system, it is helpful to look briefly at the human vocal tract first.
During voiced sounds (also called fricatives), the vocal cords vibrate with a specific pitch frequency. The output from the vocal cords resembles a pulse excitation, which is rich in harmonics. This signal is modulated by the cavities of the throat, mouth, and nose. The various harmonics are increased or decreased relative to one another to form the various sounds. The resonant frequencies of the vocal system are called formants and represent one-pole, second-order resonances, as seen in Eq. (9.12). The average pitch excitation for a male voice is about 130 Hz. The female voice is approximately one octave higher, so LPC intelligibility scores are often slightly better for male speakers.
In addition to fricatives, there are also non-voiced sounds, such as the f in "five" and s in "six." In this case, the vocal cords do not vibrate. Instead, the excitation to the vocal tract resembles white noise and is caused by a turbulent air flow. Therefore, one may postulate a speech model such as shown in Fig. 9.7.
The basic idea underlying linear predictive coding is to approximate the vocal tract filters on a short-term basis (20 to 30 ms) and provide the right excitation to the synthesis filter to approximate speech. We can adequately model the cavities
FIGURE 9.7 Approximate equivalent circuit of human vocal tract
SPEECH SIGNAL
·Parts of this section are adapted from L.R. Rabiner and R.w. Schafer, Digital Processing of Speech Signals. pp. 399-402,411,413-414,443,444. Copyright © 1978. Adapted by permission of Prentice Hall, Inc., Englewood Cliffs, NJ.
500 Digital Signal Processing in Communication Systems
with an alI-pole filter, which we will calI the synthesis filter. We wilI designate the transfer function by 11 A(Z). The parameters of the synthesis filter can be derived from a digital transversal filter whose coefficients have been optimized to predict the present speech sample from past values. We will spend a considerable amount of time showing why this is possible and how to find the coefficients of the prediction and synthesis filters. Subsequently, we will also discuss algorithms to determine the pitch period, how to make voiced/non-voiced decisions, and so on.
We wilI begin by letting the excitation signal from the vocal chords be represented by Yen). For voiced sounds, when the vocal chords are vibrating, Yen) can be approximated by a series of impulses. In the Z plane, the excitation can be represented by
V(z) = (J L (z-k)n (9.l0) n=O
where the pitch period is k sample times. For unvoiced sounds, Yen) is approximated by white noise.
The glottal shaping model can be approximated by the transfer function *
(9.11)
The majority of the shaping of the acoustic spectrum is accomplished by the vocal tract, consisting of the nose and mouth. This shaping is approximated by an aJlpole filter model of the form
H (z) = --;K.,----------------- (9.12) n -c T 1 -2eT 2 [I - 2e J cos (Bj T) z- + e J z- ]
j = 1
The jth format frequency is
and the format bandwidth is
'Equations (9.11), (9.12), (9.15), (9.19), (9.21), (9.22), and (9.25) were adapted from J.D. Markel and A.H. Gray, Linear Prediction a/Speech. by permission. Copyright © 1976 by Springer-Verlag.
Speech Processing 501
The speech output is given by
S(z) == V(z)G(z)H(z) (9.13)
We now lump the glottal shaping fiiter, the vocal tract, and the additional lip radiation shaping into a single synthesis filter 11 A(Z). The resulting model is shown in Fig. 9.8. Using an all-pole model approximation we have
Then
1 S (z) G A (z) == V (z) == p
1- Laiz-i i= 1
p
A(z) == 1- Laiz-i i = 1
The filter 11 A(Z) is called the synthesis fiiter, and
S(z) == [A:z)]GV(Z)
(9.14)
(9.15)
The parameters needed to characterize a speech sample are the values of the A coefficients, the gain factor, G, the pitch period, and the voiced/unvoiced decision.
IMPULSE 11 i SOURCE
PITCH PERIOD
VOICED/ UNVOICED SWITCH
VARIABLE I--~ DIGITAL
G GAIN
FILTER
1/A(z)
FIGURE 9.8 Simplified equivalent circuit of vocal tract
Sen)
502 Digital Signal Processing in Communication Systems
These parameters all vary slowly with time. If the filter size, P, is large enough, the all-pole model can accurately represent the speech sample. For most LPC applications, P is chosen to be 10. The resulting system is referred to as LPC-IO.
We will now study the two most popular ways to estimate the filter parameters, (the "a" terms). The first of these is the autocovariance method, which is used in the US government and NATO system. The second is the autocorrelation method. The performance of the methods is fairly similar, and the particular method used is often a matter of the designer's choice. The basic difference is that the speech data is windowed for the autocorrelation method so that the values are zero at the edges of the analysis window. This results in a correlation matrix with equal diagonal elements that can be inverted more easily. A recursion process can also be used to avoid inversion altogether.
The covariance method, on the other hand, has a small performance advantage because it does not throwaway some of the speech samples by windowing. However, it requires more computation.
One of the major advantages of the all-pole speech model is that the coefficients can be estimated in a straightforward manner. We will now proceed to describe the methods used.
From Eq. (9.14), the speech samples are related to the excitation by
p
S (z) = L aiS (z) Z-i + GV (z) i = I
From Eq. (9.15), we may write
p
1 - A (z) i = I
Substituting this in Eq. (9.16) gives
S(z) = [l-A(z)]S(z)+GV(z)
or
S (z) GV(z) ---A(z)
(9.16)
(9.17)
We may also write the time domain response by taking the inverse z-transform of Eq. (9.16), which gives
s (n)
p
Lais(n-i) +Gv(n) i = 1
Speech Processing 503
(9.18)
To use this model, we must find a reasonable method to determine the values ofthe "a" terms. Knowing these values along with the excitation v(n) and the gain will then allow us to reconstruct the speech samples at the receiver.
To determine the values of the "a" terms, we first define a prediction filter to estimate the present value of the speech from the past samples. This cannot be perfectly predicted, of course, but a reasonable approximation can be made. We let our estimate of the present value be defined as
sen)
Taking the z-transform gives
p
LUis(n-i) i = 1
P 1\
S (z) L Uiz-iS (z) i-I
The prediction filter is then
1\
P (z) S (z)
S (z)
P (z)
(9.19)
(9.20)
The error between the actual sample point and the predicted sample point is referred to as the prediction residual. This error is given by
e (n) = s (n) - S (n)
Substituting Eq. (9.19), we have
e (n)
p
sen) - Luis(n-i) i = I
(9.21 )
(9.22)
504 Digital Signal Processing in Communication Systems
And taking the z-transform
(9.23)
We will use this equation later to optimize the (l values to obtain the minimum RMS error, e(n).
Now let us suppose that we know the correct values for the "a" terms in Eq. (9.16) and set the (l terms equal to these values. Let
«(li = a) for i = 1, 2, ... p (9.24)
Making this substitution in Eq. (9.23) gives
Now, substituting for A(z) from Eq. (9.15), we have
e (z) = S (z) A (z) (9.25)
Now consider Eq. (9.17) and substitute for S(z). This gives
e (z) = GV (z) (9.26)
This seems plausible, since the nonpredictable part ofthe speech is due to the excitation.
Now, if we use Eq. (9.22) to optimize the (l terms for minimum error, it will be found that the residual error, e(n)MIN' results from the non-predictable part of the speech which, of course, is the excitation, Gv(n). This is precisely the error we obtained in Eq. (9.16) if we use the correct filter values for the all-pole model (the "a" terms). Therefore, the coefficients giving the minimum prediction error in Eq. (9.22) are also the coefficients of the all-pole speech model proposed in Eq. (9.14).
We now take a moment to review what we have presented thus far before calculating the coefficient values of the prediction filter. Referring to Fig. 9.9, we have approximated the vocal tract filter as an all-pole prediction filter, IIA(z), with as yet unknown coefficients. The output of the vocal tract filter, S(z), is passed through the prediction filter (also called an inverse or whitening filter). The
Speech Processing 505
VOCAL CHORD V(z) VOCAL TRACT S(z) IMPULSE OR f----7 ACOUSTIC FILTER f----7 WHITE NOISE
ARTIFICIALLY GENERATED PREDICTION
G(z)H(z)
G(z)H(zj=o1/A(z)
&(z) ALL POLE FILTER 1/A(z)
SYNTHESIS FILTER
A(z)
PREDICTION FILTER
~
SYNTHESIZED SPEECH
8(z)
FIGURE 9.9 Vocal diagram of speech analysis and synthesis process for vocoder
e(z) PREDICTION
RESIDUAL
output, called the prediction residual. resembles the excitation signal V(z) to the degree that the all-pole filter approximates the vocal tract. We then approximate the prediction residual by an impulse train or by white noise, as shown in Fig. 9.8, giving
This signal is shaped by the all-pole synthesis filter to give a reconstructed approximation of the speech at the receiver.
We will now derive the equations necessary to solve for the coefficients, ex, of Eq. (9.22) on a short-term basis (i.e., for a short segment of the speech waveform) to minimize the error. If the coefficients were constant, the longer the analysis window the more accurately the coefficients could be determined. If the window is made too long, however, the coefficients change appreciably during the analysis interval. A compromise is therefore required, which for LPC-l 0 is often chosen to be 22.5 ms.
The short-term average prediction error is defined to be
(9.27)
where
E = error for the interval no :s;; n :s;; n 1
506 Digital Signal Processing in Communication Systems
Substituting for e(n) from Eq. (9.21) gives
nl
"'[ A 2 E = £.. s(n) -s(n)] (9.28) n ~ no
and from, Eq. (9.18)
n l P 2
E = L [s (n) - ,L (Xi s (n - i) ] n~no 1~1
(9.29)
Physically, the error signal to be minimized can be visualized as shown in Fig. 9.10. E is the mean square error averaged over the interval no ~ n ~ n l' The quantity e(n) is referred to as the prediction residual, as noted earlier. We will see later that the prediction residual contains a considerable amount of useful information, particularly about the excitation.
s(n)
e(n)
FIGURE 9.10 Block diagram of analysis filter showing physical interpretation of error
Speech Processing 507
The filter with a transfer function
p
A (z) = 1 - L (Xiz-i i = I
is called the analysis filter. The filter with a transfer function
p
p (z) = L (Xiz-i
i= I
is called the prediction filter. The minimum value of En is found by taking the partial derivatives of En with
respect to each of the coefficients and setting the partial derivatives to zero. Hence,
I = 1,2, ... P (9.30)
The differentiation is relatively simple to perform and is similar to the adaptive filter analysis performed in Chapter 8. Differentiating Eq. (9.29) gives
I = 1,2, ... P
(9.31)
This leads to the set of simultaneous equations
n, n, p
L s(n)s(n-i) = L [S(n-i)L(XjS(n- j )] n = no n = no J = I
i = 1,2, ... P
Interchanging the two summations in the right hand term gives
n, p n, L s (n) s (n - i) = L (Xj L s (n - i) s (n - j) 1 ~i ~P (9.32)
508 Digital Signal Processing in Communication Systems
We now define the covariance matrix to be
n l
<P (i, j) L s(n-i)s(n-j) (9.33)
Note that the summation requires values of s outside of the interval no ~ 0 ~ 01
since i andj take on values up to P, which is typically 10. The autocovariance matrix calculation can be reduced somewhat by first cal
culating the values <POj and then using the end correction procedure [38]
Given the values for <POj for j = 1,2, ... P, the values <Plj cao be found. Then values <P2j and so on can be calculated.
Now, substituting Eq. (9.33) into Eq. (9.32) gives
p
LUj<p(i,j) = <p(O,i) = 1,2, ... P (9.34) j = I
This represents a set of P simultaneous equations that can be solved for the coefficients, Uj.
Because of the way the autocovariance matrix was formed, the values are symmetrical about the diagonal. As a result, the equations can be solved using a technique that is less computationally intensive than a general matrix inversion. Later, we will discuss a simpler method that can be used if the data is windowed so that the values are zero at the beginning and at the end of the analysis window. This method, called autocorrelation, results in a symmetrical matrix in which the diagonal elements are all equal. By contrast, the autocovariance method does not result in identical diagonal elements.
The standard adopted by the US Government and NATO uses the autocovariance method with the correlation matrix formed as described in Eq. (9.33). There are also procedures that have been developed to optimally choose the placement of the analysis window within the speech frame. These will be discussed later, along with conventions for prefiltering the speech. For the present, we will concentrate on the matrix inversion process for the auto covariance method.
Autocovariance Method
Equation (9.34) defines a series of simultaneous equations. For example if we chose P = 4, the equations are as follows:
Speech Processing 509
i = 1: (XI<P(1,I) + (X2<P(1,2) + (X3<P(1,3) + (X4<P(1,4) = <p(O,!)
i = 2: (XI <p(2, 1) + (X2<P(2,2) + (X3<P(2,3) + (X4<P(2,4) = <p(O,2)
i = 3: (XI<P(3,1) + cx2<P(3,2) + (X3<P(3,3) + (X4<P(3,4) = <p(O,3)
i = 4: (XI<P(4,1) + cx2<P(4,2) + (X4<p(4,3) + cx4<P(4,4) = <p(O,4)
For a practical vocoder, as discussed earlier, P is usually chosen to be 10. For illustration purposes, in this chapter we will often use P = 4. This serves to demonstrate the principles without adding undue mathematical detail.
Choleski Decomposition * Since the analysis equations must be solved in real time, it is very desirable to minimize the amount of computation required. As indicated previously, this can be accomplished by exploiting the symmetrical properties of the matrix. The autocovariance matrix is positive, definite, and symmetrical about the diagonal. The diagonal elements are in general not equal, but are related by the relationship
<P (i + 1, i + 1) = <P (i, i) + s (no - 1 - i) s (no .- 1 - i) - s (n 1 - i) s (n 1 - i)
(9.35)
Equation (9.35) can be used to simplify calculation of the diagonal elements. Because the matrix is symmetrical about the diagonal, it can be solved for the
coefficients using Cholesky decomposition, also called the square root method. This method leads to a recursive system of equations to find the (X values.
Once the (X values have been obtained, it is necessary to determine ifthey result in a stable filter. In theory, this can be done by factoring the synthesis filter and determining if all the poles are within the unit circle in the Z plane. A more convenient procedure is normally used, however. This involves solving Eq. (9.34) for an equivalent lattice filter whose parameters are the reflection coefficients. This filter resembles a tubular model of the vocal tract. If any reflection coefficient has a magnitude larger than one, the filter is unstable:. An unstable filter is not used, but the values from previous frames are substituted.
Reflection Coefficients Reflection coefficients are also referred to as the partial correlation coefficients (PARCOR). It is relatively easy to calculate the (X values of the prediction filter
• Adapted from L.R. Rabiner and R.W. Schafer, Digital Processing a/Speech Signals. pp. 399-402, 411,413-414,443,444. Copyright © 1978. Adapted by permission of Prentice Hall, Inc., Englewood Cliffs, NJ.
510 Digital Signal Processing in Communication Systems
from the reflection coefficients. [See Eqs. (9.44) and (9.45), to be discussed later.] The calculation of the reflection coefficients, represented by K, and the transformation to <X values will now be discussed. We begin with the autocovariance matrix, as defined in Eq. (9.33).
The reflection coefficients of the predictor filter can be found as follows.· Since the <I> matrix is symmetrical, it can be factored into an upper and a lower triangular matrix. This gives
(9.36)
For an example where P = 4 we have,
W ll 0 0 0
W= W21 W 22 0 0
W31 W32 W33 0 (9.37)
W 41 W 42 W43 W44
W 21 W 31 W41 1-
W ll W II W II
0 W32 W42
VT = W 22 W22 (9.38)
0 '0 W43
W33
0 0 0 1
The values of the lower triangular matrix can be calculated using the recursive equations
1 = 1,2, ... P (9.39)
j-I W W _ ~ in jn
Wij - <l>ij- £.J W n = I nn
(9.40)
"Equations (9.38), (9.40), (9.42), and (9.74) are adapted from T.E. Tremain, The Government Standard Linear Prediction Coding Algorithm, Speech Technology, April 1982.
Speech Processing 511
After the Wil values have been detennined from Eq. (9.39), the values for Wi2 are found from Eq. (9.40) for i = 1,2, ... P. The calculation then proceeds to find the W 3i values, and so forth, until all the W elements have been detennined.
The reflection coefficients can then be found using a second set of recursive equations. We have
(9.41 )
and
r i-I 1 K. = _1 m. - ~ KW
, W .. '1',0 £... ] IJ II j = I
(9.42)
The coefficient values can now be checked for stability by ensuring that none of the magnitudes are greater than unity. The reflection coefficients are then encoded for transmission. We will have additional comments on encoding subsequently. At the receiver, the data is decoded to recover the reflection coefficients. Unless a lattice filter is used, the reflection coefficients are then converted to the coefficients of the analysis filter "a" values (which we set equal to the <X values early in the analysis.) The synthesis filter is then easily constructed to have the response
H (z) 1
A (z)
This can be implemented as shown in the lower diagram of Fig. 9.11.
(9.43)
The procedure for finding the "a" values is unusual in that the coefficients of all systems of the order less than P are calculated successively until, finally, the values for a pth-order predictor are calculated. The coefficients for the lower-order systems are not used except as required to obtain the coefficients for the pth-order system. The recursive relationships are (see Rabiner and Schafer [36])
for i = 1,2, ... P (9.44)
and for each value of i
for j = 1, 2, '" i-I (9.45)
The superscript here refers to the order of the system to which the coefficient belongs.
512 Digital Signal Processing in Communication Systems
Sen) '§tIZ01 I " PREDICTION > S(n)
FILTER
Sen) +Z)-l O §tiZ01 1
> e(n) ANALYSIS FILTER
e(n) >[§J > Sen) SYNTHESIS FILTER
REALIZATION e(n)-----~ I------;..---~ Sen) OF SYNTHESIS
FILTER
FIGURE 9.11 Relationship between prediction filter, analysis filter, and synthesis filter.
Let us again consider the example for P = 4. We must calculate the "a" values for a first-order system first. Thus, using Eq. (9.44), for i = I we have
Since the first-order system has only one prediction coefficient, Eq. (9.45) is not required for i = I. We now proceed to calculate the prediction coefficients for a second-order system (i = 2). Using Eq. (9.44)
Then, using Eq. (9.45)
a(2) = a(l) - Ka(l) for j J J J J-J
or
Speech Processing 513
All the required values were calculated previously. This completes the prediction coefficients for the second-order system. Now, going on with i = 3 and using Eq. (9.44), we have
a (3) - K 3 - 3
Then, using Eq. (9.45) for j = 2
And for j = 1
This completes the third-order system. Now we are in a position to calculate the values for i = P = 4, which corresponds to the coefficients we set out to find. Using Eq. (9.44)
a (4) - K 4 - 4
Then, using Eq. (9.45), for values j = 3, 2, and I, we have
(4) _ a(3) _ K a(3) a2 - 2 2 2
For a tenth-order system, the procedure would, of course, continue until the values of the coefficients a(10) were found.
Note that, in this case, the values of j can be ordered in either ascending or descending order, since required quantities are available from the previous, lowerorder system calculations.
Autocorrelation Method
It was indicated previously that a simpler system, called the autocorrelation method, has been used to find the values of the prediction coefficients. To use this method, the speech samples must be windowed (e.g., with a Hamming window)
514 Digital Signal Processing in Communication Systems
to ensure zero values at the boundary of the analysis window. Because of the windowing, some part of the information is lost. Nevertheless, the system has been used successfully, particularly in commercial systems. Windowing the data results in an autocorrelation matrix, <I>(i,j), which has symmetry about the diagonal and in which all the diagonal elements are also equal. Equal diagonal elements is the extra condition that allows simplification as compared with the autocorrelation method described earlier.
Let us assume a segment of the speech waveform, sen) = 0, outside 0 :s; n :s; N - 1. We obtain s by windowing the speech samples xs(n) such that
(9.46)
In the above equation, w(n) is the window function, which is zero for n < 0 and n > N - 1. We will again provide a prediction filter of the form
P
sen) L(ljs(n-i) j = \
The error function (prediction residual) is
e (n) = s (n) - s (n)
P
e (n) = s (n) - L (ljS (n - i) j = \ (9.47)
For a pth-order predictor, the error will be nonzero only over the interval 0 :s; n :s; N - 1 + P. Therefore, the mean square error can be expressed as
N+P-\
E= L e2 (n) (9.48) n=O
Now define
N+P-\
<I> (i, j) L s(n-i)s(n-j) (9.49) n=O
Speech Processing 515
We can also express Ijl as
N - I - (i - j)
Ijl(i,j) = L s(n-i)s(n+i-j) I :5: i :5: P, and I :5: j :5: P (9.50) n=O
since we are still multiplying all the values displaced by i - j from each other. Since Ijl now has the characteristics of an autocorrelation function, we write
N-I-t
Ijl(i,j) = R(t) = L s(n)s(n-t) n-O
where
t= li-jl i = 1,2, ... P andj = 1,2, ... P
Then, referring to Eq. (9.34) with the substitution ofEq. (9.51), we have
p
LajR(li-jl) = R(i) j = I
1 = 1,2, ... P
If P = 4, this leads to the set of simultaneous equations
i = I: R(O)a) + R(l)a2 + R(2)a3 + R(3)a4 = R(l)
i = 2: R(l)a) + R(0)a2 + R(l)a3 + R(2)a4 = R(2)
i = 3: R(2)a) + R(l)a2 + R(0)a3 + R(l)a4 = R(3)
i = 4: R(3)a) + R(2)a2 + R(I)a3 + R(0)a4 = R(4)
and the autocorrelation matrix has the form
rR(O) R(l) R(2) R(3)J
R(t) = R(I) R(O) R(I) R(2) R(2) R(l) R(O) R(l)
R(3) R(2) R(I) R(O)
(9.51 )
(9.52)
516 Digital Signal Processing in Communication Systems
The PxP matrix of autocorrelation values is a Toeplitz matrix (i.e., it is symmetrical, and all elements along the diagonal are equal). As a result the equations can be solved using Durbin's recursive solution: also called the Levinson-Durbin method. To solve for the coefficients of order P, we first must solve for all the coefficients of order P - 1. To solve for the coefficients of order P - 1, we must solve for all the coefficients of order P - 2, and so on. As before, we will use a superscript to represent the order of the system being analyzed.
The process begins by finding the residual error or initial condition
E(O) =R(O) (9.53)
A set of four recursive equations is then used for i = 1 ... P.
(9.54)
(9.55)
(X.(i) = (X.(i-l) _ K.(X.(i-:-l) J J 1 1-J
= 1,2, ... i - 1 (9.56)
(9.57)
Since the subscripts and superscripts may be a bit confusing, we will consider an example for a second-order, P = 2, system. The calculations are made as follows:
E(O) = R(O)
i= 1: KI =R(l)/E(O)
i = 1: E(l) = (1- KI2)E(O)
i = 2: K2 = {R(2) - (XI (l)R(l)}/E(l)
"Material in this section adapted from L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals. Chapter 8, p. 411, with permission. Copyright © 1978, Prentice Hall, Inc.
Speech Processing 517
Note that E(i) is the predictor error for an ith order system. We note also that the equations can just as well be solved using normalized autocorrelation coefficients
r (j) = R (j)
R (0)
This may be convenient in a machine with fixed-point arithmetic. A few additional comments are in order with regard to the Levinson-Durbin
recursion method. The intermediate quantities, represented by K and known as reflection coefficients, are also called PARCOR coefficients (partial correlation coefficients) because they can be expressed in the form of a normalized crosscorrelation function. We will not be concerned with this form here.
Starting with Eqs. (9.15), (9.25), and (9.56), it is possible to show that the analysis filter can also be implemented using the reflection coefficients in a lattice filter (see Rabiner and Schafer [36]). This is shown in Fig. 9.12. From this figure, we may write the iterative expression for the lower portion of the lattice filter:
(9.58)
Here, bi is called the backward prediction error sequence. By inspection, the iterative expression for the upper portion of the lattice filter is
(9.59)
Sen) >
FIGURE 9.12 Signal flow diagram for lattice implementation of analysis filter. Adapted from L.R. Rabiner and R.W. Schafer, Digital Processing o/Speech Signals. p. 415, with pennission. Copyright © 1978, Prentice Hall, Inc.
518 Digital Signal Processing in Communication Systems
Here, ej is the forward prediction error sequence. The combination of Eqs. (9.58) and (9.59) form the basis of a lattice analysis filter which gives the same result as the FIR analysis filter shown in Fig. 9.1 0, with the use of reflection coefficients (K) rather than the prediction coefficients (a).
It is now a simple matter to transform the lattice analysis filter to a lattice synthesis filter. To accomplish this, we solve Eq. (9.59) for ei_l(n) and make the notational changes e/n) = fj(n). This gives
(9.60)
A block diagram of the lattice synthesis filter is shown in Fig. 9.13. In this filter fj is the forward power and bi is the backward power. From the diagram, it can be seen that the following iterative expressions can be used to compute the output signal:
fp(n) == u(n) (9.61)
for i = P, ... 2, I (9.62)
bien) == bi_1(n-I)-K/i_1(n) for i = I, 2, ... P (9.63)
and bo(n) = fo(n). The equations can be clarified by an example. In this case, we will use P = 3.
We begin with n = I and assume the initial stored values to be zero.
n = I Start: f3(1) = u(1)
i=3 f2(1) = f3(1) note: b3(0) = 0
b3(1) = -K3f2(1) b2(O) = 0
i = 2 fl(1) = f2(1) b2(O) = 0
b2(1) = -K2fl(1) bl(O) = 0
i = I fo(1) = fl(l) bl(O) = 0
b l(1) = -Klfo(l) boCO) = 0
This completes the calculations for the output at n = I, and the output is given by
sen) == fo(l)
Speech Processing 519
Sen)
FIGURE 9.13 Signal flow diagram for lattice implementation of synthesis filter
We now calculate the output for n = 2.
n=2 Start:
note: boO) = foO)
The reflection coefficients for the lattice filter are easier to determine than the <X values for the synthesis filter based on the inversion of an FIR filter. We note, however, that twice as many multiplications are required to compute each point. For this reason, reflection coefficients are often converted to the coefficients of the FIR analysis filter in a vocoder.
The lattice filter gives exactly the same results as the FIR-based synthesis filter. The lattice filter can be viewed as a model of a lossless acoustic tube filter with P sections of equal length but different areas, Am' Then, Km is the reflection coefficient between sections m and m - 1.
520 Digital Signal Processing in Communication Systems
It can be shown that
where
Zm = characteristic impedance of mth section
Am = area ofmth section
This leads to the relationship
=
(9.64)
(9.65)
This is the basis for a method of coding the reflection coefficients called the log area ratios. These result from coding the logs ofEq. (9.65).
If the <l values are known for the FIR-based synthesis filter, it is also possible to calculate the reflection coefficients. The recursive equations are as follows (see Rabiner and Schafer [36]). Note that the ex values for the prediction filter are for a pth-order system, so the givens are <l1(P>, <l2(P), ... <lp(P). For each i = P, P - 1, ... 2,1.
K. = ex.(i) 1 1
(9.66)
Also, for each i, let m = 1, 2, .. .i - 1 and findt
<l (i) - ex.(i) IX (~) a (i - I) = m 1 (1- m)
m 1- K2 1
(9.67)
After m = i-I, reduce i by 1 and return to Eq. (9.66). This concludes our discussion on finding the values of the coefficients of the
synthesis filter. The values are normally computed every 22.5 ms in the analysis mode. At the receiving (synthesis) end of the link, the coefficient values are updated more frequently by interpolation. The interpolation is normally performed on the reflection coefficients prior to converting to the coefficients of the FIRbased synthesis filter. In some systems the number of interpolations per frame is variable, depending on how rapidly the parameters are changing. Interpolations may be performed as frequently as every 5 ms.
-Equations 9.64 and 9.65 reprinted with permission from C. Bristow, Electronic Speech Synthesis Techniques. Technology. and Applications [40]. Copyright © 1984, McGraw-HilI. tEquation (9.67) reprinted from L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals. with permission. Copyright © 1978, Prentice Hall, Inc.
Speech Processing 521
Gain Determination
Referring to Fig. 9.8, we see that it is necessary to adjust the gain parameter, O. This is required so the energy response to Oven) equals the energy in the signal for both voiced and unvoiced sounds. The required value is given by
P
0 2 == Rn (0) - L aiR (i) (9.68) j = I
for an excitation u(n) with unit energy. This can be determined from Eq. (9.18).
S (n)
p
Lajs(n-i) +v(n)O j = 1
(9.69)
We assume here that the a values are substituted for the "a" values, and that the excitation, v(n), is a unit impulse v(n) = O(n). We also assume that the predicted signal, s (n), has the same energy as the actual signal, sen). Now, define the autocorrelation function to be
L s (n) s (n+1) (9.70) 0=0
Substituting Eq. (9.69) for the first occurrence of sen) in Eq. (9.70) gives
R (1) (9.71 )
For 1 ;f. 0, this simplifies to
p
L ajR (11 - il) (9.72) i = 1
assuming sen) = 0 for n < O. For 1 = 0, from Eq. (9.71), we have
522 Digital Signal Processing in Communication Systems
R (0) nto l i~ {XiS (n - i) s (n) + 0 (n) Gs (n) J
p
R(O) I, (XiR(i) +Gs(O) i = I
But from Eq. (9.69), we see that s(O) = G. Therefore,
R (0)
p
I, (Xi R (i) + G2
i = I
which is the criterion proposed in Eq. (9.68).
(9.73)
For unvoiced sounds, we assume that v(n) is a white noise process with the autocorrelation function Ru(O) = 1 and Ruet) = 0 for't # O. This is the same autocorrelation function obtained for the assumed voiced excitation v(n) = o(n). Consequently, the derivation gives the same result for white noise excitation.
Pitch Period
One of the most difficult aspects ofLPC vocoder technology is making the correct determination of the pitch period. Many different algorithms have been tried, and most work well on strongly voiced sounds. The difficulty occurs in the transition regions from voiced to unvoiced sounds and vice versa. Many of the algorithms are discussed by Rabiner [41].
We will discuss two algorithms here. The first is the average magnitude difference function (AMDF) used by US Government and NATO LPC speech coders. The second is the autocorrelation method applied to the prediction residual.
The AMDF was developed during a period when multiplication was a particularly time consuming operation to perform, and it avoids the mUltiplications required to compute the autocorrelation function. Basically, the algorithm compares the speech signal with a delayed version of itself and attempts to find the delay that results in the minimum average difference. Presumably, ifthe delay is a multiple of the pitch period, the differences will be small, and the shortest pitch period exhibiting a minimum should correspond to the pitch period of the voiced sound.
The AMDF function is given by Tremain [38] as
N
AMDF ('t) I,ls(i) -s(i+'t)1 (9.74) n = I
Speech Processing 523
For an 8 kHz sample rate, it is typical to perform the search for 60 values of't from 20 to 156.
The pitch is smoothed over five frames. A confidence factor may also be developed, based on the minimum AMDF over several frames. This is used in the voiced/non-voiced decision discussed later.
The signal is normally lowpass filtered before applying it to the AMDF algorithm. A cutoff frequency in the order of 800 Hz works well. This passes all the information required to make a pitch determination and removes extraneous information. In some instances a low order prediction filter is used to form a whitened error signal or residual, and the residual is used as the input to the pitch determination algorithm. Recall from Eq. (9.26) that the prediction residual corresponds to the excitation function.
A block diagram of a pitch determination system based on a second-order prediction residual is shown in Fig. 9.14.
Voicing Decision
Another required determination is the voiced/non-voiced decision. This decision is normally made based on several parameters, the most important of which are a follows:
1. The low-band energy. given by LBE = Elx(i)l, where x(i) are the lowpass filtered speech samples. The low-band energy is higher for voiced sounds.
2. The maximum-to-minimum ratio of the AMDF{'t) function. This also is higher for voiced sounds.
3. The rate of zero crossings. This is perhaps less than 500 per second for voiced sounds.
Several other parameters may be used to assist in making the voicing decision. One of these is the first reflection coefficient (see Kemp et al. [57]), given by
x(i)
K1.K2 TO VOICING
t===:JALGORITHM
FIGURE 9.14 Block diagram of pitch preprocessing circuit
TO PITCH ALGORITHM
AMDF
524 Digital Signal Processing in Communication Systems
RCt = 1:[s(i)s(i-I)]
1:[s2(i)] (9.75)
Another parameter that can be used is the weighted high-band energy, which is higher for non-voiced sounds. A measure of the high-band energy (see Kemp et al. [57]) is given by
1:ls(i) -s(i-I)1 QS =
1:s2 (i) (9.76)
Two other parameters that have been used are the product of the causal reverse prediction gain (which tends to be higher for voiced sounds), and the product of the non-causal sounds (see Kemp et al. [57]). The former is given by
1:[s(i)s(i-'t)]2 ARB = -;:------::---
1:s2 (i) 1:s2 (i - 't)
and the latter is given by
1: [s (i) s (i + 't)] 2 ARF =
1:s2 (i) 1:s2 (i + 't)
(9.77)
(9.78)
A tentative voicing decision is made based on a weighted sum of the deviation of the individual parameters from their threshold values. The final voicing decisions are then made by smoothing the tentative classifications based on the previous and next determination. A voicing decision is typically made every halfframe.
Another method has been proposed to make the voicing decision from the autocorrelation function of the prediction residual. The pitch period can also be determined from this function, which is given by
E[e(n)e(n-'t)] R ('t) = ------:--------::----
0.5 {E [e2 (n)] + E [e2 (n - 't)] } (9.79)
We recall that, to the extent that the vocal tract can be modeled by the prediction filter, the prediction residual corresponds to the excitation. Consequently, for voiced sounds a value of't will be found for which R('t) has a strong peak. Conversely, the absence of such a peak is indicative of a non-voiced sound.
The voicing decision is one ofthe most difficult parts ofthe vocoder algorithm. It is responsible for a significant part of the performance degradation in a noisy background environment.
Speech Processing 525
Window Placement
As we have seen, a considerable number of subtle features may be incorporated in the LPC vocoder technique to enhance performance. Therefore, a few remarks are in order with regard to the placement of the analysis window within the frame for determination of the pitch period and reflection coefficients.
Human speech tends to be characterized by periods of relatively constant parameters interspersed with abrupt changes called onsets. The speech immediately following an onset often contains important information, and it is desirable to start the analysis window for the reflection coefficients at-or slightly after-an onset. If there is more than one onset in a frame, the window is placed at the first onset. For voiced sounds it is also desirable to place the pitch window synchronously with the previous window. This is done by placing the start of the window a multiple of the pitch period from the start of the previous window. The analysis window is often a different length from the frame, giving some latitude in the placement of the starting position within the frame. For a 22.5 ms frame with an 8 ks/s rate, a frame consists of 180 samples. It is not unusual to make the analysis window variable (say, from 90 to 156) with a nominal value of 130 samples.
The voicing window, used to determine the pitch and the voicing decision, should be placed to avoid onsets. Ifthere are no onsets, it is centered in the pitch window. If there is an onset, the voicing window is placed before it, if possible; otherwise, it is placed after the onset. If there are two onsets, the voicing window is placed between them, if possible.
The determination of onsets is not difficult. One method used is to examine the sample-by-sample prediction coefficients for a first-order linear predictor. If it changes abruptly over about 16 samples, an onset is present. A first-order predictor is given by
. [S(i)S(i-l)] f(l) = E -2=----
S (i - 1) (9.80)
In this case, the expectation is formed by a running average of 63/64 of the old sum and 1164 of the latest calculation. If the difference given by
7 15
d(i) = I/(i-j) - Lf(i-j) (9.81) j=O j=8
exceeds a threshold value (typically about 0.25), an onset may be present [57]. There are many other refinements and strategies used to make a successful LPC
vocoder, such as the way the signal is preconditioned (e.g., to remove the dc bias), the way the parameters are interpolated, and so on. For additional details, the reader is referred in particular to Rabiner and Schafer [36], Tremain [38], Campbell et
526 Digital Signal Processing in Communication Systems
al. [39], Bristow [40], Rabiner et al. [41], Kang and Everett [42], Kang [43], Federal Standard 1015 [44], and Kemp et al. [57] for additional details.
Synthesis
As indicated in Fig. 9.8, the voice signal is reconstructed at the receiver by exciting the synthesis filter with an impulse train or by random noise. For voiced sounds, the impulse train has a repetition period determined by the pitch algorithm of the analysis system. Unfortunately, the extreme regularity of the excitation causes the speech to sound machine-like and tense. As a result, it has been found desirable to introduce a bit of phase jitter into the waveform. One approach is to use a multiple pulse spectrum consisting of about 25 sample pulses. The entire sequence is repeated at the pitch period and lowpass filtered. In addition, highpass filtered random noise is added to the excitation. A randomly spaced doublet may also be added resulting in a synthesis system such as shown in Fig. 9.15. This results in a more natural sound.
PITCH PERIOD
FIGURE 9.15 Modified excitation signal for voiced sounds
PERFORMANCE EVALUATION
DOUBLET
TO 10 POLE SYNTHESIS
FILTER
The evaluation or comparison of the many speech coding techniques, and the many variations of each, is a matter that requires careful consideration. It is no trivial matter to objectively determine which technique is better than another. The ear (in conjunction with the human mind) is amazingly adaptable, and investigators working with a particular coder often become quite adept at understanding its output even though an outsider would think its quality is poor. This learning progresses to the point where it eventually becomes difficult to make objective comparisons. Fortunately, several relatively objective measures have been developed that can be used to effectively compare systems.
The first of these, called the diagnostic rhyme test (DRT), deals with intelligibility." The DRT uses a basic list of 192 words consisting of 96 rhyming pairs. Each pair is normally presented twice in the course of the testing session. The listener's task is to indicate which member of the pair was actually spoken. For ex-
Speech Processing 527
ample, when the stimulus word is "zeal" the options available to the listener are "zeal" and "seal." A correct response indicates that the speaker has conveyed a sufficient number of acoustic features with regard to the voicing attribute. Depending on the word pair involved, each item serves to test for one of the following elementary phonetic attributes:
1. voicing 2. nasality 3. sustention 4. sibilation 5. graveness 6. compactness
There are 16 word pairs to test each attribute. Each pair differs only in a single attribute in the initial phoneme. Typical examples for voicing are the word pairs "veal" and "feel," and "goat" and "coat." Nasality is tested by such word pairs as "meat" and "beat," and "news" and "dues." At 2,400 bls, the LPC-1O algorithm can be expected to score from the high 80s to about 90 percent, depending on the individual speakers. This degrades rather rapidly to the mid 80s with an error rate of only 1 or 2 percent.
A second test that is often used to measure vocoder performance is the diagnostic acceptability measure (DAM). This test consists of 12 phonetically balanced six-syllable sentences from each talker. A listener hears the 12 sentences as a group. He then rates the overall quality on 21 separate rating scales. The ratings address such factors as speech quality, background noise, cracking, intelligibility, nasal sound, naturalness, and so forth. A 2,400 bls LPC system typically scores in the lower 50s.
GOVERNMENT STANDARD ALGORITHM: LPC-IO
A brief discussion will now be given of the US government standard LPC-IO coder. This example will serve to summarize several of the concepts developed earlier in the chapter and will also describe a typical coding arrangement for the parameters. Additional details on the vocoder are presented in the references, particularly Tremain [38], Federal Standard lOIS [44], and Kemp et al. [57].
A summary of the major characteristics is given in Table 9.1. A block diagram ofthe transmitter is shown in Fig. 9.16. The input bandwidth to the AID converter is 100 Hz to 3,600 Hz. The signal is attenuated 23 dB above 4,000 Hz. A 12-bit AID converter is used with a sample rate of 8 kHz. A digital preemphasis filter is
"This test is often scored by Dynastat, Inc., of Austin, Texas. The company maintains a stable crew of trained listeners.
528 Digital Signal Processing in Communication Systems
TABLE 9.1 Summary of Parameters for LPC-IO
Predictor order Sampling rate Bit rate Frame Pitch algorithm Voicing Matrix load Reflection coefficient coding Error correction coding
10 8 kHz 2,400 bps 22.5 ms (54 bits per frame) AMDF (51 to 400 Hz) Two decisions per frame Covariance Log area ratio for RCI and RC2, linear for others Hamming codes on selected bits
Table based on Federal Std-I 0 15. November 28, 1984.
provided to boost the high-frequency energy. The transfer function of this filter is H(z) = 1 - 0.9375z- i . The bit allocation for the 54 bits in the LPC frame is listed in Table 9.2. The synchronization bit alternates between zero and one from frame to frame.
Pitch and voicing are encoded as a 7-bit field. The specific 7-bit codes assigned to each of the 60 pitch frequencies are defined in Federal Standard 1015 [44] and are not repeated here. For error protection, a non-voiced frame is encoded as seven zeros, and frames in voicing transition are encoded as seven ones. Obviously, since there are 128 decoding states, many received combinations can be allowed for error correction. As assigned by Federal Standard 1015, eight received characters are recognized as non-voiced. These are either seven zeros or a single one and six zeros. Thus, a single error is effectively corrected. Likewise, there are eight received characters interpreted as a transition frame. These contain all ones or a single zero and six ones.
ANALOG SPEECH INPUT
RMS
PWRLEVEL
r-;;=:;-;-;;--, PITCH VOICING RC'S
FIGURE 9.16 Block diagram of US government standard LPC speech coder
Speech Processing 529
TABLE 9.2 Bit Allocation for Vocoder
Bits Allocated per Frame
Parameter Voiced Non-voiced
Pitch and voicing 7 7
RMS amplitude 5 5 RC(1) 5 5 RC(2) 5 5
RC(3) 5 5 RC(4) 5 5
RC(5) 4 0
RC(6) 4 0 RC(7) 4 0
RC(8) 4 0
RC(9) 3 0
RCCIO) 2 0
Error control 0 20
Synchronization
Unused 0
Total 54 54
The RMS amplitude is scaled from 512 possible levels (nine bits) to 32 levels (five bits) using a table. The table levels tend to be more coarsely quantized at the higher signal values.
The first two reflection coefficients are encoded with five bits, again using a table look-up. Table 9.3 gives the nonlinear convention for these quantities and is included to give the reader a feel for the types of tables used.
The reflection coefficient RC3 is encoded using a similar five-bit table. In this case, the conversion is linear except for limiting in the region for (RC3) > 0.6. RC4 is also linearly encoded using five bits with limiting for values of (RC4) > 0.76. RC5 through RC8 are encoded with four bits for voicing frames only. Encoding is basically linear, but with slightly different saturation characteristics for each. Therefore, a separate table is used to encode each reflection coefficient. RC9 is encoded using a three-bit table, and RC lOusing a two-bit table.
During non-voiced frames, since RC5 through RCI 0 are not transmitted, an additional 20 bits are available for error correction. A Hamming code is used for the four most significant bits of the RMS amplitude and the first four reflection coefficients. The error correction encoding convention is listed in Table 9.4.
At the receiver, the speech is synthesized using a tenth-order all-pole filter excited by pitch-synchronous pulses. A block diagram of the synthesizer (receiver) section is shown in Fig. 9.17. The incoming signal is first examined for frame sync using the characteristic that the synchronization bit toggles. Each bit in the serial stream is correlated with the bit delayed by 54 clocks, and running averages are
530 Digital Signal Processing in Communication Systems
TABLE 9.3 Reflection Coefficients 1 and 2
Coefficient Range Binary Encoded Value Decoded Value
-.999 to -.984 -15 -.984
-.984 to -.969 -14 -.969
-.969 to -.953 -13 -.953
-.953 to -.938 -12 -.938
-.938 to -.906 -11 -.922
-.906 to -.875 -10 -.891
-.875 to -.828 - 9 -.844
-.828 to -.766 - 8 -.781
-.766 to -.688 - 7 -.719
-.687 to -.609 - 6 -.641
-.609 to -.531 - 5 -.563
--.531 to -.422 - 4 -.469
-.422 to -.313 - 3 -.359
-.312 to -.203 - 2 -.250
-.203 to -.094 - 1 -.141
-.094 to +.094 0 +.031
.094 to .203 -.141
Identical to negative values but with positive signs.
.984 to .999 15 .984
TABLE 9.4 Pulse Amplitude Values for Voicing Excitation
Index Amplitude Index Amplitude Index Amplitude
1 249 15 - 20 29 19
2 -262 16 138 30 - 15
3 363 17 - 62 31 - 29
4 -362 18 -315 32 -- 21
5 100 19 -247 33 - 18
6 367 20 -78 34 -27 7 79 21 - 82 35 - 31
8 78 22 -123 36 - 22
9 10 23 - 39 37 - 12
10 -277 24 65 38 - 10
11 - 82 25 64 39 - 10
12 376 26 19 40 - 4
13 288 27 16
14 - 65 28 32
Speech Processing 531
FIGURE 9.17 Block diagram of LPC receiver
maintained for 54 different bit positions. Since the sync bit is the only position that toggles, it soon correlates to a value of negative one, which establishes frame synchronization. Once frame sync is established, the serial bit stream can be converted to a 53-bit paralleled pattern.
The excitation used in the coder is not a simple sequence of impulses at the pitch period. Rather, it is a sequence oflevels that are generated at the 8 kHz rate. The waveform sequence is repeated at the pitch frequency. Table 9.4 lists these values. If the pitch period is 40, all values of the table are used in sequence and repeated for the duration of the 200 Hz pitch. If the pitch period is longer than 40, the excitation is followed by as many zeros as necessary to complete each pitch period. If the pitch period is shorter than 40, the remaining values are added to the values at the beginning of the table for the next pitch period. For example, if the pitch period is 38, excitation value 39 of the table would be added to value 1 of the next pitch period, and so forth. The table values are scaled by the RMS amplitude parameter.
The reflection coefficients are interpolated and converted to prediction coefficients, ("a" values) to produce the synthesis filter with a response lIA(z). The filter shapes the excitation signal.
The pitch, RMS amplitude, and reflection coefficients are all interpolated. As part of the interpolation, the decoded parameters are converted from frame blocks to pitch periods. Pitch and log RMS are linearly interpolated. The beginning of RMS interpolation is delayed at the onset of voiced sounds. This increases the sharpness of the voice attacks.
532 Digital Signal Processing in Communication Systems
The interpolation of the reflection coefficients is accomplished by forming the area ratios [see Eq. (9.65)] and performing a linear interpolation on the log ofthe area ratios. The reflection coefficients are interpolated once per pitch period.
The reconstructed signal at output of the synthesis filter is deemphasized before application to the D/A converter. This filter undoes the effect of the original preemphasis filter. The transfer function of the deemphasis filter is given by
H(z) = ----: 1- 0.75z-1
(9.82)
This completes our consideration of the basic LPC algorithm. Before leaving the subject of speech coding we will discuss a more efficient method of encoding LPC parameters using line spectrum pairs (LSPs) to reduce the data rate to the 400 to 800 bps range. We will also discuss several methods of obtaining near-toll quality speech at 4,800 bps using code excited linear prediction (CELP).
VERY LOW DATA RATE SPEECH CODING*
The data rate communicated by the human voice is much less than 2,400 bps. This can be demonstrated easily by an example. Suppose a person is reading English text at a typical rate of 150 words per minute. This rate corresponds to an average of 12.5 characters per second. Each character can be coded with 5 bits, which results in a data rate of 62.5 bps. On the other hand, a number of subtle features are communicated that allow us, for example, to recognize the speakers voice, his level of excitement, and so on. At any rate, if we are willing to settle for the basic information, a considerable reduction in the data rate should be possible over 2,400 bps LPC. This is indeed possible, and it has been found that more sophisticated coding of the prediction filter coefficients can result in a reduction of the data rate to the 400 to 800 bps range. Only a small additional degradation in quality from the 2,400 bps LPC is sustained, on the order of 1 to 2 percent on the diagnostic rhyme test scores (see Kang and Jewett [46]).
As we might suspect, a variety of techniques have been considered to reduce the data rate. One of the most successful solutions has involved the use of line spectrum pairs (LSPs). Using this technique, the basic LPC vocoder is still used. The use ofLSPs is then applied to transmit the prediction filter coefficients. Using this technique, the transfer function of the analysis filter is represented by two functions that have their zeros on the unit circle. The movement of the zeros with time is particularly well behaved-especially the spacing of the pairs. These prop-
• A significant part of the material in this section is adapted from Kang and Jewett [46. 71]. and Kang and Fransen [72]. to which the reader is referred for additional details.
Speech Processing 533
erties make it possible to transmit the characteristics of the analysis filter with fewer bits. Since the LSP conversion takes place after the LPC analysis, the majority of the algorithms used in the speech coder are unchanged. A block diagram of the resulting speech coder showing the LSP addition is given in Fig. 9.18. The standard LPC-I 0 speech coder is shown only as one block. However, it forms all the parameters discussed previously, including the pitch period, RMS level, prediction filter coefficients, and the voiced/non-voiced decision. The prediction filter coefficients are then represented in a different way using line spectrum pairs.
Coefficient Conversion (PC to LSP)
The transfer function of the LPC analysis filter, as given by Eq. (9.15), is
A(z) (9.83)
where
~ = the nth prediction coefficient
The corresponding LPC synthesis filter, as discussed previously, is given by lIA(z). The prediction coefficients are readily obtainable using the autocovariance or autocorrelation methods discussed earlier. A serious limitation of the prediction filter expressed in this way is that an error in one coefficient affects the entire speech spectrum. However, the same function can also be expressed in terms of the zeros in the Z plane. If this is done, each pair of zeros corresponds to a resonant frequency and a bandwidth for the resonance. To develop the idea, let
SPEECH INPUT
SPEECH OUTPUT
EXCITATION PARAMETERS
EXCITATION PARAMETERS
REFLECTION COEFFICIENTS
FIGURE 9.18 Block diagram of800 bps speech coder
534 Digital Signal Processing in Communication Systems
us first note that Eq. (9.83) can also be expressed as the product ofthe tenns given by
where
NI2
A(z) = IT (l-zjz-I) (l-z~z-I) j = I
Zj = the ith root of the transfer function
(9.84)
The advantage of expressing the transfer function in this way is that each root primarily affects the transfer function only in the vicinity of that frequency.
We now decompose A(z) into two functions, consisting of the function and its conjugate
P(z) = A(z) _Z-(N+l)A(z-l) (9.85)
and
Q(z) = A(z) +Z-(N+l)A(z-l) (9.86)
Then, the prediction filter can be reconstructed by
I A (z) = "2 [P (z) + Q (z)] (9.87)
The impulse response of P(z) is odd with respect to its midpoint. It has one real root at z = 1. The other zeros are on the unit circle at
(9.88)
where
fK = frequency of zero
T s = sample time
P(z) can be factored in the fonn
P (z) (9.89)
Speech Processing 535
Multiplying out the expression gives
NI2 I II I j2ltf T j21tf T 2
P(z) = (1-Z-) l-z-[e KS+ e KS]+Z-
K= I
Using the Euler identity, this can be written in the fonn
NI2
P (z) = (1 - z-l) II [1 - 2z-1 cos (21tfK T s) + Z-2] K= I
(9.90)
The other expression, Q(z), has even symmetry about the midpoint. It has one real root at z = -1. The other roots are also on the unit circle at
Consequently, we may write
j2ltfK Ts Z = e
N/2
Q (z) = (1 + z-l) II [1 - 2z-1 cos (21tfK) + z-2] K= I
(9.91)
(9.92)
It turns out that the roots ofP(z) and Q(z) are interleaved (i.e., alternate), as illustrated in Fig. 9.19.
The closer a pair of zeros ofP(z) and Q(z) are to each other, the closer the corresponding zero of A(z) is to the unit circle, which indicates a sharper (higher) Q resonance. These roots are referred to as line spectrum pairs.
There are several ways to solve for the roots ofP(z) and Q(z), given the impulse response [the "a" values of the prediction filter A(z)]. Since the roots ofP(z) and Q(z) lie on the unit circle, the task is simplified. A zero on the unit circle implies that the zeros correspond to real frequencies. We can find the frequency response ofP(z) or Q(z) by taking the discrete Fourier transfonn of the impulse response.
Since the location of the zeros must be known with some precision, a fairly large FFT is suggested. The impulse response is therefore appended with zeros to fill the FFT. A 256-point FFT may be appropriate. For LPC-lO, there are 12 real input points, and the FFT input is then padded with 244 zero values. The frequency resolution for an 8 ks/s sample rate is 8000/256 = 31.25 Hz.
The amplitude of the FFT outputs is found by taking y(k) = -V[I2(k) + Q2(k)] and finding the indexes (k) corresponding to the minimum values. The worst-case errors are then ±15.625 Hz. This is larger than one would like to allow. A parabolic interpolation has been used to refine the zero locations using the two adjacent
536 Digital Signal Processing in Communication Systems
x
Z PLANE
2KHz
4KHz~-------------+------------~O
x
FIGURE 9.19 Zeros ofP(z) and Q(z) in the Z plane
• ZEROS OF P(z) • ZEROS OF Q(z) X ZEROS OF A(z)
values along with the minimum. This is shown graphically in Fig. 9.20. It can be shown (see Problem 9-7) that the minimum using a parabolic fit is given by
(9.93)
This expression can be simplified considerably for the present application. Let us designate the FFT output value, giving the minimum magnitude as X(k). Then,
Y3r-------~----------~
Y1 f---------------''''-
XMIN
FIGURE 9.20 Parabolic interpolation of function zeros
Speech Processing 537
M(k) = JRe [X2 (k)] + 1m [X2 (k)] . The adjacent value on the low side is then given as M(k - I), and the value on the high side is M(k + I). Substituting these values in Eq. (9.93) leads to the expression
1 [ M(k-I) -M(k+ I) ] MIN k = k+"2 M(k-I) -2M(k) +M(k+l) (9.94)
where
k = index of value M(k) nearest zero
We have made use of a condition here that x3 = x2+1 and Xl = x2-l for the FFT outputs. The actual frequency of the zero is then given by
(MIN k) x Fs Fz = N (9.95)
where
Fs = sample frequency
N = size of FFT used
We have not used the condition here that the value M(MIN k) = O. Because of this condition, we could have used only two values of the FFT output to solve for the minimum. Unfortunately, this procedure results in an expression requiring calculation of a square root, which may be more cumbersome than using three values with Eq. (9.94).
The LSPs are plotted in Fig. 9.21 for a short segment of speech (the sentence shown). Several properties are interesting to note. First, we observe that there are periods of time when the LSPs are relatively constantly interspersed, with abrupt changes. We note also that the line spectrum pairs tend to track each other. In addition, there is a significant amount of correlation between neighboring line spectrum pairs. These tendencies present a variety of opportunities for efficient coding of the LSPs for transmission. One obvious procedure is to absolutely code only one frequency of each LSP and transmit a differential value to determine the position ofthe other member ofthe pair. The differential value can then be transmitted with fewer bits. Other schemes can be devised to transmit changes in the parameters and the positions where the changes take place. It should be noted that clever encoding schemes often carry a price in terms of greater susceptibility to errors.
538 Digital Signal Processing in Communication Systems
4
... g 2 GI
" CT
~ 1
o
easy Way. ------"-------- -"-----
(A) Spectrogram
Here is an easy Way. ~ ~ ~ ".-----"'--------..., ---- --"-------...
4 r----------------------------------------------------
~ 'iii Co
E 2 2 u GI Q.
'" ~ 1 :::;
oL-_______ L-________ L-________ L-________ L-________ ~
o 0 .25 0 .50 0 .75 1.00 1 .25
Time (sec)
(8) LSP Trajectory
FIGURE 9.21 Typical LSP trajectories and spectrogram of original speech. Reprinted with permission from G. S. Kang and W. M. Jewett. NRL report No. 9318. December 1986. Naval Research Laboratory. Washington. DC 20375-5000.
Another method for encoding LSPs is to use vector quantization. This method involves comparing the LSP trajectories with prestored templates and transmitting the number ofthe most similar template. Vector quantization is discussed in more detail in the following section.
At the receiver, the LSPs are converted back to coefficients of the prediction filter. This is done by substituting the values for P(z) and Q(z) [see Eqs. (9.90), (9.91), and (9.92)], into Eq. (9.87). The resulting expression is
Speech Processing 539
A (z) I NI2 2 (l - Z-l) n [1 - 2z-1 cos (27tfiT) + Z-2]
i = I
1 N/2 ,
+ 2 (l-Z-l) n [1-2z-1cos (27tfiT) +z-2] i = I (9.96)
Multiplying out this expression and collecting like powers of z gives the transfer function for the prediction filter. The multiplication is rather involved because of the number of terms. Nevertheless, the procedure is straightforward.
Vector Quantization
Another approach which has been studied for encoding LSPs is to use a set of 4,096 templates, 3,840 for voiced sounds and 256 for unvoiced sounds (see Kang and Jewett [46]). The LSPs for a given frame are compared with all the templates and the number of the closest template is transmitted. The distance measure of each template is formed based on the frequency error of each line spectrum with those in the template. Here, it is taken into account that the ear is more sensitive to errors in the low-frequency region than in the higher portion of the spectrum. Hence, sensitivity decreases linearly between 100 Hz and 1 kHz, and logarithmically between 1 kHz and 4 kHz.
Obviously, great effort must be taken in forming a good set of templates. For the cited reference, the templates were formed from sentences spoken by more than 50 speakers. In this example, 800 bps digital speech was obtained with the bit allocation for each three frames as follows:
Synchronization Pitch Period Amplitude Information Filter Parameters
This results in 54 bits per 3 frames, or 800 bps.
1 5 4+4+4 12+12+12
A considerable amount of work is presently being done to further reduce the bit rate of LSP coders, and it is anticipated that rates in the 300 bps range can be achieved.
CODE EXCITED LINEAR PREDICTION CODER (CELP)
The potential applications for low data rate digital voice are enormous but, to date, public acceptance of linear predictive coders at 2,400 bps or less has been very limited. This is because of the somewhat unnatural, machine-like characteristics
540 Digital Signal Processing in Communication Systems
of the sound. It is therefore necessary to provide an algorithm that produces at least toll-quality speech or it will not be generally accepted. Speech of this type is normally provided by 64 kbps pulse code modulation (PCM), 32 kbps adaptive pulse code modulation (ADPCM), or 32 kbps continuously variable slope delta (CVSD) modulation. Excellent quality is also obtained at 16 kbps using adaptive predictive coding with hybrid quantization (APC-HQ). The government standard APC-SQ operates at 9.6 kbps with good results.
Until recently, it has not been possible to reproduce toll-quality speech at a data rate below approximately 9.6 kbps. The codebook excited linear prediction coder (CELP) has made it possible to use a 4,800 bps rate and still provide excellent quality speech reproduction. The CELP algorithm has been reported to score 93 on the DRT test and 68 on the DAM test (see Campbell et al. [20,49]). A 1 percent error rate degrades the DRT to about 90.
A brief explanation of the technique is given below.* As the name implies, linear predictive techniques are used. However, the excitation is generated in a manner that differs greatly from the 2,400 bps LPC-1O discussed previously. In the CELP coder, a stochastic codebook stores a fairly large number of short excitation waveforms-on the order of 128 to 512. The speech encoder determines which of the stored waveforms best serves as the excitation for the analysis period and transmits the number of that waveform along with the prediction filter coefficients, pitch information and the like. The prediction filter coefficients are converted to line spectrum pairs (LSPs), as discussed earlier, to provide efficient and channel error resilient coding.
First, the LPC parameters are determined using the autocorrelation method. The CELP coder then passes each of the adaptive and stochastic codebook excitation waveforms through the LPC synthesis filter and compares the output with the actual speech signal to determine which of the waveforms produces the best perceptual replica of the actual waveform for the analysis period of interest.
The manner in which the adaptive codebook ("pitch") information is used to modify the excitation waveforms is somewhat complex, and it requires further explanation. This will be addressed in more detail later. The CELP coder (as proposed in Federal Standard 10 16, 31 August 1989) uses a 30 ms analysis window and an 8 kHz sample rate with a 12-bit AID converter. The autocorrelation method of LPC analysis is used with a 30 ms Hamming window function. A tenth-order synthesis filter is used, as in the LPC-l 0 coder discussed earlier. The LPC coefficients are computed once per frame; however, the codebook search for the excitation and the pitch analysis are made at a 7.5 ms subframe rate. The spectrum is coded using a total of 34 bits for the 10 LSPs and 144 total bits for each 30 ms frame. The number of bits for the LPC filter can be minimized because each LSP tends to occur in a fairly limited frequency range and only three or four bits are
"For more information, see Campbell et al. [39,49], Kemp et al. [50], Tremain et al. [51], and Federal Standard 1016 [52].
Speech Processing 541
required per LSP. The LSPs are linearly interpolated over the frame, as shown in Table 9.5. The past and future spectra are centered at the beginning and ending of the present frame's excitation parameters, respectively. This requires that the LPC parameters are computed half a frame (two subframes) ahead of the excitation parameters.
The codebook excitation consists of the sum of two parts, the stochastic codebook and the adaptive codebook as shown in Fig. 9.22. The stochastic codebook consists of 512 sequences of 60 samples. Each sequence consists of the ternary values -1, 0, and + 1. Each sequence differs from the adjacent sequence only in two places and is a shift of two samples from the previous sequence with two new values appended. This has been found to give as subjectively good a result as independent random values. Moreover, it simplifies the search calculations by allowing end-point correction algorithms. A stochastic codebook gain, G 1, is determined along with the optimum stochastic codebook sequence for the sub frame being synthesized. As shown, the adaptive codebook is somewhat of a simplification. It is formed using values of a memory or shift register holding past values of the filter excitation.
Initially, the memory is filled with zeros. During the first subframe of7.5 ms, the excitation values, ex(n), are stored in a memory, as shown in Fig. 9.23. We use the designation that the present sample has an index of zero, the value stored before that negative one, the preceding value negative two, and so on. Therefore, the index number represents a delay from the present. The adaptive codebook is calculated using the values in memory as they exist at the beginning of the subframe. Thus, index -1 is the last value stored in the previous subframe. To understand how the adaptive codebook values are calculated, we first consider the case where the pitch period is less than 60 (60 being the number of samples in a 7.5 ms subframe). Table 9.6 shows the way the values, stored in the memory, are used or grouped to produce the adaptive codebook output during the second subframe. The table is formed from the data points as they were in the memory at the end of the previous sub frame.
Consider the case when the pitch period is 20 samples long. This corresponds to the bottom row in the adaptive codebook. The first entry in this codebook position corresponds to the value in the memory delayed by 20 samples. The next codebook value (2) corresponds to the value that was delayed 19 samples, and so on, until 20 samples are used. The same 20 samples are then repeated twice to
TABLE 9.5 Interpolation Weighting for LSPs
Subframe Past Spectrum Future Spectrum
I 7/8 118 2 5/8 3/8 3 3/8 5/8 4 1/8 7/8
542 Digital Signal Processing in Communication Systems
511 STOCHASTIC 255 ADAPTIVE 510 CODE BOOK 254 CODE BOOK 509 • 508 507 •
~ • • 3 • • 2
G1 1 G2 4 0 3 2 1 ................. -... _.-- ...... _- .. __ ........ -......... 0
FIGURE 9.22 CELP synthesizer
-147 -60 -1
ZEROS ax(n)
FIGURE 9.23 Adaptive codebook storage after first subframe
form the 6O-sample subframe. The adaptive codebook row is used starting with the left-most value and working toward the right.
Now suppose a pitch period of21 had been chosen, corresponding to the table index of 3. In this table position, the past excitations, starting 21 samples delayed, are used. After 21 samples, the values are repeated twice as before, except that on the second repetition only words -21 through -4 are needed to fill out the 60 sample codeword. This is consistent, since when the table is loaded for the next subframe, the first entry will be delayed 21 samples from the excitation at that time, which corresponds to the next position of the periodic waveform if the period has not changed. It corresponds to a point on the waveform three periods later, however.
Now, consider the case ofa pitch period longer than the subframe (e.g., a delay of 147), which corresponds to the top row in Table 9.6. In this case, the first value stored in the table is delayed 147 samples. The 60th value is delayed 88 samples.
Obviously, it is possible to produce the adaptive codebook from a single series of 147 memory locations by clever manipulation of pointers. Nevertheless, it may be easier to think of the operations performed as calculating all the entries in the adaptive codebook at the beginning of each sub frame.
The 128 integer delay codewords are formed by repeating samples from the adaptive codebook as indicated above. The 128 noninteger delay codewords are
Speech Processing 543
TABLE 9.6 Adaptive Codebook Structure
Index Delay Adaptive Codebook Numbers
255 147 -147, -146, -145, ... -89, -88
131 61 -61, -60, -59, -58, ... -3,-2
128 60 -60, -59, -58, -57, ... -2,-1
3 21 -21, -20, ... -I, -21, -20, ... -1, -21, -20, ... --4
° 20 -20, -19, ... -1, -20, -19, ... -1, -20, -19, ... -1
fonned by interpolation of the adaptive codebook's samples and are assigned the index numbers between the integer values. The 256 delays specified by the proposed Federal Standard-1016 are
Delay Range Resolution
20-25 2/3 113
26-33 3/4 1/4
34-79 2/3 1/3
80-147
We note in Fig. 9.22 that a pitch gain, 02, is supplied with the adaptive codebook position to be used. The value of the pitch gain, which may be between -1 and +2.0, detennines the strength of the periodic component in the excitation, as well as how quickly a periodic component builds up or dies away.
The best stochastic and adaptive codebook values for each subframe are determined by passing each entry through the LPC filter and comparing the results with the actual speech. This is shown in Fig. 9.24. The weighting filter is used to emphasize those areas of the spectrum to which the human ear is most sensitive. During each subframe, the procedure is first to try the adaptive code entries and the pitch gain, 02. After these are detennined, all values in the stochastic codebook are tried, and the gain, 01, is optimized.
544 Digital Signal Processing in Communication Systems
G1
s
ERROR MININIZATION ex(n)
FIGURE 9.24 CELP analyzer
All values in the adaptive code book are coded on odd sub frames. During even subframes, it is delta coded and can take on values within -31 to +32 indices of the delay used in the previous subframe.
The analog audio input to the CELP coder is bandpass filtered to a frequency range from 100 to 3,800 Hz. A Hamming error correction code is used on some of the most vulnerable bits prior to encoding. The bit allocation for the various parameters is given in Table 9-7.
The use of the stochastic codebook in connection with the adaptive codebook allows great flexibility in determining the excitation for the synthesis filter and is primarily responsible for the improved performance as compared with the LPC vocoder described previously.
The improvements are not without cost, however, and the amount of computation, which depends on the size of the code book searches, is roughly one order of magnitude higher than for an impUlse-excited vocoder. As higher-speed signal processors are developed, this will become less of a factor.
There are additional subtle features, some of which may be proprietary to individual manufacturers, in the CELP algorithm. The interested reader is referred to the cited references for a more exact and detailed explanation of the coder.
Speech Processing 545
TABLEfJ.7 CELP Bit Allocation
Sub/rame Parameter J 2 3 4 Frame
LSPI 3
LSP2 4
LSP3 4
LSP4 4
LSP5 4
LSP6 3
LSP7 3
LSP8 3
LSP9 3
LSPIO 3
Pitch delay 8 6 8 6 28
Pitch gain 5 5 5 5 20
Codebook index 9 9 9 9 36
Codebook gain 5 5 5 5 20
Future expansion 1
Hamming parity 4
Synchronization
Total 144
PROBLEMS
9-1 a) Derive the frequency response of a DPCM modulator when the prediction filter is a one-sample delay (see Fig. 9.3).
b) Sketch the frequency response normalized to the sample rate.
9-2 Using the Levenson-Durbin method, write the expressions for the prediction filter coefficients (IX values) for a third-order predictor. The autocorrelation matrix is given below for the frame under consideration.
[ I 0.5 0.25
J R ('t) = 0.5 1.0 0.5 0.25 0.5 1
Also, R(3) = 0.1.
546 Digital Signal Processing in Communication Systems
9-3 Draw the schematic diagram of an analysis filter using the a. values for the results found in Problem 9-2. Calculate the impulse response for the first four output samples (see Fig. 9.11).
9-4 a) Draw the schematic diagram of a synthesis filter using the a. values found in Problem 9-2.
b) Write the difference equations for the output signal. c) Calculate the impulse response for the first four output samples.
9-5 A method of converting the prediction filter coefficients ( a.) to reflection coefficients (K) was presented in Eqs. (9.66) and (9.67). Using these recursion formulas, convert the a. values in Problem 9-2 to reflection coefficients. Check the results against the reflection coefficients found in Problem 9-2 as a by-product of the Levenson-Durbin method.
9-6 a) Draw the schematic diagram of a lattice synthesis filter using the reflection coefficients found in Problem 9-2.
b) List the interactive equations for the filter. c) Write a computer program to calculate the impulse response of the filter
and list the results for the first four samples. Compare the results with the impulse response calculated for the synthesis filter in Problem 9-4.
9-7 Given the arbitrary points of a parabola (x). Yl), (x2' Y2), (x3' Y3), show that the x coordinate of the minimum (or maximum) is given by the expression
9-8 The impulse response of the functions used to solve for line spectrum pairs are known to contain zeros on the unit circle in the Z plane. This indicates that there are real frequencies for which the FFT of the impulse response has zero values. The FFT output is examined for the value of K giving a minimum response. The amplitudes of the adjacent points are also noted, so that
K-l ~ Al
K ~ A2
K+ 1 ~ A3
Speech Processing 547
We have Al > A2 and A3 > A2. In this case, the expression derived in Problem 9-7 can be used to find the exact index of the zero. Show that the expression can be simplified to