formantlpc

7/27/2019 FormantLPC

http://slidepdf.com/reader/full/formantlpc 1/15

Formant estimation algorithm based on polefocusing offering improved noise tolerance andfeature resolution

G. Duncan, PhD, MIEE

Prof. M.A. Jack, PhD, MIEE

Indexing terms: Signal processing, Codes and decoding, Noise and interference, Algorithms

Abstract: The ability to measure the centre fre-quencies of areas of resonance (formants) in theshort-time power spectrum of speech is of para-mount importance in the recognition of voicedspeech sounds in a feature-extraction-based con-

tinuous speech recognition system. Additionally,the provision of a tracking algorithm, by whichthe loci of formants with respect to time can beestimated, yields formant transition informationwhich helps identify phonetic features which are ofshort duration. Noise robustness in formant esti-mation is an essential attribute for recognitionsystems which are used in the office environmentand in military applications. The novel techniquepresented in the paper provides a noise-robustmethod of extracting formant centre-frequencyinformation from the short-time speech spectrum,and consequently improves the signal/noise per-

formance of the associated formant tracking algo-rithm. Formant estimation is based on modellingthe vocal tract frequency response using linearprediction coding (LPC) techniques. However, theestimation of formant centre frequency in anygiven analysis frame is greatly improved byemploying off-axis spectral estimation coupledwith a progressive increase in vocal tract modelorder, which together provide vocal tract poleenhancement. Finally, the use of a formant-weighting filter function applied within each frameaids in conferring high noise immunity to the esti-mation process. The pole focusing technique isshown to offer an improvement of at least 14 dB

in signal/noise immunity as a formant frequencyestimator over conventional LPC-based spectralestimation. In its application to formant tracking,it is shown that the technique also offers improvedseparation of formants which tend to merge,besides offering a general improvement in the pro-vision of formant detail, in particular with regardto weak nasal formants. An additional advantageof the technique is its relative insensitivity tochoice of vocal tract model order, which producesan inherently speaker-independent formant esti-mation algorithm.

1 Introduction

Paper 5772F (E10, E14), received 19th May 1987

The authors are with the Centre for Speech Technology Research, Uni-versi ty of Edinburgh, 80 South Bridge, Edinburgh EH1 1HN, Uni tedKingdom

The measurement of areas of spectral resonance(formants) in the spectrum of speech is essential in theidentification of voiced speech sounds in recognitionsystems employing knowledge-based feature extraction

and interpretation. In particular, measurement offormant centre frequency is crucial in identification ofvowels. The ability to track formant movement withrespect to time also yields important information in theidentification of pre- and postvocalic consonants, and thesecond formant (F 2) has generally been identified as pro-viding the most useful information in this respect [1-5].Relative formant amplitude, and to some extent formantbandwidth, can be regarded as primarily providing infor-mation relating to voice quality, but which are useful assecondary param eters in the identification of voiced pho-netic features.

Several techniques are available for formant estima-

tion which variously model either the speech productionprocess [6-10] or the auditory perception process [11-13]. Linear prediction coding (LPC) techniques [6, 14] inparticular offer a readily implementable processing para-digm for real-time analysis of the speech waveform, butcharacteristically offer only moderate signal/noise immu-nity.

The novel pole focusing technique presented in thispaper employs spectral estimation using LPC techniques,but offers a substantial improvement both in the signal/noise performance of formant frequency estimation andin the extraction of formant detail from the short-timespeech signal segment. The basis of the technique isessentially to enhance the detection of singularities (poles)

in the estimated transfer function of the vocal tract inher-ent in the LPC-based model.

Standard LPC-based formant estimation algorithmssuffer from severe restrictions on the length of inversefilter which can be used to model the inverse vocal tractimpulse response. Models which are of too low an ordertend to provide poor spectral separation of formants inthe frequency domain, whereas too high an order causesdeterioration of the noise immunity of the spectral esti-mator by creating a profusion of candidate peaks in theestimated vocal tract frequency response. The new polefocusing technique presented here is the antithesis ofstandard LPC analysis, in that high noise immunity in

formant estimation is specifically conferred by the pro-gressive overestimation of model order in any singleanalysis frame, coupled with the use of off-axis spectralestimation, i.e. calculation of the frequency response ofthe vocal within the digital z-transform plane unit circle,

18 IEE PROCEEDING S, Vol. 135, Pt. F, No. I, FEBRUARY 1988



as opposed to relying solely on unit-circle spectra. Addi-tionally, the pole-focusing technique does not require thesignal to be preconditioned by the application of a digitalpre-emphasis filter prior to analysis. An averaging filterfunction is applied as the final step in each analysis frameto yield noise-robust estimates of formant centre fre-quency.

The formant estimation properties of the pole focusingtechnique are explored by the use of both syntheticspeech-like signals and real speech. The study ofspectrum-based formant tracking in itself demands con-sideration of two sets of properties: those of the formantestimator operating within a single analysis frame of timedata, and those of the spectral tracking algorithm oper-ating across several frames of extracted formant data.The latter relies heavily on a knowledge of the dynamicsof the speech mechanism, but the design and comparisonof such algorithms is not considered in this paper.However, the examination of the performance of thespectral estimation component of the formant trackingsystem provides a reasonable measure of the likely per-formance of the overall system, particularly in its abilityto maintain the integrity of any formant track and to

maintain separation of tracks which may tend to merge.To assess the noise robustness of the pole focusing tech-nique in comparison to standard LPC spectral estima-tors, a synthetic speech-like signal with well controlledparameters is used. Much of the theory employed in thedesign of feature-extraction-based speech recognitionsystems relies heavily on the characterisation of thehuman speech production and perceptual processes.Therefore, the signal/noise performance criteria employedin this study are based in part on difference limen relatedto the human perception of fine variations in the spectralstructure of voiced speech [15, 16]. It is shown that thepole focusing techn ique prov ides a t least a 14 dBimprovement in signal/noise immunity compared to stan-dard LPC analysis.

Assessment of the technique within a formant trackingsystem is provided by a comparision with an advancedLP C techn ique [7 ], which also employs off-axis spectralestimation but uses a fixed model order, with both tech-niques sharing a common line-tracking paradigm. It isshown that the pole focusing technique has a superiorperformance in terms of ability to separate formantswhich merge together. Finally, the ability of the polefocusing technique to provide superior estimation andconsistency of formant detail in relation to the formantstructure of the signal is exemplified by application tonasal speech and connected voiced speech.

2 Autocorrelation-based linear predictive codinganalysis

The digitised speech signal can be considered as anuncorrelated pseudorandom sequence in the long termover periods of a few hundred milliseconds and longer. Ashort-term sequence of speech of the order of 20-40 ms,on the other hand can be considered as a correlated setof sampled data points, since the articulatory setting, andhence the filtering action of the vocal tract, can be con-sidered essentially constant over this time period [17].Thus , the amplitude value of any given sampled datapoint on the speech waveform, sn, can be predictedapproximately by a weighted linear sum of several, say p,past data points; i .e.

(1)

where sn is the predicted value of the actual data point,sn. Thus, for an optimum sequence of prediction errorcoefficients ak, then

s — s = e (2 )

where B2 -*• 0.

The coefficients themselves can be found by applyingthe condition of least mean squares, which is a conse-quence of the applied constraint that, during the timeperiod (consisting of N sampled data points) over whichthe vocal tract is considered to have a fixed configu-ration, the total error power a

2 associated with thechoice of prediction error coefficients must be minimal,where

N + p - l N + p - 1 / p

= Z£n = Z K + Z flk5n-fc

n=O n=O \ fc=l

(3)

Applying the usual minimum error criterion for multipleregression, and differentiating eqn. 3 with respect to eachpredictor coefficient, gives the system of p equa t ions :

N + p - 1

4

Therefore

p N+p-l N+p-1

Z«k Z Sn-kSn-i= ~ Zfc=l n=0 n=0

(4)

(5)

For a short-time sampled data series with a taperedwindow function such that the sampled data values taper

smoothly to zero amplitude towards the window edges atn = 0 and n = N - 1, i.e. sn = 0, 0 > n $s N, then the jthautocorrelation coefficient is given by

(6)

and since, for a real-valued data sequence, R(j) = R(—j),then eqn. 5 can be written as

(7)

where the set of p equations is referred to as the Yule-

Walker norm al equat ions [18, 19].It is possible to solve for each value of ak by matrixinversion techniques given the above system of equations.A more tractable solution is, however, given by explicitlyexpanding eqn. 3 and substituting the results of eqn. 5, togive

N - l p N-l

Z Sn + J > k Z

n=0 k = l n=k(8)

and, by simply extending k to cover the range 0 ^ k < p,then

p N - l

j>k Zk=0 n=k

ao = (9)

Employing eqn. 9 to augment the Yule-Walker equat ionsystem into a. (p + l)th-order system gives the matrixequat ion

IE E PROCEEDINGS, Vol. 135, P t. F, No. 1, FEBRUARY 1988 19



R(0) R(2)

R(0)

R(p) Rip - 1)

Rip)

R(P-I)

R(0) J

(10)

The autocorrelation matrix on the left-hand side ofeqn. 10 is Toeplitz in form and is positive-definite,lending itself readily to solution of the predictor coeffi-cients ak by the recursive formulation of Levinson-Durban [20], obviating any explicit matrix inversion.For a pth order linear predictor, the recursive methodapplicable to eqn. 10, with a computing time of order p2 ,is much faster than the matrix inversion solution requiredby eqn. 7, where the com putin g time is of order p3 .

The key to understanding the nature of the new pole

focusing technique proposed in this paper, however, isbest found by an interpretation of the properties of linearprediction in the transfer function and frequencydomains. First, given that a0 = 1, then rewriting eqn. 2 inthe form

akSn-k — (11)fc =

demonstrates that the sequence of prediction error coeffi-cients can in fact be viewed as a finite-impulse-response(FIR) prediction error filter (PEF) which is digitally con-volved with the input speech signal to produce the pre-diction error (or residual) en . That is, with a sampling

period of T s seconds,

a(nTs)*s((n - k)T s) = s(nTs) (12)

and hence, from convolution theory, in the frequencydomain,

A(nF)S(nF) = E(nF)

or

S(nF)

A{nF\

= E(nF)

(13a)

(13b)

where F is the frequency resolution given an N-pointda ta window: F = l/(NTs). That is, LPC-based spectralanalysis attempts to minimise the difference between theactual signal spectrum S(nF) and its estimate l/A(nF). Inreality, the residual spectrum E(nF) will be coloured withdifferent amplitude values at various frequencies.However, under the basic assumptions of LPC analysis,while A(nF) is directly calculable given the PEF coeffi-cient series ak, then if the PEF is indeed optimum, theresidual signal should be a Gaussian random variablewith zero mean. Hence, the estimate of the residual spec-t rum, E(nF), is considered to be of uniform amplitude, i.e.white, with constant power spectral density a

2. An esti-

mate of the signal spectrum S(nF) can then be derivedfrom

S(nF) =E(nF) a

A{nF) ~ A(nF)(14)

Using a source-filter interpretation of S(nF), then E(nF)

can be considered as the estimate of the glottal excitationspectrum, and therefore l/A(nF) is an estimate of thevocal tract frequency response H(nF). With the choice ofE(nF) = a (a constant), and since A(nF) is evaluated froma very sh ort ((p + l)-length ) F IR filter sequence, thenS(nF) is spectrally smooth, with periodicity effectivelydeconvolved from the short-time signal spectrum esti-mate. This is the most important advantage of LPC-based spectral analysis, facilitating the use of simple peakdetection algorithms in estimating resonance (formant)centre frequencies, bandwidths and amplitudes.

Since the impulse respon se of the prediction error filteris simply the coefficient s equ ence

a0 + axz~l

+ a2z~2 + ... + akz~

k + ••• + apz~p ( 1 5 )

then A(nF), the inverse vocal tract frequency response,can readily be calculated by employing the usual discreteFourier transform along the z-plane unit circle, withz = e

j2nmlM, where the length of the DFT M can be set to

any desired value to give any arbitrarily required spectralresolution, F = 1/(MTS).

Of course, eqn. 15 also represents a system possessing

a transfer function characterised by z-transform zeros.Thus , the inverse frequency response amplitudes of for-mants with high Q-factors tend to zero. This is an impor-tant observation in relation to the pole focusingtechnique. There is no reason why the z-plane unit circleneed be chosen as the sole search path for calculation ofthe DFT. Since the vocal tract can be considered a stablefilter system, then the zeros of eqn. 15 must all lie withinthe unit circle. Therefore, choosing a search path in thez-plane such that | z | < 1, i.e. z = re

j2nm'M

, 0 ^ r < 1, willenhance the effect of zeros, or vocal tract poles, on theinverse vocal tract frequency response. The spectral'amplitude' in these 'off-axis' spectra, however, now haslittle meaning in respect of providing a direct physicalinterpretation of spectral features, other than indicatingwhen the off-axis search path approaches the location ofa vocal tract pole.

In the application of LPC-based spectral analysis toformant tracking [6], however, there is no absoluterequirement to conserve all of the spectral estimate forevaluation. In any case, the LPC smooth spectrumemployed depends on the arbitrary requirements fordesired spectral resolution. The choice of DFT length,though, does not affect the resolution of the LPC modelin terms of the features modelled into the frequencyresponse estimate, which depend crucially on the adoptedLPC model order p. Certainly, if several off-axis spectra

are calculated for the PEF sequence of eqn. 15, thenformant bandwidth and amplitude can be estimated fromobserving the relative changes in 'formant' bandwidthfrom one z-transform radius to another. Decreasingbandwidth as z-transform radius r is decreased will indi-cate that the vocal tract pole lies on a lower DFT radialsearch path than that currently in use, whereas increasingbandwidth will indicate that the pole characterising theformant lies closer to the unit circle than the value ofDFT search path radius currently in use. Each vocaltract pole position can be estimated by simple triangu-lation techniques, with z-transform radius supplantingthe usual unit-circle formant parameters of bandwidth

and amplitude. Of course, such a pole focusing techniquemight very well choose a search path within the unitcircle which directly crosses the position of a vocal tractpole, with an attendant infinite spectral amplitude.However, the use of the inverse spectrum A(nF) provides

20 1EE PROCEE DINGS, Vol. 135, Pt. F, No. 1, FEBRUA RY 1988



a fail-safe mechanism in this case, since the spectralamplitude merely falls to zero.

Using several off-axis spectra, then, allows implemen-tation of a 'focusing' mechanism to locate vocal tractpoles. Estimation of formant trajectories with respect totime depends of course on provision of formant centrefrequency estimates, which in the case of pole focusingwill be different from the centre frequencies calculatedfrom the resonant peaks of unit circle spectra. Withinlimits, this is unlikely to cause major problems, sincemuch of phonetic feature interpretation is dependent noton absolute formant frequency values, but rather on rela-tive frequency ratios and the direction and duration offormant frequency tran sitions [21 , 22] . Fo rm ant ampli-tude, although giving important indications of voicequality, is largely irrelevant in many speech recognitionstrategies, although identification of weak nasal formantsthrough ampli tude and bandwidth discriminat ion isperhap s the obvious exception [23 , 24] .

The use of off-axis spectra alone is nonetheless insuffi-cient in providing any significant improvement over stan-dard LPC analysis using spectral estimation calculatedalong the z-plane unit circle. The choice of LPC model

order at each z-transform radius, and whether or not toapply signal pre-emphasis prior to analysis, are bothfound to greatly affect the performance of the pole focu-sing technique, both in its ability to provide improveddetection of weak (low-intensity) features in the vocaltract frequency response, and also in the robustness ofthe algorithm to additive uncorrelated noise

3 Effects of signal conditioning on noisetolerance of spectral estimate

In voiced speech during which the vocal cords are themain source of excitation of the vocal tract, the glottal

excitation spectrum has been observed to exhibit a char-acteristic roll-off in power of the order of —12 dB/octave[25] . The combined effect of the vocal tract filter and thevocal cord excitation source is to produce a pressurewaveform travelling towards the lips whose short-timepow er sp ectrum (over 20-4 0 ms or so) will reflect theperiodicity of the excitation source and will also containconcentrations of energy at the formant frequencies F l 9

F2, F3 etc. The lips themselves typically act as an acous-tic horn radiator, thereby imparting an emphasis of+ 6 dB/octave to the overall spectral slope of the signalemanating from the lips, giving the characteristic speechpower spectrum roll-off of —6 dB/octav e [26 ]. Signalconditioning in the form of pre-emphasis is a well-

established pre-processing operation for spectral analysisof speech using standard LPC techniques. The applica-tion of a digital filter with the impulse response

1- / IZ"1 (16)

imparts a compensat ing + 6 dB/octave slope to thepower spectrum of the speech signal.

The application of pre-emphasis can be viewed asintroducing a further zero into the transfer function ofthe vocal tract filter, with the + 3 dB corner frequency ofthe spectral pre-emphasis slope being located at /„ = (1— n)fJn, where fs is the signal sample frequency. From

residue theory, with the impulse response of the filterbeing interpreted as a sum of damped sinusoids, then theinclusion of a zero term in the transfer function merelyserves to increase the power of high-frequency com-ponents in the filter impulse response. The presence ofthe zero neither alters pole locations in the transfer func-

tion plane, nor affects their resonance centre frequenciesand bandwidths in the associated frequency response.This view of the effects of pre-emphasis is of course accu-rate as far as the actual signal itself is concerned, givenfull freedom to represent the vocal tract filter system withunconstrained analytic techniques.

Linear predictive coding analysis is, however, highlyconstrained. Any estimate of the vocal tract function,from eqns. 14 and 15, must conform to an all-pole model.Any spectral effects of pre-emphasis therefore, given afixed model order p, must be reflected in a change ofvalues for the coefficients a lt a2, . . . , ap. As shown byeqn. 135, LPC analysis can be interpreted spectrally asthe minimisation of the error between the smooth spec-trum estimate l/A(nF) and the actual signal spectrumS(nF). If the resonance features of S(nF) are given a boostin peak power through pre-emphasis, then the only waythat error in LPC spectral representation can be mini-mised given an all-pole model is through readjustment ofpole positions in the all-pole model. Specifically, underthe assumption that the distribution of energy around aspectral peak is parabolic, then this is achieved bydecreasing the bandwidth of corresponding poles, i .e. by

maintaining the z-transform angle argument but movingpoles closer to the unit circle.

Employing standard LPC analysis techniques, wherethe inverse spectrum A(nF) is calculated using the DFTpath along the z-plane unit circle, then pre-empahsis iscertainly more than a convenient method of conservingas many poles as possible for modelling formant featuresthrough dissipation of the underlying spectral power roll-off. Indeed, it is absolutely essential if both low-intensityand high-bandwidth formants are to be detectable fromthe unit circle.

Fig. 1 shows an analogue sp ectrogram , obtained bywideband filtering techniques, of the word 'deed' spoken

130 ms

Fig. 1 Wideband spectrogram of the word 'deed' spoken by a maletalker

IPA transcription is indicated

by a male talker in continuous speech. An example of theeffects of pre-emphasis on standard LPC analysis appliedto this signal segment is illustrated in Fig. 2. A short-timeHamming-windowed signal segment of 25.6 ms durationwas used in this analysis, and was extracted from themidportion of the voiced vowel /ee/, with sampling ratefs = 16 kHz. B oth Figs. 2a an d b employ a fixed 14th-order LPC model, but only the signal segment used forFig. 2b was pre-emphasised (fi = 0.976). Each diagramshows the fi rst quadrant of the z-plane between/= 0 and/ = 4 kHz. A total of 12 smo oth spectra were calculatedin each case, using z-transform radii between values of

IEE PROCEED INGS, Vol. 135, Pt . F, No. 1, FEBRUARY 1988 21



I z I = r = 1 and r = 0.89 inclusive, with a uniform stepdecrease in radius (Sr = — 0.01) between successivespectra. A simple peak-detection algorithm was applied

0.8

• | | | t - x " U

0.8 0.9

Fig . 2 z-transform plane between 0 and 4 kHz (fs = 16 kHz)

Solid arcs represent formant bandwidth about centre frequency for 12 spectraseparated by br = —0.01, p = 14

Approx imate pole posit ions (minimum formant bandwidths) are marked with an xa n = 0 b n = 0.976

to each such spectrum and a standard parabolic curve-fitting technique applied to each extracted peak to yieldestimates for peak (formant) centre frequency and band-width. The solid arcs shown in each diagram represent

formant bandwidth estimates from parabolic inter-polation, centred evenly about the estimate for formantcentre frequency. As the DFT path in the z-planeapproaches any formant pole location, the formant band-wid th de creases, giving rise to th e focusing effect.

The use of a 14th-order model is of course marginal,since with/s = 16 kHz then at the very least a 16th-ordermodel is indicated by order-determining criteria. None-theless, the LPC spectrum derived should still producesome optimum fit to S(nF), with the error minimisationcondition implying that those features with highestenergy will dominate the shape of the spectral estimatel/A(nF).

The migration of the F2 and F 3 formant poles nearerto the unit circle due to the effects of pre-emphasis canclearly be seen in Fig. 2b . Note in particular that formantF3 in Fig. 2a is not visible from the unit circle if pre-emphasis is not applied, with characteristic poles deep

within the z-plane at or beyond r = 0.89, and indeed canonly be observed using a DFT search path at r ^ 0.98.Conversely, it can be seen that the pole corresponding toformant Fx has receded deeper into the z-plane, and thisphenomenon can lead to a severe mismatch of the LPCmodel to the assumed actual vocal tract model in thelow-frequency area of the LPC spectral estimate around/ „ . It is worth noting here that these formant features arerelatively very intense compared to say, nasal formants.Pre-emphasis is not found to significantly help in the

detection of such weak formants, but this issue is dealtwith later.

Pre-emphasis, then, is absolutely essential in standardLPC analysis to render formant features detectable fromunit-circle-based smooth spectra. On the other hand, it isknown that pre-emphasis causes some mismatch of theLPC smooth spectrum to the vocal tract frequencyresponse at low frequencies [27], and indeed, the applica-tion of essentially a differentiating filter can be expectedto adversely affect the noise tolerance of standard LPCanalysis to additive Gaussian-type noise. In this respect,the likely effects of signal/noise ratio (SNR) on the spec-tral estimate can be appreciated from an examination of

the effects of varying levels of signal and additive noiseon the short-time autocorrelation series. This is conve-nient from several points of view; differences between thevalues of the LPC coefficients of one series from anotherdo not necessarily imply that the estimated positions forvocal tract poles lie in different z-plane locations, unlessall poles can be considered distinct. On the other hand,from the Weiner-Kinchine theorem [28],

R(T) •i+ 00

P(f)e2n z

df (17)

where P{f) is the power spectrum of the signal, thenchanges in the autocorrelation coefficients necessarily

entail a change in spectral shape. That is, changes in therelative ratios of the autocorrelation coefficients, withrespect to R(0) say, directly affect the differential poweramplitudes of spectral components and hence overallspectral shape, rather than merely affecting their relativephase relationships, since the power spectrum has auniform (zero) phase characteristic. Since the LPC coeffi-cient series is calculated directly and solely from theautocorrelation series, then changes in the autocorrela-tion series can therefore be expected to affect the locationof formant features in the LPC smooth spectrum.

The signal shown in Fig. 1, whose average quan tisa-tion SN R is approximately 65 dB, was contam inatedwith computer-generated Gaussian white noise at several

values for SNR down to a minimum of 0 dB SNR. Auto-correlation coefficients for the Hamming-windowedsegment used in Figs. 2a an d b were calculated, withautocorrelat ion rat ios Rr(n) being computed with respectto R(0), Rr(ri) = R(ri)/R(0), n $s 1, for each no ise-contaminated signal segment. The percentage change ineach coefficient ratio was then taken with respect to the'quiet ' o riginal signal segment with SNR = 65 dB, suchtha t SRr(n) = 100% x [R rx(n) - Rr65 „(«)] /1 Rr 65 dB(n) |,where 0 dB ^ x ^ 40 dB. Figs. 3a an d b demonst ra teSNR against the percentage changes found for (a) R(Q)an d (b) Rr(l), both with pre-emphasis {n = 0.976) andwitho ut pre-em phasis app lied to the 25.6 ms analysis

segment. It is clear that the effects of pre-emphasis on thenoise component of the signal are radically altering theautocorrelation coefficient series even in relatively 'quiet'noise environm ents of the orde r of SN R = 30 dB, an dhence LPC spectral estimates would be expected to be

22 IEE PROCEEDINGS, Vol. 135, P t. F, No. 1, FEBRUARY 1988



detrimentally affected. On the other hand, the non-application of pre-emphasis to the speech + noise givesapproximately an 18 dB advantage in the deferment of

emphasised signal -I- noise. However, the extent of suchan effect is difficult to quantify. In addition, it is unlikelythat the additive noise will possess a flat spectrum withinany given short-time analysis frame.

4000

2 3000

2000

£ 1000

j . 500

-500

M=0

10 20 30

SNR. dB

S N R , dB

20 30

M=0.976

o -50

IX.

o - 100

o! - 150

- 2 0 0

er - 2 5 0 L b

Fig. 3 Effects of pre-emphasis and SNR on autocorrelation series

a SR(0) b5R(l)Both are referred to 65 dB SNR

similar percentage changes in the autocorrelation series,with analogous results obtained for Rr(3) etc. Indeed,Figs. Aa-f demonstrate that at 10 dB SNR the autocorrel-ation series of the pre-emphasised signal + noise (Fig. Aj)more closely resembles that of the pre-emphasised noisealone (Fig. Ad), in comparison to the nonpre-emphasisedsignal -I- noise.

Since signal conditioning is applied equally to both

speech and noise, then the local SNR in any spectralregion is not affected by pre-emphasis. In any givenanalysis frame, if the additive noise can indeed beassumed to be Gaussian, such that its spectrum is flat,then the primary effect of the additive noise itself on spec-tral peaks is to increase their bandwidth, and applicationof pre-emphasis will not alter this signal degradationcharacteristic. However, in applying LPC analysis to thesignal, spectral peaks (on which a high bandwidth is con-ferred due to the effects of low SNR) may be rendereddetectable in the unit-circle LPC spectrum through themechanism of pole migration towards the unit circle dueto pre-emphasis, as discussed earlier. Paradoxically, this

effect is likely to mitigate to some extent aberrations inthe LPC spectrum expected owing to changes in theautocorrelation series with worsening SNR as illustratedabove, in that formant peaks may remain detectable atlower SNRs, compared to those of the nonpre-

4 Provision of form ant estimates using polefocusing technique

In standard LPC analysis, formant candidates are usuallyextracted by applying a peak detection algorithm to theunit circle smooth spectrum estimate of l/A(nF). Not-withstanding the effects of pre-emphasis detailed inSection 3, the calculation of a smooth spectrum on theunit circle with fixed order entails two major dilemmas.On the one hand, although the chosen model order andassociated LPC coefficient series may implicitly embodysolutions for weak formant features, such as nasal for-mants, there is no guarantee that the prior applicationwill render such features visible from the unit-circle spec-trum. On the other hand, even if it is suspected tha t vocaltract poles may lie deep within the z-plane, then if off-axisspectral analysis is used, there is no a priori guaranteethat the LPC coefficient series of a fixed-orde r model in

fact contains any terms which implicitly relate to suchformants. In particular, it is observed that from oneanalysis frame to another an apparently consistentformant may become undetectable, possibly through theeffects of a non-optimally located analysis window withrespect to glottal cycle [29, 30]. Experiments with tech-niques which employ off-axis LPC spectral estimationbut use a fixed model order [7] (i.e. Sp/Sr = 0) consistent-ly fail to recover the 'missing' formant, even when explor-ing deep inside the z-plane down to a radius of r = 0.88.

An equitable solution to these shortcomings is tosimultaneously increase model order as off-radius isincreased, i.e. Sp/dr = K(T), where K(T) is in general terms a

function of the z-transform radius r. Each increase inmodel order offers the inclusion of more vocal tracttransfer function detail into the inverse filter impulseresponse of eqn. 15, improving the minimisation of errorbetween estimated and actual vocal tract trasfer func-tions. The simultaneous use of off-axis spectral estimationincreases the probability that weak formant detail mani-fests itself as a detectable spectral peak as the search pathfor the off-axis spectrum approaches the correspondingpole positions. Most importantly, note that the increasein model order does not necessarily entail 'cluttered'spectra at radii r <^ 1. Just as weak formant detail is notvisible from the unit circle so, conversely, intense formantdetail is not detectable in off-axis spectra of low-radius

search paths. Of course, as discussed previously, the defi-nition of formant centre frequency must be altered toaccomodate such operations, since the provision ofvalues for formant centre frequency are determined (aswill be shown) as a function of all formant candidatesextracted from all off-axis spectra.

If a formant peak is manifest in several off-axis spectrain any given analysis frame, then its centre frequency canbe expected to move in a deterministic manner as theDFT search path approaches and recedes from the polepositions characterising the formant peak. This effect canbest be understood by examining the properties of thedamping factor £, related to the position of poles in the

analogue Laplace transform s-plane. ( itself is a measureof the closeness of any pole pair to the yew-axis, and hencea measure of the Q-factor of the associated resonancepeak in the system frequency response, although evi-dently this will also depend on the locations of other

IEE PROCEEDINGS, Vol. 135, Pt. F, No. 1, FEBRUARY 1988 23



transfer function components in a multistage filtersystem. The well established relationship between £ and

resonance characteristic for a simple two pole system is

illustrated in F ig. 5 [31] . Note in particular the trace of

resonance centre frequency with (. It is a well knownresult that the equat ion governing the relation betweenthe undamped natural frequency of oscillation f0 and the

resonance centre frequency fm of the frequency responseof a simple two-pole system, viewed from theycoax is, is

L = / oV ( 1 " 2£ 2) (18«)where the complex conjugate pole pair is located at

pole pair is therefore

z =r pl,p3 (19)

In the digi tal domain, the z-plane radial position of the

Conversely, using a generalised D F T search path withconstant radius rs, then the relationship betweendamping factor (and hence bandwidth) of any spectralpeak and radial distance rd between any z-plane poleposition and search path, rd = rp — rs, say, is given by

In (1 - $rp - rj |) = In (1 - \rd$ = -£(o0 T s (20)

From eqn. 20, substituting for ( in eqn. 18a gives

k/ o "

1.3

1.0

S 0.5cc

0 .5

10 15

coefficient n

a

25 30 35

3& V 35

coefficient n

c

-0.5t-

Fig. 4 Autocorrelation series R(0) to R(34)for a 25.6 ms signal segment

a speech signal, n = 0

b speech signal, n = 0.976c Uncorrelated Gaussian noised Pre-emphasised Gaussian noise, \i = 0.976e speech + noise, SNR = 10 dB , \i = 0

/ Pre-emphasised speech + noise, SNR = 10 dB , p. = 0.976

24 / £ £ PROCEEDINGS, Vol. 135, Pt . F, No. 1, FEBRUARY 1988



and so, solving for rd,

(21b)

Fig . 5 Resonance centre frequency movement against damping factor £for an analogue two-pole resonating system

There are two very imp ortan t observ ations from eqn . 21.First, from eqn 21a, for a fixed sample period 7^ and for(say) both high- and low-frequency poles located on thesame radial, then from any given search path following aradial contour, th e rat io fjfo is less at low frequencies.That is, for a frequency response calculated from a radialDFT search path for a set of coradial poles, the observedQ-factor of peaks in the frequency response is less at lowfrequencies than at high frequencies. Secondly and con-versely, from eqn. 21b, for a given Q-factor of any peak in

the frequency response, the radial observation distance rdof the DFT search path from the corresponding poles isless at the low-frequency end of the spectrum. The tw oobservations together mean that if a series of off-axisspectra are calculated for the LPC coefficient series, thenlow-frequency formants will be observable over a shortrange of rd, over which th e range formant centre fre-quency will rapidly follow the characteristic curve of Fig.5 as | rd | increases, and therefore may be present in only afew of the off-axis spectra before critical damping isreached. On the other hand, high-frequency formants willfollow th e centre frequency and Q-factor characteristicsof Fig. 5 more slowly as \rd\ increases, an d so should beobservable in many more spectra than low-frequency for-

mants before the point of critical damping occurs.In a complex resonating system with many pole pairs,

the above characteristics for formant centre frequencyand Q-factor will only be followed as the radial distance\rd\ between the D FT search path and any pole pairtends to zero; i.e. when th e pole pair effectively domi-nates the off-axis frequency response. At larger distances,the presence of other poles in the transfer function willhave an effect on the movement of formant centre fre-quency and Q-factor with | rd \. It has been found in prac-tice that formant peaks are at best observable in LPCspectra at Q-factors greater than or equal to 1.65, i.e. withan equivalent damping factor ( ^ 0.3. This has an impor-

tant bearing on the method of estimating formant centrefrequency.In th e pole focusing technique, th e method adopted

for provision of formant centre frequency data has beento employ a weighted averaging operation to the col-

lected set of peaks extracted from all off-axis spectra. Thepeak values, with associated Q-factor weightings and

DFT radius values, are arranged in frequency order in a

one-dimensional list. First, th e peak with th e highest Q-

factor in the list is located and assumed to be an estimateof the location of a vocal tract pole in the z-plane, say,pa, with undamped natural frequency fa0. From a know-ledge of the maximum D F T radial distance from th e

assumed pole to the most distant DFT search path pos-sible, a value for/a m / /a 0 is calculated from eqn. 21a. Thisspecifies th e frequency range over which peak values in

the list are to be averaged to yield a value for formantcandidate Fa. All peaks in the list lying between/a 0 ±fam

are averaged and weighted according to their Q-factor.Thus a value for formant frequency is given by

Jau Qau/QaO

F = -

Z QaJQaOu = 0

(22)

where U is some unspecified number dependent on/ f l m ,rdmax

a nd K(r)- Qao is the Q-factor of the pivotal peak at

/ a 0 . Peaks within th e averaging range are deleted asfuture candidates for vocal tract poles, but may still beused in the averaging process of other such poles. Notethat the averaging range extends to / f l 0 +fam . There is notheoretical requirement to include this upper band, but itis found to be necessary as a consequence of choosingany positive-valued function for K(r), since changes inLPC model order may significantly alter th e centre fre-quencies of spectral peaks.

The peak in the list with th e next-highest Q-factor,excluding those previously included in the averagingoperat ion above, is chosen as the pivot for the next passof the averaging process, and this operation is continued

until no more pivotal peaks are available. Finally, th eextracted formant candidates Fa, F b etc. are arranged infrequency order, together with associated averagingweights and radius at which the pivotal peak was found,which is taken to be the approximate radial position of avocal tract pole. The averaging weights can be used tofilter out spurious 'formant' peaks (those with a lowweight), and retention of radial pole position allows iden-tification of possible formant type (nasal or oral). The useof the averaging method in itself has important rami-fications for th e consistency of formant trajectories insignals with a low SNR, and this is discussed in Section 5.

5 Com parative noise tolerance evaluation of thepole focusing techn ique

The spectral effects of model order and pre-emphasis onthe LPC analysis of a noisy signal, with a view to esti-mating the likely effect on formant tracking systems, canbe measured by the use of synthetic signals where th eunderlying signal structure is completely predetermined.Spectrum-based formant tracking itself depends on twosets of properties: (i) those of the spectral estimator pro-viding formant candidates on a per frame basis, and (ii)those of the line tracking algorithm with associated pho-netic constraints, which selects optimum formant candi-dates and tracks formant movement from frame to frame.

The former has already been investigated to some extentin Section 3 on the effects of signal conditioning. The per-formance of the t racking algori thm, on the other hand,depends heavily on the ability of the spectral estimator toprovide good consistency and fidelity in the extraction of

IEE PROCEED INGS, Vol. 135, Pt. F, No. 1, FEBRUA RY 1988 25



formant data, and hence examination of the frame toframe noise tolerance of the spectral estimation com-ponent of the formant tracking system provides a reason-able measure of the likely performance of the overallsystem, particularly in its ability to maintain the integrityof any formant track in high noise environments.

The investigation of noise tolerance performance isexemplified here by employing an artificially generatedspeech-like signal of 100 ms duration. This signal is char-acterised by three formant-like features in its short-time

spectrum at nominally 500 Hz, 1 kHz and 2 kHz withrelative power amplitudes of 0 dB, — 6 dB and —12 dB,respectively, and each formant having a bandwidth of100 Hz. The signal is generated by exciting a simulatedparallel formant synthesiser with an impulse train of fre-quency 150 Hz ± 3% . The sample frequency for thesignal is 16 kHz, 12-bit resolution, with an average quan-tisation SNR of 60 dB.

For the purpose of comparison, several standard LPCanalysis models are used: (i) a 16th-order model with noprior pre-emphasis, which yields an adequate spectralrepresentation of the signal at high SNRs; (ii) the samemodel but with pre-emphasis applied; and (iii) 22nd- and

24th-order LPC models both employing pre-emphasis.The pre-emphasis factor when used was n = 0.976. Allvalues for formant centre frequency in each standardLPC analysis are extracted using peak detection andparabolic interpolation applied to unit-circle based DFTspectra. In both the pole focusing technique and eachstandard LPC analysis, a 25.6 ms Hamming-windowedanalysis frame is used, and the total analysis is performedby moving the window exactly one sample point at atime across the signal, to give a total of 1192 analysisframes.

In the pole focusing technique, 12 spectral estimatesare produced using z-transform radii in the range

0.89 ^ r ^ 1, decreasing in steps of 5r = —0.01. Theinitial model order is pr=1 = 12 on the unit circle,increasing by + 2 for each step decrease in radius, i.e.K(r) = + 2. The value of radius step has been chosen tosome extent arbitrarily, and experimentation with realspeech has indicated that decreasing the radius muchbelow 0.9 does not appear to improve the performance ofthe technique. Experimentation has also found that thetechnique is relatively insensitive to choice of initialmodel order (values of between 8 and 16 may be usedwith little observable change to the results presentedhere). Figs. 6a and b demonstrate the results of formantextraction in the range 0-4 kHz from both the 16th-orderLPC analysis (no pre-emphasis) and the pole focusing

technique at a simulated SNR of 0 dB. Note especiallythat Fu F2 and F 3 should remain exactly constant withtime. The results demonstrate that standard LPC analysis(see Fig. 6a) suffers from poor formant consistency fromone analysis frame to another at such a low SNR. Also,note that F2, which is nominally situated at 1 kHz, ismissing in the majority of frames. The formant-like fea-tures at 3 kHz and above in both diagrams are thoughtto relate to the additive Gaussian noise.

The evaluation process used here employs a statisticalaveraging routine to calculate average and standard devi-ation (with respect to the average) for each nominalformant centre frequency. This routine utilised formant

values both from each signal frame and from eachanalysis technique on a 'nearest-neighbour' criterion,with the nominal formant frequencies of the artificialsignal, as detailed above, being used to select the formantcandidates.

The 100 ms signal was contaminated by progressivelymore intense additive white noise in 5 dB steps down to15 dB SNR, and then in 1 dB steps down to - 6 dB. Of

4 r

• -R,

- F .

- F ,

• • •

—F,

- -F ,

0 1000

time, msb

Fig . 6 Point-by-point analysis of synthetic speech with fixed formantfrequencies

F , = 500 Hz, F 2 = 1 kHz, F 3 = 2 kHz, SNR = 0 dBa LPC analysis, p = 22, p = 0.976b Pole focusing technique. \i = 0

interest here is the deviation characteristic and percent-age of analysis frames yielding a value for any givenformant. Formant trajectories are of paramount impor-tance as discriminant cues for pre- and postvocalic con-sonant sounds, as well as identifying the voiced phonemeitself, and formant F2 is particularly crucial in thisrespect. Formant line tracking algorithms generally have

some inbuilt tolerance to formants which are absent inone or several frames, enabling tracking over the missingformant value by substitution of some average valuebased on formant consistency both prior to and after theaffected analysis frame, and this tolerance can cope withup to a 20% loss of formant information (although lossesmust usually be randomly dispersed). The results for thestandard deviation of formant frequency estimationagainst SNR for each technique for formant F2 areshown in Fig. la . The y-axis here is calculated from y =20 Iog10(100<7/F2ai>), where F2av is taken as the averageformant frequency value across those frames in which avalue for F2 was present, and a is the associated standard

deviation of F2 across these frames. Fig. 1b shows thepercentage of frames in which a value for F2 was able tobe extracted against SNR.

Initial assessment of the performance of each tech-nique is based here on formant difference limen of 5%

26 IEE PROCEEDINGS, Vol. 135, Pt. F, No. 1, FEBRUARY 1988



of average formant centre frequency [15, 16], i.e.I00o/F2a v dB ~ 14 dB, and a minimum percentage F 2

yield of 80%. The results of Fig. la demonstrate that the

30

20

>U

o 10

- 1 0

100

80

6 0

4 0

20

=24. p=0.976

: -10 0 10

SNR. dB

a p=16,u=0

pole focusing

-10 0 10 20 30 40 50

SNR,dB

b

Fig. 7 Comparison of F2 deviation and frame yield for F2 againstSNR

a F 2 deviation (expressed in decibels with respect to average value of F 2) againstSN R for several model orders, and for the pole focusing technique. 5% differencelimen (deviation) threshold (14 dB) is indicatedb Frame yield for F2 against SNR

pole focusing technique offers superior performance in

terms of provision of consistent values for formant fre-

quency compared to s tandard L P C analysis with pre-

emphasis. With th e synthetic signal parameters chosen,16th-order standard L P C analysis with no pre-emphasisperforms adequately down to SN R = 11 dB, at whichpoint it can be seen, from Fig. 1b tha t th e percentage of

frames yielding a value for F 2 dr ops off sharply beyond

th e 8 0 % minimum adopted here . T h e same model orderwith pre-emphasis is ostensibly found to perform better,but it should be noted that values for Fx could not be

extracted (al though not shown here) . With pre-emphasisapplied, Ft could only in fact be found using a modelorder of p = 22 and above . From Fig. 1b , results for the

22nd-order model in this case demonstrate that below an

SN R = 50 dB, there is an insufficient yield of F 2 to

satisfy th e requirements of an eventual formant linetracking algori thm. On the other hand, results for a 24th-order model, although satisfying F 2 yield conditions fromFig. 1b, fail to satisfy th e minim um formant differencelimen criterion (see Fig . la). Th e pole focusing technique,al though employing large model orders in excess of

p = 22 at z-transform radii r < 1, satisfies well theadopted test cr i ter ia down to - 3 dB SNR, at whichpoint the formant difference limen exceeds\Q0a/F2av d B = 14 dB. Note, however , that there is still

100% F 2 yield beyond - 3 dB SNR c ompa r e d to all

other s tandard L P C me thods .The pole focusing technique thus gives some 14 dB

impr ove me nt in noise immunity, as compared to the

16th-order model without pre-emphasis. However , in the

analysis of speech (a s opposed to the predefined syntheticsignal used above), as has been shown previously, pre-

emphasis is essential in the application of s tandard L P C

analysis. Based on the above results, the pole focusingtechnique is therefore likely to impart substantial ly morethan a 14 dB impr ove me nt in noise immunity, a l thoughthe exact nature of the a dva n t a ge is difficult to quantify.

The improvement offered by the pole focusing tech-nique in formant consistency from frame to frame,despite a low SNR in addit ive Gaussian noise, can be

related to the choice of K(r), and especially to the use of

an averaging technique in the provision of formant centrefrequency, as given by eqn. 22. At each n e w D F T radius,as th e model order is increased, then th e pole posit ionsfor each formant inherent in eq n. 15 will alter as eachaddit ional noise-corrupted value for the autocor re la t ioncoefficient R{pr= t + K{r)) is included in the solution for the

ne w ak coefficient series. If the (nonpre-emphasised) noise

is approximately Gaussian in nature, then pole posit ionsassociated with each model order can also be expectedto randomly 'di ther ' about some mean posit ion. In

off-axis standard L P C analysis employing a fixed modelorder, this dithering effect will not occur since th e modelorder is explicitly fixed. If /c(r) = 0, then pole posit ionsremain absolutely fixed with respect to the model orderchosen, al though evidently, if the signal characteristicsremain fixed with time as in the above experiment, thenpole dither can be expected from one analysis frame to

the next. Th e use of prior within-frame averaging, as is

inherent in the pole focusing method used here, succeedsin reducing th e a m o u n t of frame to frame dither. These

effects are illustratedin

Figs.Sa-d.

Fig.8a

showsa set of

formant centre frequencies extracted from off-axis spectrain which there are detectable values for Fu usingSr = —0.01 as above , but with /c(r) = 0 and p = 16, withno pre-emphasis. The thicken ed solid curve join s centrefrequencies for Fx from the 'quiet ' synthetic signal as digi-t ised with average quantisat ion S N R ~ 60 dB . The thinsolid curve joins centre frequencies extracted from th e

signal + noise at 15 dB S N R . Note tha t th e curves are

well separated. Similar results are shown in Fig. 86 for

F 2 . Once again, the centre frequency traces are separatedout through the effects of addit ive noise. From frame to

frame the a m o u n t of such separation is found to vary.Figs. 8c and d relate to the pole focusing technique as

used above. Note that the peak centre frequencies foreach formant dither at r a ndom a r ound th e 'quiet 'formant trace character ist ic . Thus, al though th e poledither associated with each radius may vary from frameto frame, th e otherwise damaging effect on consistency is

a t tenua ted by the within-frame averaging. If the poledither ca n indeed be assumed to be Gauss ian bothwithin-frame and from frame to frame, then within frameaveraging will certainly reduce the overall standard devi-at ion of formant centre frequency. If there are \p off-axisspectra (with K(r) > 0) , assuming a t rue formant poleposit ion of p F , then employing averaging will yield an

approximate pole f requency posit ion pF:

PF =* (23)

where n, is the random dither in pole position for eachvalue of radius and associated model order. This of

IE E PROCEEDINGS, Vol. 135, P t. F, No. 1, FEBRUARY 1988 27



course leads to a reduction in the standard deviat ion in

the position of pF of l/y/ij/, and with ^ = 12 in the aboveexperiment this explains some 11 dB in the noise immu-

1.00 r

0.95 h

g 0.90

mants which may not be visible from the unit circle.However, it has been found that using such a techniquewith K(T) = 0 is relatively unsuccessful in recovering for-

I .OOr

0.95

P 0.90

i •

400 410 420

R| frequency, Hz

a

430 2.00 2.05 2.10 2.15

F2 frequency, kHz

b

1.00

•60.95o

20.90

1.00

0.95

0.90

400 450 550 2.00 2.05 2.10

F2 frequency, kHz

d

50 0

F. frequency, Hz

c

Trace ofF x and F2 formant centre frequency for two values of SNR

a F,, standard LPC analysis, p = 16, n = 0

bF 2

c F, , pole focusing techniquedF 2

Approximate pole posit ions (minimum formant bandwidth) are marked with an V. Thick solid curve: 60 dB SN R; thin solid curve: 15 dB SNR

2.15

Fig. 8

nity of pole focusing over standard LPC analysis with no

pre-emphasis. The addit ional advantage is thought to

pertain to vertical pole movement within the z-plane,thereby relating to the formant yield across all analysisframes as shown in Fig. 1b. Some hint of this can be

gained from Figs. Sa-d, since it will be noted that the

approximate pole posi t ion using standard LPC analysishas moved further into the z-plane for both F x and F2 at

15 dB SNR, whereas radial position remains constantunder pole focusing analysis.

6 Enhanced featu re resolution using polefocusing

Notwi ths tanding the effects of noise on formant estima-tion, the pole focusing technique is also found to provideenhanced feature resolution compared to standard LPC

analysis. The technique inherently allows for formantdecay and growth in its use of off-axis spectral analysis,

but is uniqe in its concurrent use of increasing modelorder. It is of course possible to apply off-axis LPC

analysis to the speech waveform, but with a fixed modelorder [7] . This allows for the inclusion of terms in the

PEF coefficient series which relate to low Q-factor for-

mants whose characteristic poles may momentari lyoccupy a z-plane position which is deep within the unitcircle. This means that such poles are either simply no t

present, or perhaps require larger LPC models beyondthe usually accepted model orders applicable to speech.However, standard LPC analysis here faces two dil-

emmas. Using a fixed ord er an d estimating formant posi-tion from the unit circle D FT demands the use of

pre-emphasis, with inherent problems in ensuring thatlow-frequency formants near the pre-emphasis corner fre-

quency /„, are adequately represented. On the otherhand, using a large fixed model order and employing off-

axis spectral estimation may obviate the requirement for

pre-emphasis, but the main criticism here is that the

frame to frame variance of formant estimates willincrease. The use of large fixed model orders is thus gen-

erally much less robust to the effects of noise even in a

high (> 40 dB) SNR environment . As illustrated pre-

viously, however, the pole focusing technique consider-

ably at tenuates the effects of noise through the use ofseveral large model orders at various radii. The explicitus e of an averaging mechanism to achieve formant esti-mates, as given in eqn. 22, is of prime importance in con-

ferring a high noise immunity. In high SN R

28 IE E PROCEEDINGS, Vol. 135, P t. F, No. 1, FEBRUARY 1988



environments, however, the large model orders employedhave the dual benefit of providing unsurpassed consis-tency of formant retrieval from deep within the z-planecompared to standard LPC analysis.

Figs. 9a and b demonstrate the ability of the polefocusing technique to separate merging formants in the

3 -

i i m i i i n • I I I I I i i i i i m n i I I I I I I in

50 100

time, ms

o

A r

50 100

time, ms

b

Fig. 9 Formant trajectories for the word 'deed' with formant estima-

tion

a Standard LPC analysis, bu t with off-axis spectral estimation, p = 16, \i = 0.976b Pole focusing techniqueBoth techniques share same line tracking algorithm

time-frequency plane. The analysis is carried out on the

speech waveform detailed in Section 3 and whose spec-trogram is shown in Fig. 1. In the standard formanttracking technique used to obtain Fig. 9a, off-axis spec-

tral estimation, although used, is not asserted in allanalysis frames, unlike the pole focusing technique. Onlyif it is not possible in the current analysis frame to extracta value for any single formant from the normal unit-circleLPC smooth spectrum (based on a nearest-neighbourconsistency test dependent on the previous analysisframe), then the D FT radial search path is decreased in

steps 5r = —0.004 down to a minimum radius of

r = 0.88, or until frequency values are obtained for thoseformants which are 'missing'. If this search fails, the linetracking algorithm which is finally applied to the time-aligned list of formants may still estimate a substitutevalue, dependent on the a mount of frames for which any

formant is spectrally undetectable. The pre-emphasisfactor used here is \i — 0.976, with a fixed model orderp = 16. The pole focusing technique employed the

parameters already given in the experiments detailed in

Section 5. Both techniques share the same line tracking

algori thm, and analysis frames are calculated every 5 ms.

In Fig. 9a note that at 100 ms the standard techniquefails to recover merged values for F 2 and F 3 , despite the

use of off-axis spectral analysis. There is only a partialyield of values for F 4 , and the line tracking algorithmfails to maintain the integrity of the trajectory for F 3 .

With the pole focusing technique (Fig. 9b ) note that the

value of Sr = —0.01 is much coarser than that employedby the standard technique above, and the minimum D FT

search radius (0.89) employed by the pole focusing tech-

nique does no t penetrate as deep into the z-plane com-

pared to the standard LPC technique used in the abovecomparision, and yet the pole focusing technique suc-

cessfully improves the integrity of the formant trajectoriesof both F 3 and F 4 .

The fixed-order off-axis analysis scheme used in the

above comparison does no t employ any within-frameformant averaging, an d moreover requires pre-emphasis.Bearing in mind the degraded frame to frame variance of

formant position when employing a large (fixed) modelorder, as has been shown in Section 5 it is nonethelessinteresting to postulate that if a high SNR environmentca n be guaranteed (perhaps through the use of a noise-

cancel l ing microphone arrangement or prior adaptivefiltering) then the pole focusing technique may not

strictly require an increasing model order. It is thus theo-retically possible to construct a modified pole focusingmechanism, in which off-axis spectral estimation is

retained together with formant averaging, such that pre-

emphasis remains unnecessary. A reliance is, however,then placed on choosing an appropriate model orderaccording to some criterion. Formant decay and growthis still explicitly catered for through use of DF T searchpaths within the unit circle, but formant computat iontime is reduced by using a moderate fixed value of modelorder, i.e. K(r) = 0. Pole focusing will still take place,and

eqns. 20-22 will still hold, although no within-frameformant dither will now take place, so denying eqn. 23.

However, the explicit use of a signal with high SNR

reduces effects of noise on frame to frame variance.

A comparision of the modified pole focusing techniquewith fixed model order (p = 18) against the original polefocusing technique with increasing model order and

parameters as previously described in Section 3 is shownin Figs. 10a and b. The speech used is from a male talkeruttering the isolated word 'nana', digitised to 12-bitresolut ion with/s = 1 6 kHz. The nasalised consonant /n/

presents a difficult challenge to LPC analysis in generalsince several formant features associated with nasality are

characteristically of high bandwidth and low intensity, i.e.

their characteristic poles lie deeply embedded within thez-plane. Using the modified pole focusing technique, Fig.

10a illustrates that typical values for Flt F2 and F 3 for

the back vowel /a/ are adequately detectable. Nasal for-

mants Nu N2, JV3 and N5 are also detectable. Theoreti-cal predictions [32] indicate that nasal formants shouldcertainly be present at 0.9 kHz (N2) and 2.4 kHz (N 4) ,a l though it has not been possible to find the latter usingthe fixed model order. Nonetheless, the values for thoseformants which have been detected display good consis-tency from frame to frame. The pole focusing techniquewith K(r) = +2 (see Fig. 106) exhibits some interestingfeatures. First, the formant values for F1 in bothoccurences of /a/ are more consistent, i.e. have lessvariance, from frame to frame. By the same token,however, formant F 2 appears to vary considerably fromframe to frame, and this phenomenon is certainly unat-t ributable to properties relating to expected noise per-

1EE PROCEEDINGS, Vol. 135, P t. F, No. 1, FEBRUARY 1988 29



formance, which indeed are counterindicative given theresults discussed in Section 5. On the other hand, there isa greater yield of analysis frames containing values forthe nasal formant N2 in both occurences of /n/, and N3

and N5 characteristics are comparable in consistency andvalue. Note that in Fig. 10b there appear to be pre- andpostconsonantal nasal formant trajectories visible for N2,with the latter being found to occupy a frequency loca-tion much closer to the theoretical value of 0.9 kHz thanin Fig. 10a using the modified pole focusing technique.

••. N1 •••

2 3 A 5 6

time,x 100 ms

a

• n

f. 2

• 1 • 1 1 1

* • • • ' • • -

. • "•". ' • . . . ^ .•

.Fo \ N1

. , , , ,

• • • . « • -

' v

~^--.C-:....-':

•

3 f4 5

t ime.x 100ms

b

10.8 0.9

z-transform radiusc

Fig. 10 Qualitative comparison of formant analysis in a high SNR

environment

a Modified pole focusing technique with a ixed, model order, p = 18

b Pole focusing technique with increasing model order

c Development of formant detail for frame at 380 ms marked in b

Most interestingly, using the increasing model order inFig. 10b has apparently produced values for the funda-mental F o , besides giving nasal formant values not onlyin the expected location for JV4 for the nasal consonant/n / itself, but also throughout each postconsonantaloccurence of the vowel /a/. The obvious connotation isthat each occurence of the vowel /a/ has in fact beennasalised, a reasonable assumption given the consonantal

environment of the vowel. Conversely, it can be postu-lated that the results for N4 in Fig. 10b may simply besome artefact of the focusing technique itself.

To explore the confidence of the above results fornasal formants, both modified and unmodified polefocusing techniques as described are applied to thephrase 'our lawyer will allow your rule', consisting ofpurely voiced speech in which there is no apparent nasal-ity, and spoken by a phonetically trained male talker. In

a —1 — D —j_a-w-e-l-a-l — a w — J _ D _ L _ U _ I

W -v

'•. Vs-. r.'J'

v */ '

a — I —D — j - a - w - d - l - a - f_ dw — j — o— i — u — I

• \V»

0.5 1.0

time, s

b

1.5

Fig. 11 Formant analysis of the phrase 'our lawyer will allow yourrule'

a Modified pole focusing technique, p = 18

b Pole focusing technique, demonstrating increased formant yield

Fig. l la employing the fixed model order, while theformant values correspond well to those expected, thereare several segments with a low yield of formant values;

specifically, at 1 s into the signal, during /I/ of 'lawyer',there is a low yield of values for Ft. Similarly, at about1.4 s, during transition from /oo/ to /r/ in 'your', F 3 isundetectable below 1.4 kHz as it approaches F2. Usingan increasing model order, however, Fig. lib demon-strates that there is excellent yield for all formants Fx to

30 IEE PROCEEDINGS, Vol. 135, Pt. F, No. 1, FEBRUARY 1988



F 3 . Most importantly, there is no trace of similarformant characteristics to those found in Fig. 106,thereby decreasing the probability that such results arean artefact of the signal processing itself, and increasingthe confidence in the interpretation that the vowel /a/ isalso nasalised.

7 Summary and conclusions

Linear predictive coding is one of the most importantsignal processing techniques in use both in speechanalysis from low-bit-rate encoded speech. Its role as ananalysis tool extends to several classes of speech recogni-tion system — from isolated word recognition systemsemploying template matching technology to continuousspeech recognition systems based on phonetic featureextraction and interpretation. However, standard LPCanalysis traditionally suffers from a poor tolerance ofhigh noise environments and, although otherwise ade-quate, is not generally regarded as a sufficiently sensitiveanalysis tool in the provision of detailed signal structure,as witnessed by its most popular use in medium-quality,low-bit-rate encoding systems. The standard use of the

Fourier transform of the prediction error coefficient serieswith a fixed model order requires the concomitant use ofdigital pre-emphasis. The coalescence of these para-meters, as has been demonstrated in this paper, rendersthe performance of standard LPC analysis highly sensi-tive to the exact choice of values in the overall parametermix. A large model order will aid in separating spectralresonances which are closely coupled, but even with nopre-emphasis will increase the variability of centre fre-quency values from frame to frame, besides cluttering thespectral estimate with many formant-like peaks even inmoderate noise environments of 30 dB SNR or so . Withprior pre-emphasis applied, the same model order may beinsufficient to allow detection of low-frequency formantsclose to the corner frequency of the pre-emphasis charac-teristic, and frame to frame variability of formant valuesincreases still further. Pre-emphasis additionally rendersthe analysis extremely susceptible to the effects of noise.Even with the optimum mix of parameter values includ-ing pre-emphasis, use of the z-plane unit circle as the dis-crete Fourier transform search path does not facilitatedetection of weak formant detail such as nasal formants.Even if off-axis search paths within the unit circle areemployed for the DFT, the use of a fixed model orderoften does not provide an adequate yield of formantvalues to satisfy the conditions required for estimation oftime-frequency formant trajectories by line tracking algo-

rithms.The pole focusing technique detailed in this paper is

the antithesis of standard LPC analysis in terms of theconditions found to be necessary to facilitate optimumprocessing. No pre-emphasis is required, and formantdecay and growth within the digital signal z-transformplane is accomodated by the use of several off-axis spec-tral analyses, coupled with the simultaneous use of amonotonically increasing LPC model order, many ofwhose values would render the results obtained fromstandard LPC analysis untenable. Paradoxically, whenused with a simple averaging algorithm as given inSection 4, the pole focusing technique nonetheless pro-vides a performance in terms of noise tolerance which isfar in excess of that available from standard LPCanalysis. As has been shown in Section 5, the improve-ment in noise immunity is at the very least some 14 dBSNR. The pole focusing technique has likewise been

shown in Section 6 to offer improvements in featureresolution even in low noise environments normallyhighly favourable to optimum performance of standardLPC techniques. Besides improving the yield of formantvalues from frame to frame, the pole focusing mechanismhas been demonstrated to be particularly adept at locat-ing weak spectral detail such as nasal formants. Theacoustic correlates of resonance in the nasal cavity arewell established. First, nasal resonances and associatedantiresonances are introduced into the overall vocal tracttransfer function, leading to a modified power spectrum.The most commonly reported nasal formant frequenciesare between 200 and 300 Hz and around 1 kHz [3 3],with another at aro und 2 kHz, their associated band-widths being of the order of 300 Hz for the lowest nasalformant and increasing to 1 kHz for those near to2.5 kHz [34], implying that the poles in the vocal tracttransfer function which relate to nasal formants charac-teristically reside deep within the z-plane. Nasal anti-resonances are each paired with a nasal formant, andvalues of between 500-700 Hz have been reported for thelowest [24], with another between 0.9 kHz and 1.8 kHz.Secondly, as regards the acoustic waveform overall, the

most general effect is an overall loss of power, which canbe directly attributed to the introduction of the nasalanti-resonances. A futher effect of nasal quality in other-wise unnasalised vowels is the detuning of existing vocaltract formants such that their bandwidths increase, withan attendant decrease in formant peak amplitude [35]. Inparticular, if the frequency location (i.e. z-plane phaseargument) of an oral cavity pole roughly coincides withthe position of a nasal cavity zero, then it is likely thatthere will be an attendant increase in formant centre fre-quency 'jitter', since formant bandwidth (and hencedamping factor (see eqn. 18a) will be affected by anymovement in the position of the nasal zero.

The analysis results demonstrated in Section 6 in rela-tion to the word 'nana' (Figs. 10a-c) certainly indicatethat the pole focusing mechanism succeeds in procuringresults which conform well to the known characteristicsboth of nasal consonants and nasalised vowels. Paradoxi-cally, however, although the results provide significantdetail in respect of the qualitative nature of the speechsegment, the inherent sensitivity of the pole focusingmechanism to formant jitter, due to the coincidence oforal formant and nasal antiresonance, is likely to rendercalculation of formant trajectories through nasal speechto be fraught with errors. This is as opposed tounnasalised speech, where the increase in formant yieldand formant separation capabilities will enhance the per-

formance of formant tracking. It is suggested here thatresults from the pole focusing technique would be bestsupplemented by comparison against those from, say, astandard fixed-order LPC analysis, which is relativelyinsensitive to such formant jitter, to provide a workableapproach to formant tracking. Detailed qualitative dataas to the true nature of the segment under analysis canthen additionally be deduced from the formant valuesprovided by the pole focusing mechanism.

In summary, the new pole focusing technique present-ed here offers major advantages over standard LPCanalysis. Not least of all, the technique avoids the sensi-tivity to choice of model order which normally besetsLPC analysis. In using an increasing model order, thepole focusing technique has been found to have an inher-ent insensitivity to choice of initial model on the unitcircle. Results presented have related in the main toformant centre frequency, with a view to exploring likely

IEE PROCEEDINGS, Vol. 135, Pt. F, No. 1, FEBRU ARY 1988 31



effects on formant trajectory estimation. However, thetechnique can easily and simply be made to produce esti-mates of vocal tract pole position, although there hasbeen no explicit use here of such data. The quality of thedata produced would appear to offer extensive improve-ments over that obtainable from standard LPC analyses.However, much of the high-bandwidth, low-intensity res-onance detail, such as nasal poles or formants, would notgenerally be manifest in traditional spectral sections orgrey-scale spectrograms. Fullest use of the pole focusing

technique as an analysis tool is therefore likely to requiredevelopment of some means of visually encoding thequalitative nature of the estimated transfer functionparameters, in particular the pole position inside the z-transform plane. However, this in no way denigrates theusefulness of the technique as an unsurpassed stand-aloneresearch tool in speech analysis, and indeed for the para-metric representation of any class of signal where signifi-cant information is encoded in resonance features of thesignal spectrum.

8 Acknowledgment

We wish to extend our kind thanks to Dr. J. Dalby andDr. J. Harrington, Centre for Speech TechnologyResearch, Edinburgh. This work has been supported by aUK Science & Engineering R esearch Council grant.

9 References

1 H OLB RO OK , A., and FAIRBA NKS, G. : 'Dipthong formant s andtheir movements' , J. Speech & Hear. Res., 1962, 5, pp. 38-58

2 LIBERM AN, A.M., DELA TTRE , P .C. , CO OPE R, F .S. , andGERSTMAN, L.J. : 'The role of consonant-vowel t ransi t ions in thepercept ion of the stop and nasal consonants ' , Psycholog. Monogr.,1954 ,68, (8), pp. 1-13

3 D O R M A N , M . F , S T U D D A R T - K E N N E D Y , M ., a n d R A P H A E L ,L.J. : 'Stop-consonant recogni t ion: release bursts and formant t ran-sitions as functionally equivalent, context dependent cues', Percept.Psycholog., 1977, 22 , pp. 109-122

4 L I N D B L O M , D .E . F. , a n d S T U D D A R T - K E N N E D Y , M : 'O n t herole of formant transitions in vowel recognition', Q. Prog. & StatusRep., 1967, STL-QPSR-1, pp. 21-24, Speech Transmission Labor-atory, Royal Institute of Technology

5 CARLSON, R., GRANSTROM, B., and PAULI, S.: 'Percept iveevaluation of segmental cues', ibid, 1972, STL-QPSR-1, pp. 18-24

6 MARKEL, J.D.: 'Digital inverse filtering — a new tool for formanttrajectory estimation', IEEE Trans., 1972, AU-20, pp. 129-137

7 McCANDLESS, S.S.: 'An algor i thm for formant extract ion usinglinear prediction spectra', ibid, 1974, ASSP-22, pp. 135-141

8 PIN SO N, E .N.: 'Pi tch-synchronous t ime-domain est imat ion offormant frequencies and bandwidths', J. Acoust. Soc. Am., 1963, 35,pp. 1264-1273

9 SCH AFE R, R.W., and RAB INER, L.R.: 'System for au tomaticformant analysis of voiced speech', ibid, 1970, 47, pp. 634-648

10 DUNN, H.K.: 'Methods of measur ing vowel formant bandwidths' ,ibid, 1961, 33, pp. 1737-1746

11 CO OK E, M .P.: 'A computer model of per ipheral audi tory p ro-cessing' . NP L repor t DIT C 58/85, Nat iona l Physical Lab oratory,May 1985

12 NE IDE RJ OH N, R.J. , and L AHA T, M.: 'A zero-crossing consis-tency method for formant tracking of voiced speech in high noiselevels', IEEE Trans., 1985, ASSP-33, pp. 349-355

13 YOUNG, E.D., and SACHS, M.B.: 'Representat ion of steady-statevowels in the temporal aspects of the discharge patterns of popu-lation of auditory-nerve fibres', J. Acoust. Soc. Am., 1979, 66

14 MA KH OU L, J. : 'Linear predict ion — a tutor ial review', Proc.

IEEE, 1975,63, pp. 561-581

15 FLANAGAN, J.L.: 'Difference limen for vowel formant frequency',

J. Acoust. Soc. Am., 1955, 27 , pp. 613-61716 ME RM EL ST EIN , P.: 'Difference limens for formant frequencies for

steady-state and consonant-bound vowels' , ibid, 1978, 63, pp.572-580

17 DENES, P.B., and PINSON, E.N.: 'The acoust ic character ist ics ofspeech', in 'The speech chain' (Bell Telephone Laboratories, 1963),Chap. 7

18 YULE, G.U.: 'On a method of investigating periodicities in dis-turbed series, with particular reference to Wolfer 's sunspot numbers',Philos. Trans. R. Soc. London, 1927, 226, pp. 267-298

19 WALKER, G.: 'On periodicity in series of related terms', Proc. R.Soc. London, 1931,131, pp. 518-532

20 LEVINSON, N.: 'The Weiner rms error criterion in filter design andprediction', J. Math. Phys., 1947, 25, pp. 261-278

21 VER BRU GGE , R.R., STRAN GE, W. , SHA NKW EILER, D P. ,and EDMAN, T.R.: 'What informat ion enables a l i stener to map a

talker 's vowel space?', J. Acoust. Soc. Am., 1976, 60, pp. 198-21222 DELA TTRE, P C , LIBER MAN , A.M. , and COO PER , F .S .:

'Acoustic loci and transitional cues for consonants', ibid, 1955, 27 ,pp. 769-773

23 DICKSON, D.R.: 'An acoust ic study of nasal i ty ' , J. Speech & Hear.Res., 1962,5, pp. 103-111

24 FUJIMURA, O.: 'Analysis of nasal consonants ' , J. Acoust. Soc. Am.,1962, 34, pp. 1865-1875

25 MILLER, R.L.: 'Nature of the vocal cord wave', ibid, 1959, 31, pp.667-677

26 FANT, G.: 'Acoust ic theory of speech product ion' (Mouton, TheHague, 1960)

27 WONG, D.Y, HSIAO, C.C., and MARKEL, J.D.: 'Spectral mis-match due to preemphasis in LPC analysis/synthesis ' , IEEE Trans.,1980, ASSP-28, pp. 263-264

28 WEINER, N.: 'General ized harmonic analysis ' , Acta Math., 1930,55 , pp. 117-258

29 ANAN THAP ADM ANA BHA, T.V., and YEGNANARA YANA, B. :'Epoch extraction from linear prediction residual for identificationof closed g lottis interval ', IEEE Trans., 1979, ASSP-27, pp. 309-319

30 STEIGL ITZ, K., and D ICK INS ON , B.: 'The use of t ime-domainselection for improved linear prediction', ibid, 1977, ASSP-25, pp.34-39

31 KU O, F .F.: 'Network analysis and synthesis ' ( J. Wiley & Sons, NewYork, 1966, 2nd edn.)

32 KUROWSKI, K., and BLUMSTEIN, S.E.: 'Perceptual integrat ionof the murmur and formant transitions for place of articulation innasal consonants ' , J. Acoust. Soc. Am., 1984, 76, pp. 383-390

33 HOUSE, A.S., and STEVENS, K.N.: 'Analog studies of the nasal-ization of vowels', J. Speech & Hear. Disorders, 1956, 21 , pp.218-231

34 H OU SE , A.S.: 'Analog studies of nasal consonants ' , ibid., 1957, 22 ,

pp . 190-204

35 MARTONY, J.: 'The role of formant amplitudes in synthesis ofnasal consonants ' , Q. Prog. Status Rep., 1964, STL-QPSR-3, pp.28-31, Speech Transmission Laboratory, Royal Inst i tute of Tech-nology, Stockholm

32

formantlpc

Documents