real time pitch extraction

8/3/2019 Real Time Pitch Extraction

1/6

E E E T R A N S A C T I O N SNUDIOND ELECTROACOUSTICS,OL. AU-21, NO. 3, JUNE 1973

Real-Time PitchExtraction by AdaptivePrediction,of theSpeech WaveformJOSEPH N. MAKSYM

Abstract-With the exception o f relatively sophisticatedmethods suchascepstrum analysis, the problem of reliablepitch-period extraction has remained argely unsolved. Thispaper examines the feasibility of pitch-period extraction bymeans of the nonstationary error processesulting fromadaptive-predictive quantization of speech. A real-time hard-ware system that may be realized at low cost is described.

I. IntroductionOneof the most important parameters in speech

analysis, synthesis, and vocoder applications is thefundamental frequency, or pitch, ofvoiced speech.Its determination, unfortunately, s not easy, and thisproblem has occupied speech researchers for manyyears. Of the numerous systems for pitch extractionthat have been proposed, none is free from deficien-cies either in performance.or in excessive complexity.Recently, anew technique based upon linear predic-tion of the speech waveform was proposed by Ataland Hanauer [l] Their method consists of finding,by a least squares fitting procedure, thatecursive dig-ital filter whose impulse response approximates thespeech waveform over the interval of analysis. Whenthe recursive expression so derived is used to predictthe waveform from its past sample values, the predic-tion error increases sharply at theonset of each vocalfoldexcitation pulse.Provided that the filter coef-ficients are periodically recomputed in accordancewith the variations of the vocal tract during speech,the prediction error remains small, with the exceptionof pulses showing glottal excitation. These provide in-stantaneous pitch and oicing indication.

The method described above is very powerful, allow-ing a number of other parameters, such as formantfrequencies and bandwidths, to be computed directlyfrom the parameters of the resulting digital filter, butrequires explicit measurement of correlations and

Manuscript received November1, 1972.The author was with the Department of Electrical Engineer-ing,CarletonUniversity,Ottawa, Ont., Canada.Hes nowwith the Defence Research Establishment Atlantic, Dartmouth,N.S., Canada.

-X, t @ U A N T I Z

5 i y n - i -1 5 1

(c1Fig. 1. Alternativetructuresuitable for usen pextraction.

matrix inversion. In this paper it will be shown tpitch ex traction, aswellas the detection of voispeech, is obtainable from simpler systems, suchthe predictive quantizers shown in Fig. 1, provithat suitable algorithms for adaptive adjustment ofpredictor coefficients are used. These are develolater in the paper.II . Pitch-Period Extraction

The known techniques for determining the funmental frequency of speech may be divided into categories: those medically oriented methods thatempt direct measurement of the vocal fold closusuch as that described in a recent paper by Fourand Aberton [2] ; nd signal processing methods oating on the speech waveform. Some of the more cessful of the latter are the following: 1) ilteringextract the fundamental; 2) nonlinear processingaccentuate peaks in the waveform; 3) pattern recnition; 4) epstrum analysis; 5) spectrum flattenand 6) linear prediction of the waveform.An old, but still used, technique makesuse ofixed low-pass filter that passes the fundamental ponent of the waveform but suppresses all harmonSucha system hasseveral obvious faults:first, fundamentalcomponent of voiced speech is ofweak or absent , as in the case of telephone spesecondly, a fixed filter cannot at th e ame time sat


2/6

1 5 0 IEEE TRANSACTIONS O N AUDIO AND ELECTROACOUSTICS, JUNE 1973

the requirements of male and female speakers whosepitch may differ by as much as hree octaves; andfinally, the relatively long response time of such a fil-ter implies that the resulting pitch-period indicationswill appear some time after the onset of voicing, andsome short-voiced plosives may not produce any out -put a t all. Tracking filters solve some of these prob-lems, but have difficulty in following rapid pitch varia-ation, and re still plagued by the response time delay.

A more instantaneousmethod is to extract pitchmarkers directly from the periodic peaks in the wave-form. The detection of single high-amplitude peaks isaided by a zero memory nonlinearity, such as a cubicfunction. This method fails, however, if no single dis-tinct peak is present, and also during transitions be-tween phonemes.Pattern recognition that attempts to mimic the hu-man ability to supply pitch markers on he speechwaveform has been demonstrated by Gold [ 3 ] . Im-plementation of pattern recognition methods is com-plex because of the necessity for measurement andprocessing of a large number of features in order toachieve reliable operation. Somewhat simpler patternrecognition methods have been suggested for use onthe speech spectrum by Schroeder [4] nd by Harrisand Weiss [5] .

Pitchextractionby double-spectrum analysiswasdemonstrated by No11 [6], using the cepstrum tech-nique of Bogart, Healy, andTukey [ 7 ] . The cep-trum analysis method is readily described in terms ofth e simplified model of the voiced speech process,which assumes a slowly varying linear system drivenby a quasi-periodic train of glottal pulses. The short-term speech spectrum s, then , composed of th eclosely spaced harmonics of the pulse train, denotedby U ( o , multiplied by the transfer function G (oof the linear system as follows:

X ( w ) = G ( w ) .U ( O ) . (1)By taking th e logarithm of th e spectral magnitudes,the multiplicative relation is converted to a sum, Thus

log IX (w) = log IG(w ) + log I U(w). (2)The cepstrum is the inverse Fourier ransform of

(2), and displays the periodicities in th e spectrum as-sociated with U ( w ) and G(o). he periodicity cor-responding to U ( w )appears as a single isolated com-ponent in the cepstrum at a position in quefrencycorresponding to thepitch period. To date, cepstrumanalysis provides he most accurate and reliable sourceof pitch information at a cost of relatively high com-plexity of implementation. Cepstrum analysis, never-theless, has a few deficiencies tha t are important insome applications. It indicates the pitch period as anaveragevalue for th e segment of speechwaveformanalyzed (typically several pitch periods) and not theepoch of th e glottal pulses. Furthermore, if the ana-

lyzed segment is shortened to a pitch period or less,the pitch indication becomes erratic.

A number of more recent methodsfor pitch extrac-tion, including the spectrum flattening technique ofSondhi [8] and linear prediction of the waveform byAtal and Hanauer [l , nvolve recovery of th e pulsetrain u ( t ) from he speech waveform x ( t ) . Bothmethods indicate the epoch of occurrence of the glot-tal excitation. Sondhi's technique is eadily under-stoodby rewriting G ( w ) in (1) as magnitude andphase functions so tha t

X ( w )= 1 G(w ) e-@(w) U ( o ) . (3 )This suggestshat u ( t )might be recovered by a spec-tral decomposition of x ( t ) , ollowed by scaling and

phase shifting of the spectral components and summa-tion of the results. The method is difficult t o imple-ment since both IG(w ) and B ( w )must be adaptivelyestimated. Estimation of B (a) ay be avoided at th ecost of using autocorrelation analysis on thespectrumflattened signal, however.Pitch Extraction by Linear Prediction

The model for voiced speech mplied by (1)maybe expressed in terms of a recursive digital filter ex-cited once per pitch period by an impulse. Followinga terminology similar to th atof [l]

1X ( 2 ) = . U ( 2 ) .1 A ( z )If a linear predictor is used for prediction of the nextsignal sample x, according to

m

th e transform of the error may be written asE ( 2 )= X ( Z ) B ( X ) X ( Z ) (6)

where B ( z ) is the transform of the predictor. Aftersubstitution forX ( z )

E @ )= 1 B(2 )1 A ( x ). U ( z ) .If B ( 2 )e A ( X ) , th e error signal e ( t ) , ecovered by low-pass filtering of the error sequence { e , } , pproximatesthe exciting pulse train u ( t ) , nd may be used to ex-tract pitch information. There is an obvious similarityto the spectrum-flattening and phase-shifting methods,but estimation of IG (w ) and B (w ) is now replacedby predictor adaptation, which s capable of simpleimplementation.111. Differential Systems for Pitch Extraction

Examination of the linear prediction method ofpitch extraction reveals that two conditions must be


3/6

MAKSYM: REAL-TIME PITCH EXTRA CTION

satisfied: low prediction error between glottal excita-tions and during unvoiced speech segments, and highprediction error at he onset of glottal excitation.These conditions can e met by the differentialquantizers shown in Fig. 1, provided that the coef-ficients ai (not generally the same for the three dif-ferent configurations) are suitably adjusted to followthe syllabic variation of speech source. For ease ofreference, the three configurations of Fig. 1will here-after be referred to as systems (a), (b) , and (c) inconformance with the labeling on the figure.

Differential quantizers have the well-known abilityto encode with low quantization error signal sequencesthat exhibit high correlation between samples. Suchis the case for speech during voiced segments betweenglottal excitations,andfor unvoiced segments pro-vided that the sampling rate is sufficiently high. Inthis paper, sampling rates in the range 40-60 kHz areconsidered. It should be notedhat these highsampling rates are in di re d contrast to he 10-kHzor ower rates usedby Atal and Hanauer, andallow ow prediction error to berealized with rela-tively crude predictors and predictor adaptationalgorithms.

Referring to system (a) of Fig. 1, and denoting thequantizationerror at henth sampling instant byq , ,we have

E , = y, +2, = x , '+ q, . (8)The predictor ou tput sample may then be expressed as

2, = ai x,+ + qn- i ) .mi = l

This is identical to (5 ) for small quantization error,aswould occur f the quantization were sufficiently fine.The predictor in system (a) is complicated by thenecessity to operate upon the samples {2,- : i = 1,. . . ,m}, ince these mustbe stored as many bit digitalnumbers. This problem is avoided by system (b),which in simplest form uses a binary quantizer and abinary shift register to store the samples {y,+ : =1 , 2 , . . . ,m}. Assuming identical signal sequences,which we can write in transform notation as X @ ) , ndconstraining the resulting prediction sequences %(x)to be identical, we obtain

r.elated, system (b) requires a large number of coficients to achieve a low prediction error. System avoids this problem by including an integrator as pof the predictor. Increasing the sampling rate in stem (c) has, therefore, the effect of reducing predtion error even if the number of coefficients is smIterative Adjustment of the PredictorAdaptive adjustment algorithms for he prediccoefficients in systems (a), (b ), and (c) of Fig. 1be derived by consideration of mean-squared predtion error. For system (a) this may be written

mse = t {(x, - A~ X,)2} (where A = ( a l - , - ,am and X , = . .

Since (11) s a downward convex quadratic hypsurface in the coefficient values, its minimum is tained when the gradient is zero. That is,

- v mse = { ( x , - A T X , ) X , } = 0 . (Alternatively, recognizing tha t the term in the inparentheses is just e,,

T

- V mse = { e , X , } = 0. (Theoretically, one could obtain heoptimum efficient vector by measurement of the xpectevalues in (12) and solution of the resulting maequation.A more readily implemented recursive algoritmay be derived, however, by noting that (13) is agression function whose root is the optimum coficient vector. Selection of a small positive constv as a step-size parameter yields the following recsive algorithm for the coefficient vector:

A ( n + l ) = A ( n ) + v e , X , .A number of modified algorithms in which e, isplaced by the quantized value y, , or in which thestantaneous gradient term e , X , is replaced by sign, are also possible and lead to essentially the saresult for the coefficient vector. This is the case sithe regression functions t { y , X,} and t {(e , X , )} haveessentially the same root , ascanbeshown by simulating the system and measuring thexpectations [9] .

An identical development for system (b) of Fig,yields the following modified algorithm: -(10) A ( n + 1) = A ( n ) v y, Y , , (where, A , (x),A b (x) re the transfer functions of the while for system (c ),digital filter blocks for systems (a)nd (b ) , re- nspedively. It should be notedhathe number of co- A ( n+ 1)= A ( n ) v Y n x Y,-i. (efficients in system (b ) is not necessarily finite, butthat in practice, some finite number suffices to give The increment in the coefficient vector asgivenbypredictor performance that is only slightly worse than (16) for system (c) is a function of all the past quthator system (a) . tized error samples. For aonstationary signal souFor input signalswhosesamples are highly cor model, this is undesirable, and in fact, it is found t

i=0


4/6

1 5 2 IEEE TRANSAC TIONS ON AUDIO AND ELECTROACOUSTICS, JUNE 1973

4 o w p A s s3.1I L T E R k n Z )P I T C H W A V E F O R M E X T R A C T O R O UP R E D I C T I O N

E R R O Rx ( t ) - < ( i l I B A S I CL O C K

S P E E C H B I N AR YO U T P U T

P R E D I C T E DW A V E F O R M

D A T A S T O R A G E

< ( t )I II

C O E F F I C I E N TS T O R A G EI N T E G R A T O R 4

U P D A T E 0TA PG A I N S

7 *D I G I T A L OA N A L O G A R I T H M E T I CU N I TC O N V E R T E R

Fig. 2. Block structure of experimental system. Double linesindicate vector-valued quantities.

dependence only upon the most recent vector Y ,yields the lowest prediction error [9] Accordingly,algorithm (15) will be used in the experimental pitchextractor t o be described next .IV. Hardware mplementation

An embodiment of system (c ) suitable for real-timepitch extraction is shown in Fig. 2. Eight predictorcoefficients, each as an eight-bit binary number, arestored in a recirculating shift-regester memory that isclocked at a nominal 2-MHz rate. Thus, each coef-ficient is presented at he memory output in turnafter each successive clock pulse. A similar shift regis-ter stores the eight most recent binary quantized pre-diction error samples. This allows coefficient adapta-tion and formation f the integrator input in 16 pulsesof the 2-MHz clock. A single 12-bit adder is used forboth functions: coefficient incrementation by the ad-dition or subtraction of one least significant bit as re-quired by algorithm (15) during the first eight clockpulses, and formation of the integrator input by sum-mation of coefficients with signs determined by thequantized error samples in the data store during thenext eight clock pulses. .The 2-MHz clock is then shutoff in readiness for th e next pulse from he basicclock, which operates at a 40-kHz rate. Integrationtakes place over an interval approximately 15 p s inlength prior to th e next basic clock pulse, at whichtime a comparator and bistable form the new binarysample y n .The prediction error x t ) 2 ( t ) s low-pass filteredbyan eight-pole 3.1-kHz cutoff Butterworth activefilter whose output is used to extract pitch informa-

tion. A similar filter at the speechwaveform inputselects that part of the energy which is significant topitch extraction, whileuppressing much of theenergy in unvoiced speech that is known to be con-centrated at higher frequencies. This filter, therefore,aids in keeping the prediction error small during un-voiced segments of speech.

The digital implementation of Fig. 2 is by no meansthe simplest or least expensive. It is possible, by stor-ing coefficient values as analog voltages on capacitorsor integrators, t o eliminate the 2-MHz clock and as-sociated control circuitry. Because of the variousfeedback and adaptive loops, the performance is notoverly sensitive to the usual tolerances of analog com-ponents, and a substantial decrease in overall cost andcomplexity may be achieved.V. Performance

The waveforms obtained from the experimental sys-temfor ypical voicedspeech inputsare shown nFigs. 3 and 4. Fig . 3 shows the waveform at variouspoints in the system for the phoneme / i/ , as n theword deed. The time scale in the figure is 2 ms/div,while the amplitude scale s 5 V/div. Fig. 3 demon-strates the bursts of error that occur in the systemduring glottal excitation,and which, after urthersimple processing, may be used to extract pitch infor-mation. It was found in tests with a wide selection ofvoiced phonemes as input that the duration of theprediction error burst is longest for the phoneme /i/.However, even in this case, the epoch of glottal ex-citation may be determined without ambiguity if theenvelope of the error is used t o trigger, for example,


5/6

MAKSYM: REAL-TIME PITCH EXTRACTION 1

(a )b 1Fig. 3. Waveforms in the experimental pitch extractor for the

phoneme /i / in beet. (a ) x ( t ) , - Z ( t ) , integrator input,e ( t ) iltered. (b)x ( t ) , e ( t ) , e ( t )iltered, [ y n } .

Fi , 4 . Speech waveform, error waveform, squared-error wave-form, andpulse output of experimental pitch extractionsystem. (a ) The phoneme /ae/ in hat. (b ) Th;e phoneme/i / in beet. (c ) The phoneme /e/ in bet. (d) Thephoneme/x/n bought.a threshold detector which then produces a standardpulse. It is interesting to note that the speech wave-form in Fig. 3 contains a large amplitude sinusoidalcomponent a t approximately twice the fundamentalfrequency, but that only one error burst is producedby the system. The correct pitch is therefore deter-mined, whereas methods that use low-pass filters ornonlinear distortion of the speech waveform wouldhave a tendency o indicate twice the actual pitch.

Examples of the application of a square-law non-linearity to the error waveform are shown in Fig. 4.The effect is to accentuate peaks in the waveformand invert negative pulses while suppressing much ofthe low-amplitude noise. The four pictures shown inFig. 4 were obtained with the voiced phonemes /ae/,/i/, /e/, and /I/poken in context by a male voice.The timescale is2 ms/div, and the amplitude cales inall four parts of the figure are: 5 V/div for the speechwaveform, 10 V/div for the error,2 V/div for the out-

put of the error squaring circuit, and 5 V/div for thpulse outp ut . Although the pictures in Fig. 4 weobtained with the quarederror pulses riving Schmitt trigger directly, it probably would be safer first obtain the envelope of the squared-error wavform, as this would avoid the danger of ambiguity fthe type of error waveform shown in Fig. 3.A number of tests were conducted to determine false pitch output pulses would be produced for uvoiced speech. Thesewere not observed, eitherfounvoiced segments in the context of normal speecor when the system was deliberately excited wihigh-amplitude unvoiced phonemes such as /s /see or /s/ in she. These tests, although they amittedly donot allow an objective comparison btween the experimental system and other forms pitch extraction,do indicate that he experimentsystem shows promise as a voicing detector at hesame time as it extracts pitch.


6/6

1 5 4 IEEE TRANSACTIONS ON AUDIOND ELECTROACOUSTICS, V O L. AU-21 , NO. 3, JUNE 1973

VI . Conclusions AcknowledgmentA new technique for pitchextractionand voicing Theauthor wishes to thank Dr. D. A . George for

indication hasbeen described. It operates by per- suggestingpredictive encoding as a promising area forforming short-term prediction of the speech wave- research. He also wishes to thank Dr.L. R.Momis forform, and using the resultant prediction error to de- useful discussions on speech processing.tect the presence of glottal excitation. It was deter-mined that he proposed method has everal useful Referencesfeatures. Among them are: ease of implementation,ability to respond quickly to glottal excitations at th ebeginning of words, insensitivity to unvoiced speechsounds, indication of the epoch of glottal excitationandnot simply the period, ability to follow rapidpitch changes, and he ability to operate onwave-forms where the fundamental is weak or absent. Thesefeatures, many of which are not present in the pitch-extraction techniques listed in Section 11, would rec-ommend the proposed system for such applicationsas pitch and voicing input for vocoders, pitch extrac-tion for speech analysis and processing, and speechaids for the deaf. Further research including exten-sive performance comparisons between the proposedsystem and other methods s recommended, however,before an objective evaluation of the new system canbe made.

Application of a DigitalInverse Filter forAutomatic Formantand F, AnalysisJOHN D. MARKEL

Abstract-In this paper, a new algorithm based upo n a digitalinverse filterformulation is presentedforautomaticallyde-termining VU, a voiced-unvoiced decision (V U = 0 during un-voiced speech and VU = 1 during voiced speech), F , , thefundamental frequency, and Fi, = 1, 2, 3, the first three for-mant frequencies, as a function of time. Formant rajectoryestimates are obtained for all speech sounds thatatisfy VU= 1.

ported yhe Office of Naval Research underContractManuscript received April 30, 1972. Thiswork was sup-N00014-67-C-0118with the Speech Communications Labora-tory, Santa Barbara, Calif. 93101.(SCRL), Santa Barbara, Calif. 93101.The auth or is with the Speech Communications Laboratory

thesis by linear prediction of the speech wave,J. AcoustB. S. Atal and S; L. HanauerSpeech analysis and synS O C .Amer., vol. 5 0 , no. 2, part 2, 1971.A. J. Fourcinand E. Aberton,Firstapplications of anew laryngograph, Med. Biol. Illus., vol. 21, July 1971.B. Gold,Computer program forpitchextraction, J.Acoust. S O C .Amer.,vol. 34, July 1962.M. R. Schroeder, PeGod histogram andproduct spec-trum: ,New methods for fundamental frequencymeasure-ment, J. Acoust. S O C .Amer., vol. 43, no. 4, 1968.C. M. Harris and M. R. Weiss, Pitch extra ction by com-uter processing of high resolution Fourier analysis data ,3.Acoust. S O C .Amer.,vol. 35 , Mar. 1963.ni ues for vocal-pitch detection, J. Acoust. SOC.Amer.A. M. Noll, Short-timespectrumand cepstrum tech-vo?. 36, Feb. 1964.B.P.Bogert, M. J. R. Healy, and J.W. ukey, Time SeriesAnelyszs. New York: Wiley, 1963, ch. 15 .M . M. Sondihi, New methods of pitch extraction, IEEETrans. AudioElectroacoust., vol. AU-16, pp. 262-266,J. N. Maksym, Iterative adjustment of predictive quan-June 1968.tizers, Ph.D. disserta tion, Dep.Elec. Eng., CarletonUniv., Ottawa, Ont., Canada, 1972.

IntroductionThe purpose of this paper is to present a new al-gorithm for automatically extracting the first hree

formant frequencies for voiced male speech and thefundamental frequency. Explicit in the fundamentalfrequency extraction is V U , a voiced-unvoiced de-cision.

The central element in the analysis is the digital in-verse filter. Based upon the firs t M + 1 erms of theinput utocorrelation sequence, coefficients of anMth degree, all-zero digital filter are calculated. Theformant trajectory estimates for each frame are basedsolely upon the locations of the local minima of thecorresponding spectrum of the resultant inverse filter.The V U decision is determined by the amplitude ofthe largest peak of the normalized autocorrelationsequence of the output of the inverse filter (excludingthe origin). If VU = 1, then Fo is defined as thereciprocal of the peak location.

Brief Review of Digital Inverse Filter FormulationThe following formulation hasbeen proposed for

extracting the resonance behavior from a sequence ofpreemphasized speech data {x,) [l] Given a digitalinverse filter

real time pitch extraction

Documents