a comparative performance study of several pitch detection algorithms

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-24,NO. 5, OCTOBER 1976 399

A Comparative Performance Study of Several PitchDetection Algorithms

LAWRENCE R. RABINER, FELLOW, IEEE, MICHAEL J. CHENG, STUDENT MEMBER, IEEE,AARON E. ROSENBERG, MEMBER, IEEE, AND CAROL A. McGONEGAL

Abstract—A comparative performance study of seven pitch detectionalgorithms was conducted. A speech data base, consisting of eightutterances spoken by three males, three females, and one child was con-structed. Telephone, close talking microphone, and wideband record-ings were made of each of the utterances. For each of the utterancesin the data base; a "standard" pitch contour was semiautomaticallymeasured using a highly sophisticated interactive pitch detection pro-gram. The "standard" pitch contour was then compared with thepitch contour that was obtained from each of the seven programmedpitch detectors. The algorithms used in this study were 1) a centerclipping, infinite-peak clipping, modified autocorrelation method(AUTOC), 2) the cepstral method (CEP), 3) the simplified inverse fil-tering technique (SIFT) method, 4) the parallel processing time-domainmethod (PPROC), 5) the data reduction method (DARD), 6) a spectralflattening linear predictive coding (LPC) method, and 7) the averagemagnitude difference function (AMDF) method. A set of measure-ments was made on the pitch contours to quantify the various typesof errors which occur in each of the above methods. Included amongthe error measurements were the average and standard deviation of theerror in pitch period during voiced regions, the number of gross errorsin the pitch period, and the average number of voiced—unvoiced clas-sification errors. For each of the error measurements, the individualpitch detectors could be rank ordered as a measure of their relative per-formance as a function of recording condition, and pitch range of thevarious speakers. Performance scores are presented for each of theseven pitch detectors based on each of the categories of error.

I. INTRODUCTION

A PITCH DETECTOR is an essential component in avariety of speech processing systems. Besides provid-ing valuable insights into the nature of the excitation

source for speech production, the pitch contour of an utter-ance is useful for recognizing speakers [1], [2], for speechinstruction to the hearing impaired [31, and is required in al-most all speech analysis—synthesis (vocoder) systems [4] -Because of the importance of pitch detection, a wide varietyof algorithms for pitch detection have been proposed in thespeech processing literature (e.g., [5] —[11]). In spite of theproliferation of pitch detectors, very little formal evaluationand comparison among the different types of pitch detectorshas been attempted. There are a wide variety of reasons whysuch an evaluation has not been made. Among these are thedifficulty in selection of a reasonable standard of comparison;collection of a comprehensive data base; choice of pitch de-tectors for evaluation; and the difficulty in interpreting theresults in a meaningful and unbiased way. This paper is a re-port on an attempt to provide such a performance evaluation

Manuscript received March 3,1976; revised June 7,1976.The authors are with Bell Laboratories, Murray Hill, NJ 07974.

of seven pitch detection algorithms. Although a wide varietyof alternatives were available in almost every aspect of thisinvestigation, several arbitrary (but hopefully reasonable) de-cisions were made to limit the scope of the performance eval-uation to a reasonable size.

In this section we provide an overview of the investigation.We begin with a discussion of the general problems and issuesin pitch detection. Then we discuss the various types of pitchdetection algorithms which have been proposed and reviewtheir general characteristics. We conclude with a discussion ofthe types of performance measures which would be suitablefor various applications.

In Section lIthe detailed implementations of the seven pitchdetectors used in this study are reviewed. Included in this sec-tion is a brief discussion of the method of operation of thepitch detector and the method used to make a voiced-unvoicedclassification. In Section III we discuss the data base selectedfor evaluating the seven pitch detectors. In Section IV themethod used to measure the standard pitch contour is out-lined. Section V presents the results of several error analyses.Section VI provides a discussion of the error analyses and Sec-tion VII discusses the computational considerations in the im-plementation of each of the algorithms.

A. Problems in Pitch Detection

Accurate and reliable measurement of the pitch period of aspeech signal from the acoustic pressure waveform alone isoften exceedingly difficult for several reasons. One reason isthat the glottal excitation waveform is not a perfect train ofperiodic pulses. Although finding the period of a perfectlyperiodic waveform is straightforward, measuring the period ofa speech waveform, which varies both in period and in the de-tailed structure of the waveform within a period, can be quitedifficult. A second difficulty in measuring pitch period is theinteraction between the vocal tract and the glottal excitation.In some instances the formants of the vocal tract can alter sig-nificantly the structure of the glottal waveform so that theactual pitch period is difficult to detect. Such interactionsgenerally are most deleterious to pitch detection during rapidmovements of the articulators when the formants are alsochanging rapidly. A third problem in reliably measuring pitchis the inherent difficulty in defining the exact beginning andend of each pitch period during voiced speech segments. Thechoice of the exact beginning and ending locations of the pitchperiod is often quite arbitrary. For example, based on theacoustic waveform alone, some candidates for defining the

400 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, OCTOBER 1976

beginning and end of the period include the maximum valueduring the period, the zero crossing prior to the maximum,etc. The only requirement on such a measurement is that itbe consistent from period-to-period in order to be able to de-fine the "exact" location of the beginning and end of eachpitch period. The lack of such consistency can lead to spu-rious pitch period estimates. Fig. 1 shows two possible choicesfor defining a pitch marker directly based on waveform mea-surements. The two waveform measurements shown in Fig. 1can (and often will) give slightly different values for the pitchperiod. The pitch period discrepancies are due not only to thequasiperiodicity of the speech waveform, but also the fact thatpeak measurements are sensitive to the formant structure dur-ing the pitch period, whereas zero crossings of a waveform aresensitive to the formants, noise, and any dc level in the wave-form. A fourth difficulty in pitch detection is distinguishingbetween unvoiced speech and low-level voiced speech. Inmany cases transitions between unvoiced speech segments andlow-level voiced speech segments are very subtle and thus areextremely hard to pinpoint.

In addition to the difficulties in measuring pitch period dis-cussed above, additional complications occur when one isfaced with the problem of pitch extraction of speech that hasbeen transmitted through the telephone system. Many sys-tems, in which pitch detection is required, must processtelephone-quality speech. The effects of the telephone systemon speech include linear filtering, nonlinear processing, and theaddition of noise to the speech signal. With regard to linear fil-tering, the telephone system acts like a bandpass filter (low-frequency cutoff of approximately 200 Hz, high-frequencycutoff of approximately 3200 Hz) which can significantly at-tenuate the fundamental pitch frequency and many of thehigher pitch harmonics. The result is that the periodicity ofthe signal is much harder to detect. Nonlinear contributionsof the telephone system to the speech signals can, depend-ing on the particular transmission system used, include thefollowing.

1) Phase distortion.2) Fading or amplitude modulation of the speech signal.3) Crosstalk between two or more messages.4) Clipping or distortion of extremely high-level sounds.

(It should be noted that one would not expect all the abovelisted effects to occur simultaneously) Thus the overall effectof the telephone line is to obscure the periodic structure of thespeech waveform such that the pitch period becomes more dif-ficult to detect.

B. Types of Pitch DetectorsAs a result of the numerous difficulties in pitch measure-

ments, a wide variety of sophisticated pitch detection methodshave been developed. Basically, a pitch detector is a devicewhich makes a voiced-unvoiced decision, and, during periodsof voiced speech, provides a measurement of the pitch period.However, some pitch detection algorithms just determine thepitch during voiced segments of speech and rely on some othertechnique for the voiced—unvoiced decisions. Pitch detection

PEAI<MEASUREMENT

ZERO CROSSINGMEASUREMENT

Fig. 1. Two waveform measurements which can be used to define pitchmarkers.

algorithms can roughly be divided into the following threebroad categories.

1) A group which utilizes principally the time-domain prop-erties of speech signals.

2) A group which utilizes principally the frequency-domainproperties of speech signals.

3) A group which utilizes both the time- and frequency-domain properties of speech signals.

Time-domain pitch detectors operate directly on the speechwaveform to estimate the pitch period. For these pitch detec-tors the measurements most often made are peak and valleymeasurements, zero- crossing measurements, and autocorrela-tion measurements. The basic assumption that is made in allthese cases is that if a quasiperiodic signal has been suitablyprocessed to minimize the effects of the formant structure,then simple time-domain measurements wifi provide good es-timates of the period.

The class of frequency-domain pitch detectors use the prop-erty that if the signal is periodic in the time domain, then thefrequency spectrum of the signal will consist of a series of im-pulses at the fundamental frequency and its harmonics. Thussimple measurements can be made on the frequency spectrumof the signal (or a nonlinearly transformed version of it as inthe cepstral pitch detector [5]) to estimate the period of thesignal.

The class of hybrid pitch detectors incorporates features ofboth the time-domain and the frequency-domain approachesto pitch detection. For example, a hybrid pitch detectormight use frequency- domain techniques to provide a spectrallyflattened time waveform, and then use autocorrelation mea-surements to estimate the pitch period.

In this investigation four time-domain, one frequency-domain, and two hybrid pitch detectors were studied. De-tailed discussions of the algorithms which were used will begiven in Section II.

C. Criteria for Evaluating Pitch Detectors

One of the most difficult problems in comparing and evalu-ating the performance of pitch detectors is choosing a mean-ingful objective performance criterion. The basic problem isthat a criterion suitable for one application may not be suit-able for a different application of pitch detectors.

There are many characteristics of pitch detection algorithmswhich influence the choice of a set of performance criteria.Among these factors are the following.

1) Accuracy in estimating pitch period.2) Accuracy in making a voiced-unvoiced decision.3) Robustness of the measurements, i.e., they must be

RABINER et al.: PITCH DETECTION ALGORITHMS 401

modified for different transmission conditions, speakers,etc.

4) Speed of operation.5) Complexity of the algorithm.6) Suitability for hardware implementation.7) Cost of hardware implementation.

Depending on the specific application, various weights mustbe given to each of the above factors in choosing a single ob-jective performance criterion. In this paper we will presentresults based on factors 1 and 2 in the above list, i.e., onlythose factors which are most amenable to numerical tabula-tions. However, whenever possible, we will try to discuss howfactors 3—7 enter into an overall assessment of the pitch de-tectors discussed in this paper.

There is one major factor which was omitted from the abovelist and which, for many applications, is the dominant factorin evaluating pitch detectors. This factor is the perceptualaccuracy of the pitch detectors, i.e., the question of howfaithfully the pitch contour measured by the pitch detectormatches the natural excitation pitch contour in terms of syn-thetic speech quality. We have omitted this factor from thelist because it is not an objective performance criterion, butis instead a subjective criterion that can only be assessedthrough a series of extensive perceptual tests using syntheticspeech samples. Such a companion investigation is being un-dertaken by the authors and will be reported on at a later date.

II. PITCH DETECTION ALGORITHMS

As stated earlier, seven distinct algorithms for detectingpitch were investigated. These algorithms were the following.

1) Modified auto correlation method using clipping (AUTOC)(Dubnowski [11]).

2) Cepstrum method (CEP) (Schafer [12]).3) Simplified inverse filtering technique (SIFT) (Markel [8]).4) Data reduction method (DARD) (Miller [9]).5) Parallel processing method (PPROC) (Rabiner [21).6) Spectral equalization LPC method using Newton's trans-

formation (LPC) (Atal, unpublished).7) Average magnitude difference function (AMDF) (NSA

version, [10]).

The names in parentheses are the individual (or group) respon-sible for providing Fortran code for .the computational partsof each algorithm, and the code following the name of themethod is the abbreviation which wifi be used to refer to thepitch detector throughout this paper.

The choice of pitch detectors was based on both practicalconsiderations (i.e., availability of reasonably portable Fortrancode) as well as the desire to choose a good cross section ofrecent examples of each of the three types of pitch detectorsdiscussed in Section I. Thus, included in this study were twotime-domain (waveform) pitch detectors (4 and 5), two auto-correlation pitch detectors (1 and 7), one frequency-domainpitch detector (2), and two LPC hybrid pitch detectors (3and 6).

In the following section we provide a summary of themethod of operation of each of the seven pitch detectors.

Whenever possible, exact values of parameters of the pitch de-tector (e.g., section length, window, etc., will be given).

A. Modified Autocorrelation Method (A UTOC)

The modified autocorrelation pitch detector is based on thecenter-clipping method of Sondhi [7]. Fig. 2 shows a blockdiagram of the pitch detection algorithm. The method re-quires that the speech be low-passed filtered to 900 Hz. (A99-point linear phase, finite impulse response (FIR) digital fil-ter is used to low-pass filter the speech. Detailed character-istics of this low-pass filter, which is used for several of thepitch detectors, are given in [13] -)

The low-pass filtered speech signal is digitized at a 1 0-kHzsampling rate and sectioned into overlapping 30-ms (300 sam-ples) sections for processing. Since the pitch period computa-tion for all pitch detectors is performed 100 times/s, i.e., every10 ms, adjacent sections overlap by 20 ms or 200 samples.

The first stage of processing is the computation of a clippinglevel CL for the current 30-ms section of speech. The clippinglevel is set at a value which is 64 percent of the smaller of thepeak absolute sample values in the first and last 10-ms portionsof the section. Following the determination of the clippinglevel, the 30-ms section of speech is center clipped, and theninfinite peak clipped, resulting in a signal which assumes oneof three possible values—+ 1 if the sample exceeds the positiveclipping level, — 1 if the sample falls below the negative clip-ping level, and 0 otherwise.

Following clipping the autocorrelation function for the30-ms section is computed over a range of lags from 20 sam-ples to 200 samples (i.e., 2-ms-20-ms period). Additionally,the autocorrelation at 0 delay is computed for appropriatenormalization purposes. The autocorrelation function is thensearched for its maximum (normalized) value. If the maxi-mum (normalized value) exceeds 03, the section is classifiedas voiced and the location of the maximum is the pitch period.Otherwise, the section is classified as unvoiced.

In addition to the voiced—unvoiced classification based onthe autocorrelation function, a preliminary test is carried outon each section of speech to determine if the peak signal am-plitude within the section is sufficiently large to warrant thepitch computation. If the peak signal level within the sectionis below a given threshold,1 the section is classified as unvoiced(silence) and no pitch computations are made. This methodof eliminating low-level speech sections from further process-ing was also used for pitch detectors 2 (CEP), 3 (SIFT), and5 (PPROC).

B. Cepstral Method (CEP)Fig. 3 shows a flow diagram of the cepstral pitch detector

described in [12]. Each block of 512 samples (51.2 ms) isweighted by a 512-point Hamming window, and then the cep-strum of that block is computed. The peak cepstral valueand its location is determined and if the value of this peak ex-ceeds a fixed threshold, the section is called voiced and thepitch period is the location of the peak. If the peak does not

1The threshold chosen is 1/15 of the peak absolute signal value withinthe utterance.


FIND POSITION,bl COMPARE PEAKVALUE OF

VALUE WITHIAUTOCORRELATION1 n'os IVIU THRESHOLDPEAK

w In)

exceed the threshold, a zero-crossing count is made on theblock. If the zero-crossing count exceeds a given threshold,the block is called unvoiced. Otherwise it is called voicedand the period is the location of the maximum value of thecepstrum.

As in the modified autocorrelation method, a preliminarysilence detector (based on the signal level) is used to classifyall low1evel blocks as silence (unvoiced speech) prior to thecepstral computation. It should also be noted that the cepstralpitch detector uses the full-band speech signal for processing.

C. Simplified Inverse Filtering Technique (SIFT)Fig. 4 shows a block diagram of the SIFT method of pitch

detection [8]. A block of 400 speech samples (40 ms at a1 0-kHz rate) is low-pass filtered to a bandwidth of 900 Hz,and then decimated (down sampled) by a 5 to 1 ratio. Thecoefficients of a 4th-order inverse filter are obtained using theautocorrelation method of LPC analysis [14]. The 2-kHzspeech signal is then inverse ifitered to give a spectrally flat-tened signal which is then autocorrelated. The pitch period isobtained by interpolating the autocorrelation function in theneighborhood of the peak of the autocorrelation function. Avoiced-unvoiced decision is made on the basis of the amplitudeof the peak of the autocorrelation function. The threshold

used for this test is a normalized value of 0.4 for the autocor-relation peak.

As with the previous two pitch detectors, a preliminary si-lence detector is used to classify low-level sections as silenceand eliminate them from further consideration.

D. Data Reduction Method (DARD)

Fig. 5 shows a block diagram of the data reduction pitch de-tector of Miller [9] . This pitch detector places pitch markersdirectly on a low-pass filtered (0-900 Hz) speech signal andthus is a pitch synchronous pitch detector.

To obtain the appropriate pitch markers, the data reductionmethod first detects excursion cycles in the waveform basedon intervals between major zero crossings. The remainder ofthe algorithm tries to isolate and identify the principal excur-sion cycles, i.e., those which correspond to true pitch periods.This is accomplished through a series of steps using energymeasurements and logic based on permissible pitch periods andanticipated syllabic rate changes of pitch. An error correctionprocedure is used to provide a reasonable measure of conti-nuity in the pitch markers.

Since there is no inherent voiced—unvoiced calculationwithin this pitch detector, regions of unvoiced speech are iden-tified by the lack of pitch markers.

S In)

0- 900 Hz

VOICED,

PERIOD IPOS

Fig. 2. Block diagram of the AUTOC pitch detector.

HAMMING WINDOW

UNVOICED

SILENCE

VOICED,PERIOD TPOs

SI LENCEDETECTOR

Fig. 3. Block diagram of the CEP pitch detector.


Fig. 5. Block diagram of the DARD pitch detector.

E. Parallel Processing Method (PPROC)

Fig. 6 shows a block diagram of the parallel processing pitchdetector [6] . The speech signal is first low-pass filtered to abandwidth of 900 Hz. Then a series of measurements aremade on the peaks and valleys of the low-pass filtered signalto give six separate functions. Each of these six functions isprocessed by an elementary pitch period estimator (PPE), giv-ing six separate estimates of the pitch period. The.six pitchestimates are then combined by a sophisticated decision algo-rithm which determines the pitch period. A voiced-unvoiceddecision is obtained based on the degree of agreement amongthe six pitch detectors. Additionally, the preliminary silencedetector described in Section Il-A is used to classify low-levelsegments as silence.

F. Spectral Equalization LPC Method Using Newton'sTransformation (LPC)

Fig. 7 shows a block diagram of an LPC pitch detector pro-posed by Atal (unpublished). The first step in this pitch de-tector is a voiced—unvoiced detector which uses a pattern-recognition technique to classify each 10-ms interval of speechas voiced or unvoiced [15]. If the speech section is classifiedas voiced, the 10-kHz sampled speech is digitally low-passfiltered to a bandwidth of about 900 Hz, and then decimatedby 5 to I to a 2-kHz sampling rate. A 41-pole LPC analysis isperformed on a 40-ms frame of speech to give a good repre-sentation of the speech spectrum in terms of the pitch har-monics. A Newton transformation is used to spectrally flattenthe speech, i.e., to transform the signal into one which hassharp peaks at the pitch impulses, and is approximately zeroeverywhere else. A peak picker is used to determine the pitchperiod at the 2-kHz rate anda simple interpolation network isused to obtain higher resolution in the value of the pitchperiod.

It should be pointed out that the voiced-unvoiced patternrecognition algorithm uses a training set which provides a sta-tistical description of the measurements used in the algorithmfor each of classes. The success of this method of making avoiced-unvoiced decision depends heavily on how well thetraining set of data characterizes the different speech classes.

With careful training, voiced—unvoiced accuracies on the orderof 99 percent have been obtained [15].

G. Average Magnitude Different Function (AMDF)Fig. 8 shows a block diagram of the AMDF pitch detector

[10]. (The version used in this study was kindly supplied byM. Malpass of the Massachusetts Institute of TechnologyLincoln Laboratory, based on the NSA version of the AMDFmethod. Details of implementation differ somewhat fromthose of [10] .) The speech signal, initially sampled at 10 kHz,is decimated to a 6.67-kHz rate using a system of the typediscussed in [16] . A zero-crossing measurement (NOZ) ismade on the full-band speech file, and an energy measurement(ENG) is made on a low-pass filtered version (0-900 Hz) of thesignal. The average magnitude difference function is computedon the low-pass filtered speech signal at 48 lags running from16 to 124 samples. The pitch period is identified as the valueof the lag at which the minimum AMDF occurs. Thus a fairlycoarse quantization is obtained for the pitch period. Logic isused to check for pitch period doubling, etc., and to check oncontinuity of pitch periods with previous pitch estimates (atype of nonlinear smoothing). In addition to the pitch esti-mate, the ratio between the maximum and minimum values ofAMDF (MAX/MIN) is obtained. This measurement, alongwith NOZ and ENG is used to make a voiced-unvoiceddecision using logical operations.

III. DATA BASE FOR EVALUATION

In order to evaluate the performance of these seven pitchdetectors, an appropriately chosen data base was required tospan the range of pitch, types of utterances, and recording andtransmission environments which are normally encountered inspeech processing. In this section we describe the data baseused in this study.

A. SpeakersThe set of seven speakers for this study included the fol-

lowing.

1) Low-pitched male (LM).2) Malespeaker1M1).3) Male speaker 2 M2).

0- 900 Hz

Fig. 4. Block diagram of the SIFT pitch detector.

SIn) DETECT EXCURSION iISOLATE AD IDENTIFY PITCH MARKERS

LPF 1 1CYCLS USING ZERO 1PRINCIPAL EXCURSION _____ ERROR

ON WAVEFORM

CROSSING ANALYSIS CYCLES USING ENERGY AND CORRECTION

0—900PITCH PERIOD LIMITS

I


4) Female speaker 1 (Fl).5) Female speaker 2 (F2)6) Child (4 year old) (Cl).7) Diplophonic speaker (Dl).

Diplophonia is a condition in which a person's alternateglottal pulses are more strongly correlated (both in length andamplitude) than adjacent glottal pulses. Thus, it is extremelydifficult to detect the pitch of a diplophonic speaker—evenunder the best of conditions. Fig. 9 shows a section of wave-form from the diplophonic speaker. It is hard to detect, evenby eye, the correct pitch periods. For diplophonic speakers,many pitch detectors calculate the pitch period as the distancebetween major peaks, and not the distance between major andminor peaks. As a result, the pitch contour for a diplophonicspeaker often exhibits a large amount of pitch period doubling.

2

,-' —.. —.--' /-. -. —.- ,_-513

— \./ — —V 768

769

TIME (SAMPLES)

Fig. 9. Section of the waveform from the diplophonic speaker.

Mi

5(n)

0—900Hz

Fig. 6. Block diagram of the PPROC pitch detector.

Fig. 7. Block diagram of the LPC pitch detector.

Fig. 8. Block diagram of the AMDF pitch detector.

256


To illustrate the range of pitch (both period and frequency)for the speakers in the data base, Fig. 10 shows a plot of thepitch variation for each of the seven speakers for the utter-ances used in this evaluation (see Section 111-B). It can beseen that a wide range of pitch is encompassed by these sevenspeakers. Additionally, Fig. 11 shows the individual histo-grams for each of these speakers. It can be seen from this fig-ure that the low-pitched (long period) speakers used in thisstridy (i.e., LM, Ml, M2) had a much larger range of pitch pe-riod variation than the high-pitched speakers. The histogramfor the low-pitched male (LM) shows that on several occasionshis pitch period exceeded 200 samples (i.e., the pitch fre-quency fell below 50 Hz). Since this was outside the antici-pated range of pitch variation, all the pitch detectors made er-rors during these regions.

B. Recorded Utterances

The utterances used in this study included the four mono-syllabic nonsense words:

1) Hayed2) Heed3) Hod4) Hoed

and the four sentences:

5) We were away a year ago.6) I know when my lawyer is due.7) Every salt breeze comes from the sea.8) I was stunned by the beauty of the view.

Sentences 5 and 6 are all voiced (except for the stop gaps)whereas sentences 7 and 8 contain both voiced and unvoicedspeech.

C. Recording ConditionsThe three types of recording conditions that were used in

this study included:

1) Close-talking microphone (M). 1 simultaneous

2) Standard telephone transmission (T). recording3) High- quality microphone (W). J

The close-talking microphone recordings were made simul-taneously with the telephone recordings since this was themost convenient method of providing a good-quality signal(during voiced regions) for manual pitch detection whichcould be time aligned with the telephone recordings and whichdid not interfere with using a standard telephone handset in anatural manner. HOwever, because of its placement close tothe mouth of the speaker, the close-talking microphone wasquite sensitive to breath noise, plosives, and other unvoicedtransients. The telephone recordings were made over the localPBX using an ordinary telephone handset. The close-talkingmicrophone recordings were band-limited to about 3 kHz, aswere the telephone recordings. The recordings made on thehigh-quality microphone were wideband recordings whichwere filtered at 4 kHz prior to digitization.

IV. MEASUREMENT OF THE STANDARD PITCH CONTOUR

The method used to measure the standard pitch contour foreach of the utterances in the data base was the semiautomatic

2 5 tO IS 20 25 TIME (msec(

Ft

F2H—H

AT TWO ISOLATED POINTS THE PITCH PERIOD WAS 14.7 rnsec**AT TWO ISOLATED POINTS THE PITCH PERIOD WAS 21.8 mseC

Fig. 10. Pitch variation for each of the speakers used in this study.

pitch detector of McGonegal et al. [13] which was developedfor this study. This method is a highly sophisticated, user-interactive, pitch detector which estimated pitch on a 10-msframe-by-frame basis. Extensive analysis of the results ob-tained from this semiautomatic pitch detector across severalusers on the same utterances showed this method to be highlyreliable [13]

Using the semiautomatic method the analysis time for anexperienced user was about 30 mm to process 1 s of speech(i.e., 100 frames). For the data base used in this study, a totalof 60 h of computer processing was required to estimate thestandard pitch contours for the entire data base.

V. ERROR ANALYSIS RESULTS

The way in which objective comparisons of the performanceof each of the individual pitch detectors were made was as fol-lows. For each of the utterances in the data base, a standardpitch contour was obtained using the semiautomatic methodof Section IV. We denote the standard pitch contour as ps(m)were m goes from 1 to M, and M is the number of 10-msframes in the utterance. The contour ps(m) has the value 0if the mth frame is unvoiced; otherwise it has the value of thepitch period for the mth frame.

Next, each of the utterances was used as input to each ofthe seven pitch detectors and a set of pitch contours was ob-tained as output. We denote the pitch contour from the /thpitch detector (/l,2,'",7) as {p1(m),m=l,2,,M}.Of course, special attention had to be given in the Fortrancode to compensate the processing delay of each pitch detec-tor to ensure that the pitch contours from each of the sevenpitch detectors registered properly with the standard pitchcontour.

To quantitatively measure the performance of each of thepitoh detectors relative to the semiautomatic analysis, a seriesof error measurements was defined for each utterance. In Sec-

PITCH PEPIOD

0114- 14

14M2

LMI. 14

PITCH FREQUENCYI I I I I

500 200 100 66.7 50 40 FREQUENCY (HZ)

LEGEND:

C): CHILD (NO.1) —Fl: FEMALE(NO.1) —F2: FEMALE (NO.2) —DI DIPLOPHON (NO.1) —Ml: MALE (NO.1) —M2: MALE (NO.2)LM: LOW PITCH MALE —

PITCH PERIODRANGE

2.3 TO 4.5 mSAC3.8 TO 5.1 mSeC3.5 TO 6.2 mseC4.3 TO 10.6 mSec*6.8 TO 9.5 mSeC6.5 TO 17.2 roseC6.6 TO 19.2 mSeC*

(I)L1JUzLia:a:C.)C)0Li0a:Lia:

z

tion V-A we discuss the problems associated with definingthese error measurements aiid attaching physical significanceto their values.

The result of the error measurements was a set of scores ofthe performance of each pitch detector for each utterance,for each recording condition, and for each speaker. Due to the

excessive amount of data, the individual results were averagedover the utterances of a single speaker, for each recording con-dition. A complete set of these results for all the performanceclasses is given in Section V-B. Finally, where appropriate, theresults were averaged over recording conditions and an absoluteranking of each of the pitch detectors for each speaker was

42


0

LM

i 20

50 100 200

Ml

7?

0

232

0

M2

0 50 100 150 200

Fl

F2

134

0 50 100 150 200

Cl

100

01

0 50 100 150 200PITCH PERIOD (SAMPLES AT 10KHz)

Fig. 11. Individual pitch period histograms for each of the speakers used in this study.


1 N.0e e2(m1)-ë2.

given. Such rankings provided a good picture of the perfor-mance strengths and weaknesses of each of the seven pitchdetectors.

Before proceeding to the discussion of the error measure-ments, an additional dimension to the error analysis should bementioned. This added dimension was the application of non-linear smoothing (error correcting) methods to detect andcorrect several types of errors which occur in pitch detection[171. Such a nonlinear smoother was incorporated into thisinvestigation to see what the effects would be on this database. Extensive examples of the applications of nonlinearsmoothers to speech processing are given in [17].

A. Definition of Error Parameters

As mentioned above, for every utterance in the data basethere is a standard pitch contour, p(m), and a pitch contourfor each pitch detector, p1(m), where 1 denotes the pitch de-tector used in the comparison, i.e., j= 1 is the AUTOCmethod, / = 2 is the CEP method, etc. By comparing p8(m)to p1(m), (for each value of /) it can be seen that four possibili-ties can occur for each value of m. These four possibilities arethe following.

1) p(m) 0, pj(m) = 0 in which case both the standardanalysis and the pitch detector classified the mth interval asunvoiced. No error results here.

2) p(m) = 0, pj(m) 0 in which case the standard analysisclassified the mth interval as unvoiced, but the pitch detectorclassified the mth interval as voiced. An unvoiced-to-voicederror results here.

3) p(m) 0, p1(m) 0 in which case the standard analysisclassified the mth interval as voiced, but the pitch detectorclassified the mth interval as unvoiced. A voiced-to-unvoicederror results here.

4) p(m) =P1 0, p1(m) =P2 ' 0 in which case both thestandard analysis and the pitch detector classified the mthinterval as voiced. For this case two types of errors can exist,depending on the values of P1 and F2, the pitch periods fromthe standard analysis and from the pitch detector. If we definethe voiced error e(m) as

e(m)=P1-P2, (1)

then, if I e(m)I 10 samples (i.e., more than 1-ms error inestimating the. pitch period), the error was classified as a grosspitch period error. For such cases, the pitch detector hasfailed dramatically in estimating the pitch period. Possiblecauses of such gross pitch errors are pitch period doubling ortripling, inadequate suppression of formants so as to effectpitch measurements, etc. The second type of pitch error wasthe fine pitch period error in which case I e(m) I < 10 samples.For such cases the pitch detector has estimated the pitchperiod sufficiently accurately to attribute the errors (primarily)to the measurement techniques.

Based on the above four possibilities for comparing eachframe of the reference pitch contour to each frame of eachpitch detector contour, five distinct measurements of theperformance of each pitch, detector were derived. These fiveerror measurements are the following.

1) Gross Error Count: For this measurement the number of

gross pitch period errors (as defined above) per utterance wastabulated.

2) Mean of the Fine Pitch Errors: The mean ë is defined as

1 Njeçe(m/) (2)

I j=j

where m1 is the /th interval in the utterance for whichI e(m)l < 10 (fine pitch error), and N is the number of suchintervals in the utterance. Thus ë is a measure of the bias inthe pitch measurement during voiced intervals.

3) Standard Deviation of the Fine Pitch Errors: The stan-dard deviation, ae, is defined as

(3)

The standard deviation of the fine pitch errors is a measure ofthe accuracy of the pitch detector in measuring pitch periodduring voiced intervals.

4) Voiced -to-Unvoiced ErrorRate: This measurement showsthe accuracy in correctly classifying voiced intervals.

5) Unvoiced-to- Voiced Error Rate: This measurement showsthe accuracy in correctly classifying unvoiced intervals.

Although other error analyses are possible, it was felt thatthese five error measurements provided a good description ofthe performance strengths and weaknesses of each of the sevenpitch detectors. The results of these error analyses are given inSection V-B. Before presenting these results, we first showsome examples of the individual pitch contours for three ofthe utterances used in this study. Figs. 1 2—i 4 show typicalsets of pitch contours for the seventh utterance of SectionIll-B, for the wideband condition, for speakers LM, M2, andCl. The curve of the upper left in each figure is the result ofthe semiautomatic analysis (i.e., the standard pitch contour).It can be seen from these figures that each of the types oferrors discussed above occurs in these examples. Finally, Fig.15 shows the result of processing each of the pitch contours ofFig. 13 by a nonlinear smoother (a combination of runningmedians of length 7 and some simple logic). The overallsimilarity among the smoothed pitch contours is startlinglyevident in Fig. 15. As will be seen later, a good nonlinearsmoother (error correcter) is able to correct a large number ofthe errors in pitch detection and considerably improve the per-formance of a pitch detector. However, if the error rate is toohigh, no amount of nonlinear smoothing will suffice.

B. Results ofErrorAnalysis

The complete set of error analyses discussed in Section V-Awas performed on the entire data base of Section III, and themajor results are presented in Tables I—XX, along with corre-sponding performance scores for each error category. Severalpoints about the analysis should first be noted before discuss-ing the individual tables and the resulting performance scores.First it must be pointed out that for the microphone andtelephone recording conditions all eight sentences were pro-cessed, whereas for the wideband case only the four sentenceswere processed (i.e., the four nonsense words were not used).

-J

150

080

250I0a.

1 2 0 1

TIME SECONDS) TIME (SECONDS)

UTTERANCE: LMO7W, UNSMOOTHED

Fig. 12. Representative set of raw pitch contours for utterance 7, speaker LM, recording condition W.

130

60130

(I)

Ui-Ja.

(h

0o 60130

Uia.I0a.

60130

600

TIME (SECONDS)2 0

TIME (SECONDS)

2

UTTERANCE: M207W, UNSMOOTHED

Fig. 13. Representative set of raw pitch contours for utterance 7, speaker M2, condition W.


280

150

80250

'I,Ui

— SAPD

— AUTOC

I._ _. •... .a• _t

— CEP

.ISI'_ — . .. .5.' 1- — — _a. —— SIFT

—5.. .-.1 1= .. i:.... ._• _t.a )0

— DARD

PPROC

- LPC

— S

- AMDF

150

80250

150

80

DARD

.40

Se

z

—SAPO

k

AUTOC

.

CEP.1

%' — ,' I-

SIFT

.h- .

PPROC

'5

_0_LPC/

AMDF

p.

i/%I'S_ It'

If,LU-J0

106000

LU0.I0I-0.

0-Ja-

U,

o 609130LUJ1.I00.

RABINER etal.: PITCH DETECTION ALGORITHMS 409

60

1060

1060

100 1 2

TIME (SECONDS)

UTTERANCE: CIO7W, UNSMOOTHED

Fig. 14. Representative set of raw pitch contours for utterance 7, speaker Cl, condition W.

130

60130

DARD

vv' vi t— — —

. PPROC

.

— 1•LPC

\, I,.., '' _.,AMDF

". Is.' I"

O_

— — I_TIME (SECONDS)

2

D A RD

C-PPROC

1%

—L PC

St

AMDF

F •II%/ — Is /'- _I I_C -

2

—SAPD

- /.•4 '/' _•

-AUTOC

-CEP

- I—C 4'

60 - — — —— _.. — —130 -SIFT

60 -I—T—0 1 2 0

TIME (SECONDS) TIME (SECONDS)

UTTERANCE: M2O7W, SMOOTHED

Fig. 15. Representative set of nonlinearly smoothed pitch contours for utterance 7, speaker M2, condition W.


Second, the results obtained for the diplophonic speaker (Dl)were omitted entirely from the error analysis because of theuniversal difficulties of all the pitch detectors (including thesemiautomatic method) in estimating the correct pitch periodfor this speaker. Some of the raw analysis results for speakerDl are presented in the M.S. thesis of Cheng [18]. Finally, be-cause of the large number of factors involved in the analysis,each of the error measurements was averaged over the utter-ances of each speaker. This is justified in that it is not antici-pated that the sentence material is a factor in the performanceevaluation of any pitch detector.

The format for the results presented in Tables I—XX is asfollows. First we present the average (over utterances) errorscores for each category for each speaker, recording condition,and pitch detector. Also included in the tables is a sum of theraw averages across the three recording conditions. Each tableof raw data is followed by one (or sometimes two) table(s) ofperformance rankings based on an empirical (but hopefullyphysically justifiable) evaluation of the average scores for eachpitch detector and for each speaker. From these performancerankings for each error category the performance strengthsand weaknesses of each pitch detector can readily be seen andevaluated. Following the tables of unsmoothed raw averages,the results obtained after, nonlinear smoothing are presented.Comparisons between the unsmoothed and smoothed per-formance rankings show cases where the error rate is too highto be properly corrected with simple nonlinear smoothingtechniques. We now proceed to discuss the results for each ofthe five error categories of Section V-A.

1) Gross Pitch Errors: Tables I—IV present the results ob-tained for the gross pitch error measurements, From Table litcan be seen that, for the most part, a great deal of homogene-ity existed between the scores for the three recording condi-tions, although in some cases there were fairly substantialdifferences in the average gross error score's. Table II showsthe performance rankings based on the sum of the averagegross error scores across the three recording conditions. Thebest rankings are the lowest scores in Table II, i.e., 1 is the bestscore, S is the worst score. Rank 1 was given to a score offrom 0 to 6; rank 2 for a score from 6 to 18; rank 3 for a scorefrom 18 to 42; rank 4 for a score from 42 to 90; and rank 5for scores over 90. The scale in this case is logarithmic becausethe difficulty in detecting and correcting such gross errors in-herently appears to be logarithmically related to the numberof such errors per utterance. Based on these assumptions, therankings of Table II show that each pitch detector performedbetter for some speakers (i.e., range of pitch variation) thanfor others. For example, the AUTOC pitch detector per-formed much worse on the two low-pitch speakers (LM andM2) than on the three higher pitch speakers (Fl, F2, Cl);whereas the CEP pitch detector performed much better on thelower pitch speakers than on the higher pitch speakers. Anoverall ranking score for each pitch detector (i.e., the sum ofthe rankings over the speakers) is given at the bottom of TableII, and the ranking scores in the rightmost column of Table II(the sum of the rankings over the pitch detector) is a measureof the difficulty of detecting pitch for a given speaker. Table IIshows that the overall ranking scores for five of the sevenpitch detectors were comparable, and that the two others were

TABLE INUMBER OF GROSS PITCH ERRORS—UNSMOOTH€D

Speaker AUTOC,

Pitch Detector

CEP SIFT DARD PPROC LPC AMDF

LM MTW

Sum

15.326.119.560.9

0.5 0.6 5.81.1 4.5 5.81.3 4.5 13.82.9 9.6 25.4

10.011.023.844.8

4.45.6

13.023.0

12.815.623.852.2

Ml MTW

Sum

0.63.42.86.8

0.1 0.0 5.90.1 0.8 6.30.5 3.0 23.50.7 3.8 35.7

2.03.06.0

11.0

0.10.80.81.7

0.30.82.83.9

M2 MTW

Sum

6.19.97.3

23.3

0.4 1.3 15.90.6 3.4 4.01.3 5.3 26.82.3 10.0 46.7

4.95.8

12.323.0

2.94.05.5

12.4

7.39.88.5

25.6Fl M

TW

Sum

1.91.60.03.5

9.1 4.4 7.38.5 1.8 6.3

29.0 8.0 0.846.0 14.2 14.4

4.02.84.0

10.8

2.41.42.05.8

0.50.00.00.5

F2 MTW

Sum

0.40.62.03.0

1.4 2.1 7.12.0 1.5 5.62.5 3.8 8.55.9 7.4 21.2

2.41.55.08.9

1.61.52.05.1

0.61.01.83.4

Cl MTW

Sum

1.01.90.02.9

13.6 65.3 6.114.8 62.6 12.912.5 40.8 3.040.9 168.7 22.0

7.89.07.3

24.1

8.312.35.5

26.1

10.69.16.5

26.2

PBRFORMANCETABLE II

SCORES BASED ON SUM 0ERRORS—UNSMOOTI-IED

F GROSS PITCH'

Pitch Detector

Speaker AUTOC CE? SIFT DARD PPROC LPC AMDF Sum

LM 4 1 2 3 4 3 4 21Ml 2 1 1 3 2 1 1 11M2 3 1 2 4 3 2 3 18Fl 1 4 2 2 2 1 1 13F2 1 1 2 3 2 1 1 11Cl 1 3 5 3 3 3 3 21Sum 12 11 14 18 16 11 13

Code: (0—6) = 1, (6—12) = 2, (12—42) = 3, (42—90) =4, (90— ) = 5.

somewhat inferior for this error category. Additionally, it isseen that the speakers with the most extreme pitch (LM, Cl)presented the most difficulty in terms of this error category.

Tables III and IV present the results for the gross pitch errorcategory after processing by a nonlinear smoother. This typeof error is most easily detected and corrected by the nonlinearsmoother used in this study as verified by Tables III and IV.It can be seen from Table IV that only 12 out of the 42 pairsin Table IV were not given the best ranking of 1. These 12represent cases where the gross error rate was too high to becorrected entirely by a nonlinear smoother. The overall rank-ing scores for the smoothed results showed all seven pitch de-tectors (with the exception of speaker Cl for pitch detectorSIFT) to be essentially identical in their overall performancein this error category.

2) Fine Pitch Error—Average Value: The results of theanalysis of the average value of the fine pitch error indicatedthat all seven pitch detectors yielded average values of ë on theorder of ±0.5 samples across all utterances, speakers, and re-

RABINER eta!.: PITCH DETECTION ALGORITHMS 411

TABLE IIINUMBER OF GROSS PITCH ERRORS—SMOOTHED

Speaker AUTOC CEP

Pitch Det

SIFT

ector

DARD PPROC LPC AMDF

LM MTW

Sum

5.36.13.8

15.2

2.31.65.08.9

2.02:06.8

10.8

3.52.05.0

10.5

5.86.0

13.325.1

3.63.19.5

16.2

9.410.817.337.5

Ml MTW

Sum

0.30.00.30.6

0.40.10.51.0

0.50.00.51.0

0.80.52.84.1

1.30.11.52.9

0.40.30.81.5

0.60.52.83.9

M2 MTW

Sum

0.81.32.34.4

1.51.82.35.6

1.51.43.05.9

6.02.38.3

16.6

2.93.07.0

12.9

3.53.65.0

12.1

6.47.45.3

19.1

Fl MTW

Sum

0.00.00.00.0

0.80.00.00.8

0,00.00.00.0

0.10.10.00.2

0,00.00.00.0

0.00.00.00.0

0.00.00.00.0

F2 MTW

Sum

0.00.10.80.9

0.10.00.00.1

0.00.11.31.4

0.40.60.31.3

0.30.01.31.6

0.00.30.00.3

0.10.30.00.4

Cl MTW

Sum

0.00.00.00.0

0.00.00.00.0

57.455.815.8

129.0

0.10.30.00.4

0.00.00.00.0

0.00.00.00.0

0.51.80.02.3

TABLE IVPERFORMANCE SCORES BASED ON SUM OF GROSS PITCH ERRORS—SMOOTHED

Pitch Detector

Speaker AUTOC CEP SIFT DARD PPROC LPC AMDF Sum

LM 2 2 2 2 3 2 3 16Ml 1 1 1 1 1 1 1 7M2 1 1 1 2 2 2 2 11Fl 1 1 1 1 1 1 1 7F2 1 1 1 1 1 1 1 7Cl 1 1 5 1 1 1 1 11Sum 7 7 11 8 9 8 9

Code same as Table II.

cording conditions. No consistent bias (either positive ornegative) in the value of ë was noted in the data. Thus for allpractical purposes the average value of the fine pitch error wasessentially 0 in all cases and, therefore, no results are tabulatedhere.

3) FineFitch Error—Standard Deviation: Tables V—VIII pre-sent the results of the analysis of the standard deviation of thefme pitch error. The units of the standard deviation aresamples. The results here were quite homogeneous across re-cording conditions and thus the sum of the standard deviationsover recording conditions was used as the performance mea-sure in Tables VI (raw averages) and VIII (smoothed averages).Based on the analysis results, a standard deviation of less than0.5 samples per condition, or 1.5 samples for the sum wasgiven a score of I. A linear scale was used for this measurement—thus a standard deviation sum from 1.5 to 3 samples wasgiven the next best score (2), etc.

As seen in Table VI, four of the pitch detectors (AUTOC,CEP, SIFT, and LPC) performed almost uniformly across allspeakers, and had comparably high overall performance scores.

TABLE VSTANDARD DEVIATION OF FINE PITCH ERRORS—UNSMOOTHED

Speaker AUTOC CEP

Pitch De

SIFT

tector

DARD PPROC LPC AMDF

LM MTW

Sum

1.01.00.82,8

0.91.01.02.9

0.90.9092.7

1.11.21.13.4

1.81.91.65.3

1.11.21.13.4

2.32.12.16.5

Ml MTW

Sum

0.70.61.02.3

0.50.50.81.8

0.70.70.82.2

0.90.81.33.0

0.90.61.12.6

0.60.60.71.9

1.11.01.33.4

M2 MTWSum

0.70.61.02.3

0.70.70.92.3

0.70.81.02.5

1.01.01.23.2

1.11.01.43.5

0.80.90.92.6

1.51.31.64.4

Fl

F2

MTWSum

MTW

Sum

0.60.60.51.70.60.60.61.8

0.50.60.51.60.60.50.51.6

0.70.70.72.1

0.6

0.70.72.0

1.11.00.93.0

0.8

0.90.82.5

0.80.90.82.5

0.80.80.72.3

0.60.60.51.7

0.40.50.61.5

1.01.10.93.01.11.31.13.5

Cl MTW

Sum

0.40.40.51.3

0.50.50.41.4

0.80.80.92.5

0.70.80.82.3

0.60.60.71.9

0.50.50.51.5

0.91.01.02.9

TABLE VIPERFORMANCE SCORES BASED ON SUM OF STANDARD DEVIATIONS OF FINE

PITCH ERRORS—U NSMOOTHED

Pitch Detector


LM 2 2 2 3 4 3 5 21Ml 2 2 2 2 2 2 3 15M2 2 2 2 3 3 2 3 17Fl 2 2 2 3 2 2 2 15F2 2 2 2 2 2 1 3 14Cl 1 1 2 2 2 1 2 11

Sum 11 11 12 15 15 11 18

Code: (0—1.5) 1, (1.5-3) 2, (3—4.5) 3, (4.5—6) = 4, (6— )= 5.

The two simple time-domain pitch detectors (DARD andPPROC) had somewhat higher scores (poorer performance)due to the lower resolution which is obtained in estimating apitch period directly on the waveform due to effects discussedearlier. Finally, the worst performance in this category wasfor the AMDF pitch detector. This result is due to the lack ofresolution in the AMDF measurement which is made onlyevery third or fourth sample—thus the pitch period is onlyestimated to within a couple of samples.

Tables VII and VIII for the smoothed standard deviationsshow that the nonlinear smoother does not strongly affect theraw results presented in Tables V and VI. Slight differences inthe overall performance scores do exist both because of thegross pitch period errors which are detected and corrected tofine pitch period errors, and because of the smoothing of thefine pitch errors themselves.

4) Voiced-to-Unvoiced Errors: Tables IX.-XIV present theresults of the voiced-to-unvoiced errors for each pitch detector.Table IX gives the raw average scores for each recording condi-


TABLE VIISTANDARD DEVIATION OF FINE Pvrcsi ERRORSSMOOTHED

Speaker AUTOC CEP

Pitch De

SIFT

tector

DARD PPROC LPC AMDF

LM MTW

Sum

1.11.01.23.3

1.11.21.33.6

1.01.00.92.9

1.21.31.23.7

1.51.61.54.6

1.11.31.03.4

1.91.92.05.8

Ml MTW

Sum

0.60.60.82.0

0.60.60.92.1

0.60.60.92.1

0.80.71.12.6

0.70.60.92.2

0.50.50.71.7

1.21.01.33.5

M2 MTW

Sum

0.80.80.92.5

0.80.91.12.8

0.80.91.12.8

1.00.91.23.1

1.00.81.33.1

0.80.90.82.5

1.51.61.54.6

Fl MTW

Sum

0.50.50.51.5

0.50.60.51.6

0.60.60.82.0

0.80.90.72.4

0.60.70.61.9

0.50.50.41.4

0.91.00.82.7

F2 MTW

Sum

0.60.50.61.7

0.60.50.51.6

0.60.60.61.8

0.90.80.72.4

0.70.70.62.0

0.40.50.51.4

1.01.21.03.2

Cl MTW

Sum

0.40.50.51.4

0.60.50.51.6

1.10.70.92.7

0.60.70.61.9

0.60.50.71.8

0.50.60.41.5

0.91.10.82.8

TABLE VIIIPERFORMANCE SCORES BASED ON SUM OF STANDARD DEVIATIONS OF FINE

PITCH ERRORS—-SMOOTHED

Speaker AUTOC CEP

Pitch

SIFT

Detecto

DARD

r

PPROC LPC AMDF Sum

LM 3 3 2 3 4 3 4 22Ml 2 2 2 2 2 2 3 15M2 2 2 2 3 3 2 4 18Fl 1 2 2 2 2 1 2 12F2 2 2 2 2 2 1 3 14Cl 1 2 2 2 2 1 2 12Sum 11 13 12 14 15 10 18

Code same as Table VI.

tion as well as the sum of the scores across recording condi-tions. Each of the scores is given as a ratio of the number ofvoiced-to-unvoiced errors to the number of voiced intervalsfor each condition. As might be anticipated, there is a greatlack of homogeneity of the results across recording conditions,especially for the LPC pitch detector.

Table X gives a performance evaluation of the pitch detectorsfor the raw data of the voiced-to-unvoiced error rate averagedover the three recording conditions. The scores at the top ofthis table are the percentage of voiced-to-unvoiced errors foreach pitch detector. A ranking of 1 was given to a pitchdetector with an average error rate less than 5 (percent). Alinear scale was used for these performance scores as shown inTable X.

Based on the overall rankings, it can be seen that five of thepitch detectors (AUTOC, DARD, PPROC, LPC, and AMDF)had essentially equivalent performance scores and all tendedto be homogeneous across speakers. The SIFT pitch detector

had a somewhat poorer performance than the top five, and theCEP pitch detector had a poor performance for this errorcategory. We defer a discussion of these results to Section VI.

Because of the lack of homogeneity across recording condi-tions, a second set of performance ratings was made for thiserror category based solely on the wideband recordings. Theseresults are presented in Table XI. From this table it can beseen that four of the pitch detectors (AUTOC, PPROC, LPC,and AMDF) performed extremely well on this condition. TheSIFT and DARD methods had somewhat poorer performancescores, while the CEP method had the worst score.

Tables XII—XIV show the error scores and performancerankings for voiced-to-unvoiced errors for the smoothed pitchcontours. The effect of the smoother is to change slightly thenumber of voiced-to-unvoiced errors. The performance rank-ings for the data averaged over recording conditions (TableXIII) shows slightly different results than for the raw data;however, the rankings for the wideband condition (Table XIV)are quite similar to the raw data rankings of Table XI.

The results of Tables IX—XIV also show that the most diffi-cult speakers were the two low-pitched speakers (LM, M2) andthe high-pitched speaker (C 1).

5) Unvoiced-to-Voiced Errors: The last set of tables (TablesXV-XX) show the results of the unvoiced-to-voiced erroranalysis. The form of the data in these tables is identical tothat used in the voiced-to-unvoiced error category. A per-formance ranking of 1 was given to an unvoiced-to-voicederror rate of less than 10 percent. The remaining rankingscores were assigned linearly as shown in Table XVI. The over-all performance scores for the raw data averaged across record-ing conditions showed the CEP pitch detector to have a verylow score (high performance), in contrast to the very highscores it obtained in the previous error category. The AUTOC,SIFT, DARD, PPROC, and AMDF pitch detectors all hadsimilar performance rankings and the LPC pitch detector had avery poor score. (Again we defer discussion of these results toSection VI.)

Table XVII (for the raw wideband data only) shows the per-formance of the LPC pitch detector to be substantially im-proved and comparable to all but the CEP pitch detector.

As seen in Tables XVIII—XX, the nonlinear smoother sub-stantially helps almost all the pitch detectors for the unvoiced-to-voiced error category. The performance rankings for all butthe LPC pitch detector are almost comparable for thesmoothed data averaged over recording conditions (Table XIX);for the wideband smoothed data (Table XX) all seven pitchdetectors had comparable performance scores.

VI. DISCUSSION OF ERROR ANALYSIS RESULTS

The error analysis and performance evaluation presented inSection V points up the strengths and weaknesses of each ofthe pitch detectors used in the study. No single pitch detectorwas uniformly top ranked across all speakers, recording condi-tions, and error measurements. In this section we discuss theresults presented in Section V with a view towards explainingthe general trends in the performance scores and how theyrelate back to the specific methods of pitch detection used inthis study.

RABINER etal.: PITCH DETECTION ALGORITHMS 413

TABLE IXVOICED-TO-UNVOICED ERRORS—UNSMOOTHED

Pitch Detector

Speaker AUTOC CEP SIFT DARD PPROC LPC AMDF

LM M 32/631 168/631 58/631 66/631 16/631 1/631 27/631T 36/631 235/631 105/631 66/631 37/631 78/631 40/631W 33/533 130/533 46/533 77/533 18/533 4/533 15/533

Sum 101/1795 533/1795 209/1795 209/1795 71/1795 83/1795 82/1795Ml M 19/703 54/703 11/703 30/703 25/703 3/703 28/703

T 45/703 75/703 37/703 75/703 51/703 36/703 57/703W 6/654 88/654 7/654 39/654 14/654 14/654 5/654

Sum 70/2060, 217/2060 55/2060 144/2060 90/2060 53/2060 90/2060M2 M 48/772 89/772 38/772 65/772 28/772 1/772 40/772

T 60/772 123/772 60/772 104/772 67/772 194/772 67/772W 27/660 123/660 12/660 37/660 15/660 26/660 16/660

Sum 135/2204 335/2204 110/2204 196/2204 110/2204 221/2204 113/2204Fl M 10/762 99/762 45/762 15/762 18/762 6/762 21/762

T 38/762 97/762 42/762 40/762 45/762 148/762 26/762W 7/603 70/603 28/603 18/603 14/603 1/603 17/603

Sum 55/2127 266/2127 115/2127 73/2127 77/2127 155/2127 64/2127F2 M 18/810 62/810 36/810 14/810 17/810 3/810 23/810

T 46/810 67/810 37/810 32/810 41/810 68/810 36/810W 16/670 68/670 30/670 49/670 33/670 12/670 30/670

Sum 80/2290 197/2290 103/2290 95/2290 91/2290 83/2290 89/2290Cl M 38/935 93/935 130/935 27/935 20/935 5/935 21/935

T 68/935 100/935 137/935 58/935 52/935 43/935 52/935W 9/568 66/568 139/568 18/568 12/568 5/568 13/568

Sum 115/2438 259/2438 406/2438 103/2438 84/2438 53/2438 86/2438

TABLEXPERFORMANCE SCORES BASED ON SUM OF VOICED-TO-UNVOICED

ERRORS—UNSMOOTHED

TABLE XIPERFORMANCE SCORES BASED ON VOICED-TO-UNVOICED ERRORS—

WIDEBAND DATA—UNSMOOTHED

Pitch Detector Pitch Detector

Speaker AUTOC CEP SIFT DARD PPROC LPC AMDF ' Speaker AUTOC CEP SIFT DARD PPROC LPC AMDF

5.6 29.7 11.6 11.63.4 10.5 2.7 7.06.1 15.2 5.0 8.92.6 12.5 5.4 3.43.5 8.6 4.5 4.14.7 10.6 16.7 4.2

2 5 .3 31 3 1 22 4 2 21 3 2 11 2 1 11 3 4 18 . 20 13 10

4.0 4,6 4.6 LM4.4 2.6 4.4 Ml5.0 10.0 5.1 M23.6 7.3 3.0 Fl4.0 3.6 3.9 F23,4 2.2 3.5 . Cl

2 5 21 3 11 4 11 3 11 3 1

1 3 57 21 11

3 1 1 1 152 1 1 1 102 1 1 1 111 1 1 1 92 1 1 1 101 1 1 1 13

11 6 6 6

LMMlM2FlF2Cl

6.2 24.4 8.6 14.4 3.4 0.8 '2.80.9 13.5 1.1 6.0 2.1 2.1 0.84.1 18.6 1.8 5.6 2.3 3.9 2.41.2 11.6 4.6 3.0 ' 2.3 0.2 2.82.4 10.1 4.5 7.3 4.9 1.8 4.51.6 11.6 24.5 3.2 2.1 0.9 2.3

(a) Percentage Error Rate

LMMlM2FlF2ClSum


Speaker AUTOC CEP SIFT DARD PPROC LPC AMDF Sum Speaker AUTOC CEP SIFT DARD PPROC LPC AMDF Sum

LM1 1 1 16 Ml1 1 1 10

M22 3 2 17 Fl1 2 1 11 F21 11 8 Cl1 1 1 11 Sum7 9 7

(b) Performance Scores(b) Performance Scores

Code same as Table X.Code: (0—5) = 1, (5—10) = 2, (10—15) = 3, (15—20) = 4, (20— ) = 5.

adequate for low-pitched speakers. The difficulties of spectralThe results on the gross pitch period errors (Tables I—IV) methods for high-pitched speakers are due to the small number

showed that the time-domain and hybrid pitch detectors had of harmonics which are present in their spectra, leading togreatest difficulty with the low-pitched speakers (LM, M2) analysis difficulties in choosing the correct pitch. The poorwhereas the spectral pitch detector (CEP) had the greatest performance of the SIFT pitch detector on speaker Cl isdifficulty with the high-pitched speakers (Cl, Fl). The diffi- related to the problem of reliably spectrally flattening (by in-culties of time-domain methods for low-pitched speakers are verse filtering) a signal in which generally only one harmonicdue to the fixed 30—40-ms analysis frame which is generally in- occurs.


TABLE XIIVOICED-TO-UNVOICED ERRORS—SMOOTBED

Speaker AUTOC CEP

Pitch D

SIFT

etector

DARD PPROC LPC AMDF

LM MTWSum

112/626213/62669/512

404/1764

137/626226/62697/512

460/1764

48/626135/62628/512

211/1764

56/62660/62661/512

117/1764

31/62660/62618/512

109/1764

6/62679/6266/512

91/1764

34/62685/62611/512

130/1764Ml M

TWSum

20/70664/7069/657

93/2069

43/706

65/70674/657

182/2069

8/70642/7067/657

57/2069

33/706

91/70644/657168/2069

26/70667/70617/657

110/2069

0/70625/70614/657

39/2069

24/70660/7065/657

89/2069M2 M

TWSum

90/782134/78232/660

256/2224

80/782104/782116/660300/2224

47/78286/782

9/660142/2224

86/782114/78243/660

243/2224

45/78289/78225/660

159/2224

0/782213/78215/660

228/2224

46/78285/78216/660

147/2224Fl M

TWSum

7/76940/769

7/60754/2145

119/76992/76981/607

292/2145

50/76934/76922/607

106/2145

32/76953/76924/607

109/2145

16/76945/76918/60779/2145

2/769157/769

2/607161/2145

19/76922/76921/60762/2145

F2 MTWSum

13/81544/81515/67672/2306

61/81570/81572/676

203/2306

29/81534/81528/67691/2306

17/81543/81583/676

143/2306

8/81546/81538/67692/2306

4/81547/81512/67663/2306

32/81548/81532/676

112/2306Cl M

TWSum

40/94170/941

8/600118/2482

99/941107/94170/600

276/2482

198/941175/941230/600603/2482

44/941107/94126/600

177/2482

31/94165/94115/600

11.1/2482

31/94185/9415/600

121/2482

86/94197/94133/600

216/2482

Speaker AUTOC CEP

Pitch Detector

SIFT DARD PPROC LPC AMDF

LM 22.9 26.1 12.0 10.0 6.2 5.2 7.4Ml 4.5 8.8 2.8 8.1 5.3 1.9 4.3M2 11.5 13.5 6.4 10.9 7.1 10.3 6.6Fl 2.5 13.6 4.9 5.1 3.7 7.5 2.9F2Cl

3.14.8

8.811.1

3.9 6.2 4.024.3 7.1 4.5

2.74.9

4.98.7


Speaker AUTOC CE? SIFT DARD PPROC LPC AMDF Sum

LM 5 5 3 3 2 2 2 22Ml 1 2 1 2 2 1 1 10M2 3 3 2 3 2 3 2 18Fl 1 3 1 2 1 2 1 11

F2 1 2 1 2 1 1 1 9Cl 1 3 5 2 1 1 2 15Sum 12 18 13 14 9 10 9

(b) Performance Scores

Code same as Table X.

The results on the fine pitch period errors (Tables V—VIII)showed that (aside from the AMDF method which inherentlylacked pitch resolution) the time-domain waveform pitch de-tectors (DARD, PPROC) had somewhat lower resolution thanthe other methods. This is due to the sensitivity of waveformpeaks, valleys, and zero crossings to formant changes, noise,distortion, etc.

The error measurements of voiced-to-unvoiced and unvoiced-

Speaker AUTOC CEP

Pitch Detector


LMMlM2FlF2Cl

13.51.44.8

1.22.21.3

18.911.317.6

13.310.711.7

5.5 11.9 3.51.1 6.7 2.6

1.4 6.5 3.83.6 4.0 3.04.1 12.3 5.6

38.3 4.3 2.5

1.22.12.3

0.31.80.8

2.10.82.4

3.54.75.5



LMMlM2FlF2ClSum

31

1

11

1

8

434333

20

2 3 1

1 2 1

1 2 1

1 1 1

1 3 25 1 1

11 12 7

1

1

1

1

1

1

6

1

1

1

1

1

27

151011

91214


Code same as Table X.

to-voiced errors provided several interesting results. Thesecategories cannot be examined separately because they areoften intimately related. For example, a voiced—unvoiced de-tector which is biased towards the category voiced willgenerally have a low voiced-to-unvoiced error rate, but incompensation will have a high unvoiced-to-voiced error rate.There are three types of voiced—unvoiced decision methodsused in the seven pitch detectors. One method is the use of a

TABLE XIIIPERFORMANCE SCORES BASED ON SUM OF VOICED-TO-UNVOICED

ERRORS—S MOOTHED

TABLE XIVPERFORMANCE SCORES BASED ON VOICED-TO-UNVOICED ERRORS—-

WIDEBAND DATA—SMOOTHED

RABINER et a!.: PITCH DETECTION ALGORITHMS 415

Speaker AUTOC CEP

Pitch Detector


LMMlM2FlF2Cl

15.110.113.116.215.912.3

3.32.74.73.95.63.2

16.9 10.0 20.717.6 10.1 16.118.7 15.3 18.522.7 15.8 18.521.0 23.0 25.420.0 11.8 18.5

46.925.150.423.847.626.7

20.216.913.611.423.811.8


Speaker AUTOC CEP SIFT DART PPROC LPC AMDF Sum

LMMlM2FlF2ClSum

222222

12

111111

6

2 2 3 52 2 2 32 2 2 53 2 2 33 3 3 53 2 2 3

15 13 14 24

3

222

3214

181416152015


simple threshold on one or more measurements to classify aninterval as voiced or unvoiced. For example, the preliminaryvoiced-unvoiced detector used in the AUTOC, CEP, SIFT, andPPROC methods used a waveform threshold to remove inter-vals of silence. The second type of voiced—unvoiced detectoris the periodicity measurement. For example, the AUTOC,AMDF, and SIFT methods used a threshold on the autocorre-lation peak to decide if the interval was periodic whereas the

Speaker AUTOC CEP

Pitch Detector


LMMlM2FlF2Cl

17.810.615.6

16.8

10.0

11.4

1.76.83.1

4.0

6.9

5.3

18.3 13.9 16.116.7 6.8 12.1

23.4 14.8 20.324.8 6.4 10.4

16.9 1.9 5.6

15.9 12.9 11.4

25.68.3

17.2

12.0

9.4

8.3

17.213.618.8

4.8

8.1

6.8



LMMlM2FlF2ClSum

22222212

1

1

1

1

1

1

6

2 2 2 3

2 1 2 13 2 3 23 1 2 22 1 1 12 2 2 1

14 9 12 10

2

2

2111

9

1411

15129

11


CEP method used a threshold on the cepstral peak for thispurpose. The third type of voiced—unvoiced detector is thepattern recognition statistical approach used in the LPC pitchdetector. Each of these methods has some advantages anddisadvantages. For example, the periodicity measurementtends to be extremely robust with regard to noise, distortion,and spurious transients in the signal. Thus methods like theAUTOC and AMDF pitch detectors tended to work uni-

TABLE XVUNVOICED-TO-VOICED ERRORS—UNSMOOTHED

Speaker AUTOC CEP

Pitch Detector


LM MTWSum

44/277

35/27732/180111/734

16/2775/2773/180

24/734

45/27746/27733/180124/734

24/27725/27725/18074/734

52/27771/27729/180152/734

165/277133/27746/180344/734

48/27769/27731/180148/734

Ml MTWSum

32/29226/29214/13272/716

5/2925/2929/13219/716

52/29252/29222/132

126/7.16

27/29236/2929/132

72/716

25/29274/29216/132

115/716

73/29296/29211/132

180/716

35/29268/29218/132121/716

M2 MTWSum

42/32439/32420/128101/772

19/324

13/3244/12836/772

55/32459/32430/128144/772

88/324

11/32419/128

118/772

65/32452/32426/128143/772

226/324141/32422/128389/772

43/32438/32424/128105/772

Fl MTWSum

40/21930/21921/125

91/563

9/2198/219

5/12522/563

50/219

47/21931/125128/563

56/21925/2198/125

89/563

48/21943/21913/125

104/563

68/219

51/21915/125

134/563

29/21929/2196/12564/563

F2 MTWSum

86/40051/40016/160

153/960

20/40023/40011/16054/960

91/40084/40027/160202/960

147/40071/4003/160

221/960

128/400107/400

9/160244/960

254/400188/40015/160 .

457/960

126/40089/40013/160

228/960

Cl MTWSum

43/31235/312

15/13293/756

7/312

10/3127/132

24/756

73/31257/31221/132151/756

29/31243/31217/132

89/756

51/31274/312

15/132

140/756

89/312102/31211/132

202/756

37/312

43/3129/132

89/756

TABLE XVIPERFORMANCE SCORES BASED ON SUM OF UNVOICED-TO-VOICED

ERRORS—UNSMOOTHED

TABLE XVIIPERFORMANCE SCORES BASED ON UNVOICED-TO-VOIGED ERRORS—

WIDEBAND DATA—UNSMOOTHED

Code: (0—10) = 1, (10—20) 2, (20—30) = 3,(30—40)4,(40— )=5.

Code same as Table XVI.


TABLE XVIIIUNVOICED-TO-VOICED ERRORS—SMOOTHED

Speaker AUTOC CEP

Pitch De

SIFT

tector

DARD PPROC LPC AMDF

LM MTWSum

5/2823/2823/201

11/765

3/2820/2822/2015/765

6/2829/282

17/20132/765

18/28211/28211/20140/765

24/28225/28220/20169/765

95/28259/28243/201

197/765

51/28264/28235/201

150/765Ml M

TWSum

19/2893/2898/129

30/707

3/2895/2897/129

15/707

45/28933/28920/12998/707

17/2895/2895/129

27/707

9/28929/289

3/12941/707

55/28965/289

9/129129/707

35/28954128915/129

104/707M2 M

TWSum

5/3144/314

17/12826/756

13/3148/3144/128

25/756

18/3145/314

20/12843/756

45/3142/3149/128

56/756

21/31413/31416/12850/756

225/314131/31420/128

376/756

31/31437/31424/12892/756

Fl MTWSum

31/21219/21217/12167/545

5/2121/2125/121

11/545

34/21244/21220/12198/545

3/2122/2120/1215/545

21/21219/2126/121

46/545

33/21219/21211/12163/545

15/21218/2126/121

39/545F2 M

TWSum

34/39526/395

8/15468/944

5/39511/39510/15426/944

52/39564/39520/154

136/944

42/39535/3950/154

77/944

48/39555/3957/154

110/944

82/395132/395

6/154220/944

44/39544/395

9/15497/944

Cl MTWSum

16/30610/30613/13039/742

4/3066/3065/130

15/742

31/30618/30610/13059/742

4/3063/3060/1307/742

16/30615/3067/130

38/742

10/3063/306

10/13023/742

13/3069/3067/130

29/742

Speaker AUTOC CEP

Pitch Detector


LMMlM2FlF2Cl

1.44.23.4

12.37.25.3

0.72.13.32.02.82.0

4.2 5.2 9.013.9 3.8 5.85.7 7.4 6.6

18.0 0.9 8.414.4 8.2 11.7

8.0 0.9 5.1

25.818.249.711.623.3

3.1

19.614.712.27.2

10.33.9



LMMlM2FlF2ClSum

1112117

11

1

1

1

1

6

1 1 1 32 1 1 21 1 1 52 1 1 22 1 2 31 1 1 1

9 6 7 16

222121

10

10101210127


formly well across recording conditions, whereas a methodlike the LPC pitch detector, which used a pattern recognitionvoiced—unvoiced detector, worked much better for widebandrecordings than for microphone or telephone recordings. Thedistortions (especially high-level transients) and band-limitingin both the microphone and telephone recordings made re-liable voiced—unvoiced decisions almost impossible for the

Speaker AUTOC CEP

Pitch Detector


LMMlM2FlF2Cl

1.56.2

13.314.05.2

10.0

1.05.43.14.16.53.8

8.5 5.5 10.015.5 3.9 2.315.6 7.0 12.516.5 0.0 5.013.0 0.0 4.5

7.7 0.0 5.4

21.47.0

15.69.13.97.7

17.411.618.85.05.85.4



LMMlM2FlF2ClSum

1122129

1

1

1

1

1

1

6

1 1 2 3

2 1 1 1

2 1 2 22 1 1 1

2 1 1 1

1 1 1 1

10 6 8 9

2221

1

1

9

119

12988


pattern recognition approach (of the LPC method) using thefive parameters discussed in [15]. However, for the widebandrecordings, this method worked quite well.

The only method which had no formal voiced-unvoiceddetector was the DARD method. This method just identifiedpitch period markers directly on the speech waveform. Themethod used to classify an interval as voiced was to measure

TABLE XIXPERFORMANCE SCORES BASED ON SCM OF UNVOICED-TO-VOICED

ERRORS—SMOOTHED

TABLE XXPERFORMANCE SCORES BASED ON UNVOICED-TO-VOICED ERRORS—

WIDEBAND DATA—SMOOTHED

Code same as Table XVI.

RABINER etaL: PITCH DETECTION ALGORITHMS 417

the spacing between adjacent markers centered around theinterval and to call the interval as unvoiced if the marker spac-ing exceeded 200 samples (20-ms period). This method pro-vided surprisingly good results yielding a reasonable voiced—unvoiced error rate.

Finally, it can be seen that the CEP pitch detector had astrong tendency to classify voiced intervals as unvoiced. Incompensation the unvoiced-to-voiced error rate for the CEPmethod was very low. Readjustment of the cepstral peakthreshold and the following zero-crossing threshold wouldyield a tradeoff in these scores.

VII. COMPUTATIONAL CONSIDERATIONS

Since none of the pitch detectors used in this study arecommercially available, another factor in comparing thesepitch detectors is their speed of execution on the computer(a Data General NOVA 800 minicomputer2) on which all thesimulations were run. Table XXI shows such a comparisonalong with other computational considerations for implement-ing the various algorithms. The execution times given in thetable are the time required to process 1 s of speech. It can beseen that the two waveform time-domain pitch detectors(DARI) and PPROC) ran the fastest, whereas all the otherswere on the order of 1 to 2 orders of magnitude slower. TheAMDF pitch detector would take about four times longer ifthe resolution in the measurement were increased to 1 sampleat a 6.67-kHz rate. The AUTOC pitch detector is a factor of2 or more faster than the SIFT, LPC, and CEP pitch detectorsbecause of the simplified autocorrelation function which iscomputed using a counter rather than a multiplier and anadder.

Table XXI also includes some of the details of how the vari-ous pitch detection algorithms were implemented on theNOVA 800 computer. The numerical method of realization(i.e., fixed or floating point) is indicated in the column labeled"arithmetic type." Three of the algorithms were realized ininteger arithmetic (DARD, PPROC, and AUTOC); three wererealized in floating-point arithmetic (AMDF,3 SIFT, and [PC):the CEP method used both integer arithmetic [for windowingand fast Fourier transforms (FFT's)], and floating arithmetic(for the log magnitude operation). The next column indicateswhether or not downsampling (Le., reduction of the samplingrate of the signal to a lower rate) was used in the realization toreduce the computation. Although not used for the AMDFand AUTOC methods, it could easily be incorporated intothese methods to speed up the realization. Finally, the lastcolumn thows the dependence of the computation on thesampling rate of the input. As seen in this table, all themethods are approximately linearly or quadratically dependenton the sampling rate, assuming all the parameters of the

2Cycle time 800 ns, add time of 1.6 s, multiply time of 3.6 is. Themachine also had floating-point hardware.

3The AMDF algorithm as provided to us was implemented in integerarithmetic. However the 16-bit integer representation of the NOVA800 is inadequate for this implementation. Consequently, the compu-tations were converted to floating point.

TABLE XXICOMPUTATIONAL CONSIDERATIONS FOR THE SEVEN PITCH DETECTORS ON

THE NOVA 800 MINICOMPUTER

Down-Pitch

DetectorSpeed/s of

SpeechArithmetic

Typesampling

UsedDependence onSampling Rate

DARD 5 s Integer No LinearPPROC 7.5 s Integer No LinearAMDF 50 s Floating point N0 QuadraticAUTOC 120 s Integer No QuadraticSIFT 250 s Floating point Yes QuadraticLPC 300 s Floating point Yes QuadraticCEP 400 s Mixed No Linear

aThese algorithms could easily incorporate downsampling.

analysis (i.e., analysis section length, pitch range, etc.) remainthe same.

VIII. SUMMARY

This paper has reported on the results of a rather extensiveperformance evaluation of seven pitch detection algorithms.Using a variety of error measurements, the performancestrengths and weaknesses of each of the pitch detectors fordifferent speakers and different recording conditions werehighlighted.

A major issue which arises when trying to understand theresults of this study is how to interpret the various errorscores. This is one problem for which we have no simpleanswer other than it all depends on the intended applicationof the pitch analysis. For example, classifying a low-levelvoiced speech interval as unvoiced may be perfectly acceptablefor a vocoder, but may cause great problems for a recognitionsystem. Similarly, the level at which various types of errorsbecome significant also depends strongly on the application.We have presented performance scores based on a criterionrelated to the applications with which the authors are mostfamiliar, i.e., speaker verification systems [2] and digit recog-nition systems [17].

Finally, an important consideration in interpreting the re-sults presented here is the perceptual effect of each of thetypes of errors discussed in Section V. A parallel series ofinvestigations is required to provide perceptual comparisonsamong the seven pitch detectors. Such an investigation iscurrently being made by the authors.

REFERENCES

[1] B. S. Atal, "Automatic speaker recognition based on pitch con-tours," J. Acoust. Soc. Amer., vol. 52, pp. 1687—1697, Dec.1972.

[2] A. E. Rosenberg and M. R. Sambur, "New techniques for auto-matic speaker verification," IEEE Trans. Acoust., Speech, SignalProcessing, vol. ASSP-23, pp. 169—176, Apr. 1975.

[3] H. Levitt, "Speech processing aids for the deaf: An overview,"IEEE Trans. Audio Electroacoust. (Special Issue on 1972 Con-ference on Speech Communication and Processing), vol. AU-21,pp. 269—273, June 1973.

[4] J. L. Flanagan, Speech Analysis, synthesis, andPerception. NewYork: Springer-Verlag, 1972.

[5] A. M. NoB, "Cepstrum pitch determination," .1. Acoust. Soc.Amer, vol. 41, pp. 293—309, Feb. 1967.

418 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-24, NO. 5, OCTOBER 1976

[6] B. Gold and L. R. Rabiner, "Parallel processing techniques forestimating pitch periods of speech in the time domain," J.Acoust. Soc. Amer., vol. 46, pp. 442-448, Aug. 1969.

[7] M. M. Sondhi, "New methods of pitch extraction," IEEE Trans.Audio Electroacoust. (Special Issue on Speech Communicationand Processing—Part Ii) vol. AU-16, pp. 262—266, June 1968.

[8] J. D. Market, "The SIFT algorithm for fundamental frequency es-timation," IEEE Trans. Audio Electroacourt., vol. AU-20, pp.367—377, Dec. 1972.

[9] N. J. Miller, "Pitch detection by data reduction," IEEE Tranr,Acoust., Speech, Signal Processing (Special Issue on IEEE Sym-posium on Speech Recognition), vol. ASSP-23, pp. 72-79, Feb.1975.

[10] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J.Manley, "Average magnitude difference function pitch extrac-tor," IEEE Trans. Acoust., Speech, Signal Processing, vol.ASSP-22, pp. 353—362, Oct. 1974.

[11] 3. J. Dubnowski, R. W. Schafer, and L. R. Rabiner, "Real-timedigital hardware pitch detector," IEEE Trans. Acoust., Speech,Signal Processing, vol. ASSP-24, pp. 2—8, Feb. 1976.

[12] R. W. Schafer and L. R. Rabiner, "System for automatic formantanalysis of voiced speech," .1. Acoust. Soc. Amer., vol. 47, pp.634—648, Feb. 1970.

[13] C. A. McGonegal, L. R. Rabiner, and A. E. Rosenberg, "A semi-aptomatic pitch detector (SAPD)," IEEE Trans. Acoust., Speech,Signal Processing, vol. ASSP-23, pp. 570—574, Dec. 1975.

[14] J. D. Markel and A. H. Gray, Linear Prediction of Speech. NewYork: Springer, 1976.

[15] B. S. Atal and L. R. Rabiner, "A pattern recognition approach tovoiced—unvoiced—silence classification with applications to speechrecognition," IEEE Trans. Acoust., Speech, Signal Processing,vol. ASSP-24, pp. 201—212, June 1976.

[16] R. E. Crochiere and L. R. Rabiner, "Optimum FIR digital fllterimplementations for decimation, interpolation, and narrow-bandfiltering," IEEE Trans. Acoust., Speech, Signal Processing, vol.ASSP-23, pp. 444—456, Oct. 1975.

[17] L. R. Rabiner, M. R. Sambur, and C. E. Schmidt, "Applicationsof a nonlinear smoothing algorithm to speech processing' IEEETrans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 552—557, Dec. 1975.

[18] M. J. Cheng, "A comparative performance study of several pitchdetection algorithms," M. S. thesis, Mass. Inst. Technol., Cam-bridge, June 1975.

[19] L. R. Rabiner and M. R. Sambur, "Some preliminary experimentsin the recognition of connected digits," IEEE Trans. A coust.,Speech, Signal Processing, vol. ASSP-24, pp. 170—182, Apr. 1976.

Maximum Likelihood Pitch EstimationJAMES D. WISE, STUDENT MEMBER, IEEE, JAMES R. CAPRIO, MEMBER, IEEE, AND THOMAS W. PARKS, MEMBER, IEEE

Abstract—A method for estimating the pitch period of voiced speechsounds is developed based on a maximum likelihood (ML) formulation.It is capable of resolution finer than one sampling period and is shownto perform better in the presence of noise than the cepstrum method.

I. INTRODUCTION

MANY current speech encoding techniques attempt toachieve low-rate digitized speech transmission bymodeling the speech source as a linear filter, repre-

senting the vocal tract resonances, excited by either randomnoise or a quasiperiodic pulse train, representing the sourcesignal for unvoiced and voiced speech, respectively. Achievinggood-quality resynthesized speech with this model requiresthat both the filter parameters and the excitation signal beaccurately estimated. The mechanical character of the earlyvocoders was due largely to their inability to extract the exci-tation signal and motivated the development of the voice-excited vocoder [1]. It is this same difficulty in accuratelymodeling the excitation signal which is responsible for thepopularity of systems which transmit a quantized version ofthe estimated excitation signal even though they have a higherdata rate than those which parameterize the excitation.

Manuscript received October 23, 1975; revised March 30, 1976 andAprll 6, 1976. This work was supported in part by the National Sci-ence Foundation under Grant ENG 70-01349 A03.

3. D. Wise and T. W. Parks are with the Department of ElectricalEngineering, Rice University, Houston, TX 77001.

J. R. Caprio is with Comptek Research, Inc., Buffalo, NY.

The difficulty in estimating the excitation is compounded bydepartures from the idealizations used in developing themethod which are encountered in a realistic situation: noisyenvironment, absence of the fundamental due to band-limiting,simultaneous presence of periodic and random excitation,phase distortion, or rapid changes in pitch period. In particu-lar, the sensitivity of the pitch detector to ambient noise is aserious limitation in many potential applications of analysis-synthesis telephony systems [2] -

The pitch detection scheme to be discussed is designed to beresistant to white, Gaussian noise, may be extended to colorednoise, and shows promising performance in the presence ofrealistic environmental noise. In addition, it is capable ofdetermining the pitch period with a resolution finer than onesample period, resulting in improved performance for high-pitched speech.

In revising this paper, the authors discovered that Noll [3]proposed a maximum likelthood (ML) pitch estimationmethod similar to that described in Section II of this paper.The modifications developed in Section III reduce his problemof multiple peaks with increasing amplitude caused by noise,and enable this method to provide a parameter which is usefulin making the voiced-unvoiced decision. An interpretation interms of the signal autocorrelation function is presented whichis useful in suggesting possible efficient implementationsand in providing insight into the frequency-domain behaviorof the estimator. This method has been evaluated on sev-eral utterances with various types of noise and signal-to-

a comparative performance study of several pitch detection algorithms

Documents