improved real-time monophonic pitch tracking with the …orchi/documents/jaes_pitch... · 2020. 7....

9
PAPERS O. Das, J. O. Smith III and C. Chafe, “Improved Real-Time Monophonic Pitch Tracking with the Extended Complex Kalman Filter” J. Audio Eng. Soc., vol. 68, no. 1/2, pp. 78–86, (2020 January/February.). DOI: https://doi.org/10.17743/jaes.2019.0053 Improved Real-Time Monophonic Pitch Tracking with the Extended Complex Kalman Filter ORCHISAMA DAS, AES Student Member ([email protected]) , JULIUS O. SMITH III, AES Fellow ([email protected]) , AND CHRIS CHAFE ([email protected]) Center for Computer Research in Music and Acoustics, Stanford University, Stanford, CA, USA This paper proposes a real-time, sample-by-sample pitch tracker for monophonic audio signals using the Extended Kalman Filter in the complex domain (Extended Complex Kalman Filter). It improves upon the algorithm proposed by the same authors in a previous paper [1] by fixing the issue of slow tracking of rapid note changes. It does so by detecting harmonic change in the signal and resetting the filter whenever a significant harmonic change is detected. Along with the fundamental frequency, the ECKF also tracks the amplitude envelope and instantaneous phase of the input audio signal. The pitch tracker is ideal for detecting ornaments in solo instrument music—such as slides and vibratos. The improved algorithm is tested to track pitch of bowed string (double-bass), plucked string (guitar), and vocal singing samples. 0 INTRODUCTION Pitch is a perceptual feature [2] that roughly relates to the fundamental frequency of the sound wave under con- sideration. Pitch tracking algorithms aim to track the evo- lution of fundamental frequency with time. Pitch tracking in speech and music has been an active area of research [3], with applications in speech recognition and automatic music transcription. Algorithms for monophonic pitch detection can be clas- sified into three broad categories—time domain methods, frequency domain methods, and statistical methods. Time domain methods based on the zero-crossing rate and au- tocorrelation function are particularly popular. One exam- ple of this is the YIN [4] estimator, which makes use of a modified autocorrelation function to accurately detect periodicity in signals. Among frequency domain meth- ods, the best known techniques are cepstrum [5], har- monic product spectrum [6], and an optimum comb fil- ter algorithm [7]. Statistical methods include the maxi- mum likelihood pitch estimator [6, 8], more recent neu- ral networks [9, 10], and hidden Markov models [11]. Combinations of these methods include pYIN (probabilis- tic YIN) [12], which uses multiple pitch candidates from YIN with associated probabilities and feeds it to a Hid- den Markov model, and the Extended Kalman filter based pitch trackers proposed in [1, 13] that are a combination of statistical methods and frequency domain methods. The state-of-art in monophonic pitch tracking is CREPE [10], which uses a deep convolutional network. It is to be noted that multi-pitch estimation in a polyphonic context is a more complex problem, and the tools used to tackle it are different [14]. In [15] Cuadra et al. discuss the performance of various pitch detection algorithms in real-time interactive music. They establish the fact that although monophonic pitch de- tection seems like a well-researched problem with little scope for improvement, that is not true in real-time appli- cations. Some of the most common issues in real time pitch tracking are optimization, latency, and accuracy in noisy conditions. The Extended Complex Kalman filter (ECKF) based pitch tracker developed in [1] overcomes some of these limitations. It has low latency, is robust to the pres- ence of noise, and yields pitch estimates on a fine-grained sample-by-sample basis. The ECKF was originally proposed by Dash et al. in [16] to track frequency fluctuations in a 60 Hz power signal. The ECKF is ideal for use in real-time, with a high tolerance for noise. The complex multiplications can be carried out on a floating point processor. The ECKF simultaneously tracks fundamental frequency, amplitude, and phase in the presence of harmonics and noise. Of course, several modi- fications need to be made before applying it to track pitch in audio signals, such as detecting silent frames that have no pitch, initializing the filter for fast convergence, and an adaptive process noise to accurately detect fine changes in pitch. Perhaps, the biggest application of the ECKF pitch tracker is detecting ornaments—such as glissando, vibrato, trill, etc. Such techniques are essential in adding expres- sion to a musical performance. Playing technique detection [17] is an important task in MIR in which pitch detection is usually the first step. 78 J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February

Upload: others

Post on 26-Jan-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

  • PAPERSO. Das, J. O. Smith III and C. Chafe, “Improved Real-Time MonophonicPitch Tracking with the Extended Complex Kalman Filter”J. Audio Eng. Soc., vol. 68, no. 1/2, pp. 78–86, (2020 January/February.).DOI: https://doi.org/10.17743/jaes.2019.0053

    Improved Real-Time Monophonic Pitch Trackingwith the Extended Complex Kalman Filter

    ORCHISAMA DAS, AES Student Member([email protected])

    , JULIUS O. SMITH III, AES Fellow([email protected])

    , AND CHRIS CHAFE([email protected])

    Center for Computer Research in Music and Acoustics, Stanford University, Stanford, CA, USA

    This paper proposes a real-time, sample-by-sample pitch tracker for monophonic audiosignals using the Extended Kalman Filter in the complex domain (Extended Complex KalmanFilter). It improves upon the algorithm proposed by the same authors in a previous paper [1]by fixing the issue of slow tracking of rapid note changes. It does so by detecting harmonicchange in the signal and resetting the filter whenever a significant harmonic change is detected.Along with the fundamental frequency, the ECKF also tracks the amplitude envelope andinstantaneous phase of the input audio signal. The pitch tracker is ideal for detecting ornamentsin solo instrument music—such as slides and vibratos. The improved algorithm is tested totrack pitch of bowed string (double-bass), plucked string (guitar), and vocal singing samples.

    0 INTRODUCTION

    Pitch is a perceptual feature [2] that roughly relates tothe fundamental frequency of the sound wave under con-sideration. Pitch tracking algorithms aim to track the evo-lution of fundamental frequency with time. Pitch trackingin speech and music has been an active area of research[3], with applications in speech recognition and automaticmusic transcription.

    Algorithms for monophonic pitch detection can be clas-sified into three broad categories—time domain methods,frequency domain methods, and statistical methods. Timedomain methods based on the zero-crossing rate and au-tocorrelation function are particularly popular. One exam-ple of this is the YIN [4] estimator, which makes use ofa modified autocorrelation function to accurately detectperiodicity in signals. Among frequency domain meth-ods, the best known techniques are cepstrum [5], har-monic product spectrum [6], and an optimum comb fil-ter algorithm [7]. Statistical methods include the maxi-mum likelihood pitch estimator [6, 8], more recent neu-ral networks [9, 10], and hidden Markov models [11].Combinations of these methods include pYIN (probabilis-tic YIN) [12], which uses multiple pitch candidates fromYIN with associated probabilities and feeds it to a Hid-den Markov model, and the Extended Kalman filter basedpitch trackers proposed in [1, 13] that are a combinationof statistical methods and frequency domain methods. Thestate-of-art in monophonic pitch tracking is CREPE [10],which uses a deep convolutional network. It is to be notedthat multi-pitch estimation in a polyphonic context is a

    more complex problem, and the tools used to tackle it aredifferent [14].

    In [15] Cuadra et al. discuss the performance of variouspitch detection algorithms in real-time interactive music.They establish the fact that although monophonic pitch de-tection seems like a well-researched problem with littlescope for improvement, that is not true in real-time appli-cations. Some of the most common issues in real time pitchtracking are optimization, latency, and accuracy in noisyconditions. The Extended Complex Kalman filter (ECKF)based pitch tracker developed in [1] overcomes some ofthese limitations. It has low latency, is robust to the pres-ence of noise, and yields pitch estimates on a fine-grainedsample-by-sample basis.

    The ECKF was originally proposed by Dash et al. in [16]to track frequency fluctuations in a 60 Hz power signal. TheECKF is ideal for use in real-time, with a high tolerancefor noise. The complex multiplications can be carried outon a floating point processor. The ECKF simultaneouslytracks fundamental frequency, amplitude, and phase in thepresence of harmonics and noise. Of course, several modi-fications need to be made before applying it to track pitchin audio signals, such as detecting silent frames that haveno pitch, initializing the filter for fast convergence, and anadaptive process noise to accurately detect fine changes inpitch. Perhaps, the biggest application of the ECKF pitchtracker is detecting ornaments—such as glissando, vibrato,trill, etc. Such techniques are essential in adding expres-sion to a musical performance. Playing technique detection[17] is an important task in MIR in which pitch detectionis usually the first step.

    78 J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February

  • PAPERS ECKF PITCH TRACKING

    However, the biggest disadvantage of the method pro-posed in [1] was that it could not track fast note changes.This is due to the inherent latency of convergence in Kalmantracking [18]. In this paper we propose an improvementby devising a method to detect harmonic change and re-initializing the filter every time a significant change in pitchis detected. This improves performance significantly andgets rid of the lag in estimated pitch. We also propose amore accurate fundamental frequency estimation methodto initialize the filter.

    The rest of this paper is organized as follows: in Sec. 1we give details of the model used for ECKF pitch tracking.In Sec. 2 details of its implementation are given, includingmethods for silent frame and harmonic change detection,calculation of initial estimates for attaining steady statevalues quickly, resetting the error covariance matrix, back-tracking to avoid transient errors, and an adaptive processnoise variance calculation based on the measurement resid-ual. Sec. 3 has the results of running the ECKF pitch trackeron double bass, guitar, and vocal singing samples. It is alsocompared to YIN and CREPE pitch trackers. We discusssome parameter selection details and conclude the paper inSec. 5 and delineate the scope for future work.

    1 MODEL AND EQUATIONS

    1.1 ModelWe make use of the sines+noise model for music [19] to

    derive our state space equations. Since the model is non-linear, we use the extended Kalman filter [20], which lin-earizes the function about the current estimate by using itsTaylor series expansion up to the first order. Higher orderterms are ignored.

    Let there be an observation yk at time instant k, which isa sum of additive sines and a measurement noise.

    yk =N∑

    i=1ai cos (ωi tk + φi ) + vk (1)

    where ai, ωi and φi are the amplitude, frequency, and phaseof the ith sinusoid and vk is a normally distributed Gaussiannoise v ∼ N(0, σ2v), which is the measurement noise—usually background noise picked by the microphone. σ2vis the measurement noise variance—typically given by theSNR. If we ignore the partials and only take into accountthe fundamental, Eq. (1) reduces to

    yk = a1 cos (ω1kTs + φ1) + vk (2)where a1, ω1 and φ1 are the fundamental amplitude, fre-quency, and phase respectively and Ts is the sampling in-terval. The state vector is constructed as

    xk =⎡⎣ αuk

    u∗k

    ⎤⎦ (3)

    where

    α = exp( jω1Ts)uk = a1 exp( jω1kTs + jφ1)u∗k = a1 exp(− jω1kTs − jφ1)

    (4)

    This particular selection of state vector ensures that we cantrack all three parameters that define the fundamental—frequency, amplitude, and phase. The relative advantage ofchoosing this complex state vector has been described in[16]. The state vector estimate update rule xk+1 relates to xkas ⎡

    ⎣ αuk+1u∗k+1

    ⎤⎦ =

    ⎡⎣1 0 00 α 0

    0 0 1α

    ⎤⎦

    ⎡⎣ αuk

    u∗k

    ⎤⎦

    xk+1 = f (xk) + wkf (xk) =

    [α αuk

    u∗kα

    ]T(5)

    yk relates to xk as

    yk = H xk + vkH = [0 0.5 0.5] (6)

    where H is the observation matrix and wk is the processnoise. In our model, the process noise can be assumed to befiltered white noise that represents the residual that remainsafter removing the harmonic content of the signal. We cansee that

    H xk = a12

    [exp ( jω1kTs + jφ1) + exp (− jω1kTs − jφ1)]= a1 cos (ω1kTs + φ1) (7)

    1.2 EKF EquationsThe recursive Kalman filter equations aim to minimize

    the trace of the error covariance matrix. Each iteration re-duces the variance of x̂k|k−1 until it converges. At eachtime-step k, the EKF equations are as follows

    Kk = P̂k|k−1 H∗T [H P̂k|k−1 H∗T + σ2v]−1 (8)x̂k|k = x̂k|k−1 + Kk(yk − H x̂k|k−1) (9)x̂k+1|k = f (x̂k|k) (10)P̂k|k = (I − Kk H )P̂k|k−1 (11)P̂k|k+1 = Fk P̂k|k F∗Tk + σ2w I (12)where Fk is the Jacobian given by

    Fk = ∂ f (xk)∂xk

    ∣∣∣∣xk=x̂k|k

    =

    ⎡⎢⎣

    1 0 0x̂k|k(2) x̂k|k(1) 0− x̂k|k (3)

    x̂2k|k (1)0 1x̂k|k (1)

    ⎤⎥⎦

    (13)

    • x̂k|k−1, x̂k|k, x̂k|k+1 are the a priori, current and a pos-teriori state vector estimates respectively.

    • P̂k|k−1, P̂k|k, P̂k|k+1 are the a priori, current and aposteriori error covariance matrices respectively.

    • Kk is the Kalman gain that acts as a weighting factorbetween the observation yk and a priori predictionx̂k|k−1 in determining the current estimate.

    • σ2v is the measurement noise variance, which is fixedto be 1, as in [16]. Ideally, this should depend on theSNR of the measurement. A lower SNR would givea higher measurement noise variance.

    J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February 79

  • DAS ET AL. PAPERS

    • σ2w is modeled as the process noise variance andI ∈ C3×3 is an identity matrix.

    • Initial state vector and error covariance matrix aredenoted as x̂1|0 and P̂1|0 respectively.

    From Eq. (4) the fundamental frequency, amplitude, andphase estimates at instant k can be calculated as

    f1,k = ln (x̂k|k(1))2πjTs

    a1,k =√

    x̂k|k(2) × x̂k|k(3)

    φ1,k = 12 j

    ln

    (x̂k|k(2)

    x̂k|k(3)x̂k|k(1)2k

    ) (14)

    2 IMPLEMENTATION DETAILS

    2.1 Detection of Silent FramesIt is important to keep track of pitch-off events in the sig-

    nal because the estimated frequency for such silent regionsshould be zero. Moreover, whenever there is a transitionfrom pitch-off to pitch-on, the Kalman filter error covari-ance matrix needs to be reset. This is because the filterquickly converges to a steady state value, and as a result theKalman gain Kk and error covariance matrix P̂k|k settle tovery low values. If there is a sudden change in frequency ofthe signal (which happens at note onset), the filter will notbe able to track it unless the covariance matrix is reset.

    To keep track of silent regions, we divide the signal intonon-overlapping frames. In real-time processing, this is eas-ily achieved by working with audio buffers. One way to de-termine if a frame is silent or not is to calculate its energy.The energy of a frame is given as the sum of the square ofall the signal samples in that frame. If the energy is below–50 dB, then the frame is classified to be silent.

    However, for noisy input signals, the energy in silentframes is significant. To find silent frames in noisy signals,we make use of the fact that noise has a fairly flat powerspectrum. The power spectral density (PSD) of the observedsignal, �yy, is given as

    �yy(ejω) =

    ∞∑n=−∞

    φyy(n)e− jωn (15)

    where φyy(n) is the autocorrelation function of the inputsignal y, given as

    φyy(n) =∞∑

    m=−∞y(m)y(n + m). (16)

    The power spectrum is the DTFT of the autocorrelationfunction and one way of estimating it is Welch’s method[21] which makes use of the periodogram. Since we aredealing with very short frames here of the order of a fewmilliseconds, an alternative is to estimate the autocorre-lation function and take its FFT after zero-padding by a

    Fig. 1. Spectral flatness varying across frames. A high value indi-cates a silent frame.

    factor ≥ 5. The spectral flatness is defined as the ratio ofthe geometric mean to the arithmetic mean of the PSD.

    spf =K

    √∏K−1k=0 �̂yy(e jωk )

    1K

    ∑K−1k=0 �̂yy(e jωk )

    (17)

    �̂yy(e jωk ) is the estimated power spectrum for K frequencybins covering the range [–π, π].

    For white noise v ∼ N(0, σ2v) corrupting the measurementsignal, y, the silent frames of y will have the followingproperties

    φyy(n) = σ2vδ(n)�yy(e jω) = σ2v∀ω ∈ [−π,π]

    spf =K√

    σ2Kv1K (Kσ

    2v )

    = 1(18)

    Therefore, for each frame, we calculate the spectral flatness[22]. If spectral flatness ≥th1, then the frame is classified tobe silent. Fig. 1 shows spectral flatness v/s frame numberfor an audio file containing a single note preceded andfollowed by some silence.

    2.2 Resetting the Kalman FilterIn [1] the error covariance matrix was reset and initial es-

    timates were re-calculated only when there was a transitionfrom pitch-off to pitch-on event (silent frame to non-silentframe). However, the error covariance matrix should bereset and the filter re-initialized whenever there is a signif-icant change in pitch. Otherwise, the filter cannot track fastpitch changes; this was one of the main drawbacks of themethod proposed in [1]. To fix this problem we implementa harmonic change detector and reset the filter whenever asignificant harmonic change is detected.

    2.2.1 Initial F0 EstimateIn [1] we simply looked at the first peak in the magnitude

    spectrum of a frame to calculate the initial F0 estimate. Weuse a more sophisticated method here by looking at the firstfew peaks (NP) in the magnitude spectrum. We multiplythe magnitude spectrum with a slowly tapering Poisson

    1th is a threshold ≥0.5

    80 J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February

  • PAPERS ECKF PITCH TRACKING

    window, w(n) [23], with α = 10 and M = FFT size to weighthe lower partials more than the higher partials (which maybe spurious and noisy).

    w(n) = exp(−0.5αn

    M − 1)

    (19)

    Peaks in the magnitude spectrum typically representharmonics, and the differences between their frequenciesshould remain constant. We utilize this fact and find themode of the difference between the adjacent frequencies ofthe NP peaks, and use that to detect fundamental frequencyand set that as the initial F0 estimate, f̂0. The amplitude, â0,and phase, φ̂0 are found by quadratically interpolating themagnitude and phase spectrum around the peak at estimatedF0. The initial state and error covariance matrix are

    x̂1|0 =

    ⎡⎢⎣ e

    (2πj f̂0Ts )

    â0e(2πj f̂0Ts+ j φ̂0)

    â0e(−2πj f̂0Ts− j φ̂0)

    ⎤⎥⎦

    P̂1|0 = E[(x1 − x̂1|0)(x1 − x̂1|0)∗T ] = 0

    (20)

    2.2.2 Detecting Harmonic ChangeTo detect harmonic change we need to keep track of both

    the previous frame and the current frame. We find the NPlargest peaks in the magnitude spectra of the consecutiveframes. For frames k − 1 and k, the NP peak frequenciescan be represented as [f1, k−1, f2, k−1, · · · fNP,k−1] and [f1, k,f2, k, · · · fNP,k] respectively. We then find the first orderdifference between adjacent peaks in frames k and k − 1 .This typically represents the fundamental frequency.

    fi+1,k − fi,k = di,k ∀ i = 1, · · · , N P − 1 (21)We calculate the mode of the distribution given by

    m = maxi

    | di,k − di,k−1 | (22)The fundamental frequency deviation among consecutive

    frames is calculated in cents as

    f 0dev = 1200 log2(

    f̂0f̂0 + m

    )(23)

    where f̂0 is the calculated initial fundamental frequency.If the same note is being played in the current frame andthe previous frame, then m would typically be zero. If thereis a large enough frequency deviation between consecutiveframes, i.e., f0dev ≥ nst (specified in number of semitones),then the filter is re-initialized with a null covariance matrixand the estimated initial state.

    We show the detected peaks in two consecutive frameswhen there is a harmonic change, along with a histogramof frequency deviation between them in Fig. 2.

    2.3 Backtracking to Avoid Transient ErrorsFor an instrument with a strong transient attack, such as

    a guitar or a piano, the F0 estimation is likely to fail duringa note change. This is because the transient is broadbandnoise-like in nature and the frequency peaks do not haveharmonic relationships between them. To avoid this, when-ever there is a transition from a silent to a non-silent frame

    Table 1. Pitch detection errors with frequency modulatedsawtooth wave

    SNR Mean Absolute Error (Hz) Standard Deviation (Hz)

    5 5.5673 19.050010 5.6536 19.071515 5.7145 19.093720 5.9741 19.3449

    or whenever a harmonic change is detected, a few framesare skipped before estimating the initial state. The numberof frames to skip depends on the instrument whose pitch isto be tracked. Once a stable F0 estimate is found, the filter isinitialized with that estimate, and we backtrack the numberof frames we skipped and start tracking the pitch. This in-troduces latency (which can be up to 130 ms depending onthe number of frames skipped) but ensures that an accuratepitch is detected for all frames in the signal.

    2.4 Adaptive Process NoiseWe only reset the covariance matrix when there is a

    transition from silent to non-silent frame or when a newnote with a different pitch is played. However, dynamicssuch as vibrato also need to be captured. To track thesechanges, σ2w in Eq. (12) is modeled as the process noisevariance.

    log10 (σ2w) = −c + |yk − H x̂k|k | (24)

    where c ∈ Z+ is a constant. The term yk − H x̂k|k is knownas the innovation. It gives the error between our predictedvalue and the actual data. Whenever the innovation is high,there is a significant discrepancy between the predictedoutput and the input, which is probably caused by a changein the input that the ECKF needs to track. In that case,σ2w increases and there’s a term added to the a posteriorierror covariance matrix P̂k|k+1. This increase in the errorcovariance matrix causes the Kalman gain Kk to increasein the next iteration according to Eq. (8). As a result, thenext state estimate x̂k|k+1 depends more on the input andless on the current predicted state x̂k|k . Thus, the innovationreduces in the next iteration and so does σ2w. In this way,the process noise acts as an error correction term that isadaptive to the variance in input.

    3 RESULTS

    The MATLAB code for the following examples is availableat https://github.com/orchidas/Pitch-Tracking.

    3.1 Synthesized SignalWe test our pitch estimator on an artificially synthesized

    signal—a sawtooth wave at 440 Hz, with a sinusoidal vi-brato of frequency 5 Hz, with added white noise at dif-ferent SNRs. The results showing the mean absolute errorand standard deviation are given in Table 1. The trackedpitch lags slightly behind the actual pitch as observed inFig. 3. Higher variance of detected pitch is caused by rapidfluctuations of estimated pitch about a mean value.

    J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February 81

  • DAS ET AL. PAPERS

    Fig. 2. Different notes played in consecutive frames (dashed line—previous frame, solid line—current frame). On the right is a histogramof the distribution given by Eq. (22).

    Fig. 3. Tracking pitch in a sawtooth wave with 5 Hz sinusoidalvibrato. Dashed line—actual pitch, solid line—ECKF detectedpitch.

    3.2 Tracking Dynamic Note ChangesTo address the main drawback of [1], we track ECKF

    pitch trajectories for rapid note changes. The results areshown in Fig. 4, where the spectrogram of the fundamentalis plotted, along with the estimated pitch trajectory with ablack solid line. The trajectory is smooth and accurate andfollows the changing notes without any significant lag.

    3.3 Tracking OrnamentsOne of the potential applications of a sample-by-sample

    pitch tracker is to get smooth trajectories of ornamentsand embellishments. We track portamento (smooth glidefrom one note to another), vibrato (frequency modula-tion), and trill (hammering on adjacent note) on two stringinstruments— the double bass and electric guitar. Trackingpitch in a bowed string instrument such as the double-bassis easier, since it is a driven oscillator, whereas a pluckedstring instrument will have a strong transient where pitchdetection goes haywire. We recorded the samples into Au-dacity through a TASCAM 2x2 interface using a BeyerDynamic omni mic and direct line input at a sampling rateof 48 kHz. The results comparing the ECKF pitch trackerto YIN and CREPE are shown in Fig. 5. There is an over-shoot/undershoot in the initial pitch detected by the ECKF

    Fig. 4. Estimated pitch (in black) for descending fifths played onthe A string on the double bass.

    in some plots, but it quickly converges to the correct pitch.This is due to errors in initial F0 estimation. The ECKFtrajectory is smoother and follows YIN and CREPE tra-jectories closely in most cases, except in Fig. 5d, wherethe YIN pitch trajectory becomes unstable, and in Fig. 5f,where it slightly lags behind the YIN and CREPE trajecto-ries.

    3.4 Tracking Singing VoicePutting it all together, we compare the performance of our

    proposed pitch tracker against YIN and CREPE in a real-

    82 J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February

  • PAPERS ECKF PITCH TRACKING

    Fig. 5. Estimated pitch for various ornaments played on the guitar. Circles—YIN, Diamonds—CREPE, Black line—ECKF.

    Fig. 6. Pitch track of a female voice excerpt singing with vibrato from VocalSet. Circles—YIN, Diamonds—CREPE, Black line—ECKF.

    J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February 83

  • DAS ET AL. PAPERS

    world example. We use a track from the VocalSet dataset[24]: a female soprano singer singing an excerpt from thesong “Row your boat gently down the stream” with vibrato.The results are given in Fig. 6. The pitch track generated byECKF is comparable to that of CREPE. It is more denselysampled (one f0 estimate per sample), with the additionaladvantage of recovering the amplitude envelope and thephase track of the fundamental. Essentially, with the ECKFpitch tracker, we can extract the fundamental waveformfrom the rest of the signal2. This is an advantage it has overother pitch trackers proposed in the literature. The run-timesin ascending order are YIN, ECKF, and CREPE.

    4 DISCUSSION

    The proposed improved ECKF pitch tracker gives asample-by-sample pitch estimate, along with amplitude en-velope and phase, which yields the extracted fundamen-tal. The smooth pitch trajectory obtained could be used tomake subtle pitch corrections during the mixing process.Although the Kalman filter is inherently robust to the pres-ence of observation noise, it affects the accuracy of the F0estimate calculated to reset the filter. Moreover, a numberof parameters need to be carefully selected, depending onthe type of instrument whose pitch is to be tracked.

    4.1 Selection of ParametersThe following parameters need to be carefully selected

    for optimum results with the ECKF pitch tracker.

    • Frame size. For tracking very fast pitch changes,frame size should be small. However, choosing ashorter frame would lead to errors in initial F0 esti-mation. Therefore, it is a trade-off. In this paper wechose a frame size of 2048 samples.

    • Number of peaks (NP). The number of peaks usedto calculate an initial F0 estimate and detect har-monic change depends on the instrument whosepitch is to be tracked. If the partials are harmoni-cally related, then a large number of peaks wouldgive a more stable estimate. However, if the partialsare inharmonic, then selecting a small NP worksbetter. In this paper we picked NP = 5 or 6 for thedouble bass signals and NP = 2 or 3 for guitar, andNP = 4 for vocal signals.

    • Number of frames to skip during transient. If theinstrument to be tracked has a strong attack, suchas a piano or a guitar, then more frames need to beskipped before initial pitch is calculated from thesteady state. This introduces latency. However, forinstruments with a softer attack, such as woodwindinstruments, skipping one frame should be adequatein getting a good initial estimate. In this paper thenumber of frames to be skipped ∈ [1, 3].

    • Adaptive process noise coefficient (c). This coeffi-cient determines how smooth the pitch trajectory is.

    2Sound examples are available at https://ccrma.stanford.edu/∼orchi/Kalman_pitch/EKF.html

    A higher value of this coefficient reduces the pro-cess noise variance σ2w and gives a smoother pitchtrajectory. In this paper c ∈ [7, 11].

    • Harmonic change threshold. The threshold beyondwhich a harmonic change is detected, nst, is specifiedin number of semitones (nst = 2 is default and indi-cates a whole tone). Small fluctuations in pitch canbe tracked by the ECKF, but for larger fluctuationsthe filter needs to be reset. In case of discrete pitchtrajectories, such as in guitar and piano, a smallervalue of nst (quarter or eighth tone) works better,whereas for more continuous trajectories, like vocalor bowed string instruments, nst should be equal to afew semitones.

    5 SUMMARY

    In this paper we have improved upon the proposedKalman filter based pitch tracker in [1]. The ECKF pitchtracker gives a sample-by-sample estimate of the funda-mental frequency, amplitude, and phase and is robust tothe presence of measurement noise. We have discussed indetail when and how to reset the filter so that fast notechanges as well as ornaments and embellishments can betracked. We have compared the performance of the pitchtracker to YIN and CREPE pitch detectors and found theresults to be comparable. Additionally, this method hasthe ability to extract the fundamental from the waveform ofthe signal as it yields the amplitude envelope and a phasetrack, along with the pitch track. However, parameter selec-tion for the ECKF pitch tracker requires knowledge of thetype of signal whose pitch is to be tracked. That is a potentialdrawback of this method. In future, it would be interestingto automatically pick the optimum set of parameters givenan audio signal by training on instrument-specific datasets.

    6 ACKNOWLEDGMENT

    Part of this research was conducted at the SoundAnalysis-Synthesis team at IRCAM during a residency atthe Cité Internationale des Arts, Paris, in early 2019. Theauthors would like to thank Axel Röebel, who heads theAnalysis-Synthesis team at IRCAM.

    7 REFERENCES

    [1] O. Das, J. O. Smith, and C. Chafe, “Real-Time PitchTracking in Audio Signals with the Extended ComplexKalman Filter,” Int. Conf. on Digital Audio Effects (DAFx17), pp. 118–124 (2017).

    [2] J. C. R. Licklider, “A Duplex Theory of Pitch Per-ception,” J. Acoust. Soc. Amer., vol. 23, no. 1, pp. 147–147(1951).

    [3] D. Gerhard, “Pitch Extraction and Fundamental Fre-quency: History and Current Techniques,” Tech. Rep., Uni-versity of Regina (2003).

    [4] A. De Cheveigné and H. Kawahara, “YIN, a Fun-damental Frequency Estimator for Speech and Music,” J.Acoust. Soc. Amer., vol. 111, no. 4, pp. 1917–1930 (2002).

    84 J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February

  • PAPERS ECKF PITCH TRACKING

    [5] A. M. Noll, “Cepstrum Pitch Detection,” J. Acoust.Soc. Amer., vol. 41, pp. 293–309 (1967).

    [6] A. M. Noll, “Pitch Determination of Human Speechby the Harmonic Product Spectrum, the Harmonic SumSpectrum, and a Maximum Likelihood Estimate,” Proceed-ings of the Symposium on Computer Processing Communi-cations, vol. 779 (1969).

    [7] J. A. Moorer, “On the Transcription of MusicalSound by Computer,” Computer Music J., vol. 1, no. 4,pp. 32–38 (1977 Nov.).

    [8] J. Wise, J. Caprio, and T. Parks, “Maximum Likeli-hood Pitch Estimation,” IEEE Transactions on Acoustics,Speech, and Signal Processing, vol. 24, no. 5, pp. 418–423(1976).

    [9] E. Barnard, R. A. Cole, M. P. Vea, and F. A. All-eva, “Pitch Detection with a Neural-Net Classifier,” IEEETransactions on Signal Processing, vol. 39, no. 2, pp. 298–307 (1991).

    [10] J. W. Kim, J. Salamon, P. Li, and J. P. Bello,“CREPE: A Convolutional Representation for Pitch Es-timation,” IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp. 161–165(2018).

    [11] F. Bach and M. Jordan, “Discriminative Training ofHidden Markov Models for Multiple Pitch Tracking,” pre-sented at the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), vol. 5 (2005).

    [12] M. Mauch and S. Dixon, “pYIN: A FundamentalFrequency Estimator Using Probabilistic Threshold Dis-tributions,” IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp. 659–663(2014).

    [13] L. Shi, J. K. Nielsen, J. R. Jensen, M. A. Little, andM. G. Christensen, “A Kalman-Based Fundamental Fre-quency Estimation Algorithm,” IEEE Workshop on Appli-cations of Signal Processing to Audio and Acoustics (WAS-PAA), pp. 314–318 (2017).

    [14] T. Tolonen and M. Karjalainen, “A ComputationallyEfficient Multipitch Analysis Model,” IEEE Transactionson Speech and Audio Processing, vol. 8, no. 6, pp. 708–716(2000).

    [15] P. De La Cuadra, A. S. Master, and C. Sapp, “Ef-ficient Pitch Detection Techniques for Interactive Music,”presented at the International Computer Music Conference(ICMC) (2001).

    [16] P. K. Dash, G. Panda, A. Pradhan, A. Routray,and B. Duttagupta, “An Extended Complex Kalman Filterfor Frequency Measurement of Distorted Signals,” PowerEngineering Society Winter Meeting, 2000. IEEE, vol. 3,pp. 1569–1574 (2000).

    [17] P.-C. Li, L. Su, Y.-h. Yang, A. W. Su, et al., “Anal-ysis of Expressive Musical Terms in Violin Using Score-Informed and Expression-Based Audio Features,” ISMIR,pp. 809–815 (2015).

    [18] M. Boutayeb, H. Rafaralahy, and M. Darouach,“Convergence Analysis of the Extended Kalman Fil-ter Used as an Observer for Nonlinear Determin-istic Discrete-Time Systems,” IEEE Transactions onAutomatic Control, vol. 42, no. 4, pp. 581–586(1997).

    [19] X. Serra and J. Smith, “Spectral Modeling Synthe-sis: A Sound Analysis/Synthesis System Based on a Deter-ministic Plus Stochastic Decomposition,” Computer MusicJ., vol. 14, no. 4, pp. 12–24 (1990).

    [20] G. A. Terejanu, “Extended Kalman Filter Tutorial,”Tech. Rep., University of Buffalo (2008).

    [21] P. Welch, “The Use of Fast Fourier Transform forthe Estimation of Power Spectra: A Method Based onTime Averaging over Short, Modified Periodograms,” IEEETransactions on Audio and Electroacoustics, vol. 15, no. 2,pp. 70–73 (1967).

    [22] A. Gray and J. Markel, “A Spectral-Flatness Mea-sure for Studying the Autocorrelation Method of LinearPrediction of Speech Analysis,” IEEE Transactions onAcoustics, Speech, and Signal Processing, vol. 22, no. 3,pp. 207–217 (1974).

    [23] J. O. Smith III, Spectral Audio Signal Processing(W3K Publishing, 2011).

    [24] J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo,“VocalSet: A Singing Voice Dataset,” International Sympo-sium on Music Information Retrieval (ISMIR), pp. 468–474(2018).

    J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February 85

  • DAS ET AL. PAPERS

    THE AUTHORS

    Orchisama Das Julius O. Smith Chris Chafe

    Orchisama Das received the B. Eng. degree in Instrumenta-tion and Electronics engineering from Jadavpur University,India, in 2016. She is currently a Ph.D. candidate at theCenter for Computer Research in Music and Acoustics atStanford University. She is also a teaching assistant forvarious signal processing courses at CCRMA. In 2015,Orchisama worked at the University of Calgary fundedby a Mitacs Globalink fellowship. She interned at TeslaMotors in 2018 doing DSP for the Noise, Vibration andHarshness team. In 2019 she was a visiting researcher inthe Sound Analysis-Synthesis team at IRCAM.

    •Julius O. Smith received the B.S.E.E. degree from Rice

    University, Houston, TX, in 1975 (control, circuits, andcommunication). He received the M.S. and Ph.D. degreesin E.E. from Stanford University, Stanford, CA, in 1978and 1983, respectively. His Ph.D. research was devoted toimproved methods for digital filter design and system iden-tification applied to music and audio systems. From 1975to 1977 he worked in the Signal Processing Departmentat ESL, Sunnyvale, CA, on systems for digital commu-nications. From 1982 to 1986 he was with the AdaptiveSystems Department at Systems Control Technology, PaloAlto, CA, where he worked in the areas of adaptive fil-tering and spectral estimation. From 1986 to 1991 he wasemployed at NeXT Computer, Inc., responsible for sound,music, and signal processing software for the NeXT com-puter workstation. After NeXT, he became an Associate

    Professor at the Center for Computer Research in Musicand Acoustics (CCRMA) at Stanford, teaching courses andpursuing research related to signal processing techniquesapplied to music and audio systems. Continuing this work,he is presently Professor of Music and (by courtesy) Elec-trical Engineering at Stanford University. For more infor-mation, see http://ccrma.stanford.edu/∼jos/.

    •Chris Chafe is a composer, improvisor, and cellist de-

    veloping much of his music alongside computer-based re-search. He is Director of Stanford University’s Center forComputer Research in Music and Acoustics (CCRMA).At IRCAM (Paris) and The Banff Centre (Alberta), hepursued methods for digital synthesis, music performance,and real-time internet collaboration. Online collaborationsoftware including jacktrip and research into latency fac-tors continue to evolve. An active performer either on thenet or physically present, his music reaches audiences indozens of countries and sometimes at novel venues. A si-multaneous five-country concert was hosted at the UnitedNations a decade ago. Gallery and museum music instal-lations involve “musifications” resulting from collabora-tions with artists, scientists, and MD’s. Recent work in-cludes the Brain Stethoscope project, PolarTide for theVenice Biennale, Tomato Quintet for the transLife:mediaFestival at the National Art Museum of China,and Sun Shot played by the horns of large ships in theport of St. Johns, Newfoundland.

    86 J. Audio Eng. Soc., Vol. 68, No. 1/2, 2020 January/February

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages false /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages false /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (U.S. Web Coated \050SWOP\051 v2) /PDFXOutputConditionIdentifier (CGATS TR 001) /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown

    /CreateJDFFile false /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > > /FormElements true /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MarksOffset 6 /MarksWeight 0.250000 /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /UseName /PageMarksFile /RomanDefault /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] /SyntheticBoldness 1.000000>> setdistillerparams> setpagedevice