wavelet-based cepstrum calculation

6
Journal of Computational and Applied Mathematics 227 (2009) 288–293 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam Wavelet-based cepstrum calculation Fabrício Lopes Sanchez a , Sylvio Barbon Júnior b,* , Lucimar Sasso Vieira b , Rodrigo Capobianco Guido b , Everthon Silva Fonseca c , Paulo Rogério Scalassara c , Carlos Dias Maciel c , José Carlos Pereira c , Shi-Huang Chen d a Department of Bioengineering, University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil b Institute of Physics of São Carlos - University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil c Department of Electrical Engineering, School of Engineering at São Carlos - University of São Paulo, USP, São Carlos, São Paulo 13566-590, Brazil d Department of Computer Science and Information Engineering, Shu-Te University, Kaohsiung County, 824, Taiwan, ROC article info Article history: Received 4 April 2006 Received in revised form 30 September 2007 Keywords: Wavelet transform Cepstrum Pitch period Speech analysis abstract In this paper we present a new wavelet-based algorithm for low-cost computation of the cepstrum. It can be used for real time precise pitch determination in automatic speech and speaker recognition systems. Many wavelet families are examined to determine the one that works best. The results confirm the efficacy and accuracy of the proposed technique for pitch extraction. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Cepstral analysis is fairly well discussed in [1,2], and many other references. With it, a voiced speech segment can be separated into two main parts: the resonances of the vocal tract, and the harmonic peaks which come from the source pulmonary excitation ([1], pp. 30; [2], pp. 79). Cepstral analysis has been intensively used for speech recognition, speaker recognition, and speaker verification and identification [1,3,4]. The cepstrum, which is the signal derived from the cepstral analysis of a speech utterance, is traditionally obtained by means of the Discrete Fourier Transform (DFT); see Section 2. This approach gives precise results but with O(L 2 ), i.e., the order of computational complexity is quadratic in relation to the signal length (L). If the Fast Fourier Transform (FFT) is used, the complexity reduces to O(L · log(L)). The proposed algorithm has a lower order of complexity, since its basic principle is the convolution of the input signal with a wavelet filter of small support. Furthermore, the accuracy that we obtained with the proposed algorithm is comparable to the one obtained using the DFT or FFT. For many real time applications, the proposed approach is a valuable alternative to the existing algorithms based on the DFT or FFT for pitch extraction. This paper is organized as follows. Section 2 presents a brief review on the cepstrum and its calculation based on the DFT/FFT. The proposed algorithm is described in Section 3. A study on wavelets and their important characteristics for precise cepstrum calculation appears in Section 4. Section 5 describes the tests and results, and, lastly, Section 6 presents the conclusions. * Corresponding address: Universidade de Sao Paulo, IFSC, Postal 369 - CEP, 13560-970 Sao Carlos, Sao Paulo, Brazil. E-mail address: [email protected] (S. Barbon Júnior). 0377-0427/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.cam.2008.03.016

Upload: fabricio-lopes-sanchez

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wavelet-based cepstrum calculation

Journal of Computational and Applied Mathematics 227 (2009) 288–293

Contents lists available at ScienceDirect

Journal of Computational and AppliedMathematics

journal homepage: www.elsevier.com/locate/cam

Wavelet-based cepstrum calculationFabrício Lopes Sanchez a, Sylvio Barbon Júnior b,∗, Lucimar Sasso Vieira b,Rodrigo Capobianco Guido b, Everthon Silva Fonseca c, Paulo Rogério Scalassara c,Carlos Dias Maciel c, José Carlos Pereira c, Shi-Huang Chen d

a Department of Bioengineering, University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazilb Institute of Physics of São Carlos - University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazilc Department of Electrical Engineering, School of Engineering at São Carlos - University of São Paulo, USP, São Carlos, São Paulo 13566-590, Brazild Department of Computer Science and Information Engineering, Shu-Te University, Kaohsiung County, 824, Taiwan, ROC

a r t i c l e i n f o

Article history:Received 4 April 2006Received in revised form 30 September2007

Keywords:Wavelet transformCepstrumPitch periodSpeech analysis

a b s t r a c t

In this paper we present a new wavelet-based algorithm for low-cost computation of thecepstrum. It can be used for real time precise pitch determination in automatic speech andspeaker recognition systems. Many wavelet families are examined to determine the onethat works best. The results confirm the efficacy and accuracy of the proposed techniquefor pitch extraction.

© 2008 Elsevier B.V. All rights reserved.

1. Introduction

Cepstral analysis is fairly well discussed in [1,2], and many other references. With it, a voiced speech segment can beseparated into two main parts: the resonances of the vocal tract, and the harmonic peaks which come from the sourcepulmonary excitation ([1], pp. 30; [2], pp. 79). Cepstral analysis has been intensively used for speech recognition, speakerrecognition, and speaker verification and identification [1,3,4].

The cepstrum, which is the signal derived from the cepstral analysis of a speech utterance, is traditionally obtained bymeans of the Discrete Fourier Transform (DFT); see Section 2. This approach gives precise results but with O(L2), i.e., theorder of computational complexity is quadratic in relation to the signal length (L). If the Fast Fourier Transform (FFT) is used,the complexity reduces to O(L · log(L)). The proposed algorithm has a lower order of complexity, since its basic principleis the convolution of the input signal with a wavelet filter of small support. Furthermore, the accuracy that we obtainedwith the proposed algorithm is comparable to the one obtained using the DFT or FFT. For many real time applications, theproposed approach is a valuable alternative to the existing algorithms based on the DFT or FFT for pitch extraction.

This paper is organized as follows. Section 2 presents a brief review on the cepstrum and its calculation based on theDFT/FFT. The proposed algorithm is described in Section 3. A study on wavelets and their important characteristics forprecise cepstrum calculation appears in Section 4. Section 5 describes the tests and results, and, lastly, Section 6 presentsthe conclusions.

∗ Corresponding address: Universidade de Sao Paulo, IFSC, Postal 369 - CEP, 13560-970 Sao Carlos, Sao Paulo, Brazil.E-mail address: [email protected] (S. Barbon Júnior).

0377-0427/$ – see front matter © 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.cam.2008.03.016

Page 2: Wavelet-based cepstrum calculation

F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 289

Fig. 1. [above]: log spectrum of a voiced speech signal; [below]: corresponding cepstrum obtained with the Fourier approach, where the period of pitch,p, appears as a spike.

Table 1The algorithm for obtaining the cepstrum, based on the Fourier approach

BEGINNINGSTEP S1: Input speech utterance, s[n], with length L;STEP S2: s[n] ← DFT(s[n]);STEP S3: For (int i = 0; i < L; i++)

si ← log10 ‖(si)‖2;

STEP S4: s[n] ← IDFT(s[n]), where IDFT is the inverse DFT;STEP S5: For (int i = 0; i < L; i++)

si ← (si)2;

END.

2. A brief review: The cepstrum

According to [1], if E is the excitation, and H represents the vocal tract resonance, then, the speech spectrum is S = E · H.Cepstral analysis converts this product into a sum of spectra as follows:

S = E · H⇒ log(S) = log(E · H)⇒ log(S) = log(E)+ log(H).

The spectrum of H varies slowly with frequency, because it consists mostly of smooth formant curves, and the spectrumof E is more irregular, since it relates to the harmonics of the excitation. Therefore, the cepstral analysis linearly separatesthe contributions given by E and H to form S. One particular application of cepstral analysis is in determining the period ofpitch, p, which corresponds to the time interval between two consecutive glottal closures that control the flow of air from E.Another application is the determination of the formant frequencies, that are related to H. Quefrency, in seconds, is the scaleused to measure the cepstrum. Fig. 1 illustrates the explanations above, and the algorithm for obtaining the cepstrum basedon the DFT/FFT appears in Table 1.

3. The proposed algorithm

The proposed algorithm appears in Table 2 and further explanation follows. In steps S2 and S4, the full Discrete Wavelet-Packet Transform (DWPT) tree [5] is needed. For this algorithm, the DWPT requires the natural frequency ordering (NFO),instead of the filter bank ordering (FBO). For each level of decomposition, the NFO is obtained by alternating the order inwhich the pair of filters, low pass (h[n]) and high pass (g[n]), is applied to decompose each sub-band. A detailed explanationof this procedure can be found in [5], pp. 107–111. Unlike for the Fourier-based cepstrum, the proposed algorithm does notrequire an inverse DWPT (IDWPT).

Page 3: Wavelet-based cepstrum calculation

290 F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293

Table 2Pseudo-code for the proposed algorithm

BEGINNINGSTEP S1: Input the speech utterance, s[n], with length L, that must be a power of 2;STEP S2: s[n] ← DWPT(s[n],m), that is, turn s[n] into its full mth-level Discrete

Wavelet-Packet Transform (DWPT), where m = log(L)log(2)

;STEP S3: For (int i = 0; i < L; i++)

si ← log10(si)2;

STEP S4: repeat step S2, i.e., apply the DWPT again to s[n];STEP S5: For (int i = 0; i < L; i++)

si ← (si)2;

END.

After carrying this out, the signal s[n]will become the cepstrum of its original version.

To obtain the period of pitch, p, of a speech utterance, a search can be performed within the range [a–b] to find the highestvalue, i.e., the spike that represents the pitch. The values of a and b that represent samples of the cepstrum can be determinedas follows:

a = round( sampling_rate

400.0

),

b = round( sampling_rate

60.0

),

where 60 and 400 are the limits of frequencies, in hertz, for searching for the pitch. The kth sample between a and b thatcontains the highest value is related to the frequency of pitch, f = 1

p, according to the following:

f =sampling_rate

kHz.

It is interesting to note that the Fourier-based approach converts the input signal from the time domain to the frequencydomain, scales it using the logarithmic function, and then returns the scaled signal to the time domain. On the other hand,the proposed approach filters the input signal, scales it, filters again, and then calculates the energy of each sub-band. Sincethe input signal is decomposed until the maximum level is reached, the leaves of the decomposition tree contain only onesample each and the sample with the highest value within the range [a–b] is the one related to the pitch period.

The Fourier-based cepstrum of a signal is computationally complex as compared to that for the proposed method, whenthe signal length is long. The order of complexity of the proposed approach is N ·K ·O(L), N being a constant that correspondsto the wavelet filter lengths adopted, and K being the number of decomposition levels of the wavelet-packet tree.

4. The choice of the wavelet

The proposed algorithm requires the analysis of s[n] twice, according to the steps S2 and S4 of the previous section.The process by which s[n] is analyzed is Mallat’s pyramidal algorithm [6]. It consists of low-pass and high-pass filtering, bydiscrete convolutions [7], followed by down-samplings by 2. A generalization of Mallat’s algorithm is well described in theliterature, and can be found in [5,7]; therefore, this topic is not included here.

During the analysis of s[n], the length of h[k] and g[k], N, is responsible for frequency selectivity, Q , and time resolution,R. As N increases, Q increases and R decreases. Therefore, a balanced value for N is needed [8]. Let us define this as beingthe requirement 1. Another important consideration, say requirement 2, is that (almost) linear phase filters, h[k] and g[k],are desirable to avoid distortion in the filtered signal, i.e., (almost) symmetrical or anti-symmetrical impulse responses arepreferable. Lastly, we can define requirement 3, that is, a flat frequency response in the pass and in the stop bands of thefilters is desirable to attenuate the Gibbs phenomenon [9].

The proposed algorithm does not require any synthesis of the speech signal, i.e., no inverse DWT (IDWT) is performed.Therefore, it is sufficient to consider only the requirements 1, 2, and 3, which are related to the filters h[n] and g[n], anddisregard particular characteristics of the scaling and wavelet functions [7],φ(n) =

∑k hkφ(2n−k) andψ(n) =

∑k gkφ(2n−k),

respectively.In order to satisfy the requirement 1, we have to find N in such a way that it balances the constraints Q and R. Imposing

limits on N, say c and d, (c < d), a search can be performed within this range to allow optimal results in one particularrespect: R or Q .

Requirement 2 can only be satisfied by using finite impulse response (FIR) wavelet filters, since infinite impulse response(IIR) ones, say Shannon, Meyer, and so on, do not exhibit linear phase [7]. Therefore, the only alternative among the well-known FIR wavelet filters [7] is the Haar wavelet. Unfortunately, we also have to discard this last wavelet, since the supportof its filters is N = 2, and therefore it fails to satisfy the requirement 1. The only alternative is the use of a wavelet familywhose filters exhibit almost linear phase, i.e., have almost symmetrical or anti-symmetrical impulse responses. Fig. 2, whichgives an intuition of the impulse response shapes of such low-pass filters, shows that Symmlets or Coiflets filters are themost appropriate for satisfying this particular requirement.

Page 4: Wavelet-based cepstrum calculation

F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 291

Fig. 2. From left to right: impulse response shapes of the wavelet filters Haar, Daubechies, Vaidyanathan, Beylkin, Coiflet, and Symmlet.

Fig. 3. [above]: Fourier-based cepstrum of /a/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signalsamples; [vertical axis]: amplitude. The highest spike close to the end of the horizontal axis corresponds to the pitch.

Lastly, requirement 3 can be satisfied by using Daubechies’ filters, since they have a maximally flat frequency response [6].This family of filters, however, fail to satisfy requirement 2 because they do not have a linear phase response. Therefore, theonly alternative is, again, to use a family of wavelets whose filters present an almost flat response in the pass and stop bands.Since Symmlets and Coiflets present this characteristic, and they also satisfy requirements 1 and 2, they can be used.

5. Tests and results

For evaluation of the proposed algorithm, synthetic utterances were used, particularly the following vowels sounds: /a/as in the word dogma, /e/ as in the word men, and /i/ as in the word ship, with the pitch frequencies being, respectively,116.79 Hz, 125.00 Hz, and 207.79 Hz. The signals were sampled at 16 000 Hz, 16-bit, PCM.

Figs. 3–5 show the cepstrums of the utterances above, obtained by the Fourier method, described in Section 2, and by theproposed method. Table 3 contains the pitch frequency for the synthetic /e/ obtained with the proposed method, varying thewavelet bases and filter lengths. Similar results, in terms of precision, were obtained with the vowels /a/ and /i/, accordingto each wavelet used.

All the tests above used L = 2048 samples. With that, the maximum level of the DWPT is j = log(2048)log(2)

= 11, which gives a

frequency resolution s =16 000

2211 ≈ 3.9 Hz after the first filtering step (S2) of the proposed algorithm. If L increases, j increases

and s decreases, i.e., a better frequency resolution is obtained. On the other hand, if L decreases, j decreases and s increases,i.e., a worse frequency resolution is obtained.

Page 5: Wavelet-based cepstrum calculation

292 F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293

Fig. 4. [above]: Fourier-based cepstrum of /e/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signalsamples; [vertical axis]: amplitude. The highest spike close to the end of the horizontal axis corresponds to the pitch.

Fig. 5. [above]: Fourier-based cepstrum of /i/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signalsamples; [vertical axis]: amplitude. The highest spike close to the middle of the horizontal axis corresponds to the pitch.

Table 3Results obtained for the synthetic vowel /e/, created with pitch period at 125.00 Hz

W S P W S P W S P W S P

Haar 2 125.98 Daubechies 4 125.98 Daubechies 8 125.98 Daubechies 12 125.98Daubechies 24 125.98 Daubechies 36 125.98 Coiflet 6 125.98 Coiflet 12 125.98Coiflet 18 125.00 Coiflet 24 125.00 Coiflet 30 125.98 Symmlet 8 125.98Symmlet 16 125.00 Symmlet 24 125.98 Beylkin 18 125.98 Vaidyanathan 24 125.98

[W]: wavelet family; [S]: filter support; [P]: pitch frequency, in hertz, obtained using the proposed algorithm.

6. Conclusions

We presented a new wavelet-based algorithm for cepstrum calculation, which has lower computational complexity thanthe traditional method based on DFT or FFT, and can be used for pitch period extraction in real time. On the basis of ourtheoretical assumptions, confirmed in practice, we concluded that a family of wavelets whose filters exhibit almost linearphase responses, i.e., Symmlets or Coiflets, with a balanced value for N, i.e., 12 6 N 6 24, are the best candidates. This choice

Page 6: Wavelet-based cepstrum calculation

F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 293

takes into account the frequency selectivity and the time resolution obtained with the use of such filters. The length, L, of theinput speech signal also influences the results, 2048 being a reasonable choice in terms of both precision and conveniencefor small delay in real time applications.

Acknowledgment

We wish to thank the State of São Paulo Research Foundation for the grants given for this work under process number2005/00015-1.

References

[1] L. Deng, O. O’Shaughnessy, Speech Processing: A Dynamic and Optimization-Oriented Approach, Marcel Dekker Inc., New York, USA, 2003.[2] J. Coleman, Introducing Speech and Language Processing, Cambridge University Press, Cambridge, UK, 2005.[3] W. Chou, B.H. Juang, Pattern Recognition in Speech and Language Processing, CRC Press, Boca Raton, USA, 2003.[4] T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall, Upper Saddle River, NJ, USA, 2002.[5] A. Jensen, A. Cour-Harbo, Ripples in Mathematics: The Discrete Wavelet Transform, Springer, Berlin, Germany, 2001.[6] P.S. Addison, The Illustrated Wavelet Transform Handbook: Introductory Theory and Applications in Science, Engineering, Medicine and Finance,

Institute of Physics Publishing, Edinburg, 2002.[7] G. Strang, T. Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge Press, Wellesley, MA, USA, 1997.[8] R.C. Guido, C.D. Maciel, M. Monteiro, E.S. Fonseca, S. Panchapagesan, J.C. Pereira, L.S. Vieira, S. Barbon Jr., F.L. Sanchez, M.B.A. Guilherme, K.I.C. Sergio,

T.L. Scarpa, P.C. Fantinato, E.J.R. Moura, A study on the best wavelet for audio compression, in: 40th IEEE Asilomar Int. Conference on Signals, Systemsand Computers, Pacific-Grove, CA, USA, vol. 1, 2006, pp. 2115–2118.

[9] R.W. Hamming, Digital Filters, third ed., Dover Publications, New York, USA, 1998.