[ieee 2009 third international symposium on intelligent information technology application -...

Speech/Music Classification Using OccurrencePattern of ZCR and STE

Arijit Ghosal1 Rudrasis Chakraborty2 Ractim Chakraborty2 Swagata Haty2 Bibhas Chandra Dhara3 Sanjoy Kumar Saha2

1 CSE Dept., Institute of Technology and Marine Engg., 24 Parganas (south), India2 Dept. of CSE, Jadavpur University, Kolkata, India3 Dept. of IT, Jadavpur University, Kolkata, India

Abstract—With the rapid growth in audio data volume, re-search in the area of content-based audio retrieval has gainedimpetus in the last decade. Audio classification serves as thefundamental step towards it. Accuracy in classifying data relieson the strength of the features and on the efficacy of classificationscheme. In this work, we have focused on the features only. Wehave restricted ourselves further in the time domain based lowlevel features. Zero crossing rate (ZCR) and shot time energy(STE) are the most widely used features in this category. Wehave tried to develop the features reflecting the quasi-periodicpattern of the signal by studying the occurrence pattern ofZCR and STE. Co-occurrence matrix for ZCR and STE areformed and features are computed from that to parameterize thesignal. For classification, simple k-means clustering is followedand experimental result indicates that proposed features performbetter than the traditional feature derived from ZCR and STE.Keywords: Speech/music classification, audio features, ZCR oc-currence pattern, STE occurrence pattern.

I. INTRODUCTION

There has been enormous growth in multimedia content. Asa result, content-based multimedia data retrieval has becomean active area of research for efficient access of desiredpiece of data. A lot of work have been directed towardsthe development of content-based image and video retrievalsystem. Comparatively, little work has been done on the audioportion [1] and it has gained impetus later on with the increasein audio data volume.

Research activities on content-based audio data managementcan be categorized as audio classification, audio retrievaland indexing. Among these, automatic classification is thefundamental step for any such application involving audiodatabase. Generally, an automatic audio classification systemconsists of two steps: extraction of features from the waveformand classification based on the extracted feature. In the lastdecade, lot of efforts [2], [3], [4], [5], [6] have been made toclassify audio data.

Variety of features have been proposed by the researcherswhich may categorized as low level features, perceptual/psychoacoustic features etc. Low level feature includes sev-eral time domain and frequency domain features. ZCR (zerocrossing rate) [7], [8] and STE(short time energy) [9], [10] arethe most widely used time domain features. Frequency domainapproaches include features like signal bandwidth, spectralcentroid, signal energy [11], [12], [13], [14], fundamental fre-quency [1], mel-frequency cepstral co-efficients (MFCC) [15],[16] etc. Perceptual/psychoacoustic features include measures

for roughness [17], loudness [17], etc. In [18], a modelrepresenting the temporal envelope processing by the humanauditory system has been proposed which yields 62 featuresdescribing the auditory filterbank temporal envelope (AFTE).Lie et al. [19] and Guo et al. [20] have dealt with subbandenergy to describe audio data.

Audio data is classified into different categories like speech,music, songs, or noise based on the feature vector describingthe data. A number of classification schemes of varyingcomplexity have been used. El-Maleh [10] has proposed atwo level speech-music classifier. A threshold based two levelalgorithm has been presented in [9]. Neural network basedscheme has been tried by Matiyaho and Furst [21]. SVM [22],[23], [24] has also been used by many researchers [20], [25]for audio classification. Various classification schemes havedeployed self organizing maps, k-nearest neighbor method,multivariate Gaussian models [4]. Hidden Markov model [26]have also been tried.

Past study reveals that researchers have tried to exploit thestrength of both– the features and the classification schemes toattain high classification accuracy. In this regard, current trendsas observed are more inclined to deploy various soft computingclassification schemes. But, in this work, we have set ourfocus on the efficacy of features. We have concentrated on thetwo most widely used time domain low level features, ZCRand STE and have tried to incorporate the concept utilizedin other form of media like image. So that the strength ofthe features in classifying the data can be increased. Themotivation is to design a powerful set of features which willmake the task of a classifier easier. The paper is organized asfollows. Introduction is followed by the proposed methodologyin section II where we have described the design of thefeatures. Section III presents the experimental results and theconcluding remarks are put into section IV.

II. PROPOSED METHODOLOGY

It has been indicated in the past work that zero crossing rate(ZCR) and short time energy (STE) are two most importanttime domain, low level features which play major role inspeech/music discrimination. It has motivated us to concentrateon these two features.

Considering audio data as discrete signals, it is said that azero crossing has occurred whenever two successive samples

2009 Third International Symposium on Intelligent Information Technology Application

978-0-7695-3859-4/09 $26.00 © 2009 IEEE

DOI 10.1109/IITA.2009.427

435

2009 Third International Symposium on Intelligent Information Technology Application

978-0-7695-3859-4/09 $26.00 © 2009 IEEE

DOI 10.1109/IITA.2009.427

435

have different signs. Rate of zero crossing provides an impres-sion regarding the frequency content. Audio signal is dividedinto N frames {xi(m): 1 ≤ i ≤ N}. Then, for ith frame, zerocrossing rate is computed as follows:

zi =n−1∑m=1

sign[xi(m − 1) ∗ xi(m)] (1)

n is the number of samples in the ith frame and

sign[v] ={

1, if v > 00, otherwise

(2)

Mean and standard deviation of {zi : i = 1, 2, . . . , N} aretaken as two features. It has been observed that maximumvalue of ZCR is limited by frame size. Thus, precise reflectionof the frequency context of an audio signal, particularly incase of a broad band one, is not achieved. Mean and standarddeviation gives only an overall idea about the distribution. Inthis context, to obtain a better representation of the signalcharacteristics we have utilized the concept of co-occurrencematrix [27] which is widely used in image processing. Inan image, the occurrence of the different intensity valueswithin a neighborhood reflects a pattern and it is utilized toparameterize the appearance/texture of an image. The sameconcept is adopted here. For each frame, ZCR is computedusing equation (1). Thus, {zi}, a sequence of ZCR is obtainedfor the signal. Occurrence of different ZCR values within aneighborhood reflects the pattern and characterizes the quasi-periodic behavior of the signal. Thus, a matrix, C of L × Ldimension (where, L = max{zi} + 1) is formed as follows:

• Initialize C[i][j] = 0 ∀ i, j ∈ {0, 1, . . . , L}• for i= 1 to N − d

C[zi][zi+d] = C[zi][zi+d]+1

• C[i][j] = C[i][j]∑r

∑c

C[r][c]∀ i, j ∈ {0, 1, . . . , L}

where, d is the distance at which occurrence of the values arebeing considered. Thus, the matrix C represents distribution ofpairwise occurrence of different ZCR values. It is likely that incase of a speech signal, there will be substantial co-occurrenceof low ZCR denoting silence zone and high-low transition (orvice versa) for non-silence to silence (or vice versa) switching.Such transition also occurs due to interleaving of voiced andunvoiced speech. These will have a reflection in M . Music iscomparatively richer in frequency content, distribution will bewell spread in the matrix.

Due to noise there may be little variation in the signalwhich may affect the co-occurrence matrix. Moreover, veryclose frequencies are also not perceivable to human ear.

To combat these issues, we had to go for a modified schemeto construct the co-occurrence matrix. The ZCR scale may bedivided into k bins defined by the points μz±t×s×σz where,μz and σz are mean and standard deviation of {zi}, t takesthe values 0, 1, 2, . . . and s is the step size. It is obvious thatsubstantial contribution will be confined within μz±σz . Hence,to reveal the distribution characteristics in a detailed manner sis taken from (0,1). Once the bins have been formed, zis are

mapped onto bins and instead of zi values, corresponding binnumbers are used as the index in forming the co-occurrencematrix Mk×k. From the co-occurrence matrix, M , followingstatistical features [28] are computed:

Entropy = −∑

i

∑j

M [i][j]log2M [i][j] (3)

Energy =∑

i

∑j

[M [i][j]]2 (4)

Inertia =∑

i

∑j

(i − j)2M [i][j] (5)

Inverse difference =∑

i

∑j

M [i][j]|i − j| , i �= j (6)

Entropy =1

σxσy

∑i

∑j

(i − μx)(j − μy)M [i][j] (7)

where,

μx =∑

i

i∑

j

M [i][j]

μy =∑

j

j∑

i

M [i][j]

σ2x =

∑i

(1 − μx)2∑

j

M [i][j]

σ2y =

∑j

(1 − μy)2∑

i

M [i][j]

Thus, computing these features, a 5-dimensional ZCR basedfeature vector is formed.

Similarly, short time energy based features are also com-puted. First of all, for each frame short time energy iscomputed as follows:

Ei =1n

n−1∑m=0

[xi(m)]2 (8)

where frame contains n samples. Based on the set of STE, Ei

for the frames, the co-occurrence matrix is formed in the samemanner as it has been done in case of co-occurrence matrix ofZCRs. As the range of energy values is quite high, it wouldhave been a big problem for matrix dimension. Mapping of theabsolute value to bin solves the problem. Such mapping alsoovercome another problem. Overall rise/fall in the amplitudelevel of the signal does not change the nature of the signal butaffects the energy value. Mapping scheme present in this workalso cancels such impact and retains the signal characteristics.In case of a speech signal, silence zone will have minimalenergy. Moreover, interleaved voiced and unvoiced speech willlead to interleaving of high and low energy. It will give atypical pattern in the co-occurrence matrix enabling us todiscriminate the speech from rest. Co-occurrence matrix basedfeatures are computed to obtain 5-dimensional STE basedfeature vector.

Taking ZCR and STE based features together, a 10-dimensional feature vector is formed and it acts the description

436436

for an audio signal. Using the 10-dimensional feature vectorwe go for speech/music classification. In choosing the classi-fication scheme, we had to keep in mind that the focal pointof this work is to judge the capacity of the proposed featureset in discriminating speech and music. Use of sophisticatedclassification schemes like SVM may put their own strengthsubstantially and it will be difficult to understand the effectof the underlying features. Hence, we have used the primitiveschemes like k-means clustering to separate the audio signalinto two classes namely speech and music.

III. EXPERIMENTAL RESULTS

In order to carry out the experiment, we have prepared adatabase consisting of 110 speech files and 140 music files.All are of around 40-45 seconds duration. Sampling frequencyis 22050 Hz, 16-bit per sample and of type mono. Speech filescontains the voice of male and female, different languages arealso present. Some are noisy also. Music files correspond towide variety of instruments.

To compute the features, an audio file is divided intoframes. Each frame consists of 150 samples of which thereis an overlap of 50 samples between two consecutive frames.To compute the co-occurrence matrices, we have considereds=0.25 and t goes up to 8. Thus, 18 bins are formed. ZCRand STE values of the frames have been mapped into the bins.Finally, 18×18 co-occurrence matrices are formed. Occurrenceof ZCR (STE) values are looked into successive points, i.e., dis taken as 1.

Corresponding to a sample speech and music signal, theplot of co-occurrence matrices have been shown in Fig. 1 and2. It is clear that, occurrence patterns are different for them.For speech silence zone has almost zero energy and ideallytwo visible peaks arise out of voiced and unvoiced speech.In case of music, multiple clusters are formed. For ZCRoccurrence patterns of speech signal shows a few peaks as itsfrequency content is limited. Music being rich in frequencycontent, multiple peaks becomes apparent. Thus, the utility ofthe concept of occurrence pattern is clearly visible.

The classification result is shown in table I. It shows thatclassification based on mean and standard deviation of ZCRand STE fails to recognize speech whereas proposed co-occurrence matrix based features performs better for speechbut lags a little for music. Much higher overall classificationaccuracy is obtained in case of proposed features. In [25],a SVM based classification technique has been used andsuccess depends heavily on the classification scheme. Theyhave reported around 98% accuracy. In this context, it may benoted that comparing two systems (theirs and ours) workingon different databases is not justifiable. Moreover, we haveadhered to simple k-means clustering algorithm and use ofother sophisticated schemes like MLP or SVM would haveprovided improved result. Furthermore, inclusion of frequencydomain and perceptual features may also improve the accuracy.The intended purpose of the work is to study the the strengthof proposed features in comparison to traditional STE and

TABLE ICLASSIFICATION ACCURACY

Classification Accuracy (in %)Feature Music Speech Overall

Mean and std. dev.of ZCR & STE 90.71 59.09 76.80

Proposed features 87.85 96.36 91.60

(a) (b)

Fig. 1. Plot of co-occurrence matrix of STE for (a) a Speech Signal and (b)a Music Signal.

ZCR based features. In this regard, result clearly establishesthe supremacy of the proposed features.

(a) (b)

Fig. 2. Plot of co-occurrence matrix of ZCR for (a) a Speech Signal and (b)a Music Signal.

IV. CONCLUSIONS

In this work, we have proposed new set of features basedon the occurrence pattern of ZCR and STE by deploying theconcept of co-occurrence matrix. Experimental result indicatesthe potential of such features. The classification performanceof the proposed features is much better than that of traditionalfeatures based on ZCR and STE. In this work, to highlightthe strength of the proposed features, simple classificationscheme based on k-means clustering has been adopted. Infuture, sophisticated classification scheme like SVM may beused to improve the performance further.

REFERENCES

[1] T. Zhang and C. C. J. Kuo, “Content-based classification and retrievalof audio,” in Conf. on Advance Signal Processing, Architectures andImplementations VIII, 1998.

[2] S. Davis and P. Mermelstein, “Comparison of parametric representationsmonosyllabic word recognition in continously spoken sntences,” IEEETransactions on acoustics, speech and signal processing, vol. 28, pp.357–366, 1980.

437437

[3] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classifica-tion, search, and retrieval of audio,” IEEE Transactions on Multimedia,vol. 28, pp. 27–36, 1996.

[4] E. Scheirer and M. Slaney, “Construction and evaluation of a robust mul-tifeature speech/music discriminator,” in IEEE Int. Conf. on Acoustics,Speech and Signal Processing, 1997, pp. 1331–1334.

[5] H. Wang, A. Divakaran, A. Vetro, S. F. Chang, and H. Sun, Survey oncompressed-domain features used in video/audio indexing and analy-sis. Technical report, Department of electrical engineering, ColumbiaUniversity, New York, 2000.

[6] Y. Wang, Z. Liu, and J. C. Huang, “Multimedia content analysis usingboth audion and visual cues,” IEEE signal processing magazine, vol. 17,pp. 12–36, 2000.

[7] C. West and S. Cox, “Features and classifiers for the automatic classi-fication of musical audio signals,” in Int. Conf. on Music InformationRetrieval, 2004, pp. 531–537.

[8] J. Downie, “The scientific evaluation of music information retrievalsystems: Foundations and future,” Computer Music Journal, vol. 28,no. 2, pp. 12–33, 2004.

[9] J. Saunders, “Real-time discrimination of broadcast speech/music,” inIEEE Int. Conf. on Acoustics, Speech, Signal Processing, 1996, pp. 993–996.

[10] K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal, “Speech/music dis-criminator for multimedia application,” in IEEE Int. Conf. on Acoustics,Speech and Signal Processing, 2000.

[11] H. Beigi, S. Maes, J. Sorensen, and U. Chaudhari, “A hierarchicalapproach to large-scale speaker recognition,” in Proceeding of the Int.Computer Music Conference, 1999.

[12] M. A. Cohen, S. Grossberg, and L. L. Wyse, “A spectral network modelof pitch perception,” Journal of the Acoustical Society of America, vol.498, pp. 862–879, 1995.

[13] C. Mckay and I. Fujinaga, “Automatic genre classification using largehigh-level musical feature sets,” in In the Proceeding of Int. Conf. MIR,2004.

[14] C. West and S. Cox, “Finding an optimal segmentation for audio genreclassification,” in Int. Sym. on Music Information Retrieval, 2005.

[15] A. Eronen and A. Klapuri, “Musical instrument recognition using ceptralcoefficients and temporal features,” in IEEE Int. Conf. on Acoustics,Speech and Signal Processing, 2000, pp. 753–756.

[16] J. T. Foote, “Content-based retrieval of music and audio,” in SPIE, 1997,pp. 138–147.

[17] E. Zwicker and H. Fastl, “Psichoacoustics: Facts and models,” SpringerSeires on Infromation Science, 1999.

[18] J. Breebaart and M. McKinney, “Feature for audio classification,” in Int.Conf. on MIR, 2003.

[19] Z. Liu, J. H. a. Wang, and T. Chen, “Audio feature extraction andanalysis for scene classification,” in IEEE Workshop on MultimediaSignal Processing, 1997.

[20] G. Guo and S. Z. Li, “Content-based audio classification and retrievalby support vector machines,” IEEE Transactions on Neural Networks,vol. 14, no. 1, pp. 209–215, 2003.

[21] B. Matityaho and M. Furst, “Classification of music type by a multilayerneural network,” Journal of the Acoustical Society of America, vol. 95,1994.

[22] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support vectorclustering,” Journal of Machine Learning Research, vol. 2, pp. 125–137,2001.

[23] B. M. J. Tax and R. T. W. Duin, “Support vector domain description,”Pattern Recongintion Letters, vol. 20, pp. 1191–1199, 1999.

[24] F. Camastra and A. Verri, “A novel kernel method for clustering,” IEEETransactions on PAMI, vol. 27, pp. 801–805, 2005.

[25] S. O. Sadjadi, S. M. Ahadi, and O. Hazrati, “Unsupervised speech/musicclassification using one-class support vector machines,” in In the Pro-ceeding of the ICICS, 2007.

[26] D. Kimber and L. Wilcox, “Acoustic segmentation for audio browsers,”in Proc. of Intface Conference, 1996.

[27] R. M. Haralick and L. G. Shapiro, Computer and Robot Vision (Vol-I).Addision-Wesley, 1992.

[28] S. E. Umbaugh, Computer Imaging: Digital Image Analysis and Pro-cessing. CRC Press, 2005.

438438

[ieee 2009 third international symposium on intelligent information technology application -...

Documents