a novel spoken keyword spotting system using support vector machine

7
A novel spoken keyword spotting system using support vector machine J. Sangeetha n , S. Jothilakshmi Department of Computer Science and Engineering, Annamalai University, Annamalainagar 608 002, India article info Article history: Received 22 November 2013 Received in revised form 11 June 2014 Accepted 17 July 2014 Keywords: Spoken keyword detection Spoken keyword spotting Mel frequency cepstral coefcients Support vector machine Misclassication rate abstract Spoken keyword spotting is crucial to classify expertly a lot of hours of audio stufng such as meetings and radio news. These systems are technologically advanced with the purpose of indexing huge audio databases or of differentiating keywords in uninterrupted speech streams. The proposed work involves sliding a frame-based keyword template along the speech signal and using support vector machine (SVM) misclassication rates obtained from the hyperplane of two classes efciently search for a match. This work framed a novel spoken keyword detection algorithm. The experimental results show that the proposed approach competes with the keyword detection methods described in the literature and it is an alternative technique to the prevailing keyword detection approaches. & 2014 Elsevier Ltd. All rights reserved. 1. Introduction Keyword spotting is a very futuristic and promising branch in unremitting speech recognition and it is useful to retrieve the speech les which enclose the words or phrases associated with an application-specic domain. It resolves the problem of obtain- ing the right word or phrase in the speech ow. Keyword spotting technologies are extensively used in the security services, tele- communication companies, radio stations, call-centers, broadcast- ing companies and other organizations that use a large stream or collection of speech information. They are looked-for brisk seek in huge data sets. Keyword detection systems can be utilized not only in telephone conversations, but also in video, audio streams, that greatly accelerate the process of data tracking. Several approaches to this problem have been proposed in the literature (Jansen and Niyogi, 2009). Investigating the task of spotting predened keywords in continuous speech has both practical and scientic motivations, even in situations where little access to non-lexical linguistic constraints is provided (e.g. spotting native words in an unfamiliar language). Several computational approaches to this problem have been proposed. One of the rst keyword spotting strategies, proposed by Bridle (1983), involved sliding a frame-based keyword template along the speech signal and using a nonlinear dynamic time warping algorithm to prociently search for a match. While the word models in later approaches changed signicantly, this sliding model strategy was used in other approaches (Wilpon et al., 1989; Silaghi and Bourlard, 2000). A Japanese spoken term detection method for spoken docu- ments proposed (Nakagawa et al., 2013) that robustly considers out of vocabulary (OOV) words and mis-recognition. To address OOV words, recognition errors and high speed retrieval, a distant n-gram indexing/retrieval method incorporates a distance metric in a syllable lattice. An efcient approach (Norouzian and Rose, 2012) to spoken term detection (STD) from unstructured audio recordings using word lattices generated off-line from an auto- matic speech recognition (ASR) system has been proposed. The approach facilitates open vocabulary STD and focuses specically on reducing the difference between detection performance obtained for with in-vocabulary (IV) and out-of-vocabulary search terms. A hybrid two-pass approach (Norouzian and Rose, 2014) for facilitating fast and efcient open vocabulary spoken term detec- tion (STD) is proposed. A large vocabulary continuous speech recognition (LVCSR) system is deployed for producing word lattices from audio recordings. An index construction technique is used for facilitating very fast search of lattices for nding occurrences of both in-vocabulary (IV) and out-of-vocabulary query terms. An unsupervised learning framework was proposed (Zhang et al., 2009) to address the problem of detecting spoken keywords. Without any transcription information, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posterior gram. Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/engappai Engineering Applications of Articial Intelligence http://dx.doi.org/10.1016/j.engappai.2014.07.014 0952-1976/& 2014 Elsevier Ltd. All rights reserved. n Corresponding author. E-mail addresses: [email protected] (J. Sangeetha), [email protected] (S. Jothilakshmi). Engineering Applications of Articial Intelligence 36 (2014) 287293

Upload: s

Post on 09-Feb-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A novel spoken keyword spotting system using support vector machine

A novel spoken keyword spotting system using supportvector machine

J. Sangeetha n, S. JothilakshmiDepartment of Computer Science and Engineering, Annamalai University, Annamalainagar 608 002, India

a r t i c l e i n f o

Article history:Received 22 November 2013Received in revised form11 June 2014Accepted 17 July 2014

Keywords:Spoken keyword detectionSpoken keyword spottingMel frequency cepstral coefficientsSupport vector machineMisclassification rate

a b s t r a c t

Spoken keyword spotting is crucial to classify expertly a lot of hours of audio stuffing such as meetingsand radio news. These systems are technologically advanced with the purpose of indexing huge audiodatabases or of differentiating keywords in uninterrupted speech streams. The proposed work involvessliding a frame-based keyword template along the speech signal and using support vector machine(SVM) misclassification rates obtained from the hyperplane of two classes efficiently search for a match.This work framed a novel spoken keyword detection algorithm. The experimental results show that theproposed approach competes with the keyword detection methods described in the literature and it isan alternative technique to the prevailing keyword detection approaches.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

Keyword spotting is a very futuristic and promising branch inunremitting speech recognition and it is useful to retrieve thespeech files which enclose the words or phrases associated withan application-specific domain. It resolves the problem of obtain-ing the right word or phrase in the speech flow. Keyword spottingtechnologies are extensively used in the security services, tele-communication companies, radio stations, call-centers, broadcast-ing companies and other organizations that use a large stream orcollection of speech information. They are looked-for brisk seek inhuge data sets. Keyword detection systems can be utilized not onlyin telephone conversations, but also in video, audio streams, thatgreatly accelerate the process of data tracking.

Several approaches to this problem have been proposed in theliterature (Jansen and Niyogi, 2009). Investigating the task ofspotting predefined keywords in continuous speech has bothpractical and scientific motivations, even in situations where littleaccess to non-lexical linguistic constraints is provided (e.g. spottingnative words in an unfamiliar language). Several computationalapproaches to this problem have been proposed. One of the firstkeyword spotting strategies, proposed by Bridle (1983), involvedsliding a frame-based keyword template along the speech signaland using a nonlinear dynamic time warping algorithm to

proficiently search for a match. While the word models in laterapproaches changed significantly, this sliding model strategy wasused in other approaches (Wilpon et al., 1989; Silaghi and Bourlard,2000).

A Japanese spoken term detection method for spoken docu-ments proposed (Nakagawa et al., 2013) that robustly considersout of vocabulary (OOV) words and mis-recognition. To addressOOV words, recognition errors and high speed retrieval, a distantn-gram indexing/retrieval method incorporates a distance metricin a syllable lattice. An efficient approach (Norouzian and Rose,2012) to spoken term detection (STD) from unstructured audiorecordings using word lattices generated off-line from an auto-matic speech recognition (ASR) system has been proposed. Theapproach facilitates open vocabulary STD and focuses specificallyon reducing the difference between detection performanceobtained for with in-vocabulary (IV) and out-of-vocabulary searchterms. A hybrid two-pass approach (Norouzian and Rose, 2014) forfacilitating fast and efficient open vocabulary spoken term detec-tion (STD) is proposed. A large vocabulary continuous speechrecognition (LVCSR) system is deployed for producing wordlattices from audio recordings. An index construction techniqueis used for facilitating very fast search of lattices for findingoccurrences of both in-vocabulary (IV) and out-of-vocabularyquery terms.

An unsupervised learning framework was proposed (Zhanget al., 2009) to address the problem of detecting spoken keywords.Without any transcription information, a Gaussian Mixture Modelis trained to label speech frames with a Gaussian posterior gram.

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/engappai

Engineering Applications of Artificial Intelligence

http://dx.doi.org/10.1016/j.engappai.2014.07.0140952-1976/& 2014 Elsevier Ltd. All rights reserved.

n Corresponding author.E-mail addresses: [email protected] (J. Sangeetha),

[email protected] (S. Jothilakshmi).

Engineering Applications of Artificial Intelligence 36 (2014) 287–293

Page 2: A novel spoken keyword spotting system using support vector machine

For the given one or more spoken examples of a keyword, they usesegmental dynamic time warping to compare the Gaussian poster-ior grams between keyword samples and test utterances. Astandard hidden Markov model (HMM) based method is the keyword-filler model. In this case, an HMM consists of three compo-nents: a keyword model, a background model, and a filler model.The keyword model is tied to the filler model, which is typically aphone or a broad class loop, meant to represent the non-keywordportions of the speech signal. Finally, the background model isused to normalize keyword model scores. A Viterbi decode of aspeech signal is performed using this keyword-filler HMM, produ-cing predictions when the keyword occurs. Variations of thisapproach are provided by Wilpon et al. (1990), Hofstetter andRose (1992), Rose and Paul (1990), and Szoke et al. (2005). Theresearch effort focused on defining specialized confidence mea-sures that maximize performance is provided by James and Young(1994), Weintraub (1995), Junkawitsch et al. (1996), andThambiratnam and Sridharan (2005). While these systems donot require a predefined vocabulary, they rely on languagemodelling and are thus highly tuned to the training environment.

A further extension of HMM spotter approaches consists ofusing large vocabulary continuous speech recognition HMMs. Thisapproach can actually be seen as a phonetic-based approach inwhich the garbage model only allows valid words from thelexicon, except the targeted keyword. This use of additionallinguistic constraints is shown to improve the spotting perfor-mance (Cardillo et al., 2002; Rose and Paul, 1990). Such anapproach however raises practical concerns: one can wonderwhether the design of a keyword spotter should require theexpensive collection a large amount of labeled data typicallyneeded to train LVCSR systems, as well as the computational costimplied by large vocabulary decoding (Manos and Zue, 1997).

Over the last years, significant effort toward discriminativetraining of HMMs has been proposed as an alternative to like-lihood maximization (Bahl et al., 1986; Juang et al., 1997; Fu andJuang, 2009). These training approaches aim at both maximizingthe probability of the correct transcription given an acousticsequence and minimizing the probability of the incorrect tran-scriptions given an acoustic sequence. When applied to keywordspotting, none of these approaches closely tie the training objec-tive with a final spotting objective, such as maximizing the areaunder the Receiver Operating Characteristic (ROC) curve.

A method that has been proposed by Itoh et al. (2012) is to realizepseudo-real-time spoken term detection using pre-retrieval results.Pre-retrieval results for all combination of syllable bigrams areprepared beforehand. The retrieval time depends on the number ofcandidate sections of the pre-retrieval results. A few top candidatesare obtained in almost real-time by limiting the small number ofcandidate sections. While a user is confirming the candidate sections,the system can conduct the rest of retrieval by increasing the numberof candidate sections gradually. A technique for phonetic spokenterm detection in large audio archive (Vavruska et al., 2013) isdesigned within the framework of weighted finite-state transducersand utilizes the rather recently developed notion of factor automata,which we have enhanced with a score normalization and a techniquefor systematic query expansionwhich allows for phone deletions andsubstitutions and consequently compensates for frequent pronuncia-tion imperfections and systematic phoneme interchanges occurringduring the ASR decoding process. A new approach to spokenkeyword detection using Auto-Associative Neural Networks (AANN)has been proposed (Jothilakshmi, 2014) which concerns the use ofthe distribution capturing ability of the AANN for spoken keyworddetection. It is based on the confidence score obtained from thenormalized squared error of AANN.

The most important involvement of this paper concerns theexploitation of the misclassification rate obtained from the trained

SVM hyperplane of the two classes (�1 and þ1) for spokenkeyword detection (Jothilakshmi et al., 2009). The proposedmethod involves sliding a frame-based keyword pattern alongthe input audio signal and initially a block of frames belongs to thesearch keyword is set to (�1) class and a block of frames such thatthe number of frames in the block is equal to number of frames ofthe keyword signals are selected from the input signal startingfrom the first frame is set to (þ1) class. Then the SVM is trainedusing these two classes and hyperplane is obtained between thesetwo classes. By using this hyperplane the frames of these twoclasses are classified and the obtained misclassification rate is usedto proficiently search for a match. This work formulates a newspoken keyword spotting system.

The rest of the paper is organized as follows: a brief descriptionabout the method of extracting the features for spoken keywordspotting from the speech signal is described in Section 2. Supportvector machine for spoken keyword spotting is given in Section 3.The proposed algorithm for spoken keyword detection is pre-sented in Section 4. Section 5 presents the performance measuresfor the proposed spoken keyword spotting system. Section 6presents the experimental results. Section 7 gives the conclusionsand describes the future work.

2. Feature extraction for keyword detection

MFCC has proven to be one of the most successful featurerepresentations in speech related recognition tasks. The mel-cepstrum exploits auditory principles, as well as the decorrelatingproperty of the cepstrum (Davis and Mermelstein, 1980). Fig. 1illustrates the computation of MFCC features for a segment ofspeech signal which is described as follows (HTK book, 2002):

1. The speech waveform is first windowed with analysis win-dow and the discrete short time Fourier transform (STFT) iscomputed.

2. The magnitude is then weighted by a series of filter frequencyresponses whose center frequencies and bandwidths roughlymatch those of the auditory critical band filters. These filtersfollow the mel scale whereby band edges and center frequen-cies of the filters are linear for low frequency and logarithmi-cally increase with increasing frequency as shown in Fig. 2. Wecall these filters as mel-scale filters and collectively a mel-scalefilter bank. This filter bank, with 24 triangularly shapedfrequency responses, is a rough approximation to actual audi-tory critical band filters covering a 4000 Hz range.

3. The log energy in the STFT weighted by each mel-scale filterfrequency response is computed.

4. Finally discrete cosine transform (DCT) is applied to the filterbank output to produce the cepstral coefficients.

Fig. 1. Extraction of MFCC from speech signal.

J. Sangeetha, S. Jothilakshmi / Engineering Applications of Artificial Intelligence 36 (2014) 287–293288

Page 3: A novel spoken keyword spotting system using support vector machine

3. Support vector machine (SVM) for keyword spotting

Support vector machine is based on the principle of StructuralRisk Minimization (SRM). Like Radial Basis Function Neural Net-works (RBFNN), support vector machines can be used for patternclassification and nonlinear regression. SVM constructs a linearmodel to estimate the decision function using non-linear classboundaries based on support vectors. If the data are linearlyseparated, SVM trains linear machines for an optimal hyperplanethat separates the data without error and into the maximumdistance between the hyperplane and the closest training points.The training points that are closest to the optimal separatinghyperplane are called support vectors. SVMmaps the input patternsinto a higher dimensional feature space through some nonlinearmapping chosen a priori. A linear decision surface is then con-structed in this high dimensional feature space. Thus, SVM is alinear classifier in the parameter space, but it becomes a nonlinearclassifier as a result of the nonlinear mapping of the space of theinput patterns into the high dimensional feature space.

The support vector machine (Vapnik, 1998) is a useful statisticmachine learning technique that has been successfully applied inthe pattern recognition tasks (Jiang et al., 2005; Guo and Li, 2003;Ramalingam, 2006; Geetha et al., 2009). If the data are linearlynonseparable but nonlinearly separable, the nonlinear supportvector classifier will be applied. The basic idea is to transforminput vectors into a high-dimensional feature space using a non-linear transformation ϕ, and then to do a linear separation infeature space as shown in Fig. 3.

A nonlinear support vector classifier implementing the optimalseparating hyperplane in the feature space with a kernel functionK(xi, xnew) is given by

f ðxnewÞ ¼ sgn ∑SV

i ¼ 1αiyiKðxi; xnewÞþb

!ð1Þ

where SV is the support vectors. The SVM has two layers. Duringthe learning process, the first layer selects the basis Kðxi; xnewÞ,i¼ 1;2;…; SV , from the given set of bases defined by the kernel;the second layer constructs a linear function in this space. This is

completely equivalent to constructing the optimal hyperplane inthe corresponding feature space.

The SVM algorithm can construct a variety of learningmachines by the use of different kernel functions. Four kinds ofkernel functions are usually used. They are

1. Linear kernel:

Kðx1; x2Þ ¼ ⟨x1; x2⟩ ð2Þ

2. Polynomial kernel of degree d:

Kðx1; x2Þ ¼ ðγ⟨x1; x2⟩þc0Þd ð3Þ

3. Gaussian radial basis function (RBF):

Kðx1; x2Þ ¼ expð�γ‖x1�x2‖2Þ ð4Þ

4. Sigmoidal kernel:

Kðx1; x2Þ ¼ tanhðγ⟨x1; x2⟩þc0Þ ð5Þ

where kernel parameters

� γ: width of RBF coefficient in polynomial� d: degree of polynomial� c0: additive constant in polynomial

In Lu et al. (2001), an SVM classification based supervisedtechnique was proposed for spoken term detection in which theyadopted a bottom up binary tree combining three two class SVMclassifiers for content based audio segmentation. SVM basedsupervised technique was proposed in Karthik et al. (2005) bylabeling the speech data around the spoken keyword as (þ1) classand the block of frames in the audio files as (�1) class for trainingan SVM hyperplane and then classified each window as (þ1) or(�1).

4. Proposed spoken keyword spotting algorithm

The keyword spotting consists of the technical elements pre-sented in the previous sections. It is presumed that the acousticfeatures have been extracted from the speech signal.

The outline of the algorithm is summarized as follows: first theMFCC features (discussed in Section 2) for every single frame ofthe given search keyword speech signal are obtained. Likewise thespeech features are obtained for each frame of the given inputsignal in which the given keyword should be detected. Initially ablock of frames belongs to the search keyword is set to (�1) classand a block of frames such that the number of frames in the blockis equal to number of frames of the keyword signals are selectedfrom the input signal starting from the first frame is set to (þ1)class. Then the SVM is trained using these two classes and thehyperplane is obtained between these two classes. If the searchword corresponding to the block of frames is same as the searchkeyword then the misclassification rate will be very high. If thesearch word corresponding to the block of frames is not same asthe search keyword, then feature vectors from the block possiblywill not fall into the hyperplane and the model gives lowmisclassification (probability) rate.

Similarly, the next possibility is the word corresponding to theblock of frames which partly match with the search keyword. Ifthis is the case, the misclassification rate of the block will be in

m1 mP

freq

1

mj... ... Energy inEach Band

MELSPEC

Fig. 2. Mel scale filter bank.

Fig. 3. Principle of support vector machines.

J. Sangeetha, S. Jothilakshmi / Engineering Applications of Artificial Intelligence 36 (2014) 287–293 289

Page 4: A novel spoken keyword spotting system using support vector machine

amidst the above two values. After obtaining the misclassificationrate for the current block, the block is shifted by a fixed number offrames to the right. Then the entire process is reiterated for thisfresh block. In the same way the misclassification rates aremeasured up to the tail end of the block reaches the last frameof the input speech frames. From the misclassification rate, theglobal maximum positions are the locations for the search key-word in the input signal. Fig. 4 shows the steps involved in theproposed algorithm.

4.1. Keyword spotting

Given the speech features of the each input signal S¼ fSi : i¼1;2;…;ng where i is the frame index and n is the total number offrames in the input speech signal. As well the speech features ofthe keyword signal K ¼ fKj : j¼ 1;2;…;mg where j is the frameindex and m is the total number of frames present in the searchkeyword signal. The proposed algorithm for retrieving the speechfiles for the given search speech keyword is summarized asfollows:

� Initially a block of frames which belongs to the search keywordis considered as W � and it is set to (�1) class:

W � ¼ fSi : i¼ 1;2;…;ngAf�1g

� From n frames of input speech signal, m numbers of frames arecarefully chosen, and considered as W þ and it is set to (þ1)class:

W þ ¼ fSlg; p¼ lomþpAfþ1g

� The SVM is trained using these two classes and the hyperplaneis obtained between these two classes.

� By using this hyperplane the frames of these two classes areclassified. If the search word corresponding to the block of

frames is same as the search keyword then the misclassificationrate will be very high. If the word corresponding to the block offrames is not same as the search keyword, then the featurevectors from the block possibly will not fall into the hyperplaneand the model gives low misclassification (probability) rate.

� From this we know that the SVM training misclassification ratecan be used to decide whether the search keyword occurs inthe selected block of frames.

� Two types of misclassification rates are computed namelymcr� ðkÞ and mcrþ ðkÞ, where mcr� ðkÞ is the rate of the (�1)class misclassified as (�1) and mcrþ ðkÞ is the rate of the (þ1)class misclassified as (þ1) .

� Detect keywords from the misclassification rate by applying athreshold. The threshold (ts) is calculated from the confidencescore as follows:

ts ¼ asmax; 0:5oao1 ð6Þwhere smax is the global maximum misclassification rate and ais the adjustable parameter.

� Then the window is shifted by a fixed number of frames to theright of current position and the procedure is repeated. Like-wise the entire speech stream must be examined.

5. Performance measures

The purpose of this research is to identify keywords withinaudio documents based on the SVM misclassification rates. UnlikeAutomatic Speech Recognition, which typically considers thecorrect recognition of all words that are equally important, weare interested in the tradeoff of precision and recall. First theperformance of the proposed keyword spotting algorithm isassessed in terms of detection rate which is defined as

Detection Rate¼ nc

ncþniþnrð7Þ

where nc is the number of correctly classified keywords, ni is the

Fig. 4. Steps involved in the proposed algorithm.

J. Sangeetha, S. Jothilakshmi / Engineering Applications of Artificial Intelligence 36 (2014) 287–293290

Page 5: A novel spoken keyword spotting system using support vector machine

number of incorrectly classified keywords, and nr is the number ofrejected keywords.

The following metrics are used to evaluate the systems pre-sented in this work. The figure of merit (FOM) was originallydefined by Rohlicek et al. (1989) for the task of keyword spotting.By optimizing the FOM (Wallace et al., 2010, 2011) the accuracy ofthe spoken term detection can be increased. It gives the averagedetection rate over the range [1, 10] false alarms per hour perkeyword. The FOM values for individual keywords can be averagedin order to give an overall figure. The NIST STD 2006 evaluationplan defined the metrics Occurrence-Weighted Value (OCC) andActual Term-Weighted Value (ATWV) and a Maximum TermWeighted (MTWV). These three metrics have been adopted andtheir description follows.

For a given set of terms and some speech data, let NcorrectðtÞ;NFAðtÞ and NtrueðtÞ represents the number of correct, false alarm,and actual occurrences of term t respectively. In addition, we denotethe number of non-target terms (which gives the number ofpossibilities for incorrect detection) as NNT(t). We also define missand false alarm probabilities, PmissðtÞ and PFAðtÞ, for each term t as

PmissðtÞ ¼ 1�NcorrectðtÞNtrueðtÞ

ð8Þ

PFAðtÞ ¼NFAðtÞNNT ðtÞ

ð9Þ

In order to tune the metrics to give a desired balance ofprecision versus recall, a cost CFA for false alarms was defined,along with a value V for correct detections. The occurrence-weighted value is computed by accumulating a value for eachcorrect detection and subtracting a cost for false alarms as follows:

OCC ¼∑f A terms½VNcorrectðtÞ�CFANFAðtÞ�∑f A termsVNtrueðtÞ

ð10Þ

Whilst OCC gives a good indication of overall system perfor-mance, there is an inherent bias towards frequently occurringterms. The second NIST metric, the actual term-weighted value isarrived at by averaging a weighted sum of miss and false alarmprobabilities, PmissðtÞ and PFAðtÞ, over the terms:

ATWV ¼ 1�1T

∑f A terms

½PmissðtÞþβPFANFAðtÞ� ð11Þ

where β¼ ðc=γÞðPpriorðtÞ�1�1Þ. The NIST evaluation scoring toolsset a uniform prior term probability PpriorðtÞ ¼ 10�4, and the ratioc=γ to be 0.1 with the effect that there is an emphasis placed onrecall compared to the precision in the ratio 10:1. The third termmaximum term-weighted value is over the range of all possiblevalues of threshold. It ranges from 0 to þ1.

In this work, the results in terms of FOM and OCC arepresented. However, rather than giving the ATWV values whichgive point estimates of the miss and false alarm probabilities, wepresent these results graphically in order to show the full range ofoperating points. For all results, tuning for the parameters usingthe developed algorithm is performed on STD development setaccording to the metric which is used in evaluation. For allmeasures, the higher values indicate better performance. Theperformance is also evaluated by the recall–precision curve (ROCcurve).

6. Experiments and results

6.1. The databases

The experiments have been conducted over a corpus which iscomposed of broadcast news and conversations recorded from

various channels like BBC, NDTV, Doordarshan News and real timerecorded speech. It includes three different source types: two hoursof broadcast news (BNEWS), one and a half hour of conversationaltelephony speech (CTS) and one hour of real time conference roommeetings (REALCONFSP). For the experiments, we have processedthe query set that includes 350 queries. Each query is a phrasecontaining between one to five terms, common and rare terms,terms that are in the manual transcripts and those that are not. Thedataset is divided into the development corpus and evaluationcorpus. The development corpus is utilized for training the struc-ture and fine-tuning the parameters which is composed of twohour speech from the above three source types each including 100search terms. The evaluation corpus is composed of 5 h speechincluding 250 search terms which is for validation.

6.2. Feature extraction

The mel frequency cepstral coefficients with delta and accel-eration coefficients are used to evaluate the proposed algorithm.So each feature vector consists of 39 coefficients. Cepstral meansubtraction is performed to trim down the channel effects. Thepreferred properties of the speech signals are a sampling rate of8 kHz, 16 bit monophonic PCM format. The frame rate is as same asthe keyword frames/second, where each frame is 16 ms in dura-tion with an overlap of 50% between adjacent frames.

6.3. Parameter tuning phase

In order to adjust the parameters of the algorithm that deferthe preeminent performance, several experiments have beenperformed on development corpus, whose results are providedin this subsection. The parameters to be tuned are the number offrame shifts and different kernels of SVM. The MFCC featurevectors are extracted for all the speech frames as described inSections 2 and 6.2. For the given keyword feature vectors, themisclassification rates are determined by the hyperplane of thetwo classes (þ1 and �1) as described in Section 3.

The feature vectors of Wp are set as (þ1) classes and thefeature vectors of search keyword are set to (�1) classes. SVM istrained for these two classes and the hyperplane is obtained. Byusing this hyperplane the frames of these two classes are classifiedusing the misclassification rate. Fig. 5 shows the progression of themisclassification rate when the frame shift is changed from 1/2th,1/4th, 1/8th, 1/16th, 1/32th and 1/64th. There is no change in themisclassification rate after the 1/16th even though the frame shiftwas increased to 1/32th and 1/64th. Hence the SVM models aretrained for only 1/16th frameshift. Similarly the classification rateis measured for three different types of SVM kernels and based onthe results with the SVM classifier the linear kernel function withthe upper bound of the Lagrange multiplier C ¼ 1 has been used.

Fig. 5. Effects of frame shift in misclassification rate.

J. Sangeetha, S. Jothilakshmi / Engineering Applications of Artificial Intelligence 36 (2014) 287–293 291

Page 6: A novel spoken keyword spotting system using support vector machine

It is not possible to obtain the same misclassification rates forthe same keyword query every time. To avoid the false keywordspotting the misclassification rates which are less than the thresh-old value are considered. Hence, after obtaining the global maximaof the confidence scores for the entire speech signal, the hypothe-sized keyword is validated by using the threshold. For calculatingthe threshold, the adjustable parameter is used in this experiment.

7. Evaluation results

The experiments were conducted on the database described inSection 6.1. A set of 250 search queries was elected depending ontheir high frequency of occurrence and appropriateness as searchterms for spoken keyword spotting, and evaluation (retrievingsearch terms) is performed on the test set.

7.1. Spoken keyword spotting results

� Recognition accuracy: Whilst the detection rate is not the mainfocus of this work, it is an important factor in STD performance.In Table 1, we present the recognition accuracy results afterproviding the empirical tuned parameter values to thealgorithm.

� Evaluation in terms of FOM and OCC: Table 2 shows that theevaluation in terms of the FOM, BNEWS renders better perfor-mance than the source types CTS and REALCONSP. Similarly inaccordance with the term OCC, again the BNEWS provides thebest performance.

� We presented detection error trade off (DET) curves of theATWV performance in Fig. 6 which shows the miss against falsealarm probability for each of the source types. The DET curvesin Fig. 6 show that the performances are almost quite similarfor each of the source types BNEWS and CTS and these two areoutperformed than the REALCONSP.

7.2. Evaluation in terms of ATWT and MTWV

For each found occurrence of the given query, our systemoutputs the location of the term in the audio recording (begin timeand duration), the score indicating how likely is the occurrence ofthe query, and a hard decision as to whether the detection iscorrect. We measure precision and recall by comparing the resultsobtained over the automatic transcripts (only the results havingtrue hard decision) to the results obtained over the referencemanual transcripts. Our aim is to evaluate the ability of thesuggested retrieval approach to handle transcribed speech data.Thus, the closer the automatic results to the manual results are,the better the search effectiveness over the automatic transcriptswill be. The results returned from the manual transcription for a

given query are considered to be relevant and are expected to beretrieved with the highest scores.

The retrieval performance is evaluated by the recall–precisioncurve (ROC curve). Besides the recall and the precision, we use theevaluation measures defined by NIST for the 2006 STD evaluationthe actual term-weighted value and the maximum term-weightedvalue. The term-weighted value (TWV) is computed by firstcomputing the miss and false alarm probabilities for each queryseparately, then using these and (arbitrarily chosen) prior prob-ability to compute query-specific values, and finally averaging thisquery-specific values over all query q to produce an overall systemvalue: Table 3 shows the values of the above measures and Fig. 7shows the ROC curve for the source types. Fig. 7 clearly shows thatthe BNEWS gives the best performance over the CTN and REALSP.

In order to justify, the overall performance of the proposedwork is compared with the existing method based on the Auto-Associative Neural Network algorithm (Jothilakshmi, 2014).

Table 1Detection rate for the source types.

Database BNEWS CTS REALCONSP

Detection rate (%) 95.5 94.5 93.5

Table 2Results in terms of FOM and OCC for the source types.

Database BNEWS (%) CTS (%) REALCONSP (%)

FOM 92.7 91.2 89.7OCC 91.1 87.3 86.2

Fig. 6. DET curves for the source types BNEWS, CTS and REALCONSP.

Table 3ATWV, MTWV, precision and recall per source type.

Database BNEWS CTS REALCONSP

ATWV 0.88 0.87 0.85MTWV 0.91 0.89 0.86Recall 0.93 0.91 0.86Precision 0.90 0.89 0.85

Fig. 7. ROC curve for the source types BNEWS, CTS and REALCONSP.

Table 4Comparative results.

Comparison Existing method Proposed method

Overall performance (%) 93.2 94.5

J. Sangeetha, S. Jothilakshmi / Engineering Applications of Artificial Intelligence 36 (2014) 287–293292

Page 7: A novel spoken keyword spotting system using support vector machine

Table 4 depicts that the performance of the proposed method issuperior to the existing method (Jothilakshmi, 2014).

8. Conclusion

In this paper, a novel method for spoken keyword spotting isproposed using MFCC features and support vector machine. Themisclassification rate of SVM hyperplane is used for spoken termdetection. The proposed method involves sliding a frame-basedkeyword template alongside the speech signal and by means ofmisclassification rate of the SVM hyperplane obtained from thetwo classes (�1 and þ1) competently search for a match. Thiswork formulates a new spoken keyword detection algorithm. Thiswork studies how the spoken keyword spotting is performedefficiently over different data sources. The experiment reveals thatall the measures provided the best performance for the sourcetype. It yields the overall performance at around 94.5% of thespoken term detection which is better than the methods inthe literature. The future work will be on the track to progressthe efficacy of the algorithm by optimizing the time taken toscrutinize the complete audio file frame by frame.

References

Bahl, L., Brown, P., de Souza, P., Mercer, R., 1986. Maximum mutual informationestimation of hidden Markov model parameters for speech recognition.In: International Conference on Audio, Speech and Signal Processing,pp. 49–52.

Bridle, J.S., 1983. An efficient elastic-template method for detecting given words inrunning speech. In: Proceedings of the British Acoustic Society Meeting,pp. 1–4.

Cardillo, P., Clements, M., Miller, M., 2002. Phonetic searching vs. LVCSR: how tofind what you really want in audio archives. Int. J. Speech Technol. 5 (1), 9–22.

Davis, S.B., Mermelstein, P., 1980. Comparison of parametric representations formonosyllabic word recognition in continuously spoken sentences. IEEE Trans.Acoust. Speech Signal Process. 28, 357–366.

Fu, O., Juang, B.H., 2009. Automatic speech recognition based on weightedminimum classification error (W-MCE) training method. In: Proceedings ofASRU, IEEE, Kyoto, 278–283.

Geetha, A., Ramalingam, V., Palaniappan, B., Palanivel, S., 2009. Facial expressionrecognition—a real time approach. Expert systems with Applications 36 (1),303–308.

Guo, G., Li, S.Z., 2003. Content-based audio classification and retrieval by supportvector machines. IEEE Trans. Neural Netw. 14, 308–315.

Hofstetter, E.M., Rose, R.C., 1992. Techniques for task independent word spottingincontinuous speech messages. In: Proceedings of ICASSP.

HTK book, 2002. Microsoft Corporation, USA.Itoh, Y., Saito, H., Tanaka, K., Lee, S.-w., 2012. Pseudo real-time spoken term

detection using pre-retrieval results. In: Speech and Computer, Lecture Notesin Computer Science. Springer International Publishing, Switzerland, 8113, pp.264–270.

James, D.A., Young, S.J., 1994. A fast lattice-based approach to vocabulary indepen-dent wordspotting. In: Proceedings of ICASSP.

Jansen, A., Niyogi, P., 2009. Point process models for spotting keywords incontinuous speech. IEEE Trans. Audio Speech Lang. Proc. 17 (8), 1457–1470.

Jiang, H., Bai, J., Zhang, S., Xu, B., 2005. SVM-based audio scene classification. In:Proceedings of IEEE International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE 05), September, 131–136.

Jothilakshmi, S., 2014. Spoken keyword detection using autoassociative neuralnetworks. International Journal of Speech Technology 17, 83–89.

Jothilakshmi, S., Ramalingam, V., Palanivel, S., 2009. Speaker diarization usingautoassociative neural networks. Eng. Appl. Artif. Intell. 22, 667–675.

Juang, B.H., Chou, W., Lee, C.H., 1997. Minimum classification error rate methods forspeech recognition. IEEE Trans. Speech Audio Process. 5 (3), 257–265.

Junkawitsch, J., Neubauer, L., H.H.L., Ruske, G., 1996. A new keyword spottingalgorithm with pre-calculated optimal thresholds. In: Proceedings of ICSLP.

Karthik, V., Satish, D.S., Sekhar, C.C., 2005. Speaker change detection using supportvector machines. In: Proceedings of International Conference on Non LinearSpeech Process (NOLISP '05), pp. 130–136.

Lu, L., Li, S.Z., Zhang, H.J., 2001. Content based audio segmentation using supportvector machines. In: Proceedings of IEEE International Conference on Multi-media and Expo (ICME '01), pp. 956–959.

Manos, A., Zue, V., 1997. A segment-based wordspotter using phonetic fillermodels. In: International Conference on Audio, Speech and Signal Processing,pp. 889–902.

Nakagawa, S., Iwami, K., Fujii, Y., Yamamoto, K., 2013. A robust/fast spoken termdetection method based on a syllable n-gram index with a distance metric.Speech Commun. 55, 470–485.

NIST, 2006. The Spoken Term Detection (STD) 2006 Evaluation Plan. NationalInstitute of Standards and Technology, Gaithersburg, MD, USA. ⟨http://www.nist.gov/speech/tests/std⟩.

Norouzian, A., Rose, R., 2012. Facilitating open vocabulary spoken term detectionusing a multiple pass hybrid search algorithm. In: Proceedings of Acoustics,Speech and Signal Processing (ICASSP). IEEE, Kyoto, pp. 5169–5172.

Norouzian, A., Rose, R., 2014. An approach for efficient open vocabulary spokenterm detection. Speech Commun. 57, 50–62.

Ramalingam, V., 2006. A study of advertisement effectiveness using neural net-works (Ph.D. thesis). Department of Computer Science and Engineering,Annamalai University.

Rohlicek, J., Russell, W., Roukos, S., Gish, H., 1989. Continuous hidden Markovmodeling for speaker independent word spotting. In: Proceedings of ICASSP,Glasgow, UK, pp. 627–630.

Rose, R., Paul, D., 1990. A hidden Markov model based keyword recognition system.In: International Conference on Audio, Speech and Signal Processing, pp. 129–132.

Silaghi, M.C., Bourlard, H., 2000. Iterative posterior-based keyword spotting with-outfiller models. In: Proceedings of ICASSP.

Szoke, I., Schwarz, P., Matejka, P., Burget, L., Fapso, M., Karafiat, M., Cernchy, J., 2005.Comparison of keyword spotting approaches for informal continuous speech.In: Joint Workshop on Multimodal Interaction and Related Machine LearningAlgorithms.

Thambiratnam, K., Sridharan, S., 2005. Dynamic match phone-lattice searches forvery fast and unrestricted vocabulary KWS. In: Proceedings of ICASSP.

Vapnik, V., 1998. Statistical Learning Theory. John Wiley and Sons, New York.Vavruska, J., Svec, J., Ircing, P., 2013. Phonetic spoken term detection in large audio

archive using the WFST framework. In: Text Speech and Dialogue, LectureNotes in Computer Science. Springer, Berlin, Heidelberg.

Wallace, R., Vogt, R., Baker, B., Sridharan, S., 2010. Optimizing figure of merit forphonetic spoken term detection. In: Proceedings of ICASSP, pp. 5298–5301.

Wallace, R., Vogt, R., Baker, B., Sridharan, S., 2011. Discriminative optimization ofthe figure of merit for phonetic spoken term detection. IEEE Trans. AudioSpeech Lang. Process. 19 (6).

Weintraub, M., 1995. Lvscr log-likelihood scoring for keyword spotting. In:Proceedings of ICASSP.

Wilpon, J.G., Rabiner, L.R., Lee, C.H., Goldman, E.R., 1989. Application of hiddenMarkov models for recognition of a limited set of words in unconstrainedspeech. In: Proceedings of ICASSP.

Wilpon, J.G., Rabiner, L.R., Lee, C.-H., Goldman, E.R., 1990. Automatic recognition ofkeywords in unconstrained speech using hidden Markov models. IEEE Trans.Acoust. Speech Signal Process. 38 (11), 1870–1878.

Zhang, S., Yaodong, S., Glass, J.R., 2009. Unsupervised spoken keyword spotting viasegmental dtw on gaussian posteriorgrams. In: Proceedings of ASRU (2009).IEEE, Merano, pp. 398–403.

J. Sangeetha, S. Jothilakshmi / Engineering Applications of Artificial Intelligence 36 (2014) 287–293 293