1-s2.0-s0165168411000776-main.pdf

16
Recursive nearest neighbor search in a sparse and multiscale domain for comparing audio signals Bob L. Sturm a, ,# , Laurent Daudet b,## a Department of Architecture, Design and Media Technology, Aalborg University Copenhagen, Lautrupvang 15, 2750 Ballerup, Denmark b Institut Langevin (LOA), Universite´ Paris Diderot – Paris 7, UMR 7587, 10, rue Vauquelin, 75231 Paris, France article info Article history: Received 5 January 2010 Received in revised form 15 February 2011 Accepted 2 March 2011 Available online 10 March 2011 Keywords: Multiscale decomposition Sparse approximation Timefrequency dictionary Audio similarity abstract We investigate recursive nearest neighbor search in a sparse domain at the scale of audio signals. Essentially, to approximate the cosine distance between the signals we make pairwise comparisons between the elements of localized sparse models built from large and redundant multiscale dictionaries of time–frequency atoms. Theoretically, error bounds on these approximations provide efficient means for quickly reducing the search space to the nearest neighborhood of a given data; but we demonstrate here that the best bound defined thus far involving a probabilistic assumption does not provide a practical approach for comparing audio signals with respect to this distance measure. Our experiments show, however, that regardless of these non-discriminative bounds, we only need to make a few atom pair comparisons to reveal, e.g., the origin of an excerpted signal, or melodies with similar time–frequency structures. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Sparse approximation is essentially the modeling of data with few terms from a large and typically over- complete set of atoms, called a ‘‘dictionary’’ [24]. Consider an x 2 R K , and a dictionary D composed of N unit-norm atoms in the same space, expressed in matrix form as D 2 R KN , where N bK. A pursuit is an algorithm that decomposes x in terms of D such that JxDsJ 2 re for some error e Z0. (In this paper, we work in a Hilbert space unless otherwise noted.) When D is overcomplete, D has full row rank and there exists an infinite number of solutions to choose from, even for e ¼ 0. Sparse approx- imation aims to find a solution s that is mostly zeros for e small. In that case, we say that x is sparse in D. Matching pursuit (MP) is an iterative descent sparse approximation method based on greedy atom selection [17,24]. We express the nth-order model of the signal x ¼ HðnÞaðnÞþ rðnÞ, where aðnÞ is a length-n vector of weights, HðnÞ are the n corresponding columns of D, and rðnÞ is the residual. MP augments the nth-order represen- tation, X n ¼fHðnÞ, aðnÞ, rðnÞg, according to X n þ 1 ¼ Hðn þ 1Þ¼½HðnÞjh n aðn þ 1Þ¼½a T ðnÞ, /rðnÞ, h n S T rðn þ 1Þ¼ xHðn þ 1Þaðn þ 1Þ 8 > < > : 9 > = > ; ð1Þ using the atom selection criterion h n ¼ arg min d2D JrðnÞ/rðnÞ, dSdJ 2 ¼ arg max d2D j/rðnÞ, dSj ð2Þ where JdJ ¼ 1 is implicit. The inner product here is defined /x, yS9y T x. This criterion guarantees Jrðn þ 1ÞJ 2 r JrðnÞJ 2 [24]. Other sparse approximation methods include ortho- gonal MP [28], orthogonal least squares (OLS) [41,33], molecular methods [9,38,19], cyclic MP and OLS [36], and minimizing error jointly with a relaxed sparsity measure [6]. These approaches have higher computational Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/sigpro Signal Processing 0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2011.03.002 Corresponding author. E-mail addresses: [email protected] (B.L. Sturm), [email protected] (L. Daudet). # EURASIP# 7255. ## EURASIP # 2298. Signal Processing 91 (2011) 2836–2851

Upload: gurpreet-singh-walia

Post on 26-Sep-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

  • n a

    sity Co

    rue Va

    Accepted 2 March 2011Available online 10 March 2011

    Multiscale decomposition

    cursi

    entia

    mpar

    large and redundant multiscale dictionaries of timefrequency atoms. Theoretically,

    error bounds on these approximations provide efcient means for quickly reducing the

    ially thand

    ionary

    2

    [24]. Other sparse approximation methods include ortho-

    Contents lists available at ScienceDirect

    journal homepage: www.els

    Signal Pro

    Corresponding author.

    Signal Processing 91 (2011) 283628510165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved.

    doi:10.1016/j.sigpro.2011.03.002gonal MP [28], orthogonal least squares (OLS) [41,33],molecular methods [9,38,19], cyclic MP and OLS [36],and minimizing error jointly with a relaxed sparsitymeasure [6]. These approaches have higher computational

    E-mail addresses: [email protected] (B.L. Sturm),

    [email protected] (L. Daudet).# EURASIP# 7255.## EURASIP # 2298.where JdJ 1 is implicit. The inner product here is dened/x,yS9yTx. This criterion guarantees Jrn1J2r JrnJ2full row rank and there exists an innite number ofsolutions to choose from, even for e 0. Sparse approx-imation aims to nd a solution s that is mostly zeros for esmall. In that case, we say that x is sparse in D.

    using the atom selection criterion

    hn arg mind2D

    Jrn/rn,dSdJ2 arg maxd2D

    j/rn,dSjan x 2 RK , and a dictionary D composed of N unit-normatoms in the same space, expressed in matrix form asD 2 RKN , where NbK. A pursuit is an algorithm thatdecomposes x in terms of D such that JxDsJ2re forsome error eZ0. (In this paper, we work in a Hilbert spaceunless otherwise noted.) When D is overcomplete, D has

    rn is the residual. MP augments the nth-order represen-tation, Xn fHn,an,rng, according to

    Xn1 Hn1 Hnjhnan1 aT n,/rn,hnSTrn1 xHn1an1

    8>:

    9>=>; 1Sparse approximation

    Timefrequency dictionary

    Audio similarity

    1. Introduction

    Sparse approximation is essentdata with few terms from a largecomplete set of atoms, called a dictthe best bound dened thus far involving a probabilistic assumption does not provide a

    practical approach for comparing audio signals with respect to this distance measure.

    Our experiments show, however, that regardless of these non-discriminative bounds,

    we only need to make a few atom pair comparisons to reveal, e.g., the origin of an

    excerpted signal, or melodies with similar timefrequency structures.

    & 2011 Elsevier B.V. All rights reserved.

    e modeling oftypically over- [24]. Consider

    Matching pursuit (MP) is an iterative descent sparseapproximation method based on greedy atom selection[17,24]. We express the nth-order model of the signalxHnanrn, where an is a length-n vector ofweights, Hn are the n corresponding columns of D, andKeywords:search space to the nearest neighborhood of a given data; but we demonstrate here thatRecursive nearest neighbor search ifor comparing audio signals

    Bob L. Sturma,,#, Laurent Daudet b,##

    a Department of Architecture, Design and Media Technology, Aalborg Univerb Institut Langevin (LOA), Universite Paris Diderot Paris 7, UMR 7587, 10,

    a r t i c l e i n f o

    Article history:

    Received 5 January 2010

    Received in revised form

    15 February 2011

    a b s t r a c t

    We investigate re

    audio signals. Ess

    make pairwise cosparse and multiscale domain

    penhagen, Lautrupvang 15, 2750 Ballerup, Denmark

    uquelin, 75231 Paris, France

    ve nearest neighbor search in a sparse domain at the scale of

    lly, to approximate the cosine distance between the signals we

    isons between the elements of localized sparse models built from

    evier.com/locate/sigpro

    cessing

  • their sparse models. We conclude with a discussion aboutthe results and several future directions.

    2. Nearest neighbor search by recursion in a sparsedomain

    Consider a set of signals

    Y9fyi 2 RK : JyiJ 1gi2I 3where I f1,2, . . .g indexes this set, and a query signalxq 2 RK , JxqJ 1. Assume that we have generated sparseapproximations for all of these signals Y^9ffHini,aini,rinig : yi Hiniainirinigi2I using a dictionaryD that spans the space RK , and giving the nq-orderrepresentation fHqnq,aqnq,rqnqg for xq. Since D spansRK , D is complete, and any signal in RK is compressibleinD, meaning that we can order the representation weights

    Fig. 1. Gray lines show decays of representation weight magnitudes as a

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 2837complexities than MP, but can produce data models thatare more sparse.

    Sparse approximation is data-adaptive and can pro-duce parametric and multiscale models having featuresthat function more like mid-level objects than low-levelprojections onto sets of vectors [9,19,38,8,22,27,32,34,43].These aspects make sparse approximation a compellingcomplement to state-of-the-art approaches for and appli-cations of comparing audio signals based upon, e.g.,monoresolution cepstral and redundant timefrequencyrepresentations, such as ngerprinting [42], cover songidentication [26,10,3,35], content segmentation, index-ing, search and retrieval [16,4], artist or genre classica-tion [39].

    In the literature we nd some existing approaches toworking with audio signals in a sparse domain. Featuresbuilt from sparse approximations can provide competi-tive descriptors for music information retrieval tasks,such as beat tracking, chord recognition, and genreclassication [40,32]. Sparse representation classiershave been applied to music genre recognition [27,5],and robust speech recognition [12]. Parameters of sparsemodels can be compared using histograms to nd similarsounds in acoustic environment recordings [7,8], oratoms can be learned to compare and group percussionsounds [34]. Biologically inspired sparse codings ofcorrelograms of sounds can be used to learn associationsbetween descriptive high-level keywords and audiofeatures such that new sounds can be automaticallycategorized, and large collections of sounds can bequeried in meaningful ways [22]. Outside the realm ofaudio signals, sparse approximation has been applied toface recognition [44], object recognition [29], and land-mine detection [25].

    In this paper, we discuss the comparison of audiosignals in a sparse domain, but not specically for nger-printing or efcient audio indexing and searchtwo tasksthat have been convincingly solved [42,13,16,18]. Weexplore the possibilities and effectiveness of comparing,atom-by-atom, audio signals modeled using sparseapproximation and large overcomplete timefrequencydictionaries. Our contributions are threefold: (1) we gen-eralize recursive nearest-neighbor search algorithm tocomparing subsequences [14,15]; (2) we show thatthough sparse models of audio signals can be comparedby considering pairs of atoms, the best bound so farderived [14,15] does not make a practical procedure;and (3) we show experimentally that the hierarchiccomparison of audio signals in a sparse domain stillprovides intriguing and informative results. Overall, ourwork here shows that a sparse domain can facilitatecomparisons of audio signals in hierarchical waysthrough comparing individual elements of each sparsedata model organized roughly in order of importance.

    In the next two sections, we discuss and elaborateupon a recursive method of nearest neighbor search in asparse domain [14,15]. We extend this method to com-paring subsequences, and examine the practicality ofprobabilistic bounds on the distances between neighbors.In the fourth section, we describe several experimentsin which we compare a variety of audio signals throughfunction of approximation order k for several decompositions of unit-

    norm signals (4-s recordings of single piano notes described in Section

    4.1). Thick black line shows a global compressibility bound with its

    parameters.in aini or aqnq in terms of decreasing magnitude, i.e.,0o jainim1jr jainimjrCmg, m 1,2, . . . ,ni1

    4for ni arbitrarily large, with C40, and where am is themthelement of the column vector a. This can be seen in themagnitude representation weights in Fig. 1, which areweights of sparse representations of piano notes, describedin Section 4.1. With MP and a complete dictionary, we areguaranteed g40 because Jrn1J2o JrnJ2 for all n [24].

    Consider the Euclidean distance between two signalsof the same dimension, which is the cosine distance forunit-norm signals. Thus, with respect to this distance, theyi 2 Y nearest to xq is given by solvingmini2I

    JyixqJmaxi2I

    /xq,yiS 5

    We can express this inner product in terms of sparserepresentations

    /xq,yiS/Hqnqaqnqrqnq,HiniainiriniS aTi niHTi niHqnqaqnqOrq,ri 6

    With a complete dictionary we can make Orq,rinegligible by choosing e arbitrarily small, so we can

  • B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 283628512838express (5) as

    maxi2I

    /xq,yiSmaxi2I

    aTi niGiqaqnq maxi2I

    Xnim 1

    Xnql 1

    AiqGiqml

    7where BCml BmlCml is the Hadamard, or entry wise,product, Bml is the element of B in the mth row of the lthcolumn, Giq9H

    Ti niHqnq is a ni nq matrix with ele-

    ments from the Gramian of the dictionary, i.e., G9DTD,and nally we dene the outer product of the weights

    Aiq9ainiaTq nq: 8

    2.1. Recursive search limited by bounds

    Since we expect the decay of the magnitude of ele-ments in AiqGiq to be fastest in diagonal directions by (4),we dene a recursive sum along the M anti-diagonalsstarting at the top left:

    SiqM9SiqM1XMm 1

    AiqGiqmMm1 9

    for M 2,3, . . . ,minni,nq, and setting Siq1 AiqGiq11.With this we can express the argument of (7) as

    /xq,yiSXnim 1

    Xnql 1

    AiqGiqml SiqMRM 10

    where at step M, we are comparing M additional pairs ofatoms to those considered in the previous steps. R(M) is aremainder that we will bound. The total number of atompairs contributing to Siq(M) (9) is

    PM9XMm 1

    mMM1=2: 11

    The approach taken by Jost et al. [14,15] to nd thenearest neighbors of xq in Y bounds the remainder R(M)by compressibility (4). Assuming we have a positive upperbound on the remainder, i.e., RMr ~RM, we know lowerand upper bounds on the cosine distance LiqMr/xq,yiSrUiqM, whereLiqM9SiqM ~RM 12

    UiqM9SiqM ~RM 13Finding elements of Y close to xq with respect to (5) canbe done recursively over the approximation orderM. For agiven M, we nd fSiqMgi2I , compute the remainder ~RM,and eliminate signals that are not sufciently close to xqwith respect to their cosine distance by comparing thebounds. This approach is similar to hierarchical ones, e.g.,[21], where the features become more discriminable asthe search process runs. (Also note that compressibility issimilar to the argument made in justifying the truncationof Fourier series in early work on similarity search[1,11,30], i.e., that power spectral densities of manytime-series decay like Ojf jb with b41.)

    Starting withM1, we compute the sets fLiq1gi2I andfUiq1gi2I , that is, the rst-order upper and lower boundsand M 2,y, n:cgM9flml1g : mM1, . . . ,n; l 1, . . . ,mg 17

    dg9flnm1g : m 1, . . . ,n1; lm1, . . . ,ng:18

    Appendix A gives derivations of these bounds, as well asthe efcient computation of (16) for the special case ofg 0:5. The parameters C,g describe the compressibilityof the signals in the dictionary (4). The bounds of (15) and(16) are much more discriminative than (14) because theyinvolve an 2-norm at the price of uncertainty in thebound. The bound in (16) is attractive because we cantune it with the parameter p, which is the probability thatthe remainder will not exceed the bound. Fig. 2 showsbounds based on (16) for several pairs of compressibilityparameters for the dataset used to produce Fig. 1.

    2.3. Estimating the compressibility parameters

    The bounds (14)(16), and consequently the numberof atom pairs we must consider before the boundswhRMrC22=3

    pErf1pJcgMJ22Jd

    gJ221=2 16with probability 0rpr1

    ere we dene the following vectors for n9minni,nq3.n41)

    RMrC2ln4

    pJcgMJ22Jd

    gJ221=2 15

    Giqml iid Uniform, O 1,1,2.of the set of distances of xq from all signals in Y. Then wend the index of the largest lower bound imax arg maxi2ILiq1, and reduce the search space toI19fi 2 I : Uiq1ZLimaxq1g, since all other data have aleast upper bound on their inner product with xq than thegreatest lower bound in the set. For the next step, wecompute the sets fLiq2gi2I1 and fUiq2gi2I1 , nd the indexof the maximum imax arg maxi2I1Liq2, and constructthe reduced set I29fi 2 I1 : Uiq2ZLimaxq2g. Continuingin this way, we nd the elements of Y closest to xq at eachM with respect to the cosine distance by recursing intothe sparse approximations of the signals.

    2.2. Bounding the remainder

    To reduce the search space quickly we desire that (12)and (13) converge quickly to the neighborhood of/xq,yiS, or in other words, that the bounds on theremainder quickly become discriminative. Jost et al.[14,15] derive three different bounds on R(M). From theweakest to the strongest, these are:

    1. Giqml 1 (worst case scenario, but impossible forn41)

    RMrC2JcgMJ1JdgJ1 14

    Giqml iid Bernoulli(0.5), O f1,1g (impossible for

  • (4.75

    om P

    Rem

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 28392 500 1000 20000

    0.1

    0.2 (0.69, 0.50)(0.72, 0.55)

    At

    Est

    imat

    ed0.3

    0.4

    0.5

    (1.19, 0.70)

    (1.89, 0.80)

    (3.00, 0.90)

    aind

    er R

    (M)become discriminable, depend on the compressibilityparameters, C,gFwhich themselves depend in a com-plex way on the signal, the dictionary, and the method ofsparse approximation. Fig. 3 shows the error surface,feasible set, and the optimal parameters for the datasetused to produce Fig. 1. We describe our parameterestimation procedure in Appendix B. The resulting boundis shown in black in Fig. 1. These compressibility para-meters also agree with those seen in Fig. 2.

    3. Subsequence nearest neighbor search in a localizedsparse domain

    The recursive nearest neighbor search so far describedhas the obvious limitation that it cannot be applied tocomparing subsequences of large data vectors, as isnatural for comparing audio signals. Thus, we must adaptits structure to work for comparing subsequences in a set

    Fig. 2. Estimated remainder (assuming unit-norm signals) using bound in (16n100 (number of elements in each sparse model) as a function of the numbparameters C,g estimated from the dataset used to produce Fig. 1.

    C

    0.4 0.6 0.8 1 1.2

    1

    2

    3

    4

    Fig. 3. Error surface as a function of the compressibility parameters forthe dataset used to produce Fig. 1, with the feasible set shaded at top

    left, and optimal parameters marked by a circle.of data

    Y9fyi 2 RNi : NiZKgi2I 19

    (note that now we do not restrict the norms of thesesignals). We can create from the elements of Y a new setof all subsequences having the same length as aK-dimensional query xq (KoNi:

    YK9fPtyi=JPtyiJ : t 2 T i f1,2, . . .NiK1g,yi 2 Yg 20

    where Pt extracts a K-length subsequence in yi starting atime-index t (it is an identity matrix of size K starting acolumn t in a K Ni matrix of zeros). The set T i are timesat which we create length-K subsequences from yi. If we

    3000 4000 5000

    (0.78, 0.60)

    , 1.00)(7.53, 1.10)

    airs P(M)

    ) with p0.2 (probability that remainder does not exceed bound) ander of atom pairs already considered for several pairs of compressibilitydecompose each of these by sparse approximation, thenwe can use the framework in the previous section.However, sparse approximation is an expensive operationthat we want to do only once for the entire signal, andindependent of the length of xq.

    To address this problem, we instead approximate eachelement in YK by building local sparse representationsfrom the global sparse approximations of each yi, andthen calculating their distance to xq using the frameworkin the previous section. From here on we consider onlythe K-length subsequences of a single element yi 2 Ywithout loss of generality (i.e., all other elements of Ycan be included as subsequences). Toward this end,consider that we have decomposed the Ni-length signalyi using a complete dictionary to produce the representa-tion fHini,aini,rinig. From this we construct the localsparse representations of yi:

    Y^K9ffPtHini,xtaini,Ptrinig : t 2 T ig 21

    where the time partition T i is the set of all times at whichwe extract a K-length subsequence from yi, and xt isset such that JxtPtyiJ 1, i.e., each length-K subsequenceis unit-norm. For each K-dimensional subsequence,

  • (7) now becomes

    maxt2T i

    /xq,PtyiSmaxt2T i

    /Hqnqaqnq,xtPtHiniainiSOrq,ri

    maxt2T i

    xtaTi niHTi niPTt Hqnqaqnq

    maxt2T i

    xtXnim 1

    Xnql 1

    AiqGiqtml 22

    where Aiq is dened in (8), we dene the time-localizedGramian

    Giqt9HTi niPTt Hqnq 23and we have excluded the terms involving the residualsbecause we can make them arbitrarily small.

    3.1. Estimating the localized energy

    The only thing left to do is to nd an expression for xtso that each subsequence is comparable with the otherswith respect to the cosine distance. We assume that thelocalized energy can be approximated from the localsparse representation in the following way assumingJPtyiJ40 v

    as shown by Fig. 4. For much of the time we seePntj 1 a

    2j ZJPtyiJ

    2, which means our localized estimateof the segment energy is greater than its real value. Thiswill make xt and consequently (22) smaller.

    Instead, we make a more reasonable estimate of JPtyiJby accounting for the fact that atoms can have supportoutside [t,tK). For instance, if an atom has some fractionof support in the subsequence we multiply its weight bythat fraction. We thus weigh the contribution of the jthatom to the subsequence norm using

    wj

    1, ujZt,ujsjrtKK=sj2, ujot,ujsjZtKujsjt2=s2j , ujot,toujsjrtKtKuj2=s2j , trujotK ,ujsj4tK

    8>>>>>>>>>:

    25

    where uj and sj are the position and scale, respectively, ofthe atom associated with the weight aj. In other words, ifan atom is completely in [t,tK), it contributes all of itsenergy to the approximation; otherwise, it contributesonly a fraction based on how its support intersects[t,tK). With this we are now slightly underestimating

    ows (

    e (top

    4.3. (

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 283628512840xt JPtyiJaTi niHTi niPTt PtHiniaini

    q

    Xntj 1

    wja2j

    uut24

    where the nt weights aj 2 fainim : HTi niPTt PtHi nimla0,1rm,lrnig are those associated with atoms havingsupport in [t,tK), and wj we dene to weigh thecontribution of aj

    2to the localized energy estimate. We

    set xt 0 ifPnt

    j 1 a2j 0.

    If all atoms contributing to the subsequence have theirentire support in [t,tK), and are orthonormal, then wecan set each wj 1. This does not hold for subsequencesof a signal decomposed using an overcomplete dictionary,

    Fig. 4. Short-term energy ratios, log10Pnt

    j 1 wja2j =JPtyiJ

    2, over 1 s windresidual energy 30 dB below the initial signal energy. Arrow points to lin

    Data in (a) are described in Section 4.2; data in (b) are described in Section

    (24-27 s) and (b) Music Orchestra.the localized energies, as seen in Fig. 4. In both of thesecases for {wj}, however, we can assume by the energyconservation of MP [24] that as the subsequence lengthbecomes larger our error in estimating the subsequenceenergy goes to zero, i.e.,

    limK-Ni

    JPtyiJ2Xntj 1

    wja2j JPtriniJ2 26

    With a complete dictionary, we can make the right handside zero. Signicant departures from the energy estimateof subsequences can be due to the interactions betweenatoms [37].

    hopped 125 ms) for MP decompositions using 8xMDCT [31] to a global

    , gray) using weighting wj1. The other line (bottom, black) uses (25).a) Six speech signals (0-20 s), Music except (21-23 s), Realization of GWN

  • 3.2. Recursive subsequence search limited by bounds

    Now, similar to (9) and (10), we can say,

    /xq,PtyiS xtXntm 1

    Xnql 1

    AiqtGiqtml Siqt,MRt,M

    27where for M 2,3, . . . ,minni,nq, and withSiqt,1 xtAiqtGiqt11,

    Siqt,M9Siqt,M1xtXMm 1

    AiqtGiqtmMm1 28

    The problem of nding the subsequence closest to xq withrespect to the cosine distance can now be done iterativelyover M by bounding each remainder R(t,M) using (14),(15), or (16), and the method presented in Section 2.1.Furthermore, we can compare only a subset of all possiblesubsequences using a coarse time partition T i.

    3.3. Practicality of the bounds for audio signals

    The experiments by Jost et al. [14,15] use small images(128 square) and orthogonal wavelet decompositions,

    which do not translate to audio signals decomposed overredundant timefrequency dictionaries. Jost et al. [14,15]do not state the compressibility parameters they use, butfor the high-dimensional audio signals with which wework in this paper it is not unusual to have g 0:5 whenusing MP and highly overcomplete dictionaries. We ndthat decomposing 4 s segments of music signals (singlechannel, 44.1 kHz sample rate, representation weightsshown in Fig. 1) using the dictionary in Table 1 requireson average 2375 atoms to reduce the residual energy20 dB below the initial signal energy. Thus, for the bound(16) using n2375 atoms, and with the parametersC,g 0:4785,0:5 (in the feasible set), Fig. 5 clearlyshows that in order to have any discriminable bound(say 70:2 for unit-norm signals) we must either select alow value for pin which case we are assuming the rstatom comparison is approximately the cosine distanceor we must make over a million pairwise comparisons.

    This is not practical for signals of large dimension, anddictionaries containing billions of timefrequency atoms.

    Table 1Timefrequency dictionary parameters (44.1 kHz sampling rate): atom

    scale s, time resolution Du , and frequency resolution Df . Finer frequencyresolution for small-scale atoms is achieved with interpolation by zero-

    padding.

    s (samples/ms) Du (samples/ms) Df (Hz)

    128/3 32/0.7 43.1

    256/6 64/2 43.1

    512/12 128/3 43.1

    1024/23 256/6 43.1

    2048/46 512/12 21.5

    4096/93 1024/23 10.8

    0.60

    om Pa

    of th

    labele

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 28418192/186 2048/46 5.4

    16,384/372 4096/93 2.7

    32,768/743 8192/186 1.3

    101

    102

    103

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    0.10

    0.20

    0.30

    0.40

    0.50

    At

    Est

    imat

    ed R

    emai

    nder

    R(M

    )

    Fig. 5. Estimated remainder (assuming unit-norm signals) as a functionFig. 1. Gray: bound in (15). Black, numbered: bound in (16) for several

    (number of elements in each sparse model), and C,g 0:4785,0:5.Though approximate nearest neighbor subsequencesearch of sparsely approximated audio signals with thebound (16) is impractical, we have found that approx-imating the cosine distance in a sparse domain hassome intriguing behaviors. We now present severalexperiments where we compare different types of audiodata in a sparse domain under a variety of conditions.

    104

    105

    106

    0.70

    0.80

    irs P(M)

    e number of atom pairs already considered for dataset used to produce

    d p (probability that remainder does not exceed bound) with n2375There is no possibility of tabulating the dictionary Gra-mian for quick lookup of atom pair correlations; and thecost of looking up atoms in million-atom decompositionsis expensive as well. It is clear then that the tightestbound given in (16) is not practical for efciently dis-criminating distances between audio signals with respectto their cosine distance (5) decomposed by MP and timefrequency dictionaries.

    4. Experiments in comparing audio signalsin a sparse domain

  • All signals are single channel, and have a sampling rate of44.1 kHz. We decompose each by MP [17] using either thedictionary in Table 1, or the 8xMDCT dictionary [31].

    4.1. Experiment 1: comparing piano notes

    In this experiment, we look at how well low-ordersparse approximations of sampled piano notes embodytheir harmonic characteristics by comparing them usingthe methods presented in Section 2. The data in set A are68 notes (chromatically spanning A0 to G#6) on a real andsomewhat in-tune piano; and in set the data Bare 39 notes (roughly a C major scale C0 to D6) on a real

    and very out-of-tune piano with very poor recordingconditions. We truncate all signals to have a dimension of176,400 (4 s), and decompose each by MP [17] over aredundant dictionary of timefrequency Gabor atoms, theparameters of which are summarized in Table 1. We stopeach decomposition once its residual energy drops 40 dBbelow the initial energy. We normalize the weights of eachmodel by the square root energy of the respective signal.We do not align the time-domain signals such that the noteonsets occur at the same time. Fig. 1 shows the ordereddecays of the weights in the sparse models of data set A.

    Fig. 6(a) shows the magnitude correlations between allpairs of signals in set A evaluated in the time-domain.

    com

    10).

    )A S

    tions o

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 283628512842Fig. 6. jSiq10j (9) for two sets of recorded piano notes. (a) and (b): Set A(rows) compared with set A (columns) in time and sparse domains (Mcontrast of other elements. (a) A Time-domain magnitude correlations, (b

    domain magnitude correlations and (d) B-A Sparse-domain approximapared with itself in time and sparse domains (M10). (c) and (d): Set BElements on main diagonals in (a) and (b) are scaled by 0.25 to increase

    parse-domain approximations of magnitude correlations, (c) B-A Time-

    f magnitude correlations.

  • The overtone series is clear as diagonal lines offset at 12,19, 24, and 28, semitones from the main diagonal.Fig. 6(b) shows the approximated magnitude correlations(9) using only M10 atoms from each signal approxima-tion (thus P(10)55 atom pairs). Even though the meannumber of atoms in this set of models is about 7000 westill see portions of the overtone series. The diagonal inFig. 6(b) does not have a uniform color because low noteshave longer sustain times than high notes, and the sparsemodels thus have more timefrequency atoms withgreater energies spread over the support of the signal.Fig. 6(c) show the magnitude correlations between setsB and A evaluated in the time-domain; and Fig. 6(d)shows the magnitude correlations (9) using only M10atoms from each model. In a sparse domain, we can moreclearly see the relationships between the two setsbecause the rst 10 terms of each model are most likelyrelated to the stable harmonics of the notes, and not tothe noise. We can see a diatonic scale starting aroundMIDI number 36, as well as the fact that the pitch scale in

    data set B lies somewhere in-between the semitones indata set A.

    Fig. 7(a) shows the approximate magnitude correla-tions jSiqMj (9), as well as the upper and lower boundson the remainder using the tightest bound (16) withp0.2, and n100, for the signal A3 from set A andthe rest of the set. Here we can see that the lower boundfor the largest magnitude correlation exceeds the upperbound of all the rest after comparing only M19 atomsfrom each decomposition. All but ve of the signals can beexcluded from the search after M6. The four othersignals having the largest approximate magnitude corre-lation are labeled, and are harmonically related to thesignal through its overtone series. With a signal selectedfrom set B and compared to set A, Fig. 7(b) shows thatwe must compare many more atoms between the modelsuntil the bounds have any discriminability. AfterP(M)1500 atom comparisons we can see that the largestmagnitudes jSiqMj (9) are roughly harmonically relatedto the detuned D5 from set B.

    ered fo

    Liq(M

    ch sp

    B w

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 2843Fig. 7. Black: jSiqMj (9) as a function of the number of atom pairs considor (b) B (note D5 approximately). Gray: for each Siq(M), magnitudes of

    remainder does not exceed bound), and n100 (number of elements in eain axes. Signal A3 from A with (c,g)(0.78, 0.60) and (b) Signal D5 fromr the set of piano notes in A with a signal from either (a) A (note A3)

    ) (12) and Uiq(M) (13) using bound in (16) with p0.2 (probability thatarse model). Largest magnitude correlations are labeled. Note differences

    ith (c, g)(1.17,0.70).

  • As a nal part of this experiment, we look at the effectsof comparing atoms with parameters within some subset.As done in Fig. 6(d), we compare the sparse approxima-tions of two different sets of piano notes, but here we onlyconsider those atoms that have scales greater than186 ms. This in effect means that we look for signals thatshare the same long-term timefrequency behaviors.The resulting jSiq10j (9) is shown in Fig. 8. We see

    correlations between the notes much more clearly com-pared with Fig. 6(d). Removing the short-term phenom-ena improves tonal-level comparisons between thesignals because non-overlapping yet energetic shortatoms are replaced by atoms representative of the noteharmonics.

    4.2. Experiment 2: comparing speech signals

    In this experiment, we test how efciently using (28)we can nd in a speech signal the time from which weextract some xq. We also test how distortion in the queryaffects these results. We make a signal by combining sixsegments of speech, a short music segment, and whitenoise, shown in Fig. 4(a). The six speech segments are thesame phrase spoken by three females and three males:Cottage cheese with chives is delicious. We extract fromone of these speech signals the word cheese, to createxq with duration of 603 ms, shown at top in Fig. 9. Wedecompose this signal using MP and the 8xMDCT dic-tionary [31].

    We distort the query in two ways: with additive WGN(AWGN), and with an interfering sound having a highcorrelation with the dictionary. In the rst case, shown inthe middle in Fig. 9, the signal xq 0 axqn=JaxqnJ isthe original xq distorted by a unit-norm AWGN signal n.We set a 0:3162 such that 10log10JaxqJ2=JnJ2

    Fig. 8. jSiq10j (9) for two sets of recorded piano notes in a sparsedomain using only the atoms with duration at least 186 ms. Compare

    with Fig. 6(d).

    Fre

    quen

    cy (

    kHz)

    2

    4

    6

    8

    Fre

    quen

    cy (

    kHz)

    2

    4

    6

    8

    (kH

    z)

    Time

    0

    6

    8

    : que

    ith in

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 283628512844Fre

    quen

    cy

    0 0.1 0.20

    2

    4

    Fig. 9. Log spectrograms of the query signals with which we search. TopGaussian noise (AWGN) with SNR 10 dB. Bottom: query distorted w (s)

    .3 0.4 0.5

    ry of male saying cheese. Middle: query distorted with additive white

    terfering crow sound with SNR 5 dB.20log10jaj 10 dB. For this signal, we nd the follow-ing statistics from 10,000 realizations of the AWGNsignal: Ej/xq,nSj 1 105, Varj/xq,nSj 4 106.We also nd the following statistics for the 8xMDCT

  • dictionary: Emaxd2Dj/n,dSj 5 104, Varmaxd2Dj/n,dSj 2 105. Thus, the noise signal is not wellcorrelated either with the original signal or the 8xMDCTdictionary. In the second case, shown at the bottom ofFig. 9, we distort the signal by adding the sound of a crowc so that xq 0 axqc=JaxqcJ2 with JcJ 1. Here, weset a 0:5623 given by 20log10jaj 5 dB. For thisinterfering signal, we nd that j/xq,cSj 2 103, butmaxd2Dj/c,dSj 0:21, which is higher than maxd2Dj/xq,dSj 0:17. In this case, unlike for the AWGN inter-ference, it is likely that the sparse approximation of thesignal with the crow interference will have atoms in itslow-order model due to the crow and not the speech. Wedo not expect the AWGN interference to be a part of thesignal model created by MP until much later iterations.

    Fig. 10 shows jSiqt,Mj (28) aligned with the originalsignal for four values of M using the sparse approxima-tions of the clean and distorted signals. We plot at the rearof these gures the localized magnitude time-domaincorrelation of the windowed and normalized signal withthe query xq. In Fig. 10(a), using the clean xq, we clearly

    see its position even when using a single atom pair foreach 100 ms partition of the time-domain. We see thesame behavior in Fig. 10(b), (c) for the two distortedsignals, but in the case where the crow sound interfereswe nd the query for MZ2, or with at least three atompair comparisons. The rst atom of the decomposed querywith the crow is modeling the crow and not the content ofinterest, and so we must increase the order of the modelto nd the location of xq. As we increase the number ofpairs considered we also nd other segments that point inthe same direction as xq. Table 2 gives the times andcontent of the ten largest values in jSiqt,10j. For the cleanand AWGN distorted xq, cheese appears ve of the sixtimes it exists in the original signal. Curiously, these sameve instances are the ve largest magnitude correlationswhen xq has the crow interference.

    We perform the same test as above but using a muchlonger speech signal (about 21 minutes in length)excerpted from a book-on-CD, The Old Man and theSea by Ernest Hemingway, read aloud by a single person.From this signal we create several queries xq, from words

    labele

    is sho

    th cro

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 2845Fig. 10. jSiqM,tj (28) as a function of time and the number of atoms M (approximation. Localized magnitude correlation of each signal with query

    (a) clean signal, (b) Signal with AWGN at 10 dB energy and (c) Signal wid at right) considered from each representation for each localized sparse

    wn by the thin black line in front of the gray time-domain signal at rear.

    w signal at 5 dB energy.

  • in Fig

    GN

    jSiq

    0.20.0

    0.0

    0.0

    0.0

    0.0

    0.0

    0.0

    0.0

    0.0

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 283628512846to sentences to an entire paragraph of duration 35 s. Wedecompose each signal over the dictionary in Table 1 untilthe global residual energy is 20 dB below the initialenergy. The approximation of the entire 21 m signal has1,004,001 atoms selected from a dictionary containing2,194,730,297 atoms.

    One xq we extract from the signal is the spoken phrase,the old man said (861 ms in length). This phraseappears 26 times in the long excerpt. We evaluatejSiqt,Mj (28) every 116 ms, and nd the time at whichxq originally appears using only M1 atom pair compar-isons for each time partition. The next highest rankedpositions have values of 75% and 67% that of the largestjSiqt,1j. When M50, the values of the second and thirdlargest values jSiqt,50j drop to 62% and 61% that of thelargest value. In the top 30 ranked subsequences for M5we nd only one of the other 25 appearances of the oldman said (rank 26); but we also nd the old managreed (rank 11), and the old man carried (rank 16).All other results have minimal content similarity to thesignal, but have timefrequency overlap in parts of theatoms of each model.

    We perform the same test with a sentence extracted

    Table 2Times, values and signal content for rst 10 peaks in jSiqt,10j (P(10)55)signal.

    # Clean signal SignalW

    t (s) jSiqj Content t (s)

    1 10.0 0.798 cheese 10.02 13.6 0.199 cheese 13.6

    3 11.3 0.153 -ives is- 15.1

    4 16.9 0.149 cheese 11.3

    5 15.1 0.141 delicious 16.9

    6 18.3 0.076 delicious 18.3

    7 1.3 0.057 cheese 8.1

    8 8.1 0.035 delicious 12.0

    9 2.4 0.026 delicious 5.2

    10 6.9 0.024 cheese 6.8from the signal, They were as old as erosions in a shlessdesert (2.87 s), which only appears once. No matter theM[1, 50] we use, the origin of the excerpt remains at arank of 6 with a value jSiqt,50j at 67.5% that of thehighest rank subsequence. We nd that if we shift thetime partition forward by 11.6 ms its ranking jumps torst, with the second ranked subsequence at 73%. Weobserve a similar effect for a query consisting of an entireparagraph (35 s). We nd its origin by comparing M2 ormore atoms from each model using a time partition of116 ms. This result, however, disappears when we evalu-ate jSiqt,Mj using a coarser time partition of 250 ms.

    4.3. Experiment 3: comparing music signals

    While the previous experiment deals with single-channel speech signals, in this experiment we makecomparisons between polyphonic musical signalsexcerpted from a commercial recording of the fourthmovement of Symphonie Fantastique by H. Berlioz. For thequery, we use a 10.31 s segment of the third appearanceof the A theme of the movement (bars 3339, locatedaround 1322 s in Fig. 4(b)). Fig. 11 shows the sonogramand timefrequency tiles of the model of xq using the 50atoms with the largest magnitude weights selected fromthe 8xMDCT dictionary [31]. We add no interferingsignals as we do in the previous experiment.

    Fig. 12(a) shows jSiqt,Mj (28) over the rst minute ofthe original signal, for three values of M, including M50,the timefrequency representation of which is shown atbottom of Fig. 11. For jSiqt,50j we can see a strong spikelocated around 13 s corresponding with the query, but wealso see spikes at about 2 s and around 43 s. The formerset of spikes correspond with the second appearance ofthe A theme, when only low bowed strings are playingthe theme in G-minor. This is quite similar to theinstrumentation of the query: low bowed strings and alegato bassoon in counterpoint in E[-major. The latter setof spikes is around the end of the fth appearance of thetheme, which is played in G-minor on low pizzicatostrings with a staccato bassoon. For M10, we see aconspicuous spike at the time of the fth appearance

    s. 10(a)(c). Highest-rated distances in each (bold) points to the origin of

    SignalCrow

    j Content t (s) jSiqj Content

    36 cheese 10.0 0.409 cheese80 cheese 13.6 0.060 cheese

    51 delicious 16.9 0.030 cheese

    45 -ives is- 6.9 0.012 cheese

    42 cheese 1.3 0.011 cheese

    28 delicious 18.3 0.010 delicious

    14 delicious 13.2 0.010 cottage

    12 -licious 15.1 0.009 delicious

    11 delicious 16.0 0.004 cott-

    10 cheese 22.8 0.003 WGNaround 34 s, as well as of the fourth appearance around24 s, where the theme is played in E[-major like thequery. Finally, we test how the sparse approximationof this query compares with subsequences from a differ-ent recording of this movement, which is also in adifferent tempo. Fig. 12(b) shows jSiqt,Mj (28) for threedifferent values of M. We see high similarity with the rstand second appearances of the main theme, but not thethird, which is what the query contains.

    4.4. Discussion

    There is no reason to believe that a robust and accuratespeech or melody recognition system can be created bycomparing only the rst few elements of greedy decom-positions in timefrequency dictionaries. What appears tobe occurring for the short signals, both the cheese andthe old man said, is that the rst few elements of theirsparse and atomic decomposition create a prosodic

  • B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 2847Fre

    quen

    cy (

    kHz)

    0.1

    0.2

    0.3

    0.4

    0.5

    y (k

    Hz)

    0.3

    0.4

    0.5representation that is comparable to others at the atomiclevel. For the longer signals, such as sentences, para-graphs, and an orchestral theme, a few atoms cannotadequately embody the prosody, but we still see that byonly making a few comparisons we are able to locate theexcerpted signalas long as the time partition is neenough. This is due to the atoms of the models acting insome sense as a timefrequency ngerprint, an exampleof which we see in Fig. 11. Through the cosine distance,the relative time and frequency locations of the atoms inthe query and subsequence are being compared, weighted

    Fre

    quen

    c

    Tim0 1 2 3 4

    0

    0.1

    0.2

    Fig. 11. Polyphonic orchestral query: sonogram (top) and timef

    Fig. 12. jSiqt,Mj (28) for three values of M for the query and two different signnumber. Magnitude correlation of query with localized and normalized signal is

    rear. (a) Query compared with original signal and (b) Query compared differenby their energies. Subsequences that share atoms insimilar congurations will be gauged closer to xq thanthose that do not.

    By using the cosine distance it is not unexpected that (28)will be extremely sensitive to a partitioning of the time-domain. This comes directly from the denition of the time-localized Gramian (23), as well as the use of a dictionary thatis not translation invariant. There is no need to partition thetime axis when using a parameterized dictionary if weassume that some of the atoms in the model of xq will haveparameters that are nearly the same as some of those in the

    e (s)5 6 7 8 9

    requency tiles (bottom) of 50-order sparse approximation.

    als. Arrows mark the appearances of the A theme, and their appearance

    shown by the solid gray area in front of the black time-domain signal at

    t interpretation.

  • mid-level structures such as harmonics [9,38,8]; and compar-

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 283628512848relevant localized sparse representations. In such a scenario,we can search a sparse representation for the times at whichatoms exist that are similar in scale and modulation fre-quency to those modeling xq. Then we can limit our search tothose particular times without considering any uniform andarbitrary partition of the time-domain. With non-lineargreedy decomposition methods such as MP and time-variantdictionaries, however, such an assumption cannot be guar-anteed; but its limits are not yet well-known.

    5. Conclusion

    In this paper, we have extended and investigated theapplicability of a method of recursive nearest neighbor search[14,15] for comparing audio signals using pairwise compar-isons of model elements in a sparse domain. The multiscaledescriptions offered by sparse approximation over timefrequency dictionaries are especially attractive for such tasksbecause they provide exibility in making comparisonsbetween data, not to mention a capacity to deal with noise.After extending this method to the task of comparingsubsequences of audio signals, we nd that the strongestbound known for the remainder is too weak to quickly andefciently reduce the search space. Our experiments show,however, that by comparing elements of sparse models wecan judge with relatively few comparisons whether signalsshare the same timefrequency structures, and to whatdegrees, although this can be quite sensitive to the time-domain partitioning. We also see that we can approach suchcomparisons hierarchically, starting from the most energeticcontent to the least, or starting from the longest scalephenomenon to the shortest.

    We are continuing this research in multiple directions.First, since we know that the inner product matrix Giqt (23)will be very sparse for all t in timefrequency dictionaries,this motivates designing a tighter bound based on a Laplaciandistribution of elements in Giqt with a large probabilitymass exactly at zero. This bound would be much morerealistic than that provided by assuming the elements ofthe Gramian are distributed uniform (16). Another part of theproblem is of course that the sums in (9) and (28) are notsuch that at step M the P(M) largest magnitude values ofAiqGiq are actually being summed. By our assumption in (4),we know that the decay of the magnitudes of the elements inAiq will be quickest in diagonal directions, but dependentupon the element position in the matrix. These diagonaldirections are simply given by

    @=@gi@=@gq

    " #mgi lgq mgi lgq

    gi=mgq=l

    " #29

    where we now recognize that the weights of two differentrepresentations can decay at different rates. With this, we canmake an ordered set of index pairs by

    L fm,ll : jAiqljZ jAiql1jgl 1,2,...,ninq 30

    and dene a recursive sum for 1omrninqSiqm9Siqm1AiqGiqLm 31

    setting Siq1 AiqGiq11. We do not yet know the extent towhich this approach can ameliorate the problems with theing high-level patterns of short atoms representing rhythm[32]. There also exists the matching pursuit dissimilaritymeasure [25], where the atoms of one sparse model are usedto decompose another signal, and vice versa to see how wellthey model each other. We are exploring these variouspossibilities with regards to gauging more generally similar-ity in audio signals at multiple levels of specicity within asparse domain.

    Acknowledgments

    B.L. Sturm performed part of this research as a Cha-teaubriand Post-doctoral Fellow (N. 634146B) at theInstitut Jean Le Rond dAlembert, Equipe Lutheries, Acous-tique, Musique, Universite Pierre et Marie Curie, Paris 6,France; as well as at the Department of Architecture,Design and Media Technology, at Aalborg UniversityCopenhagen, Denmark. L. Daudet acknowledges partialsupport from the French Agence Nationale de laRecherche under contract ANR-06-JCJC-0027-01 DESAM.The authors thank the anonymous reviewers for theirvery detailed and helpful comments.

    Appendix A. Proof of remainder bounds

    To show (14), we can bound R(M) loosely by assumingthe worst case scenario of Giqml 1 for all its elements.Knowing that R(M) is the sum of the elements of thematrix AiqGiq except for the rst P(M) values, andassuming (4), we can say

    C2RMrXn

    m M1

    Xml 1

    lml1gXn1m 1

    Xnl m1

    lnm1g

    JcgMJ1JdgJ1 A:1non-discriminating bound (16), as we have yet to design anefcient way to generate a satisfactory L, and estimate thebounds of the corresponding remainderwhether it is likethat in (16), or another that uses the fact that Giqt will bevery sparse, even when xq yi. We think that using astronger bound and this indexing order will signicantlyreduce the number of pairwise comparisons that must bemade before determining a subsequence is not close enoughwith respect to the cosine distance. Furthermore, we canmake the elements of Aiq decay faster, and thus increase g, byusing other sparse approximation approaches, such as OMP[28,23] or CMP [36]. Andwe cannot forget the implications ofchoosing a particular dictionary. In this work, we have usedtwo different parametric dictionaries, one of which isdesigned for audio signal coding [31]. Another interestingresearch direction is to use dictionaries better suited forcontent description than coding, such as content-adapteddictionaries [20,2,19].

    Finally, and specically with regards to the specicproblem of similarity search in audio signals, the cosinedistance between time-domain samples makes little sensebecause it is too sensitive to signal waveforms whereashuman perception is not. Instead, many other possibilitiesexist for comparing sparse approximation, such as comparinglow-level histograms of atom parameters [7,34]; comparing

  • JcM J lml1

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 2849where cgM and dg are dened in (17) and (18). This worst

    case scenario is not possible using MP because of itsupdate rule (1).

    We can nd the tighter bound in (15) by assuming thedistribution of signs of the elements of Giq is Bernoulliequiprobable, i.e., PfGiqml 1g PfGiqml 1g 0:5.Thus, dening a random variable bi : R/f1,1g, and itsprobability mass function fBbi 0:5dbi10:5dbi1using the Dirac function, dx, we create a random vector bwith n2P(M) elements independently drawn from thisdistribution. Placing this into the double sums of (A.1)provides the bound

    C2RMr bT cgM

    dg

    " #rJcgMJ1JdgJ1 A:2

    This weighted Rademacher sequence has the propertythat [14]

    PfjbTsj4Rgr2expR2=2JsJ22, R40 A:3which becomes PfjbTsjrRgZmaxf0,12expR2=2JsJ22gby the axioms of probability. With this we can nd an Rsuch that PfjbTsjrRg will be greater than or equal tosome probability 0rpr1, i.e.,

    Rp JcgMJ22JdgJ221=2 2ln

    2

    1p

    1=2A:4

    This value can be minimized by choosing p0, for whichwe arrive at the residual upper bound in (15). Note thateven though we have set p0, we still have an unrealis-tically loose bound by the impossibility of MP of choosingtwo sets of atoms for which all entries of their GramianGiq are in {1,1}.

    Finally, to show (16), we can model the elements of theGramian as random variables, ui : R/1,1, indepen-dently and identically distributed uniformly

    fUui 0:5, 1ruir10 else

    A:5

    Substituting this into (14) gives a weighted sum ofrandom variables satisfying

    C2RMr uT cgM

    dg

    " #rJcgMJ1JdgJ1 A:6

    where u is the random vector. For large M, this sum hasthe asymptotic property [14,15]:

    PfjuTsjoRg Erf3R2

    2JsJ22

    sA:7

    Setting this equal to 0rpr1 and solving for R producesthe upper bound (16). We can reach the upper bound (15)if we set p0.9586, but note that (16) can be made zero.This bound can still be extremely loose because theGramian of two models in timefrequency dictionarieswill be highly sparse.

    Computing the 2-norm in these expressions, however,leads to evaluating the double sums

    JcgMJ2

    Xnm M1

    Xml 1

    1

    lml12g A:8m M1 l 1

    Xn

    m M1

    1

    m1Xml 1

    1

    l 1ml1

    Xn

    m M1

    1

    m1 lnmgE1

    2m 112m2

    Xml 1

    1

    l

    " #

    2Xn

    m M1

    1

    m1 lnmgE1

    2m 112m2

    : A:12

    With these expressions we can avoid double sums incalculating the bounds.

    Appendix B. Estimating the compressibility parameters

    We estimate the compressibility parameters C,g forall signals from the entire set of representation weights.Since by (4) the parameters C,g bound from above thedecay of all the ordered weights, only the largest magni-tude weights matter for their estimation. Thus, we denea vector, a, of the largest n magnitude weights from eachrow in the set ffainigi2I ,aqnqg, which is equivalent totaking the largest weights at each approximation order.Good compressibility parameters can be given by

    minC,g

    JCzgaJ2lC subject to Czgka B:1

    where we dene zg91,1=2g, . . . ,1=ngT , and add a multi-ple of C in order to keep it from getting too large since thebounds (14)(16) are all proportional to it. The constraintis added to ensure all elements of the difference Czgaiare positive such that (4) is true.

    To remove the g component from the exponent, andsince all of the elements of z and a are positive and non-zero, we can instead solve the problem

    minC,g

    JlnC1glnzlnaJ2llnCJdgJ2 Xn1m 1

    Xnl m1

    1

    lnm12g A:9

    which can be prohibitive for large n. The dimensionality ofcgM is n(n1)/2P(M), and of d

    g is n(n1)/2. We approx-imate these values in the following way for g 0:5, usingthe partial sum of the harmonic seriesXnm 1

    1

    m lnngE

    1

    2n 112n2

    1120n4

    On6 A:10

    where gE 0:5772 is the EulerMascheroni constant. Tond Jd0:5J2

    Jd0:5J2 Xn1m 1

    Xnl m1

    1

    lnm1

    Xn1m 1

    1

    nm1Xnl 1

    1

    lXml 1

    1

    l

    " #

    Xn1m 1

    1

    nm1 lnn=mnm2nm

    n2m2

    12n2m2

    : A:11

    To nd Jc0:5M J2 we rst use partial fractions and then the

    partial sum of the harmonic series:

    0:5 2Xn Xm 1

  • [5] K. Chang, J.-S.R. Jang, C.S. Iliopoulos, Music genre classication viacompressive sampling, in: Proceedings of the International Society

    s,

    by8)

    onio,

    11381151.

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 283628512850[8] C. Cotton, D.P.W. Ellis, Finding similar acoustic events usingmatching pursuit and locality-sensitive hashing, in: Proceedingsof the IEEE Workshop on Applications of Signal Processing to Audioand Acoustics, Mohonk, NY, October 2009, pp. 125128.

    [9] L. Daudet, Sparse and structured decompositions of signals with themolecular matching pursuit, IEEE Transactions on Audio, Speechand Language Processing 14 (5) (2006) 18081816.

    [10] D.P.W. Ellis, G.E. Poliner, Identifying cover songs with chromafeatures and dynamic programming beat tracking, in: Proceedingsof the International Conference on Acoustics, Speech and SignalProcessing Honolulu, Hawaii, April 2007, pp. 14291432.

    [11] C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequencematching in time-series databases, in: Proceedings of the ACMSIGMOD International Conference on Management of Data,Minneapolis, MN, 1994, pp. 419429.

    [12] J. Gemmeke, L. ten Bosch, L. Boves, B. Cranen, Using sparserepresentations for exemplar based continuous digit recognition,in: Proceedings of the European Signal Processing Conference.Glasgow, Scotland, August 2009, pp. 17551759.with timefrequency audio features, IEEE Transactions on AudSpeech and Language Processing 17 (6) (2009) 11421158.[7]

    3361.S. Chu, S. Narayanan, C.-C.J. Kuo, Environmental sound recogniti] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decompositionbasis pursuit, SIAM Journal Scientic Computing 20 (1) (199[6for Music Information Retrieval, Amsterdam, The NetherlandAugust 2010, pp. 387392.minC,g

    lnC2ng2JlnzJ2JlnaJ2llnC

    2glnzT lnC1lna2lnClnaT1 B:2subject to the constraint Czgka. Taking the partial deri-vative of this with respect to g and C, we nd

    go lnzT lnalnC1

    JlnzJ2B:3

    Co exp l1

    n

    Xni 1

    lnaglnzi" #

    B:4

    Starting with some initial value of C then, we use thefollowing iterative method:

    1. solve for g given a C in (B.3);2. nd the new C in (B.4) using this g;3. set C0 expmaxlnagolnz and evaluate the error

    JC0zgaJ2;4. repeat until the error begins to increase.

    The factor l in effect controls the step size for conver-gence. A typical value we use is l 70:03 based onexperiments (the sign of which depends on if the objec-tive function decreases with decreasing C).

    References

    [1] R. Agrawal, C. Faloutsos, A. Swami, Efcient similarity search insequence databases, in: Proceedings of the International Confer-ence of Foundations of Data Organization and Algorithms, Chicago,IL, October 1993, pp. 6984.

    [2] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for design-ing of overcomplete dictionaries for sparse representation, IEEETransactions of Signal Processing 54 (11) (2006) 43114322.

    [3] M. Casey, C. Rhodes, M. Slaney, Analysis of minimum distances inhigh-dimensional musical spaces, IEEE Transactions on Audio,Speech and Language Processing 16 (5) (2008) 10151028.

    [4] M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, M. Slaney,Content-based music information retrieval: current directions andfuture challenges, Proceedings of the IEEE 96 (4) (2008) 668696.[13] J. Haitsma, T. Kalker, A highly robust audio ngerprinting systemwith an efcient search strategy, Journal of New Music Research 32(2) (2003) 211221.

    [14] P. Jost, Algorithmic aspects of sparse approximations, Ph.D. Thesis,Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland,June 2007.

    [15] P. Jost, P. Vandergheynst, On nding approximate nearest neigh-bours in a set of compressible signals, in: Proceedings of theEuropean Signal Processing Conference, Lausanne, Switzerland,August 2008, pp. 15.

    [16] A. Kimura, K. Kashino, T. Kurozumi, H. Murase, A quick searchmethod for audio signals based on piecewise linear representationof feature trajectories, IEEE Transactions on Audio, Speech andLanguage Processing 16 (2) (2008) 396407.

    [17] S. Krstulovic, R. Gribonval, MPTK: Matching pursuit made tractable,in: Proceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing, vol. 3, Toulouse, France, April 2006,pp. 496499.

    [18] F. Kurth, M. Muller, Efcient index-based audio matching, IEEETransactions on Audio, Speech and Language Processing 16 (2)(2008) 382395.

    [19] P. Leveau, E. Vincent, G. Richard, L. Daudet, Instrument-specicharmonic atoms for mid-level music representation, IEEE Transac-tions on Audio, Speech and Language Processing 16 (1) (2008)116128.

    [20] M.S. Lewicki, T.J. Sejnowski, Learning overcomplete representa-tions, Neural Computation 12 (2000) 337365.

    [21] C.-S. Li, P.S. Yu, V. Castelli, Hierarchyscan: A hierarchical similaritysearch algorithm for databases of long sequences, in: Proceedingsof the International Conference on Data Engineering, New Orleans,LA, February 1996, pp. 546553.

    [22] R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, G. Chechik, Soundretrieval and ranking using sparse auditory representations, NeuralComputation 22 (9) (2010) 23902416.

    [23] B. Mailhe, R. Gribonval, P. Vandergheynst, F. Bimbot, Fast orthogo-nal sparse approximation algorithms over local dictionaries, SignalProcessing, this issue.

    [24] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way,third ed., Academic Press, Elsevier, Amsterdam, 2009.

    [25] R. Mazhar, P.D. Gader, J.N. Nilson, Matching pursuits dissimilaritymeasure for shape-based comparison and classication of high-dimensional data, IEEE Transactions on Fuzzy Systems 17 (5)(2009) 11751188.

    [26] M. Muller, F. Kurth, M. Clausen, Chroma-based statistical audiofeatures for audio matching, in: IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics, New Paltz, NY, October2005, pp. 275278.

    [27] Y. Panagakis, C. Kotropoulos, G.R. Arce, Music genre classicationvia sparse representations of auditory temporal modulations, in:Proceedings of the European Signal Processing ConferenceGlasgow, Scotland, August 2009, pp. 15.

    [28] Y. Pati, R. Rezaiifar, P. Krishnaprasad, Orthogonal matchingpursuit: Recursive function approximation with applications towavelet decomposition, in: Proceedings of the Asilomar Conferenceon Signals, Systems, and Computers, vol. 1, Pacic Grove, CA,November 1993, pp. 4044.

    [29] T.V. Pham, A. Smeulders, Sparse representation for coarse and neobject recognition, IEEE Transactions on Pattern Analysis andMachine Intelligence 28 (4) (2006) 555567.

    [30] D. Raei, A. Mendelzon, Efcient retrieval of similar time sequencesusing DFT, in: Proceedings of the International Conference ofFoundations of Data Organization and Algorithms, Kobe, Japan,November 1998, pp. 249257.

    [31] E. Ravelli, G. Richard, L. Daudet, Union of MDCT bases for audiocoding, IEEE Transactions on Audio, Speech and Language Proces-sing 16 (8) (2008) 13611372.

    [32] E. Ravelli, G. Richard, L. Daudet, Audio signal representations forindexing in the transform domain, IEEE Transactions on Audio,Speech and Language Processing 18 (3) (2010) 434446.

    [33] L. Rebollo-Neira, D. Lowe, Optimized orthogonal matching pursuitapproach, IEEE Signal Processing Letters 9 (4) (2002) 137140.

    [34] S. Scholler, H. Purwins, Sparse coding for drum sound classicationand its use as a similarity measure, in: Proceedings of the Interna-tional Workshop on Machine Learning Music ACM Multimedia,Firenze, Italy, October 2010.

    [35] J. Serra, E. Gomez, P. Herrera, X. Serra, Chroma binary similarity andlocal alignment applied to cover song identication, IEEE Transac-tions on Audio, Speech and Language Processing 16 (August 2008)

  • [36] B.L. Sturm, M. Christensen, Cyclic matching pursuit with multiscaletimefrequency dictionaries, in: Proceedings of the AsilomarConference on Signals, Systems, and Computers, Pacic Grove, CA,November 2010.

    [37] B.L. Sturm, J.J. Shynk, Sparse approximation and the pursuit ofmeaningful signal models with interference adaptation, IEEE Trans-actions on Audio, Speech and Language Processing 18 (3) (2010)461472.

    [38] B.L. Sturm, J.J. Shynk, A. McLeran, C. Roads, L. Daudet, A comparisonof molecular approaches for generating sparse and structured multi-resolution representations of audio and music signals, in: Proceed-ings of Acoustics, Paris, France, June 2008, pp. 57755780.

    [39] G. Tzanetakis, P. Cook, Musical genre classication of audio signals, IEEETransactions on Speech, and Audio Processing 10 (5) (2002) 293302.

    [40] K. Umapathy, S. Krishnan, S. Jimaa, Multigroup classication ofaudio signals using timefrequency parameters, IEEE Transactionson Multimedia 7 (2) (2005) 308315.

    [41] P. Vincent, Y. Bengio, Kernel matching pursuit, Machines Learning.48 (1) (2002) 165187.

    [42] A. Wang, An industrial strength audio search algorithm, in:Proceedings of the International Society on Music InformationRetrieval, Baltimore, Maryland, USA, October 2003, pp. 14.

    [43] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, S. Yan, Sparserepresentation for computer vision and pattern recognition,Proceedings of the IEEE 98 (6) (2009) 10311044.

    [44] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust facerecognition via sparse representation, IEEE Transactions on PatternAnalysis and Machine Intelligence 31 (2) (2009) 210227.

    B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851 2851

    Recursive nearest neighbor search in a sparse and multiscale domain for comparing audio signalsIntroductionNearest neighbor search by recursion in a sparse domainRecursive search limited by boundsBounding the remainderEstimating the compressibility parameters

    Subsequence nearest neighbor search in a localized sparse domainEstimating the localized energyRecursive subsequence search limited by boundsPracticality of the bounds for audio signals

    Experiments in comparing audio signals in a sparse domainExperiment 1: comparing piano notesExperiment 2: comparing speech signalsExperiment 3: comparing music signalsDiscussion

    ConclusionAcknowledgmentsProof of remainder boundsEstimating the compressibility parametersReferences