single channel source separation using short-time independent component analysis

6
Audio Engineering Society Convention Paper Presented at the 119th Convention 2005 October 7–10 New York, New York USA This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Single Channel Source Separation using Short-time Independent Component Analysis Dan Barry 1 , Derry Fitzgerald 2 , Eugene Coyle 1 and Bob Lawlor 3 1 Digital Audio Research Group, Dublin Institute of Technology, Kevin St. Dublin, Ireland [email protected] & [email protected] 2 Dept. Electrical Engineering, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland [email protected] 3 Dept. of Electronic Engineering, National University of Ireland, Maynooth, Ireland [email protected] ABSTRACT In this paper we develop a method for the sound source separation of single channel mixtures using Independent Component Analysis within a time-frequency representation of the audio signal. We apply standard Independent Component Analysis techniques to contiguous magnitude frames of the short-time Fourier transform of the mixture. Provided that the amplitude envelopes of each source are sufficiently different, it can be seen that it is possible to recover the independent short-time power spectra of each source. A simple scoring scheme based on auditory scene analysis cues is then used to overcome the source ordering problem ultimately allowing each of the independent spectra to be assigned to the correct source. A final stage of adaptive filtering is then applied which forces each of the spectra to become more independent. Each of the sources is then resynthesised using the standard inverse short- time Fourier transform with an overlap add scheme. 1. BACKGROUND Sound source separation has been the topic of extensive research in recent years. The problem has seen many different formulations which have been based on many different mixing models. Some success has been had for the degenerate case, i.e. more sources than mixtures, where there are at least 2 mixture signals. In [1], a technique for recovering N sources from two convolute speech mixtures was presented. The technique clusters time-frequency components based on the intensity ratios and phase delays between each sensor capturing the mixture. In [2, 3] a similar technique was proposed for separating N sources from intensity panned stereo recordings. It is a localisation technique based on clustering components belonging to phase coherent sources emanating from the same position in the

Upload: independent

Post on 16-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Audio Engineering Society

Convention PaperPresented at the 119th Convention

2005 October 7–10 New York, New York USA

This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or considerationby the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending requestand remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org.All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from theJournal of the Audio Engineering Society.

Single Channel Source Separation usingShort-time Independent Component Analysis

Dan Barry1, Derry Fitzgerald2, Eugene Coyle1 and Bob Lawlor3

1 Digital Audio Research Group, Dublin Institute of Technology, Kevin St. Dublin, [email protected] & [email protected]

2 Dept. Electrical Engineering, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, [email protected]

3 Dept. of Electronic Engineering, National University of Ireland, Maynooth, [email protected]

ABSTRACTIn this paper we develop a method for the sound source separation of single channel mixtures using IndependentComponent Analysis within a time-frequency representation of the audio signal. We apply standard IndependentComponent Analysis techniques to contiguous magnitude frames of the short-time Fourier transform of the mixture.Provided that the amplitude envelopes of each source are sufficiently different, it can be seen that it is possible torecover the independent short-time power spectra of each source. A simple scoring scheme based on auditory sceneanalysis cues is then used to overcome the source ordering problem ultimately allowing each of the independentspectra to be assigned to the correct source. A final stage of adaptive filtering is then applied which forces each ofthe spectra to become more independent. Each of the sources is then resynthesised using the standard inverse short-time Fourier transform with an overlap add scheme.

1. BACKGROUNDSound source separation has been the topic of extensiveresearch in recent years. The problem has seen manydifferent formulations which have been based on manydifferent mixing models. Some success has been had forthe degenerate case, i.e. more sources than mixtures,where there are at least 2 mixture signals. In [1], atechnique for recovering N sources from two convolute

speech mixtures was presented. The technique clusterstime-frequency components based on the intensity ratiosand phase delays between each sensor capturing themixture. In [2, 3] a similar technique was proposed forseparating N sources from intensity panned stereorecordings. It is a localisation technique based onclustering components belonging to phase coherentsources emanating from the same position in the

Barry et al. Short-time ICA

AES 119th Convention, New York, New York, 2005 October 7–10

Page 2 of 6

horizontal plane. Standard ICA [4] techniques haveproved ideal for the case where a number of linearmixtures equal to the number of sources exists.Unfortunately, the most popular commercial musicformats are still of the stereo variety and so standardICA is rarely applicable. By far the most difficultproblem is that of separating N sources from a singlemixture of the sources. Some limited success has beenachieved to this end using computational auditory sceneanalysis (CASA) techniques [5] which usually requireprior knowledge of some description. IndependentSubspace Analysis (ISA) [6] has been applied quitesuccessfully to the separation of drum sounds fromsingle channel mixtures [7]. This technique involvescarrying out Principal Component Analysis (PCA) on amagnitude spectrogram which results in a set of N timebasis functions (amplitude envelopes) and N frequencybasis functions (spectra). They are ordered by variance.The basis functions are de-correlated from each otherbut not mutually independent. At this stage ICA can beapplied to the time basis functions resulting inindependent amplitude envelopes which usuallycorrespond to each of the drums in the mixture. Theprocess works well on drums because they tend toaccount for most of the variance in musical signals.However, because of the way in which the modelrepresents the data, it is limited to pitch stationarysounds such as drums. In this paper we present amethod for performing single channel source separationusing a combination of ICA and CASA.

2. SYSTEM OVERVIEWICA is a statistical method used for blind sourceseparation. It is usually applied to a matrix whichcontains a set of linear time mixtures of some latentsources. Furthermore, it requires at least as manymixture signals as there are sources present in themixture. For this reason it is not normally associatedwith single channel separation since to separate Nsources you would require N observation mixtures. For adetailed description of ICA refer to [4]. Here, we presenta formulation which allows ICA to be performed ontime-frequency representations of single channelmixtures. The process starts by taking the magnitudeSTFT of a single channel mixture. Each musical sourcewill usually have significantly different amplitudeenvelopes which are ultimately a function of the timbre,mode of excitation and articulation possibilities ofspecific instruments. Therefore, if two instruments areplaying simultaneously, the mixture at time (t) will besimilar to that of the mixture at time (t+a) but the mixing

coefficients of each instrument at time (t) and (t+a) willbe different because of their time evolving amplitudeenvelopes. Although the time representations of thesignal at t and (t+a) are not candidates for ICA, theshort-term power spectra are.

Figure 1: The graph shows the amplitude envelopes oftwo different notes of equal length along with theresultant mixture envelope of the notes. The dark lineplot is the envelope of a harp pluck while the dashed lineis that of a bowed string.

In the diagram above, the pitch of each note remainsconstant but the amplitude coefficients or envelopes ofeach note change with respect to time. As(t) and Ah(t) infigure1 refer to the amplitudes of each source, harp andstring respectively, at time (t). It is effectively thesecoefficients which are the latent mixing coefficients weseek to discover. These amplitude changes are of coursereflected in the time-frequency domain. So a spectralframe at time (t) and (t+a) can be considered asalternative linear mixtures of the same latent data, i.e.the latent short-time spectra of each source.

Figure 2: Shows the mixture spectra at times (t) and(t+a).

Barry et al. Short-time ICA

AES 119th Convention, New York, New York, 2005 October 7–10

Page 3 of 6

In order to separate the sources then, contiguousfrequency frames separated by some distance, a, arepassed to an ICA algorithm which returns independentshort-time spectra which correspond to each of theindependent sources. For this reason we have termed thetechnique, ‘Short-time Independent Componentanalysis’ (STICA). After the ICA stage, someprocessing takes place which ensures that theindependent magnitude frames are scaled and orderedcorrectly. Each independent magnitude spectrum is thensynthesized using the IFFT with the original mixturephase information resulting in the separation of twosources from one mixture.

3. METHOD

3.1. ICA Front-endWe begin by taking the magnitude STFT of the mixture,eq. 1,

( ) ( ) ( )1

2 /

0,

Nj nk N

nX k m abs w n x n mH e π

−−

=

= + ∑

(1)

where X(k,m) is the absolute value of the complex STFTand where m is the time frame index, k is the frequencybin index, H is the hopsize between frames and N is theFFT window size. w(n) is a suitable window of length Nalso. Next we perform ICA on 2 contiguous frames ofthe magnitude spectrum X(k,m). The idea is that, over ashort time period (1-4 frames), the local frequencycontent of the signal will be similar but not remainstationary. Therefore, the time varying amplitudeenvelopes of each source will dictate their relativeamplitudes in the mixture at any given time. Theseunknown amplitude envelopes values of each source atany given time frame are considered to be the mixingmatrix denoted by A in equation 1. So in order toretrieve 2 independent short-time source spectra we needto supply the ICA algorithm with 2 short-time mixturespectra. Furthermore, to ensure convergence, the mixingcoefficients of each source at each time frame must notbe the same. So for example, if the amplitude envelopesof the sources change rapidly, consecutive frames maybe used but if the envelopes evolve slowly it may benecessary to put some distance between the frames, thisdistance is denoted by α in equation 3 below. It shouldalso be noticed then, that if each of the sources haveidentical amplitude envelopes, no separation can beachieved. The formal ICA representation is shown in

equation 2, where x is a set of observation mixtures, A isan unknown mixing matrix and s, also unknown, is amatrix with the same dimension of x and contains theindependent short-time source spectra.

x = Αs (2)

where, = ( ( , ) ( , ) )X k m , X k m α+x (3)

and where,= ( ( , ) ( , ) )i jS k m , S k m i j≠s (4)

Referring specifically to a 2 source problem as inequation 3 and 4, both x and s are of dimension 2 x kand A is of dimension 2 x 2. α is the distance betweenthe frames to be passed to the ICA algorithm. So Si andSj then, are the unknown independent short-time sourcespectra. There are two significant issues at this point.The first issue is that ICA causes the independentoutputs to have an arbitrary scaling and the secondproblem is that ICA returns the independent sources in arandom order. This effectively means that source i inframe 1 may be source j in frame 2. So in order toresynthesise 2 coherent independent sources in time,some method of grouping each frame with the correctsource is required. This will be dealt with in section 3.3.

3.2. Re-scaling FramesAs stated, the outputs of the ICA algorithm are alwaysarbitrarily scaled. This would result in random gainchanges at the output. The problem is easily resolvedsince the sum of the independent components should beapproximately equal to a scaled version of the originalmixture frame. So in order to calculate each source’srelative magnitude we use equations 5 and 6.

2

( , )'( , ) ( , )

( , ) ( , )1

11

S k mS k m X k m

S k m S k m=

+ (5)

( , )'( , ) ( , )

( , ) ( , )2

21 2

S k mS k m X k m

S k m S k m=

+ (6)

The process described above ensures that the scalingbetween frames is relative and not arbitrary, thusavoiding random gain changes in the output signals.

Barry et al. Short-time ICA

AES 119th Convention, New York, New York, 2005 October 7–10

Page 4 of 6

3.3. Source OrderingThe second problem is that ICA returns the independentsources in a random order and since we are carrying thisout on a short-time basis, there is an issue associatedwith identifying each of the sources so that each of theindependent frames are assigned to the correct sourcefor resynthesis. The complexity of this problem risessignificantly the more sources there are present. Wehave limited this research to only two sources present ina single mixture. A simple scheme to order the sourcesusing 3 measures is proposed. For our examples we usethree measures:

1. Normalised Spectral Centroid,2. Peak Location,3. Proximity to Previous Peak Locations.

The measures are compared and the results summed toform likelihood scores which determine which sourcesthe independent frames belong to. The similaritymeasures chosen here are for simplicity duringevaluation; the authors suggest that a more complex rulebased system could lead to a very robust identificationand separation. Our rules are equipped only to detect thesources based on their musical register with respect toeach other, so one source is considered to always have ahigher pitch than the other but their harmonics will ofcourse overlap. If the sources were to intersect andswitch register, this would be reflected in the output, butthe outputs would still remain separated andindependent. The spectral centroid [8] is a measure ofthe ‘brightness’ of a sound or as it is often referred to the‘centre of gravity’ of a power spectrum. We use it hereto indicate the register of the instrument. It is obtainedusing equation 7. The frames are normalised before thespectral centroid is obtained in order to avoid intensityambiguities.

1

1

( )

( )

N

kN

k

kX kSC

X k=

=

= ∑∑

(7)

So we will have SC1 and SC2 for each outputrespectively. As a second indication of register we takethe peak location from each of the independent short-time spectra which is assumed to be the fundamentalfrequency of each source at a given time. The peak isdenoted as P1(m) and P2(m) for each outputrespectively. The next measure simply performs simplepeak tracking. We define a distance measure between

each of the current peaks and each of the previous peaksas in equation 8.

1 1

1 2

( ) ( 1)

( ) ( 1)

1

2

D P m P m

D P m P m

= − −

= − − (8)

We now have three measures which describe therandomly ordered magnitude frames. We predeterminethat the source with the highest register will becontained in the vector I1 and the source of lowerregister will be contained I2. We now compare eachmeasure to determine which source container a givenindependent frame should be placed in. The scores areevaluated simply as follows.

{ }{ }

1 1 2 2

1 1 2 2

1 1 2 2

1 11 2

2 2

2 1

1 2

''''

L if SC SC else LL if P P else LL if D D else L

I (k,m) S (k,m)if L L I (k,m) S (k,m)I (k,m) S (k,m)else I (k,m) S (k,m)

+ + + +

+ + + +

+ + + +

>><

=> =

==

(9)

where L1 is the likelihood that output vector 1 from theICA front end belongs to the source with the highestregister, i.e. source 1 is assigned to container 1. Thelikelihood scores are compared to see which permutationof the outputs is required. At this stage we now have twoindependent magnitude spectrograms, one for eachsource.

3.4. Adaptive FilteringThere are instances where, the ICA outputs are notcompletely independent. This is usually evident whereboth sources exhibit a similar time evolution. It usuallyresults in poor separation during those time frames. Soas a final process to force more further attenuation of theunwanted source, an adaptive filtering technique is usedwhich consists of multiplying each independentfrequency frame by a normalised, smoothed andinverted version of the opposite frame. This suppressesresidual effects of the unwanted sources between theframes resulting in greater separation.

Barry et al. Short-time ICA

AES 119th Convention, New York, New York, 2005 October 7–10

Page 5 of 6

( , ) [ ( , )] 1 [ ( ) ( , )]j iY k m I k m w n I k m i j= × − ∗ ≠(10)

where ( )w n is a suitable hanning window and ∗denotes convolution, which is performed in order tosmooth the magnitude response into somethingresembling an equalization curve. This curve is invertedso that when multiplied by the opposite frame, it willsuppress any residual information from the other source.This is done for each frame before resynthesis using theIFFT, equation 11.

( , )

1

1( ) ( ) ( , )K

j x k m

ky n mH w n Y k m e

Kω∠

=

+ = ⋅ ∑

(11)where ( , )x k mω∠ is the phase information from theoriginal mixture spectrogram and w(n) is a suitablewindow of length N.

4. RESULTS

To evaluate the system we synthesized a mixture of bassand flute. Both sources are always active and do containharmonic overlap. Each source is playing a melody.

Figure 3: The top plot shows the mixture signal, thecentre plot shows the resynthesised flute separation andthe lower plot shows the resynthesised bass separation.

For the separations above, a 4096 point window with75% overlap was used. The frame distance, denoted by�α in equation 3 was set to 12 which corresponds to330mS for the parameters chosen. This particular valuewas achieved through experimentation. A degree of

separation is always achieved, but this setting providedbest the results for this example.

Figure 4: The spectrogram on the top shows the mixturesignal followed by the flute and the bass respectively.Note that the discontinuities in the flute notes can beseen as artifacts in the bass separation.

5. CONCLUSIONS

We present a framework which is capable of singlechannel source separation. Although the system iscurrently limited by the simple source ordering rules, theresults remain compelling. The method could beextended and made more robust by adding moreordering rules at the output stage. Theoretically themethod could be extended to deal with more sources bytaking J frames at the input, where J is the number ofsources present. This makes the problem significantly

Barry et al. Short-time ICA

AES 119th Convention, New York, New York, 2005 October 7–10

Page 6 of 6

more complex since there will be J factorial (J!)permutation possibilities to consider during the sourceordering stage. The main limitation with the systemdescribed is the fact that both sources must be active allof the time. To overcome this problem, a mechanismwhereby two output frames can be assigned to the samesource could be employed.

6. REFERENCES

[1] D A. Jourjine, S. Rickard, O. Yilmaz, "BlindSeparation of Disjoint Orthoganal Signals:Demixing N Sources from 2 Mixtures,” Proc. ofIEEE Int. Conf. on Acoustics, Speech and SignalProcessing, June 2000

[2] Barry, D., Lawlor, R. and Coyle E., “Sound SourceSeparation: Azimuth Discrimination andResynthesis”, Proc. 7th International Conference onDigital Audio Effects, DAFX 04, Naples, Italy, 2004

[3] Barry, D. and Lawlor, “Real-time Sound SourceSeparation using Azimuth Discrimination andResynthesis”, Proc. 117th Audio EngineeringSociety Convention, October 28-31, San Francisco,CA, USA, 2004

[4] A. Hyvarinen, J. Karhunen and E. Oja,"Independent Component Analysis", Wiley & Sons,2001.

[5] D.F. Rosenthal, H. G. Okuno, ComputationalAuditory Scene Analysis, LEA Publishers, MahwahNJ, 1998.

[6] M.A. Casey, "Separation of Mixed Audio Sourcesby Independent Subspace Analysis," Proc. of theint. Computer Music Conference, Berlin, August2000.

[7] FitzGerald, D., Lawlor, B., Coyle, E., "IndependentSubspace Analysis using Locally LinearEmbedding", Proceedings of the Digital AudioEffects Conference (DAFX03), London, pp. 13-17,2003.

[8] Beauchamp, J. W., ‘‘Synthesis by spectralamplitude and brightness matching of analyzedmusical instrument tones,’’ J. Audio Eng. Soc. 30,396–406. 1982