variational and stochastic inference for bayesian source separation

23
Digital Signal Processing 17 (2007) 891–913 www.elsevier.com/locate/dsp Variational and stochastic inference for Bayesian source separation A. Taylan Cemgil a,,1 , Cédric Févotte b , Simon J. Godsill a a Engineering Department, University of Cambridge, Trumpington st., CB2 1PZ, Cambridge, UK b Département Signal-Image, GET/TélécomParis (ENST), 37-39, rue Dareau, 75014 Paris, France Available online 7 April 2007 Abstract We tackle the general linear instantaneous model (possibly underdetermined and noisy) where we model the source prior with a Student t distribution. The conjugate-exponential characterisation of the t distribution as an infinite mixture of scaled Gaussians enables us to do efficient inference. We study two well-known inference methods, Gibbs sampler and variational Bayes for Bayesian source separation. We derive both techniques as local message passing algorithms to highlight their algorithmic similarities and to contrast their different convergence characteristics and computational requirements. Our simulation results suggest that typical posterior distributions in source separation have multiple local maxima. Therefore we propose a hybrid approach where we explore the state space with a Gibbs sampler and then switch to a deterministic algorithm. This approach seems to be able to combine the speed of the variational approach with the robustness of the Gibbs sampler. © 2007 Elsevier Inc. All rights reserved. Keywords: Source separation; Variational Bayes; Markov chain Monte Carlo; Gibbs sampler 1. Introduction In Bayesian source separation [1–7], the task is to infer N source signals s k,n given M observed signals x k,m where n = 1,...,N , m = 1,...,M. Here, k is an index with k = 1,...,K that may directly correspond to time or indirectly to an expansion coefficient of the sources in a (linear) transform domain. By letting s s 1:K,1:N and x x 1:K,1:M , a generic hierarchical Bayesian formulation of the problem can be stated as p(s|x) = 1 Z x m p p(x|sm )p(s|Θ p )p(Θ m )p(Θ p ). (1) The mixing process is characterised by the (possibly degenerate, deterministic) conditional distribution p(x|sm ), that is known as the observation model. Here, Θ m denotes the collection of mixing parameters such as the mixing matrix, observation noise variance, etc. The prior term p(s|Θ p ), the source model, describes the statistical properties of the sources via their own prior parameters Θ p . The hierarchical model is completed by postulating hyper-priors over * Corresponding author. E-mail address: [email protected] (A.T. Cemgil). 1 This research is sponsored by the EPSRC grant EP/D03261X/1 “Probabilistic Modelling of Musical Audio for Machine Listening” and MUSCLE Network of excellence supported by the European commission under FP6. 1051-2004/$ – see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.dsp.2007.03.008

Upload: independent

Post on 23-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Digital Signal Processing 17 (2007) 891–913

www.elsevier.com/locate/dsp

Variational and stochastic inference for Bayesian source separation

A. Taylan Cemgil a,∗,1, Cédric Févotte b, Simon J. Godsill a

a Engineering Department, University of Cambridge, Trumpington st., CB2 1PZ, Cambridge, UKb Département Signal-Image, GET/Télécom Paris (ENST), 37-39, rue Dareau, 75014 Paris, France

Available online 7 April 2007

Abstract

We tackle the general linear instantaneous model (possibly underdetermined and noisy) where we model the source prior witha Student t distribution. The conjugate-exponential characterisation of the t distribution as an infinite mixture of scaled Gaussiansenables us to do efficient inference. We study two well-known inference methods, Gibbs sampler and variational Bayes for Bayesiansource separation. We derive both techniques as local message passing algorithms to highlight their algorithmic similarities andto contrast their different convergence characteristics and computational requirements. Our simulation results suggest that typicalposterior distributions in source separation have multiple local maxima. Therefore we propose a hybrid approach where we explorethe state space with a Gibbs sampler and then switch to a deterministic algorithm. This approach seems to be able to combine thespeed of the variational approach with the robustness of the Gibbs sampler.© 2007 Elsevier Inc. All rights reserved.

Keywords: Source separation; Variational Bayes; Markov chain Monte Carlo; Gibbs sampler

1. Introduction

In Bayesian source separation [1–7], the task is to infer N source signals sk,n given M observed signals xk,m wheren = 1, . . . ,N , m = 1, . . . ,M . Here, k is an index with k = 1, . . . ,K that may directly correspond to time or indirectlyto an expansion coefficient of the sources in a (linear) transform domain. By letting s ≡ s1:K,1:N and x ≡ x1:K,1:M ,a generic hierarchical Bayesian formulation of the problem can be stated as

p(s|x) = 1

Zx

∫dΘm dΘpp(x|s,Θm)p(s|Θp)p(Θm)p(Θp). (1)

The mixing process is characterised by the (possibly degenerate, deterministic) conditional distribution p(x|s,Θm),that is known as the observation model. Here, Θm denotes the collection of mixing parameters such as the mixingmatrix, observation noise variance, etc. The prior term p(s|Θp), the source model, describes the statistical propertiesof the sources via their own prior parameters Θp . The hierarchical model is completed by postulating hyper-priors over

* Corresponding author.E-mail address: [email protected] (A.T. Cemgil).

1 This research is sponsored by the EPSRC grant EP/D03261X/1 “Probabilistic Modelling of Musical Audio for Machine Listening” andMUSCLE Network of excellence supported by the European commission under FP6.

1051-2004/$ – see front matter © 2007 Elsevier Inc. All rights reserved.doi:10.1016/j.dsp.2007.03.008

892 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

the nuisance parameters Θm and Θp . The normalisation term Zx = p(x) is the marginal probability of the data underthe model, also known as the evidence [8]. While its actual numerical value is not directly relevant for separation, itplays a key role for model order selection, for example when the number of sources N is unknown and need to beestimated [9].

In this paper we assume the mixing process to be linear and instantaneous. In reality, a linear mixing process maystill involve convolutions (i.e., reverberation) and may evolve with time (i.e., nonstationarity, moving sources). Whilesuch general scenarios are definitely of practical interest and pose significant computational challenges, these may beaddressed in the same Bayesian framework of Eq. (1). Another common assumption is that of a-priori independentsources; i.e., the source model factorises as p(s|Θp) = ∏

n p(sn|Θp), which underlies the well-known independentcomponent analysis model [10,11]. While the validity of the a-priori independence assumption is debatable especiallyfor musical audio, it is nevertheless a quite commonly made assumption mainly due to its simplicity. Note howeverthat some authors have considered source models where mutual dependence is taken into account [6,7].

Once the posterior is computed by evaluating the integral, the estimates of the sources can be obtained as a marginalmaximum-a-posteriori (MMAP) or minimum-mean-square-error (MMSE) estimate

s∗ = argmaxs

p(s|x), 〈s〉p(s|x) =∫

sp(s|x) ds.

Here, and elsewhere the notation 〈f (x)〉p(x) will denote the expectation of the function f (x) under the distributionp(x), i.e., 〈f (x)〉p ≡ ∫

dxf (x)p(x).Unfortunately, the computation of the posterior distribution p(s|x) via the integral in Eq. (1) is intractable for

almost all relevant observation and source models, even under conditionally Gaussian and independence assumptions.Hence, approximate numerical integration techniques have to be employed.

In summary, the success of a practical Bayesian source separation system hinges upon the following points:

• Topology (structure) of the model and the parameter regime.• The accuracy, solution quality and the computational cost of the inference algorithm.

Obviously, some tradeoff has to be made between model complexity and computational effort. However, for a par-ticular application, the details of how to make this tradeoff are far from clear, especially given that there are manycandidate models and approximate inference algorithms. Moreover, it is not uncommon that the difficulty of the in-ference problem is determined by the particular parameter regime, not only by model structure. For example, thedifficulty of a source separation task depends upon the particular details of the mixing system where the effectiverank of the mixing matrix and the number of observed data points play a critical role. This affects the convergenceproperties of approximate inference algorithms and success often depends upon initialisation and other design choices(such as tempering or propagation schedule).

In this paper, we will investigate some of the issues raised above. We will focus on the underdetermined case(M < N ) and with noisy mixtures. The underdetermined case is challenging because contrary to the (over)determinedcase, estimating the mixing system is not sufficient for reconstruction of the sources and the prior structure p(s|Θ)

becomes increasingly important for reconstructing the sources in noisy environments.We will follow the model proposed in [12,13], where audio sources are decomposed on a MDCT basis (a local

cosine transform orthonormal basis) and the coefficient prior p(s|·) is a Student t distribution. This prior prefers thesparsity of the sources on a given dictionary, which intuitively means that the MAP solution s∗ or MMSE solution 〈s〉typically contain few coefficients that are significantly non-zero. The use of source sparsity to handle underdeterminedsource separation finds its origin in the deterministic approaches of [14,15].

In [13], a Gibbs sampler is derived to sample from the posterior distribution of the parameters (the mixing matrix,the sources coefficients, the additive noise variance and the hyper-parameters). Minimum mean square error estimatesof the coefficients of the sources are then computed and time domain estimates of the sources are reconstructedby applying inverse-MDCT. The results were reproducible, i.e., independent of initialisation, and the method couldcompute a perceptually reasonable separation of the sources.

The main advantage of MCMC is its generality, robustness and attractive theoretical properties. However, themethod comes at the price of heavy computational burden which may render it impractical for certain applications.An alternative approach for computing the required integrals is based on deterministic fixed point iterations (varia-

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 893

tional Bayes-structured mean field) [16–18]. This set of methods have direct links with the well-known expectation–maximisation (EM) type of algorithms.

Variational methods have been extensively applied to various models for source separation by a number of authors[9,19–25]. In this contribution, we derive both VB and Gibbs sampler as local message passing algorithms on afactor graph [18,26]. This approach has two advantages: first it provides a pedagogical perspective to highlight thealgorithmic similarities and to contrast objectively the convergence characteristics and computational requirementsof different inference algorithms. Second, the framework facilitates generalisation to more complex models and toautomation of code generation procedure including alternative message passing methods.

2. Model

The source separation model defines a generative model for the source coefficients. The model is applicable to theactual time series xt,m for t = 1, . . . , T observed at channel m as well as when transformed by an orthogonal basis toa new series xk,m where the new index k typically encodes both the frequency band and frame number (a particulartile in a time-frequency plane). We assume linear instantaneous mixing2:

xk,m|sk,1:N,am, rm ∼ N(xk,m;a�

msk,1:N, rm),

where sk,1:N is the vector of source coefficients, am is a N × 1 vector that denotes the mixing proportions for the mthchannel, and rm is the variance of observation noise in the mth channel. Since both of these quantities are typicallyunknown exactly, we place informative priors on them3

rm ∼ IG(rm;ar, br ),

am ∼ N (am;μa,Pa).

In many domains, (audio, natural images) the distribution of source coefficients sk,n tends to be heavy tailed. Forexample in typical audio signals, this is a simple consequence of the fact that the vast majority of physical systems(human vocal tract, many acoustical musical instruments, etc.) produce quasi-periodic oscillations. Hence, over shorttime scales, the energy of signals from these sources is concentrated in a few narrow frequency bands which enablesa sparse representation using just a few expansion coefficients. In this respect, a sparse prior can be justified by theunderlying physics. We assume the source coefficients to be zero mean Gaussian

sk,n|vk,n ∼ N (sk,n;0, vk,n),

where vk,n is an unknown variance. Note that the variance vk,n (for zero mean case) is equal to the expected amountof energy. Since this quantity is in general unknown, we place an informative prior as

vk,n|λn ∼ IG(vk,n;ν/2,2/(νλn)

).

We note that this hierarchical prior model on the source coefficients is equivalent to a t-distribution

Tν(s;μ,λ) =∫

dvN (s;μ,v)IG

(v; ν

2,

2

νλ

)≡ Γ ((ν + 1)/2)√

νπλΓ (ν/2)

(1 + 1

ν

(s − μ)2

λ

)−(ν+1)/2

,

where the parameter λn directly controls the variance of the t-distribution and hence the power of the nth source. Weplace an informative prior on λn as

λn ∼ G(λn;aλ, bλ).

In this paper the degrees of freedom ν is fixed to a low value, yielding a distribution with heavy tails to model sparsity.However, any estimation procedure of the degrees of freedom of an inverse Gamma distribution could be employedto learn ν as well. As such, in [27], we have considered a lower bound maximisation. Similarly, an empirical adaptivescheme is considered in [13] and a Metropolis–Hasting procedure is proposed in [28]. The graphical model of the fullmodel is shown in Fig. 1.

2 N , IG, and G denote Gaussian (normal), inverse Gamma and Gamma distributions respectively and are defined in Appendix A.3 Note that one can break the source separation gain indeterminacy (ca)�s = a�(cs), by placing appropriate priors on the mixing proportions a

or source coefficients s. The posterior then becomes only invariant to permutations of the sources.

894 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

Fig. 1. Graphical model for the source separation model. The rectangle denotes a plate, K copies of the nodes inside.

3. Inference

The model described in the previous section is an instance of the basic Bayesian source separation statedin (1), where the mixing model parameters Θm = {a1:M, r1:M } ≡ {A,R} and the source prior parameters areΘp = {v1:K,1:N,λ1:N } ≡ {v,λ}. We will refer to all parameters by Θ ≡ (Θp,Θm). The remaining hyper-parameters,aλ, bλ, ar , br ,μa,Pa , and ν are assumed to be known. Formally, we need to compute the posterior distribution

p(s|x) = 1

Zx

∫φx(s,A,R,v,λ) dAdRdvdλ, (2)

where φx is the joint distribution evaluated at the observed x given as

φx(s,A,R,v,λ) ≡ p(x|s,A,R)p(s|v)p(A)p(R)p(v|λ)p(λ)

=K∏

k=1

(N∏

n=1

p(vk,n|λn)p(sk,n|vk,n)

)(M∏

m=1

p(xk,m|sk,1:N,am, rm)

)

×(

M∏m=1

p(am)p(rm)

)(N∏

n=1

p(λn)

)and estimate the sources by, e.g., MMSE given as 〈s〉p(s|x). Unfortunately, the required integral cannot be computedexactly and numerical techniques have to be employed.

In the following two sections, we will review two alternative approaches for numerical computation: a stochasticmethod (Markov chain Monte Carlo) and a deterministic method (structured mean field–variational Bayes). Bothmethods are well understood and a vast amount of literature is available on both subjects. In this paper, we willintroduce both techniques in the context of source separation to highlight the algorithmic similarities in order tofacilitate a comparison.

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 895

3.1. Markov chain Monte Carlo

Suppose we could generate I independent samples (s(i),Θ(i)), i = 1, . . . , I , from the joint posterior

1

Zx

φx(s,Θ) ≡ 1

Zx

φx(s,A,R,v,λ).

Then, the intractable integral can be trivially approximated by simply discarding Θ(i) and the sources could be recov-ered by

〈s〉 ≈ 1

I

I∑i=1

s(i). (3)

Unfortunately, generating independent samples from the joint posterior is a difficult task. On the other hand, it is usu-ally easier to generate dependent samples, that is we generate (s(i+1),Θ(i+1)) by making use of (s(i),Θ(i)). Perhapssurprisingly, even though subsequent samples are correlated (and provided certain ergodicity conditions are satisfied),Eq. (3) remains still valid and estimated quantities converge to their true values when number of samples, I , goes toinfinity [29].

A sequence of dependent samples (s(i),Θ(i)) is generated by sampling from a Markov chain that has desired jointposterior φx(s,Θ)/Zx as its stationary distribution. The chain is defined by a collection of transition probabilities, i.e.,a transition kernel T (s(i+1),Θ(i+1)|s(i),Θ(i)). The Metropolis–Hastings algorithm [30,31] provides a simple way ofdefining an ergodic kernel that has the desired stationary distribution. Suppose we have a sample (s(i),Θ(i)). A newcandidate (s′,Θ ′) is generated by sampling from a proposal distribution q(s,Θ|s(i),Θ(i)). We define the acceptanceprobability a by

a(s′,Θ ′ ← s,Θ) = min

(1,

φx(s′,Θ ′)/Zx

φx(s,Θ)/Zx

q(s,Θ|s′,Θ ′)q(s′,Θ ′|s,Θ)

).

The new candidate (s′,Θ ′) is accepted as the next sample (s(i+1),Θ(i+1)) with probability a, otherwise (s(i+1),Θ(i+1))

← (s(i),Θ(i)). The algorithm is initialised by generating the first sample (s(0),Θ(0)) according to an (arbitrary) pro-posal distribution.

However for a given transition kernel T , it is hard to assess the time required to converge to the stationary dis-tribution so in practice one has to run the simulation until a very large number of samples have been obtained, [32].The choice of the proposal distribution q is also very critical. A poor choice may lead to the rejection of many newcandidates and consequently to a very slow convergence to the stationary distribution. In the context of source sep-aration, where the dimension is large, it is not easy to design a proposal that makes global jumps, as suggested byT (s(i+1),Θ(i+1)|s(i),Θ(i)), where a whole block of variables is updated in parallel.

3.2. Gibbs sampling

A simpler approach is to sample the variables one by one or in smaller, mutually disjoint blocks. For this purpose,we group all variables in mutually exclusive sets Cα , that we name as clusters. The subscript α = 1, . . . ,A denotes theunique cluster index and C = {C1, . . . ,Cα, . . . ,CA} is the set of all clusters. Often, there is a huge number of choicesfor grouping the variables. A natural choice for the source separation problem is4

C = {s1,1:N, . . . , sK,1:N,a1, . . . ,aM, r1, . . . , rM, v1,1, . . . , vk,m, . . . , vK,M,λ1, . . . , λM}. (4)

Given a clustering one can apply a more specialised Markov chain Monte Carlo (MCMC) algorithm, the Gibbssampler, by choosing the proposal density as the full conditional distributions given as p(Cα|C¬α), where C¬α ≡ C\Cα ,i.e., all clusters but Cα . It also turns out that the acceptance probability becomes a = 1 and all the proposals areaccepted by default.

4 Note that in this choice, for each k, we cluster the sources sk,1, . . . , sk,N jointly as sk,1:N . This particular blocking leads to better convergencebecause whilst sk,1:N are a priori independent, clearly they are a posteriori dependent given xk,1:M .

896 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

To complete the algorithm, we need to decide upon the order in which we will visit the individual clusters α. Wewill call such a sequence a propagation schedule πα(j), where j = 1, . . . ,AI .

To generate a new candidate for a cluster Cα , we freeze the configuration of the remaining clusters C(j−1)¬α (where

α = πα(j)) and compute the full conditional distribution p(Cα|C(j−1)¬α ) ∝ φx(Cα,C(j−1)

¬α ).Theoretically, the actual order is not important and can be randomised, provided that each cluster is visited infinitely

often in the limit when j → ∞. In practice however, given limited computational resources, a particular propagationschema may be favourable, since it may converge faster. Typically, periodic propagation schedules are employedwhere each cluster is visited once according to some permutation. The permutation is repeated I times so each clusteris visited I times at the end of the simulation. It is also common practice to let the chain run for an initial burn inperiod where each cluster is visited I0 times without computing any statistics.

The key observation is that, given a clustering of the variables C, the joint posterior admits the factorisation

φx(C) =∏β

φx,β

({Cα}α∈neigh(β)

). (5)

Here, φx,β are positive and possibly unnormalised functions. This expression says that if cluster α appears in factorφx,β , we have α ∈ neigh(β) and the joint can be written as a product of such terms.5

The structure of the factorisation can be conveniently visualised using a factor graph, see Fig. 2 [26]. Each blacksquare correspond to a factor φx,β where φx = ∏

β φx,β and the circles denote the individual cluster variables Cα .A factor φx,β is connected to a cluster node α if α ∈ neigh(β), i.e. a variable in Cα appears in the defining expressionof φx,β . Obviously the relation is symmetric: a cluster α is connected to φx,β when β ∈ neigh(α).

For a given cluster α, we could write Eq. (5) as

logφx(C) =∑

β∈neigh(α)

logφx,β

({Ca}a∈neigh(β)

) +∑

β /∈neigh(α)

logφx,β

({Ca}a∈neigh(β)

), (6)

p(C|C(j−1)

¬α

) ∝ φx

(Cα;C(j−1)

¬α

) = exp

( ∑β∈neigh(α)

logφx,β

({Ca}a∈neigh(β)

)). (7)

Given the configuration C(j−1)¬α , the second term in Eq. (6) does not depend on α, hence only a small subset of

clusters in ¬α actually effect the expression of the full conditional. This subset, which is known as the Markovblanket of α, will be denoted by ¬α. The Markov blanket consists of nodes that share a common factor with α, i.e.¬α ≡ {α′ : α′ ∈ ¬α ∧ ∃βs.t.(β ∈ neigh(α′) ∧ β ∈ neigh(α))}. For example, in Fig. 2, the Markov blanket of {vk,n}consists of {{λk,n}, {sk,1:N }}.

This simplification reduces the amount of computation required for computing the full conditional distribution.A further computational simplification is obtained when the expression for the full conditional corresponds to a knowndistribution. One way to ensure this is choosing the hierarchical model such that all priors are conjugate-exponential.6

In this case, all conditional marginals have a closed form solution. The Gibbs sampling algorithm is summarisedbelow:

• Initialise (or Burn in):

For α = 1, . . . ,A, C(0)α ∼ q(Cα).

5 As an example, suppose φ(s, v) = p(s|v)p(v) = N (s;0, v)IG(v;1,1) and we choose C = {C1,C2} = {{v}, {s}}. Then

φ(s, v) ≡ exp

(− 1

2

s2

v− 1

2log 2πv

)exp

(−2 logv − 1

v

)∝ exp

(− 1

2

s2

v

)exp

(− 5

2logv − 1

v

)= φ1(s, v)φ2(v) ≡

2∏β=1

φβ

and we would have neigh(β = 1) = {1,2} and neigh(β = 2) = {1}.6 Consider the example from previous footnote

p(v|s = s) ∝ φ(s, v) ∝ exp

(− 1

2

s2

v

)exp

(− 5

2logv − 1

v

)= exp

(− 5

2logv −

(1 + 1

2s2

)1

v

)∝ IG

(v;3/2,2/

(2 + s2))

.

The full conditional p(v|s = s) has the same form as the prior p(v).

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 897

Fig. 2. Factor graph corresponding to the clustering of the variables as defined in Eq. (4).

• Sample the chain (typically J = AI ):For j = 1, . . . , J

α = πα(j),C(j)

α ∼ p(Cα|C(j−1)

¬α).

• Estimate

〈Cα〉 ≈ 1

I

∑js.t.πα(j)=α

C(j)α .

3.3. Structured mean field-variational Bayes

One alternative approximation method, that leads to an iterative optimisation procedure is the structured meanfield method, also known as variational Bayes, see [8,16–18,33] and references herein. In our case, mean field boilsdown to approximating the integrand P = φx/Zx in (2) with a simple distribution Q in such a way that the integralbecomes tractable. An intuitive interpretation of mean field method is minimising the KL divergence with respect to(the parameters of) Q where

KL(Q||P) = 〈logQ〉Q −⟨log

1

Zx

φx

⟩Q

. (8)

Using non-negativity of KL [34], we obtain a lower bound on the evidence

logZx � 〈logφx〉Q − 〈logQ〉Q ≡ F [P,Q] + H [Q].Here, F is interpreted as a negative energy term and H [Q] is the entropy of the approximating distribution. Themaximisation of this lower bound is equivalent to finding the “nearest” Q to P in terms of KL divergence and thissolution is obtained by a joint maximisation of the entropy H and F (minimisation of the energy) [35].

In this paper, we choose a factorized approximating distribution Q that respects the factorisation of form

Q=∏

Qα(Cα)

α

898 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

where the clusters are defined as in (4). Although a closed form solution for Q still cannot be found, it can be easilyshown, e.g., see [17,36], that each factor Qα of the optimal approximating distribution should satisfy the followingfixed point equation

Qα ∝ exp(〈logφx〉Q¬α

), (9)

where Q¬α ≡ Q/Qα , that is the joint distribution of all factors excluding Qα . Hence, the mean field approach leadsto a set of (deterministic) fixed point equations that need to be iterated until convergence.

Right-hand side of this fixed point iteration can be computed efficiently since the joint posterior admits the factori-sation in (5) that translates to

〈logφx〉Q¬α=

∑β

〈logφx,β〉Q¬α=

∑β∈neigh(α)

〈logφx,β〉Qneigh(β)\α

where f (x) =+ g(x) denotes equal up to some irrelevant constant c, i.e., f (x) = g(x) + c. The expectations 〈φx,β〉can be computed easily if all distributions are chosen to be in a conjugate-exponential family, for example, see [16].One additional nice feature of this fixed point iteration is that at every step, the lower bound defined in Eq. (8) isguaranteed to increase. This feature is useful in ensuring that an implementation is bug free.

The variational Bayes algorithm is summarised below:

• Initialise:For α = 1, . . . ,A, set Q(0)

α .• Iterate the fixed point equations (typically J = AI ):

For j = 1, . . . , J

α = πα(j),

Q(j)α = exp

( ∑β∈neigh(α)

〈logφx,β〉Q(j−1)

neigh(β)\α

).

• Estimate〈Cα〉P ≈ 〈Cα〉Q(J ) .

It is important to note the similarity between the Gibbs sampler and variational Bayes. Given the factor graph ofFig. 2, both algorithms have a simple “visual” interpretation. Remember that the factor nodes β ∈ neigh(α) (blacksquares in Fig. 2) define local compatibility functions between α and its Markov blanket α′ ∈ ¬α.

In Gibbs sampler, on each cluster node α, we store a configuration. If we wish to sample the cluster node α at step

j of the simulation, all we need is the current configuration of the Markov blanket ¬α(j−1)

and we need to evaluateφx,β for β ∈ neigh(α) pointwise. When the nodes are chosen as conjugate exponential, the expression for this fullconditional density is already available in closed form as a function of the configuration of the Markov blanket.

In variational Bayes, on each cluster node α, we store sufficient statistics. If we wish to update (the sufficientstatistics of) the approximating distribution Qα , all we need is the current sufficient statistics of Q¬α and we needto evaluate the expectation 〈φx,β〉Q¬α

for β ∈ neigh(α). Again, if the nodes are chosen as conjugate exponential, theexpression for Qα is already available in closed form as a function of the sufficient statistics of its Markov blanket.7

7 Consider the example from the first footnote and assume Q1(v) = IG(v, av, bv) and Q2(s) = N (s;ms,Ss) where av, bv,ms,Ss are varia-tional parameters. By Eq. (9), we have the following fixed point equations:

logQ1(v) = ⟨logφ1(s, v)φ2(v)

⟩Q2(s)

= ⟨logφ1(s, v)

⟩Q2(s)

+ logφ2(v) = − 1

2

⟨s2⟩ 1

v− 5

2logv − 1

v

= logIG(v;3/2,2/(2 + Ss + m2

s )),

logQ2(s) = ⟨logφ1(s, v)φ2(v)

⟩Q1(v)

= ⟨logφ1(s, v)

⟩Q1(v)

= − 1

2s2

⟨1

v

⟩= logN

(s;0,1/(avbv)

).

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 899

3.4. Tempering

As illustrated above, while underlying theoretical principles are different, both the Gibbs sampler and variationalBayes share the same local updating strategy and both can suffer from slow convergence or get stuck in a local modeof the posterior. One general strategy to circumvent these problems is tempering.8 The idea is to define a sequence oftarget distributions

φ(τ1)x , . . . , φ

(τj )x , . . . → φx

that converge to the desired posterior. The sequence τj is referred as a tempering schedule. The idea is to use samples

or sufficient statistics obtained from φ(τj−1)x as a the starting point for φ

(τj )x . In general, the schedule can be quite

arbitrary but one simple strategy is to define a sequence 0 � τ1 � · · · � τj � · · · � 1 and to raise the posterior to the

power as φ(τj )x = φ

τjx . When τ is near zero, the posterior will be flatter and the modes will not be very pronounced.

4. Simulation results

The first simulation is designed to illustrate both algorithms on a simpler problem where there is only one source(N = 1) that is observed through a single noisy observation channel (M = 1). This problem is in fact a de-noisingproblem where we wish to “separate” a single source from Gaussian noise.

We sample 200 independent problem instances from the model with K = 180 data points using hyper-parameters(aλ, bλ, ar , br , ν) = (0.5, 10, 0.5, 10, 0.5). The mixing matrix is set A = a1 = 1 and is also assumed to be known.For each problem instance (x, s,v, r,λ)orig, we infer the posterior mean of the sources srec = 〈s〉p(s|xorig) and compareit with sorig. The parameter λorig defines the scale of the source and rorig denotes the variance of the observation noise.The difficulty of de-noising depends upon the signal-to-noise ratio λ/r , where cases with low λ/r are typically moredifficult to de-noise.

For each of the 200 instances, we initialise the parameters as follows: prior means for Gibbs and prior sufficientstatistics for VB. We run both algorithms using the same propagation schedule where we update the parametersperiodically with the order {s,v, r,λ}. We run the Gibbs sampler for I = 1000 iterations with a burn-in period ofI0 = 100 iterations. The variational algorithm, in contrast, is only run for I = 100 iterations, which seems to besufficient to ensure its convergence.

The results for two typical problem instances are shown in Fig. 3, where we show the approximation to the posteriormarginal p(λ, r|x) as computed by both inference methods. In both cases, the Gibbs sampler seems to be capturingthe posterior marginal. In contrast, in the harder case, VB underestimates the observation noise variance by about anorder of magnitude.

The results of the experiment are summarised in Fig. 4. We measure the quality of reconstruction by the SNRdefined by

SNR(sorig, srec) = 10 log10

( |sorig|2|sorig − srec|2

).

In the top panel of Fig. 4, we show each problem instance by a point (λorig, rorig) and classify by the SNR criteria.Although this is a rather crude comparison, the emerging pattern suggests that Gibbs sampling achieves better perfor-mance for harder cases where VB seems to be a viable choice for easier instances. A pairwise comparison of SNRvalues for methods shows that we can expect comparable performance from both methods and there is a slight tilttowards Gibbs sampler at the difficult cases. We obtain qualitatively similar behaviour in similar experiments in otherreasonable parameter regimes, although the details of these pictures vary (such as the absolute dB levels).

We observe that, in the easy instances both algorithms achieve essentially the same performance level with VBrequiring far less computation. In more difficult cases, the Gibbs sampler tends to be slightly superior. In a fewisolated cases, the Gibbs sampler failed to converge in I = 1000 iterations, whereas VB converged quite quickly to aviable solution.

8 Different communities use different terms for this idea such as bridging, annealing or overrelaxation. We prefer to reserve the term simulatedannealing for the particular case where the inverse temperature τ goes to infinity to locate the modes of the posterior.

900 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

Fig. 3. Two typical runs of the inference algorithms. (Upper) easy—low noise and (lower) harder—high noise. The dots correspond to the samplesobtained from the Gibbs sampler and the contour plot corresponds to the variational approximation. The round point denotes (λorig, rorig). Notethat this picture is only showing the approximation to the posterior marginal p(λ, r|x).

4.1. Source separation, synthetic example

Next, we illustrate both algorithms on synthetic data (K = 200) sampled from the model with N = 3 sources andM = 2 observation channels with (aλ, bλ, ar , br , ν) = (2,10,2,1,1). The first row of the mixing matrix a is set toa1 = [1 1 1]� in order to remove the BSS indeterminacy on gain. The second row a2 is sampled from the prior and isassumed to be unknown. The observations x are shown in Fig. 5.

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 901

Fig. 4. Denoising problem. (Upper) 200 problem instances denoted by (λorig, rorig): instances where the Gibbs sampler (VB) achieves a higherSNR is shown with crosses (triangles). (Lower) Comparison of both algorithms in terms of SNR on the same set of instances.

We note that we use the same clustering of the variables for both Gibbs and VB, as given in Eq. (4). In particular,we cluster the sources sk,1, . . . , sk,N jointly as sk,1:N . This is because whilst sk,1:N are a-priori independent, clearlythey are a posteriori dependent given xk,1:M . In MCMC jargon, this is a simple form of blocking which improvesconvergence to the stationary distribution [29]. In the case of VB, a detailed comparison of using a richer approx-imation versus a simpler factorised approximation is given by [37] for ICA. It is also easy to see that given this

902 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

Fig. 5. Observations xk,m.

Fig. 6. Source separation, synthetic example with 250 epochs of Gibbs with tempering and 250 epochs of VB.

particular model topology, introducing an even richer approximation that tries to capture correlations across k (e.g.,such as Q(s1:K,1:N)), whilst in principle tractable will not give us a better approximation, see [38] for a discussion ofchoosing tractable approximations in general graphical models.

Our initial simulations with various K � 2000 have revealed that the posterior is multimodal and the solutions thatare obtained by both methods depend critically upon starting configuration. The Gibbs sampler was slightly superiorbut the performance was not too different. Similarly to [13], we have added a tempering schedule where we modifiedthe scale parameter of the observation noise variance as log10 br = −8, . . . ,0 in I/2 epochs. This has the effect ofslowly blending in the data contributions. Other approaches, such as tempering ν, were not as useful.

We have found a hybrid approach most effective: switching between Gibbs sampler and VB during the simulationcycles. In Fig. 6, we show a typical simulation run where we switch between Gibbs and VB every 250 epochs. Whenwe switch to the Gibbs sampler, we apply a tempering schedule for 300 epochs. This enables the chain to jumpbetween modes. Once we switch back to VB, the chain quickly converges to a “nearby” solution. In Fig. 7, we showthe reconstruction of the sources as obtained each time before switching to Gibbs sampling. As a careful investigationshows, each reconstruction assigns source coefficients with high amplitude to different sources.

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 903

Fig. 7. Reconstructions (from top) and original (bottom). As shown in Fig. 6, during first switch at epoch 500, the simulation has not converged andthe reconstruction is poor (top panel). Subsequent reconstructions are close to the original with only a few source coefficients assigned incorrect.The final reconstruction is close to the original and the mixing system is estimated accurately.

4.2. Audio source separation example

We study a linear instantaneous mixture of N = 3 audio sources (speech, piano, guitar) with M = 2 observationchannels.9 The signals are sampled at 8 kHz with length N = 65,356 (≈ 8 s). The actual time series xt,m for t =1, . . . , T observed at channel m are transformed by a MDCT basis [39] to new series xk,m where the new index k

encodes both the frequency band and frame number jointly (a particular tile in a time-frequency plane). The MDCTwas used with a sine window with time resolution 64 ms.

In the first experiment, we set the first row a�1 of the mixing matrix to a1 = [1 1 1]�. The second row a�

2 is set to[tanψ1 tanψ2 tanψ3] with ψ1 = −45◦, ψ2 = 15◦, and ψ3 = 75◦. The evolution of the parameters the simulation runcan be seen in Fig. 8.

Sources are reconstructed by inverse MDCT of the estimated source coefficients s. The reconstructions are com-pared to the original sources using the source separation evaluation criteria are defined in Appendix D and describedin great detail by [40]. The source-to-distortion ratio (SDR) provides an overall separation performance criterion, thesource-to-interferences ratio (SIR) measures the level of interference from the other sources in each source estimate,source-to-noise ratio (SNR) measures the error due to the additive noise on the sensors and the source-to-artifactsratio (SAR) measures the level of artifacts in the source estimates. The performance criteria are reported in Table 1.

9 These results are reproduced from an earlier version of this paper [27].

904 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

Fig. 8. Source separation, synthetic example with 70 epochs of Gibbs and 430 epochs of VB. Original parameters are shown with dashed lines.

Table 1Performance criteria of estimated sources with both methods in the first experiment

s1 s2 s3

SDR SIR SAR SNR SDR SIR SAR SNR SDR SIR SAR SNR

MCMC 6.3 15.0 7.3 20.4 5.1 14.3 5.8 27.8 16.6 23.7 17.8 29.8Variational 6.4 15.4 7.3 21.7 5.2 15.0 5.8 24.8 16.6 25.3 17.5 29.7

We point out that the performance criteria are invariant to a change of basis, so that figures can be computed either onthe time sequences or the MDCT coefficients.

In the second experiment we have relaxed the condition that we know the first row. We set the mixing matrix suchthat the columns are almost linearly dependent with a1 = [1 1 1]� and a2 = [tanψ1 tanψ2 tanψ3] with ψ1 = 57◦,ψ2 = 66◦, and ψ3 = 75◦. The inference problem becomes harder and we observe that both the Gibbs sampler andthe VB can get trapped in local maxima easily and multiple restarts are necessary. To render the approach feasibleit was important to find a reasonable initialisations for the mixing matrix. For this purpose, we have introduced afast initialisation schema which can be obtained from a simpler prior structure imposed upon the sources in the zeronoise limit (see Appendix E). Our approach is analogous to initialisation of Bayesian Gaussian mixture estimationwith a fast method such as k-means clustering. The initialisation and reconstruction results are shown in Fig. 9. Asexpected, the initialisation assigns each time-frequency atom to only one source. Perhaps surprisingly, the separationquality of the initialisation is comparable to the final reconstruction. In subsequent runs with the original model, thesource assignments become less crisp; while the quality of separation is not significantly higher, the sound quality ofreconstruction is slightly better (increasing SDR and SNR) in Table 2.

The estimated sources can be downloaded from http://www-sigproc.eng.cam.ac.uk/~atc27/papers/cemgil-dsp-bss.html, which is perhaps the best way to assess the audio quality of the results.

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 905

Fig. 9. The logarithm of the absolute value of MDCT coefficients of the original signal (top row), initialisation (middle row), and reconstructions(bottom).

Table 2Performance criteria of estimated sources in the second experiment

Initialisation Reconstruction

SDR SIR SAR SNR SDR SIR SAR SNR

s1 5.1 20.5 5.3 48.6 5.9 15.0 6.6 64.5s2 4.2 16.0 4.6 49.9 4.5 14.1 5.2 62.9s3 10.6 25.5 10.8 58.7 16.1 25.0 16.7 74.1

5. Conclusions

We have studied two inference methods, Gibbs sampler and variational Bayes for Bayesian source separation.While both methods are algorithmically similar, they have different characteristics. Our simulations suggest that theinference problem can be hard and the posterior may display multiple local maxima.

In many source separation scenarios, the posterior distribution tends to be multimodal, with each mode typicallycorresponding to an alternative interpretation of the observed data. For example, the exact posterior marginal on themixing matrix A has multiple modes (apart from m! equivalent modes corresponding to the BSS indeterminacy onpermutation of sources). Each of these modes imply alternative hypotheses about the mixing process and typicallyonly one of these modes corresponds to the desired/original separation, see [24,41] for illustrative examples. Theimplication of the multimodality is that local algorithms such as the Gibbs sampler or greedy algorithms such asVB will get trapped in a single mode, depending upon starting configuration. In theory, the basic Gibbs samplershould be able to visit all modes but this usually takes prohibitively long. Therefore, it seems to be advisable in

906 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

practice to use hybrid approaches where a stochastic algorithm is used to explore the state space at a high temperature(i.e., overrelaxation) and occasionally switching to a deterministic algorithm to converge quickly to a nearby mode.Alternatively, multiple-independent restarts can be tried to locate a starting point in the vicinity of a mode.

Another important issue is the convergence rate. It is known that algorithms based on fixed point iterations suchas VB (see [42] and references herein) suffer from slow convergence, especially in low observation noise regimes.Intuitively, low observation noise implies that latent sources and hyperparameters tend to be strongly correlated underthe posterior due to “explaining away” effects. This renders coordinate ascend methods prohibitively slow. By similarreasons, the convergence of the Gibbs sampler to the target posterior is also hampered. Overrelaxation is a standardtechnique that can applied to speed up convergence [43]. It is also possible to compute the gradient of the datalikelihood and use gradient descent with adaptive step sizes [25].

Typically, in extreme cases, when there are a lot of data points or only a few data points, the posterior distributionmay be rendered unimodal. In the first case, the posterior is peaked around the MAP solution and in the latter case it isdiffuse and typically quite close to the prior. In both cases, the inference problem may be in fact easy and one wouldnot expect big qualitative differences in the solutions found by both algorithms. This seems to be the picture in ourearlier simulation studies with real data.

As we show in Appendix A, both algorithms (VB and Gibbs sampler) differ only in a few lines of code and withminimal coding effort, both can be implemented when one is already implemented. Hence, computation speed shouldbe the only determining factor in choosing the inference algorithm. Moreover, the factor graph formalism enables oneto mechanise the derivations and automate the code generation process. For example, the latex code for the updateequations in Appendix A are automatically generated from the model specification by a computer program (and onlymanually edited to enhance visual appearance). This is particularly attractive since for real applications one has tooften build complex hierarchical prior models and coding/testing cycle can be cumbersome.

We conclude by noting that VB and Gibbs sampler are only two possible choices as inference algorithms. It will beinteresting to see if sequential Monte Carlo techniques [44–46] or alternative deterministic message passing algorithms(EP, EC) [25,47,48] will be useful for large scale Bayesian source separation problems.

Appendix A. Standard distributions in exponential form, their sufficient statistics and entropies

• Gamma

G(λ;a, b) ≡ exp

(+(a − 1) logλ − 1

bλ − logΓ (a) − a logb

),

〈λ〉G = ab, 〈logλ〉G = Ψ (a) + log(b),

H [G] ≡ −〈logG〉G = −(a − 1)Ψ (a) + logb + a + logΓ (a).

Here, Ψ denotes the digamma function defined as Ψ (a) ≡ d logΓ (a)/da.• Inverse Gamma

IG(r;a, b) ≡ exp

(−(a + 1) log r − 1

br− logΓ (a) − a logb

),

〈1/r〉IG = ab, 〈log r〉IG = −(Ψ (a) + log(b)

),

H [IG] ≡ −〈logIG〉IG = −(a + 1)Ψ (a) − logb + a + logΓ (a).

• Multivariate Gaussian

N (x;μ,P ) = exp

(−1

2x�P −1x + μ�P −1x − 1

2μ�P −1μ − 1

2log |2πP |

),

〈x〉N = μ, 〈xxT 〉N = P + μμT ,

H [N ] ≡ −〈logN 〉N = 1

2log |2πeP |.

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 907

Appendix B. Summary of the model

B.1. Generative model

λn: Scale parameter of nth source

λn ∼ G(λn;aλ, bλ).

vk,n: Variance of the kth coefficient of the nth source

vk,n|λn ∼ IG(vk,n;ν/2,2/(νλn)

).

sk,n: Source coefficient

sk,n|vk,n ∼ N (sk,n;0, vk,n).

am: Vector of mixing proportions for the mth channel

am ∼ N (am;μa,Pa).

rm: Variance of observation noise in the mth channel

rm ∼ IG(rm;ar, br ).

xk,m: Observed channel coefficient

xk,m|sk,1:N,am, rm ∼ N(xk,m;a�

msk,1:N, rm).

B.2. Expression of the joint posterior

φx ≡(

N∏n=1

p(λn)

)(K∏

k=1

(N∏

n=1

p(vk,n|λn)p(sk,n|vk,n)

)(M∏

m=1

p(xk,m|sk,1:N,am, rm)

))

×(

M∏m=1

p(am)p(rm)

),

logφx =N∑

n=1

(+(aλ − 1) logλn − 1

λn − logΓ (aλ) − aλ logbλ

)

+K∑

k=1

N∑n=1

(−(ν/2 + 1) logvk,n − νλn

2vk,n

− logΓ (ν/2) + ν

2log

ν

2+ ν

2logλn

)

+K∑

k=1

N∑n=1

(−1

2

s2k,n

vk,n

− 1

2log 2πvk,n

)

+M∑

m=1

(−1

2a�mP −1

a am + μ�a P −1

a am − 1

2μ�

a P −1a μa − 1

2log |2πPa |

)

+M∑

m=1

(−(ar + 1) log rm − 1

brrm− logΓ (ar) − ar logbr

)

+K∑

k=1

M∑m=1

(−1

2

x2k,m

rm+ (a�

msk,1:N)xk,m

rm− 1

2

(a�msk,1:N)2

rm− 1

2log 2πrm

).

908 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

Appendix C. Structure of the Q distribution

Q =(

N∏n=1

Q(λn)

)(K∏

k=1

N∏n=1

Q(vk,n)

)(K∏

k=1

Q(sk,1:N)

)(M∏

m=1

Q(am)Q(rm)

).

C.1. Expressions of Markov blankets, conditional marginals, and Q distributions

λn: Scale parameter of nth source

logφx(λn;v1:K,n) = +(

aλ + Kν

2− 1

)logλn −

(1

+ ν

2

K∑k=1

1

vk,n

)λn,

p(λn|v1:K,n) = G(αλ,n, βλ,n),

αλ,n = aλ + Kν

2,

βλ,n =(

1

+ ν

2

K∑k=1

1

vk,n

)−1

,

Q(λn) = G(αλ,βλ),

αλ,n = aλ + Kν

2,

βλ,n =(

1

+ ν

2

K∑k=1

⟨1

vk,n

⟩)−1

=(

1

+ ν

2

K∑k=1

αv,k,nβv,k,n

)−1

.

vk,n: Variance of the kth coefficient of the nth source

logφx(vk,n;λn, sk,n) = −(

ν/2 + 1

2+ 1

)logvk,n −

(νλn

2+ s2

k,n

2

)1

vk,n

,

p(vk,n|λn, sk,n) = IG(αv,k,n, βv,k,n),

αv,k,n = (ν + 1)/2,

βv,k,n =(

νλn

2+ s2

k,n

2

)−1

,

Q(vk,n) = IG(αv,k,n, βv,k,n),

αv,k,n = (ν + 1)/2,

βv,k,n =(

ν〈λn〉2

+ 〈s2k,n〉2

)−1

= 2/(ναλ,nβλ,n + Ss,k,n + m2

s,k,n

).

sk,1:N : Source coefficients

logφx(sk,1:N ;vk,1:N,a1:M, r1:M) = −1

2

(N∑

n=1

1

vk,n

s2k,n

)

+M∑(

1

rmxk,ma�

msk,1:N − 1

2Tr

1

rmama�

msk,1:Ns�k,1:N

),

m=1

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 909

p(sk,1:N |vk,1:N,a1:M, r1:M) = N (sk,1:N ;ms,k, Ss,k),

Ss,k =(

diag

(1

vk,1, . . . ,

1

vk,N

)+

M∑m=1

1

rmama�

m

)−1

,

ms,k = Ss,k

M∑m=1

1

rmxk,mam,

Q(sk,1:N) = N (sk,1:N ;ms,k, Ss,k),

Ss,k =(

diag

(⟨1

vk,1

⟩, . . . ,

⟨1

vk,N

⟩)+

M∑m=1

⟨1

rm

⟩⟨ama�

m

⟩)−1

,

=(

diag(αv,k,1βv,k,1, . . . , αv,k,Nβv,k,N ) +M∑

m=1

αr,mβr,m

(Sa,m + ma,mm�

a,m

))−1

,

ms,k = Ss,k

M∑m=1

⟨1

rm

⟩xk,m〈am〉 = Ss,k

M∑m=1

αr,mβr,mxk,mma,m.

am: Vector of mixing proportions for the mth channel

logφx(am; s1:K,1:N, r1:M) = −1

2TrP −1

a ama�m + μ�

a P −1a am

+K∑

k=1

(1

rmxk,ms�

k,1:N am − 1

2Tr

1

rmsk,1:Ns�

k,1:N ama�m

),

p(am|s1:K,1:N, r1:M) = N (am,ma,m,Sa,m),

Sa,m =(

P −1a + 1

rm

K∑k=1

sk,1:Ns�k,1:N

)−1

,

ma,m = Sa,m

(P −1

a μa + 1

rm

K∑k=1

xk,msk,1:N

),

Q(am) = N (am,ma,m,Sa,m),

Sa,m =(

P −1a +

⟨1

rm

⟩ K∑k=1

⟨sk,1:Ns�

k,1:N⟩)−1

,

ma,m = Sa,m

(P −1

a μa +⟨

1

rm

⟩ K∑k=1

xk,m〈sk,1:N 〉)

.

rm: Variance of observation noise in the mth channel

logφx(rm;am, s1:K,1:N) = −(

ar + K

2+ 1

)log rm

−(

1

br

−K∑

k=1

(−1

2x2k,m + xk,ma�

msk,1:N − 1

2Tr ama�

msk,1:Ns�k,1:N

))1

rm,

p(rm|am, s1:K,1:N) = IG(rm;αr,m,βr,m),

αr,m = ar + K,

2

910 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

βr,m =(

1

br

+ 1

2

K∑k=1

x2k,m − a�

m

K∑k=1

xk,msk,1:N + 1

2Tr ama�

m

K∑k=1

sk,1:Ns�k,1:N

)−1

,

Q(rm) = IG(rm;αr,m,βr,m),

αr,m = ar + K

2,

βr,m =(

1

br

+ 1

2

K∑k=1

x2k,m − ⟨

a�m

⟩ K∑k=1

xk,m〈sk,1:N 〉 + 1

2Tr

⟨ama�

m

⟩ K∑k=1

⟨sk,1:Ns�

k,1:N⟩)−1

.

Appendix D. Definition of performance criteria for source separation

The below criteria are defined when all the true source signals and all noise signals (if any) are known in advanceand defined in detail by [40]. The reconstructed signal by a separation algorithm is denoted by srec. Since noises andall other sources are known one can compute a decomposition as

srec = starget + enoise + einterf + eartif,

where einterf, enoise, eartif are respectively the interferences, noise and artifacts error terms and starget is the originalsignal (or a version obtained by some allowed distortion such as scaling). The criteria are defined as

SDR ≡ 10 log10‖starget‖2

‖enoise + einterf + eartif‖2,

SIR ≡ 10 log10‖starget‖2

‖einterf‖2,

SNR ≡ 10 log10‖starget + einterf‖2

‖enoise‖2,

SAR ≡ 10 log10‖starget + enoise + einterf‖2

‖eartif‖2.

Appendix E. Initialisation

In iterative algorithms, having a good starting configuration is of paramount importance. This especially the casefor inference algorithms such as variational or Gibbs sampling that make local moves and may get stuck easily inlocal modes.

We will derive a fast initialisation algorithm, based on iterative conditional modes (ICM—coordinate ascent) fromthe following generative model in the zero noise limit

ck ∼ π(ck),

sk,1:N |ck ∼ N(sk;0,P (ck)

),

xk,1:M |sk,1:N, r,A ∼ N(xk,1:M ;Ask,1:N, rI

),

where ck ∈ {1, . . . ,N} is an indicator that selects the covariance structure on the jointly Gaussian sources. The N ×N

covariance matrix is chosen as

P(ck = n) = diag(ε, . . . , σ, . . . , ε),

where ε is a small positive number and σ is a large positive number at nth position. This switching mechanism is aconstrained form of the independent factor analysis (IFA) model [19] and remisicent to the source separation modelof [49] which selects sources in a mutually exclusive way to explain each observation xk,1:M .

The full conditional distributions are given as

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 911

Fig. 10. (Top) The joint distribution of observations and typical A∗ found in independent runs with random reinitialisations, corresponding toincreasing reconstruction error from left to right. The lines show the directions of the columns A∗(:, n). (Bottom) The histogram of reconstructionerrors.

p(sk,1:N |ck, xk,1:M,A) = N (sk,1:N ;mk,Sk),

Sk =(

P(ck)−1 + 1

rA�A

)−1

,

mk =(

P(ck)−1 + 1

rA�A

)−1 1

rA�xk,1:M.

When σ is large and we let ε → 0 and r → 0, then

mk(ck = n) = limr→0

(P(n)−1 + 1

rA�A

)−1 1

rA�xk = A†xk,1:M = [

0, . . . ,m∗k(n),0, . . . ,0

]�,

m∗k(n) ≡ A(:, n)†xk,1:M.

Here, A† denotes the Moore–Penrose pseudoinverse and A(:, n) the nth column of the matrix A, hence mk(n) isa vector with only a nonzero entry at position n, which is equal to the projection of the observation xk,1:M onto the nthcolumn of the mixing matrix. The last equality can be verified by substituting (P −1 + 1

rA�A)−1 = P − PA�(rI +

APA�)−1AP and taking the limit.Assuming π(ck) is flat for ck = 1, . . . , n, . . . ,N ,

p(ck = n|sk) = exp

(−1

2Trmk(n)mk(n)�P(n)−1

)/∑n′

exp

(−1

2Trmk(n

′)mk(n′)�P(n′)−1

).

In the limit when ε → 0, p(ck) is a crisp distribution concentrated at c∗k = arg maxn m∗

k(n)2. Hence maximum aposteriori reconstruction of the sources will be given by mk(c

∗k ) where the estimate of all sources sk,n for n �= c∗

k willbe zero and sk,c∗

k= m∗

k(c∗k ).

logp(A|sk, xk) =K∑

k=1

(1

rTr skx

�k A − 1

2

1

rTr sks

�k A�A

),

A∗ =(

K∑xkmk

(c∗k

)�)(

K∑mk

(c∗k

)mk

(c∗k

)�)−1

.

k=1 k=1

912 A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913

In practice, when we compute A∗, we renormalise the columns so that ‖A∗(:, n)‖ = 1 which seems to improveconvergence. In Fig. 10, we show the results from 100 independent runs. Each run takes only a couple of seconds.The resulting distribution of reconstruction errors confirms the fact that the posterior distribution is multimodal—inthis case there seem to be three different local maxima.

References

[1] A. Mohammad-Djafari, A Bayesian estimation method for detection, localisation and estimation of superposed sources in remote sensing, in:SPIE’97, San Diego, July 1997.

[2] A. Mohammad-Djafari, A Bayesian approach to source separation, in: Proc. 19th International Workshop on Bayesian Inference and MaximumEntropy Methods (MaxEnt99), Boise, USA, August 1999.

[3] K.H. Knuth, Bayesian source separation and localization, in: SPIE’98: Bayesian Inference for Inverse Problems, San Diego, July 1998,pp. 147–158.

[4] K.H. Knuth, A Bayesian approach to source separation, in: Proc. 1st International Workshop on Independent Component Analysis and SignalSeparation, Aussois, France, January 1999, pp. 283–288.

[5] K.H. Knuth, H.G. Vaughan, Convergent Bayesian formulations of blind source separation and electromagnetic source estimation, in: Maxi-mum Entropy and Bayesian Methods (MaxEnt), Munich, 1998, pp. 217–226.

[6] D.B. Rowe, A Bayesian approach to blind source separation, J. Interdisciplin. Math. 5 (1) (2002) 49–76.[7] D.B. Rowe, Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing, Chapman & Hall, New York, 2003.[8] D.J.C. McKay, Information Theory, Inference and Learning Algorithms, Cambridge Univ. Press, Cambridge, 2003.[9] J. Miskin, D. Mackay, Ensemble learning for blind source separation, in: S.J. Roberts, R.M. Everson (Eds.), Independent Component Analysis,

Cambridge Univ. Press, Cambridge, 2001, pp. 209–233.[10] A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, Wiley, New York, 2001.[11] A. Cichocki, S.I. Amari, Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications, Wiley, New York, 2003.[12] C. Févotte, S.J. Godsill, P.J. Wolfe, Bayesian approach for blind separation of underdetermined mixtures of sparse sources, in: Proc. 5th

International Conference on Independent Component Analysis and Blind Source Separation (ICA 2004), Granada, Spain, 2004, pp. 398–405.[13] C. Févotte, S.J. Godsill, A Bayesian approach for blind separation of sparse sources, IEEE Trans. Speech and Audio Processing, in press,

available at http://persos.mist-technologies.com/~cfevotte/.[14] M. Zibulevsky, B.A. Pearlmutter, P. Bofill, P. Kisilev, Blind source separation by sparse decomposition, in: S.J. Roberts, R.M. Everson (Eds.),

Independent Component Analysis: Principles and Practice, Cambridge Univ. Press, Cambridge, 2001.[15] A. Jourjine, S. Rickard, O. Yilmaz, Blind separation of disjoint orthogonal signals: Demixing n sources from 2 mixtures, in: Proc. ICASSP-5,

Istanbul, Turkey, June 2000, pp. 2985–2988.[16] Z. Ghahramani, M. Beal, Propagation algorithms for variational Bayesian learning, Neural Inform. Process. Syst. 13 (2000).[17] M. Wainwright, M.I. Jordan, Graphical models, exponential families, and variational inference, Technical Report 649, Department of Statistics,

UC Berkeley, September 2003.[18] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, 2006.[19] H. Attias, Independent factor analysis, Neural Comput. 11 (4) (1999) 803–851.[20] H. Lappalainen, Ensemble learning for independent component analysis, in: Proceedings of Int. Workshop on Independent Component Analy-

sis and Signal Separation (ICA’99), Aussois, France, 1999, pp. 7–12.[21] H. Valpola, Nonlinear independent component analysis using ensemble learning: Theory, in: Proceedings of the Second International Work-

shop on Independent Component Analysis and Blind Signal Separation, ICA 2000, Helsinki, Finland, 2000, pp. 251–256.[22] M. Girolami, A variational method for learning sparse and overcomplete representations, Neural Comput. 13 (11) (2001) 2517–2532.[23] P. Hojen-Sorensen, O. Winther, L.K. Hansen, Mean-field approaches to independent component analysis, Neural Comput. 14 (2002) 889–918.[24] K. Chan, T.W. Lee, T.J. Sejnowski, Variational Bayesian learning of ica with missing data, Neural Comput. 15 (2003) 1991–2011.[25] O. Winther, K.B. Petersen, Flexible and efficient implementations of Bayesian independent component analysis, Neurocomputing, 2006,

submitted for publication.[26] F.R. Kschischang, B.J. Frey, H.-A. Loeliger, Factor graphs and the sum-product algorithm, IEEE Trans. Inform. Theory 47 (2) (2001) 498–

519.[27] A.T. Cemgil, C. Fevotte, S.J. Godsill, Blind separation of sparse sources using variational EM, in: 13th European Signal Processing Confer-

ence, Antalya, Turkey, 2005. EURASIP. URL http://www-sigproc.eng.cam.ac.uk/~cf269/eusipco05/sound_files.html.[28] S. Moussaoui, D. Brie, A. Mohammad-Djafari, C. Carteret, Separation of non-negative mixture of non-negative sources using a Bayesian

approach and MCMC sampling, IEEE Trans. Signal Process., in press.[29] W.R. Gilks, S. Richardson, D.J. Spiegelhalter (Eds.), Markov Chain Monte Carlo in Practice, CRC Press, London, 1996.[30] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, E. Teller, Equations of state calculations by fast computing machines, J. Chem.

Phys. 21 (1953) 1087–1091.[31] W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57 (1970) 97–109.[32] G.O. Roberts, J.S. Rosenthal, Markov chain Monte Carlo: Some practical implications of theoretical results, Can. J. Statist. 26 (1998) 5–31.[33] W. Wiegerinck, Variational approximations between mean field theory and the junction tree algorithm, in: UAI (16th conference), 2000,

pp. 626–633.[34] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, New York, 1991.

A.T. Cemgil et al. / Digital Signal Processing 17 (2007) 891–913 913

[35] R.M. Neal, G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in: Learning in Graphical Models,MIT Press, Cambridge, MA, ISBN 0-262-60032-3, 1999, pp. 355–368.

[36] J. Winn, C. Bishop, Variational message passing, J. Machine Learn. Res. 6 (2005) 661–694.[37] A. Ilin, H. Valpola, On the effect of the form of the posterior approximation in variational learning of ica models, Neural Process. Lett. 22 (2)

(2005).[38] D. Barber, W. Wiegerinck, Tractable variational structures for approximating graphical models, in: M. Kearns, S. Solla, D. Cohn (Eds.),

Advances in Neural Information Processing Systems (NIPS), 1999, pp. 183–189.[39] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, San Diego, 1998.[40] E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation, IEEE Trans. Speech Audio Process. 14 (4)

(2006) 1462–1469.[41] L.K. Hansen, K.B. Petersen, Monaural ICA of white noise mixtures is hard, in: Proceedings of ICA 2003, pp. 815–820.[42] K.B. Petersen, O. Winther, L.K. Hansen, On the slow convergence of EM and VBEM in low noise linear mixtures, Neural Comput. 17 (2005)

1–6.[43] R.M. Neal, Suppressing random walks in Markov chain Monte Carlo using ordered overrelaxation, in: M.I. Jordan (Ed.), Learning in Graphical

Models, Kluwer Academic, Dordrecht, 1998, pp. 205–225.[44] A. Doucet, S. Godsill, C. Andrieu, On sequential Monte Carlo sampling methods for Bayesian filtering, Statist. Comput. 10 (3) (2000) 197–

208.[45] E. Sudderth, A. Ihler, W. Freeman, A. Willsky, Nonparametric belief propagation, in: Proceedings of IEEE Computer Vision and Pattern

Recognition Conference (CVPR), 2003.[46] M. Briers, A. Doucet, S.S. Singh, K. Weekes, Particle filters for graphical models, in: Proceedings of Nonlinear Statistical Signal Processing

Workshop. IEEE, 2006.[47] T. Minka, Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research, Cambridge, 2005.[48] M. Opper, O. Winther, Expectation consistent approximate inference, J. Machine Learn. Res. (2005) 2177–2204.[49] M. Davies, N. Mitianoudis, A simple mixture model for sparse overcomplete ICA, IEE Proceedings on Vision, Image and Signal Processing,

February 2004.

A. Taylan Cemgil received his B.Sc. and M.Sc. in Computer Engineering, Bogazici University, Turkey and his Ph.D. (2004)from Radboud University Nijmegen, the Netherlands with a thesis entitled Bayesian music transcription. Between 2003–2005, heworked as a postdoctoral researcher at the University of Amsterdam on vision based multiobject tracking. He is currently a researchassociate at the Signal Processing and Communications Lab., University of Cambridge, UK, where he cultivates his interestsin machine learning methods, stochastic processes and statistical signal processing. His research is focused towards developingcomputational techniques for audio, music and multimedia processing.

Cédric Févotte was born in Laxou, France, in 1977 and lived in Tunisia, Senegal and Madagascar until 1995. He graduatedfrom the French engineering school École Centrale de Nantes and obtained the Diplôme d’Études Appronfondies en Automatiqueet Informatique Appliquée in 2000. He then received the Diplôme de Docteur en Automatique et Informatique Appliquée jointlyfrom École Centrale de Nantes et de Université de Nantes in 2003. From Nov. 2003 to Mar. 2006 he was a research associate withthe Signal Processing Laboratory at University of Cambridge, working on Bayesian approaches to many audio signal processingtasks such as audio source separation, denoising and feature extraction. From May 2006 to Feb. 2007 he was a researcher withthe startup company Mist-Technologies (Paris), working on mono/stereo to 5.1 surround sound upmix solutions. In Mar. 2007 hejoined the Audio, Acoustics and Waves group at GET/Télécom Paris (ENST) where his interests generally concern statistical signalprocessing with audio applications, and in particular object-based representations of sound.

Simon J. Godsill is Professor in Statistical Signal Processing in the Engineering Department of Cambridge University. He is anAssociate Editor for IEEE Trans. Signal Processing and the journal Bayesian Analysis, and is a member of IEEE Signal ProcessingTheory and Methods Committee. He has research interests in Bayesian and statistical methods for signal processing, Monte Carloalgorithms for Bayesian problems, modelling and enhancement of audio and musical signals, source separation, tracking andgenomic signal processing. He has published extensively in journals, books and conferences. He has co-edited in 2002 a specialissue of IEEE Trans. Signal Processing on Monte Carlo Methods in Signal Processing and a recent special issue of the Journal ofApplied Signal Processing, and organised many conference sessions on related themes.