[ieee 2013 sixth international conference on advanced computational intelligence (icaci) - hangzhou,...

6
2013 Sixth I nteational Conference on Advanced Computational I ntelligence October 19-21,2013, Hangzhou, China A general algorithm scheme mixing computational intelligence with Bayesian simulation Bin Liu, member, IEEE, and Chunlin Ji Abstct- In this paper, a general algorithm scheme which mixes computational intelligence with Bayesian simulation is proposed. This hybridization retains the advantage of compu- tational intelligence in searching optimal point and the ability of Bayesian simulation in drawing random samples from any arbitrary probability density. An adaptive importance sampling (IS) method is developed under this framework, and the objective is to obtain a feasible mixture approximation to a multivariate, multi-modal and peaky target density, which can only be evaluated pointwise up to an unknown constant. The parameter of the IS proposal is determined with the aid of simulated annealing as well as some heuristics. The performance of this algorithm is compared with a counterpart algorithm that doesn't involve any kind of computational intelligence. The result shows a remarkable performance gain due to the mixture strategy and so gives proof-of-concept of the proposed scheme. I. INTRODUCTION R igorous statistical analysi is play ng an i�po . rtant role in data mining and machme learmng applIcatIOns. Sta- tistical concepts, e.g. latent variables, regularization, model selection and spurious correlation have appeared in widely noted data mining literature [ 1]- [ 2]. Bayesian analysis is a widely accepted paradigm for estimating unknown parame- ters om data and it has found great success in statistical practice. The appeal of the Bayesian approach stems from the transparent inclusion of prior knowledge, a straightfor- ward probabilistic interpretation of parameter estimates,and greater flexibility in model specification. While Bayesian models have revolutionized the field of applied statistical work, they continue to rely on computational tools to support necessary calculations. Bayesian simulation techniques,such as Markov chain Monte Carlo ( M CM C) and the importance sampling ( IS), have revolutionized statistical practice since the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. However, for large datasets and different models except for the most trivial ones, an elaborate algorithm design is necessary for M CM C or IS to yield a satisfactory performance. So some improved variants of M CM C and IS have been proposed, such as those in [ 3]- [ 14], to name just a few. In this paper, we provide a general algorithm scheme, see Fig.I, which mixes computational intelligence into the framework of Bayesian simulation,and the purpose is to enhance the algorithm's performance and to expand its applicability for dealing with more complex models and Bin Liu is with School of Computer Science and Technology, Nanjing University of Posts and Telecommunications (email:[email protected]). Chunlin Ji is with Shenzhen Kuang-Chi Institute of Advanced Technology. 978-1-4673-6343-3113/$31.00 ©2013 I EEE A t' parame riC Parameterized model to represent proposal the features / pattes i features Searching features 1 patterns in the simulated dat a via computational int e lli ge nc e i Simulated random draws from a related distribution I Bayesian simulation ( MCMC / Importance Sampling) j- Fig. I. The proposed algorithm scheme mixing computational intelligence with Bayesian simulation datasets. Specifically, we present an example implementation of this scheme, in which simulated annealing ( SA) along with some heuristics is hybridized into the process of IS. The resulting algorithm en joys the advantage of SA in searching the peaky modes in the target density and so removes the burden of designing a feasible IS density to handle a high dimensional peaky target density from the algorithm designer. The performance of the algorithm is investigated and the result demonstrates the performance gain yielded by the usage of computational intelligence and so shows the proof-of-concept of the proposed algorithm scheme. II. T HE PROPOSED ALGORITHM SCHEME This section introduces the proposed algorithm framework. The key to this amework is the recognition that the success of Bayesian simulation relies heavily on the underlying pro- posal density used for generating candidate random samples, and that the resulting random samples outputted by Bayesian simulation are controlled to be distributed according to the ( unknown) target distribution, namely the posterior in Bayesian statistics. These samples may therefore be used to search/estimate propertieslpatterns of the target density via computational intelligence and then iteratively render the proposal density closer to optimal in the sense that the measured properties from the current simulation samples are made to match those of the optimal density. A schematic diagram of the proposed algorithm scheme is shown in Fig.I. Given a parametric proposal density, the Bayesian simulation module produces a batch of simulated draws om a density, which is closely relevant to the target density, based on an M CM C or IS mechanism. Then the module of computational intelligence is invoked to search

Upload: chunlin

Post on 10-Mar-2017

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - Hangzhou, China (2013.10.19-2013.10.21)] 2013 Sixth International Conference on Advanced

2013 Sixth International Conference on Advanced Computational Intelligence October 19-2 1,2013, Hangzhou, China

A general algorithm scheme mixing computational intelligence with Bayesian simulation

Bin Liu, member, IEEE, and Chunlin Ji

Abstract- In this paper, a general algorithm scheme which mixes computational intelligence with Bayesian simulation is proposed. This hybridization retains the advantage of compu­tational intelligence in searching optimal point and the ability of Bayesian simulation in drawing random samples from any arbitrary probability density. An adaptive importance sampling (IS) method is developed under this framework, and the objective is to obtain a feasible mixture approximation to a multivariate, multi-modal and peaky target density, which can only be evaluated pointwise up to an unknown constant. The parameter of the IS proposal is determined with the aid of simulated annealing as well as some heuristics. The performance of this algorithm is compared with a counterpart algorithm that doesn't involve any kind of computational intelligence. The result shows a remarkable performance gain due to the mixture strategy and so gives proof-of-concept of the proposed scheme.

I. INTRODUCTION

Rigorous statistical analysi � is play �ng an i�po.rtant role

in data mining and machme learmng applIcatIOns. Sta­tistical concepts, e.g. latent variables, regularization, model selection and spurious correlation have appeared in widely noted data mining literature [1]- [2]. Bayesian analysis is a widely accepted paradigm for estimating unknown parame­ters from data and it has found great success in statistical practice. The appeal of the Bayesian approach stems from the transparent inclusion of prior knowledge, a straightfor­ward probabilistic interpretation of parameter estimates, and greater flexibility in model specification. While Bayesian models have revolutionized the field of applied statistical work, they continue to rely on computational tools to support necessary calculations. Bayesian simulation techniques, such as Markov chain Monte Carlo (M CM C) and the importance sampling (IS), have revolutionized statistical practice since the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. However, for large datasets and different models except for the most trivial ones, an elaborate algorithm design is necessary for M CM C or IS to yield a satisfactory performance. So some improved variants of M CM C and IS have been proposed, such as those in [3]- [14], to name just a few. In this paper, we provide a general algorithm

scheme, see Fig.I, which mixes computational intelligence into the framework of Bayesian simulation, and the purpose is to enhance the algorithm's performance and to expand its applicability for dealing with more complex models and

Bin Liu is with School of Computer Science and Technology, Nanjing University of Posts and Telecommunications (email:[email protected]). Chunlin Ji is with Shenzhen Kuang-Chi Institute of Advanced Technology.

978-1-4673-6343-3113/$31.00 ©20 13 IEEE

A t ' parame riC Parameterized model to represent proposal

the features / patterns

i features

Searching features 1 patterns in the simulated

data via computational inte llige nce

i Simulated random draws from a related distribution

I Bayesian simulation (MCMC / Importance Sampling) j.--Fig. I. The proposed algorithm scheme mixing computational intelligence with Bayesian simulation

datasets. Specifically, we present an example implementation of this scheme, in which simulated annealing (SA) along with some heuristics is hybridized into the process of IS. The resulting algorithm en joys the advantage of SA in searching the peaky modes in the target density and so removes the burden of designing a feasible IS density to handle a high dimensional peaky target density from the algorithm designer. The performance of the algorithm is investigated and the result demonstrates the performance gain yielded by the usage of computational intelligence and so shows the proof-of-concept of the proposed algorithm scheme.

II. T HE PROPOSED ALGORITHM SCHEME

This section introduces the proposed algorithm framework. The key to this framework is the recognition that the success

of Bayesian simulation relies heavily on the underlying pro­posal density used for generating candidate random samples, and that the resulting random samples outputted by Bayesian simulation are controlled to be distributed according to the (unknown) target distribution, namely the posterior in Bayesian statistics. These samples may therefore be used

to search/estimate propertieslpatterns of the target density via computational intelligence and then iteratively render the proposal density closer to optimal in the sense that the measured properties from the current simulation samples are made to match those of the optimal density.

A schematic diagram of the proposed algorithm scheme is shown in Fig.I. Given a parametric proposal density, the

Bayesian simulation module produces a batch of simulated draws from a density, which is closely relevant to the target density, based on an M CM C or IS mechanism. Then the module of computational intelligence is invoked to search

Page 2: [IEEE 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - Hangzhou, China (2013.10.19-2013.10.21)] 2013 Sixth International Conference on Advanced

the features/patterns hidden in the target density from the above simulated samples. The discovered features/patterns are then utilized to construct a parametric model, which later will act as the proposal density for the next round Bayesian simulation. The above iterative process is initialized by prescribing a parametric proposal for the Bayesian simulation module.

A. Connections to relevant works

The proposed algorithm scheme mentioned above truly has connections to some existing algorithms in the literature. Actually it can be seen as a generic framework to cover or an extension to some of' these existing ones. From the

Bayesian simulation perspective, the proposed scheme is closely connected to the adaptive IS algorithm in [14],

the Sequential Monte Carlo (SM C) samplers [15]-[ 16], the annealed importance sampler [17], the adaptive Metropolis­

Hastings algorithm with independent proposals [18]-[21]. For all of these methods, an iterative operation of Monte Carlo simulation (either IS or M CM C) is involved, based on the recognition that the knowledge gained from former iterations is useful for the sampling procedure of current iteration. The point to be stressed here is that an efficient mechanism is re­quired to search as much knowledge as possible from former iterations, as losing any important information may result in a inefficient proposal density function and then lead to a chain reaction that deteriorates the final performance. Recurring to computational intelligence, the proposed scheme provides a candidate method to search the features/patterns hidden in the data samples simulated from a former iteration, which are then used to build the proposal function at current iteration.

III. A N EXAMPLE IMPLEMENTATION OF THE PROPOSED

SCHEME

This section describes an example implementation of the proposed algorithm scheme and shows how to get a cross fertilization between computational intelligence and

Bayesian simulation via the developed algorithm framework. The most primitive version of this algorithm was presented at

a workshop on SMC[22] and then used for searching extra­solar planets [23] by the first author here.

Our algorithm is a kind of adaptive IS method. Compared to alternatives such as M CM C, IS is appealing in allowing for parallel implementations, easy assessment of the Monte

Carlo error and avoiding the daunting issue of convergence diagnostics [14]. However, the success of IS depends on designing an appropriate proposal density q (. ), which is required to mimic the target density 7TO and have heavier tails [13]. Building such an IS density can be quite difficult even in low dimensional settings. A general strategy to solve this problem presets a model structure for q( . I1/! ) and then optimize its parameter ·1fJ via an iterative process, which can be summarized as follows:

• Draw i.i.d random samples {en };;'=1' from q( ' 11fJ) ; • Weight the samples by

�n n W W =­

W' n = 1,2, . . . ,N, (1)

2

where

N

and W = �1lln; (2)

n=l

• Adapt the value of 1/!, based around some knowledge obtained from the weighted samples {en, wn };;'=1'

The above iteration stops when a criterion meets, e.g. the effective samples size (ESS) is bigger than a given threshold [14]. Based on the above strategy a bunch of adaptive

importance sampler (AIS) algorithms have been proposed, see [14],[24]-[27], to name just a few. Such AIS algorithms are characterized by the ability of adapting the proposal parameter VJ automatically by the algorithm itself, while the underlying assumption is that the support area of the target density has already been known and then the focus is how to find an appropriate model to cover the structured, e.g. multi-modal, properties of the target density. This assumption may be violated when facing large data-sets and/or more complicated models, which indicates a possible failure of these algorithms for these cases.

Here a novel adaptive IS method is developed according to the framework described in Section II, for solving the aforementioned problem of existing methods.

A. Annealed adaptive mixture importance sampling (AAMIS)

This subsection presents the AAMIS algorithm in detail. In AAMIS, the IS density is specified to be a mixture function as follows

!vI M q(elvJ) = � amfm(eITlm), � am = 1, am � 0, (3)

m=l m=l

where am is the probability mass of the mth component f m with parameters TIm,1/! the mixture model parameter. The Kullback-Leibler (KL) divergence is adopted as the metric to characterize the difference between a given IS density q and the target density 7T:

(4)

Since the efficiency of IS requires that the IS density mimics the target density, here the goal is just to minimize (4) in terms of (a, TI), which is equivalent to maximizing

(5)

Note that, if a number of independently and identically dis­tributed (i.i.d) samples drawn from 7T are available, the task of maximizing (5) in terms of (a,·17) then becomes a standard problem of maximum likelihood estimation (MLE) of a mixture model. To this end, the Expectation-Maximization

(EM) technique can be used, relying on the missing data structure of the mixture model [28]-[29]. On the other hand, if a feasible mixture density model is available, then naturally

Page 3: [IEEE 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - Hangzhou, China (2013.10.19-2013.10.21)] 2013 Sixth International Conference on Advanced

it can be taken as an IS density for sampling from the target density via a basic IS routine. At this moment, these two tasks, MLE of a mixture model and IS sampling from the target density, becomes nested. In another word, the solution of one of them will make it much easier to solve the other one. A big challenge appears if there are dominant while peaky modes in the structure of the target density and there is no much prior knowledge about the location of such peaky modes. To be best of our knowledge, there is no general algorithm solution, which is capable of handling peaky modes, multi-modality and Monte Carlo sampling all together, in the literature.

Here an attempt is made by mixing SA, IS, EM and several heuristics within the proposed algorithm scheme introduced in Section II. To begin with, a mixture density qO is prescribed with the form of (3) and is used as an initial

guess of the target density. The principle of determining the parameters of qO is to letting the support of qO be flat, spreading over as much region as possible, guaranteeing that it covers the support regions of the target density. Next a sequence of annealed distributions {7rn} evolving from qO to 7r is built as follows:

n _ ° I-,pr' cpU _ 7r -(q ) 7r ,n-O,oo.,p, (6)

where { ¢n }�=o denotes an artificially introduced time sched­ule, which satisfies 0 = ¢o < ¢I < . . . ¢n < ... < ¢P = 1.

At time step 1, the goal is to generate a batch of random samples distributed according to 7r1. First draw i.i.d random samples from qO; and then weight these samples via IS, which takes 7r1 as the target density. In such doing, it corrects for sampling from a wrong but close distribution qO via the importance weights {Wi}i!I' As a result, {Bi, Wi}i!1 is qualified for being taken as a sample from 7r1. Given {Bi, Wi}i!l' the task for MLE of parameters of an mixture

approximation to 7r1 can be resolved via the EM mechanism. Denote ql the resulting mixture approximation to 7r1, and

then, at time step 2, ql can be in turn used as the IS density for generating random draws from a new target density 7r2. Later a mixture approximation to 7r2, denoted by q2, will be obtained in the similar way as in the first iteration. This recursive procedure is continued until we have obtained qP, the mixture approximation to 7r. Then we just use qP as the IS density to simulate random samples from 7r. Let's summarize one iteration of the above procedure as follows.

One iteration of AAMIS: Take the nth iteration as an example, the input consists of the mixture density function qn-I, parameterized by (an-I, Tln-I ), and 7rn. The operation

includes the following steps: 1. Generate a sample (Bi) from qn-I and compute the

normalized importance weights

and the mixture posterior probabilities

3

for i = 1, . . . , Nand m = 1, . . . , JI.;[. 2. Update the parameters a and B as follows

N

a� = LwiPm(Bilan-I,Tln-l) (9) i=l

for m = 1, . . . , JI.;[. 3. Output qn in form of (3) with parameters (an, Tin).

Both the student's t-mixture model and the EM identities for solving equation (10) are described in the following subsection.

B. MLE of Student's t-mixture

The multivariate t-distribution, other than the commonly used Gaussian, is selected here for use as the mixture compo­nents in (3), due to the desirable heavy-tail property of the t distribution [30]. This subsection just gives the introduction to the t-mixture model as well as the corresponding EM identities for solving Equation (10).

1) the Student's t mixture model: When a d dimensional random variable Y follows the multivariate t distribution S('I/L,�, v) with center IL, positive definite inner product

matrix � and degrees of freedom v E (0, 00] , it denotes that, given the weight T, Y has the multivariate normal distribution, and that VT is X�, which means the weight T is

Gamma distributed:

Y lfL,�, v, T rv Nd(fL, �/T); TlfL,�, V rv Gamma (v/2, v/2), (11)

where the Gamma( a, (3) density function is

(3"Toc-1exp( -(3T)/r(a), T > 0, a> 0, (3 > O. (12)

As v --+ 00, then T --+ 1 with probability 1, and Y becomes marginally Nd(fL, �). Performing standard algebraic opera­tions integrating T from the joint density of (Y, T), we could obtain the density function of the marginal distribution of Y , namely, S('lfL,�, v):

(7rV )O.5dr( �){1 + O(Y, fL, �) /v }O.5(v+d) , (13)

where

denotes the Mahalanobis squared distance from Y to fL with respect to �. Substituting fm(BIrI) with S(BlfLm' �m' vm) in (3), we get the t-mixture model:

M M q(BII/J) = L amS(BlfLm, �m' vm), L am = 1, am :;0. O.

m=1 m=1 (15)

Page 4: [IEEE 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - Hangzhou, China (2013.10.19-2013.10.21)] 2013 Sixth International Conference on Advanced

2) EM Identities for MLE of t-mixture: At this stage, we consider the case where Tim = {,Lm, �m}, while, Vm is specified beforehand as a constant, e.g., 5 used in our simulation test. Then the EM identities for computing 'f/� in (10) are:

N '\' G.Pp (en. an ",n-l)u (en. ",n-l)en L 1., m 1.,' "�I m 1.,,'lm 1.,

n i=1 (16) fLm = N

'\' wnp (en. an 'nn-l)u (en. Tin-I) L '[ m 2' , 'f m '[' m i=1

N '\' wnp (en. an ",n-l)U (en. ",n-l)cn L 2 m 2' , " m 2' '1m '[

��=i=1 N (17) '\' wnp (en. an ",n-l) L 2 m 2' "�I i=1

where

(e n-l) V + d

Um ; Tim = V + (e _ fL�-I)T(��-I)-I(e _ fL�-I)

(18)

and CT' = (er - IL�) (er - fL�) T. Readers are referred to [14], [30] for the derivation of the above identities and more

details on EM for t-mixture models.

C. Components number adaptation

It is worthwhile to note that so far the number of compo­nents !vI is assumed to be constant. So the initial components allocation for qO is especially important. However, it is not realistic for us to allocate an appropriate number of compo­nents in advance. To overcome this, an adaptive mechanism to allocate components is proposed based on heuristics. This mechanism includes three types of operations, namely deletion of negligible components, production of new com­

ponents and merging of highly correlated components. Take the nth iteration of the proposed algorithm for example, these operations are described in the following.

J) Deletion of negligible components: The probability mass a� calculated by equation (9) represents the impact of the mth component on qn. For each mixture component, say the mth one, compare its probability mass with a given threshold Pdelete, a < Pdelete « l/lYI. If a� < Pdelete, it represents that the mth component has negligible im­pacts on qn and so it's removed and its probability mass is immediately redistributed to the remaining components uniformly. This operation helps to avoid wasting unnecessary computational costs on negligible mixture components and appearance of singular covariance matrices in the process of MLE for mixture parameters.

2) Production of new Components: This mechanism in­cludes the following steps:

1) Set a trial mixture qtr = qn. 2) Use qtr as importance density, simulate a set of

weighted samples, {ei, Wi}i!I' from trn via IS. Update the associated parameters of qtr using EM identities.

Calculate the effective sample size [14]:

1 ESS = N ' '\' - 2 L..i=1 Wi (19)

4

3) Select a threshold Wth satisfying 1 > Wth » 1/ N. If Wj > Wth, where j = arg maXi Wi, i = 1, ... ,N, end

the procedure of adding new components and output the original mixture density qn; otherwise, perform the following procedures: add a component, denoted T, into qtr' Initialize parameters for component T as follows: fLr = ej, �r = �a and ar = a.5ac, where a denotes

the component from which e j has been generated, and c = arg minm a�, for m = 1, ... , lYI. Then set ac = ar for qtr, ensuring that the sum of the probability

masses of the components is 1.

4) Use qtr as importance density and calculate ESS in the way as presented in Step 2. If the resulting ESS is bigger than the original one, set qn = qtr and go to Step 1; Otherwise, end the procedure of adding new components and output qn.

Note that ESS is a routinely used measure of degeneracy for IS [14]. In Step 3), if Wj 2 Wth, it's likely to result in a small ESS value, so if it happens, a new component T

centered at ej is added, to increase the importance density at that region, thereby to increase the ESS. We accept qtr as a new IS density in Step 4), provided that qtr is able to lead to a bigger ESS value.

3) Merging of highly correlated components: This pro­cedure aims to merge highly correlated components in the mixture proposal, based on a quantity termed mutual infor­mation. The ob jective is to build up a more parsimonious feature representation of the target density, thereby to fa­cilitate searching appropriate features/patterns hidden in the target density.

The term mutual information is defined as a metric to measure the relevance of two components in a mixture model. Assume that there are N equally weighted samples {ei}i!1 drawn from a mixture density qU. According to

the Monte Carlo principle, q(.) can be approximated by an empirical point-mass function as below

N 1 q(e) c::: L

Nben,

n=1 (20)

where be denotes the delta- Dirac mass function located at e. Denote f(mlen) posterior probability mass for event that en

belongs to component m, i.e.,

( len) = amfm(enl�m)

f m q(enl1jJ) ' (21 )

It's intuitive that components j, k contain completely over­lapped information, if the identity f(jlen) = f(klen) satisfies for n = 1,2, . . . , N. Based on this heuristic, the mutual information between components j, k is defined to be:

('

k) _ (Zj - Zj)T(Zk - Zk)

MI J. - , , IIZj - Zjll ·IIZk - Zkll

(22)

where Zm = [f(mle1), ... , f(mleN)jT,11 . II denotes the - 1 N Euclidean norm, and Zm = ]V Ln=1 f(mlen)IN. Here In denotes the n-dimensional column vector with all elements being 1. Note that MI (j,k) E [-I,I] ,j E {1, ... ,M},k E

Page 5: [IEEE 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - Hangzhou, China (2013.10.19-2013.10.21)] 2013 Sixth International Conference on Advanced

{I, . . . , M}, and MI(j , k) = 1 itlthe jth and kth components are identical.

Within the IS framework, each data sample, e.g. en is associated with an importance weight, wn. Correspondingly, Zm becomes L;;=l wne(mlen), and the mutual information

between j, k turns to be:

MI(j,k) =

where (24)

A threshold Tmerge, 0 < Tmerge < 1 is set beforehand. For each pair of components j, k, if MI (j , k) > Tmerge,

merge them into one by the following operations:

Qr Qj + Qk (25)

fLj QjfLj + QkfLk (26)

Qr

L;j L;jfLj + L;klLk (27)

Qr Qj Qr· (28)

IV. PERFORMANCE EVALUATION

In this section, the performance of our algorithm is evaluated through simulations. In the simulation, the target density is prescribed to be an outer product of 7 univariate distributions as follow:

1) jQ(10 + x12, 3) + �Q(10 - x12, 5); . 2) 4skN(xI3, 1,5) + 4skN(xl - 3,3, -6), 3) S(xIO, 9, 4); 4) iB(x + 313, 3) + �N(xIO, 1); 5) 2c(xI1) + �c( -x i I ) ; 6) skN(xIO, 8, -3); 7) �N(xl - 10,0.1) + iN(xIO, 0.15) + �N(xI7, 0.2).

This simulation case was used in the context of exoplanet search [31]. Here Q('IQ,;3) denotes the gamma distribution, skN('lfL, (J, Q) the skew-normal distribution, B('IQ,;3) the

beta distribution, and c('IA) denotes the exponential distri­bution. The 2nd dimension has two modes bracketing a deep ravine, the 4th dimension has one low, broad mode that overlaps a second sharper mode, and the 7th dimension has 3 distinct, well-separated modes. Only the 5th dimension is symmetric. There is a range of tail behaviors as well, from

Gaussian to heavy-tailed. The proposed algorithm is initialized by choosing qO in (6)

as a 7-variate t-mixture composed of 20 components with identical mixing proportions. For each component parame­terized by (fL, L;, v ) , we select the value of each dimension of fL randomly and uniformly from [-100, 100]' and specify

L; = 4e3h, v = 5, where h denotes a 7 dimensional identity matrix. The IS sample size N = 1e5. The length of SA time schedule, i.e. p in (6), is 50. A scatterplot of the resulting samples of an example run of this algorithm is shown in

Fig.2. The diagonal sub-figures presents the estimated target

5

density curves (using the resulting samples). It's shown that the estimated target density is close to the true. 100 times of Monte Carlo simulations for this algorithm are performed and it's shown that the sampling result is always satisfactory in discovering the peaky modes and sampling from each mode in the target density.

50

1\ 0 ,. , ., �, .. .. , II -50 20

.... � ',W' � - ., i II -20 10

A .- •• J -- .' .- I I I -10

5

l. • I I ,. '.' • I II -5 10

A .. " t .. ..' .. I II -10

50

A 0 .. ". • .. IIOfI!!I!" - , II -50

20

dl - = : - - - = -20

-50 0 50 -20 0 20-10 0 10 -5 0 5 -10 0 10 -50 0 50 -20 0 20

Fig. 2. Scatter plot result for the simulation study. The exhibited bivariate samples are equally weighted (by a resampling mechanism). The diagonal sub-pictures show the estimated target density in red solid lines per dimension, compared with the true target density plotted by (nearly overlapped) black lines.

-::f-LL...-t--.-+----1i----t---+-----I----1

-;�f------+---A"---+----1I----+---+-----I----1

-':f---t---+-....u.-1i----t---+-----1r----1

;;f____--t---+_---1I---'-"'-"-__+_--+_----1f--__1

-��f------+--+----1I----+--L-'----+-----I----1

-::f---t---+----1i----t---+-L..L..---1r----1

Fig. 3. Seven dimensional outer product example: the resulting samples produced by a counterpart algorithm of the proposed one. In the diagonal, the curves are kernel density estimates of the importance samples in each corresponding dimension.

For performance comparison, a counterpart algorithm is considered here, which is the same as the proposed one, except that it doesn't use the simulated annealing strategy and the heuristics based adaptation of components number. A 100 times Monte Carlo simulation of this algorithm is also performed to sample from the same target density as in the above. The sampling result of one example run is plotted in

Fig.3. As is shown, it fails to completely find those 3 peaky modes in the 7th dimension. Actually, such phenomenon of missing peaky modes appears for most of these Monte Carlo runs of this algorithm.

Page 6: [IEEE 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - Hangzhou, China (2013.10.19-2013.10.21)] 2013 Sixth International Conference on Advanced

V. CONCLUDING REMARKS

This paper introduces a generic algorithm scheme devel­oped by mixing computational intelligence with Bayesian simulation. The intention is to combine advantages of each other to gain a cross fertilization effect, thereby enhanc­ing the performance of each other in dealing with more challenging optimization or statistical analysis problems. An example implementation of this scheme is derived, whereby the computational intelligence technique is used to enhance performance of a Bayesian simulation technique termed adaptive IS. Simulation result shows a remarkable perfor­mance gain due to the proposed mixture strategy.

A possibly practical direction along this work lies in using the proposed scheme to develop more efficient computational intelligence methods. The example algorithm presented here can just be tailored to develop information guided SA methods [32]. The idea is to take full advantage of the intermediate information generated by Monte Carlo sampling and mixture density estimation to design highly efficient temperature ladders (cooling schedule) online.

V I. ACKNOWLEDGEMENTS

The first author appreciates Profs. Jim Berger and Merlise Clyde in Department of Statistical Science, Duke Univer­

sity for many related discussions made when he worked there as a research scholar. This work was supported by National Natural Science Foundation ( NS F) of China

( Grant Nos. 61302158, 61100135, 61170065, 61373017), Provincial Science and Technology Plan ( NS F) pro ject

of Jiangsu province ( Grant No. BK20130869), Natural Science research pro ject for colleges and universities in Jiangsu province ( Grant No. 13KJ B520019), NS F of Guang­dong Province ( Grant No. S2011040005153), Shenzhen Special Fund for Strategic Emerging Industry ( Grant No. ZD2011 I 1080 127 A), Shenzhen Outstanding Youth Pro ject

of Fundamental Research Plan on Science & Technology ( Grant No. J C201 005280651 A).

REFERENCES

[1] J. Elder, and D. Pregibon, "Statistical perspective on knowledge discov­ery in databases," Advances in Knowledge Discovery and Data Mining. Chapt 4, AAATIMTT Press; 1996.

[2] C. Glymour, D. Madigan, and P. Smyth, "Statistical themes and lessons for data mining," Data Mining and Knowledge Discovery, vol. I, no. 1, pp. 11-28, 1997.

[3] C. Andrieu, and J. Thoms, "A tutorial on adaptive MCMC," Statistics

and Computing, vol. 18, no. 4, pp. 343-373, 2008. [4] R. V. Craiu, J. Rosenthal, and C. Yang, "Learn from thy neighbor:

Parallel-chain and regional adaptive MCMC," Journal of the American Statistical Association, vol. 104, no. 488, pp. 1454-1466, 2009.

[5] J. A. Vrugt, CJF. Ter Braak, CGH Diks, B. A. Robinson, 1. M. Hyman, and D. Higdon, "Accelerating Markov chain Monte Carlo simulation by differential evolution with self-adaptive randomized subspace sam­pling," International Journal of Nonlinear Sciences and Numerical Simulation, vol. 10, no. 3, pp. 273-290, 2009.

[6] CJF. Ter Braak, "A Markov Chain Monte Carlo version of the genetic algorithm DitIerential Evolution: easy Bayesian computing for real parameter spaces," Statistics and Computing, vol. 16, no. 3, pp. 239-249,2006.

[7] B. Calderhead, and M. Sustik, "Sparse Approximate Manifolds for Differential Geometric MCMC," Advances in Neural Information Pro­cessing Systems, pp. 2888-2896, 2012.

6

[8] C. Kollman, K. Baggerly, D. Cox, and R. Picard, "Adaptive importance sampling on discrete Markov chains," Annals of Applied Probability, pp. 391-412, 1999.

[9] T. T. Ahamed, VS Borkar, and S. Juneja, "Adaptive importance sampling technique for Markov chains using stochastic approximation," Opera­

tions Research, vol. 54, no. 3, pp. 489-504, 2006. [10] D. Wraith, M. Kilbinger, K. Benabed, O. Cappe etc., "Estimation of

cosmological parameters using adaptive importance sampling," Physical

Review D, vol. 80, no. 2, pp. 023507, 2009. [II] R. Douc, A. Guillin, J-M. Marin, and C. P. Robert, "Convergence

of adaptive mixtures of importance sampling schemes," The Annals of

Statistics, vol. 35, no. I, pp. 420-448, 2007. [12] J. Cheng, and M. J. Druzdzel, "AIS-BN: An adaptive importance sam­

pling algorithm for evidential reasoning in large Bayesian networks," Journal of Artificial Intelligence Research, vol. 13, no. I, pp. 155-188, 2000.

[13] M. Oh, and J. O. Berger, "Adaptive importance sampling in Monte Carlo integration," Journal of Statistical Computation and Simulation,

vol. 41, no. 3-4, pp. 143-168, 1992. [14] O. Cappe, R. Douc, A. Guillin, J-M. Martin, and C. P. Robert,

"Adaptive importance sampling in general mixture classes," Statistics

and Computing, vol. 18, no. 4, pp. 447-459, 2008. [15] P. D. Moral, A. Doucet, and A. Jasra, "Sequential monte carlo

sampler," Journal of the Royal Statistical Society:Series B, vol. 68, no. 3, pp. 411-436, 2006.

[16] A. Jasra, A. Doucet, D. A. Stephens, and C. C. Holmes,"Interacting sequential Monte Carlo samplers for trans-dimensional simulation," Computational Statistics and Data Analysis, vol. 52, no. 4, pp. 1765-1791, 2008.

[17] R. M. Neal, "Annealing importance sampling," Statistics and Comput­ing, vol. 11, no. 2, pp. 125-139, 2001.

[18] P. Giordani, and R. Kohn, "Adaptive independent Metropolis-Hastings by fast estimation of mixtures of normals," Journal of Computational and Graphical Statistics, vol. 19, no. 2, pp. 243-259, 2010.

[19] L. Holden, R. Hauge, and M. Holden, "Adaptive independent metropolis-hastings," The Annals of Applied Probability, pp. 395-413, 2009.

[20] J. Gasemyr, "On an adaptive version of the Metropolis-Hastings al­gorithm with independent proposal distribution," Scandinavian Journal

of Statistics, vol. 30, no. 1, pp. 159-173, 2003. [21] J. M. Keith, D. P. Kroese, and G. Y. Sofronov, "Adaptive independence

samplers," Statistics and Computing, vol. 18, no. 4, pp. 409-420, 2008. [22] B. Liu, "Adaptive T-mixture importance sampling method," Talk given

at transition workshop of SAMSI research programm on Sequential Monte Carlo Methods, 2009.

[23] T. J. Loredo, J. O. Berger, D. F. Chernoff, M. A. Clyde, and B. Liu, "Bayesian methods for analysis and adaptive scheduling of exoplanet observations," Statistical Methodology, vol. 9, pp. 101-114,2012.

[24] M. Evans, "Chaining via annealing," The Annals of Statistics, vol. 19, no. I, pp. 382-393, 1991.

[25] M. S. Oh and J. O. Berger, "Integration of multimodal functions by Monte Carlo importance sampling," Journal of the American Statistical Association, vol. 88, no. 42, pp. 450-456, 1993.

[26] O. Cappe, A. Guillin, J. M. Marin, and C. P. Robert, "Population Monte Carlo," Journal of Computational and Graphical Statistics, vol. 13, no. 4, pp. 907-929, 2004.

[27] M. West, "Mixture models, Monte Carlo, Bayesian updating, and dynamic models," Computing Science and Statistics, pp. 325-325, 1993.

[28] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society. Series B (Methodological), pp. 1-38, 1977.

[29] T. K. Moon, "The expectation-maximization algorithm," IEEE Signal

Processing Magazine, vol. 13, no. 6, pp.47-60, 1996. [30] D. Peel and G. J. McLachlan, "Robust mixture modelling using the

t distribution," Statistics and Computing, vol. 10, no. 4, pp. 339-348, 2000.

[31] J. L. Crooks, J. O. Berger, and T. J. Loredo, "Posterior-guided impor­tance Sampling for calculating marginal likelihoods with applications to Bayesian exoplanet searches," Discussion paper series of Dept. of

statitical science, Duke University , 2007. [32] M. Kumar, "An information guided framework for simulated anneal­

ing," American Control Conference (ACC), pp. 827-832,2012.