# [IEEE 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - Hangzhou, China (2013.10.19-2013.10.21)] 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI) - A general algorithm scheme mixing computational intelligence with Bayesian simulation

Post on 10-Mar-2017

216 views

Embed Size (px)

TRANSCRIPT

<ul><li><p>2013 Sixth International Conference on Advanced Computational Intelligence October 19-2 1,2013, Hangzhou, China </p><p>A general algorithm scheme mixing computational intelligence with Bayesian simulation </p><p>Bin Liu, member, IEEE, and Chunlin Ji </p><p>Abstract- In this paper, a general algorithm scheme which mixes computational intelligence with Bayesian simulation is proposed. This hybridization retains the advantage of computational intelligence in searching optimal point and the ability of Bayesian simulation in drawing random samples from any arbitrary probability density. An adaptive importance sampling (IS) method is developed under this framework, and the objective is to obtain a feasible mixture approximation to a multivariate, multi-modal and peaky target density, which can only be evaluated pointwise up to an unknown constant. The parameter of the IS proposal is determined with the aid of simulated annealing as well as some heuristics. The performance of this algorithm is compared with a counterpart algorithm that doesn't involve any kind of computational intelligence. The result shows a remarkable performance gain due to the mixture strategy and so gives proof-of-concept of the proposed scheme. </p><p>I. INTRODUCTION </p><p>Rigorous statistical analysi is play ng an ipo.rtant role in data mining and machme learmng applIcatIOns. Statistical concepts, e.g. latent variables, regularization, model selection and spurious correlation have appeared in widely noted data mining literature [1]- [2]. Bayesian analysis is a widely accepted paradigm for estimating unknown parameters from data and it has found great success in statistical practice. The appeal of the Bayesian approach stems from the transparent inclusion of prior knowledge, a straightforward probabilistic interpretation of parameter estimates, and greater flexibility in model specification. While Bayesian models have revolutionized the field of applied statistical work, they continue to rely on computational tools to support necessary calculations. Bayesian simulation techniques, such as Markov chain Monte Carlo (M CM C) and the importance sampling (IS), have revolutionized statistical practice since the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. However, for large datasets and different models except for the most trivial ones, an elaborate algorithm design is necessary for M CM C or IS to yield a satisfactory performance. So some improved variants of M CM C and IS have been proposed, such as those in [3]- [14], to name just a few. In this paper, we provide a general algorithm </p><p>scheme, see Fig.I, which mixes computational intelligence into the framework of Bayesian simulation, and the purpose is to enhance the algorithm's performance and to expand its applicability for dealing with more complex models and </p><p>Bin Liu is with School of Computer Science and Technology, Nanjing University of Posts and Telecommunications (email:bins@ieee.org). Chunlin Ji is with Shenzhen Kuang-Chi Institute of Advanced Technology. </p><p>978-1-4673-6343-3113/$31.00 20 13 IEEE </p><p>A t ' parame riC Parameterized model to represent proposal </p><p>the features / patterns </p><p>i features Searching features 1 patterns in the simulated </p><p>data via computational inte llige nce </p><p>i Simulated random draws from a related distribution </p><p>I Bayesian simulation (MCMC / Importance Sampling) j.--Fig. I. The proposed algorithm scheme mixing computational intelligence with Bayesian simulation </p><p>datasets. Specifically, we present an example implementation of this scheme, in which simulated annealing (SA) along with some heuristics is hybridized into the process of IS. The resulting algorithm en joys the advantage of SA in searching the peaky modes in the target density and so removes the burden of designing a feasible IS density to handle a high dimensional peaky target density from the algorithm designer. The performance of the algorithm is investigated and the result demonstrates the performance gain yielded by the usage of computational intelligence and so shows the proof-of-concept of the proposed algorithm scheme. </p><p>II. T HE PROPOSED ALGORITHM SCHEME </p><p>This section introduces the proposed algorithm framework. The key to this framework is the recognition that the success </p><p>of Bayesian simulation relies heavily on the underlying proposal density used for generating candidate random samples, and that the resulting random samples outputted by Bayesian simulation are controlled to be distributed according to the (unknown) target distribution, namely the posterior in Bayesian statistics. These samples may therefore be used </p><p>to search/estimate propertieslpatterns of the target density via computational intelligence and then iteratively render the proposal density closer to optimal in the sense that the measured properties from the current simulation samples are made to match those of the optimal density. </p><p>A schematic diagram of the proposed algorithm scheme is shown in Fig.I. Given a parametric proposal density, the </p><p>Bayesian simulation module produces a batch of simulated draws from a density, which is closely relevant to the target density, based on an M CM C or IS mechanism. Then the module of computational intelligence is invoked to search </p></li><li><p>the features/patterns hidden in the target density from the above simulated samples. The discovered features/patterns are then utilized to construct a parametric model, which later will act as the proposal density for the next round Bayesian simulation. The above iterative process is initialized by prescribing a parametric proposal for the Bayesian simulation module. </p><p>A. Connections to relevant works </p><p>The proposed algorithm scheme mentioned above truly has connections to some existing algorithms in the literature. Actually it can be seen as a generic framework to cover or an extension to some of' these existing ones. From the </p><p>Bayesian simulation perspective, the proposed scheme is closely connected to the adaptive IS algorithm in [14], the Sequential Monte Carlo (SM C) samplers [15]-[ 16], the annealed importance sampler [17], the adaptive Metropolis</p><p>Hastings algorithm with independent proposals [18]-[21]. For all of these methods, an iterative operation of Monte Carlo simulation (either IS or M CM C) is involved, based on the recognition that the knowledge gained from former iterations is useful for the sampling procedure of current iteration. The point to be stressed here is that an efficient mechanism is required to search as much knowledge as possible from former iterations, as losing any important information may result in a inefficient proposal density function and then lead to a chain reaction that deteriorates the final performance. Recurring to computational intelligence, the proposed scheme provides a candidate method to search the features/patterns hidden in the data samples simulated from a former iteration, which are then used to build the proposal function at current iteration. </p><p>III. A N EXAMPLE IMPLEMENTATION OF THE PROPOSED SCHEME </p><p>This section describes an example implementation of the proposed algorithm scheme and shows how to get a cross fertilization between computational intelligence and </p><p>Bayesian simulation via the developed algorithm framework. The most primitive version of this algorithm was presented at </p><p>a workshop on SMC[22] and then used for searching extrasolar planets [23] by the first author here. </p><p>Our algorithm is a kind of adaptive IS method. Compared to alternatives such as M CM C, IS is appealing in allowing for parallel implementations, easy assessment of the Monte </p><p>Carlo error and avoiding the daunting issue of convergence diagnostics [14]. However, the success of IS depends on designing an appropriate proposal density q (. ), which is required to mimic the target density 7TO and have heavier tails [13]. Building such an IS density can be quite difficult even in low dimensional settings. A general strategy to solve this problem presets a model structure for q( . I1/! ) and then optimize its parameter 1fJ via an iterative process, which can be summarized as follows: </p><p> Draw i.i.d random samples {en };;'=1' from q( ' 11fJ) ; Weight the samples by </p><p>n n W W =W' </p><p>n = 1,2, . . . ,N, (1) </p><p>2 </p><p>where </p><p>N </p><p>and W = 1lln; (2) n=l </p><p> Adapt the value of 1/!, based around some knowledge obtained from the weighted samples {en, wn };;'=1' </p><p>The above iteration stops when a criterion meets, e.g. the effective samples size (ESS) is bigger than a given threshold [14]. Based on the above strategy a bunch of adaptive </p><p>importance sampler (AIS) algorithms have been proposed, see [14],[24]-[27], to name just a few. Such AIS algorithms are characterized by the ability of adapting the proposal parameter VJ automatically by the algorithm itself, while the underlying assumption is that the support area of the target density has already been known and then the focus is how to find an appropriate model to cover the structured, e.g. multi-modal, properties of the target density. This assumption may be violated when facing large data-sets and/or more complicated models, which indicates a possible failure of these algorithms for these cases. </p><p>Here a novel adaptive IS method is developed according to the framework described in Section II, for solving the aforementioned problem of existing methods. </p><p>A. Annealed adaptive mixture importance sampling (AAMIS) </p><p>This subsection presents the AAMIS algorithm in detail. In AAMIS, the IS density is specified to be a mixture function as follows </p><p>!vI M q(elvJ) = amfm(eITlm), am = 1, am 0, (3) </p><p>m=l m=l where am is the probability mass of the mth component f m with parameters TIm,1/! the mixture model parameter. The Kullback-Leibler (KL) divergence is adopted as the metric to characterize the difference between a given IS density q and the target density 7T: </p><p>(4) </p><p>Since the efficiency of IS requires that the IS density mimics the target density, here the goal is just to minimize (4) in terms of (a, TI), which is equivalent to maximizing </p><p>(5) </p><p>Note that, if a number of independently and identically distributed (i.i.d) samples drawn from 7T are available, the task of maximizing (5) in terms of (a,17) then becomes a standard problem of maximum likelihood estimation (MLE) of a mixture model. To this end, the Expectation-Maximization </p><p>(EM) technique can be used, relying on the missing data structure of the mixture model [28]-[29]. On the other hand, if a feasible mixture density model is available, then naturally </p></li><li><p>it can be taken as an IS density for sampling from the target density via a basic IS routine. At this moment, these two tasks, MLE of a mixture model and IS sampling from the target density, becomes nested. In another word, the solution of one of them will make it much easier to solve the other one. A big challenge appears if there are dominant while peaky modes in the structure of the target density and there is no much prior knowledge about the location of such peaky modes. To be best of our knowledge, there is no general algorithm solution, which is capable of handling peaky modes, multi-modality and Monte Carlo sampling all together, in the literature. </p><p>Here an attempt is made by mixing SA, IS, EM and several heuristics within the proposed algorithm scheme introduced in Section II. To begin with, a mixture density qO is prescribed with the form of (3) and is used as an initial </p><p>guess of the target density. The principle of determining the parameters of qO is to letting the support of qO be flat, spreading over as much region as possible, guaranteeing that it covers the support regions of the target density. Next a sequence of annealed distributions {7rn} evolving from qO to 7r is built as follows: </p><p>n _ I-,pr' cpU _ 7r -(q ) 7r ,n-O,oo.,p, (6) </p><p>where { n }=o denotes an artificially introduced time schedule, which satisfies 0 = o < I < . . . n < ... < P = 1. </p><p>At time step 1, the goal is to generate a batch of random samples distributed according to 7r1. First draw i.i.d random samples from qO; and then weight these samples via IS, which takes 7r1 as the target density. In such doing, it corrects for sampling from a wrong but close distribution qO via the importance weights {Wi}i!I' As a result, {Bi, Wi}i!1 is qualified for being taken as a sample from 7r1. Given {Bi, Wi}i!l' the task for MLE of parameters of an mixture </p><p>approximation to 7r1 can be resolved via the EM mechanism. Denote ql the resulting mixture approximation to 7r1, and </p><p>then, at time step 2, ql can be in turn used as the IS density for generating random draws from a new target density 7r2. Later a mixture approximation to 7r2, denoted by q2, will be obtained in the similar way as in the first iteration. This recursive procedure is continued until we have obtained qP, the mixture approximation to 7r. Then we just use qP as the IS density to simulate random samples from 7r. Let's summarize one iteration of the above procedure as follows. </p><p>One iteration of AAMIS: Take the nth iteration as an example, the input consists of the mixture density function qn-I, parameterized by (an-I, Tln-I ), and 7rn. The operation </p><p>includes the following steps: 1. Generate a sample (Bi) from qn-I and compute the </p><p>normalized importance weights </p><p>and the mixture posterior probabilities </p><p>3 </p><p>for i = 1, . . . , Nand m = 1, . . . , JI.;[. 2. Update the parameters a and B as follows </p><p>N </p><p>a = LwiPm(Bilan-I,Tln-l) (9) i=l </p><p>for m = 1, . . . , JI.;[. 3. Output qn in form of (3) with parameters (an, Tin). </p><p>Both the student's t-mixture model and the EM identities for solving equation (10) are described in the following subsection. </p><p>B. MLE of Student's t-mixture </p><p>The multivariate t-distribution, other than the commonly used Gaussian, is selected here for use as the mixture components in (3), due to the desirable heavy-tail property of the t distribution [30]. This subsection just gives the introduction to the t-mixture model as well as the corresponding EM identities for solving Equation (10). </p><p>1) the Student's t mixture model: When a d dimensional random variable Y follows the multivariate t distribution S('I/L,, v) with center IL, positive definite inner product </p><p>matrix and degrees of freedom v E (0, 00] , it denotes that, given the weight T, Y has the multivariate normal distribution, and that VT is X, which means the weight T is </p><p>Gamma distributed: </p><p>Y lfL,, v, T rv Nd(fL, /T); TlfL,, V rv Gamma (v/2, v/2), (11) </p><p>where the Gamma( a, (3) density function is </p><p>(3"Toc-1exp( -(3T)/r(a), T > 0, a> 0, (3 > O. (12) </p><p>As v --+ 00, then T --+ 1 with probability 1, and Y becomes marginally Nd(fL, ). Performing standard algebraic operations integrating T from the joint density of (Y, T), we could obtain the density function of the marginal distribution of Y , namely, S('lfL,, v): </p><p>(7rV )O.5dr( ){1 + O(Y, fL, ) /v }O.5(v+d) , (13) </p><p>where </p><p>denotes the Mahalanobis squared distance from Y to fL with respect to . Substituting fm(BIrI) with S(BlfLm' m' vm)...</p></li></ul>