bayesian training of neural networks using genetic programming

7
Bayesian training of neural networks using genetic programming Tshilidzi Marwala * School of Electrical and Information Engineering, University of the Witwatersrand, Private Bag x3, Wits 2050, South Africa Received 17 November 2005; received in revised form 4 January 2007 Available online 27 March 2007 Communicated by K. Tumer Abstract Bayesian neural network trained using Markov chain Monte Carlo (MCMC) and genetic programming in binary space within Metropolis framework is proposed. The algorithm proposed here has the ability to learn using samples obtained from previous steps merged using concepts of natural evolution which include mutation, crossover and reproduction. The reproduction function is the Metropolis framework and binary mutation as well as simple crossover, are also used. The proposed algorithm is tested on simulated function, an artificial taster using measured data as well as condition monitoring of structures and the results are compared to those of a classical MCMC method. Results confirm that Bayesian neural networks trained using genetic programming offers better perfor- mance and efficiency than the classical approach. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Bayesian framework; Evolutionary programming; Neural networks 1. Introduction The use of Bayesian framework to train neural networks has been the subject of research during the previous dec- ade. Some of the techniques that have been applied thus far to train neural networks using Bayesian framework include Markov chain Monte Carlo (MCMC) method (Kass et al., 1998), the hybrid Monte Carlo method (Neal, 1994) and evolutionary Monte Carlo (Ling and Wong, 2001). Some of these methods have been applied within the framework of Metropolis et al. algorithm (Metropolis et al., 1953). Markov chain Monte Carlo method has been applied to improve the abilities of mathematical models to predict the dynamics and reliability of structures (Marwala and Sibisi, 2005). MCMC was used for Bayesian curve fit- ting and applied to signal segmentation (Punskaya et al., 2002), to estimate regularisation parameters for satellite image restoration (Jalobeanu et al., 2002) and for inference of stochastic volatility models of the S&P index (Chib et al., 2002). All these applications that have been described above have one aspect in common, and this is that they have been applied without paying particular attention to the issue of achieving a global optimum posterior distribution function by the use of sampling in binary space that is similar to the way this is conducted in genetic algorithms (Koza, 1992). Kendall and Montana (2002) have noted that inside every Markov chain with measurable transition density there is a discrete state-space Markov chain struggling to escape from some local optimum distribution. This in essence indi- cates that the issue of global posterior distribution must not be taken for granted. Several techniques have been implemented to achieve global optimum distributions such as simulated annealing (Neal, 1993) and evolutionary com- puting to identify global solutions in optimisation prob- lems (Marwala, 2002). In this paper genetic programming is used because it offers the following advantages (Koza, 1992): (1) it efficiently searches through the parameter 0167-8655/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2007.03.004 * Tel.: +27 83 379 1357; fax: +27 11 403 1929. E-mail address: [email protected] www.elsevier.com/locate/patrec Pattern Recognition Letters 28 (2007) 1452–1458

Upload: tshilidzi-marwala

Post on 21-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bayesian training of neural networks using genetic programming

www.elsevier.com/locate/patrec

Pattern Recognition Letters 28 (2007) 1452–1458

Bayesian training of neural networks using genetic programming

Tshilidzi Marwala *

School of Electrical and Information Engineering, University of the Witwatersrand, Private Bag x3, Wits 2050, South Africa

Received 17 November 2005; received in revised form 4 January 2007Available online 27 March 2007

Communicated by K. Tumer

Abstract

Bayesian neural network trained using Markov chain Monte Carlo (MCMC) and genetic programming in binary space withinMetropolis framework is proposed. The algorithm proposed here has the ability to learn using samples obtained from previous stepsmerged using concepts of natural evolution which include mutation, crossover and reproduction. The reproduction function is theMetropolis framework and binary mutation as well as simple crossover, are also used. The proposed algorithm is tested on simulatedfunction, an artificial taster using measured data as well as condition monitoring of structures and the results are compared to thoseof a classical MCMC method. Results confirm that Bayesian neural networks trained using genetic programming offers better perfor-mance and efficiency than the classical approach.� 2007 Elsevier B.V. All rights reserved.

Keywords: Bayesian framework; Evolutionary programming; Neural networks

1. Introduction

The use of Bayesian framework to train neural networkshas been the subject of research during the previous dec-ade. Some of the techniques that have been applied thusfar to train neural networks using Bayesian frameworkinclude Markov chain Monte Carlo (MCMC) method(Kass et al., 1998), the hybrid Monte Carlo method (Neal,1994) and evolutionary Monte Carlo (Ling and Wong,2001). Some of these methods have been applied withinthe framework of Metropolis et al. algorithm (Metropoliset al., 1953). Markov chain Monte Carlo method has beenapplied to improve the abilities of mathematical models topredict the dynamics and reliability of structures (Marwalaand Sibisi, 2005). MCMC was used for Bayesian curve fit-ting and applied to signal segmentation (Punskaya et al.,2002), to estimate regularisation parameters for satellite

0167-8655/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2007.03.004

* Tel.: +27 83 379 1357; fax: +27 11 403 1929.E-mail address: [email protected]

image restoration (Jalobeanu et al., 2002) and for inferenceof stochastic volatility models of the S&P index (Chibet al., 2002).

All these applications that have been described abovehave one aspect in common, and this is that they have beenapplied without paying particular attention to the issue ofachieving a global optimum posterior distribution functionby the use of sampling in binary space that is similar to theway this is conducted in genetic algorithms (Koza, 1992).Kendall and Montana (2002) have noted that inside everyMarkov chain with measurable transition density there is adiscrete state-space Markov chain struggling to escapefrom some local optimum distribution. This in essence indi-cates that the issue of global posterior distribution mustnot be taken for granted. Several techniques have beenimplemented to achieve global optimum distributions suchas simulated annealing (Neal, 1993) and evolutionary com-puting to identify global solutions in optimisation prob-lems (Marwala, 2002). In this paper genetic programmingis used because it offers the following advantages (Koza,1992): (1) it efficiently searches through the parameter

Page 2: Bayesian training of neural networks using genetic programming

T. Marwala / Pattern Recognition Letters 28 (2007) 1452–1458 1453

space, and therefore is more likely, than conventional tech-niques, to converge toward a global desired distribution;(2) there is no need to compute partial derivatives as isthe case in methods such as Hybrid Monte Carlo (Neal,1993); and (3) due to the dynamics of sampling in binaryspace, the probable parameters are sampled more fre-quently than less probable ones.

This paper, therefore, proposes the use of genetic pro-gramming to sample a posterior probability distributionof the neural network weights. The procedure proposedoperates in binary space and conventional evolutionaryconcepts of mutation, crossover and reproduction areapplied. This procedure is then compared to the classicalMarkov chain Monte Carlo method that samples statesin floating-point space. To test the proposed approach,neural networks are trained using this method and appliedto a simulated experiment, an artificial taster and conditionmonitoring in structures.

2. Neural networks

The method we propose in this paper is aimed at identi-fying probable parameters given a structured model anddata. One such structured model, where parameters areidentified given data, is feed-forward neural networks. Thissection describes a particular type of neural networkscalled multi-layer perceptron (MLP), which are parameter-ised graphs that make probabilistic assumptions aboutdata. The MLP network architecture implemented in thispaper contains hidden units and output units and hasone hidden layer. The relationship between output y andinput x, may be written as follows (Bishop, 1995):

yk ¼ fouter

XM

j¼1

wð2Þkj finner

Xd

i¼1

wð1Þji xi þ wð1Þj0

!þ wð2Þk0

!ð1Þ

Here, wð1Þji and wð2Þji indicate weights in the first and secondlayers, respectively, going from input i to hidden unit jwhile wð1Þj0 indicates the bias for the hidden unit j. Here M

is the number of hidden units, d is the number of inputunits and k is the index for the output units. In this paper,the function fouter(Æ) is linear, while finner(Æ) is a hyperbolictangent function (Bishop, 1995, 2006). The Bayesian meth-od identifies the distribution of weights in Eq. (1) that lookprobable in the light of the given set of data. In this paper,we have developed a method which would efficiently iden-tify a distribution of these weights given a set of data. Weuse simulated data, data collected from a beer tasting exer-cise as well as condition monitoring in structures data.Therefore, in this paper we identify a distribution of themost probable weights given the simulated data or beertasting data or condition monitoring in structures data.To facilitate the understanding of the proposed method,in the next section we shall introduce Bayesian framework,which is framework we shall use to create a structure of thedistribution of the most probable weights.

3. Bayesian framework

In this section a method of identifying the networkweights described in Eq. (1) is outlined. The problem ofidentifying the network weights is posed in Bayesian formas follows (Bishop, 2006):

P ðwjDÞ ¼ PðDjwÞP ðwÞP ðDÞ ð2Þ

where P(w) is the probability distribution function of theweight-space in the absence of any data, also known asthe prior distribution function, and D � (y1, . . . ,yN) is amatrix containing the output data. The quantity P(wjD)is the probable distribution of weights also known as theposterior probability distribution function after the datahave been seen, P(Djw) is the likelihood distribution func-tion and P(D) is the evidence and its function is to normal-ise the posterior probability distribution function. Eq. (2)may be expanded to give (Bishop, 1995):

P ðwjDÞ ¼ 1

Zs

exp �bXN

n

XK

k

ftnk � ynkg2 � a

2

XW

j

w2j

!ð3Þ

where

Zsða; bÞ ¼Z

exp �bXN

n

XK

k

tnk � ynkf g2 � a2

XW

j

w2j

!dw

¼ 2pb

� �N=2

þ 2pa

� �W =2

ð4Þ

In Eq. (3), the first term in the exponent is the likelihoodfunction and the second term is the prior information, n

is the index for the training pattern, b is the data contribu-tion to the error, k is the index for the output units and a isthe coefficient of the prior information. The second term inEq. (3) is the regularisation parameter. Training the net-work using Bayesian approach automatically penalizeshighly complex models and also gives a probability distri-bution of the output of the networks. There are manymethods of sampling the distribution in Eq. (3) and theseinclude Gibbs sampling (Geman and Geman, 1984; Gelf-and and Smith, 1990), Metropolis algorithm (Metropoliset al., 1953) and the hybrid Monte Carlo (Neal, 1993).These methods suffer from their reduced probability ofnot finding a global optimum distribution while the hybridMonte Carlo requires the calculation of gradient of the dis-tribution. However, it must be borne in mind that thesemethods have been improved by incorporating simulatedannealing (Bishop, 2006). The next section will outlinethe Markov Chain Monte Carlo (MCMC).

4. MCMC via metropolis algorithm

The manner in which the probability distribution in Eq.(3) may be sampled is to randomly generate a succession ofweight vectors and accepting or rejecting them based onhow probable they are using Metropolis algorithm. This

Page 3: Bayesian training of neural networks using genetic programming

1454 T. Marwala / Pattern Recognition Letters 28 (2007) 1452–1458

process requires a generation of large samples of weights,which in many cases is not computationally efficient. TheMCMC creates a chain of weights and accepts or rejectsthem using Metropolis algorithm. The application ofBayesian approach and MCMC to neural networks, resultsin the probability distribution function of the weightswhich in turn leads to the distribution of the network out-puts. From these distribution functions the average predic-tion of the neural network and the variance of thatprediction can be calculated. The probability distributionsof these network weights are mathematically described byEq. (3). From Eq. (3) and by following the rules of prob-ability theory, the distribution of the output parameter,y, is written as (Neal, 1994)

pðyjx;DÞ ¼Z

pðyjx;wÞpðwjDÞdw ð5Þ

Eq. (5) depends on Eq. (3), and is difficult to solve analyt-ically due to relatively high dimension of weight-space.Thus the integral in Eq. (5) may be approximated asfollows:

~y ffi 1

L

XRþL�1

i¼I

F ðwiÞ ð6Þ

Here F is the mathematical model that gives the output gi-ven the input, ~y is the average prediction of the Bayesianneural network, R is the number of initial states that arediscarded in the hope of reaching a stationary posterior dis-tribution function described in Eq. (3) and L is the numberof retained states. In this paper, MCMC method is imple-mented by sampling a stochastic process consisting of ran-dom variables {w1,w2, . . . ,wn} through introducingrandom changes to weight vector {w} and either acceptingor rejecting the state according to Metropolis et al. algo-rithm given the differences in posterior probabilities be-tween two states that are in transition (Metropolis et al.,1953). This algorithm ensures that states with high proba-bility form the majority of the Markov chain. Tradition-ally, the MCMC was conducted in floating-point spaceand this paper introduces genetic programming, whichoperates in binary space, for sampling of Bayesian net-works, which is the subject of the next section.

5. MCMC: genetic programming and metropolis algorithm

Genetic programming takes features from natural evolu-tion and uses these to computationally solve practicalproblems. Genetic algorithms are examples of genetic pro-gramming and a procedure that is inspired by these is intro-duced in this section. In this paper, some of the features ofgenetic computing are applied to sample the posterior dis-tribution function in Eq. (3). As explained before, the useof genetic programming is motivated by the fact that it effi-ciently searches through the parameter space, and therebyincreasing the likelihood that a global posterior distribu-tion is achieved, there is no need to compute partial deriv-

atives as is the case in methods such as Hybrid MonteCarlo (Neal, 1993); and the dynamics of sampling in binaryspace ensures that probable parameters are sampled morefrequently than less probable ones.

Genetic algorithms were inspired by Darwin’s theory ofnatural evolution. In natural evolution, members of thepopulation compete with each other to survive and repro-duce. Evolutionary successful individuals reproduce whileweaker members die. As a result, the genes that are success-ful are likely going to spread within the population. Thisnatural optimisation method has been successfully usedto optimise complex problems (Holland, 1975; Mich-alewicz, 1996; Goldberg, 1989). This procedure uses a pop-ulation of binary string chromosomes and each of thesestrings is the discretised representation of a point in thesearch space and therefore has a fitness function that isgiven by the objective function. On generating a new pop-ulation three operators are performed: (1) crossover; (2)mutation; (3) and reproduction, and these operators areadopted in genetic MCMC sampling. The crossover opera-tor mixes genetic information in the population by cuttingpairs of chromosomes at random points along their lengthand exchanging over the cut sections. This has a potentialof joining successful operators together. Crossover occurswith a certain probability. In many natural systems, theprobability of crossover occurring is higher than the prob-ability of mutation occurring. Simple crossover techniqueis used in this paper (Goldberg, 1989). For simple cross-over, one crossover point is selected, binary string frombeginning of chromosome to the crossover point is copiedfrom one parent, and the rest is copied from the secondparent. For example, when 11001011 undergoes simplecrossover with 11011111 it becomes 11001111.

The mutation operator picks a binary digit of the chro-mosomes at random and inverts it. This has a potential ofintroducing to the population new information. Mutationoccurs with a certain probability. In many natural systems,the probability of mutation is low (i.e. less than 1%). In thispaper binary mutation is used (Goldberg, 1989). Whenbinary mutation is used a number written in binary formis chosen and its value is inverted. For an example:11001011 may become 11000011.

Reproduction takes successful chromosomes and repro-duces them in accordance to their fitness functions. In thispaper Metropolis et al. (1953) criteria is used as a reproduc-tion method. By so doing the least fit members are there-fore gradually driven out of the population of states thatform a Markov chain.

The schematic illustration of the MCMC methodtrained using genetic programming is shown in Figs. 1and 2. In this figure an initial sample weight vector {w}n

is generated.Then the sample is converted into binary form (Step 3 in

Fig. 2) using Gray method (Michalewicz, 1996). The sam-ple is then mutated to form a new sample vector {w}n+1

(Step 4 in Fig. 2). The new weight vector {w}n+1 undergoescrossover with its predecessor {w}n and mutates again to

Page 4: Bayesian training of neural networks using genetic programming

Fig. 1. A schematic illustration of genetic sampling for the MCMC implementation.

Fig. 2. A flow chart representing genetic sampling for the MCMCimplementation.

T. Marwala / Pattern Recognition Letters 28 (2007) 1452–1458 1455

form a new network weight vector {w}n+2 (Step 5 in Fig. 2).The weight vector {w}n+2 is converted into floating-pointand then its probability is calculated (Steps 6 and 7 inFig. 2). This network weight vector is either accepted orrejected using Metropolis et al. (1953) criterion (step 8 inFig. 2). Thereafter, states {w}n+2 and {w}n+1 in binaryform undergo crossover and are mutated to form {w}n+3.

State {w}n+3 is then reproduced using Metropolis et al.(1953) criterion. This procedure is represented in a flowchart shown in Fig. 2. The genetic MCMC proposed in thissection is different from the traditional GA in the followingnature: (a) The genetic MCMC does not generate a newpopulation of genes at any given iteration (i.e. generationin the GA framework) as is the case in GA but it generatesone sample at each iteration; (b) The fitness function usesMetropolis criterion while in GA this is not the case; and(c) The genetic MCMC has a higher mutation rate thanGA. The genetic MCMC is different from a standardMCMC in the following way: (a) The random walk inthe classical MCMC is replaced by a procedure inspiredby Darwin’s theory of evolution which entails crossover,mutation and reproduction and (b) It operates in float-ing-point space.

6. Case study 1: simulated study

In this case study, Bayesian neural network that istrained using genetic MCMC is used for regression prob-lems. The same regression problem is solved using the clas-sical MCMC which generate states in floating-point spaceand accept or reject the state using Metropolis et al.method. The results of the two methods are then com-pared. The simulated data are generated from a noisy sinefunction with a standard deviation of 0.1. This is the samefunction that was used by Nabney (2002). Twenty datapoints are generated around x = 0.25. Regression analysisis conducted for the domain x = [01]. The MLP networksconstructed have one input, five hidden units and one out-put units. The optimal number of hidden units wasobtained by studying the relationship between the numberof hidden units and the generalisation error. This was con-ducted by setting the number of hidden units to fallbetween 1 and 8 and assessing the generalisation error.The hidden layer activation functions are a hyperbolic tan-gent functions, while the output activation functions arelinear functions.

On implementing Bayesian training the coefficient of thedata contribution to the error b is set to 100 while the priorcoefficient a is set to 0.001. The manner in which theseparameters fit into the Bayesian framework is describedby Eq. (3). The number of retained states L is 10000 while

Page 5: Bayesian training of neural networks using genetic programming

Fig. 4. Results obtained when Bayesian networks are trained via classicalMCMC.

1456 T. Marwala / Pattern Recognition Letters 28 (2007) 1452–1458

the number of discarded states R is 200 and these values fitinto the Bayesian framework through Eq. (6).

For the genetic part of the simulation the rate of muta-tion is 6.6% and the rate of crossover is 70%. It should benoted that the rate of mutation proposed here is higherthan that of standard genetic algorithm. The proposedBayesian method via genetic programming has a randomcomponent search and therefore may be viewed as beingequivalent to the random walk that is executed in the stan-dard Bayesian sampling. Indeed the proposed proceduremay in principle be equivalent to the standard randomwalk, however, it takes into account of the efficient sam-pling in binary space which has been observed in standardgenetic algorithm. It must be noted that the rate of muta-tion chosen here is lower than the rate of crossover, whichis in accordance to many natural systems. When imple-menting the genetic framework through genetic algorithm,16-bit binary numbers are used. The bounds of the magni-tudes of the components of the weight vectors are [�44].The results obtained when Bayesian networks are trainedusing genetic programming are shown in Fig. 3.

The mean square error (MSE) obtained from this figureis 0.371. The results obtained when the Bayesian networksare trained using classical MCMC, are shown in Fig. 4. Itshould be noted that Figs. 3 and 4 have the samples equalsamples of 10000.

This gives an MSE of 0.55. When Fig. 3 is compared toFig. 4, it is observed that genetic approach to MCMC per-forms better than the method that uses MCMC methodthat operates in floating-point space, because it gives loweraverage errors. The graph showing the errors as a functionof samples accepted from state 1 to state 10000 is shown inFig. 5.

This graph shows that the rate of convergence to a sta-tionary posterior distribution is faster when using geneticapproach than when using the standard MCMC method.This is because it has been proven that sampling through

Fig. 3. Results obtained when Bayesian networks are trained via geneticprogramming.

Fig. 5. Prediction errors versus samples.

binary space is more efficient than sampling through float-ing-point space. The reason for this is that samplingthrough binary space, is able to explore a larger part ofthe weight-space than if the process is conducted in float-ing-point space. The acceptance rate of states for MCMCmethod that operates in floating-point space is 0.80, whilewhen using MCMC method based on genetic program-ming is 0.71. This indicates that the MCMC method basedon genetic programming is able to explore states that forma Markov chain better than the MCMC method that oper-ates in floating-point space.

7. Case study 2: artificial taster

Now that the simulations conducted in Section 7 dem-onstrated that the genetic approach to Bayesian frameworkthan classical MCMC, we hereby apply the procedure to apractical problem of a development of an artificial taster.

Page 6: Bayesian training of neural networks using genetic programming

Fig. 6. The taste score for a given taste sample. The top graph is for theentire 207 test data while the bottom graph is a close view of the top graph.

T. Marwala / Pattern Recognition Letters 28 (2007) 1452–1458 1457

Artificial taster has been the subject of research for sometime. Some of the works on this subject are a developmentof a taster based on proton transfer mass spectroscopy tosuccessfully taste mozzarella cheese (Gasperi et al., 2001),a solid-state electronic taster for beverage analysis (Lvovaet al., 2002), a taste sensor based on lipid-coated crystalmicrobalance to evaluate beer body and smoothness (Vla-sov et al., 2002) and wine flavour taster that uses multivar-iate statistics (Noble and Ebeler, 2002). In this paper, themethod proposed is used to construct an artificial tasterand compared to the classical approach. The artificialtaster tested in this paper is basically an infrastructure thatrelates the characteristics of beer measured in the labora-tory to taste score measured from a panel of professionaltasters. These characteristics capture parameters thathuman beings are sensitive to on tasting beer and theseare: alcohol level, sugar level called present extract and realextract, pH, iron, acetaldehyde, dimethyl sulphide, eythylacetate, iso-amyl acetate, total higher alcohols, colourand bitterness as inputs to the artificial taster that estimatesthe average taste score given by a panel of 11 professionaltasters. To realise the artificial taster, MLP neural networkis used. The MLP has 11 inputs, 7 hidden nodes and 1 out-put. The network has hyperbolic tangent function in thehidden layer and logistic function in the output layer.The parameters chosen by trail and error (see Eq. (3)) area of 0.02 and b of 25. The genetic parameters used in theprevious example are used in this case. The characteristicsof beer and the corresponding taste score from an averagescore of a panel of 11 professional tasters are used to con-struct an artificial taster. These tasters each give a tastescore that ranges from 0 for a really bad beer to 10 for agood beer. When an artificial taster, which is defined in thispaper as neural networks and measured analytical datafrom beer, is used to predict taste scores, the resultsobtained are shown in Table 1 and Fig. 6. The idea of giv-ing a taste a numerical number is quite difficult to justifyphilosophically but psychophysics has provided appropri-ate measurement techniques for subjective phenomenasuch as taste and this is the framework that is adopted inthis paper.

These results show that the genetic approach proposedin this paper converges faster than the classical MCMC.Secondly these results show that genetic programminggives lower average errors than the classical approach. Thisis because of the fact that genetic programming is able toexplore a wider search space more efficiently than the clas-sical approach.

Table 1Results from artificial taster on the test data

Method Mean squareerrors

Number ofsamples forconvergence

Average percentageerrors (%)

Classical approach 0.057 2151 9.1Evolutionary 0.051 1343 7.1

8. Case study 3: condition monitoring

The process of monitoring and identifying faults instructures is of great importance in aerospace, civil andmechanical engineering. Aircraft operators must be surethat aircraft are free from cracks. Bridges and buildingsnearing the end of their useful life must be assessed forload-bearing capacities. Cracks in turbine blades lead tocatastrophic failure of aero-engines and must be detectedearly. Many techniques have been employed in the past tolocate and identify faults. Some of these are visual (e.g.dye penetrant methods) and others use sensors to detectlocal faults (e.g. acoustics, magnetic field, eddy current,radiographs and thermal fields). These methods are timeconsuming and cannot indicate that a structure is fault-freewithout testing the entire structure in minute detail. Fur-thermore, if a fault is buried deep within the structure itmay not be visible or detectable by these localised tech-niques. The need to detect faults in complicated structureshas led to the development of global methods which are ableto utilise changes in the vibration characteristics of thestructure as a basis of fault detection (Marwala, 2000).There are four main methods by which vibration datamay be represented: time, modal, frequency and time-fre-quency domains. Raw data is obtained from measurementmade in the time domain. From the time domain, Fouriertransform techniques may then be used to transform datainto the frequency domain. From the frequency domaindata, and sometimes directly from the time domain, themodal properties may be extracted. All of these domainstheoretically contain similar information but in reality thisis not necessarily the case. Because the time domain dataare relatively difficult to interpret, they have not been usedextensively for fault identification, and for this reason, themodal properties have been widely considered. In thispaper we use pseudomodal energies to classify faults in a

Page 7: Bayesian training of neural networks using genetic programming

Table 2Results from condition monitoring problem on the test data

Method Mean squareerrors

Number ofsamples forconvergence

Average percentageerrors (%)

Classical approach 0.0256 5219 5.5Evolutionary 0.0195 4725 4.9

1458 T. Marwala / Pattern Recognition Letters 28 (2007) 1452–1458

population of cylindrical shells (Marwala, 2001). The proce-dure on how to calculate these pseudomodal energies areoutlined in (Marwala, 2001). These cylindrical shells exhibit8 classes of faults and the method proposed above is used toclassify these faults. These fault cases are [000], [100], [010],[001], [101], [110], [011] and [111]. The details on these are in(Marwala, 2001).

The MLP constructed has 10 inputs, 8 hidden nodes and3 outputs. The network has a hyperbolic tangent functionin the hidden layer and logistic function in the output layer.The parameters chosen by trail and error (see Eq. (3)) are aof 0.01 and b of 30. The genetic parameters used in the pre-vious example are used in this case. The results obtainedare shown in Table 2. These results also show that thegenetic approach proposed in this paper converges fasterthan the classical MCMC. Secondly these results also showthat the genetic programming gives lower average errorsthan the classical approach.

9. Conclusion

In this paper genetic approach to MCMC sampling isintroduced. The method is tested on simulated data, artifi-cial taster and on condition monitoring of structures. Theresults obtained are compared to those obtained fromMCMC method that operates in floating-point space. Itis concluded that the MCMC method based on genetic pro-gramming gives better results than MCMC method thatoperates in floating-point space on modelling simulateddata and artificial taster.

References

Bishop, C.M., 2006. Pattern Recognition and Machine Intelligence.Springer, Berlin, Germany.

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. OxfordUniversity Press, Oxford, UK.

Chib, S., Nardari, F., Shephard, N., 2002. Markov chain Monte Carlomethods for stochastic volatility models. J. Econometrics 108,281–316.

Gasperi, F. et al., 2001. The mozzarella cheese flavour profile: Acomparison between judge panel and proton transfer reaction massspectroscopy. J. Sci. Food Agric. 81, 357–363.

Gelfand, A.E., Smith, A.F.M., 1990. Sampling-based approaches tocalculating marginal densities. J. Amer. Statist. Assoc. 85, 398–409.

Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributionsand the Bayesian restoration of images. IEEE Trans. Pattern Anal.Machine Intell. 6, 721–741.

Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization andMachine Learning. Addison-Wesley, Reading, MA.

Holland, J., 1975. Adaptation in Natural and Artificial Systems. Univer-sity of Michigan Press.

Jalobeanu, A., Blanc-Feraud, L., Zerubia, J., 2002. Hyperparameterestimation for satellite image restoration using a MCMC maximum-likelihood method. Pattern Recognition 35, 341–352.

Vlasov, Y., Legin, A., Rudnitskaya, A., 2002. Electronic tongues and theiranalytical application. Anal. Biol. Chem. 373, 136–146.

Kass, R.E., Carlin, B.P., Gelman, A., Neal, R.M., 1998. Markov MonteCarlo in practice: A roundtable discussion. Amer. Statist. 52, 93–100.

Kendall, W.S., Montana, G., 2002. Small sets and Markov transitiondensities. Stoch. Process. Appl. 99, 177–194.

Koza, J.R., 1992. Genetic Programming: On the Programming ofComputers by means of Natural Selection. MIT Press.

Ling, F., Wong, W.H., 2001. Real-parameter evolutionary Monte Carlowith applications to Bayesian mixture models. J. Amer. Statist. Assoc.96, 653–666.

Lvova, L., Kim, S.S., Legin, A., Vlasov, Y., Yang, J.S., Cha, G.S., Nam,H., 2002. All-solid-state electronic tongue and its application forbeverage analysis. Anal. Chim. Acta 468, 303–314.

Marwala, T., 2000. Fault identification using neural networks andvibration data. PhD thesis, University of Cambridge, 2001.

Marwala, T., 2001. On fault identification using pseudo-modal-energiesand modal properties. Amer. Inst. Aeronaut. Astronaut. J. 39,1608–1617.

Marwala, T., 2002. Finite element model updating using wavelet data andgenetic algorithm. J. Aircraft 39, 709–711.

Marwala, T., Sibisi, S., 2005. Finite element updating using Bayesianframework and modal properties. J. Aircraft 42, 275–278.

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H.,Teller, E., 1953. Equations of state calculations by fast computingmachines. J. Chem. Phys. 21, 1087–1092.

Michalewicz, Z., 1996. Genetic algorithms + data structures = evolutionprograms. Springer-Verlag.

Nabney, I., 2002. Netlab: Algorithms for Pattern Recognition. Springer,Berlin.

Neal, R.M., 1993. Probabilistic inference using Markov chain MonteCarlo methods. University of Toronto Technical Report CRG-TR-93-1, Toronto, Canada.

Neal, R.M., 1994. An improved acceptance procedure for the hybridMonte Carlo algorithm. J. Comput. Phys. 111, 194–203.

Noble, A.C., Ebeler, S.E., 2002. Use of multivariate statistics inunderstanding wine flavor. Food Rev. Int. 18, 1–20.

Punskaya, E., Andrieu, C., Doucet, A., Fitzgerald, W.J., 2002. Bayesiancurve fitting using MCMC with applications to signal segmentation.IEEE Trans. Signal Process. 50, 747–758.