evolving artificial neural networks to combine financial

13
40 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997 Evolving Artificial Neural Networks to Combine Financial Forecasts Paul G. Harrald and Mark Kamstra Abstract—We conduct evolutionary programming experiments to evolve artificial neural networks for forecast combination. Using stock price volatility forecast data we find evolved networks compare favorably with a na¨ ıve average combination, a least squares method, and a Kernel method on out-of-sample forecast- ing ability—the best evolved network showed strong superiority in statistical tests of encompassing. Further, we find that the result is not sensitive to the nature of the randomness inherent in the evolutionary optimization process. Index Terms— Evolutionary programming, financial forecast- ing, forecast combination, neural networks, self-adaptive evolu- tionary programming. I. INTRODUCTION P OLICY and decision makers must typically adopt a po- sition based on a host of conflicting opinions about the future course of events. Processing multiple and conflicting data is an inherently complex and in many cases subjective procedure. We focus here on this process of consensus-making when the information available to the decision maker is quantitative. For instance, the governor of a central bank may have several forecasts of exchange rate movements available that need to be considered in setting the prime interest rate. These forecasts may come from forecasters of differing ability and reputation and from forecasters with different information at hand. Forming some sort of consensus of the exchange rate movement forecasts is required before the prime rate can be set. A simple average of the forecasts could be taken, but this ignores the ability and reputation of the forecasters, since equal weight would be given to each forecast. It is this basic observation that has led to what is now a voluminous literature on forecast combining. The “optimal” weighting scheme of [1] is based on the covariances of the individual forecasts with the actual realized values of the variable being forecasted. Subsequent work [6] extended this notion of using past forecasts to form a superior combination. Examples from this body of work include the use of unconstrained least squares regression of the actual values on the forecasts to form optimal weights, the correction of serial correlation in the forecast errors, and the use of Bayesian Manuscript received August 4, 1996; revised February 15, 1997 and March 3, 1997. This work was supported by the Social Sciences and Humanities Research Council of Canada and the Sir Peter Allen Travelling Scholarship. P. G. Harrald is with the Manchester School of Management, University of Manchester Institute of Science and Technology, Manchester, M60 1QD U.K. M. Kamstra is with the Department of Economics, Simon Fraser University, Burnaby, B.C. V5A 1S6 Canada. Publisher Item Identifier S 1089-778X(97)03850-2. updating techniques to formally allow the weights on forecasts to evolve with new information [19], [37], [20]. Only linear combinations of the individual forecasts have thus far been considered. This is a substantial, most likely inappropriate, restriction and one with serious implications for the efficiency or even the consistency of the combined forecast. For example, consider the case of a dependent variable , where is an innovation and and are known explanatory variables. If forecaster 1 has the model and forecaster 2 has the model , then any linear combination of the two forecasts will be inferior to the nonlinear forecast , with . There exists some empirical evidence for the gains from incorporating nonlinear combination of forecasts, specifically in the context of combining financial forecasts of stock market volatility [7]. Such complications with nonlinearity are becoming more widely appreciated in modeling economic data in particular, as evidenced in recent work [32] and the wealth of new tests for nonlinearity [4], [30], [33], [49], [50]. Artificial neural networks (ANN’s) have the ability to approximate arbitrarily well a large class of functions [23], [25]–[27], [32], [46], [52]–[54]. ANN’s, therefore, have at least the potential to capture complex nonlinear relationships between a group of individual forecasts and the variable being forecasted, which simple linear models are unable to capture. Readers should be aware that ANN modeling has been criticized as a “black box” (e.g., [11]). If care is not taken, all intuition for the relationship between the forecasts and what is being forecast may be lost. It is also important to recognize that the very act of combining forecasts is an admission of some sort of failure of the models from which the forecasts are produced. If we are given all the information used in generating the individual forecasts being combined, it is always better to construct a single “super model” that encompasses this full information set and not combine the individual forecasts at all. What we are attempting to demonstrate here is the utility of the ANN model if we are faced with only forecasts to combine. 1 1 In an unrelated literature [55], it is noted that competing nonlinear “generalizers” making use of identical information sets can be eliminated with a winner-takes-all strategy or they can be combined, termed “stacked generalization,” in a possibly nonlinear fashion. This is similar to what we hope to do here. Our problem, however, is to construct the best forecast as a function of forecasts each based on at least partially unique information, without access to anything but the final forecasts. We are not facing the problem of how best to use all the available information, but rather how best to combine individual forecasts themselves based on information unknown to us. 1089–778X/97$10.00 1997 IEEE

Upload: others

Post on 16-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolving Artificial Neural Networks To Combine Financial

40 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

Evolving Artificial Neural Networksto Combine Financial Forecasts

Paul G. Harrald and Mark Kamstra

Abstract—We conduct evolutionary programming experimentsto evolve artificial neural networks for forecast combination.Using stock price volatility forecast data we find evolved networkscompare favorably with a naıve average combination, a leastsquares method, and a Kernel method on out-of-sample forecast-ing ability—the best evolved network showed strong superiorityin statistical tests of encompassing. Further, we find that the resultis not sensitive to the nature of the randomness inherent in theevolutionary optimization process.

Index Terms—Evolutionary programming, financial forecast-ing, forecast combination, neural networks, self-adaptive evolu-tionary programming.

I. INTRODUCTION

POLICY and decision makers must typically adopt a po-sition based on a host of conflicting opinions about the

future course of events. Processing multiple and conflictingdata is an inherently complex and in many cases subjectiveprocedure. We focus here on this process of consensus-makingwhen the information available to the decision maker isquantitative. For instance, the governor of a central bank mayhave several forecasts of exchange rate movements availablethat need to be considered in setting the prime interest rate.These forecasts may come from forecasters of differing abilityand reputation and from forecasters with different informationat hand. Forming some sort of consensus of the exchange ratemovement forecasts is required before the prime rate can beset. A simple average of the forecasts could be taken, butthis ignores the ability and reputation of the forecasters, sinceequal weight would be given to each forecast. It is this basicobservation that has led to what is now a voluminous literatureon forecastcombining.

The “optimal” weighting scheme of [1] is based on thecovariances of the individual forecasts with the actual realizedvalues of the variable being forecasted. Subsequent work [6]extended this notion of using past forecasts to form a superiorcombination. Examples from this body of work include the useof unconstrained least squares regression of the actual valueson the forecasts to form optimal weights, the correction ofserial correlation in the forecast errors, and the use of Bayesian

Manuscript received August 4, 1996; revised February 15, 1997 and March3, 1997. This work was supported by the Social Sciences and HumanitiesResearch Council of Canada and the Sir Peter Allen Travelling Scholarship.

P. G. Harrald is with the Manchester School of Management, Universityof Manchester Institute of Science and Technology, Manchester, M60 1QDU.K.

M. Kamstra is with the Department of Economics, Simon Fraser University,Burnaby, B.C. V5A 1S6 Canada.

Publisher Item Identifier S 1089-778X(97)03850-2.

updating techniques to formally allow the weights on forecaststo evolve with new information [19], [37], [20].

Only linear combinations of the individual forecasts havethus far been considered. This is a substantial, most likelyinappropriate, restriction and one with serious implicationsfor the efficiency or even the consistency of the combinedforecast. For example, consider the case of a dependentvariable , where is an innovation and

and are known explanatory variables. If forecaster 1has the model and forecaster 2 has the model

, then any linear combination of the two forecastswill be inferior to the nonlinear forecast , with

. There exists some empirical evidencefor the gains from incorporating nonlinear combination offorecasts, specifically in the context of combining financialforecasts of stock market volatility [7]. Such complicationswith nonlinearity are becoming more widely appreciated inmodeling economic data in particular, as evidenced in recentwork [32] and the wealth of new tests for nonlinearity [4],[30], [33], [49], [50].

Artificial neural networks (ANN’s) have the ability toapproximate arbitrarily well a large class of functions [23],[25]–[27], [32], [46], [52]–[54]. ANN’s, therefore, have atleast the potential to capture complex nonlinear relationshipsbetween a group of individual forecasts and the variablebeing forecasted, which simple linear models are unable tocapture. Readers should be aware that ANN modeling hasbeen criticized as a “black box” (e.g., [11]). If care is nottaken, all intuition for the relationship between the forecastsand what is being forecast may be lost. It is also importantto recognize that the very act of combining forecasts isan admission of some sort of failure of the models fromwhich the forecasts are produced. If we are given all theinformation used in generating the individual forecasts beingcombined, it is always better to construct a single “supermodel” that encompasses this full information set and notcombine the individual forecasts at all. What we are attemptingto demonstrate here is the utility of the ANN model if we arefaced with only forecasts to combine.1

1In an unrelated literature [55], it is noted that competing nonlinear“generalizers” making use of identical information sets can be eliminatedwith a winner-takes-all strategy or they can be combined, termed “stackedgeneralization,” in a possibly nonlinear fashion. This is similar to what wehope to do here. Our problem, however, is to construct the best forecast asa function of forecasts each based on at least partially unique information,without access to anything but the final forecasts. We are not facing theproblem of how best to use all the available information, but rather how bestto combine individual forecasts themselves based on information unknownto us.

1089–778X/97$10.00 1997 IEEE

Page 2: Evolving Artificial Neural Networks To Combine Financial

HARRALD AND KAMSTRA: EVOLVING ARTIFICIAL NEURAL NETWORKS TO COMBINE FINANCIAL FORECASTS 41

A typical training method for the ANN modeling is somemanner of supervised learning on a training sample, of whichthe familiar backpropagation is an example. It is almostcertain, however, that in many optimization problems forwhich ANN’s are considered to be appropriate architecturesfor search over mappings, the state space will exhibit manylocal optima [32], [2], rendering gradient-descent methodssuch as backpropagation unreliable.

In response to this problem of local optima, techniquesof evolutionary optimization such as genetic algorithms andevolutionary programming (EP) have been applied to thetraining of ANN’s (e.g., [31], [36], [38], [41]).

This paper describes the ANN method of forecast combina-tion of [7] and reports on EP experiments to evolve suitableparameterizations of a given ANN architecture. In Section IIwe present combining methods. In Sections III, IV, and V theevolutionary programming approach to estimation of the ANNmodel is presented in detail. Section VI is a discussion of thedata to be used to illustrate these techniques in an application.In Section VII formal comparisons of the different techniquesof forecast combining are presented. Section VIII offers adiscussion of the results with some intuition. Section IXconcludes.

II. COMBINING METHODS

First we present some traditional combining methods tobe used in comparison with our ANN method, second analternative method which is nonparametric, namely the Kernelmethod, and third, our ANN combining technique.

A. Traditional Combining Methods

There are now many commonly employed methods forcombining forecasts [6], [20]. The assumption that the con-ditional expectation of the variable being forecasted is a linearcombination of the available forecasts is consistent across allcombining methods. Thus when combining two individualforecasts and , a single combined forecast isproduced according to (1) by appropriate choice of weights

, and

(1)

The cross-sectional average of the individual forecasts (de-noted “Average”), perhaps the most widely used combiningmethod, sets and .

It can be shown that a multivariate ordinary least squares(OLS) regression of the variable being forecasted on theindividual forecasts in-sample can be used to obtain “optimal”forecast weights , and for use in out-of-samplecombining [19]. This combination will in general be moreefficient than the simple average. We have also consideredBayesian combination methods, but these have been found tohave little advantage over classical methods at the forecastinghorizon we investigate in this paper [37] and are not pursuedfurther here as a point of comparison.

B. Kernel Estimation

Kernel estimation is a nonparametric smoothing techniquewidely used in modeling of economic and other data; see [22].The technique essentially forms multidimensional histogramsof the data to discover associations between the dependentand the independent variables. The nature of these associationscan be almost arbitrarily nonlinear, but uncovering these asso-ciations has one very troublesome technicality—determiningthe width of the histogram bars, called thewindow-width.The choice of window-width can be data driven. One populardata-driven method is cross validation, in which the window-width is chosen by minimizing or maximizing some objectivefunction on the cross-validated data over a grid of possiblewindow-widths.2

A second technicality, but less troublesome as a matter ofpractice, is the choice of the form of the histogram itself.Simple on–off bars is one option, although a more popularchoice has a probability distribution (such as the normaldistribution) centered on the middle of the bar.

C. Artificial Neural Network Combining

The mechanics of ANN modeling are now fairly wellunderstood (e.g., [32]), and a review of ANN’s in generalis not undertaken here: we restrict attention to the applicationof ANN’s to forecast combination.

Consider the task of combining two forecasts. Letdenote the forecast from model for time . Let and

denote, respectively, the in-sample mean and in-samplestandard deviation of the variable being forecasted out-of-sample. We consider ANN models of the form in (2)–(4),from [7]

(2)

(3)

(4)

where is an index whose use will be described below,, and are parameters to be estimated in a manner

described below, is a vector of the parameters,and is the forecast of the dependent variable producedby the ANN method. From (3) we see that the input layeraccepts as activation the forecasts to be combined. The inputnodes are linked to a hidden layer of three nodes, and alsodirectly to an output node, as is a bias. The hidden nodesand output node use a nonlinear sigmoidal filterwhich

2To implement cross validation, the Kernel weights are estimated on asubset of the data, and then the estimated weights are used to forecast theremaining portion of the data. The “out-of-sample” forecasts are then collectedand the process repeated, leaving out a different subset of the data each time,until “out-of-sample” forecasts for the entire data set are produced. (A cross-validation estimation which has 1/N of the data omitted at a time is called anN-fold cross-validation.) The model specification that produces cross-validatedforecasts with, say, the lowest mean squared error, is then selected as “best”[24], [29], [48]. Further studies include the derivation of optimality resultsfor a pseudo-likelihood function [40] and many empirical studies (e.g., [47]and [9]).

Page 3: Evolving Artificial Neural Networks To Combine Financial

42 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

maps into (0,1). The standardization of forecasts from eachmodel is given in (2). This standardization is employed,together with the appropriate choice of the , to ensurethat the function in (3) typically maps into the regionclose to 1/2 and not typically close to zero or one. Equation(4) makes explicit the manner in which the outputs fromthe functions are to be used to form the final com-bined forecast. As the estimation of (4) will be based onevolutionary programming and “self-adaptive” evolutionaryprogramming, we will refer to these models generically as theEP-NN method and the SEP-NN method, respectively. TheEP-NN method is described in Section IV and SEP-NN inSection V.

III. EVOLUTIONARY PROGRAMMING

Evolutionary programming was developed in the early1960’s [17] as a means of solving complex optimizationproblems by a stochastic numerical process that has featuresin common with natural evolution. Consider the problem ofminimizing a function where is a vector of real values.A simple implementation of EP would proceed according tothe following pseudocode.

1) Generate random vectors .2) Until finished

2.1) Sort by , smallest to largest2.2) Delete bottom half of .2.3) Replace bottom half by , .

Typically, the random mutation is made by sampling froma multidimensional normal distribution with small variance (orcovariance). There are many variations of this classic EP (see[14]). The utility of EP has been demonstrated in a variety ofcontexts (e.g., [12] and [13]). The EP has also been appliedto evolving neural networks [16], [34], [41], with the essentialidea that the vector represents the parameters of the ANNand that is a measure of the ANN’s performance, chosenin our case to be the mean square error (MSE) of the forecastof the ANN on in-sample data.

IV. ANN FORECAST COMBINATION EP

The EP-NN procedure used here is as follows.

0) Define:as the dependent variable, and our forecast ofas, with

1) For each parent

1.1) For each parameter vector

1.1.1) Generate uniformly distributed on[ 1, 1], independent of all other trials.

1.2) For each observation

1.2.1)

1.3) For each parent

1.3.1) Estimate byOLS, population regression .

1.4) Sort the array over by ascending MSEfrom regression of on a constant, ,

, , and .1.5) For each

1.5.1) N(0, ), a vector ofthe same dimension as .

This represents a single generation of the EP-NN algorithm,for a single ANN model. The set of parents had , asingle ANN model estimation consisted of a 1000-generationrun, and was set to . A total of 29 independent ANNmodels were estimated in the above fashion, each with Step1.1) and 1.5.1) repeated independently of the other 28 modelestimations.

The ANN models were ranked by MSE and the median in-sample MSE trial was selected as our ANN model used toforecast out-of-sample, denoted as EP-NN(M) in the tables ofresults. We also include for comparison and contrast the bestMSE ANN model and the worst MSE ANN model, denotedEP-NN(B) and EP-NN(W), respectively. We anticipated thatthe best MSE ANN model would overfit the data and performbadly on the out-of-sample forecasting period.3

V. SELF-ADAPTIVE EVOLUTIONARY PROGRAMMING

The choice of was, in fact, our own firstadhoc choice, but it remained quite robust to other challengers.We also considered, however, the possibility of self-adaptivemutation, an algorithm we term self-adaptive EP (SEP-NN).The SEP-NN algorithm proceeds much in the same way as theEP-NN, except that each trial solution carries with it a vectorof terms describing the mutation variances to be used in theproduction of offspring (see [43]). We denote the evolvableweights and biases of an ANN as , then in the SEPscheme a trial solution is appended with a vectorof thesame dimension. When mutation of theth component ofis undertaken, a normal deviate of mean zero and standarddeviation equal to theth component of is added. The extantvectors are updated at the beginning of each generation. Ifwe denote by the th component of , then the updatetakes the form of

where indicates a standard normal random variabledrawn independently of the other components of the vector.

The parameters and adopted the conventional values of

and , respectively, where is the total

3Ranking by cross validation, such as described for Kernel estimation, orranking by an information criterion, such as the Schwarz Criterion, are possiblemechanisms to avoid overfitting.

Page 4: Evolving Artificial Neural Networks To Combine Financial

HARRALD AND KAMSTRA: EVOLVING ARTIFICIAL NEURAL NETWORKS TO COMBINE FINANCIAL FORECASTS 43

number of weights and biases to be evolved. This scheme wasfirst proposed in [45] and has proved useful in similar andearlier work to our own (e.g., [15] and [42]).4

Scope for further experimentation is unlimited: ANN’s canachieve arbitrary mappings and be the output quantitative orqualitative, and the EP is similarly a very flexible procedure,since mutation is simply a probability distribution mapping aspace into itself.

VI. DATA ON INDIVIDUAL FORECASTS

We require a data set with well-known properties and alarge number of observations so that we have both the powerto discriminate between the various combining methods andintuition for why some combining methods outperform others.We conduct our exercise on forecasts of daily stock marketvolatility—conditional stock return variance—from two well-known models over the period 1969–1987, yielding thousandsof observations.5 The fact that the models we combine havewell-understood properties aids in interpreting test results andalso suggests specification checks of the combined models.The exercise itself—forecasting volatility—is interesting in itsown right. Forecasting the volatility (the riskiness) of stocksfacilitates tracking the risk/return characteristics of a portfoliowhich includes stocks and adjusting the portfolio compositionwhen the risk/return characteristics become undesirable. Thisapplication certainly does not exhaust the pool of applicationsin finance, let alone other fields with similar “problems”of a wealth of competing forecasts. Ongoing work by theauthors and others include credit-rating problems, mean-returnforecasting, and value-at-risk estimation (the monetary riskof holding certain portfolios, often over short horizons of aday to a month). Work in other fields, primarily engineeringthough stretching to biology, robotics, and beyond, refers tocombining problems as data “fusion” exercises. Many of theseexercises are directed to “fusing” data from a multitude ofsensors, for targeting or tracking.

Stock returns, often assumed to be independent and iden-tically normally distributed (i.i.d.), are in fact dependent,nonidentically distributed, and quite fat-tailed—nonnormal.The dependence of returns is largely captured with a simple au-toregressive term of order one [AR(1)]—a single lagged returnis used to forecast the current return. This sort of simple modelwill explain between 1% and 15% of the variation of the return,depending on the return series and its periodicity—daily,weekly, and so on. Market closings and sluggish flows ofinformation leading to nonsynchronous trading patterns arebelieved by some (e.g., [10] and [44]) to lead to this AR(1)structure in returns. The nonidentical distribution of the datais revealed by the clustering of highly volatile periods. Inexamining data with similar properties, [8] modeled the secondmoment of inflation rates as varying conditionally on past

4Since we update mutation variances before offspring are created, we adoptwhat is known as a “sigma-first” strategy; see [18].

5Our data extend only to September 1987 to exclude the infamous stockmarket crash of October 1987. The context of stable conditional data gener-ating processes is the only context in which model comparisons such as oursmake sense, and the crash of 1987 produced such large and abrupt changesin stock volatility that the assumption of stability may not be valid over thisperiod.

squared modeling errors to capture such clustering, termedautoregressive conditional heteroskedasticity6 (ARCH). In thecontext of stock returns, the residual from the model of thereturn is modeled as having time-varying variance. ARCHeffects imply fat-tails unconditionally, so modeling ARCHeffects promised to resolve both the nonidentical distributionof the data and its leptokurtosis.7 Typical in the literatureinvestigating time-varying variances of stock returns, then, isthe assumption of conditional normality of the returns, withthe first moment of the returns and the second moment of thereturn residuals modeled as autoregressive processes, thougha number of other approaches have been adopted [3].

Individual forecasts for use in the combining exercise areforecasts of the volatility in daily returns on the S&P 500stock index, for the period January 1969 to September 1987, asproduced by two popular models of stock returns volatility: 1)the moving average variance model (MAV) and 2) the gener-alized autoregressive conditional heteroskedasticity (GARCH)model. These two methods for forecasting volatility are widelyused by professional portfolio managers and academics andperform admirably, as will be demonstrated below. Improvingon them should not be regarded as a matter of course—theyare not “straw men.”

Following the stock returns volatility literature, defineasthe daily stock return, then generically

The error has zero mean and has conditional variance

where is any available information the expectation may beconditioned upon. That is, volatility is unobserved but relatedto . Our task is to form a forecast of : the volatility ofstock returns. Let and be estimates of the parameters

and , and let

(5)

Our market volatility measure is which has expectation. The conditional variance forecast of the MAV model has

with chosen to minimize the Schwarz Criterion8 and theparameters and estimated with OLS.9 The conditionalvariance forecast of the GARCH(1,1) model has

6A heteroskedastic random variable does not exhibit a constant secondmoment.

7A leptokurtotic random variable is random variable with a fourth momentwhich is larger than that of a normally distributed random variable.

8The Schwarz Criterion is a likelihood-based information criterion whichassigns a penalty for use of modeling degrees of freedom. It equals the logof the likelihood function minus1

2Klog(T ), whereK is the number of

parameters in the model andT is the number of observations available.9The value chosen forn over the range 1 to 40 was 28.

Page 5: Evolving Artificial Neural Networks To Combine Financial

44 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

with parameters , and estimated jointly withmaximum likelihood methods under the assumption of condi-tional normality of . The wide application of such modelsto the task of forecasting stock market volatility is welldocumented and motivated [3], [39].

We use one-step-ahead out-of-sample GARCH and MAVforecasts beginning the first day of trading in January 1980.The MAV and GARCH model parameters are estimated withdata from the first day of April 1969 to the last day of 1979 andthen used to produce the one-step-ahead out-of-sample MAVand GARCH forecasts for the first trading day of 1980. Ourdata set is then updated by adding the first trading day of 1980and dropping the first from 1969, the models then re-estimatedto produce a one-step-ahead out-of-sample forecast for thesecond day in 1980, and so on.10 This procedure of updatingand one-step-ahead out-of-sample forecasting is repeated untilone-step-ahead out-of-sample MAV and GARCH forecastsof daily returns volatility are produced for each trading dayfrom January 1, 1980 to September 30, 1987, constitutingthe individual out-of-sample forecasts used in the combiningexercise. The most important feature of these forecasts, forour purposes, is that the MAV and GARCH models used toproduce them employ partially nonoverlapping informationsets. Thus there may be an advantage to using a combinedforecast as opposed to either of the individual forecasts.11

The in-sample observations 1969–1979 are used for twopurposes. First, the training and selection of the EP-NNand SEP-NN models and choice of the window-width ofthe Kernel estimator12 is performed exclusively with the1969–1979 data using the in-sample forecasts from MAV andGARCH. Second, the first out-of-sample forecast from allof the models is made with only this in-sample data. TheMAV and GARCH in-sample 1969–1979 forecasts are used toform the combining parameter weights for the forecast of thevolatility of the first trading day of 1980 from the OLS, Kernel,and EP-NN and SEP-NN models.13 The information set isthen updated one day at a time, in a rolling window fashionjust as with the formation of the MAV and GARCH out-of-sample forecasts, and one-step-aheadout-of-samplecombinedforecasts of all the methods are obtained, January 1, 1980 toSeptember 30, 1987. These are the forecasts used to evaluatethe performance of the combining models. We must stressthat the training and selection of the EP-NN and SEP-NNmodels is carried out once and once only, on the in-sample data

10“Rolling window” model estimation is common in the combining litera-ture (e.g., [21]).

11In practice, having access to the forecasters’ information sets means anycombination of the forecasts is an inefficient use of the available information.The application to conditional variance forecasts in our paper should thereforebe viewed as an exercise to compare the combining methods.

12We made use of a normal distribution Kernel and picked the window-width to minimize the cross-validated Gaussian log-likelihood among allKernel estimators that removed evidence of ARCH at the 10% significancelevel on the 1969–1979 period. The test performed was the ARCH LagrangeMultiplier (LM) test at lags 5 and 20. For a description of this test see footnote14.

13The average combined forecast has weights�0 = 0; �1 = �2 = 0:5 andhence does not need to be estimated. Once the combining parameter weightshave been estimated—with onlyin-sampleMAV and GARCH forecasts—theout-of-sampleMAV and GARCH forecasts are used to produce theout-of-sampleforecast of the volatility of the first trading day of 1980.

1969-1979. Although it is sensible to retrain and re-evaluatethe EP-NN and SEP-NN as we update the data set, this iscomputationally too onerous to attempt.

It may be helpful at this point to plot some of the dataand compare the two forecasts which will be used in thecombining exercise, the MAV and GARCH forecasts. We alsoplot the SEP-NN forecasts, but these will not be discusseduntil Section VIII.

Figs. 1 and 2 plot subsets of the in-sample data, coveringperiods 1970 and 1974, respectively. The data plotted arethe squared residuals from (5) and volatility estimates fromthe MAV, GARCH, and self-adaptive evolutionary model,normalized so that their average value is equal to one. Ofcourse, since the data are strictly positive and quite skewed(the residuals are approximately distributed as), the pointsbelow one cluster more closely to one. The points (dots) are thesquared residuals, the line of circles is the MAV forecast, theboxed line is the GARCH forecast, and the line with diamondsis the SEP-NN forecast.

Fig. 1 plots a year which presented a volatile period Maythrough July and two less volatile periods on either side.In such quiet periods the MAV and GARCH forecasts arevirtually overlaid while in volatile periods the two are oftenquite different; GARCH reacts much more swiftly than MAVto changes in volatility. We see much the same pattern inFig. 2, though the year 1974 was much more volatile duein part to the OPEC oil price shock. We see remarkabledivergences of forecasts from MAV and GARCH duringvolatile periods, and the very slow adjustment of the MAVforecast causes it to forecast substantially higher volatility thanGARCH when brief periods of calm present themselves, suchas in August, October, and December 1974. While in somecases it appears the quick adjustment provided by GARCHwas appropriate, as in June of 1970, the slower adjustment ofMAV provides a better guide through the last half of 1974.Hence there is hope that combining these two forecasts mayprovide us with a better volatility estimate than the use of oneor the other alone.

VII. COMPARING FORECASTS

First, we evaluate the models’ abilities to reproduce broadfeatures of the data, characterized by summary statistics onthe unconditional moments of the data and a test for residualautoregressive conditional heteroskedasticity (ARCH) effectsand normality. If a combining forecast method is to be reliedon, then as a minimum it must pass basic specification tests anddo no worse than the forecasts incorporated in the combination.These summary statistics and specification tests give someinsight on this minimum level of performance.

Next, we compare forecasts on the basis of root meansquared forecast error (RMSFE) and mean absolute forecasterror (MAFE), in- and out-of-sample. A traditional measure ofthe best forecasting method is a simple comparison of RMSFEand MAFE across competing methods, but this provides nomeasure of statistically significant difference in performance.

Finally, we compare combining methods on the basis ofstatistical tests of superior performance—encompassing tests

Page 6: Evolving Artificial Neural Networks To Combine Financial

HARRALD AND KAMSTRA: EVOLVING ARTIFICIAL NEURAL NETWORKS TO COMBINE FINANCIAL FORECASTS 45

Fig. 1. This plots a subset of the in-sample data, covering the year 1970. The data plotted are the squared residuals and the volatility estimates from theMAV, GARCH, and evolutionary models, normalized so that their average value is equal to one. The MAV model volatility estimate is an average of pastsquared residuals, and the GARCH(1, 1) volatility estimate is a function of the squared residual and volatility estimate of the last period. The SEP modelvolatility estimate is a nonlinear function of both the MAV and GARCH volatility estimates.

on out-of-sample data. This addresses the criticism that rankingmodels by RMSFE and MAFE does not provide a measureof statistically significant difference in performance acrossforecasting models.

The importance of these different criteria are inverselyrelated to their order of presentation. The summary statisticspresented below indicate that all the methods studied attaina satisfactory minimum of performance. The RMSFE andMAFE are traditional measures of performance but we willargue that they provide little information. The encompassingtests provide a sound statistical basis for model comparison,and our argument for the superior utility of the EP methodsrests largely on the evidence from the encompassing tests.

A. Summary Statistics

Table I contains summary statistics on the volatility fore-casts and on the implied standardized return residuals from

our models for stock volatility for the in-sample period April1, 1969 to December 31, 1979 and the out-of-sample period,January 1, 1980 to September 30, 1987.

The first column in Table I contains the forecast methodname. Columns 2 and 3 contain the mean and standarddeviation of actual stock returns volatility for the S&P 500index as well as the mean and standard deviation of the stockvolatility forecasts produced by each method. We expect themean of the various volatility forecasts to be similar to theactual mean and the standard deviation of the forecasts tobe smaller than that of the actual data. The mean shouldbe identical in-sample for the OLS, EP-NN, and SEP-NNforecasts, which it is.

Columns 4–6 of Table I (the first three columns of thestandardized returns statistics) contain summary statistics onthe standardized return residuals from the S&P 500 index; i.e.,

When divided by its forecasted standard deviation,

Page 7: Evolving Artificial Neural Networks To Combine Financial

46 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

Fig. 2. Plots a subset of the in-sample data, covering the year 1974. The data plotted are the squared residuals and the volatility estimates from the MAV,GARCH, and evolutionary model, normalized so that their average value is equal to one. The MAV model variance forecast is an average of past squaredresiduals, and the GARCH(1, 1) model variance forecast is function of the squared residual and variance forecast of the last period. The SEP model varianceforecast is a nonlinear function of both the MAV and GARCH variance forecasts.

the return residuals should have a standard deviation of one(as is the case for the actual or raw return residuals dividedby their sample standard deviation, shown in the first row).Stock returns distributions are well known to be leptokurtoticcompared to the normal distribution, and we see this in ourraw data, with a kurtosis of 5.749 in-sample instead of three.As stock volatility-forecasting models are designed to producea standardized return residual series which is less leptokurtoticthan the raw series, a reliable method for combining volatilityforecasts should have this property. Table I shows that allthe forecasting and combining methods produce standardizedreturn residuals which are less skewed and leptokurtotic thanthe raw series. The skewness is close to zero and the kurtosis isclose to three, both on the in-sample data, shown in Table I(a),and on the out-of-sample data, shown in Table I(b).

Column 7 (the fourth column of the standardized returnsstatistics) contains the-value from an LM test of the nullhypothesis that the standardized return residuals do not displayARCH.14 As expected, the raw return residuals display strongevidence of ARCH with a -value of zero to three decimalplaces for both the in-sample and the out-of-sample data. Stockvolatility-forecasting models are designed to remove ARCHfrom return residuals and all the forecasting and combiningmethods we investigate here succeed in removing significant

14The ARCH LM test is a regression-based test, which takes theR2 fromthe regression of squared errors on lagged squared errors, multiplies thisR2 bythe sample size, and compares this to the critical value from a�2 distribution,degrees of freedom equal to the number of lagged squared errors in theregression. This test is appropriate under the null of no heteroskedasticityin the errors, and is performed with 12 lags. The results are qualitativelyidentical with 24 lags. For a more detailed discussion of this test for ARCH;see [3].

Page 8: Evolving Artificial Neural Networks To Combine Financial

HARRALD AND KAMSTRA: EVOLVING ARTIFICIAL NEURAL NETWORKS TO COMBINE FINANCIAL FORECASTS 47

TABLE I(a) REPORTSSUMMARY STATISTICS ON THE IN-SAMPLE FITTED CONDITIONAL

VARIANCES, �2t

, AND STANDARDIZED RETURNS, �t=�2t , FOR NEW YORK’S S&P500 INDEX ON DAILY DATA 1969–1979. (b) REPORTSSUMMARY STATISTICS ON

THE ONE-STEP-AHEAD OUT-OF-SAMPLE FORECASTEDCONDITIONAL

VARIANCES, �2t

, AND STANDARDIZED RETURNS, �t=�2

t, FOR NEW YORK’S

S&P 500 INDEX ON DAILY DATA 1980:1–1987:9. THE MEAN AND

STANDARD DEVIATION (DENOTED “STD” IN THE TABLE) OF THE VARIOUS

VARIANCE FORECASTSARE PRESENTED. FOR THE VARIOUS STANDARDIZED

RETURNS, THE STANDARD DEVIATION, SKEWNESS (DENOTED “SKEW” IN

THE TABLE), KURTOSIS, ARCH LM T EST PROBABILITY VALUE, AND

BERA-JARQUE NORMALITY TEST PROBABILITY VALUE ARE SHOWN

(a)

(b)

evidence of ARCH effects. Further, assuming conditionalnormality, as is often done in empirical studies of returnvolatility, implies that the standardized return residuals shouldbe normally distributed. The last column of Table I containsthe -value of the Bera-Jarque normality test on the stan-dardized return residuals.15 These tests for normality show astatistically significant deviation from the normal distribution.This is a widely documented feature of stock return data, notpeculiar to a particular stock index or time period [3].

These summary statistics and tests indicate that all ofthe forecasting and combining methods perform reasonably,

15The Bera-Jarque test is an LM test for normality, based on standardizedthird and fourth moments [28].

TABLE IITHE ROOT MEAN SQUARED FORECASTERROR AND MEAN ABSOLUTE FORECAST

ERROR FORALL THE MODELS, FORNEW YORK’S S&P 500 INDEX ON DAILY

DATA, IN-SAMPLE 1969:4-1979:12AND OUT-OF-SAMPLE 1980:1-1987:9

though the assumption of conditional normality of the data isclearly violated. The lack of conditional normality impingeson the efficiency of the GARCH method, but does not leadto inconsistent estimation or biased forecasting, and so thisfailure is of second-order significance.

B. Summary Measures of Performance

The traditional summary measures of the forecasting meth-ods is a comparison of RMSFE and MAFE across competingmethods. Although all of our methods were picked with a MSEcriterion method, comparison by mean absolute deviation istypically considered interesting because researchers often havelittle confidence in their choice of criteria to minimize. If weknew beyond a shadow of a doubt that the data were normallydistributed, there would be no interest in MAFE, but as wetypically do not, alternative measures of goodness of fit toMSE are provided. Table II contains the root mean squaredforecast error and mean absolute forecast error for each of theindividual models and each of the combining methods for thein-sample period April 1, 1969 to December 31, 1979, and theout-of-sample period, January 1, 1980 to September 30, 1987.

Since the parametric modeling approaches provide the great-est degrees of freedom, the EP methods naturally perform bestin-sample among the parametric models. The nonparametricKernel method performs best overall in-sample. It is theout-of-sample behavior that is informative, for if a method“over-fits” the data, the out-of-sample performance typicallyfalls far short of the in-sample performance. What we findin Table II is no evidence of this problem with the EP-NNforecasts, from the best, worst, or median in-sample MSEranked models, and similarly no such problem with the SEP-NN. On the out-of-sample data these methods are in the middleof the pack, with lower RMSFE’s than the MAV method,the worst performer, and not much higher RMSFE’s than thebest performer GARCH. Similar rankings come from MAFEmeasures. We also see no problem of overfitting with theKernel method. It is in the middle of the pack by measure

Page 9: Evolving Artificial Neural Networks To Combine Financial

48 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

TABLE IIITHE PROBABILITY VALUES OF THE TESTS OFENCOMPASSING ON THEONE-STEP-AHEAD FORECASTEDCONDITIONAL VARIANCES, �2t , FOR NEW YORK’S

S&P 500 INDEX ON DAILY DATA 1980–1987. THE TESTS ARE ON �1 PARAMETER IN THE REGRESSION�j;t = �0 + �1�2

k;t+ �t WHERE

�j;t = �2t � �2j;t

IS MODEL j’ S OUT-OF-SAMPLE FORECAST ERROR AND �2k;t

IS MODEL k’ S OUT-OF-SAMPLE FORECAST. COLUMNS 2–10(COLUMNS 1–9 OF THE DATA ENTRIES) CONTAIN p-VALUES ASSOCIATED WITH THE t-STATISTICS ON �1 FOR ALL POSSIBLE j � k COMPARISONS

of RMSFE, and it has the best MAFE measure on theout-of-sample data. Comparison by RMSFE and by MAFEprovides us with no indication of whether any one model isperforming significantly better than the other models, however.We, therefore, investigate in the next section an additionalmeans of comparison between forecasting models that allowsfor tests of significance: comparison by forecast encompassing.

C. Tests of Forecast Encompassing

Encompassing-in-forecast tests [5], [21] revolve around theintuition that a Model should be preferred to a Model ifModel can explain what Model cannot explain, withoutModel being able to explain what Model cannot explain.Encompassing-in-forecast tests are designed to provide a sta-tistically significant test of this characteristic. As such, the testprovides an obvious method for ranking forecasts.

A set of OLS regressions of the out-of-sample forecasterror from one model on the out-of-sample forecast from theother provide the formal test for encompassing-in-forecast.Let be Model ’s out-of-sample forecasterror and be Model ’s out-of-sample forecast. The testsfor encompassing involve testing for significance of theparameter in the regression in (6)

(6)

To test the null hypothesis that neither model encompassesthe other we perform two regressions. Regress the out-of-sample forecast error from Model on the out-of-sampleforecast from Model , as in (6), regression “ .” Call theresulting estimate of the coefficient . Callthe estimate that results from the analogous regression.If is not significant, but is significant, thenwe reject the null hypothesis that neither model encompassesthe other in favor of the alternative hypothesis that Model

encompasses Model. We say that Model encompassesModel if, conversely, is not significant, butis significant. We fail to reject the null hypothesis that neithermodel encompasses the other in forecast if both and

are significant, or if both and are notsignificant. Nonoverlapping information sets may lead to bothestimated coefficients being significant, and multicollinearitymay lead to both estimated coefficients being insignificant.

Columns 2–10 of Table III (columns 1–9 of the data entries)contain -values associated with the heteroskedasticity robustt-statistics on for all possible comparisons.16 -valuesless than 0.01 reveal that the out-of-sample forecast from themodel listed along the top of the table explains, at the 1%significance level, the out-of-sample forecast error from themodel listed down the left side of the table and thus that themodel listed down the side cannot encompass the model listedalong the top, at the 1% level of significance.

These results provide strong evidence of the superiority ofthe SEP-NN method over the competing linear methods. TheSEP-NN out-of-sample forecasts encompass all the competinglinear methods as well as the Kernel method at the 5% levelof significance or better. The SEP-NN coefficient [refer to(6)] is significant at the 0.1% level in the MAV out-of-sampleforecast error regression, at the 0.1% level in the average out-of-sample forecast error regression, at the 1.3% level in theGARCH out-of-sample forecast error regression, at the 1.7%level in the Kernel out-of-sample forecast error regression, atthe 4.9% level in OLS out-of-sample forecast error regression,while never having the SEP-NN out-of-sample forecast errorexplained at better than the 9.4% level. With respect to theother ANN combination methods, SEP-NN encompasses EP-NN(B) and EP-NN(W) at the 5% level as well, but notthe EP-NN(M), suggesting that EP-NN(M) and SEP-NN arepossibly picking up slightly different effects. No other methoddoes nearly this well. The EP-NN(M) and OLS encompass

16These statistics make use of a modification of the heteroskedasticity-robust covariance matrix estimator [51] termed “HC3” in [35]. We expect tohave heteroskedastic errors in this regression, so it is important to account forheteroskedasticity. A simple demonstration of this is as follows. The residual�t can be rewritten as�t�t where�t is i.i.d., with mean zero and variance one.The out-of-sample forecast error in predicting�2t (an object we do not actuallyobserve, but rather estimate) making use of�2t , is �2t � �2t = �2t (�

2

t � 1) sothat the forecast is unbiased,E[�2t � �2t ] = 0, but the forecast error, of eventhe optimal forecast, is heteroskedastic.

Page 10: Evolving Artificial Neural Networks To Combine Financial

HARRALD AND KAMSTRA: EVOLVING ARTIFICIAL NEURAL NETWORKS TO COMBINE FINANCIAL FORECASTS 49

only five of the eight alternative models at the 5% level,and OLS does not encompass these as strongly as does SEP-NN. EP-NN(W) encompasses four, the Kernel only one, andGARCH encompasses none. The MAV out-of-sample forecastis routinely encompassed at the 0.1% level or better, theaverage at the 0.5% level or better, neither encompassing asingle other model.

Also of interest is that all of the EP models we lookedat performed well, suggesting that the pseudorandom numbersused to evolve the networks were not critical in finding a modelwhich would perform well on out-of-sample criteria. The EP-NN model which had the best in-sample MSE, EP-NN(B),did perform somewhat worse than the other ANN models onthe out-of-sample encompassing tests. This suggests that theremay be some problems with overfitting even if the neuralnetwork architecture includes only three nodes as all our EPmodels do.

Of further and related interest is the fact that we also lookedat performance statistics for several SEP-NN models, those ofthe best evolved network, the median, and worst, though wereport only that of the best. Interestingly, the SEP procedureshowed no symptoms of overfitting in such comparisons, witheach network performing similarly well. This indicates thatour choice of architecture and training algorithm successfullycompromised between functional and parametric flexibilityand the tendency to overfit.17

VIII. D ISCUSSION

As the forecasts from all the models are a function ofonly two variables—the MAV and GARCH inputs—we mayplot the three-dimensional surface that maps these inputs intothe combined forecast. Fig. 3(a)–(d) plots the SEP-NN andOLS combining surfaces estimated in-sample together with thedata points used in this estimation. Fig. 3(a) shows only theSEP forecasting surface, (b) shows only the OLS forecastingsurface, (c) shows only the data points, and (d) shows all thesurfaces and data in a single plot.

The forecasts are normalized so that the average valueis equal to one, as in Figs. 1 and 2. The curved surfaceof diamonds plots the functional relationship—estimated onthe in-sample data April 1969 to December 1979—betweenthe MAV and GARCH individual forecasts and the SEP-NNcombined forecast. The nonlinearity of the SEP-NN producesthe curved relationship shown in Fig. 3(a) and (d). Similarly,in Fig. 3(b) and (d), the flat surface of circles shows the OLScombined forecast. The pillars rising up from the floor ofFig. 3(c) and (d) indicate actual in-sample data points whichwere used in estimating the OLS and EP-NN and SEP-NNmodels (the height of the pillars is slightly raised so thatthey are easily visible through the surfaces). The flaggedpillars (roughly half of all the data points) are data pointsfor which the SEP-NN combination produces a smaller errorthan does the OLS combination. The unflagged pillars are datapoints for which the OLS combination produces a smallererror than does the SEP-NN. The surfaces in Fig. 3(a) and(b) indicate the difference in response we can expect from

17Full details of evolved parameters are available from the authors.

the SEP-NN and OLS models for various pairs of GARCH-MAV forecast inputs, while the pillars in Fig. 3(c) and (d)provide some indication as to whether or not the differencesin response are relevant for the in-sample period. We canproduce analogous surfaces for the out-of-sample period butthese are less interesting. This is because the models are re-estimated as we “move” through the data with the rollingwindow updating procedure outlined in Section VI, and hencethe surface changes over time in the out-of-sample period.

As displayed in Fig. 3(d), most of the data points occur inthe area where the SEP-NN and OLS forecasting functionsintersect, thus the similarity of the SEP-NN and OLS forecastsummary statistics of Table I(a). We also see, however, thatnumerous data points occur away from the curve intersections,in particular when one or both of the MAV and GARCHforecasts are large in magnitude. In these areas we get someintuition for what the SEP-NN combination is doing that isdifferent than the OLS combination. For instance, the SEP-NN combination makes very little adjustment in its forecast asthe GARCH falls below the value of the MAV forecast whenthey are both large (greater than four) and in fact increasesits forecast when the MAV forecast is close to four and theGARCH forecast is greater than two but falling. In this areawe are riding up the curve in the SEP-NN surface. We seesuch a coincidence of forecasts in Fig. 1 in June of 1970 andin October 1974 displayed in Fig. 2. Very short periods ofquiet in the midst of high volatility will produce such patternswith sharply falling GARCH forecasts but little change in theMAV forecast. The OLS forecast is constrained, because ofits linearity, to treat such periods the same way it treats allperiods, moving up and down primarily with changes in theGARCH forecast (the slope in the MAV axis is quite small).It is in these periods, where the SEP-NN forecast reacts littleto changes in the GARCH forecast, that the SEP-NN forecastsalso dominate the OLS forecasts. This is displayed by thepredominance of flagged data points in the back middle ofFig. 3(c) and (d) where both GARCH and MAV forecasts arelarge.

The SEP-NN forecast also reacts quite strongly to largevalues of the MAV forecast, in particular if they are associatedwith moderate values (3-5) of the GARCH forecast, as seen inthe steeply rising section of the SEP-NN surface in Fig. 3(a)and (d). In this steeply rising section of the SEP-NN surface,the back left of Fig. 3(d), the OLS forecasts were as oftenassociated with smaller errors as were the SEP-NN forecasts,perhaps indicating that the architecture of the SEP-NN model,with three nodes, was insufficient to allow it to moderatereaction to falling GARCH forecasts and stable MAV forecastswhen the MAV forecast was greater than four and the GARCHforecast was between three and five. An example of this is seenin Fig. 2 in November of 1974.

IX. CONCLUSIONS

We have described an experiment which evolves singlehidden-layer perceptrons to combine forecasts of stock pricevolatility, demonstrating the utility of evolutionary program-ming in a concrete application. Such forecasting is importantin its own right, in particular in the context of financial

Page 11: Evolving Artificial Neural Networks To Combine Financial

50 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

(a) (b)

(c) (d)

Fig. 3. Plots the SEP-NN and OLS combining surfaces with data points from 1969–1979. The forecasts are normalized so that the average value is equalto one. The curved surface of diamonds plots the functional relationship between the MAV and GARCH individual forecasts and the SEP-NN forecast. Theflat surface of circles shows the OLS combined forecast. (a) shows only the SEP forecasting surface (diamonds), (b) shows only the OLS forecasting surface(balloons), (c) shows only the data points (pillars), and (d) shows all the surfaces and data in a single plot with SEP forecasts superior to OLS flagged.

risk management. Evolved ANN models were shown to havea strong advantage over simple linear models and a non-parametric Kernel method. A self-adaptive scheme was alsoshown to yield some benefits over more simple evolutionaryprogramming schemes.

Careful examination based on statistical tests revealed thatthe evolved networks were significantly superior to linearcombination methods described in the forecast combinationliterature. It should be emphasized, however, that the results

presented cover only one example, and it remains to be seenif they are generally more applicable. It is also hoped thatfurther research might discover methods of dynamic param-eter adjustment, so that the evolutionary programs are moresensitive in monitoring their own performance throughout atrial. Evolving the nonlinear ANN weights as new informationflows in, rather than just the linear weights as done here, isnot computationally feasible with our large data set, but is anatural next step to take. This application also does not exhaust

Page 12: Evolving Artificial Neural Networks To Combine Financial

HARRALD AND KAMSTRA: EVOLVING ARTIFICIAL NEURAL NETWORKS TO COMBINE FINANCIAL FORECASTS 51

the pool of applications for which these combining techniquesmay be used, in finance and beyond.

ACKNOWLEDGMENT

The authors thank D. Fogel, R. G. Donaldson, P. Gomme,L. Kramer, and in particular the anonymous referees for theirhelp.

REFERENCES

[1] J. M. Bates and C. W. J. Granger, “The combination of forecasts,”Oper.Res. Quart., vol. 20, pp. 1–68, 1969.

[2] H. J. Bierens, “Comment on artificial neural networks: An econometricperspective,”Econometric Rev., vol. 13, pp. 93–97, 1994.

[3] T. Bollerslev, R. T. Chou, and K. F. Kroner, “ARCH modeling infinance: a review of the theory and empirical results,”J. Econometrics,vol. 54, pp. 1–30, 1991.

[4] W. S. Chan and H. Tong, “On tests for nonlinearity in time seriesanalysis,”J. Forecasting, vol. 5, pp. 217–228, 1986.

[5] Y. Y Chong and D. F. Hendry, “Econometric evaluation of linearmacroeconomic models,”Rev. Economic Stud., vol. 53, pp. 671–690,1986.

[6] R. T. Clemen, “Combining forecasts: a review and annotated bibliogra-phy,” Int. J. Forecasting, vol. 5, pp. 559–583, 1989.

[7] R. G. Donaldson and M. Kamstra, “Using artificial neural networks tocombine financial forecasts,”J. Forecasting, vol. 15, pp. 49–61, 1996.

[8] R. F. Engle, “Autoregressive conditional heteroskedasticity with esti-mates of the variance of United Kingdom inflation,”Econometrica, vol.50, pp. 987–1007, 1986.

[9] R. F. Engle, C. W. J. Granger, J. Rice, and A. Weiss, “Semiparametricestimates of the relationship between weather and electricity sales,”J.Amer. Statist. Assoc., vol. 81, pp. 310–320, 1986.

[10] L. Fisher, “Some new stock market indices,”J. Business, vol. 29, pp.191–225, 1966.

[11] M. B. Fishman, D. S. Barr, and W. L. Loick, “Using neural nets inmarket analysis,”Tech. Analysis of Stocks and Commodities, Apr., pp.18–22, 1991.

[12] D. B. Fogel, “Applying evolutionary programming to selected travelingsalesmen problems,”Cybern. Syst., vol. 24, pp. 27–36, 1993.

[13] , “Applying evolutionary programming to selected control pro-grams,”Comp. Math. Appl., vol. 27, pp. 89–104, 1994.

[14] , Evolutionary Computation: Toward a New Philosophy of Ma-chine Intelligence. Piscataway, NJ: IEEE Press, 1995.

[15] D. B. Fogel, L. J. Fogel, and W. Atmar, “Meta-evolutionary program-ming,” in Proc. 25th Asilomar Conf. on Signals, Systems and Computers.Pacific Grove, CA: Maple, 1991, pp. 540–545.

[16] D. B. Fogel, L. J. Fogel, and V. W. Porto, “Evolving neural networks,”Biological Cybern., vol. 63, pp. 487–493, 1990.

[17] L. J. Fogel, A. J. Owens, and M. J. Walsh,Artificial Intelligence throughSimulated Evolution. New York: Wiley, 1966.

[18] D. K. Gehlhaar and D. B. Fogel, “Tuning evolutionary programming forconformationally flexible modular docking,” inEvolutionary Program-ming V. Cambridge, MA: MIT Press, 1997, pp. 419–432.

[19] C. W. J. Granger and R. Ramanathan, “Improved methods of combiningforecasts,”J. Forecasting, vol. 3, pp. 197–204, 1984.

[20] C. W. J. Granger, “Invited review: Combining forecasts—Twenty yearslater,” J. Forecasting, vol. 8, pp. 167–173, 1989.

[21] J. Hallman and M. Kamstra, “Combining algorithms based on robustestimation techniques and cointegrating restrictions,”J. Forecasting,vol. 8, pp. 189–198, 1989.

[22] W. Hardle, Applied Nonparametric Regression. Cambridge, U.K.:Cambridge Univ. Press, 1990.

[23] J. Hertz, A. Krogh, and R. G. Palmer,Introduction to the Theory ofNeural Computing. Redwood City, CA: Addison-Wesley, 1991.

[24] U. Hjorth and L. Holmqvist, “On model selection based on validationwith applications to pressure and temperature prognosis,”Appl. Statist.,vol. 30, pp. 264–274, 1981.

[25] K. Hornik, “Approximation capabilities of multilayer feedforward net-works,” Neural Networks, vol. 4, pp. 251–257, 1991.

[26] K. Hornik, M. Stinchcombe, and H. White, “Multi-layer feedforwardnetworks are universal approximators,”Neural Networks, vol. 2, pp.359–366, 1989.

[27] , “Universal approximation of an unknown mapping and itsderivatives using multilayer feedforward networks,”Neural Networks,vol. 3, pp. 551–560, 1990.

[28] M. Jarque and A. Bera, “Efficient tests for normality, homoskedasticityand serial independence of regression residuals,”Econ. Lett., vol. 6, pp.255–259, 1980.

[29] L. Kavalieris, “The estimation of the order of an autoregression usingrecursive residuals and cross-validation,”J. Times Series Anal., vol. 10,pp. 271–28, 1989.

[30] D. M. Keenan, “A Tukey nonadditive type test for time series nonlin-earity,” Biometrica, vol. 72, pp. 39–44, 1985.

[31] H. Kitano, “Designing neural networks using genetic algorithms withgraph generation system,”Complex Syst., vol. 4, 1990.

[32] C. Kuan and H. White, “Artificial neural networks: an econometricperspective,”Econometric Rev., vol. 13, pp. 1–91, 1994.

[33] T. Lee, C. W. J. Granger, and H. White, “Testing for neglectednonlinearity in time series models: a comparison of neural networkmethods and alternative tests,”J. Econometrics, vol. 56, pp. 269–290,1993.

[34] J. R. McDonnell and D. Waagen, “Neural network structure design byevolutionary programming,” inProc. Second Annu. Conf. EvolutionaryProgramming, La Jolla, CA: Evolutionary Programming Society, 1993,pp. 79–89.

[35] J. G. MacKinnon and H. White, “Some heteroskedasticity-consistentcovariance matrix estimators with improved finite sample properties,”J. Econometrics, vol. 48, pp. 817–838, 1985.

[36] G. F. Miller, M. Todd, and S. U. Hegde, “Designing neural networksusing genetic algorithms,” inProc. Third Int. Conf. Genetic Algorithms.San Mateo, CA: Morgan Kaufmann, 1989, pp. 379–384.

[37] C. Min and A. Zellner, “Bayesian and non-Bayesian methods for com-bining models and forecasts with applications to forecasting internationalgrowth rates,”J. Econometrics, vol. 56, pp. 89–118, 1993.

[38] D. Montana and L. Davis, “Training feedforward neural networks usinggenetic algorithms,” inProc. 11th Int. Joint Conf. Artificial Intelligence.San Mateo, CA: Morgan Kaufmann, 1989, pp. 762–767.

[39] A. Pagan and W. Schwert, “Alternative models for conditional stockvolatility,” J. Econometrics, vol. 45, pp. 267–290, 1990.

[40] P. M. Robinson, “Automatic frequency domain inference on semi-parametric and nonparametric models,”Econometrica, vol. 59, pp.1329–1364, 1991.

[41] N. Saravanan, “Evolving neural networks: application to a predictionproblem,” in Proc. 2nd Annu. Conf. Evolutionary Programming, LaJolla, CA: Evolutionary Programming Society, 1993, pp. 72–78.

[42] N. Saravanan and D. B. Fogel, “Learning strategy parameters in evolu-tionary programming: An empirical study,” inProc. Third Annu. Conf.Evolutionary Programming. River Edge, NJ: World Scientific, 1994,pp. 269–280.

[43] N. Saravanan, D. B. Fogel, and K. M. Nelson, “A comparison ofmethods of self-adaptation in evolutionary algorithms,”BioSystems, vol.36, pp. 157–166, 1995.

[44] M. Scholes and J. Williams, “Estimating betas from nonsynchronousdata,” J. Finan. Econ., vol. 5, pp. 309–327, 1977.

[45] H-P. Schwefel,Numerical Optimization of Computer Models. Chich-ester, U.K.: Wiley, 1981.

[46] M. Stinchcombe and H. White, “Using feedforward networks to dis-tinguish multivariate populations,” inProc. Int. Joint Conf. NeuralNetworks, ICANN’94, Sorrento, Italy, May 26-29, 1994, pp. 7–16.

[47] J. H. Stock, “Nonparametric policy analysis: An application to es-timating hazardous waste cleanup benefits,” inNonparametric andSemiparametric Methods in Economics and Statistics, W. Barnett, J.Powell, and G. Tauchen, Eds. Cambridge: Cambridge Univ. Press,1991.

[48] P. Stoica, P. Eykhoff, P. Janssen, and T. Soderstrom, “Model selectionby cross-validation,”Int. J. Contr., vol. 43, pp. 1841–1878, 1986.

[49] T. Subba Rao and M. Gabr, “A test for linearity of stationary timeseries,”J. Time Series Anal., vol. 1, pp. 145–158, 1980.

[50] R. S. Tsay, “Non-linearity tests for time series,”Biometrica, vol. 73,pp. 461–466, 1986.

[51] H. White, “A heteroskedasticity-consistent covariance matrix estimatorand a direct test for heteroskedasticity,”Econometrica, vol. 48, pp.817–838, 1980.

[52] , “Neural network learning and statistics,”AI Expert, pp. 48–50,1989.

[53] , “Some asymptotic results for learning in single hidden layerfeedforward network models,”J. Amer. Statist. Assoc., vol. 84, pp.1003–1013, 1989.

[54] , “Connectionist nonparametric regression multilayer feedforwardnetworks can learn arbitrary mappings,”Neural Networks, vol. 3, pp.535–549, 1990.

[55] D. Wolpert, “Combining generalizers using partitions of the learningset,” Santa Fe Institute, Santa Fe, NM, Work Paper Series 93-02-009,1993.

Page 13: Evolving Artificial Neural Networks To Combine Financial

52 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

Paul G. Harrald was educated in the United King-dom and Canada.

He is currently a Lecturer in business economicsat the Manchester School of Management, Man-chester, U.K. His recent interests are in numericaloptimization, and in particular in evolutionary tech-niques. His other research interests include artificiallife models, evolutionary games, and computableeconomic models.

Mark Kamstra received the Ph.D. degree in eco-nomics from the University of California at SanDiego.

He is currently an Assistant Professor of Eco-nomics at Simon Fraser University, Burnaby, B.C.,Canada. Recent research interests include empiricalasset pricing models, conditional variance forecast-ing, and the combination of forecasts, both quan-titative and qualitative. A common thread throughmuch of his work is the application of artificialneural networks.