a neural network ensemble method with jittered training data for time series forecasting

18
A neural network ensemble method with jittered training data for time series forecasting G. Peter Zhang * Department of Managerial Sciences, Georgia State University, Atlanta, GA 30303, United States Received 30 November 2006; received in revised form 7 June 2007; accepted 9 June 2007 Abstract Improving forecasting especially time series forecasting accuracy is an important yet often difficult task facing decision makers in many areas. Combining multiple models can be an effective way to improve forecasting performance. Recently, considerable research has been taken in neural network ensembles. Most of the work, however, is devoted to the classifi- cation type of problems. As time series problems are often more difficult to model due to issues such as autocorrelation and single realization at any particular time point, more research is needed in this area. In this paper, we propose a jittered ensemble method for time series forecasting and test its effectiveness with both sim- ulated and real time series. The central idea of the jittered ensemble is adding noises to the input data and thus augments the original training data set to form models based on different but related training samples. Our results show that the proposed method is able to consistently outperform the single modeling approach with a variety of time series processes. We also find that relatively small ensemble sizes of 5 and 10 are quite effective in forecasting performance improvement. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Neural networks; Time series forecasting; Ensemble; Noise injection; Experimental design 1. Introduction Time series forecasting is an active research area that has received considerable amount of attention in the literature. With the time series approach to forecasting, historical observations are collected and analyzed to determine a model to capture the underlying data generating process. Then the model is used to predict the future. This forecasting approach is important in many domains of business, economics, industry, engineering, and science. Much effort has been devoted over the past several decades to the development and improvement of time series forecasting models. Neural networks represent a recent approach to time series forecasting. There has been an increasing inter- est in using neural networks to model and forecast time series over the last decade. Neural networks have been found to be a viable contender to various traditional time series models [4,17,22,25,28,60]. Lapedes and Farber 0020-0255/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2007.06.015 * Tel.: +1 404 651 4065; fax: +1 404 651 3498. E-mail address: [email protected] Information Sciences 177 (2007) 5329–5346 www.elsevier.com/locate/ins

Upload: g-peter-zhang

Post on 26-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Information Sciences 177 (2007) 5329–5346

www.elsevier.com/locate/ins

A neural network ensemble method with jittered trainingdata for time series forecasting

G. Peter Zhang *

Department of Managerial Sciences, Georgia State University, Atlanta, GA 30303, United States

Received 30 November 2006; received in revised form 7 June 2007; accepted 9 June 2007

Abstract

Improving forecasting especially time series forecasting accuracy is an important yet often difficult task facing decisionmakers in many areas. Combining multiple models can be an effective way to improve forecasting performance. Recently,considerable research has been taken in neural network ensembles. Most of the work, however, is devoted to the classifi-cation type of problems. As time series problems are often more difficult to model due to issues such as autocorrelation andsingle realization at any particular time point, more research is needed in this area.

In this paper, we propose a jittered ensemble method for time series forecasting and test its effectiveness with both sim-ulated and real time series. The central idea of the jittered ensemble is adding noises to the input data and thus augmentsthe original training data set to form models based on different but related training samples. Our results show that theproposed method is able to consistently outperform the single modeling approach with a variety of time series processes.We also find that relatively small ensemble sizes of 5 and 10 are quite effective in forecasting performance improvement.� 2007 Elsevier Inc. All rights reserved.

Keywords: Neural networks; Time series forecasting; Ensemble; Noise injection; Experimental design

1. Introduction

Time series forecasting is an active research area that has received considerable amount of attention in theliterature. With the time series approach to forecasting, historical observations are collected and analyzed todetermine a model to capture the underlying data generating process. Then the model is used to predict thefuture. This forecasting approach is important in many domains of business, economics, industry, engineering,and science. Much effort has been devoted over the past several decades to the development and improvementof time series forecasting models.

Neural networks represent a recent approach to time series forecasting. There has been an increasing inter-est in using neural networks to model and forecast time series over the last decade. Neural networks have beenfound to be a viable contender to various traditional time series models [4,17,22,25,28,60]. Lapedes and Farber

0020-0255/$ - see front matter � 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.ins.2007.06.015

* Tel.: +1 404 651 4065; fax: +1 404 651 3498.E-mail address: [email protected]

5330 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

[31] report the first attempt to model nonlinear time series with neural networks. De Groot and Wurtz [19]present a detailed analysis of univariate time series forecasting using feedforward neural networks for twobenchmark nonlinear time series. Chakraborty et al. [14] conduct an empirical study on multivariate time ser-ies forecasting with neural networks. Atiya and Shaheen [3] present a case study of multi-step river flow fore-casting. Poli and Jones [44] propose a stochastic neural net model based on Kalman filter for nonlinear timeseries prediction. Weigend et al. [57,58] and Cottrell et al. [18] address the issue of network structure for fore-casting real world time series. Berardi and Zhang [6] investigate the bias and variance issue in the time seriesforecasting context. In addition, several large forecasting competitions [4,56] suggest that neural networks canbe a very useful addition to the time series forecasting toolbox.

One of the major developments in neural networks over the last decade is the model combining or ensemblemodeling. The basic idea of this multi-model approach is the use of each component model’s unique capabilityto better capture different patterns in the data. Both theoretical and empirical findings have suggested thatcombining different models can be an effective way to improve the predictive performance of each individualmodel, especially when the models in the ensemble are quite different [5,29,34,40,41,43]. Although a majorityof the neural ensemble literature is focused on pattern classification problems, a number of combining schemeshave been proposed for time series forecasting problems. For example, Pelikan et al. [42] and Ginzburg andHorn [21] combine several feedforward neural networks for time series forecasting. Wedding and Cios [55]describe a combining methodology using radial basis function networks and the Box–Jenkins models. Gohet al. [23] use an ensemble of boosted Elman networks for predicting drug dissolution profiles. Medeirosand Veiga [39] consider a hybrid time series forecasting system with neural networks used to control thetime-varying parameters of a smooth transition autoregressive model. Armano et al. [2] use a combinedgenetic-neural model to forecast stock indexes. Zhang [63] proposes a hybrid neural-ARIMA model for timeseries forecasting. Liu and Yao [33] develop a simultaneous training system for negatively correlated networksto overcome the limitation of sequential or independent training methods.

An ensemble can be formed by multiple network architectures, same architecture trained with differentalgorithms, different initial random weights, or even different methods. The component networks can alsobe developed by training with different data such as the resampling data or with different inputs. In general,as discussed in [49,50], the neural ensemble formed by varying the training data typically has more componentdiversity than that trained on different starting points, with different number of hidden nodes, or using differ-ent algorithms. Thus this approach is the most commonly used in the literature. There are many different waysto alter training data including cross-validation, bootstrapping, using different data sources or different pre-processing techniques, as well as a combination of above techniques [49]. Zhang and Berardi [64] proposetwo data splitting schemes to form multiple sub-time-series upon which ensemble networks are built. They findthat the ensemble achieves significant improvement in forecasting performance.

In this study, we propose a neural ensemble model based on the idea of adding noises to the input data andforming different training sets with the jittered input data. Prior research has indicated that adding noise to thetraining data can improve the generalization ability of trained networks. However, almost all research focuseson the benefit of increased sample size and more smooth function mapping due to jittered input data and littleresearch has been done on how to use jittered data to form ensembles. In addition, most of the research hasbeen devoted to regression or classification type of problems and little research has been carried out in theforecasting especially time series forecasting area. This study aims to provide some empirical evidence ofthe effectiveness of the proposed ensemble approach for time series forecasting.

The rest of the paper is organized as follows. Section 2 gives a review of the relevant literature. In Section 3,we present the methodology as well as the data used in evaluating the jittered ensemble method. Section 4then reports the empirical results from experiments with simulation and real data. Finally, Section 5concludes.

2. Relevant literature

In neural network training, each training data set is normally a finite sample from some underlying popu-lation. Therefore it may not represent the population accurately. Training with jittered data increases the sizeof the training sample by supplementing it with additional artificial data that is similar to, but different from,

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5331

the original sample data. This typically will cause the data to appear smoother to the neural networks and thusimprove the network ability to learn the true underlying pattern, rather than idiosyncrasy tied with the specificsample and noise on hand. As Reed et al. [46, p. 535] point out that ‘‘Training with jitter helps to preventoverfitting by providing additional constraints. The effective target function is a continuous function definedover the entire input space, whereas the original target function may be defined only at the specific trainingpoints.’’

A number of researchers have studied the effect of generating additional training data by adding noises tothe input data on neural network modeling and generalization performance. It is generally agreed that addingnoises into the training patterns can lead to improvements in neural network generalization [26,38,51]. Bishop[8,9] shows that training with noise is approximately equivalent to a form of the regularization technique.Breiman [11] adds noises to the response variable in regression to generate multiple subset regressions to sta-bilize the estimator. An in [1] conducts a rigorous analysis of how the various types of noise including datanoise, weight noise, and Langevin noise affect the learning cost function for both regression and classificationproblems. He demonstrates that input noise is effective in improving the generalization performance. Reedet al. [46] compare the effect of adding noise on the generalization ability to error regularization, sigmoid gainscaling, and target smoothing and find that they are similar in several important cases although at differentcosts. Wang and Principe [54] examine the effect of injecting noise into the output on learning speed andits ability to avoid local minima and find that it is very effective for both situations. Singh [52,53] uses thenoise-injected time series data to evaluate a pattern recognition technique for one-step ahead forecastingand shows that the noise injection method can be a useful pre-processing technique for model building.Zur et al. [65] compare two methods of jittered training: one is the usual jitter method with noise added toeach training input vector in each iteration of training and the other is adding a large number of artificial caseswith added noise to the training sample before training the neural network in the conventional manner. Theyfind that both methods outperform training without jitters and training with continually changing jitter givesbetter performance.

Noises can also be added to weights or outputs to improve neural network training. However, An [1] findsthat it is not quite effective when noises are added to the output of neural networks. Weight noises are onlyeffective in improving the generalization performance for the classification problem but not for the regressiontype of problem. In this paper, we will focus only on noises added to the input data.

The issue of choosing an appropriate sized noise has not been completely addressed in the literature. It isclear that the best noise variance must be problem dependent as different data have different inherent noiselevels. Large noises could distort the underlying pattern while smaller noises may not have enough impacton performance. Raviv and Instrator [45] examine the effect of training with noise on an ensemble classifierfor the two-spirals problem and find that for small noise levels the ensemble is unable to find a smooth struc-ture in the data or overfits the training data while for moderate levels of noise, better structure can be found,and for large levels of the noise, the data are corrupted that no structure can be found. Holmstrom and Kois-tinen [26] suggest several methods based on cross-validation. Reed et al. [46] conjecture that the regularizationresearch regarding the selection of the regularization parameter may help select an appropriate noise level dueto the relationship between training with noise and regularization. Zur et al. [65] investigate various noise var-iance levels on performance, but do not report the results. In an application of jittered training method to clas-sify orogenic gold deposit pattern, Brown et al. [12] conduct experiments to determine the optimum amount ofnoise using both uniform and normally distributed random noise. Results indicate that while training withjittered data significantly improves the classification performance, the optimal noise level depends on the noiseprobability model.

Most of the research in this area focuses on classification problems and there is a lack of research in timeseries related problems. Time series forecasting presents a uniquely difficult problem because in a time series,each data point represents an observation of sample size of one at a particular time point. In addition, timeseries observations are highly correlated. Unlike in a classification problem where one can generate or collectmore data from the same population at essentially static time frame, for time-series forecasting, time-indexeddata set normally cannot be replicated but must be augmented via additional observations over time. There-fore, larger model building uncertainties and estimation biases are typically associated with time series fore-casting problems [15].

5332 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

3. Research design and methodology

In this paper, we propose the jittered ensemble method. Our purpose is to show that by combining differentmodels developed with multiple time series augmented from the original time series, forecasting performancecan be improved over the single neural network approach. To test the effectiveness of the proposed method, weconduct a Monte Carlo simulation with simulated data for controlled time-series characteristics including thenoise component. Several real data sets are also used to consolidate the findings from the simulation study. Aswill be discussed later, both linear and nonlinear time series are employed. While one-step-ahead forecasting isthe focus in this study, generalizations to the multi-step forecasting setting are not difficult.

3.1. The jittered ensemble method

The time series forecasting problem presents a unique challenge to the modeling and analysis of the data. Ina classification problem, the observations in the dataset can often be assumed to be independent. Time seriesdata, on the other hand, typically have autocorrelated structures. In addition, while independent samples maybe easy to draw from a well-defined population in classification problems, it is impossible to draw independentsamples from a particular realization of a time series.

Let {Yt, t = 1, 2, . . ..} be a stochastic process. A time series {y1,y2, . . . ,yT} is a set of observations generatedor observed from its underlying stochastic process (data generating process or DGP). The main assumptionunderlying time series analysis is that the observation at time t, yt, is a realization of the corresponding randomvariable, Yt, in the underlying process. Thus the above observed time series is treated as one realization of asequence of random variables, often called a sample path of length T. If we could observe the underlying datagenerating process many times, we would have many realized sample paths such as

Realization 1 : y11; y1

2; . . . ; y1t ; . . . ; y1

T

Realization 2 : y21; y2

2; . . . ; y2t ; . . . ; y2

T

..

. ... ..

. ...

Realization m : ym1 ; ym

2 ; . . . ; ymt ; . . . ; ym

T

..

. ... ..

. ...

Realization M : yM1 ; yM

2 ; . . . ; yMt ; . . . ; yM

T

where M is the total number of realizations or sample paths, superscript m is the realization index, m = 1,2, . . .. ,M. Then we would have a cross-section of M random observations, fy1

t ; y2t ; . . . ; yM

t g, from the same dis-tribution at time point t.

Unfortunately, in almost all practical time-series problems, it is impossible to make more than one obser-vation at any given time. Thus, although it may be possible to increase the sample size by varying the length ofthe observed time series, there will only be a single observation on the underlying random variable at time t.Nevertheless we may regard the observed time series as just one of a set of an infinite number of time seriesthat might have been observed from the underlying process. Fig. 1 illustrates the idea of one actual realizationversus several possible realizations of a time series process. It is important to note that at each point of time,many possible values could be observed and the particular realization observed depends on the random shockor noise value at that point.

The jittered ensemble method is based on the idea that at each time point, many possible observations couldbe made. Thus, for each realized time series, if we can create many ‘‘noisy’’ or jittered time series that aim tomimic the behavior of the data generating process, we will have multiple samples and each of these time seriescan be viewed as a possible realization of the DGP. These jittered time series can then be used to enhance neu-ral network training and model building by effectively forming an ensemble of neural networks built on dif-ferent samples from the same DGP. To demonstrate the jittered ensemble method, consider the followinggeneralized autoregressive model of order p

yt ¼ f ðyt�1; yt�2; . . . ; yt�pÞ þ et ð1Þ

y

0 1 2 3 4 5 6 7 8 9Actual realization Possible realization

t

Fig. 1. An actual realization and three possible realizations of a stochastic process.

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5333

where et is the random shock at time t and is often assumed to be the white noise, i.e, iid with mean of zero anda constant variance, r2. For a particular time series generated from (1), fy�1; y�2; . . . y�Tg, we can augment thatsample by adding noise, et, to observations at time t,t = 1, 2, . . . ,T. Let ek

t be the noise for the kth replication,k=1, 2, . . . ,K, where K is the number of resampled time series desired. Then

fy�1 þ ek1; y�2 þ ek

2; . . . y�T þ ekTg ð2Þ

is the kth jittered time series. By training neural networks K times using the above jittered time series, we maybe able to create a more robust neural ensemble model for forecasting.

The jittered ensemble method is different from the traditional resampling or bootstrapping approach [20]that takes multiple samples from the single observed time series with replacement. For example, one commonresampling method for time series analysis is the moving blocks bootstrap where blocks of consecutive obser-vations are randomly drawn [30,32]. The basic idea in the moving blocks bootstrap is to form b blocks of datazt = (yt, . . . ,yt+k�1) of length k from the original time series (y1, y2, . . . ,yT), where b = T � k + 1. Then sam-pling with replacement from the blocks (z1, z2, . . . ,zb) yields resamples, (z�1; z

�2; . . . ; z�l ), of length l = T/k. The

limitation of the bootstrap approach lies in the following two aspects. First, the information contained in thebootstrapped samples is limited to the original time series observations. Second, the moving blocks bootstrapis limited to short-term dependence with the fixed block size [7]. The jittered ensemble method is able to over-come these limitations by generating multiple time series from the same underlying process without imposingconstraints on the autocorrelation structure of the data.

3.2. The experimental setting

Both simulated and real data are used to test the effectiveness of the jittered ensemble method. Simulateddata are generated from the following processes:

1. The AR(2) process

yt ¼ 0:6yt�1 � 0:3yt�2 þ et

2. The ARMA(1, 2) process

yt ¼ 0:6yt�1 þ et � 0:5et�1 � 0:3et�2

3. Bilinear process (BL)

yt ¼ 0:5yt�1 � 0:3yt�2 þ 0:6yt�1et�1 þ et

4. Nonlinear autoregressive (NAR) process

yt ¼0:7jyt�1jjyt�1j þ 2

þ et

5334 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

5. First order smooth transition autoregressive (STAR1) process

yt ¼ 0:3yt�1 � 0:6yt�1½1þ expð�10yt�1Þ��1 þ et

6. Second order smooth transition autoregressive (STAR2) process

yt ¼ 0:3yt�1 þ 0:6yt�2 þ ð0:1� 0:9yt�1 þ 0:8yt�2Þ½1þ expð�10yt�1Þ��1 þ et

where et are white noises with i.i.d. N(0, r2). These series represent a variety of commonly encountered linearand nonlinear processes in time series analysis. They have been used in many other published simulation stud-ies for model selection, model evaluation, and model comparison. To examine the noise effect in the time ser-ies, we consider four levels of r at r = 0.01, 0.05, 0.25, and 1. Several sample series from the ARMA andSTAR1 processes are plotted in Figs. 2 and 3, respectively.

For each of the above process, we generate a set of observations, {yt, t = 1, 2, . . .}. The first 200 pointsare discarded to remove the effect of initial values and the next 280 points are taken as the original timeseries, denoted as {y1, y2, . . . ,y280}. The first 200 observations are served as the original training sample,{y1, y2, . . . ,y200}, based upon which we also generate jittered time series. The last 80 points are used for testingpurposes.

At each noise level, we generate an independent set of random noises added to the original training set toform a jittered time series. This process will be repeated to generate K jittered series. Since it is not clear whatthe effective size of the ensemble should be, we experiment it with five levels at K = 1, 5, 10, 20, and 50. WhenK = 1, there is no jittered series involved and the ensemble model is reduced to the single model approach withthe neural network built only with the original time series without noises injected. In addition, we considerfour levels of training sample size in the simulation study. They are 20, 50, 100, and 200, respectively. Rela-tively small sample sizes such as 20 and 50 are considered because it is presumed that the proposed ensemblemethod will be more helpful by the synthetic data when the sample size is limited.

a b

c d

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-4

-3

-2

-1

0

1

2

3

4

NOISE 3 NOISE 4

NOISE 2NOISE 1

Fig. 2. Sample ARMA series with different noise levels.

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-4

-3

-2

-1

0

1

2

3

4

a b

c

NOISE 3 NOISE 4

NOISE 2NOISE 1

d

Fig. 3. Sample STAR1 series with different noise levels.

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5335

To avoid the situation that results are tied too much to one particular time series realization (either good orbad), the above process is replicated 100 times by using different starting random seeds for the error terms.This way, we effectively have a full factorial design with training sample size (four levels), ensemble size (fivelevels), and noise level (four levels) as three major experimental factors. Each experimental cell has a samplesize of 100, representing 100 replications. For each time series model at each of the 80 (=4 · 5 · 4) experimen-tal settings, we first build a neural network model for the original time series and then forecast for the next 80data points. Then we develop the ensemble with noise-added time series. That is, for each time series model,when K = 1, only 4 · 4 (sample size, noise level) single neural networks are constructed using the original data,and they serve as the references for the jittered ensembles. Summary statistics such as the root mean squarederrors (RMSE) and the mean absolute errors (MAE) defined below are used as the performance measures.

RMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPðyt � y

_tÞ2

N

sð3Þ

MAE ¼Pjyt � y

_tj

Nð4Þ

where yt is the actual observation at time t, y_

t is the predicted value, and N is the number of predictions.We use the standard feedforward neural network model with one hidden layer. One output node is

employed. The numbers of input and hidden nodes are determined from previous simulation studies for thesetime series [61,62]. Table 1 summarizes the input and output nodes used in these time series. The unipolar sig-moid function is used for all hidden nodes as the transfer function and the linear transfer function is selectedfor the output node. Bias terms are used for all hidden and output nodes. The relationship between the output(yt) and the inputs (yt�1, yt�2, . . . ,yt�p) has the following mathematical representation:

yt ¼ a0 þXq

j¼1

ajgðb0j þXp

i¼1

bijyt�iÞ þ et; ð5Þ

Table 1The neural network structure for six time series processes

Process Number of input nodes Number of hidden nodes

AR(2) 2 1ARMA(1, 2) 2 1BL 2 2NAR 1 1STAR1 1 1STAR2 2 2

5336 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

where aj (j = 0, 1, 2, . . . ,q) and bij (i = 0, 1, 2, . . . ,p; j = 1, 2, . . . ,q) are the model parameters; p is the number ofinput nodes and q is the number of hidden nodes; g is the transfer function at the hidden layer. Neural networktraining is conducted with a GRG2-based system [27]. GRG2 is a widely used comprehensive nonlinear opti-mization algorithm using the generalized reduced gradient method. With GRG2, there is no need to select viaexperiments the learning parameters such as the learning rate and momentum in backprogation based soft-ware. Instead, a different set of parameters such as the stopping criterion, search direction procedure, andthe bounds on variables, should be specified. The stopping criterion of the algorithm is to terminate the pro-gram if the error function is not reduced by 10�5 for 4 consecutive iterations. Initially all arc weights andbiases are uniformly distributed in the range of �5 to 5. And the bounds on weights are set to be 100. Sincethe network training is an optimization problem, to increase the possibility of getting the global optima, wetrain each network 50 times by using 50 different sets of initial arc weights and the weights finally selected arethe best ones among the 50 different training iterations.

To show the effectiveness of the jittered ensemble method, we compare it with the traditional ensemblemethod by varying the initial random weights, which is one of the most commonly used approaches in formingneural ensembles [49]. With this random ensemble approach, we fixed the training sample but varying the ini-tial random weights to train the neural network multiple times. The predictions are then combined from themultiple results. To be consistent with the jittering method, we use the same experimental setting describedabove with four training sizes (20, 50, 100, and 200), four ensemble sizes (5, 10, 20, and 50), and 100 replica-tions for each time series.

Three real data sets of series B, E, and F used in this study come from Box and Jenkins [10]. These timeseries exhibit somewhat different patterns as observed in the time series plot in Fig. 4. In addition, three stockindices from 1997 to 2004 (the S&P 500, the Nasdaq, and the Russel 1000 with daily close) are selected to testthe effectiveness of the jittered ensemble method. The noise level of a real time series can be determined by theresiduals of the process [10]. Therefore, for each time series, we first fit a neural network model and then themodel residuals are used to estimate the noise level as the standard deviation for generating jittered time series.Table 2 summarizes these time series, sample size, as well as the estimated noise level.

Although there are many ways to form ensemble predictions, we elect to use the simple averaging method inthis research for two reasons. First, simple averaging has been shown to be an effective approach to improveneural network performance in a variety of settings [24,49,64]. Second, each jittered time series in the ensemblerepresents a random realization of the underlying process. It is thus reasonable to treat each of the jitteredseries equally and give equal weights to the models built on these data sets.

4. Results and analyses

In this section, we present out-of-sample results from both the simulation study and the real data applica-tions. In the simulation study, we examine the effectiveness of the jittered ensemble method with regard to avariety of experimental factors such as the training sample size (TRNSIZE), ensemble size (ENSIZE), andnoise level (NOISE). The ANOVA procedure with the Duncan multiple range test is performed for multiplecomparison purposes. All statistical significance levels are set at 5%. Since the results with RMSE are verysimilar to those with MAE, to save space, we will report only results with MAE.

Tables 3–8 summarize the ANOVA results for each of the six time series processes with regard to the effectof training size, ensemble size, and noise level. Taking Table 3 as an example for detailed examination, we find

250

300

350

400

450

500

550

600

650

1 35 69 103 137 171 205 239 273 307 341Series B

0

40

80

120

160

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

Series E

15

25

35

45

55

65

75

85

1 7 13 19 25 31 37 43 49 55 61 67Series F

a

b

c

Fig. 4. Three real time series from Box and Jenkins [10].

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5337

that for the AR(2) process, as the noise level (NOISE) increases from low (NOISE 1 or r = 0.01) to high(NOISE 4 or r = 1), the mean performance measure increases quite dramatically. For example, at the NOISE1 level, we find that the MAE is averaged at 0.0086 while at the NOISE 4 level, the MAE is averaged at 0.8867across four training sample sizes. The training sample size (TRNSIZE) is significant at each noise level. It is

Table 2Real time series

Series Training Test Noise

BJ-B 339 30 7.1788BJ-E 90 10 13.6802BJ-F 60 10 9.8652S&P 500 1761 252 15.1552Nasdaq 1761 252 57.5059Russel 1761 252 8.0330

Table 3ANOVA results for AR(2)

(1) Panel A

NOISE 1 NOISE 2 NOISE 3 NOISE 4

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

20 0.009282 A 20 0.047775 A 20 0.474123 A 20 0.972305 A50 0.008461 B 50 0.042691 B 50 0.428446 B 50 0.875891 B

100 0.008280 C 100 0.041557 C 100 0.419044 C 100 0.854756 C200 0.008211 C 200 0.041127 C 200 0.412787 C 200 0.843783 C

(2) Panel B

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

50 0.008584 A 1 0.044269 A 1 0.441027 A 1 0.898599 A20 0.008557 A 5 0.043159 B 5 0.433262 A B 5 0.890659 A B1 0.008554 A 10 0.043096 B 20 0.431937 B 10 0.883820 B5 0.008552 A 20 0.042972 B 10 0.431007 B 20 0.880446 B

10 0.008546 A 50 0.042942 B 50 0.430768 B 50 0.879895 B

5338 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

clear from Panel A of Table 3 that as the training sample size increases, the mean error measure decreases.Note that letters in the Duncan Grouping columns represent the groups that have similar or significantly dif-ferent performance measures. Means with the same letter are not significantly different while different lettersindicate significantly different groups. The group with the largest mean is given a letter A, the second largest B,and so on. For example, if within a particular noise level, different sample sizes associate with the same letter(A), then sample size is not a significant factor. On the other hand, if all the letters are different (say, from A toD), then each different sample size will cause the results significantly different from the other sizes. In thisAR(2) case, the sample size of 50 gives significantly lower overall errors than that of 20 while at the same time,sample sizes of 100 and 200 give much better results than the sample size of 50 or lower. When the sample sizeis above 100, there is no significant difference in the overall performance measure.

Panel B of Table 3 shows the effectiveness of the ensemble method. It compares the results of ensembleacross different ensemble sizes (ENSIZE). Ensemble size of one (ENSIZE = 1) represents the single model

Table 4ANOVA results for ARMA

(1) Panel A

NOISE 1 NOISE 2 NOISE 3 NOISE 4

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

20 0.008962 A 20 0.045125 A 20 0.445285 A 20 0.898916 A50 0.008456 B 50 0.042187 B 50 0.415852 B 50 0.838851 B

100 0.008344 C 100 0.041618 C 100 0.409864 C 100 0.824999 C200 0.008305 C 200 0.041486 C 200 0.407038 C 200 0.817619 C

(2) Panel B

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

1 0.008746 A 1 0.044460 A 1 0.438184 A 1 0.883777 A5 0.008527 B 5 0.042511 B 5 0.418698 B 5 0.847322 B

10 0.008467 B 10 0.042181 B C 10 0.415331 B C 10 0.836625 B C20 0.008438 B 20 0.041995 B C 20 0.413333 B C 20 0.831350 C50 0.008404 B 50 0.041873 C 50 0.412004 C 50 0.826409 C

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5339

case. At the lowest noise level (NOISE 1), there is no significant difference between the single model and theensemble models. At higher noise levels, all ensembles with different sizes significantly outperform the singlemodeling approach. From NOISE 2 to NOISE 4, we find that as the ensemble size increases, the average per-formance measure improves. However, there are no significant differences among these larger ensemble sizes.From these observations, we may conclude that for the AR(2) process, when the noise level inherent in thedata generating process is very low, the jittered ensemble method may not have significant advantages overthe single model approach. However, when the noise level increases, the ensemble approach becomes moreeffective. Since there is no significant difference in forecasting performance among different sized ensembles,we can use small ensembles to achieve significant results. In this case, the ensemble size of 5 or 10 seems tobe a reasonable choice to achieve significant improvement over single model and at the same time keeps mod-eling and computational effort low.

Results from Tables 4–8 are similar to those discussed above for Table 3, indicating the robustness of theproposed ensemble method for different underlying processes. As the noise level increases, all models including

Table 5ANOVA results for BL

(1) Panel A

NOISE 1 NOISE 2 NOISE 3 NOISE 4

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

20 0.009347 A 20 0.048511 A 20 0.522296 A 20 1.469370 A50 0.008597 B 50 0.043426 B 50 0.448900 B 50 1.259640 B

100 0.008393 C 100 0.042165 C 100 0.437786 C 100 1.224450 B C200 0.008323 C 200 0.041818 C 200 0.432976 C 200 1.185430 C

(2) Panel B

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

1 0.008793 A 1 0.046302 A 1 0.484039 A 1 1.365310 A5 0.008650 B 5 0.043885 B 5 0.457906 B 5 1.286840 B

10 0.008643 B 10 0.043464 B C 10 0.454865 B 10 1.267640 B20 0.008628 B 20 0.043200 B C 20 0.453674 B 20 1.255880 B50 0.008609 B 50 0.043050 C 50 0.451964 B 50 1.247950 B

Table 6ANOVA results for NAR

(1) Panel A

NOISE 1 NOISE 2 NOISE 3 NOISE 4

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

20 0.008677 A 20 0.043273 A 20 0.423784 A 20 0.858004 A50 0.008357 B 50 0.041377 B 50 0.403620 B 50 0.816297 B

100 0.008251 C 100 0.041104 B 100 0.399907 B C 100 0.804039 C200 0.008213 C 200 0.040910 B 200 0.397547 C 200 0.798989 C

(2) Panel B

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

1 0.008400 A 1 0.042182 A 1 0.415831 A 1 0.838050 A10 0.008378 A 5 0.041623 B 5 0.405169 B 5 0.822228 B5 0.008373 A 20 0.041523 B 10 0.404410 B 10 0.814737 B C

20 0.008371 A 10 0.041520 B 20 0.403258 B 20 0.812335 B C50 0.008351 A 50 0.041482 B 50 0.402403 B 50 0.809311 C

Table 7ANOVA results for STAR1

(1) Panel A

NOISE 1 NOISE 2 NOISE 3 NOISE 4

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

20 0.008509 A 20 0.042998 A 20 0.427968 A 20 0.872907 A50 0.008254 B 50 0.041118 B 50 0.405233 B 50 0.824711 B

100 0.008172 B C 100 0.040812 B C 100 0.401781 B C 100 0.811998 C200 0.008145 C 200 0.040638 C 200 0.399399 C 200 0.806916 C

(2) Panel B

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

1 0.008327 A 1 0.042001 A 1 0.419158 A 1 0.843815 A5 0.008278 A 5 0.041416 B 5 0.407197 B 5 0.831934 B

10 0.008267 A 10 0.041260 B 10 0.406566 B 10 0.825805 B20 0.008251 A 20 0.041169 B 20 0.405157 B 20 0.823643 B50 0.008226 A 50 0.041111 B 50 0.404898 B 50 0.820468 B

5340 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

ensembles perform worse with regard to the overall error measures. In addition, both training sample size andensemble size are significant factors especially when the noise level is higher. The best overall results are alwaysassociated with the largest sample size (i.e., TRNSIZE = 200). However, there is no significant differencebetween TRNSIZE = 200 and TRNSIZE = 100, indicating that 100 data points are sufficient to model thesetime series. As in the case of AR(2), the ensemble is not significantly better than the single model for the non-linear autoregressive model (NAR) in Table 6 and the smooth transition autoregressive (STAR2) process inTable 8 when the noise level is very small (NOISE 1). At all other noise levels, the ensembles outperform sig-nificantly over the single model. Although it is almost always the case that larger ensemble size gives betteraverage results, there is no significant difference between the largest ensemble size (ENSIZE = 50) and rela-tively small ensemble size of 5 or 10.

Figs. 5 and 6 plot for two selected series of ARMA and STAR1, respectively, the average performance ofvarious neural network models with regard to the three experimental factors of TRNSIZE, ENSIZE, andNOISE. The horizontal axis gives the various combinations of ENSIZE and TRNSIZE and the vertical axis

Table 8ANOVA results for STAR2

(1) Panel A

NOISE 1 NOISE 2 NOISE 3 NOISE 4

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

TRNSIZE Mean Duncangrouping

20 0.038537 A 20 0.152560 A 20 1.691700 A 20 2.667700 A50 0.015937 B 50 0.053350 B 50 0.524900 B 50 1.129500 B

100 0.015658 B 100 0.048090 B 100 0.458500 B 100 0.979600 B200 0.015350 B 200 0.045790 B 200 0.438000 B 200 0.883900 B

(2) Panel B

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

ENSIZE Mean Duncangrouping

5 0.022590 A 1 0.095370 A 1 1.071000 A 1 1.919300 A1 0.021377 A 5 0.073110 A 5 0.831700 A B 5 1.407200 A B

10 0.021052 A 10 0.071840 A 10 0.724200 A B 10 1.341900 A B20 0.020941 A 20 0.067750 A 20 0.670000 A B 20 1.250000 B50 0.020892 A 50 0.066660 A 50 0.594400 B 50 1.157400 B

Fig. 5. Average MAE vs. ensemble size and training sample size with different noise levels for the ARMA process.

Fig. 6. Average MAE vs. ensemble size and training sample size with different noise levels for the STAR1 process.

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5341

5342 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

shows the average MAE. It is clear that higher noises are associated with higher average MAE. Although it ismore difficult to see the patterns at the noise level of NOISE1 and NOISE2 due to different scales used, theyare very similar to those with NOISE3 and NOISE4. Overall, we see that the ensemble performs better thanthe single model at each training size level. However, the ensemble seems more effectiveness when the trainingsample size is relatively small.

Table 9 reports the results of the jittered ensemble method compared with those of the random ensemblemethod with different noise levels and ensemble sizes. In addition, we calculate the overall average across allensembles for each series and noise level. Two observations can be made. First, as the ensemble size increases,the performance of the random ensemble generally improves, although the improvement difference is notsignificant for most of the cases. Second, for every time series, the jittered ensemble has clear advantage overthe random ensemble as seen from the comparison of the averages and the results for almost all combinationsof the noise level and the ensemble size. This observation is consistent with the observation made by Sharkey[49] that ensembles formed by varying data sets are generally better than ensembles with the single data set.

To verify the results from the simulation study, we choose three data sets from [10] (Series B, E, and F) andthree stock indices for further examinations. Table 10 reports the results from this experiment. For each timeseries, the ensemble size varies from 1 to 50 with the size of 1 represents the baseline single model. Severalobservations can be made from this table. First, as in the simulation study, the ensemble generally outper-forms the single model approach at every ensemble size level using both RMSE and MAE for all series. Sec-ond, there is a clear pattern in the performance measure as ensemble size increases. As we increase the size of

Table 9Comparison of ensemble methods

Process ENSIZE NOISE 1 NOISE 2 NOISE 3 NOISE 4

Jitter Random Jitter Random Jitter Random Jitter Random

AR(2) 5 0.008552 0.008614 0.043159 0.043341 0.433262 0.442833 0.890659 0.90837610 0.008546 0.008614 0.043096 0.043324 0.431007 0.441281 0.883820 0.90059720 0.008557 0.008610 0.042972 0.043306 0.431937 0.441015 0.880446 0.89743850 0.008584 0.008611 0.042942 0.043264 0.430768 0.440912 0.879895 0.883668Avg 0.008552 0.008612 0.043042 0.043309 0.431744 0.441510 0.883705 0.897520

ARMA 5 0.008527 0.008636 0.042511 0.043332 0.418698 0.441240 0.847322 0.88600210 0.008467 0.008634 0.042181 0.043181 0.415331 0.440858 0.836625 0.88446220 0.008438 0.008632 0.041995 0.042953 0.413333 0.440633 0.831350 0.88438750 0.008404 0.008631 0.041873 0.042743 0.412004 0.440499 0.826409 0.884049Avg 0.008459 0.008633 0.042140 0.043052 0.414842 0.440808 0.835427 0.884725

BL 5 0.008650 0.008709 0.043885 0.043996 0.457906 0.457856 1.286840 1.32002010 0.008643 0.008704 0.043464 0.043477 0.454865 0.455417 1.267640 1.31711020 0.008628 0.008663 0.043200 0.043321 0.453674 0.453709 1.255880 1.31093050 0.008609 0.008649 0.043050 0.043116 0.451964 0.452551 1.247950 1.309850Avg 0.008633 0.008681 0.043400 0.043478 0.454602 0.454883 1.264578 1.314478

NAR 5 0.008373 0.008382 0.041623 0.042099 0.405169 0.426091 0.822228 0.84438710 0.008378 0.008380 0.041520 0.041968 0.404410 0.425659 0.814737 0.84421120 0.008371 0.008380 0.041523 0.041792 0.403258 0.425529 0.812335 0.84415850 0.008351 0.008379 0.041482 0.041847 0.402403 0.425435 0.809311 0.843878Avg 0.008368 0.008380 0.041537 0.041927 0.403810 0.425679 0.814653 0.844159

STAR1 5 0.008278 0.008308 0.041416 0.041336 0.407197 0.427431 0.831934 0.84782610 0.008267 0.008305 0.041260 0.041314 0.406566 0.427178 0.825805 0.84728920 0.008251 0.008305 0.041169 0.041299 0.405157 0.427076 0.823643 0.84665150 0.008226 0.008303 0.041111 0.041287 0.404898 0.427034 0.820468 0.846522Avg 0.008255 0.008305 0.041239 0.041309 0.405955 0.427180 0.825463 0.847072

STAR2 5 0.022590 0.022565 0.073110 0.072850 0.831700 0.814493 1.407200 1.39180010 0.021052 0.021965 0.071840 0.072717 0.724200 0.734940 1.341900 1.37230020 0.020941 0.021867 0.067750 0.070761 0.670000 0.697900 1.250000 1.35530050 0.020892 0.021796 0.066660 0.069324 0.594400 0.676400 1.157400 1.270100Avg 0.021369 0.022048 0.069840 0.071413 0.705075 0.730933 1.289125 1.347375

Table 10Results with the real time series

Series Ensemble size RMSE MAE

BJ-B 1 7.9882 6.85875 7.7293 6.6606

10 7.8377 6.782420 7.9656 6.844850 7.9735 6.8498

BJ-E 1 12.6848 9.58405 11.2672 9.0561

10 11.2799 9.168820 11.7303 9.426150 11.8284 9.5638

BJ-F 1 11.6432 9.60595 10.7188 8.7754

10 11.2757 8.987120 11.6474 9.243350 12.0518 9.5178

S&P 1 7.9353 6.15025 7.8352 6.1024

10 7.8257 6.067320 7.8266 6.068250 7.8333 6.0885

Nasdaq 1 21.3383 16.80305 20.9694 16.5494

10 20.9708 16.559320 20.9724 16.562350 20.9798 16.6409

Russel 1 4.2510 3.30835 4.1960 3.2647

10 4.1979 3.267720 4.1989 3.273750 4.2000 3.2765

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5343

ensemble, both RMSE and MAE decrease first and then increase. In fact, when the ensemble size reaches 20 orhigher, the ensemble performs worse than the single model judging from RMSE, although the differences maynot be significant. The best ensemble size with the lowest RMSE or MAE is 5 or 10 in all these time series,which is in line with the recommendation from the simulation study.

5. Conclusions

Improving forecasting especially time series forecasting accuracy is an important yet often difficult task fac-ing many decision makers in a wide range of areas. Combining multiple models or using ensemble methodscan be an effective way to improve forecasting performance. Several large-scale forecasting competitions witha large number of commonly used time series forecasting models conclude that combining forecasts from morethan one model often leads to improved performance [35–37]. More recently, considerable research has beentaken in neural network ensembles. Most of the work, however, is devoted to the classification problems asevidenced by published research. As time series problems are often more difficult to model due to issues suchas autocorrelation and single realization at any particular time point, there is a need to devote more neuralnetwork research to time series analysis and forecasting.

In this paper, we proposed a jittered ensemble method with neural networks for time series forecasting andtested its effectiveness with both simulated and real time series. The central idea of the jittered ensemble is add-ing noises to the input data and thus augments the original training data set to form different but related train-ing samples. Results show that the proposed method is able to consistently outperform the single modeling

5344 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

approach with a variety of time series processes. Although it may not be possible to give a theoretical recom-mendation on the size of the ensemble for all practical problems, our experimental results suggest that smallersized ensembles perform similarly as larger ones. This has important practical implication as smaller ensemblesrequire less computational efforts. We found from both simulation and real data experiments that the bestensemble size is five or 10 from both accuracy and parsimony considerations.

The effectiveness of the jittered ensemble method may be attributed to several reasons. First, it is wellknown that neural networks belong to unstable learning machines. Thus generating multiple versions of amodel can help learn different perspectives of the data and reduce model uncertainty. Second, as research sug-gests that combining networks trained on different data is more effective than ensembles built on the samedata, it is not surprising that the jittered ensemble can improve generalization performance significantly.Finally, as discussed earlier, a time series observed is only one possible realization from the underlying datagenerating process. By adding noises into the training data, not only are we able to build a more robust neuralensemble that can generalize well, but we also augment the information set for model building and thus canhave better learning of the underlying process.

The proposed idea can be naturally extended to other types of neural networks such as radial basis functionnetworks, recurrent neural networks, and support vector machines [13,16,47,48,59]. In addition, the ensemblemethods developed by combining different types of neural networks can be explored. Given many differentensemble methods in the literature, it would be a good idea to empirically compare the effectiveness of variousensemble methods for time series forecasting.

Acknowledgment

This research was supported, in part, by a research grant from the Robinson College of Business, GeorgiaState University.

References

[1] G. An, The effect of adding noise during backpropagation training on a generalization performance, Neural Computation 8 (1996)643–674.

[2] G. Armano, M. Marchesi, A. Murru, A hybrid genetic-neural architecture for stock indexes forecasting, Information Sciences 170 (1)(2005) 3–33.

[3] F.A. Atiya, I.S. Shaheen, A comparison between neural-network forecasting techniques-case study: river flow forecasting, IEEETransactions on Neural Networks 10 (2) (1999).

[4] S.D. Balkin, J.K. Ord, Automatic neural network modeling for univariate time series, International Journal of Forecasting 16 (2000)509–515.

[5] W.G. Baxt, Improving the accuracy of an artificial neural network using multiple differently trained networks, Neural Computation 4(1992) 772–780.

[6] V.L. Berardi, G.P. Zhang, An empirical investigation of bias and variance in time series forecasting: modeling considerations anderror evaluation, IEEE Transactions on Neural Networks 14 (3) (2003) 668–679.

[7] Berkowitz, L. Kilian, Recent developments in bootstrapping time series, Econometric Reviews 19 (1) (2000) 1–48.[8] M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 1995.[9] M. Bishop, Training with noise is equivalent to Tikhonov regularization, Neural Computation 7 (1) (1995) 108–116.

[10] G.E.P. Box, G. Jenkins, Time Series Analysis, Forecasting and Control, Holden-Day, San Francisco, CA, 1976.[11] L. Breiman, Arcing classifiers, The Annals of Statistics 26 (1998) 801–849.[12] W.M. Brown, T.D. Gedeon, D.I. Groves, Use of noise to augment training data: a neural network method of mineral-potential

mapping in regions of limited known deposit examples, Natural Resources Research 12 (2) (2003) 141–152.[13] X. Cai, N. Zhang, G.K. Venayagamoorthy, D.C. Wunsch, Time series prediction with recurrent neural networks trained by a hybrid

PSO–EA algorithm, Neurocomputing 70 (2007) 2342–2353.[14] K. Chakraborty, K. Mehrotra, C.K. Mohan, S. Ranka, Forecasting the behavior of multivariate time series using neural networks,

Neural Networks 5 (1992) 961–970.[15] C. Chatfield, Model uncertainty, data mining and statistical inference, Journal of the Royal Statistical Society Series A 158 (1995)

419–466.[16] K.-Y. Chen, C.-H. Wang, A hybrid SARIMA and support vector machines in forecasting the production values of the machinery

industry in Taiwan, Expert Systems with Applications 32 (2007) 226–254.[17] Y. Chen, B. Yang, J. Dong, A. Abraham, Time-series forecasting using flexible neural tree model, Information Sciences 174 (3–4)

(2005) 219–235.

G.P. Zhang / Information Sciences 177 (2007) 5329–5346 5345

[18] M. Cottrell, B. Girard, Y. Girard, M. Mangeas, C. Muller, Neural modeling for time series: a statistical stepwise method for weightelimination, IEEE Transactions on Neural Networks 6 (6) (1995) 1355–1364.

[19] C. De Groot, D. Wurtz, Analysis of univariate time series with connectionist nets: a case study of two classical examples,Neurocomputing 3 (1991) 177–192.

[20] B. Efron, R. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, New York, 1993.[21] I. Ginzburg, D. Horn, Combined neural networks for time series analysis, Advances in Neural Information Processing Systems 6

(1994) 224–231.[22] F. Giordano, M. La Rocca, C. Perna, Forecasting nonlinear time series with neural network sieve bootstrap, Computational Statistics

and Data Analysis 51 (2007) 3871–3884.[23] W.Y. Goh, C.P. Lim, K.K. Peh, Predicting drug dissolution profiles with an ensemble of boosted neural networks: a time series

approach, IEEE Transactions on Neural Networks 14 (2) (2003) 459–463.[24] L.K. Hansen, P. Salamon, Neural network ensembles, IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (10)

(1990) 993–1001.[25] T. Hill, M. O’Connor, W. Remus, Neural network models for time series forecasts, Management Sciences 42 (7) (1996) 1082–1092.[26] L. Holmstrom, P. Koistinen, Using additive noise in back-propagation training, IEEE Transactions on Neural Networks 3 (1) (1992)

24–38.[27] M.S. Hung, J.W. Denton, Training neural networks with the GRG2 nonlinear optimizer, European Journal of Operational Research

69 (1993) 83–91.[28] A. Jain, A.M. Kumar, Hybrid neural network models for hydrologic time series forecasting, Applied Soft Computing 7 (2007) 585–

592.[29] A. Krogh, J. Vedelsby, Neural network ensembles, cross validation, and active learning, Advances in Neural Information Processing 7

(1995) 231–238.[30] H. Kunsch, The Jackknife and the Bootstrap for General Stationary Observations, The Annals of Statistics 17 (1989) 1217–1241.[31] A. Lapedes, R. Farber, Nonlinear signal processing using neural networks: prediction and system modeling, Technical Report LA-

UR-87-2662, Los Alamos National Laboratory, Los Alamos, NM, 1987.[32] R.Y. Liu, K. Singh, Moving blocks Jackknife and Bootstrap capture weak dependence, in: R. LePage, L. Billard (Eds.), Exploring

the Limits of the Bootstrap, Wiley, New York, 1992, pp. 225–248.[33] Y. Liu, X. Yao, Ensemble learning via negative correlation, Neural Networks 12 (1999) 1399–1404.[34] G. Mani, Lowering variance of decisions by using artificial neural network portfolios, Neural Computation 3 (1991) 484–486.[35] S. Makridakis, A. Anderson, R. Carbone, R. Fildes, M. Hibdon, R. Lewandowski, J. Newton, E. Parzen, R. Winkler, The accuracy

of extrapolation (time series) methods: results of a forecasting competition, Journal of Forecasting 1 (1982) 111–153.[36] S. Makridakis, C. Chatfield, M. Hibon, M. Lawrence, T. Millers, K. Ord, L.F. Simmons, The M-2 competition: a real-life

judgmentally based forecasting study, International Journal of Forecasting 9 (1993) 5–29.[37] S. Makridakis, C. Chatfield, M. Hibon, The M3-Competition: results, conclusions and implications, International Journal of

Forecasting 16 (2000) 451–476.[38] K. Matsuoka, Noise injection into inputs in back-propagation learning, IEEE Transactions on System, Man, and Cybernetics 22

(1992) 436–440.[39] M.C. Medeiros, A. Veiga, A hybrid linear-neural model for time series forecasting, IEEE Transaction on Neural Networks 11 (6)

(2000) 1402–1412.[40] P.W. Munro, B. Parmanto, Competition among networks improves committee performance, in: M. Mozer, M.I. Jordan, T. Petsche

(Eds.), Advances in Neural Information Processing Systems, vol. 9, MIT Press, Cambridge, MA, 1996, pp. 592–598.[41] B. Parmanto, P.W. Munro, Improving committee diagnosis with resampling techniques, Advances in Neural Information Processing

Systems 8 (1996) 882–888.[42] E. Pelikan, C. de Groot, D. Wurtz, Power consumption in West-Bohemia: improved forecasts with decorrelating connectionist

networks, Neural Network World 2 (6) (1992) 701–712.[43] M.P. Perrone, L.N. Cooper, When networks disagree: ensemble methods for hybrid neural networks, in: R.J. Mammone (Ed.),

Neural Networks for Speech and Image Processing, 1993, pp. 126–142.[44] I. Poli, R.D. Jones, A neural net model for prediction, Journal of American Statistical Association 89 (1994) 117–121.[45] Y. Raviv, N. Instrator, Bootstrapping with noise: an effective regularization technique, Connection Science 8 (1996) 355–372.[46] R. Reed, R.J. Marks II, S. Oh, Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter,

IEEE Transaction on Neural Networks 6 (3) (1995) 529–538.[47] V.M. Rivas, J.J. Merelo, P.A. Castillo, M.G. Arenas, J.G. Castellano, Evolving RBF neural networks for time-series forecasting with

EvRBF, Information Sciences 165 (3–4) (2004) 207–220.[48] A. Santos, N. da Costa, L. Coelho, Computational intelligence approaches and linear models in case studies of forecasting exchange

rates, Expert Systems with Applications 33 (2007) 816–823.[49] J.C. Sharkey, On combining artificial neural nets, Connection Science 8 (1996) 299–314.[50] J.C. Sharkey, N.E. Sharkey, Combining diverse neural nets, The Knowledge Engineering Review 12 (3) (1997) 231–247.[51] J. Sietsma, R. Dow, Creating artificial neural networks that generalize, Neural Networks 4 (1991) 67–79.[52] S. Singh, Noise impact on time-series forecasting using an intelligent pattern matching technique, Pattern Recognition 32 (1999) 1389–

1398.[53] S. Singh, Noise time-series prediction using pattern recognition techniques, Computational Intelligence 16 (1) (2000) 114–133.

5346 G.P. Zhang / Information Sciences 177 (2007) 5329–5346

[54] C. Wang, J.C. Principe, Training neural networks with additive noise in the desired signal, IEEE Transaction on Neural Networks 10(6) (1999) 1511–1517.

[55] K. Wedding II, K.J. Cios, Time series forecasting by combining RBF networks, certainty factors, and the Box–Jenkins model,Neurocomputing 10 (1996) 149–168.

[56] A.S. Weigend, N.A. Gershenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley,Reading, MA, 1993.

[57] S. Weigend, B.A. Huberman, D.E. Rumelhart, Predicting the future: a connectionist approach, International Journal of NeuralSystems 1 (1990) 193–209.

[58] S. Weigend, B.A. Huberman, D.E. Rumelhart, Predicting sunspots and exchange rates with connectionist networks, in: M. Casdagli,S. Eubank (Eds.), Nonlinear Modeling and Forecasting, Addison-Wesley, Redwood City, CA, 1992, pp. 395–432.

[59] S.J. Yoo, J. Park, Y.H. Choi, Indirect adaptive control of nonlinear dynamic systems using self recurrent wavelet neural networks viaadaptive learning rates, Information Sciences 177 (2007) 3074–3098.

[60] G. Zhang, B.E. Patuw, M. Hu, Forecasting with artificial neural networks: the state of the art, International Journal of Forecasting 14(1998) 35–62.

[61] G.P. Zhang, An investigation of neural networks for linear time-series Forecasting, Computers and Operations Research 28 (2001)1183–1202.

[62] G.P. Zhang, B.E. Patuwo, M.Y. Hu, A simulation study of artificial neural networks for nonlinear time-series forecasting, Computersand Operations Research 28 (2001) 381–396.

[63] G.P. Zhang, Time series forecasting using a hybrid ARIMA and neural network model, Neurocomputing 50 (2003) 159–175.[64] G.P. Zhang, V.L. Berardi, Time series forecasting with neural network ensembles: an application for exchange rate prediction,

Journal of the Operational Research Society 52 (6) (2001) 652–664.[65] R.M. Zur, Y. Jiang, C.E. Metz, Comparison of two methods of adding jitter to artificial neural network training, International

Congress Series 1268 (2004) 886–889.