[ieee 2013 sixth international conference on advanced computational intelligence (icaci) - hangzhou,...

2013 Sixth International Conference on Advanced Computational Intelligence October 19-21, 2013, Hangzhou, China

Ensembles of Echo State Networks for Time Series Prediction

Wei Yao, Zhigang Zeng, Cheng Lian, and Huiming Tang

Abstract- In time series prediction tasks, dynamic models are less popular than static models, while they are more suitable for modeling the underlying dynamics of time series. In this paper, a novel architecture and supervised learning principle for recurrent neural networks, namely echo state networks, are adopted to build dynamic time series predictors. Ensemble techniques are employed to overcome the randomness and instability of echo state predictors, and a dynamic ensemble predictor is therefore established. The proposed predictor is tested in numerical experiments and different strategies for training the predictor are also comparatively studied. A case study is then conducted to test the predictor's performance in realistic prediction tasks.

I. INTRODUCTION

T IME series analysis and prediction theory has many practical utilities, in tasks such as economic predictions,

weather forecasts, early warnings for geological disasters and so on [1-4]. A time series can be considered as a sequential observation of a natural or an artificial system, and from such a perspective, time series prediction is in fact the prediction of the developing trend of a target system. When the system is completely known, it is not a demanding job to predict the developing trend. However, the dynamics of

realistic systems are usually very complex and their internal

mechanisms can hardly be described in a clear mathematical

model. Therefore, time series prediction techniques, which are based on observations, are resorted to.

The general idea of basically all the time series prediction techniques is to learn the developing patterns from previously observed values and then give predictions about the future values. The most widely used conventional time series

prediction methods are auto-regressive models [5] and gray

models [6]. With the development of neural computing techniques,

neural networks have also been more and more popular a choice for handling time series data. Neural networks are among the most important data driven modeling methods, and there are many different neural networks. The structures

of neural networks can be divided into two main categories, namely the feed-forward networks and the recurrent

networks. In feed-forward networks, the computing units,

neurons, are arranged in different layers, and there are no

Wei Yao is with the School of Computer Science. South-Central University for Nationalities. Wuhan. China (email: [email protected]).

Zhigang Zeng (Corresponding Author) and Cheng Lian are with the

School of Automation. Huazhong University of Science and Technology. Wuhan, China (email: [email protected]@hust.edu.cn).

Huiming Tang is with the Faculty of Engineering, China University of Geosciences, Wuhan, China.

The work is supported by the Natural Science Foundation of China under Grants 61203286 and 61125303, the 973 Program of China under Grant 20 11 CB710606, the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant 20 I 00 142110021.

978-1-4673-6343-3/13/$3\.00 ©2013 IEEE 299

close loop connections in the structure. In recurrent networks,

neurons are more freely connected, and connections between neurons may form directed cycles. Difference in structures resulted in difference in function. Feed-forward networks are static models, while recurrent networks are dynamic.

There are many successful applications of feed-forward

neural networks in time series prediction tasks as well as other machine learning tasks [7-9]. However, the realistic systems behind the time series observations are in essence

dynamic processes. When applying feed-forward neural networks, the time series prediction problem has to be transformed into a static regression problem, and the sequential

observations have to be rearranged as input-output couples, which will lose the natural temporal ordering property of a time series.

Therefore, using recurrent neural networks to establish

dynamic prediction models will be a more direct and proper solution. The main issue hindering a wider usage of recurrent networks in time series prediction tasks is the lack of efficient learning algorithms. In this paper, Echo State Network (ESN) [10] is employed to establish dynamic predictors. ESN provide an efficient supervised learning principle for dynamic

modeling. Details will be explained in follow-up sections.

On the other hand, in order to obtain more stable and accurate predictions, ensemble approaches are usually adopted in time series tasks. In ensemble approaches, multiple predictors are established independently to predict the future values of the same time series. Multiple and usually various predictions for the same process are obtained. These predictions are then combined to generate a single prediction.

It is not surprising that the ensemble can give more stable prediction than its component predictors, since the combina

tion can "neutralize" the performance between the good component predictors and the bad component predictors. And it is also widely recognized a fact that the combined prediction

can sometimes be superior to the best prediction achieved by all the component predictors. Ensembles of feedforward neural networks and statistic models are common, while ensembles of recurrent neural networks are less frequently

reported. In our study, ensembles of ESN are established and their performances are comparatively studied.

This paper is organized as follows. In Section II, a brief introduction about the structure and the learning algorithm for ESN is provided. In Section III, ensemble prediction approaches are discussed and the most proper ensemble strategy

from some candidates for ESN predictors is proposed. In Section IV, the structure of the ESN ensemble predictor and the proposed ensemble prediction method are explained in

detail. Numerical experiments on Mackey-Glass benchmark data and case study on realistic landslide data are described

in Section V and Section VI, respectively. The conclusions are drawn in Section VII.

II. ECHO STATE NETW ORK PREDICT OR

The structure of an echo state network is not obviously different from a common recurrent neural network. Neurons in the network can be divided into three groups: the input neurons, the output neurons and the internal neurons. Recurrent connections mainly exist between internal neurons and these recurrently connected internal neurons build up the very component of the network which is in charge of producing dynamics. This component, namely the internal

neurons and the connections between them, is called a reservoir. Signals are conducted from the input neurons to the reservoir to provoke dynamic processes in internal neurons.

Then these dynamic processes are collected by the output neurons to combine into desired dynamics. This is how an echo state network simulates the development of realistic

dynamic processes. Training of an echo state network can be accomplished

by two steps. First, initial connection weights for all the connections in the network are randomly generated. All the connection weights inside the reservoir will not be changed at all hereafter, therefore the generating of the internal connections in the first place has to fulfill some conditions, which are called "the echo state property" [10, 11], to ensure the memory capability as well as the stability of the network. In the second step of training, only the output connections

will be adjusted, aiming at producing a best combination of

the internal dynamics to approximate the desired dynamic. The structure of the echo state network and its training

1. Initi alization

j !

d(t. l ) � Input connections

ReC"lUTent connect1 on s

o

o

o o

o

o

o

2 . Training 3. Testing

j j � d(t)

Output connections

Fig. l. Structure of the echo state network predictor

process is illustrated in Fig. 1. In a time series prediction task, the training data set for an echo state network can be formulated as follows. On one hand, the time series is divided into two parts. The first part is picked up to train the model,

while the second part is reserved as the testing set. On the other hand, a delayed version of the time series is created as the input to stimulate internal dynamics in the network,

and the original time series is kept as the desired output, according to which the output connections will be adjusted during training.

300

Assuming the length of the time series observations is T,

then it can be denoted as d(t), t = 1 rv T. The length of

the first part is h, therefore the training set is obtained as {d(t -1),d(t)}, t = 1 rv h. After initialization, the internal states, denoted as a matrix X(t), will be updated following

X(t) = f(Wx . X(t -1) + Wi· d(t -1)), t = 1 rv tl, (I)

where fO denotes the activation function of the internal neurons. Wx and Wi are the 'randomly' created internal connection weights and input connection weights. Then the output connection weights will be trained following

Wo = argming(d(t) - Wo· X(t), Wo), t = to rv h, (2)

where gO denotes the objective function, which is monotonically decreasing as the deviation d( t) -Wo . X (t) decreasing, and usually a regularization term is also included in it. Notice that to > 1. When t = 1 rv to -1, the internal states X (t) will be considered as initial transient states and discarded. After training, the network will be able to produce output which is close to the realistic time series. And its accuracy can be assessed following

error = e(d(t) - Wo . X(t)), t = tl + 1 rv T, (3)

where eO is a certain kind of error, such as mean square

error. Once the accuracy of the model is acceptable, it can be used to predict future values.

III. PREDICT OR ENSEMBLES

Since the input and internal connection weights are randomly created, the echo state network predictor will hold

some randomness by nature, and its performance in prediction tasks tends to be unstable. In order to cope with this problem, ensemble prediction approach is employed. Multiple ESN predictors are created for the same prediction task, and their predictions are combined.

Diversity among component predictors and high quality of component predictors are the two requirements for a successful ensemble. If the component predictors are similar,

the combination can not bring obvious improvement. If the

performances of the component predictors are poor, the performance of the ensemble may be further deteriorated. ESN ensembles, on the other hand, meet these two requirements

quite well. The random internal connections of ESN ensure the diversity. As regard to the accuracy of the components, on a benchmark task of predicting Mackey-Glass time series,

the ESN predictor achieved dramatically improvements by a factor of 2400 over previous techniques [10].

Usually, the ensemble prediction is obtained by average or weighted average the predictions of the component pre

dictors. Averaging is the simplest ensemble approach, and as argued by Andrawis et. al. [12], simple combination schemes tend to perform best. As for weighted linear combination, the most important issue is assigning proper weight for each component predictor. It is preferred that the better component can take a bigger weight in the ensemble. Under such a

consideration, more sophisticated and adaptive ensembles,

usually based on neural networks [13 16], have been proposed.

In our work, a feed-forward neural network is used to make an adaptive ensemble of ESN predictors. Therefore, the combination weights of the component predictors are obtained in an additional step of training. The very effective

training algorithm of extreme learning machine (ELM) [17] is adopted to train a single layer feed-forward neural network into an ensemble predictor. This ensemble predictor will denoted as the ESN-ELM predictor hereafter.

IV. ESN -ELM PREDICTOR

T ime selles samples

naining Step!: ! naining Step2: 1 / dl {t+l)

d{f) .'-----1 ------, � -, ESN 2 1----... �

dn{t+1)

Fig. 2, Strncture of the ESN-ELM predictor

Testing: 1 d{l+1)

ELM is widely used in different kinds of pattern classification and regression applications [18]. Although the ESN predictors are dynamic, the combination of their predictions is a static regression problem. Therefore, ELM's efficiency in regression applications make it a good choice for the ensemble process. The proposed ESN-ELM predictor features a parallel-cascade structure, as illustrated in Fig. 2. The component predictors and the ensemble unit combined into a

big network, which is a hybrid of multiple recurrent networks and a single layer feed-forward network.

There are two steps of training for the ensemble predictor. The first step of training is applied to the component predictors, following the algorithm which has already been

explained in Section II. The second step of training for the ELM ensemble unit is similar to the training of ESNs. The

input connections will be initialized previously then kept unchanged and only the output connections of the network

will be adjusted during the training process. In order to avoid

over-fitting, the training set is divided into two subsets. The first training set is used to train the component predictors,

namely the ESNs. Then the second training set is used to train the ELM ensemble unit.

The whole process of using the proposed ESN-ELM predictor to produce reliable predictions in a prediction task is as follow.

1) Initialization: n recurrent networks and a feed-forward network are connected to build a parallel-cascade structure. The input connections of all the component networks and

301

internal connections of the n recurrent networks are randomly initialized.

2) 1st training: The first training set is applied onto the recurrent networks, and the output connections of these recurrent networks are adjusted accordingly following the

training algorithm of ESNs. 3) 2nd training: The second training set is applied onto

the feed-forward part, and the output connections of this feed-forward network are adjusted accordingly, following the

training algorithm of ELMs. 4) Testing: After training, the reliability of the ESN-ELM

predictor can be evaluated quantitatively by computing the

deviations between its predictions and the real values in the testing set. If the predictor achieves acceptable prediction accuracy on the testing set, it may hopefully produce reliable

predictions for the really unknown future. 5) Predicting: Finally, the predictor can be used to predict

future values. For multi-step prediction, the iterative prediction strategy can be adopted. The first four of the above listed steps will be conducted in numerical experiments to verify the effectiveness of the proposed predictor. A realistic prediction task will also be reported in the following sections.

V. NUMERIC AL EXPERIMENTS

The benchmark Mackey-Glass time series is used in our

numerical predicting experiments. The purposes of the numerical experiments are threefold. First, since the training set will be divided into two subsets, how to divide it and how the dividing will affect the performance of the ensemble predictor need to be figured out. Second, the improvements

of the ensemble predictor, as compared to the component predictors as well as their average, are to be verified. Third, the effectiveness of the strategy that dividing the training set

into two parts will also be discussed. 1000 data points are selected out of the Mackey-Glass time

series. A one step delayed version of this data set is created to make up input-output couples. The first 50 data points are

used to provoke the dynamics in the ESNs, then follow-up 250 data couples are picked up to form the training set, and the left 700 couples are reserved for testing the predictors.

20 ESN predictors are established to build up the ensemble. The length of the two training sets sum up to be 250.

When the first training set gets shorter, the second will get longer. This is a tradeoff between the two steps of training. And when the first training set gets shorter, the training of the components predictors will be less sufficient, and

the performance of them will deteriorate. This explains the

phenomena as illustrated in Fig. 3 that the averaged root mean square errors (rmse) of the components' predictions raise up when the second training set gets longer. However, this deterioration is not obvious until the second training set

is longer them 120, which means that 130 training couples are sufficient for the ESNs. The best performance of the ensemble predictor is achieved when the second training set

is of a length of about 160. At this point, an optimal tradeoff is arrived that both the components (ESNs) and the ensemble unit (ELM) can be properly trained.

0.9%

I c=:J mean ape I -_max ape 0.8%

0.7% -

0.6% -

0.5% -

0.4% -

0.3% -

0.2% -

I I I -

I I r I I I I I I ! r I I ! I I , J 0.1%

o 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Av En

Fig. 4. Comparisons between component predictors and the ensemble predictor regarding absolute percentage error

Q) \ � 10-1 \ C \ "-55 \ 2 \ .. Ql .. :5 10-2 \ C \ o \ � \ } e \ I \ "- \ I \ � 10-3

\ " \ ..Q \"

\.......

'

... _---- ' " ........ " 1 0-4 c::=:::;:=:=:::::::;::==�-�-�-�- -.:- -=-- -=--�-�L--��J

o 20 40 60 80 100 120 140 160 180 200 220 240 length of training set 2

Fig. 3. Impacts of different separations between the two training sets on the testing results of the ensemble predictor

Absolute percentage error is another common measure

ment for the predictions. The mean absolute percentage error can measure the general performance of the predictor on the testing set, while the maximum absolute percentage

error is also important because it indicate the stability of the predictor, and a predictor with low variance might be

preferred to a predictor with a better average error but with higher error peaks [19]. The ensemble predictor is compared with the 20 component predictors, as regarding to mean ape and max ape. The 20 component predictors are first trained on the whole 250 training couples to show their best, and their predictions are average. The apes of their predictions are

labeled as 1 rv 20 and ' Av' in Fig. 4 .Then these component predictors are reset and trained again on a subset with the length of 90, while the left 160 training couples are used to

302

train the ensemble unit. The apes of the prediction ensemble are labeled as 'En'. Both the averaging and the proposed ensemble can give more accurate and more stable predictions as compared to the components. And the proposed ensemble

is even better than the averaging.

TABLE I

COMPARISONS BETWEEN TWO ENSEMBLE STRATEGIES

length of training set 1 length of training set 2

average testing error of components training error of the ensemble

testing error of the ensemble

strategy 1 strategy 2

250 90

250 160

2.11e - 4 3.37e - 4

4.46e - 7 l.15e - 4

4.08e - 4 l.51e - 4

Since the length of the trammg set is usually limited, especially in realistic prediction tasks, there is a tradeoff between the two training steps in the proposed method. Then a question will arise naturally-why not reuse the training couples in the two training steps. Our reason for not doing so is that this strategy may lead to over-fitting problems. And our assumption is validated by a comparison between these

two training strategies during the numerical experiments, as reported in TABLE I. When reusing the 250 training couples

in both of the two training steps, better performances of component predictors are obtained. And for the training of

the ensemble, the training error in the form of rmse drops

dramatically. However, the ensemble predictor's performance on the testing set gets worse. The training error decreasing

and the testing error increasing, this is a typical sign of over-fitting. Therefore, it can be concluded that the training strategy adopted in the proposed ESN-ELM predictor is more

proper than the reusing strategy.

It's worth mentioning that the time-consuming of neural computing process is not really a problem. The training of

all the component predictors can be accomplish in less than 2 seconds, as recorded during our experiments. Furthermore, the time-consuming of the training and the testing processes of the ensemble is less than 0.3 seconds. Since the length of the training set will be even shorter in the realistic prediction tasks we caring about, which will be explained in detail in

the following section, the efficiency of the proposed model will not be a problem in its applications.

VI. C ASE STUDY

After verifying its efficiency in numerical experiments, the

proposed method is used to predict landslide displacement time series. This is a case study that demonstrates the application of the proposed ensemble predictor in realistic prediction tasks.

We choose the Yuhuangge (Y HG) landslide as our study case. In order to give early warning of possible landslide disasters in the future, the displacements of the landslide body are recorded. The monthly recordings of Yuhuangge

landslide begin in May 2004, and a time series of lOJ month recordings are selected for our study. A denoising process is conducted. As illustrated in Fig. 5, the original time series

is severely polluted by noises, and the denoised time series is much smoother. The recordings of the first 76 months

35 �======�====�--��------�--------� ill ill 30 � � 25 en � 20 � C 15 Q) E � 10 co

Ci. :6 5

-- original recordings - - - de noised values

OL---------�----------�----------�----------� 2004-5 2006-6 2008-7

recording da1a 2010-9 20 12-10

Fig. 5. Original landslide displacement recordings and the corresponding denoised values

are used to train the ensemble predictor. After training, the ensemble predictor keeps running to produce prediction for the following up 25 months' displacements. The predictions

and the actual displacements are compared as illustrated in Fig. 6. Despite this realistic time series is quite short and the recordings are severely polluted by noises, which can hardly be removed completely, the predictions are basically acceptable.

VII. C ONCLUSIONS

A dynamic ensemble predictor, the ESN-ELM predictor,

is proposed in this paper. Since realistic systems behind time series observations are in essence dynamic, dynamic

models like the proposed ESN-ELM model, have larger

303

30

ill 28 ill � 26 E -;;; 24 Q) :::J � 22

� 20 E Q) J§ 18 Cl. 06 16

displacement predictions based on YHG recordings

-- actual displacements - e - displacement predictions

2 4 6 8 10 12 14 16 18 20 22 24 time step/month

Fig. 6. Landslide displacement predictions produced by the proposed ensemble predictor

potential to be trained into accurate predictors than static models. Successful applications of ESNs in task of predicting benchmark time series have already been reported, while the goal of our study is to further improve the performance of the dynamic predictor by building up ESN ensembles.

As verified by the results obtained in numerical experiments, the ensemble can not only suppress the instability of the ESN predictors but also improve the overall prediction

accuracy. And the adaptive ensemble based on a feed-forward neural network also shows improvement as compared to the averaging method. Another important find is that, reusing the training set may bring in over-fitting problems hence training

set separating will be a better strategy for training the ESN

ELM predictor.

REFERENCES

[I] D. Orner Faruk, "A hybrid neural network and ARIMA model for water quality time series prediction," Engineering Applications of Artificial Intelligence, vol. 23, pp. 586-594, 2010.

[2] R.I. Povinelli and F. Xin, "A new temporal pattern identification method for characterization and prediction of complex time series events," IEEE Transactions on Knowledge and Data Engineering, vol. 15, pp. 339-352, 2003.

[3] E.w. Saad, DV Prokhorov and D.C. Wunsch, II, "Comparative study of stock trend prediction using time delay, recurrent and probabilistic neural networks," IEEE Transactions on Neural Networh, vol. 9, pp.

1456-1470, 1998.

[4] C.L. WU and K.w. Chau, "Data-driven models for monthly streamflow

time series prediction," Engineering Applications of' Artificial Intelligence, vol. 23, pp. 1350-1367,20 10.

[5] F.M. Pouzols, A. Lendasse and A.B. Barros, "Autoregressive time series prediction by means of fuzzy inference systems using nonparametric

residual variance estimation," Fuzzy Sets and Systems, vol. 161, pp. 471-497,2010.

[6] E. Kayacan, B. Ulutas and O. Kaynak, "Grey system theory-based

models in time series prediction," Expert Systems with Applications, vol. 37, pp. 1784-1789, 20 10.

[7] C.-M. Lee and C.-N. Ko" "Time series prediction using RBF neural

networks with a nonlinear time-varying evolution PSO algorithm," INeurocomputing, vol. 73, pp. 449-460, 2009.

[8] N.1. Sapankevych and R. Saukar, "Time series prediction using support vector machines: a survey," IEEE Computational Intelligence Magazine, 4, pp. 24-38, 2009.

[9] P. Yee and S. Haykin" "A dynamic regularized radial basis function network for nonlinear, nonstationary time series prediction," IEEE Transactions on Signal Processing, vol. 47, pp. 2503-2521,1999.

[10] H. Jaeger and H. Haas, "Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication," Science, vol.

304, pp. 78-80, 2004. [II] M. Lukosevicius, H. Jaeger, "Reservoir computing approaches to

recurrent neural network training," Computer Science Review, vol. 3, pp. 127-149,2009.

[12] R.R. Andrawis, A.F. Atiya and H. EI-Shishiny, "Forecast combinations of computational intelligence and linear models for the NN5 time series forecasting competition," International Journal oj" Forecasting, vol. 27, pp. 672-688, 20 11.

[13] M. Heeswijk, Y. Miche, T. Lindh-Knuutila, PJ. Hilbers, T. Honkela, E. Oja and A. Lendasse, "Adaptive ensemble models of extreme learning machines for time series prediction," ICANN 2009, Springer Berlin Heidelberg, pp. 305-314,2009.

[14] J. Sun and H. Li, "Financial distress prediction using support vector machines: Ensemble vs. individual," Applied S(?/t Computing, vol. 12,

pp. 2254-2265, 20 12. [15] K. Siwek and S. Osowski, "Improving the accuracy of prediction of

PMlO pollution by the wavelet transformation and an ensemble of neural predictors," Engineering Applications oj" Artificial Intelligence, vol. 25, pp. 1246-1258,20 12.

[16] c.Y. Sheng, J. Zhao, W. Wang and H. Leung, "Prediction intervals for a noisy nonlinear time series based on a bootstrapping reservoir computing network ensemble," IEEE Transactions on Neural Networks and Learning Systems, vol. 24, pp. 1036-1048,2013.

[17] G.B. Huang, H.M. Zhou, XJ.Ding and R. Zhang, "Extreme learning machine for regression and multiclass classification," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, pp. 513-529,2012.

[18] G.B. Huang, D. Wang, Y. Lan, "Extreme learning machines: a survey," International Journal of Machine Learning and Cybernetics, vol. 2, pp. 107-122, 20 1 1.

[19] M. De Felice and Y. Xin, "Short-Term load forecasting with neural network ensembles: a comparative study [application notes]," IEEE

Computational Intelligence Magazine, vol. 6, pp. 47-56, 20 11.

304

[ieee 2013 sixth international conference on advanced computational intelligence (icaci) - hangzhou,...

Documents