time series prediction using neural networks a thesis …

TIME SERIES PREDICTION USING NEURAL NETWORKS

by

CARRIE KNERR, B.S.

A THESIS

IN

COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in

Partial Fulfillment of the Requirements for

the Degree of

MASTER OF SCIENCE

Approved

Cihalrpers'on of the Committee

Accepted

Dean of the Graduate School

May, 2004

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

ARCTDA/'T A b o 1 KAt^ 1

CHAPTER

I. INTRODUCTION

1.1 Time Series 1.2 Artificial Neural Networks 1.3 Recurrent Neural Networks 1.4 Least Squares Algorithm 1.5 Learning Rule 1.6 Mackey Glass Equation 1.7 Electrocardiogram Data 1.8 Research Objectives

IV

VI

2 5 9 11 13 13 14 15

II. BACKGROUND 21

2.1 Training 2.2 Pearlmutter Algorithm 2.3 Recent Applications

I I I . RESULTS

21 23 25

29

3.1 Square Wave 3.2 Mackey Glass 3.3 ECG Signal

IV. CONCLUSION

REFERENCES

31 36 40

49

53

LIST OF TABLES

3.1: Square Wave Prediction Ability 34

3.2: ECG Trained 1,000 Epochs 44

3.3: ECG Test Training Ability 48

Ul

UST OF FIGURES

1.1 Perceptron Model 6

1.2 Feed-Forward Network 8

1.3 Recurrent Network 10

1.4 Fully Connected Recurrent Network 10

1.5 Mackey Glass Time Series 14

1.6 Step Function First Harmonic 16

1.7 Step Function First Ten Harmonics 16

1.8 Triangle Function First Harmonic. 17

1.9 Triangle Function First Ten Harmonics 17

1.10 Harmonic Generator 18

1.11 Harmonic Generator Prediction 18

1.12 Sigmoid Activation Function 19

1.13 Algorithm Flow Chart 20

3.1 Window Function 30

3.2 Square Wave Prediction for Sine Only Network 32

3.3 Square Wave Prediction for Sines and Cosines Network 33

3.4 Square Wave Compare 100,000 Epochs 34

3.5 Square Wave Training Ability 35

3.6 Square Wave Training Ability (2) 36

IV

3.7 Mackey-Glass Prediction Results 37

3.8 Mackey-Glass Prediction (2) 38

3.9 Compare Prediction Results 38

3.10 Compare Mackey-Glass Prediction 40

3.11 Eta 1/10000 42

3.12 Eta 1/1000 42

3.13 100 Epochs, Eta 1/100 43

3.14 Eta 1/10 43

3.15 Comparison of Learning Rates - Cosines and Sines Network 44

3.16 10,000 Epochs 45

3.17 Error Goal < 1.5 46

3.18 Error Goal < 1.3 47

3.19 Error Goal < 1.2 47

ABSTRACT

This thesis involves the investigation of the effect of prior knowledge

embedded in an artificial fully connected recurrent neural network for the

prediction of non-linear time series. The networks utilize the back propagation

method for training. Two network architectures are compared using time series

such as the square wave, Mackey Glass data, and an ECG signal to determine if

prediction quality or training ability are improved when more information through

cosine oscillators are embedded in the network. The benefit of such an exercise

may be the prediction of abnormal ECG signals, which is an electrical measure of

heart activity. Such ability would allow medical professionals to intervene and

possibly prevent abnormal ECG signals. The improved network was able to

provide increased prediction value and training ability for the Mackey Glass time

series.

VI

CHAPTER I

INTRODUCnON

The objective of this thesis is to improve time series prediction on a

deterministic system using a neural network. A deterministic system is one in

which the future states of the system are determined by the current states of the

system and a set of differential equations; a deterministic system is not random.

The unknown equations are the theoretical basis which enables prediction of the

system. This work may be applied to non-linear deterministic systems such as

ECG and EEG measurements to predict future catastrophic events.

The non-linear dynamical theory can be used to represent the human

heart's electrical activity. The electrocardiogram (ECG) is a chaotic signal [Glass

1987, 1991; Albert 1990]. Humans are sometimes able to use the ECG to predict

future catastrophic events (i.e., heart attack) but cannot estimate the time until

the event. It would be useful to create an automated system to predict such

events based on ECG data to enable doctors to prevent them.

1.1 Time Series

A time series is defined as a collection of measurements of a variable that

are usually taken at equal time intervals. This data set can be decomposed into

components including trends, seasonality, and random noise [Brockwell 1987].

A trend is an overall increase or decrease in the values of the time series while

seasonality refers to patterns that are recurring in correspondence to the date.

Random noise has no pattern and can be caused by errors in the observation

methods. Time series analysis is utilized to predict future behavior based on

past measurements of variable(s). A model of the basis of the time series must

be made before predictions can be created.

Time series analysis is performed to create a model of a system. The

systems examined here are dynamical; a dynamical system is one in which the

observed value (time series) is a result of the state of the system. A moving

average process is a linear combination of the signal and random noise that is

acquired during data collection. This noise can interfere with modeling of the

system; methods such as smoothing exist to diminish the effects of random

variations.

One method of smoothing is an averaging method which will show the

time series' underlying components (i.e., trend). Moving average (MA) works by

calculating the average of a small set of past data in which each data subset

average is calculated.

H = Z — 11 ;=0

The MA method is limited to stationary processes, or those processes that do not

have trend or periodic fluctuations and do have an unvarying variance over time.

A stationary process has constant statistical properties which makes prediction

straightforward. Other approaches have been designed for a non-stationary time

series.

One of the first linear time series model was the autoregressive model

(AR) designed for sunspot study by Yule [1927]. This method can be used to

represent non-stationary processes as well as stationary processes. In this

process the value of the series at the current time is a function of the previous

values added to the random variation term. In the model shown below, the last

variable, £ , is the white noise with a zero mean. This noise is a random variable

that is replaced by an average of random variables in the autoregressive/moving

average (ARMA) model (second equation). Processes may be a combination of

moving average and autoregressive process which require the ARMA model. The

output is a weighted sum of previous time series values in both models.

^>=^o+tl'^j^,-j+£. 1.2 ; = i

^r=«.+Z«>^r-;+ZV,-; 1.3

The advantages of ARMA include a requiring reasonable computation time

and creation of a tool for analysis, forecasting, and control. One restriction of

the ARMA model is that it was not designed for time series with asymmetry or

data with sudden bursts of large amplitude at irregular times [Tong]. In

addition, ARMA is used with the assumption that the underlying system is linear.

The AR method is useful because it only requires knowledge of the system's

output values and can be used for both stationary and non-stationary time

series. However many calculations are required when the AR method is used.

Poincare maps are a classical technique for analyzing dynamical systems

[Parker & Chua, p. 33]. This method reduces the order and bridges the gap

between continuous and discrete time systems. A Poincare map is a phase

diagram with one variable. The X-axis is the time interval between one drop to

the next, and the Y-axis is the time interval between the second drop to the

third. This shows structure in a time series that appears to be random. A

Poincare map can show convergence of the system to a stationary point. This

method can be utilized to evaluate the stability of the limit cycle by using

eigenvalues.

This paper involves non-linear time series where Taken's embedding

theory can be used to create a model [Principe 1997]. Taken's theorem states

that many deterministic systems have a one to one mapping between the state

of the system and a finite portion (vector) of the time series. This allows for

estimating the number of variables that control the variable under observation.

Taken's Embedding Theorem maintains the attractor's topology properties while

reconstructing the attractor in a time delayed embedded space. This technique

uses a finite observation of a single variable. Nonlinear time series analysis can

be accomplished using delay time embedding which requires the choice of a time

delay and dimension.

1.2 Artificial Neural Networks

The method of prediction for nonlinear time series explored in this paper

is the use of an artificial neural network. An artificial neural network attempts to

model the human brain which is composed of neurons and connections between

the neurons called synapses. A neural network mimics the brain in that

observed knowledge is acquired through a learning process and interneuron

connection strengths (weights) store the knowledge [Haykin 1994]. Prior

knowledge may be embedded in the network through utilizing a pre-trained

network within the larger network.

Neuron Output

Figure 1.1: Perceptron Model

A neural network is comprised of four parts: nodes, connections between

nodes, activation functions, and a learning rule. Each neuron collects data from

the weighted sum of other neurons. The activation function of the neuron is

applied to this sum. An activation function can be different for each neuron;

however, this may become confusing therefore the code used here has only one

activation function in the network. There are three common activation functions:

the hard-limiter, the threshold function, and the sigmoid function. The hard-

limiter creates an output of positive one, negative one, or zero as in the

McCulloch-Pitts model [Haykin 1994].

In 1943, McCulloch and Pitts presented an early artificial neuron model

known as the linear threshold gate. This neuron has multiple inputs and one

output. The neuron produces a binary output used to group the input set. The

threshold and weight are fixed and the model is simple. A disadvantage of this

model is the inability of the network to work with nonlinearly separable classes.

The threshold function (piecewise-linear function) is used as an activation

function to approximate a non-linear amplifier. The sigmoid function (hyperbolic

tangent is used in these experiments) is non-linear and differentiate. The

output of each neuron may be represented by the following equation where f

represents the activation function, s is the input from other neurons, w is the

weight of the interconnection, and 0 is the threshold value. The input neuron is

denoted by i.

Each neural network must be trained by changing the weights between

neurons. Learning may be supervised where the network is provided with an

input (set of examples) and correct output (desired responses) by an external

teacher. Unsupervised learning requires the system to acquire information by

sampling without a teacher. Both learning methods are a type of Hebbian

learning rule as described by Donald Hebb in 1949 in his book written on the

idea of neurons.

The Hebbian rule states that when two connected neurons are activated

at the same time their connection is strengthened (i.e., weight value is

increased). If neuron /sends input to neuron j, the connection weight w,}\s

modified in the following equation.

A% = jy.y^ 1.5

The simple feed-forward neural network has input node(s), hidden

node(s), and other nodes that process a signal before combining into output

[Gurney 1997]. Hidden nodes are not connected to input or output and are not

expected to have a particular response. A fully connected artificial network is

one in which every node in one layer is connected to each node in the next layer.

The example feed-forward neural network in the figure below is has one layer of

hidden nodes. The only connections that exist are between consecutive layers of

nodes.

Input Nodes

Hidden Nodes

Output Node

Figure 1.2: Feed-forward Network

8

1.3 Recurrent Neural Networks

A recurrent neural network is different from a feed-forward neural

network because it has one or more feedback loops [Haykin 1994]. A feedback

loop occurs when the output of a node has an affect on the input to that node

and this can include self loops. Because of unit delay elements, these feedback

loops result in nonlinear dynamical behavior. The values from hidden nodes are

copied to memory on the input nodes and utilized in the next time step. This

attribute of recurrent neural networks allows them to be used in time dependent,

deterministic systems where feed forward networks would be inappropriate. A

fully connected recurrent neural network is one in which every neuron is

connected to every other neuron in both directions. In a fully connected

network, neurons are identical and are named input, output, or hidden. The

recurrent neural network in the figure below has one feedback loop.

Input Nodes

Hidden Nodes

Output Node

Figure 1.3: Recurrent Network

c*^±p Input Nodes

Hidden Nodes

Output Node

Figure 1.4: Fully Connected Recurrent Network

Behaviors of the recurrent neural network are described by fixed points,

limit cycles, and chaos. A fixed point may be attract, repel, or neither. A fixed

point occurs at the intersection of a function and a line. The fixed points are

categorized by taking the derivative of the function at the fixed point. A repelling

10

type will result in a derivative greater than 1 or less than - 1 . An attracting type

of fixed point has a derivative between -1 and +1. If the derivative is exactly

one the type is undefined. Feed-forward networks evolve to a fixed point [Logar

1992]. This property lends itself towards classification of the input; the output is

set for a given input. Recurrent networks may approach fixed points.

A limit cycle can be generated by a recurrent neural network. The limit

cycle properties depend on the weight values between the nodes of the network.

Stable limit cycles may be created which return cycles to their orbit. This

behavior requires non-linearity produced by the sigmoidal activation function.

A chaotic system is one in which the system is sensitive to its initial

conditions. So two input values which are close together can produce widely

varying outputs. This behavior is important because many signals appear to be

chaotic. The ECG shows that the heart is chaotic; and therefore, a recurrent

neural network may be able to closely model this behavior.

1.4 Least Squares Aloorithm

The training of a neural network consists of setting each weight

(connection between nodes) to an optimized value. One method in supervised

training is the least mean square error (LMS) method (or Widrow and Hoffs

delta rule from 1960) [Graupe 1997, p. 12] which compares the real output with

the desired output. A training series is composed of the input (x) and the

desired output (^cf). Given a training series with di... dt (desired output of the

11

network), the training error (e) at the nth set is calculated by the following

equation.

e.=d„{t)-y„{t) 1.6

Where d\s the desired value and / is the actual output created by the neural

network. The goal of setting the weights is to minimize the training cost, the

sum of squared errors.

E = {tel 1.7 • ^ n = l

E = 7^(^.(0-yMf 1-8

The gradient is computed and set equal to zero to optimize the weights

[Graupe 1997]. Tliis method is employed when the training series is limited to a

small size (L). The maximum rate of change and indicator of direction of change

is given by the gradient. A series which is chaotic is dependent on initial

conditions and so the prediction may be inaccurate when the system is

presented with different initial conditions. The line generated by a chaotic

12

system may be one of an infinite number of possibilities which makes prediction

difficult.

1.5 Learnino Rule

One learning rule described by Rosenblatt utilizes the gradient descent

method to optimize weights. The first step is to initialize the weights and

threshold to small random numbers. The initial input from the training series is

sent to the network. Then the output of each neuron is computed using the

activation function. The weights are updated with the following learning rule.

w, {t + \) = w, (t) - 7j(d - y{t))x. 1.9

The final steps are repeated for all input vectors. One drawback to using this

method is that the equation will oscillate if overlap occurs (as in classification

problems). This can be modified by adding the least mean square error to

minimize the error.

1.6 Mackev-Glass Equation

The Mackey-Glass equation is a time-delay differential equation which

models the production of white blood cells. This time series is sensitive to initial

conditions. In the experiments presented here, the initial value is 1.2 and tau =

17, a = 0.2, b = 0.1, and assumes that x(t)=0 when t<0. This equation results

13

in a non-periodic, non-convergent and chaotic series. Chaotic time series are

created by deterministic systems and are dependent on the initial conditions as

well as being non-periodic. The Mackey-Glass equation is defined by the

equation below; Figure 1.5 shows the graphical representation.

^(0 dt

= a* x(t-tau) \-x(t-tau)"'

-b*x(t) 1.10

Mackey Glass Function

1200

Figure 1.5: Mackey Glass Time Series

1.7 Electrocardioqram Data

An electrocardiogram (ECG) is a time series representing the electrical

activity of the human heart. The electrocardiograph is a device that measures

14

the potential between electrical charges on either sides of the membrane that

initiate the pumping of the heart. This machine is centered on a differential

amplifier. One heartbeat is composed of one P Wave and a QRS complex. This

signal is chaotic and can be difficult to represent artificially given the high

frequency peaks R and S. The ECG data used in these experiments are

obtained from the MFT-BIG Arrhythmia database on CD-ROM. An example of

an ECG signal is given in the figure below.

1.8 Research Objectives

The existing program consists of a three node fully connected recurrent

neural network. This architecture acts as a harmonic generator because it is

able to oscillate and f)erform predictions [Gomez 1998]. The data stream is

partitioned into separate frequency components using the Fourier Transform for

aperiodic signals. The total sum of these sinusoids is equivalent to the original

data. Currently, the program incorporates the fundamental and the first

harmonics of the Fourier transform sine components. However, the Fourier

transform involves both the sine and cosine components for most functions. The

Fourier series demonstrates that a periodic signal is the sum of sine and cosine

terms. The step function is a sum of an infinite number of sine waves as shown

in Figures 1.6 and 1.7. TTie triangle wave is a combination of sine waves and

cosine waves as shown in Figures 1.8 and 1.9.

15

1 •

OB

0.6

04

0.2-

u

•02

.04

•OS

n 01 02 03 04 05 06 07 08 09 1

Figure 1.6: Step Function First Harmonic

Square W ave

1.2

Slim ot Haimonies Sin(x)

Square function

-J I , I L.

-3 -2.5 -2 -1.5 -I -0.5 0 0.5 I 1.5 2 2.5 3

X values

Figure 1.7: Step Function First 10 Harmonics

16

, 00

OG

Qi

02

1 " fl3

04

-06

00

\

f>nllO>bnnnnci

/ \ / \ / \ ; \ / \

/ \

/ • /

/ !

1 \ J

0 O0D3 0004 OOOE DOOe OOt 0012 0014 0D1E OOIS 002

Figure 1.8: Triangle Function First Harmonic

Ttni D ivRorn

5 »

l O D D«M DTXK a ! 00-4 QD-6 etna ito

Figure 1.9: Triangle Function First 10 Harmonics

The three-node neural network is easily trained to sine waves [Gomez

1998]. The current program consists of three sub-networks, the harmonic

generators. Each sub network in the architecture is present to account for a

Fourier coefficient. The sub-networks are pre-trained to a sine wave. A picture

of a sub-network is shown below after training is completed. The overall

network is a fully connected recurrent neural network with no external input

nodes. The weights of the sub networks are held steady while the other weights

are modified by the algorithm. The input signal used for training is the correct

value for the output node.

17

Figure 1.10: Harmonic Generator

Figure 1.11: Harmonic Generator Prediction

18

Hyperbolic Tangent

Figure 1.12: Sigmoid Activation Function

An activation function limits the amplitude of each neuron's output. The

activation function utilized in this architecture is the sigmoid function (the

hyperbolic tangent function shown above). It maps onto a continuous function

from 0 to 1 and requires the input to also map to this area. In other words, it

will be impossible to train the network to a series that is outside of the range

from -1 to +1. Any series must be normalized to be limited to positive and

negative one. This squashing function is given by the following equation.

fix) 1.11 l+e"

19

The sigmoid is appropriate because it is easily derived into the following.

. / • • (x ) = -(1+^^)^

= f(x)[l-f(x)] 1.12

The algorithm of this program is shown in Figure 1.13 below. Initially the

weights and node outputs are set to small random numbers. The output of the

network is computed Crun" the network), the error is computed to help calculate

the delta rules and then the weights are updated. This training cycle is

continued for a specified number of epochs, or until a specified error is reached.

The network then predictions and the results are saved.

start

Get User Input

X Set Initial Weights between Nodes

I Set Initial Output of Each Node

4 Run the Network

31 Compute EiTor

Compute Delta Rules

I Update Weights

Prediction

I Output Results

Figure 1.13: Algorithm Flow Chart

20

CHAPTER II

BACKGROUND

Research in neural modeling began with McCulloch and Pitts in 1943.

Rosenblatt described the basic perceptron in 1958. The perceptron contains two

layers of neurons connected by modifiable weights (input and output). The

limitation of this model is that it is linear [Lau Xueying].

2.1 Training

Back-propagation through time is a supervised learning algorithm utilized

for training the neural network. Back-propagation was discovered by Bryson and

Ho [Abdi, 1994] and provided a method of updating weights between hidden

neurons. The goal of training is to minimize the error between the node's actual

output y(t) and the desired output d(t) using the gradient descent method

discussed earlier. Pearlmutter provided for an extension to back propagation

algorithms by expanding the recurrent neural network and treating it as a feed

forward neural network first.

The algorithm consists of integrating the network forward in time and

keeping track of the error between desired and actual output. Then the network

is integrated backwards in time by sending the error signal from the output layer

backwards to the hidden layer (back propagating the errors of the output layer).

21

This calculates the hidden layer(s) error rates as the weighted average of the

output error; the error of each node is weighted by the derivative of its own

output. Finally, the weights are updated. This method can handle hidden

neurons that have initial conditions that can be set to one value or changed by

the program. Back-propagation converges to a local minimum LMS for the

output. This algorithm is NP-complete; the length of time required expands

exponentially as the number of neurons increases. Therefore, it may not be

appropriate in large networks.

Back-propagation solved concerns that existed with the perceptron model

when attempting to use a network with hidden nodes. Neural networks can use

back-propagation when the problem is nonlinear. The hidden layer provides

internal knowledge storage. One disadvantage that occurs with the back

propagation method is that it is difficult to establish the optimal number of

hidden nodes and the optimal number of hidden layers. The universal

approximation theorem states that one layer of hidden units can approximate a

continuous function when the sigmoid is used for an activation function [Krose

1996]. Back propagation requires supervised training and can require thousands

of iterations for the network to learn.

22

2.2 Pearlmutter Aloorithm

The Pearlmutter [1990, 1995] network is fully connected and is similar to

recurrent back propagation but is a discrete model of a continuous system. This

method utilizes differentiable functions to define the system. The Pearlmutter

network requires a differentiable activation function, o, which is nonlinear such

as the sigmoidal function shown earlier.

a(x) = —-;^ 2.1 \ + e

a(x) = a(x)(\-a(x)) 2.2

The Pearlmutter algorithm begins by creating and saving a neuron's

outputs for a number of time steps. The total input from other nodes to node /

is the following sum where > is the activation level for neuron /and w,j is the

weight of the link from node /to node/

Xtit) = Y.'^ijyjiO 2.3 7=1

23

The network is defined by the following equations where y, is the state of node j

and h is the external input to the node. The following equation is the path for

the differential equation.

^ = ->^,(0 + o-(.x,(0) + /.(/) 2.4 at

The next step in the algorithm is to calculate the error value and then update the

weights.

e^ = actualoutputj - desiredoutput^ 2.5

^îj=^^— 2.6

'îj=îj-^îj 2.7

The following must be integrated.

dt j=\

dE

îj ,

= \zXt)cj\xXt))yj(t)dt 2.9

24

The z values are initialized to zero and show the effects of changing a nodes

output (j^ has on the error. EuleKs method is used for integration in the

following equations. The second equation shows that zis calculated backwards

in time by rearranging the equation.

:Xt + ^t) = z,{t) + M^ 2.10 dt

:^(t) = z,{t + At)-At-^ 2.11 dt

Pearlmutter's algorithm is different from recurrent back propagation by

computing the path and then the error along the entire path instead of a point

by point analysis.

2.3 Recent Applications

Sejnowski and Rosenburg developed a multi-layer perceptron trained to

combinations of letters in English text. NETTalk was trained to generate speech

from written text. In this application the hidden layers represented phonemes in

the English language such as vowels and consonants. Seven letters of a word

were the input series. The network was composed of 203 neurons; seven

groups of 29 for the 26 letters of the alphabet and 3 punctuations, 80 neurons in

25

a hidden layer, and 26 neurons in the output layer. This network utilized back

propagation with a high accuracy after training (less than 2 days) and in addition

it could speak new texts (those not used for training) with high accuracy when

compared to human error rates.

Principe and Kuo [1995] stated that a multi-step prediction more

appropriate for dynamical system. Their system, a recurrent neural network, is

trained by seeding the network with a set of input samples before the input is

disconnected. The predicted sample (network output) uses feed-back to the

input for k steps. TTiis is repeated over several segments over the time series

and the sequences are overlapping.

The authors note that it is necessary for the network to be recurrent

because the long-term behavior of the dynamical system must be modeled; the

iterative map is therefore constrained throughout learning. The Mackey-Glass

equation was used for prediction. A time delay neural network with 8 input

nodes, 14 hidden nodes, and 1 output nodes was used with hidden nodes having

sigmoid activation functions and the output node having a linear activation

function. The result was a low final mean square error but the prediction was

more regular than the input series.

Lapedes and Farber investigated using neural networks for time series

prediction. They stated that the linear back-propagation neural network is

comparable to the least mean square method. The activation function was

26

changed to the sigmoid and a feed-forward network used these non-linear

activation functions for hidden neurons. When the gradient descent method is

used the authors had better results for prediction of nonlinear time series when

compared to traditional time series analysis.

Two of the time series predicted in the previous noted paper were the

Mackey-Glass and a logistic map. The network for the Mackey-Glass had four

input nodes, one output node, and two hidden layers with a total of 20 nodes.

The attempt was to sample the series and predict many future points. The

result was better than traditional time series analysis; however the accuracy was

not impressive. The logistic map given by the equation below was also

investigated. This time series is deterministic and random. The network

included one input node, one output node, and five hidden nodes. Activation

functions were different for the hidden nodes and the output nodes: the hidden

nodes used the sigmoid function while the output nodes used a linear function.

Predicting one step in the future produced accurate results. These examples

show that neural networks may be used for prediction of a time series.

In 1994, Hayashi proposed an architecture and learning rule using a

recurrent neural network. This architecture characterized an oscillatory based

recurrent network with pairs of nodes x and y. The y nodes are internal nodes

while the x nodes are output nodes that are fully connected by weights Wy.

Each pair connects together to create an oscillator. There are two connections

27

between a pair of nodes: the positive weight, KIE, and the negative weight, -Ke.

This network uses a sigmoidal function.

Gix) = ^tein-\-) 2.12 n: a

In the equations below, Ij is the input sent to node Xj.

x' = -X. +GC^W^,X, -K„y, +1,) 2.13 k

yj=-x.+G{x.KjE) 2.14

This system (shown in Figure 1.4) produces cyclic output and is trained using a

continuous training rule devised by Hayashi [1994]. This method utilizes

Lagrange multipliers to define the error for a weight. These rules can be

modified for the discrete case by taking the partial derivative of the error with

respect to a weight [Corwin]. The disadvantage of this scheme is that only

weights connected to visible nodes can be modified. In addition, it may be

difficult to train to a series that does not match one of the oscillations that

Hayashi defines.

28

CHAPTER III

RESULTS

Code was written in C to implement the back propagation method. The

new program includes the cosine components for each sine component. A three-

node harmonic generator was trained to output a cosine function. Three of

these sub neural networks was be added to the overall code. Experiments were

performed to compare the prediction results between the code with the cosine

and without the cosine components. This involved a square wave, Mackey-Glass

function, and an ECG sample. The results are be examined to determine if a

smaller error can be obtained for the same prediction time or if acquiring the

same error will allow for a longer prediction time.

The results are measured to determine the ease of training the system

and examine the quality of the prediction. Training is improved if it requires a

fewer number of iterations. The quality of prediction will be observed based on

the least mean square error of the sample the network is trained to predict. The

error referred to hereafter includes the error across all training samples.

The first step in this process was to determine what frequencies of sine

waves need to be predicted with the oscillators. Therefore, it was necessary to

take the Fourier transform of the series to determine the frequency component.

This was performed using Matlab. First, a window function was applied to the

29

series because of the property of the Fourier transform. The window function

selected was the Blackman window given by the equation below.

w(n) = 0.42 - 0.5 * cos( ) + 0.08 * cos( - ^ ) N N 3.1

With N set to 1024, this equation creates the windowing series shown in

the figure below. The window function was applied to the input series after

mean is subtracted to remove any DC components. The Fast Fourier transform

is applied to the resultant series to determine which frequencies the oscillators

should be trained to predict. This process was not preformed for the square

wave because the frequency components were already known.

Figure 3.1: Window Function

30

3.1 Square Wave

The square wave was used to train the neural network. This time series is

a sum of an infinite number of sine waves with the odd harmonics. In this

experiment only the first, third, and five harmonics are used. The training series

consisted of the step function including 100 time points. The following figure

shows the results of the network when using only the sine oscillator generators.

The outputs for one hundred and one thousand epochs are almost

identical with error rates of 7.949 and 7.840 respectively. This was repeated

with results of 7.139 and 5.178, respectively. The difference is a result of the

random number generator that was seeded using time. Both the weights and

the initial y values are set to small random numbers and clearly initial conditions

have a big effect on the error results. In other words if the random number

generator is seeded using time, the neural network's results may be different

each time the program is run. Ten thousand epochs created the best results

with an error rate of 1.506 and then 2.324.

31

Square Wave - Sine Only

1.4

0.6 (/)

re > 0.2

-0.2

-0.6

Training Series 100 Epochs

1,000 Epochs 10,000 Epochs

^ A f ^ , f 1 *J^

^

W W m f/ km n 0 50 100 150 200 250 300 350 400 450 500

X Values

Figure 3.2: Square Wave Prediction for Sine only network

The next project was to run the net with six oscillators including the

cosines. The results are shown in Figure 3.3. Using 100 training epochs had

results of 7.667 and 7.954 while one thousand epochs were significantly lower

error rates of 2.506 and 4.861. After training using ten thousand epochs the

error was 1.983 and 1.69.

32

Square Wave - Sine & Cosine

1.4

0.6 0) •(3 > 0.2 >

-0.2

I r- - i 1 1 1

Training Series 100 Epochs

1,000 Epochs 10,000 Epochs —>*-

'•i\)

-0.6

-1

#"Mv;iV V'll'^ ')Vi * ' % ' ^

n • VrfV

_] I I L_

(It

; ::

V 0 50 100 150 200 250 300 350 400 450 500

X Values

Figure 3.3: Square Wave Prediction for Sines and Cosines Network

The figure below directly compares the network when one hundred

thousand epochs were run. When using both cosine and sine oscillators, an

error rate of 1.333 was obtained versus an error rate of 1.206 for only sine

oscillators. However as time goes on the sine only network has an inferior ability

to continue to predict the square wave (Figure 3.4).

33

1.4

fTTii 0.6

(/> 0) 3

> 0.2 >

-0.2

-0.6

Square Wave - Comparison

n r -I 1 1 r Training Series — Cosine & Sine

Sine Only

t * ! J

"^p^ v^&5t ^ A w ^ '/>wV^ V--J L.

0 50 100 150 200 250 300 350 400 450 500

X Values

Figure 3.4: Square Wave Compare 100,000 Epochs

Table 3.1: Square Wave Prediction Ability

Epochs

100

1,000

10,000

100,000

Error Sines Only Network 7.949 7.130 7.840 5.178 1.506 2.324 1.206

Sines and Cosines Network

7.667 7.954 2.506 4.861 1.983 1.690 1.333

The network was then tested to determine if adding additional information

about the signal produced better training through reduced epochs. The network

34

was trained until error decreased below 1.5. When the network had both cosine

and sine oscillators, the number of required epochs was 33,192. When only the

sine was utilized, there were 10,385 epochs required for the error to decrease

below 1.5.

1.4

1

0.6

3 "(3 > 0.2 >

-0.2

-06

. 1

Square Wave - Comparison

1 1 1 1 1 1 1 1

Training Series Cosine & Sine

Sine Only

i

, r M *

• 1

A

A

1 *. 1

A 1

• .-• - i

t * i." !

i ; 1,

*

A A

/

1

< 1 1 1 1 1 1 1

0 50 100 150 200 250 300 350 400 450 500

X Values

Figure 3.5: Square Wave Training Ability

Both had the same seed for the random number generator. The net using

only 3 oscillators required only 31,322 epochs to train while the net with 6

oscillators required 81,554 epochs. However when the graphs are compared, it

is obvious that the network with more oscillators has more accurate prediction

35

abilities. This effect was reproduced several times with differing random number

seeds.

IP

m

Square Vi/ave

1.4 L Tiaining Seiies Cosine & Sine

Sine

0.6 r l

0.2

-0.2

-0.6

^^ i '' .' . if%. ^ri%

» I

i t I

r . ' '1 I

' t

t " I

i

. t

I *• » *

1 ^ V^^' ^ ' w ^ : i|4«^s w^ _l L_ _! I ! !_

0 50 100 150 200 250 300 350 400 450 500

X Values

Figure 3.6: Square Wave Training Ability

3.2 Mackev-Glass Function

Next, the network was trained to the Mackey-Glass function described

earlier. In this training session 200 points of the signal was used (Gomez Gil

[1998] trained in segments). It is necessary to scale or shift the input data

because of the activation function used. This function requires a signal between

positive and negative one. In this paper, the series were shifted however it may

be useful to scale the time series using the following function.

36

Y, Y

_ ^OLD 'MIN

NEW 'MAX ' MIN

3.2

In this example, both cosine and sine coefficients were used to create

oscillators for an inclusion of six trained oscillators. In the network, a total of 31

nodes were utilized; 18 were a part of the oscillators, and one output node. The

rest were hidden nodes without a specified output. For 2.out (second graph) the

program was run until the error < 1.408. for 11.out (first graph) the program

was run until error < 1. Both started with eta of 0.01. 2.out took 8000 epochs

and 11.out took 40,586 epochs. The third figure focuses on the training portion

of the signal which compares the different numbers of epochs run.

04

0 2

•0 4

# i

1

« f

<

& »

« $ <*. <5

*

• »

0

^ % *

"* o

1

% f

1 * %

1 *

9 <• $

% * I ^ $

200 1000

Figure 3.7: Mackey-Glass Prediction Results

37

Figure 3.8: Mackey-Glass Prediction

Figure 3.9: Compare Prediction Results

38

The Mackey-Glass was also trained using a 31-node network with only 3

oscillators, the sine components. Therefore, there were 9 nodes in the

oscillators, 1 output node, and 21 hidden nodes in this architecture. Clearly not

having the three cosine oscillators results in worse prediction. Starting with a

learning rate (eta) of 0.01 as was the case in the experiment above, this

program was run for 100,000 epochs and achieved an error rate of 3.171.

Shown in Figure 3.10, this experiment did not come close to predicting the

Mackey-Glass function.

Thirty nodes were chosen because this had the best results without

exceeding memory limitations. It should be noted that increasing the number of

neurons will decrease the programs performance. However, any extra nodes

that are not needed can be degraded through decreasing the interconnecting

weight value. This may lead to over-fitting of the network.

39

Figure 3.10: Compare Mackey-Glass Prediction

3.3 ECG Sional

The final experiments involved the ECG signal. This data was obtained

from a MIT-BIH database. A normal ECG file was chosen and then scaled as

before. A window function was applied before running the Fourier transform to

determine the required frequencies. Six separate three node net was trained to

the appropriate frequencies. The network architecture is the same as used

previously. The initial predictions were unable to reach the high points of the

ECG.

It was decided to use a feed forward network for training initially and use

those weights for initializing the recurrent network as suggested by Dr. Oldham.

40

This allowed the recurrent network to have weights initialized to non-random

numbers; the only random numbers utilized were those for the initial node

outputs and recurrent weights. Another improvement to the program was to

include a bias to every node.

The first experiments trained the network for one thousand epochs.

When eta was small, the sines only network achieved a slightly lower error rate.

For example, when eta was 0.0001 the sines network had an error rate of

1.1109 compared to 1.1705. The graph is virtually identical in the first graph

below. However when larger eta rates were used, the 6 oscillator network had a

final error rate that was slightly smaller. When eta was .001, the first network

had a rate of 1.1103 and the second 1.1034 (second graph below). Increasing

the learning rate to .01 had a result of 1.2071 and 1.0776, respectively (third

graph below). The final experiment run with one thousand epochs had a result

of 1.2078 and 1.0776 (fourth graph below). The results from the network with

cosines and sines are compared across different learning rates in the fifth graph

below. The worst result was when eta was set to 0.1.

41

06

04

02

•0 2

-0 4

' OJtc. y'

ECa SIGNAL

^^nm r r ' • ' % !

100 150 200 250 300 350 400 450 500

Figure 3.12: Eta 1/1000

42

06 •

0 4

02

•0 4

-0 6

if/ W 1,J''ftllbll ,,.,# "--;..v,-: W ^'''^\!ff ^ "

EC' ^IGNAL

W

100 150 250 300 350 400 450 500

0 4

0 2

-0 2

-0 4

%n

¥

Figure 3.13: 100 Epochs, Eta 1/100

M A/^.-

EC

/ '- w^W

Figure 3.14: Eta 1/10

>IGII-L

50 100 150 200 250 300 350 400 450 500

43

Figure 3.15: Comparison of Learning Rates - Cosines and Sines network

Table 3.2: ECG Trained 1,000 Epochs

Eta

0.0001 0.001 0.01

0.1

Error Sines Only Network 1.1109 1.1103 1.2071

1.2078

Sines and Cosines Network 1.1705 1.1034 1.0776

1.0776

The networks were then trained using ten thousand epochs. The results

were similar to those above with an initial eta of one tenth resulting in the worst

prediction abilities. The other learning rates had similar outputs as seen in the

graph below. When comparing one thousand to ten thousand epochs, the result

44

was as expected with more epochs creating a signal closer to the original in the

next graph.

lou i;.o 350 400 DUU

Figure 3.16: 10,000 Epochs

Experiments were implemented to test the training ability of the network.

The eta (learning rate) was set to 1/100000 and the same network architecture

as before was used. Initially the goal of the network was to reach an error rate

(across all training time samples) of less than 1.5. The sines only network took

3,696 epochs while the cosines and sines network took 3,736 epochs to reach

this level. As shown in the figure below, the prediction results were virtually

identical for both networks.

45

0 3

06

04

0 2 '

-0 2

-0 4

0Uto2 ;,

EC'lSIGII-L

iM ,<'' ft

lAijfJ / *

J •?^f

•r r

100 150 250 300 350 400 450

Figure 3.17: Error Goal < 1.5

Next, the network was trained until an error of 1.3 was reached; the first

network took 4,168 epochs versus 4,831 for the second network. Next, the goal

error rate was set at 1.2 across the training sample. The sines only network

required 7,804 epochs while the second network took 6,102 epochs. When the

objective was an error of 1.1, the first network took 10,457 epochs and the

second network took 257,372 epochs.

46

0 6 • ECO ilGII-L

0 2

-0 2

-0 4

ri .A, • V A m "' n 4^^/f^

_^ =0 100 150 200 250 300 350 400 450 500


0 50 100 150 200 250 300 350 400 450 500


47

Tab Error Goal

1.5 1.3 1.2 1.1

e 3.3: ECG Test Training Ability Number of Epochs

Sines Only Network

3,696 4,168 7,804 10,457

Sines and Cosines Network 3,756 4,831 6,102

257,372

48

CHAPTER 4

CONCLUSION

The anticipated benefit resulting from this work is to increase the ability of

the recurrent neural network to correctly predict future catastrophic events of

deterministic systems. Although a point by point prediction may not be

accurate, the prediction of a catastrophic event such as a heart attack may be

useful in the analysis of an EEG/EKG. Additionally, this method may be applied

to other time series.

All of the experiments demonstrated the dependence of the network on

the initial weights and output values. The network would have differing error

values and prediction abilities depending on the seed given to the random

generator. At times the network appeared to be unable to achieve a decreasing

error rate. It is likely that in these cases the net had reached a local minimum

and was unable to leave the minimum of the error curve due to the low learning

rate. A solution would be to increase the learning rate after a certain number of

epochs with unchanging errors. The property of the neural network necessitates

using the same seed for the random number generator to be able to compare

results.

The quality of prediction for the square waves appeared to be improved

slightly for the smaller network including only the sine components. However

this network was clearly inferior to the larger network when all of the predicted

49

output was considered. The network with cosine and sine oscillators was able to

recreate the step function repeatedly better than the net without the extra

oscillators. Therefore the prediction quality was improved for the network which

included the cosine oscillators.

The same effect was noticed when testing the ease of training. The

network with only sines was able to train using less than half of the epochs

required for the network with cosines. However the second network had more

accurate ability to recreate the training series in the future. This may be a result

of a couple of effects. The first network had more weights which could be

modified and this would aid in faster training. The second network had 9 more

of the weights fixed which would require the other weights to make up for this

inflexibility by greater changes. Therefore the second network would require

more training. Another possibility is that the first network quickly found a local

minimum but was unable to locate a global minimum that was found by the

second network.

When the Mackey Glass function was utilized, it became apparent that in

this case using both the sine and cosine oscillators allowed for easier training

and increased performance in prediction of future values. It may be the case

that the benefits of including extra cosine oscillators is extremely dependent on

the particular time series to be modeled. In the case of the Mackey glass

function, inclusion of the extra three oscillators was beneficial.

50

When the EKG was used for training the network different results were

obtained for each experiment run. For training ability the network was run until

separate goal error rates were reached. For a goal error of 1.5,1.3,. And 1.1,

the sines only network had fewer training epochs. However for a goal error of

1.2, the sines network had greater number of epochs.

When the prediction quality was tested, the error rate differed depending

on the initial learning rate. For eta less than 1/10000, the sines network

obtained a lower error rate. For all greater eta rates, the sines network had a

higher error rate. Including the cosines when examining the ease of training has

contradictory results; some examples take many more epochs and others

requiring fewer epochs. When the extra oscillators are included these weights

cannot be modified and may decrease the networks ability to be trained to the

particular signal. When the graphs are examined it is clear that both cosines and

sines have the ability to predict even the high frequency components of the ECG

signal and the differences between error rates was small.

The addition of cosine frequencies through including three extra oscillators

in the overall network was highly dependent on the particular signal being

examined. The greatest increase in prediction was obtained when the feed

forward network with biases was trained first, followed by a recurrent network.

This final method seemed to be more affected by initial conditions than the

particular oscillators included in the system. Including three extra oscillators in

51

the case of the ECG signal appeared to not have a significant affect in regards to

the prediction quality. The effects on training efficiency were ambiguous,

perhaps due to the particular error curve that the network was following for this

example.

52

REFERENCES

Abdi, H. "A neural network primer." Journal of Biolooical System. 2(3), 1994.

Brockwell, Peter J. and Davis, Richard A. Time Series: Theory and Methods. New York: Springer-Veriag Inc., 1987.

Galka, Andreas. Topics in Nonlinear Time Series Analysis. Worid Scientific Pub Co Inc., 2000.

Glass, Leon and Mackey, Michael C. From Clocks to Chaos. Princeton, N.J.: Princeton University Press, 1988.

Gomez, Gil and Oldham, W.J.B. Recun-ent Neural Networks as a Tool for Modeling and Prediction of Electrocardiograms. International Conference on Information Systems, Analysis, and Synthesis (4), 1998.

(Bomez, Maria Del Pilar. "The Effect of Non-Linear Dynamic Invariants in Recurrent Neural Networks for Prediction of Electrocardiograms." Ph.D. Dissertation, Texas Tech University, Lubbock, TX, 1998.

Graupe, Daniel. Principles of Artificial Neural Networks. New Jersey: Worid Scientific Publishing Co. Pte. Ltd. 1997.

Gurney, Kevin. An Introduction to Neural Networks. London: UCL Press, 1997.

Hayashi, Yukio. "Oscillatory Neural Networic and Learning of Continuously Transformed Patterns." Neural Networi^. 7( 2), pp 219-231,1994.

Haykin, Simon. Neural Networks: A Comprehensive Foundation. New York: Macmillan College Publishing Company, Inc., 1994.

Hebb, D. The Organization of Behavior. New Yoric: Wiley. 1949.

Krose, Ben and Smagt, Patrick van der. An Introduction to Neural Networks. (8) November 1996.

Logar, Antonette. "Recurrent Neural Networks and Time Series Prediction." Ph.D DIsstertation, Texas Tech University, Lubbock, TX, 1992.

53

Pearimutter, BA. "Dynamic recurrent neural networks." Technical Report. CMU CSiSQiigg. 1990.

Pearimutter, B.A. "Gradient calculations for dynamic recurrent neural networks: a survey." IEEE trans. On neural networks. 6, pp.1212-1228, 1995.

Petrklis, Vassilios and Athansios Kehagias. Predictive Modular Neural Networks: Applications to Time Series. Kluwer Academic Publishers, Norwell, MA, 1998.

J.C. Principe and J-M Kuo. "Dynamic modeling of chaotic time series with neural networks." In G. Tesauro, D. Touretzky, and T.Leen, editors. Advances in Neural Information Processing Svstems. 7, pp, 311-318. MIT Press, 1995.

Terrence J. Sejnowski and Charies R Rosenberg. "Parallel networks that learn to pronounce English text." Complex Systems. 1987.

Tong, Howell. Non-linear Time Series: A Dynamkal System Approach. Oxford, New York: Oxford University Press, Inc., 1990.

54

PERMISSION TO COPY

In presenting this thesis in partial fiilfilhnent of the requirements for a master's

degree at Texas Tech University or Texas Tech University Health Sciences Center, I

agree that the Library and my major department shall make it ft^eely available for

research purposes. Permission to copy this thesis for scholarly purposes may be

granted by the Director of the Library or my major professor. It is understood that any

copying or publication of this thesis for fmancial gain shall not be allowed without my

further written permission and that any user may be liable for copyright infringement.

Agree (Permission is granted.)

Student Signamre Date

Disagree (Permission is not granted.)

Student Signature Date

time series prediction using neural networks a thesis …

Documents