time series prediction using neural networks a thesis …
TRANSCRIPT
TIME SERIES PREDICTION USING NEURAL NETWORKS
by
CARRIE KNERR, B.S.
A THESIS
IN
COMPUTER SCIENCE
Submitted to the Graduate Faculty of Texas Tech University in
Partial Fulfillment of the Requirements for
the Degree of
MASTER OF SCIENCE
Approved
Cihalrpers'on of the Committee
Accepted
Dean of the Graduate School
May, 2004
TABLE OF CONTENTS
LIST OF TABLES
LIST OF FIGURES
ARCTDA/'T A b o 1 KAt^ 1
CHAPTER
I. INTRODUCTION
1.1 Time Series 1.2 Artificial Neural Networks 1.3 Recurrent Neural Networks 1.4 Least Squares Algorithm 1.5 Learning Rule 1.6 Mackey Glass Equation 1.7 Electrocardiogram Data 1.8 Research Objectives
IV
VI
2 5 9 11 13 13 14 15
II. BACKGROUND 21
2.1 Training 2.2 Pearlmutter Algorithm 2.3 Recent Applications
I I I . RESULTS
21 23 25
29
3.1 Square Wave 3.2 Mackey Glass 3.3 ECG Signal
IV. CONCLUSION
REFERENCES
31 36 40
49
53
LIST OF TABLES
3.1: Square Wave Prediction Ability 34
3.2: ECG Trained 1,000 Epochs 44
3.3: ECG Test Training Ability 48
Ul
UST OF FIGURES
1.1 Perceptron Model 6
1.2 Feed-Forward Network 8
1.3 Recurrent Network 10
1.4 Fully Connected Recurrent Network 10
1.5 Mackey Glass Time Series 14
1.6 Step Function First Harmonic 16
1.7 Step Function First Ten Harmonics 16
1.8 Triangle Function First Harmonic. 17
1.9 Triangle Function First Ten Harmonics 17
1.10 Harmonic Generator 18
1.11 Harmonic Generator Prediction 18
1.12 Sigmoid Activation Function 19
1.13 Algorithm Flow Chart 20
3.1 Window Function 30
3.2 Square Wave Prediction for Sine Only Network 32
3.3 Square Wave Prediction for Sines and Cosines Network 33
3.4 Square Wave Compare 100,000 Epochs 34
3.5 Square Wave Training Ability 35
3.6 Square Wave Training Ability (2) 36
IV
3.7 Mackey-Glass Prediction Results 37
3.8 Mackey-Glass Prediction (2) 38
3.9 Compare Prediction Results 38
3.10 Compare Mackey-Glass Prediction 40
3.11 Eta 1/10000 42
3.12 Eta 1/1000 42
3.13 100 Epochs, Eta 1/100 43
3.14 Eta 1/10 43
3.15 Comparison of Learning Rates - Cosines and Sines Network 44
3.16 10,000 Epochs 45
3.17 Error Goal < 1.5 46
3.18 Error Goal < 1.3 47
3.19 Error Goal < 1.2 47
ABSTRACT
This thesis involves the investigation of the effect of prior knowledge
embedded in an artificial fully connected recurrent neural network for the
prediction of non-linear time series. The networks utilize the back propagation
method for training. Two network architectures are compared using time series
such as the square wave, Mackey Glass data, and an ECG signal to determine if
prediction quality or training ability are improved when more information through
cosine oscillators are embedded in the network. The benefit of such an exercise
may be the prediction of abnormal ECG signals, which is an electrical measure of
heart activity. Such ability would allow medical professionals to intervene and
possibly prevent abnormal ECG signals. The improved network was able to
provide increased prediction value and training ability for the Mackey Glass time
series.
VI
CHAPTER I
INTRODUCnON
The objective of this thesis is to improve time series prediction on a
deterministic system using a neural network. A deterministic system is one in
which the future states of the system are determined by the current states of the
system and a set of differential equations; a deterministic system is not random.
The unknown equations are the theoretical basis which enables prediction of the
system. This work may be applied to non-linear deterministic systems such as
ECG and EEG measurements to predict future catastrophic events.
The non-linear dynamical theory can be used to represent the human
heart's electrical activity. The electrocardiogram (ECG) is a chaotic signal [Glass
1987, 1991; Albert 1990]. Humans are sometimes able to use the ECG to predict
future catastrophic events (i.e., heart attack) but cannot estimate the time until
the event. It would be useful to create an automated system to predict such
events based on ECG data to enable doctors to prevent them.
1.1 Time Series
A time series is defined as a collection of measurements of a variable that
are usually taken at equal time intervals. This data set can be decomposed into
components including trends, seasonality, and random noise [Brockwell 1987].
A trend is an overall increase or decrease in the values of the time series while
seasonality refers to patterns that are recurring in correspondence to the date.
Random noise has no pattern and can be caused by errors in the observation
methods. Time series analysis is utilized to predict future behavior based on
past measurements of variable(s). A model of the basis of the time series must
be made before predictions can be created.
Time series analysis is performed to create a model of a system. The
systems examined here are dynamical; a dynamical system is one in which the
observed value (time series) is a result of the state of the system. A moving
average process is a linear combination of the signal and random noise that is
acquired during data collection. This noise can interfere with modeling of the
system; methods such as smoothing exist to diminish the effects of random
variations.
One method of smoothing is an averaging method which will show the
time series' underlying components (i.e., trend). Moving average (MA) works by
calculating the average of a small set of past data in which each data subset
average is calculated.
H = Z — 11 ;=0
The MA method is limited to stationary processes, or those processes that do not
have trend or periodic fluctuations and do have an unvarying variance over time.
A stationary process has constant statistical properties which makes prediction
straightforward. Other approaches have been designed for a non-stationary time
series.
One of the first linear time series model was the autoregressive model
(AR) designed for sunspot study by Yule [1927]. This method can be used to
represent non-stationary processes as well as stationary processes. In this
process the value of the series at the current time is a function of the previous
values added to the random variation term. In the model shown below, the last
variable, £ , is the white noise with a zero mean. This noise is a random variable
that is replaced by an average of random variables in the autoregressive/moving
average (ARMA) model (second equation). Processes may be a combination of
moving average and autoregressive process which require the ARMA model. The
output is a weighted sum of previous time series values in both models.
^>=^o+tl'^j^,-j+£. 1.2 ; = i
^r=«.+Z«>^r-;+ZV,-; 1.3
The advantages of ARMA include a requiring reasonable computation time
and creation of a tool for analysis, forecasting, and control. One restriction of
the ARMA model is that it was not designed for time series with asymmetry or
data with sudden bursts of large amplitude at irregular times [Tong]. In
addition, ARMA is used with the assumption that the underlying system is linear.
The AR method is useful because it only requires knowledge of the system's
output values and can be used for both stationary and non-stationary time
series. However many calculations are required when the AR method is used.
Poincare maps are a classical technique for analyzing dynamical systems
[Parker & Chua, p. 33]. This method reduces the order and bridges the gap
between continuous and discrete time systems. A Poincare map is a phase
diagram with one variable. The X-axis is the time interval between one drop to
the next, and the Y-axis is the time interval between the second drop to the
third. This shows structure in a time series that appears to be random. A
Poincare map can show convergence of the system to a stationary point. This
method can be utilized to evaluate the stability of the limit cycle by using
eigenvalues.
This paper involves non-linear time series where Taken's embedding
theory can be used to create a model [Principe 1997]. Taken's theorem states
that many deterministic systems have a one to one mapping between the state
of the system and a finite portion (vector) of the time series. This allows for
estimating the number of variables that control the variable under observation.
Taken's Embedding Theorem maintains the attractor's topology properties while
reconstructing the attractor in a time delayed embedded space. This technique
uses a finite observation of a single variable. Nonlinear time series analysis can
be accomplished using delay time embedding which requires the choice of a time
delay and dimension.
1.2 Artificial Neural Networks
The method of prediction for nonlinear time series explored in this paper
is the use of an artificial neural network. An artificial neural network attempts to
model the human brain which is composed of neurons and connections between
the neurons called synapses. A neural network mimics the brain in that
observed knowledge is acquired through a learning process and interneuron
connection strengths (weights) store the knowledge [Haykin 1994]. Prior
knowledge may be embedded in the network through utilizing a pre-trained
network within the larger network.
Neuron Output
Figure 1.1: Perceptron Model
A neural network is comprised of four parts: nodes, connections between
nodes, activation functions, and a learning rule. Each neuron collects data from
the weighted sum of other neurons. The activation function of the neuron is
applied to this sum. An activation function can be different for each neuron;
however, this may become confusing therefore the code used here has only one
activation function in the network. There are three common activation functions:
the hard-limiter, the threshold function, and the sigmoid function. The hard-
limiter creates an output of positive one, negative one, or zero as in the
McCulloch-Pitts model [Haykin 1994].
In 1943, McCulloch and Pitts presented an early artificial neuron model
known as the linear threshold gate. This neuron has multiple inputs and one
output. The neuron produces a binary output used to group the input set. The
threshold and weight are fixed and the model is simple. A disadvantage of this
model is the inability of the network to work with nonlinearly separable classes.
The threshold function (piecewise-linear function) is used as an activation
function to approximate a non-linear amplifier. The sigmoid function (hyperbolic
tangent is used in these experiments) is non-linear and differentiate. The
output of each neuron may be represented by the following equation where f
represents the activation function, s is the input from other neurons, w is the
weight of the interconnection, and 0 is the threshold value. The input neuron is
denoted by i.
Each neural network must be trained by changing the weights between
neurons. Learning may be supervised where the network is provided with an
input (set of examples) and correct output (desired responses) by an external
teacher. Unsupervised learning requires the system to acquire information by
sampling without a teacher. Both learning methods are a type of Hebbian
learning rule as described by Donald Hebb in 1949 in his book written on the
idea of neurons.
The Hebbian rule states that when two connected neurons are activated
at the same time their connection is strengthened (i.e., weight value is
increased). If neuron /sends input to neuron j, the connection weight w,}\s
modified in the following equation.
A% = jy.y^ 1.5
The simple feed-forward neural network has input node(s), hidden
node(s), and other nodes that process a signal before combining into output
[Gurney 1997]. Hidden nodes are not connected to input or output and are not
expected to have a particular response. A fully connected artificial network is
one in which every node in one layer is connected to each node in the next layer.
The example feed-forward neural network in the figure below is has one layer of
hidden nodes. The only connections that exist are between consecutive layers of
nodes.
Input Nodes
Hidden Nodes
Output Node
Figure 1.2: Feed-forward Network
8
1.3 Recurrent Neural Networks
A recurrent neural network is different from a feed-forward neural
network because it has one or more feedback loops [Haykin 1994]. A feedback
loop occurs when the output of a node has an affect on the input to that node
and this can include self loops. Because of unit delay elements, these feedback
loops result in nonlinear dynamical behavior. The values from hidden nodes are
copied to memory on the input nodes and utilized in the next time step. This
attribute of recurrent neural networks allows them to be used in time dependent,
deterministic systems where feed forward networks would be inappropriate. A
fully connected recurrent neural network is one in which every neuron is
connected to every other neuron in both directions. In a fully connected
network, neurons are identical and are named input, output, or hidden. The
recurrent neural network in the figure below has one feedback loop.
Input Nodes
Hidden Nodes
Output Node
Figure 1.3: Recurrent Network
c*^±p Input Nodes
Hidden Nodes
Output Node
Figure 1.4: Fully Connected Recurrent Network
Behaviors of the recurrent neural network are described by fixed points,
limit cycles, and chaos. A fixed point may be attract, repel, or neither. A fixed
point occurs at the intersection of a function and a line. The fixed points are
categorized by taking the derivative of the function at the fixed point. A repelling
10
type will result in a derivative greater than 1 or less than - 1 . An attracting type
of fixed point has a derivative between -1 and +1. If the derivative is exactly
one the type is undefined. Feed-forward networks evolve to a fixed point [Logar
1992]. This property lends itself towards classification of the input; the output is
set for a given input. Recurrent networks may approach fixed points.
A limit cycle can be generated by a recurrent neural network. The limit
cycle properties depend on the weight values between the nodes of the network.
Stable limit cycles may be created which return cycles to their orbit. This
behavior requires non-linearity produced by the sigmoidal activation function.
A chaotic system is one in which the system is sensitive to its initial
conditions. So two input values which are close together can produce widely
varying outputs. This behavior is important because many signals appear to be
chaotic. The ECG shows that the heart is chaotic; and therefore, a recurrent
neural network may be able to closely model this behavior.
1.4 Least Squares Aloorithm
The training of a neural network consists of setting each weight
(connection between nodes) to an optimized value. One method in supervised
training is the least mean square error (LMS) method (or Widrow and Hoffs
delta rule from 1960) [Graupe 1997, p. 12] which compares the real output with
the desired output. A training series is composed of the input (x) and the
desired output (^cf). Given a training series with di... dt (desired output of the
11
network), the training error (e) at the nth set is calculated by the following
equation.
e.=d„{t)-y„{t) 1.6
Where d\s the desired value and / is the actual output created by the neural
network. The goal of setting the weights is to minimize the training cost, the
sum of squared errors.
E = {tel 1.7 • ^ n = l
E = 7^(^.(0-yMf 1-8
The gradient is computed and set equal to zero to optimize the weights
[Graupe 1997]. Tliis method is employed when the training series is limited to a
small size (L). The maximum rate of change and indicator of direction of change
is given by the gradient. A series which is chaotic is dependent on initial
conditions and so the prediction may be inaccurate when the system is
presented with different initial conditions. The line generated by a chaotic
12
system may be one of an infinite number of possibilities which makes prediction
difficult.
1.5 Learnino Rule
One learning rule described by Rosenblatt utilizes the gradient descent
method to optimize weights. The first step is to initialize the weights and
threshold to small random numbers. The initial input from the training series is
sent to the network. Then the output of each neuron is computed using the
activation function. The weights are updated with the following learning rule.
w, {t + \) = w, (t) - 7j(d - y{t))x. 1.9
The final steps are repeated for all input vectors. One drawback to using this
method is that the equation will oscillate if overlap occurs (as in classification
problems). This can be modified by adding the least mean square error to
minimize the error.
1.6 Mackev-Glass Equation
The Mackey-Glass equation is a time-delay differential equation which
models the production of white blood cells. This time series is sensitive to initial
conditions. In the experiments presented here, the initial value is 1.2 and tau =
17, a = 0.2, b = 0.1, and assumes that x(t)=0 when t<0. This equation results
13
in a non-periodic, non-convergent and chaotic series. Chaotic time series are
created by deterministic systems and are dependent on the initial conditions as
well as being non-periodic. The Mackey-Glass equation is defined by the
equation below; Figure 1.5 shows the graphical representation.
^(0 dt
= a* x(t-tau) \-x(t-tau)"'
-b*x(t) 1.10
Mackey Glass Function
1200
Figure 1.5: Mackey Glass Time Series
1.7 Electrocardioqram Data
An electrocardiogram (ECG) is a time series representing the electrical
activity of the human heart. The electrocardiograph is a device that measures
14
the potential between electrical charges on either sides of the membrane that
initiate the pumping of the heart. This machine is centered on a differential
amplifier. One heartbeat is composed of one P Wave and a QRS complex. This
signal is chaotic and can be difficult to represent artificially given the high
frequency peaks R and S. The ECG data used in these experiments are
obtained from the MFT-BIG Arrhythmia database on CD-ROM. An example of
an ECG signal is given in the figure below.
1.8 Research Objectives
The existing program consists of a three node fully connected recurrent
neural network. This architecture acts as a harmonic generator because it is
able to oscillate and f)erform predictions [Gomez 1998]. The data stream is
partitioned into separate frequency components using the Fourier Transform for
aperiodic signals. The total sum of these sinusoids is equivalent to the original
data. Currently, the program incorporates the fundamental and the first
harmonics of the Fourier transform sine components. However, the Fourier
transform involves both the sine and cosine components for most functions. The
Fourier series demonstrates that a periodic signal is the sum of sine and cosine
terms. The step function is a sum of an infinite number of sine waves as shown
in Figures 1.6 and 1.7. TTie triangle wave is a combination of sine waves and
cosine waves as shown in Figures 1.8 and 1.9.
15
1 •
OB
0.6
04
0.2-
u
•02
.04
•OS
n 01 02 03 04 05 06 07 08 09 1
Figure 1.6: Step Function First Harmonic
Square W ave
1.2
Slim ot Haimonies Sin(x)
Square function
-J I , I L.
-3 -2.5 -2 -1.5 -I -0.5 0 0.5 I 1.5 2 2.5 3
X values
Figure 1.7: Step Function First 10 Harmonics
16
, 00
OG
Qi
02
1 " fl3
04
-06
00
\
f>nllO>bnnnnci
/ \ / \ / \ ; \ / \
/ \
/ • /
/ !
1 \ J
0 O0D3 0004 OOOE DOOe OOt 0012 0014 0D1E OOIS 002
Figure 1.8: Triangle Function First Harmonic
Ttni D ivRorn
5 »
l O D D«M DTXK a ! 00-4 QD-6 etna ito
Figure 1.9: Triangle Function First 10 Harmonics
The three-node neural network is easily trained to sine waves [Gomez
1998]. The current program consists of three sub-networks, the harmonic
generators. Each sub network in the architecture is present to account for a
Fourier coefficient. The sub-networks are pre-trained to a sine wave. A picture
of a sub-network is shown below after training is completed. The overall
network is a fully connected recurrent neural network with no external input
nodes. The weights of the sub networks are held steady while the other weights
are modified by the algorithm. The input signal used for training is the correct
value for the output node.
17
Figure 1.10: Harmonic Generator
Figure 1.11: Harmonic Generator Prediction
18
Hyperbolic Tangent
Figure 1.12: Sigmoid Activation Function
An activation function limits the amplitude of each neuron's output. The
activation function utilized in this architecture is the sigmoid function (the
hyperbolic tangent function shown above). It maps onto a continuous function
from 0 to 1 and requires the input to also map to this area. In other words, it
will be impossible to train the network to a series that is outside of the range
from -1 to +1. Any series must be normalized to be limited to positive and
negative one. This squashing function is given by the following equation.
fix) 1.11 l+e"
19
The sigmoid is appropriate because it is easily derived into the following.
. / • • (x ) = -(1+^^)^
= f(x)[l-f(x)] 1.12
The algorithm of this program is shown in Figure 1.13 below. Initially the
weights and node outputs are set to small random numbers. The output of the
network is computed Crun" the network), the error is computed to help calculate
the delta rules and then the weights are updated. This training cycle is
continued for a specified number of epochs, or until a specified error is reached.
The network then predictions and the results are saved.
start
Get User Input
X Set Initial Weights between Nodes
I Set Initial Output of Each Node
4 Run the Network
31 Compute EiTor
Compute Delta Rules
I Update Weights
Prediction
I Output Results
Figure 1.13: Algorithm Flow Chart
20
CHAPTER II
BACKGROUND
Research in neural modeling began with McCulloch and Pitts in 1943.
Rosenblatt described the basic perceptron in 1958. The perceptron contains two
layers of neurons connected by modifiable weights (input and output). The
limitation of this model is that it is linear [Lau Xueying].
2.1 Training
Back-propagation through time is a supervised learning algorithm utilized
for training the neural network. Back-propagation was discovered by Bryson and
Ho [Abdi, 1994] and provided a method of updating weights between hidden
neurons. The goal of training is to minimize the error between the node's actual
output y(t) and the desired output d(t) using the gradient descent method
discussed earlier. Pearlmutter provided for an extension to back propagation
algorithms by expanding the recurrent neural network and treating it as a feed
forward neural network first.
The algorithm consists of integrating the network forward in time and
keeping track of the error between desired and actual output. Then the network
is integrated backwards in time by sending the error signal from the output layer
backwards to the hidden layer (back propagating the errors of the output layer).
21
This calculates the hidden layer(s) error rates as the weighted average of the
output error; the error of each node is weighted by the derivative of its own
output. Finally, the weights are updated. This method can handle hidden
neurons that have initial conditions that can be set to one value or changed by
the program. Back-propagation converges to a local minimum LMS for the
output. This algorithm is NP-complete; the length of time required expands
exponentially as the number of neurons increases. Therefore, it may not be
appropriate in large networks.
Back-propagation solved concerns that existed with the perceptron model
when attempting to use a network with hidden nodes. Neural networks can use
back-propagation when the problem is nonlinear. The hidden layer provides
internal knowledge storage. One disadvantage that occurs with the back
propagation method is that it is difficult to establish the optimal number of
hidden nodes and the optimal number of hidden layers. The universal
approximation theorem states that one layer of hidden units can approximate a
continuous function when the sigmoid is used for an activation function [Krose
1996]. Back propagation requires supervised training and can require thousands
of iterations for the network to learn.
22
2.2 Pearlmutter Aloorithm
The Pearlmutter [1990, 1995] network is fully connected and is similar to
recurrent back propagation but is a discrete model of a continuous system. This
method utilizes differentiable functions to define the system. The Pearlmutter
network requires a differentiable activation function, o, which is nonlinear such
as the sigmoidal function shown earlier.
a(x) = —-;^ 2.1 \ + e
a(x) = a(x)(\-a(x)) 2.2
The Pearlmutter algorithm begins by creating and saving a neuron's
outputs for a number of time steps. The total input from other nodes to node /
is the following sum where > is the activation level for neuron /and w,j is the
weight of the link from node /to node/
Xtit) = Y.'^ijyjiO 2.3 7=1
23
The network is defined by the following equations where y, is the state of node j
and h is the external input to the node. The following equation is the path for
the differential equation.
^ = ->^,(0 + o-(.x,(0) + /.(/) 2.4 at
The next step in the algorithm is to calculate the error value and then update the
weights.
e^ = actualoutputj - desiredoutput^ 2.5
^^ij=^^— 2.6
'^ij=^ij-^^ij 2.7
The following must be integrated.
dt j=\
dE
^ij ,
= \zXt)cj\xXt))yj(t)dt 2.9
24
The z values are initialized to zero and show the effects of changing a nodes
output (j^ has on the error. EuleKs method is used for integration in the
following equations. The second equation shows that zis calculated backwards
in time by rearranging the equation.
:Xt + ^t) = z,{t) + M^ 2.10 dt
:^(t) = z,{t + At)-At-^ 2.11 dt
Pearlmutter's algorithm is different from recurrent back propagation by
computing the path and then the error along the entire path instead of a point
by point analysis.
2.3 Recent Applications
Sejnowski and Rosenburg developed a multi-layer perceptron trained to
combinations of letters in English text. NETTalk was trained to generate speech
from written text. In this application the hidden layers represented phonemes in
the English language such as vowels and consonants. Seven letters of a word
were the input series. The network was composed of 203 neurons; seven
groups of 29 for the 26 letters of the alphabet and 3 punctuations, 80 neurons in
25
a hidden layer, and 26 neurons in the output layer. This network utilized back
propagation with a high accuracy after training (less than 2 days) and in addition
it could speak new texts (those not used for training) with high accuracy when
compared to human error rates.
Principe and Kuo [1995] stated that a multi-step prediction more
appropriate for dynamical system. Their system, a recurrent neural network, is
trained by seeding the network with a set of input samples before the input is
disconnected. The predicted sample (network output) uses feed-back to the
input for k steps. TTiis is repeated over several segments over the time series
and the sequences are overlapping.
The authors note that it is necessary for the network to be recurrent
because the long-term behavior of the dynamical system must be modeled; the
iterative map is therefore constrained throughout learning. The Mackey-Glass
equation was used for prediction. A time delay neural network with 8 input
nodes, 14 hidden nodes, and 1 output nodes was used with hidden nodes having
sigmoid activation functions and the output node having a linear activation
function. The result was a low final mean square error but the prediction was
more regular than the input series.
Lapedes and Farber investigated using neural networks for time series
prediction. They stated that the linear back-propagation neural network is
comparable to the least mean square method. The activation function was
26
changed to the sigmoid and a feed-forward network used these non-linear
activation functions for hidden neurons. When the gradient descent method is
used the authors had better results for prediction of nonlinear time series when
compared to traditional time series analysis.
Two of the time series predicted in the previous noted paper were the
Mackey-Glass and a logistic map. The network for the Mackey-Glass had four
input nodes, one output node, and two hidden layers with a total of 20 nodes.
The attempt was to sample the series and predict many future points. The
result was better than traditional time series analysis; however the accuracy was
not impressive. The logistic map given by the equation below was also
investigated. This time series is deterministic and random. The network
included one input node, one output node, and five hidden nodes. Activation
functions were different for the hidden nodes and the output nodes: the hidden
nodes used the sigmoid function while the output nodes used a linear function.
Predicting one step in the future produced accurate results. These examples
show that neural networks may be used for prediction of a time series.
In 1994, Hayashi proposed an architecture and learning rule using a
recurrent neural network. This architecture characterized an oscillatory based
recurrent network with pairs of nodes x and y. The y nodes are internal nodes
while the x nodes are output nodes that are fully connected by weights Wy.
Each pair connects together to create an oscillator. There are two connections
27
between a pair of nodes: the positive weight, KIE, and the negative weight, -Ke.
This network uses a sigmoidal function.
Gix) = ^tein-\-) 2.12 n: a
In the equations below, Ij is the input sent to node Xj.
x' = -X. +GC^W^,X, -K„y, +1,) 2.13 k
yj=-x.+G{x.KjE) 2.14
This system (shown in Figure 1.4) produces cyclic output and is trained using a
continuous training rule devised by Hayashi [1994]. This method utilizes
Lagrange multipliers to define the error for a weight. These rules can be
modified for the discrete case by taking the partial derivative of the error with
respect to a weight [Corwin]. The disadvantage of this scheme is that only
weights connected to visible nodes can be modified. In addition, it may be
difficult to train to a series that does not match one of the oscillations that
Hayashi defines.
28
CHAPTER III
RESULTS
Code was written in C to implement the back propagation method. The
new program includes the cosine components for each sine component. A three-
node harmonic generator was trained to output a cosine function. Three of
these sub neural networks was be added to the overall code. Experiments were
performed to compare the prediction results between the code with the cosine
and without the cosine components. This involved a square wave, Mackey-Glass
function, and an ECG sample. The results are be examined to determine if a
smaller error can be obtained for the same prediction time or if acquiring the
same error will allow for a longer prediction time.
The results are measured to determine the ease of training the system
and examine the quality of the prediction. Training is improved if it requires a
fewer number of iterations. The quality of prediction will be observed based on
the least mean square error of the sample the network is trained to predict. The
error referred to hereafter includes the error across all training samples.
The first step in this process was to determine what frequencies of sine
waves need to be predicted with the oscillators. Therefore, it was necessary to
take the Fourier transform of the series to determine the frequency component.
This was performed using Matlab. First, a window function was applied to the
29
series because of the property of the Fourier transform. The window function
selected was the Blackman window given by the equation below.
w(n) = 0.42 - 0.5 * cos( ) + 0.08 * cos( - ^ ) N N 3.1
With N set to 1024, this equation creates the windowing series shown in
the figure below. The window function was applied to the input series after
mean is subtracted to remove any DC components. The Fast Fourier transform
is applied to the resultant series to determine which frequencies the oscillators
should be trained to predict. This process was not preformed for the square
wave because the frequency components were already known.
Figure 3.1: Window Function
30
3.1 Square Wave
The square wave was used to train the neural network. This time series is
a sum of an infinite number of sine waves with the odd harmonics. In this
experiment only the first, third, and five harmonics are used. The training series
consisted of the step function including 100 time points. The following figure
shows the results of the network when using only the sine oscillator generators.
The outputs for one hundred and one thousand epochs are almost
identical with error rates of 7.949 and 7.840 respectively. This was repeated
with results of 7.139 and 5.178, respectively. The difference is a result of the
random number generator that was seeded using time. Both the weights and
the initial y values are set to small random numbers and clearly initial conditions
have a big effect on the error results. In other words if the random number
generator is seeded using time, the neural network's results may be different
each time the program is run. Ten thousand epochs created the best results
with an error rate of 1.506 and then 2.324.
31
Square Wave - Sine Only
1.4
0.6 (/)
re > 0.2
-0.2
-0.6
Training Series 100 Epochs
1,000 Epochs 10,000 Epochs
^ A f ^ , f 1 *J^
^
W W m f/ km n 0 50 100 150 200 250 300 350 400 450 500
X Values
Figure 3.2: Square Wave Prediction for Sine only network
The next project was to run the net with six oscillators including the
cosines. The results are shown in Figure 3.3. Using 100 training epochs had
results of 7.667 and 7.954 while one thousand epochs were significantly lower
error rates of 2.506 and 4.861. After training using ten thousand epochs the
error was 1.983 and 1.69.
32
Square Wave - Sine & Cosine
1.4
0.6 0) •(3 > 0.2 >
-0.2
I r- - i 1 1 1
Training Series 100 Epochs
1,000 Epochs 10,000 Epochs —>*-
'•i\)
-0.6
-1
#"Mv;iV V'll'^ ')Vi * ' % ' ^
n • VrfV
_] I I L_
(It
; ::
V 0 50 100 150 200 250 300 350 400 450 500
X Values
Figure 3.3: Square Wave Prediction for Sines and Cosines Network
The figure below directly compares the network when one hundred
thousand epochs were run. When using both cosine and sine oscillators, an
error rate of 1.333 was obtained versus an error rate of 1.206 for only sine
oscillators. However as time goes on the sine only network has an inferior ability
to continue to predict the square wave (Figure 3.4).
33
1.4
fTTii 0.6
(/> 0) 3
> 0.2 >
-0.2
-0.6
Square Wave - Comparison
n r -I 1 1 r Training Series — Cosine & Sine
Sine Only
t * ! J
"^p^ v^&5t ^ A w ^ '/>wV^ V--J L.
0 50 100 150 200 250 300 350 400 450 500
X Values
Figure 3.4: Square Wave Compare 100,000 Epochs
Table 3.1: Square Wave Prediction Ability
Epochs
100
1,000
10,000
100,000
Error Sines Only Network 7.949 7.130 7.840 5.178 1.506 2.324 1.206
Sines and Cosines Network
7.667 7.954 2.506 4.861 1.983 1.690 1.333
The network was then tested to determine if adding additional information
about the signal produced better training through reduced epochs. The network
34
was trained until error decreased below 1.5. When the network had both cosine
and sine oscillators, the number of required epochs was 33,192. When only the
sine was utilized, there were 10,385 epochs required for the error to decrease
below 1.5.
1.4
1
0.6
3 "(3 > 0.2 >
-0.2
-06
. 1
Square Wave - Comparison
1 1 1 1 1 1 1 1
Training Series Cosine & Sine
Sine Only
i
, r M *
• 1
A
A
1 *. 1
A 1
• .-• - i
t * i." !
i ; 1,
*
A A
/
1
< 1 1 1 1 1 1 1
0 50 100 150 200 250 300 350 400 450 500
X Values
Figure 3.5: Square Wave Training Ability
Both had the same seed for the random number generator. The net using
only 3 oscillators required only 31,322 epochs to train while the net with 6
oscillators required 81,554 epochs. However when the graphs are compared, it
is obvious that the network with more oscillators has more accurate prediction
35
abilities. This effect was reproduced several times with differing random number
seeds.
IP
m
Square Vi/ave
1.4 L Tiaining Seiies Cosine & Sine
Sine
0.6 r l
0.2
-0.2
-0.6
^^ i '' .' . if%. ^ri%
» I
i t I
r . ' '1 I
' t
t " I
i
. t
I *• » *
1 ^ V^^' ^ ' w ^ : i|4«^s w^ _l L_ _! I ! !_
0 50 100 150 200 250 300 350 400 450 500
X Values
Figure 3.6: Square Wave Training Ability
3.2 Mackev-Glass Function
Next, the network was trained to the Mackey-Glass function described
earlier. In this training session 200 points of the signal was used (Gomez Gil
[1998] trained in segments). It is necessary to scale or shift the input data
because of the activation function used. This function requires a signal between
positive and negative one. In this paper, the series were shifted however it may
be useful to scale the time series using the following function.
36
Y, Y
_ ^OLD 'MIN
NEW 'MAX ' MIN
3.2
In this example, both cosine and sine coefficients were used to create
oscillators for an inclusion of six trained oscillators. In the network, a total of 31
nodes were utilized; 18 were a part of the oscillators, and one output node. The
rest were hidden nodes without a specified output. For 2.out (second graph) the
program was run until the error < 1.408. for 11.out (first graph) the program
was run until error < 1. Both started with eta of 0.01. 2.out took 8000 epochs
and 11.out took 40,586 epochs. The third figure focuses on the training portion
of the signal which compares the different numbers of epochs run.
04
0 2
•0 4
# i
1
« f
<
& »
« $ <*. <5
*
• »
0
^ % *
"* o
1
% f
1 * %
1 *
9 <• $
% * I ^ $
200 1000
Figure 3.7: Mackey-Glass Prediction Results
37
Figure 3.8: Mackey-Glass Prediction
Figure 3.9: Compare Prediction Results
38
The Mackey-Glass was also trained using a 31-node network with only 3
oscillators, the sine components. Therefore, there were 9 nodes in the
oscillators, 1 output node, and 21 hidden nodes in this architecture. Clearly not
having the three cosine oscillators results in worse prediction. Starting with a
learning rate (eta) of 0.01 as was the case in the experiment above, this
program was run for 100,000 epochs and achieved an error rate of 3.171.
Shown in Figure 3.10, this experiment did not come close to predicting the
Mackey-Glass function.
Thirty nodes were chosen because this had the best results without
exceeding memory limitations. It should be noted that increasing the number of
neurons will decrease the programs performance. However, any extra nodes
that are not needed can be degraded through decreasing the interconnecting
weight value. This may lead to over-fitting of the network.
39
Figure 3.10: Compare Mackey-Glass Prediction
3.3 ECG Sional
The final experiments involved the ECG signal. This data was obtained
from a MIT-BIH database. A normal ECG file was chosen and then scaled as
before. A window function was applied before running the Fourier transform to
determine the required frequencies. Six separate three node net was trained to
the appropriate frequencies. The network architecture is the same as used
previously. The initial predictions were unable to reach the high points of the
ECG.
It was decided to use a feed forward network for training initially and use
those weights for initializing the recurrent network as suggested by Dr. Oldham.
40
This allowed the recurrent network to have weights initialized to non-random
numbers; the only random numbers utilized were those for the initial node
outputs and recurrent weights. Another improvement to the program was to
include a bias to every node.
The first experiments trained the network for one thousand epochs.
When eta was small, the sines only network achieved a slightly lower error rate.
For example, when eta was 0.0001 the sines network had an error rate of
1.1109 compared to 1.1705. The graph is virtually identical in the first graph
below. However when larger eta rates were used, the 6 oscillator network had a
final error rate that was slightly smaller. When eta was .001, the first network
had a rate of 1.1103 and the second 1.1034 (second graph below). Increasing
the learning rate to .01 had a result of 1.2071 and 1.0776, respectively (third
graph below). The final experiment run with one thousand epochs had a result
of 1.2078 and 1.0776 (fourth graph below). The results from the network with
cosines and sines are compared across different learning rates in the fifth graph
below. The worst result was when eta was set to 0.1.
41
06
04
02
•0 2
-0 4
' OJtc. y'
ECa SIGNAL
^^nm r r ' • ' % !
100 150 200 250 300 350 400 450 500
Figure 3.12: Eta 1/1000
42
06 •
0 4
02
•0 4
-0 6
if/ W 1,J''ftllbll ,,.,# "--;..v,-: W ^'''^\!ff ^ "
EC' ^IGNAL
W
100 150 250 300 350 400 450 500
0 4
0 2
-0 2
-0 4
%n
¥
Figure 3.13: 100 Epochs, Eta 1/100
M A/^.-
EC
/ '- w^W
Figure 3.14: Eta 1/10
>IGII-L
50 100 150 200 250 300 350 400 450 500
43
Figure 3.15: Comparison of Learning Rates - Cosines and Sines network
Table 3.2: ECG Trained 1,000 Epochs
Eta
0.0001 0.001 0.01
0.1
Error Sines Only Network 1.1109 1.1103 1.2071
1.2078
Sines and Cosines Network 1.1705 1.1034 1.0776
1.0776
The networks were then trained using ten thousand epochs. The results
were similar to those above with an initial eta of one tenth resulting in the worst
prediction abilities. The other learning rates had similar outputs as seen in the
graph below. When comparing one thousand to ten thousand epochs, the result
44
was as expected with more epochs creating a signal closer to the original in the
next graph.
lou i;.o 350 400 DUU
Figure 3.16: 10,000 Epochs
Experiments were implemented to test the training ability of the network.
The eta (learning rate) was set to 1/100000 and the same network architecture
as before was used. Initially the goal of the network was to reach an error rate
(across all training time samples) of less than 1.5. The sines only network took
3,696 epochs while the cosines and sines network took 3,736 epochs to reach
this level. As shown in the figure below, the prediction results were virtually
identical for both networks.
45
0 3
06
04
0 2 '
-0 2
-0 4
0Uto2 ;,
EC'lSIGII-L
iM ,<'' ft
lAijfJ / *
J •?^f
•r r
100 150 250 300 350 400 450
Figure 3.17: Error Goal < 1.5
Next, the network was trained until an error of 1.3 was reached; the first
network took 4,168 epochs versus 4,831 for the second network. Next, the goal
error rate was set at 1.2 across the training sample. The sines only network
required 7,804 epochs while the second network took 6,102 epochs. When the
objective was an error of 1.1, the first network took 10,457 epochs and the
second network took 257,372 epochs.
46
0 6 • ECO ilGII-L
0 2
-0 2
-0 4
ri .A, • V A m "' n 4^^/f^
_^ =0 100 150 200 250 300 350 400 450 500
Figure 3.18: Error Goal < 1.3
0 50 100 150 200 250 300 350 400 450 500
Figure 3.19: Error Goal < 1.2
47
Tab Error Goal
1.5 1.3 1.2 1.1
e 3.3: ECG Test Training Ability Number of Epochs
Sines Only Network
3,696 4,168 7,804 10,457
Sines and Cosines Network 3,756 4,831 6,102
257,372
48
CHAPTER 4
CONCLUSION
The anticipated benefit resulting from this work is to increase the ability of
the recurrent neural network to correctly predict future catastrophic events of
deterministic systems. Although a point by point prediction may not be
accurate, the prediction of a catastrophic event such as a heart attack may be
useful in the analysis of an EEG/EKG. Additionally, this method may be applied
to other time series.
All of the experiments demonstrated the dependence of the network on
the initial weights and output values. The network would have differing error
values and prediction abilities depending on the seed given to the random
generator. At times the network appeared to be unable to achieve a decreasing
error rate. It is likely that in these cases the net had reached a local minimum
and was unable to leave the minimum of the error curve due to the low learning
rate. A solution would be to increase the learning rate after a certain number of
epochs with unchanging errors. The property of the neural network necessitates
using the same seed for the random number generator to be able to compare
results.
The quality of prediction for the square waves appeared to be improved
slightly for the smaller network including only the sine components. However
this network was clearly inferior to the larger network when all of the predicted
49
output was considered. The network with cosine and sine oscillators was able to
recreate the step function repeatedly better than the net without the extra
oscillators. Therefore the prediction quality was improved for the network which
included the cosine oscillators.
The same effect was noticed when testing the ease of training. The
network with only sines was able to train using less than half of the epochs
required for the network with cosines. However the second network had more
accurate ability to recreate the training series in the future. This may be a result
of a couple of effects. The first network had more weights which could be
modified and this would aid in faster training. The second network had 9 more
of the weights fixed which would require the other weights to make up for this
inflexibility by greater changes. Therefore the second network would require
more training. Another possibility is that the first network quickly found a local
minimum but was unable to locate a global minimum that was found by the
second network.
When the Mackey Glass function was utilized, it became apparent that in
this case using both the sine and cosine oscillators allowed for easier training
and increased performance in prediction of future values. It may be the case
that the benefits of including extra cosine oscillators is extremely dependent on
the particular time series to be modeled. In the case of the Mackey glass
function, inclusion of the extra three oscillators was beneficial.
50
When the EKG was used for training the network different results were
obtained for each experiment run. For training ability the network was run until
separate goal error rates were reached. For a goal error of 1.5,1.3,. And 1.1,
the sines only network had fewer training epochs. However for a goal error of
1.2, the sines network had greater number of epochs.
When the prediction quality was tested, the error rate differed depending
on the initial learning rate. For eta less than 1/10000, the sines network
obtained a lower error rate. For all greater eta rates, the sines network had a
higher error rate. Including the cosines when examining the ease of training has
contradictory results; some examples take many more epochs and others
requiring fewer epochs. When the extra oscillators are included these weights
cannot be modified and may decrease the networks ability to be trained to the
particular signal. When the graphs are examined it is clear that both cosines and
sines have the ability to predict even the high frequency components of the ECG
signal and the differences between error rates was small.
The addition of cosine frequencies through including three extra oscillators
in the overall network was highly dependent on the particular signal being
examined. The greatest increase in prediction was obtained when the feed
forward network with biases was trained first, followed by a recurrent network.
This final method seemed to be more affected by initial conditions than the
particular oscillators included in the system. Including three extra oscillators in
51
the case of the ECG signal appeared to not have a significant affect in regards to
the prediction quality. The effects on training efficiency were ambiguous,
perhaps due to the particular error curve that the network was following for this
example.
52
REFERENCES
Abdi, H. "A neural network primer." Journal of Biolooical System. 2(3), 1994.
Brockwell, Peter J. and Davis, Richard A. Time Series: Theory and Methods. New York: Springer-Veriag Inc., 1987.
Galka, Andreas. Topics in Nonlinear Time Series Analysis. Worid Scientific Pub Co Inc., 2000.
Glass, Leon and Mackey, Michael C. From Clocks to Chaos. Princeton, N.J.: Princeton University Press, 1988.
Gomez, Gil and Oldham, W.J.B. Recun-ent Neural Networks as a Tool for Modeling and Prediction of Electrocardiograms. International Conference on Information Systems, Analysis, and Synthesis (4), 1998.
(Bomez, Maria Del Pilar. "The Effect of Non-Linear Dynamic Invariants in Recurrent Neural Networks for Prediction of Electrocardiograms." Ph.D. Dissertation, Texas Tech University, Lubbock, TX, 1998.
Graupe, Daniel. Principles of Artificial Neural Networks. New Jersey: Worid Scientific Publishing Co. Pte. Ltd. 1997.
Gurney, Kevin. An Introduction to Neural Networks. London: UCL Press, 1997.
Hayashi, Yukio. "Oscillatory Neural Networic and Learning of Continuously Transformed Patterns." Neural Networi^. 7( 2), pp 219-231,1994.
Haykin, Simon. Neural Networks: A Comprehensive Foundation. New York: Macmillan College Publishing Company, Inc., 1994.
Hebb, D. The Organization of Behavior. New Yoric: Wiley. 1949.
Krose, Ben and Smagt, Patrick van der. An Introduction to Neural Networks. (8) November 1996.
Logar, Antonette. "Recurrent Neural Networks and Time Series Prediction." Ph.D DIsstertation, Texas Tech University, Lubbock, TX, 1992.
53
Pearimutter, BA. "Dynamic recurrent neural networks." Technical Report. CMU CSiSQiigg. 1990.
Pearimutter, B.A. "Gradient calculations for dynamic recurrent neural networks: a survey." IEEE trans. On neural networks. 6, pp.1212-1228, 1995.
Petrklis, Vassilios and Athansios Kehagias. Predictive Modular Neural Networks: Applications to Time Series. Kluwer Academic Publishers, Norwell, MA, 1998.
J.C. Principe and J-M Kuo. "Dynamic modeling of chaotic time series with neural networks." In G. Tesauro, D. Touretzky, and T.Leen, editors. Advances in Neural Information Processing Svstems. 7, pp, 311-318. MIT Press, 1995.
Terrence J. Sejnowski and Charies R Rosenberg. "Parallel networks that learn to pronounce English text." Complex Systems. 1987.
Tong, Howell. Non-linear Time Series: A Dynamkal System Approach. Oxford, New York: Oxford University Press, Inc., 1990.
54
PERMISSION TO COPY
In presenting this thesis in partial fiilfilhnent of the requirements for a master's
degree at Texas Tech University or Texas Tech University Health Sciences Center, I
agree that the Library and my major department shall make it ft^eely available for
research purposes. Permission to copy this thesis for scholarly purposes may be
granted by the Director of the Library or my major professor. It is understood that any
copying or publication of this thesis for fmancial gain shall not be allowed without my
further written permission and that any user may be liable for copyright infringement.
Agree (Permission is granted.)
Student Signamre Date
Disagree (Permission is not granted.)
Student Signature Date