noisy time series prediction using a recurrent neural network and

31
Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference C. Lee Giles , Steve Lawrence , Ah Chung Tsoi , NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 Faculty of Informatics, University of Wollongong, NSW 2522 Australia Abstract Financial forecasting is an example of a signal processing problem which is chal- lenging due to small sample sizes, high noise, non-stationarity, and non-linearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent difficulties when using neural networks for the processing of high noise, small sample size signals. We introduce a new intelligent signal processing method which addresses the difficulties. The method proposed uses conversion into a symbolic representation with a self-organizing map, and grammatical inference with recurrent neural networks. We apply the method to the prediction of daily foreign exchange rates, addressing difficulties with non-stationarity, overfitting, and un- equal a priori class probabilities, and we find significant predictability in comprehensive experiments covering 5 different foreign exchange rates. The method correctly predicts the direction of change for the next day with an error rate of 47.1%. The error rate re- duces to around 40% when rejecting examples where the system has low confidence in its prediction. We show that the symbolic representation aids the extraction of symbolic Lee Giles is also with the Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742. 1

Upload: others

Post on 03-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Noisy Time Series Prediction using a Recurrent Neural Network and

NoisyTime SeriesPredictionusingaRecurrent

NeuralNetwork andGrammaticalInference

C. LeeGiles���, Steve Lawrence

�, Ah ChungTsoi

������� �������� ������������������������� ��!#"$��%&"'�(���)"*�+�,,���.- �0/(+��1"2�030/1"*��/

�NECResearchInstitute,4 IndependenceWay, Princeton,NJ 08540�

Facultyof Informatics,Universityof Wollongong,NSW2522Australia

Abstract

Financialforecastingis an exampleof a signalprocessingproblemwhich is chal-

lengingdueto smallsamplesizes,highnoise,non-stationarity, andnon-linearity. Neural

networkshave beenvery successfulin a numberof signalprocessingapplications.We

discussfundamentallimitationsandinherentdifficultieswhenusingneuralnetworksfor

theprocessingof high noise,smallsamplesizesignals.We introducea new intelligent

signalprocessingmethodwhich addressesthe difficulties. Themethodproposeduses

conversioninto asymbolicrepresentationwith a self-organizingmap,andgrammatical

inferencewith recurrentneuralnetworks.Weapplythemethodto thepredictionof daily

foreignexchangerates,addressingdifficultieswith non-stationarity, overfitting,andun-

equala priori classprobabilities,andwefind significantpredictabilityin comprehensive

experimentscovering5 differentforeignexchangerates.Themethodcorrectlypredicts

thedirectionof changefor thenext daywith anerror rateof 47.1%.Theerror ratere-

ducesto around40%whenrejectingexampleswherethesystemhaslow confidencein

its prediction.Weshow thatthesymbolicrepresentationaidstheextractionof symbolic4LeeGilesis alsowith theInstitutefor AdvancedComputerStudies,Universityof Maryland,CollegePark,

MD 20742.

1

Page 2: Noisy Time Series Prediction using a Recurrent Neural Network and

knowledgefrom thetrainedrecurrentneuralnetworksin theform of deterministicfinite

stateautomata.Theseautomataexplain theoperationof thesystemandareoftenrela-

tively simple. Automatarulesrelatedto well known behavior suchastrendfollowing

andmeanreversalareextracted.

1 Intr oduction

1.1 Predicting Noisy Time SeriesData

Thepredictionof futureeventsfrom noisytime seriesdatais commonlydoneusingvarious

formsof statisticalmodels[24]. Typical neuralnetwork modelsarecloselyrelatedto statis-

tical models,andestimateBayesiana posterioriprobabilitieswhengivenan appropriately

formulatedproblem[47]. Neuralnetworkshavebeenverysuccessfulin anumberof pattern

recognitionapplications.For noisy time seriesprediction,neuralnetworks typically take a

delayembeddingof previous inputs1 which is mappedinto a prediction. For examplesof

thesemodelsandtheuseof neuralnetworksandothermachinelearningmethodsappliedto

financesee[2, 1, 46]. However, high noise,high non-stationaritytime seriespredictionis

fundamentallydifficult for thesemodels:

1. Theproblemof learningfrom examplesis fundamentallyill-posed,i.e. therearein-

finitely many modelswhichfit thetrainingdatawell, but few of thesegeneralizewell.

In orderto form a moreaccuratemodel,it is desirableto useaslargea trainingsetas

possible.However, for the caseof highly non-stationarydata,increasingthe sizeof

the trainingsetresultsin moredatawith statisticsthatarelessrelevant to the taskat

handbeingusedin thecreationof themodel.

2. The high noiseand small datasetsmake the modelsproneto overfitting. Random

correlationsbetweenthe inputsandoutputscanpresentgreatdifficulty. The models

typically do not explicitly addressthe temporalrelationshipof the inputs– e.g. they

donotdistinguishbetweenthosecorrelationswhichoccurin temporalorder, andthose

whichdo not.1For example, 57698;:=<>5?6@8�A � :=<;576980A � :=<CBDBCBE<;576980A#FHG � : form the inputsfor a delayembeddingof the

previous F valuesof aseries[62, 52].

2

Page 3: Noisy Time Series Prediction using a Recurrent Neural Network and

In our approachwe considerrecurrentneuralnetworks(RNNs). Recurrentneuralnetworks

employ feedbackconnectionsandhavethepotentialto representcertaincomputationalstruc-

turesin a moreparsimoniousfashion[13]. RNNsaddressthetemporalrelationshipof their

inputsby maintainingan internalstate. RNNs arebiasedtowardslearningpatternswhich

occurin temporalorder– i.e. they arelessproneto learningrandomcorrelationswhich do

notoccurin temporalorder.

However trainingRNNstendsto bedifficult with high noisedata,with a tendency for long-

termdependenciesto beneglected(experimentsreportedin [9] founda tendency for recur-

rentnetworksto takeinto accountshort-termdependenciesbut not long-termdependencies),

andfor thenetwork to fall into anaivesolutionsuchasalwayspredictingthemostcommon

output.Weaddressthisproblemby convertingthetimeseriesinto asequenceof symbolsus-

ing a self-organizingmap(SOM).Theproblemthenbecomesoneof grammaticalinference

– thepredictionof a givenquantityfrom a sequenceof symbols.It is thenpossibleto take

advantageof theknown capabilitiesof recurrentneuralnetworksto learngrammars[22, 63]

in orderto captureany predictabilityin theevolutionof theseries.

The useof a recurrentneuralnetwork is significantfor two reasons:firstly, the temporal

relationshipof theseriesis explicitly modeledvia internalstates,andsecondly, it is possi-

ble to extract rules from the trainedrecurrentnetworks in the form of deterministicfinite

stateautomata[44]2. The symbolicconversionis significantfor a numberof reasons:the

quantizationeffectively filters thedataor reducesthenoise,theRNN trainingbecomesmore

effective,andthesymbolicinput facilitatestheextractionof rulesfrom thetrainednetworks.

Furthermore,it canbe arguedthat the instantiationof rulesfrom predictorsnot only gives

theuserconfidencein thepredictionmodelbut facilitatestheuseof themodelandcanlead

to a deeperunderstandingof theprocess[54].

Financialtime seriesdatatypically containshigh noiseandsignificantnon-stationarity. In

this paper, thenoisy timesseriespredictionproblemconsideredis thepredictionof foreign

exchangerates.A brief overview of foreignexchangeratesis presentedin thenext section.

2Rulescanalsobeextractedfrom feedforwardnetworks[25, 41, 56, 4, 49, 29], andothertypesof recurrent

neuralnetworks[20] howevertherecurrentnetwork approachanddeterministicfinite stateautomataextraction

seemparticularlysuitablefor a timeseriesproblem.

3

Page 4: Noisy Time Series Prediction using a Recurrent Neural Network and

1.2 ForeignExchangeRates

Theforeignexchangemarket asof 1997is theworld’s largestmarket, with morethan$US

1.3 trillion changinghandsevery day. Themostimportantcurrenciesin themarket arethe

US dollar (which actsasa referencecurrency), the JapaneseYen, the British Pound,the

GermanMark, andthe SwissFranc[37]. Foreignexchangeratesexhibit very high noise,

andsignificantnon-stationarity. Many financial institutionsevaluatepredictionalgorithms

using the percentageof times that the algorithmpredictsthe right trend from todayuntil

sometime in the future [37]. Hence,this paperconsidersthepredictionof thedirectionof

changein foreignexchangeratesfor thenext businessday.

The remainderof this paperis organizedas follows: section2 describesthe dataset we

have used,section3 discussesfundamentalissuessuchasill-posedproblems,the curseof

dimensionality, andvectorspaceassumptions,andsection4 discussesthe efficient market

hypothesisandpredictionmodels.Section5 detailsthesystemwehaveusedfor prediction.

Comprehensive testresultsarecontainedin section6, andsection7 considerstheextraction

of symbolicdata. Section8 providesconcludingremarks,andAppendixA provides full

simulationdetails.

2 ExchangeRate Data

Thedatawe have usedis publicly availableat IKJKJML�NCOPO�QPQPQ�R'STJ7UWVWX�RYX?ZW[�REUM\W[�O?]W^_Q7Ua`�b?U�X?\�Oc `�d�UPe�fPUWV�`_U7SMOWfP^_XKJ7^Wg?UhR>IKJWd�i . It wasfirst usedby Weigendet al. [61]. Thedataconsists

of daily closingbids for five currencies(GermanMark (DM), JapaneseYen,SwissFranc,

British Pound,andCanadianDollar) with respectto theUSDollar andis from theMonetary

Yearbookof the ChicagoMercantileExchange.Thereare 3645 datapoints for eachex-

changeratecoveringtheperiodSeptember3, 1973to May 18,1987.In contrastto Weigend

et al., this work considersthepredictionof all fiveexchangeratesin thedataandprediction

for all daysof theweekinsteadof just Mondays.

4

Page 5: Noisy Time Series Prediction using a Recurrent Neural Network and

3 Fundamental Issues

This sectionbriefly discussesfundamentalissuesrelatedto learningby exampleusingtech-

niquessuchasmulti-layerperceptrons(MLPs): ill-posedproblems,thecurseof dimension-

ality, andvectorspaceassumptions.Thesehelpform themotivationfor our new method.

The problemof inferring an underlyingprobability distribution from a finite setof datais

fundamentallyan ill-posed problembecausethereare infinitely many solutions[30], i.e.

thereareinfinitely many functionswhich fit the trainingdataexactly but differ in otherre-

gionsof theinputspace.Theproblembecomeswell-posedonly whenadditionalconstraints

areused.For example,theconstraintthata limited sizeMLP will beusedmightbeimposed.

In this case,theremaybea uniquesolutionwhich bestfits thedatafrom therangeof solu-

tionswhich themodelcanrepresent.Theimplicit assumptionbehindthis constraintis that

theunderlyingtarget function is “smooth”. Smoothnessconstraintsarecommonlyusedby

learningalgorithms[17]. Withoutsmoothnessor someotherconstraint,thereis noreasonto

expecta modelto performwell on unseeninputs(i.e. generalizewell).

Learningby exampleoften operatesin a vectorspace,i.e. a function mapping j1k to j1lis approximated.However, a problemwith vectorspacesis that therepresentationdoesnot

reflectany additionalstructurewithin the data. Measurementsthat arerelated(e.g. from

neighboringpixelsin animageor neighboringtimestepsin atimeseries)andmeasurements

that do not have sucha relationshipare treatedequally. Theserelationshipscan only be

inferredin astatisticalmannerfrom thetrainingexamples.

The useof a recurrentneuralnetwork insteadof an MLP with a window of time delayed

inputsintroducesanotherassumptionby explicitly addressingthe temporalrelationshipof

theinputsvia themaintenanceof aninternalstate.This canbeseenassimilar to theuseof

hints[3] andshouldhelpto make theproblemlessill-posed.

Thecurseof dimensionalityrefersto theexponentialgrowth of hypervolumeasafunctionof

dimensionality[8]. Considerm0n�opj1k . Theregression,qsrut vwm�x is ahypersurfacein j kzy.{ .If t|vwm�x is arbitrarily complex andunknown thendensesamplesarerequiredto approximate

the function accurately. However, it is hardto obtaindensesamplesin high dimensions3.

3Kolmogorov’s theoremshows thatany continuousfunctionof } dimensionscanbecompletelycharacter-

izedby aonedimensionalcontinuousfunction.Specifically, Kolmogorov’stheorem[33, 34, 35, 18] statesthat

5

Page 6: Noisy Time Series Prediction using a Recurrent Neural Network and

This is the “curseof dimensionality”[8, 17, 18]. The relationshipbetweenthe sampling

densityandthenumberof pointsrequiredis ~u���� [18] where� is thedimensionalityof the

input spaceand � is thenumberof points. Thus,if � { is thenumberof pointsfor a given

samplingdensityin 1 dimension,thenin orderto keepthesamedensityasthedimensionality

is increased,thenumberof pointsmustincreaseaccordingto ��k{ .It hasbeensuggestedthat MLPs do not suffer from the curseof dimensionality[16, 7].

However, this is not truein general(althoughMLPsmaycopebetterthanothermodels).The

apparentavoidanceof thecurseof dimensionalityin [7] is dueto the fact that the function

spacesconsideredare more and more constrainedas the input dimensionincreases[34].

Similarly, smoothnessconditionsmustbesatisfiedfor theresultsof [16].

The useof a recurrentneuralnetwork is importantfrom the viewpoint of the curseof di-

mensionalitybecausetheRNN cantake into accountgreaterhistoryof the input. Trying to

take into accountgreaterhistorywith an MLP by increasingthenumberof delayedinputs

resultsin an increasein the input dimension.This is problematic,given the small number

of datapointsavailable. The useof a self-organizingmapfor the symbolicconversionis

alsoimportantfrom theviewpoint of thecurseof dimensionality. As will beseenlater, the

topographicalorderof theSOMallowsencodingaonedimensionalSOMinto asingleinput

for theRNN.

for any continuousfunction: � 695M�$<;5_��<CBDBCBD<;5_�*:�� �Y���?���>� ����� � �� � � �?� ��� � 6@5 � :�� (1)

where � � ��� � � are universal constantsthat do not dependon�

, � � � � �Y���?�� are universal transformations

which do not dependon�

, and � � 6@�W: is a continuous,one-dimensionalfunction which totally characterizes� 695M��<>5���<DBDBCBD<>5��*: ( ��� is typically highly non-smooth).In otherwords,for any continuousfunctionof } argu-

ments,thereis a onedimensionalcontinuousfunction thatcompletelycharacterizestheoriginal function. As

such,it canbeseenthattheproblemis notsomuchthedimensionality, but thecomplexity of thefunction[18],

i.e. thecurseof dimensionalityessentiallysaysthatin high dimensions,asfewer datapointsareavailable,the

targetfunctionhasto besimplerin orderto learnit accuratelyfrom thegivendata.

6

Page 7: Noisy Time Series Prediction using a Recurrent Neural Network and

4 The Efficient Mark et Hypothesis

The Efficient Market Hypothesis(EMH) wasdevelopedin 1965by E. Fama[14, 15] and

foundbroadacceptancein thefinancialcommunity[39, 58,5]. TheEMH, in its weakform,

assertsthatthepriceof anassetreflectsall of theinformationthatcanbeobtainedfrom past

pricesof theasset,i.e. themovementof theprice is unpredictable.Thebestpredictionfor

a price is thecurrentprice andtheactualpricesfollow what is calleda randomwalk. The

EMH is basedontheassumptionthatall newsis promptlyincorporatedin prices;sincenews

is unpredictable(by definition), pricesareunpredictable.Much effort hasbeenexpended

trying to prove or disprove theEMH. Currentopinionis that thetheoryhasbeendisproved

[53, 28], andmuchevidencesuggeststhatthecapitalmarketsarenotefficient [40, 38].

If the EMH was true, thena financial seriescould be modeledas the additionof a noise

componentateachstep: � vY�� ¢¡*x�r � vY�7x� ¤£2vY�7x (2)

where £zvY�ax is azeromeanGaussianvariablewith variance¥ . Thebestestimationis:¦� vY�� ¢¡*x�r � vY�7x (3)

In otherwords,if theseriesis truly a randomwalk, thenthebestestimatefor thenext time

period is equal to the currentestimate. Now, if it is assumedthat thereis a predictable

componentof theseriesthenit is possibleto use:� vY�§ H¡2x r � vY�ax� ¨t|v � v>�ax�© � vY�&ª«¡*x�©�¬z¬z¬0© � v>�sª­�# ¢¡*x'x� ¨£2v>�ax (4)

where£zvY�ax isazeromeanGaussianvariablewith variance¥ , and t vD®¯x isanon-linearfunction

in its arguments.In this case,thebestestimateis givenby:¦� vY�§ H¡2x r � vY�7x� ¨t|v � v>�ax�© � vY�°ª±¡*x�©�¬z¬z¬0© � v>�°ª²�³ H¡*xCx (5)

Predictionusingthis modelis problematicastheseriesoftencontainsa trend.For example,

a neuralnetwork trainedon sectionA in figure1 haslittle chanceof generalizingto thetest

datain sectionB, becausethemodelwasnot trainedwith datain therangecoveredby that

section.

7

Page 8: Noisy Time Series Prediction using a Recurrent Neural Network and

Figure1. An exampleof thedifficulty in usinga modeltrainedon a raw priceseries.Generalization

to sectionB from the training datain sectionA is difficult becausethe modelhasnot beentrained

with inputsin therangecoveredby sectionB.

A commonsolutionto this is to usea modelwhich is basedon the first orderdifferences

(equation7) insteadof theraw timeseries[24]:´ v>�) ¢¡*x rµt|v ´ v>�ax�© ´ v>�&ª¶¡*x�©z¬�¬z¬�© ´ vY�·ª²�³ H¡2x'x� ¶¸�vY�ax (6)

where ´ v>�� H¡*xº¹r � v>�� H¡*x ª � v>�ax (7)

and ¸0v>�ax is a zeromeanGaussianvariablewith variance¥ . In this casethebestestimateis:¦´ vY�� H¡*x rµt|v ´ vY�ax�© ´ vY�·ª±¡*x�©z¬z¬z¬0© ´ v>�&ª­�# ¢¡*x'x (8)

5 SystemDetails

A high-level block diagramof the systemusedis shown in figure 2. The raw time series

valuesare q�vY�ax , �³r»¡W©�¼K©z¬z¬z¬�©�� whereq�vY�7x(opj . Thesedenotethedaily closingbidsof the

financialtimeseries.Thefirst differenceof theseries,q�vY�ax , is takenasfollows:´ vY�ax rHq0v>�ax�ª²q�vY�·ª±¡*x (9)

8

Page 9: Noisy Time Series Prediction using a Recurrent Neural Network and

Figure2. A high-level block diagramof thesystemused.

Thisproduces vY�7x , ´ vY�ax�o½j , �³r»¡W©�¼?©�¬z¬z¬�©��¾ª­¡ . In orderto compressthedynamicrange

of theseriesandreducetheeffectof outliers,a log transformationof thedatais used:� v>�ax�r signv ´ vY�axCx�vw¿�ÀWÁav� ´ vY�7x*Â� H¡2x'x (10)

resultingin

� v>�ax�©��­ráW©�¼K©z¬z¬z¬�©��Īu¡ , � vY�ax&o±j . As is usual,a delayembeddingof this

seriesis thenconsidered[62]:Å vY�.©$ÆaÇEx�rÈv � v>�ax�© � v>�°ª±É*x�© � v>�°ª²ÊMx�©�¬z¬z¬�© � vY�&ª²Æ7Ç0 HÉ*xCx (11)

whereÆ { is thedelayembeddingdimensionandis equalto1or2 for theexperimentsreported

here.Å vY�.©$ÆaÇEx is a statevector. This delayembeddingformstheinput to theSOM. Hence,

the SOM input is a vector of the last Æ { valuesof the log transformeddifferencedtime

series.Theoutputof theSOM is thetopographicallocationof thewinningnode.Eachnode

representsonesymbolin theresultinggrammaticalinferenceproblem. A brief description

of theself-organizingmapis containedin thenext section.

TheSOM canberepresentedby thefollowing equation:Ë vY�7x rHÌ�v Å vY��©�ÆKxDx (12)

whereË v>�ax§oµÍ¯¡W©�¼?©�Î?©z¬z¬z¬D�0Ï=Ð , and �0Ï is thenumberof symbols(nodes)for theSOM. Each

nodein theSOMhasbeenassignedanintegerindex rangingfrom 1 to thenumberof nodes.

An Elmanrecurrentneuralnetwork is thenused4 which is trainedonthesequenceof outputs

from the SOM. The Elmannetwork waschosenbecauseit is suitablefor the problem(a4Elmanrefersto thetopologyof thenetwork – full backpropagationthroughtime wasusedasopposedto

thetruncatedversionusedby Elman.

9

Page 10: Noisy Time Series Prediction using a Recurrent Neural Network and

grammaticalinferencestyleproblem)[13], andbecauseit hasbeenshown to performwell

in comparisonto otherrecurrentarchitectures(e.g.see[36]). TheElmanneuralnetwork has

feedbackfrom eachof thehiddennodesto all of thehiddennodes,asshown in figure3. For

theElmannetwork: Ñ v>�� H¡*x�rÓÒ�Ô�ÕMÖh ¤×WØ (13)

and ÕMÖÙrHÚ kzÛ v>Ü1ÕMÖ�Ý {  ¤Þ&ß�Öh áà|x (14)

where Ò is a �.â1ãä�.å vectorrepresentingthe weightsfrom the hiddenlayer to the output

nodes,�.â is thenumberof hiddennodes,�.å is thenumberof outputnodes,×WØ is a constant

biasvector, ÕMÖ , ÕMÖ²oæj k2Û , is an ��âçãè¡ vector, denotingthe outputsof the hiddenlayer

neurons.ß�Ö is a ÆMéêãë¡ vectorasfollows,where ÆMé is theembeddingdimensionusedfor the

recurrentneuralnetwork:

ß�Öºrìííííííî Ë v>�axË v>�°ª«¡*xË v>�°ªá¼Wx¬�¬z¬Ë vY�°ªáÆMé  ¢¡*x

ï¯ððððððñ (15)

Ü and Þ arematricesof appropriatedimensionswhichrepresentthefeedbackweightsfrom

the hiddennodesto the hiddennodesand the weightsfrom the input layer to the hidden

layerrespectively. Ú k Û is a �.â&㤡 vectorcontainingthesigmoidfunctions. à is an ��â&ãá¡vector, denotingthebiasof eachhiddenlayerneuron.

Ñ vY�ax is a ��å ã�¡ vectorcontainingthe

outputsof thenetwork. ��å is 2 throughoutthis paper. Thefirst outputis trainedto predict

theprobabilityof apositivechange,thesecondoutputis trainedto predicttheprobabilityof

anegativechange.

Thus,for thecompletesystem:Ñ vY�� H¡*x�rHÚ { v ´ vY�7x�© ´ vY�·ª±¡*x�© ´ vY�°ªá¼Wx�© ´ v>�°ª²ÎMx�© ´ v>�&ª­òPx'x (16)

whichcanbeconsideredin termsof theoriginal series:Ñ v>�§ ¢¡*x rµÚ�éTv;q�vY�ax�©$q0v>�)ª«¡*x�©$q�vY�·ª¤¼Wx�©$q0v>�)ªáÎMx�©$q0v>�·ªäòPx�©$q0v>�·ª¤óWx'x (17)

10

Page 11: Noisy Time Series Prediction using a Recurrent Neural Network and

Figure3. Thepre-processed,delayembeddedseriesis convertedinto symbolsusingaself-organizing

map.An Elmanneuralnetwork is trainedon thesequenceof symbols.

11

Page 12: Noisy Time Series Prediction using a Recurrent Neural Network and

Notethatthis is notastaticmappingdueto thedependenceon thehiddenstateof therecur-

rentneuralnetwork.

Thefollowing sectionsdescribetheself-organizingmap,recurrentnetwork grammaticalin-

ference,dealingwith non-stationarity, controllingoverfitting,a priori classprobabilities,and

amethodfor comparisonwith a randomwalk.

5.1 The Self-Organizing Map

5.1.1 Intr oduction

Theself-organizingmap,or SOM [32] is anunsupervisedlearningprocesswhich learnsthe

distribution of a setof patternswithout any classinformation.A patternis projectedfrom a

possiblyhighdimensionalinput spaceô to apositionin themap,a low dimensionaldisplay

spaceõ . Thedisplayspaceõ is oftendividedinto a grid andeachintersectionof thegrid

is representedin the network by a neuron. The information is encodedasthe locationof

anactivatedneuron.TheSOM is unlike mostclassificationor clusteringtechniquesin that

it attemptsto preserve the topologicalorderingof the classesin the input spaceô in the

resultingdisplayspaceõ . In otherwords,for similarity asmeasuredusinga metric in the

input spaceô , theSOM attemptsto preserve thesimilarity in thedisplayspaceõ .

5.1.2 Algorithm

We give a brief descriptionof the SOM algorithm, for more detailssee[32]. The SOM

definesamappingfrom aninputspacej k ontoatopologicallyorderedsetof nodes,usually

in a lower dimensionalspaceõ . A referencevector, ö1nø÷ Íúù�n { ©$ù�n¯éz©�¬û¬@¬û©'ù�n k Ð Ô oüj k , is

assignedto eachnodein theSOM. Assumethat thereare ý nodes.During training,each

input,

� ouj k , is comparedto ö1n , þ°rÿ¡_©�¼?©z¬z¬�¬�©�ý , obtainingthe locationof the closest

match( Âû � ª«ö��zÂ�Â0r ����� n���Â� � ª«ö1n'Â� ). The input point is mappedto this locationin the

SOM.Nodesin theSOM areupdatedaccordingto:ö1n=v�� ¢¡*x rHö1nEv��Cx� � ���nEv��Cx�Í � v��Cx|ª²ö1nEv��CxEÐ (18)

12

Page 13: Noisy Time Series Prediction using a Recurrent Neural Network and

where � is the time during learningand ��wn=v��Cx is the neighborhoodfunction, a smoothing

kernelwhich is maximumat ö�� . Usually, ���n=v��Cx�r� v�Âû ����ª���n'Âû ©��Cx , where��� and ��n represent

the locationsof nodesin the SOM outputspaceõ . ��� is the nodewith the closestweight

vectorto the input sampleand ��n rangesover all nodes. ��wn=v��Cx approaches0 as Â� ���hª���n'Â�Âincreasesandalsoas � approaches� . A widely appliedneighborhoodfunctionis:

���n�r���v��Cx������ ª Â� ���ª!��n'Âû é¼�¥ é v��Cx " (19)

where ��v��Cx is a scalarvaluedlearningrateand ¥�v��Cx definesthe width of the kernel. They

aregenerallyboth monotonicallydecreasingwith time [32]. The useof the neighborhood

functionmeansthatnodeswhich aretopographicallyclosein theSOM structurearemoved

towardstheinputpatternalongwith thewinningnode.Thiscreatesasmoothingeffectwhich

leadsto aglobalorderingof themap.Notethat ¥�v��Cx shouldnotbereducedtoofarasthemap

will loseits topographicalorderif neighboringnodesarenot updatedalongwith theclosest

node.TheSOM canbeconsidereda non-linearprojectionof theprobabilitydensity,#�v � x ,of theinputpatterns

�ontoa (typically) smalleroutputspace[32].

5.1.3 Remarks

Thenodesin thedisplayspaceõ encodethe informationcontainedin the input spacej k .Sincethereare ý nodesin õ , this implies that the input patternvectors

� o²j1k aretrans-

formedto a setof ý symbols,while preservingtheir original topologicalorderingin j1k .Thus,if theoriginal input patternsarehighly noisy, thequantizationinto thesetof ý sym-

bols while preservingthe original topologicalorderingcanbe understoodasa form of fil-

tering. Theamountof filtering is controlledby ý . If ý is large, this impliesthereis little

reductionin thenoisecontentof theresultingsymbols.Ontheotherhand,if ý is small,this

impliesthatthereis a “heavy” filtering effect, resultingin only asmallnumberof symbols.

5.2 Recurrent Network Prediction

In the pastfew yearsseveral recurrentneuralnetwork architectureshave emerged which

have beenusedfor grammaticalinference[10, 23, 21]. The inductionof relatively simple

13

Page 14: Noisy Time Series Prediction using a Recurrent Neural Network and

grammarshasbeenaddressedoften– e.g.[59, 60, 21] on learningTomitagrammars[55]. It

hasbeenshown thataparticularclassof recurrentnetworksareTuringequivalent[50].

Thesetof ý symbolsfrom theoutputof theSOM arelinearly encodedinto a singleinput

for the Elmannetwork (e.g. if ý r Î , the single input is either -1, 0, or 1). The linear

encodingis importantandis justifiedby thetopographicalorderof thesymbols.If no order

wasdefinedoverthesymbolsthenaseparateinputshouldbeusedfor eachsymbol,resulting

in asystemwith moredimensionsandincreaseddifficulty dueto thecurseof dimensionality.

In orderto facilitatethe trainingof the recurrentnetwork, an input window of 3 wasused,

i.e. thelastthreesymbolsarepresentedto 3 input neuronsof therecurrentneuralnetwork.

5.3 Dealingwith Non-stationarity

The approachusedto dealwith the non-stationarityof the signal in this work is to build

modelsbasedon ashorttime periodonly. This is anintuitively appealingapproachbecause

onewould expectthatany inefficienciesfound in themarket would typically not last for a

long time. Thedifficulty with this approachis thereductionin thealreadysmallnumberof

trainingdatapoints.Thesizeof thetrainingsetcontrolsanoisevs. non-stationaritytradeoff

[42]. If the training set is too small, the noisemakesit harderto estimatethe appropriate

mapping.If thetrainingsetis too large,thenon-stationarityof thedatawill meanmoredata

with statisticsthatarelessrelevantfor thetaskathandwill beusedfor creatingtheestimator.

We rantestson a segmentof dataseparatefrom themaintests,from which we choseto use

modelstrainedon 100daysof data(seeappendixA for moredetails).After training,each

model is testedon the next 30 daysof data. The entire training/testsetwindow is moved

forward30 daysandtheprocessis repeated,asdepictedin figure4. In a real-life situation

it would bedesirableto testonly for thenext day, andadvancethewindowsby stepsof one

day. However, 30daysis usedherein orderto reducetheamountof computationby a factor

of 30, andenablea much larger numberof teststo be performed(i.e. 30 timesasmany

predictionsaremadefrom thesamenumberof trainingruns),therebyincreasingconfidence

in theresults.

14

Page 15: Noisy Time Series Prediction using a Recurrent Neural Network and

Figure4. A depictionof thetrainingandtestsetsused.

5.4 Controlling Overfitting

Highly noisydatahasoftenbeenaddressedusingtechniquessuchasweightdecay, weight

eliminationandearly stoppingto control overfitting [61]. For this problemwe useearly

stoppingwith a difference:the stoppingpoint is chosenusingmultiple testson a separate

segmentof data,ratherthanusinga validationsetin thenormalmanner. Testsshowedthat

usinga validationsetin thenormalmannerseriouslyaffectedperformance.This is not very

surprisingbecausethereis a dilemmawhenchoosingthe validationset. If the validation

set is chosenbeforethe training data(in termsof the temporalorder of the series),then

thenon-stationarityof theseriesmeansthevalidationsetmaybeof little usefor predicting

performanceon thetestset. If thevalidationsetis chosenafter thetrainingdata,a problem

occursagainwith the non-stationarity– it is desirableto make the validationsetassmall

aspossibleto alleviate this problem,however the setcannotbe madetoo small becauseit

needsto be a certainsizein orderto make the resultsstatisticallyvalid. Hence,we chose

to controloverfitting by settingthetrainingtime for anappropriatestoppingpoint a priori .

Simulationswith aseparatesegmentof datanotcontainedin themaindata(seeappendixA

for moredetails)wereusedto selectthetrainingtimeof 500epochs.

Wenotethatouruseof aseparatesegmentof datato selecttrainingsetsizeandtrainingtime

arenot expectedto beidealsolutions,e.g. thebestchoicefor theseparametersmaychange

15

Page 16: Noisy Time Series Prediction using a Recurrent Neural Network and

significantlyover timedueto thenon-stationarityof theseries.

5.5 A Priori Data Probabilities

For somefinancialdatasets,it is possibleto obtaingoodresultsby simply predictingthe

mostcommonchangein thetrainingdata(e.g.noticingthatthechangein pricesis positive

moreoften thanit is negative in the training setandalwayspredictingpositive for the test

set). Although predictabilitydueto suchdatabiasmay be worthwhile, it canprevent the

discoveryof a bettersolution.Figure5 shows a simplifieddepictionof theproblem– if the

naive solution,A, hasa bettermeansquarederror thana randomsolution then it may be

hardto find a bettersolution,e.g. B. This may be particularlyproblematicfor high noise

dataalthoughthetruenatureof theerrorsurfaceis not yet well understood.

Figure5. A depictionof a simplified error surfaceif a naive solution,A, performswell. A better

solution,e.g.B, maybedifficult to find.

Hence,in all resultsreportedherethemodelshave beenpreventedfrom actingon any such

biasby ensuringthat thenumberof trainingexamplesfor thepositiveandnegativecasesis

equal.Thetrainingsetactuallyremainsunchangedsothatthetemporalorderis notdisturbed

but thereis no training signalfor the appropriatenumberof positive or negative examples

startingfrom thebeginningof theseries.Thenumberof positive andnegative examplesis

typically approximatelyequalto begin with, thereforetheuseof this simpletechniquedid

not resultin thelossof many datapoints.

5.6 Comparisonwith a Random Walk

The predictioncorrespondingto the randomwalk model is equalto the currentprice, i.e.

no change. A percentageof times the model predictsthe correctchange(apart from no

16

Page 17: Noisy Time Series Prediction using a Recurrent Neural Network and

change)cannotbeobtainedfor therandomwalk model.However, it is possibleto usemod-

elswith datathat correspondsto a randomwalk andcomparethe resultsto thoseobtained

with therealdata.If equalperformanceis obtainedon therandomwalk dataascomparedto

the realdatathenno significantpredictabilityhasbeenfound. Thesetestswereperformed

by randomizingtheclassificationof the testset. For eachoriginal testsetfive randomized

versionswerecreated.The randomizedversionswerecreatedby repeatedlyselectingtwo

randompositionsin time andswappingthe two classificationvalues(the input valuesre-

main unchanged).An additionalbenefitof this techniqueis that if the model is showing

predictabilitybasedon any databias(asdiscussedin the previoussection)thenthe results

ontherandomizedtestsetswill alsoshow thispredictability, i.e. alwayspredictingapositive

changewill resultin thesameperformanceon theoriginal andtherandomizedtestsets.

6 Experimental Results

We presentresultsusingthepreviously describedsystemandalternative systemsfor com-

parison. In all cases,the Elmannetworks contained5 hiddenunits andthe SOM hadone

dimension.Furtherdetailsof the individual simulationscanbe found in appendixA. Fig-

ure6 andtable1 shows theaverageerrorrateasthenumberof nodesin theself-organizing

map(SOM) wasincreasedfrom ¼ to $ . Resultsareshown for SOM embeddingdimensions

of 1 and 2. Resultsare shown using both the original dataand the randomizeddatafor

comparisonwith a randomwalk. The resultsareaveragedover the 5 exchangerateswith

30 simulationsper exchangerateand30 testpointsper simulation,for a total of 4500test

points. Thebestperformancewasobtainedwith a SOM sizeof 7 nodesandanembedding

dimensionof 2 wheretheerror ratewas47.1%.A � -testindicatesthat this resultis signifi-

cantlydifferentfrom thenull hypothesisof therandomizedtestsetsat # r�%?¬%&%&$(' or 0.76%

( �çr ¼?¬'&$ ) indicating that the result is significant. We also considereda null hypothesis

correspondingto a modelwhich makescompletelyrandompredictions(correspondingto a

randomwalk) – theresultis significantlydifferentfrom this null hypothesisat #ër)%7¬*%&%MóTòor 0.54%( �ºr ¼?¬*+&% ). Figure7 andtable2 shows theaverageresultsfor theSOM sizeof 7

on theindividualexchangerates.

In orderto gaugetheeffectivenessof thesymbolicencodingandrecurrentneuralnetwork,

we investigatedthefollowing two systems:

17

Page 18: Noisy Time Series Prediction using a Recurrent Neural Network and

46

47

48

49

50

51

52

53

54

1 2 3 4 5 6 7 8

Tes

t Err

or %

SOM Size

Embedding dimension 1Embedding dimension 2

Randomized embedding dimension 1Randomized embedding dimension 2

Figure6. The error rateasthe numberof nodesin the SOM is increasedfor SOM embeddingdi-

mensionsof 1 and2. Eachof the resultsrepresentstheaverageover the5 individual seriesandthe

30 simulationsperseries,i.e. over 150simulationswhich tested30 pointseach,or 4500individual

classifications.Theresultsfor therandomizedseriesareaveragedoverthesamevalueswith the5 ran-

domizedseriespersimulation,for a total of 22500classificationsperresult(all of thedataavailable

wasnotuseddueto limited computationalresources).

1. The systemaspresentedherewithout the symbolicencoding,i.e. the preprocessed

datais entereddirectly into therecurrentneuralnetwork without theSOM stage.

2. Thesystemaspresentedherewith therecurrentnetwork replacedby a standardMLP

network.

Tables3 and4 show theresultsfor systems1 and2 above. Theperformanceof thesesystems

canbeseento bein-betweentheperformanceof thenull hypothesisof arandomwalk andthe

performanceof thehybridsystem,indicatingthatthehybridsystemcanresultin significant

improvement.

18

Page 19: Noisy Time Series Prediction using a Recurrent Neural Network and

SOM embeddingdimension:1

SOM size 2 3 4 5 6 7

Error 48.5 47.5 48.3 48.3 48.5 47.6

Randomizederror 50.1 50.5 50.4 50.2 49.8 50.3

SOM embeddingdimension:2

SOM size 2 3 4 5 6 7

Error 48.8 47.8 47.9 47.7 47.6 47.1

Randomizederror 49.8 50.4 50.4 50.1 49.7 50.2

Table1. Theresultsasshown in figure6.

45464748495051525354

BP CD DM JY SF

Tes

t Err

or %

Exchange Rate

Embedding dimension 1Embedding dimension 2

Randomized embedding dimension 1Randomized embedding dimension 2

Figure7. Averageresultsfor thefive exchangeratesusingthereal testdataandtherandomizedtest

data.TheSOM sizeis 7 andtheresultsfor SOM embeddingdimensionsof 1 and2 areshown.

6.1 RejectPerformance

We usedthe following equationin order to calculatea confidencemeasurefor every pre-

diction: ,ár q l.-0/ vwq l.-0/ ª¶q l n k x where q l1-2/ is themaximumoutput,and q l n k is themin-

imum output (for outputswhich have beentransformedusing the softmaxtransformation:q�n r 3�465�7*8:9<;=?>@BA � 3�465�7*8 @ ; where C�n aretheoriginal outputs,qTn arethe transformedoutputs,and � is

thenumberof outputs).Theconfidencemeasure, givesanindicationof how confidentthe

19

Page 20: Noisy Time Series Prediction using a Recurrent Neural Network and

Exchange British Canadian German Japanese Swiss

Rate Pound Dollar Mark Yen Frank

Embeddingdimension1 48.5 46.9 46 47.7 48.9

Embeddingdimension2 48.1 46.8 46.9 46.3 47.4

Randomizedembeddingdimension1 50.1 50.5 50.4 49.8 50.7

Randomizedembeddingdimension2 50.2 50.6 50.5 49.7 49.8

Table2. Resultsasshown in figure7.

Error 49.5

Randomizederror 50.1

Table3. The resultsfor the systemwithout the symbolic encoding,i.e. the preprocesseddatais

entereddirectly into therecurrentneuralnetwork without theSOMstage.

SOM embeddingdimension:1

SOM size 2 3 4 5 6 7

Error 49.1 49.4 49.3 49.7 49.9 49.8

Randomizederror 49.7 50.1 49.9 50.3 50.0 49.9

Table4. Theresultsfor thesystemusinganMLP insteadof theRNN.

network modelis for eachclassification.Thus,for afixedvalueof , , thepredictionobtained

by a particularalgorithmcanbedividedinto two sets,oneset D for thosebelow thethresh-

old, andanotherset E for thoseabove thethreshold.We canthenrejectthosepointswhich

arein set D , i.e. they arenotusedin thecomputationof theclassificationaccuracy.

Figure8 shows theclassificationperformanceof thesystem(percentageof correctsignpre-

dictions)astheconfidencethresholdis increasedandmoreexamplesarerejectedfor classi-

fication. Figure9 shows thesamegraphfor the randomizedtestsets. It canbeseenthata

significantincreasein performance(down to approximately40%error)canbeobtainedby

only consideringthe5-15%of exampleswith thehighestconfidence.Thisbenefitcannotbe

seenfor therandomizedtestsets.It shouldbenotedthatby thetime 95%of testpointsare

rejected,thetotal numberof remainingtestpointsis only 225.

20

Page 21: Noisy Time Series Prediction using a Recurrent Neural Network and

4648505254565860

0 10 20 30 40 50 60 70 80 90

Per

cent

Cor

rect

Reject Percentage

Reject Performance

RNN Test

Figure 8. Classificationperformanceas the confidencethresholdis increased.This graphis the

averageof the150individual results.

4648505254565860

0 10 20 30 40 50 60 70 80 90

Per

cent

Cor

rect

Reject Percentage

Reject Performance

RNN Randomized Test

Figure9. Classificationperformanceastheconfidencethresholdis increasedfor therandomizedtest

sets.This graphis theaverageof theindividual randomizedtestsets.

7 Automata Extraction

A commoncomplaintwith neuralnetworks is that the modelsare difficult to interpret–

i.e. it is not clearhow themodelsarrive at the predictionor classificationof a given input

pattern[49]. A numberof peoplehaveconsideredtheextractionof symbolicknowledgefrom

trainedneuralnetworksfor bothfeedforwardandrecurrentneuralnetworks[12, 44, 57,44].

For recurrentnetworks,theorderedtriple of adiscreteMarkov process( � state;input F next

state ) canbe extractedandusedto form an equivalentdeterministicfinite stateautomata

(DFA). This canbe doneby clusteringthe activation valuesof the recurrentstateneurons

[44]. Theautomataextractedwith this processcanonly recognizeregulargrammars5.

5A regular grammar G is a 4-tuple Gæ�æ�IHa<>F�<KJ<BL � where H is the start symbol, F and J are non-

terminalandterminalsymbols,respectively, and L representsproductionsof the form M�NPO or M�NPO�Qwhere M�<BQSR F and OTRUJ .

21

Page 22: Noisy Time Series Prediction using a Recurrent Neural Network and

The algorithmusedfor automataextractionis the sameasthat describedin [21]. The al-

gorithm is basedon the observation that the activationsof the recurrentstateneuronsin a

trainednetwork tendto cluster. The outputof eachof the � stateneuronsis divided intoV intervalsof equalsize,yielding VXW partitionsin thespaceof outputsof thestateneurons.

Startingfrom aninitial state,the input symbolsareconsideredin order. If an input symbol

causesa transitionto a new partition in the activation spacethena new DFA stateis cre-

atedwith a correspondingtransitionon the appropriatesymbol. In the event thatdifferent

activationstatevectorsbelongto thesamepartition,thestatevectorwhich first reachedthe

partitionis chosenasthenew initial statefor subsequentsymbols.Theextractionwould be

computationallyinfeasibleif all VXW partitionshadto bevisited,but theclusterscorrespond-

ing to DFA statesoftencoverasmallpercentageof theinput space.

Sampledeterministicfinite stateautomata(DFA) extractedfrom trainedfinancial predic-

tion networksusinga quantizationlevel of 5 canbeseenin figure10. TheDFAs have been

minimizedusingstandardminimizationtechniques[27]. TheDFAs wereextractedfrom net-

workswheretheSOM embeddingdimensionwas1 andtheSOM sizewas2. As expected,

the DFAs extractedin this casearerelatively simple. In all cases,the SOM symbolsend

up representingpositiveandnegativechangesin theseries– transitionsmarkedwith asolid

line correspondto positivechangesandtransitionsmarkedwith a dottedline correspondto

negative changes.For all DFAs, state1 is the startingstate. Doublecircled statescorre-

spondto predictionof a positivechange,andstateswithout thedoublecircle correspondto

predictionof a negativechange.(a) canbeseenascorrespondingto mean-revertingbehav-

ior – i.e. if the last changein the serieswasnegative thenthe DFA predictsthat the next

changewill be positive andvice versa. (b) and(c) canbe seenasvariationson (a) where

thereareexceptionsto thesimplerulesof (a),e.g. in (b) a negativechangecorrespondsto a

positiveprediction,howeveraftertheinputchangesfrom negativeto positive,theprediction

alternateswith successivepositivechanges.In contrastwith (a)-(c),(d)-(f) exhibit behavior

relatedto trendfollowing algorithms,e.g. in (d) a positivechangecorrespondsto a positive

prediction,anda negative changecorrespondsto a negative predictionexcept for the first

negativechangeafterapositivechange.Figure11showsasampleDFA for aSOMsizeof 3

– morecomplex behavior canbeseenin this case.

22

Page 23: Noisy Time Series Prediction using a Recurrent Neural Network and

1

21 2

1 2

3

(a) (b) (c)

1 2

3 1 2

1

2 3

4

(d) (e) (f)

Figure10. Sampledeterministicfinite stateautomata(DFA) extractedfrom trainedfinancialpredic-

tion networksusingaquantizationlevel of 5.

1

2

3

4 5

6

Figure 11. A sampleDFA extractedfrom a network wherethe SOM size was 3 (and the SOM

embeddingdimensionwas1). TheSOMsymbolsrepresentapositive (solid lines)or negative (dotted

lines)change,or a changecloseto zero(grayline). In this case,therulesaremorecomplex thanthe

simplerulesin thepreviousfigure.

23

Page 24: Noisy Time Series Prediction using a Recurrent Neural Network and

8 Conclusions

Traditionallearningmethodshave difficulty with high noise,high non-stationaritytime se-

riesprediction.Theintelligentsignalprocessingmethodpresentedhereusesa combination

of symbolicprocessingandrecurrentneuralnetworksto dealwith fundamentaldifficulties.

The useof a recurrentneuralnetwork is significantfor two reasons:firstly, the temporal

relationshipof theseriesis explicitly modeledvia internalstates,andsecondly, it is possible

to extract rulesfrom the trainedrecurrentnetworks in the form of deterministicfinite state

automata.Thesymbolicconversionis significantfor a numberof reasons:thequantization

effectively filters the dataor reducesthe noise,the RNN training becomesmoreeffective,

andthe symbolic input facilitatesthe extractionof rulesfrom the trainednetworks. Con-

sideringforeignexchangerateprediction,weperformedacomprehensivesetof simulations

covering5 differentforeignexchangerates,whichshowedthatsignificantpredictabilitydoes

exist with a 47.1%error ratein thepredictionof thedirectionof thechange.Additionally,

the error ratereducedto around40% whenrejectingexampleswherethe systemhadlow

confidencein its prediction.Themethodfacilitatestheextractionof ruleswhich explain the

operationof thesystem.Resultsfor theproblemconsideredherearepositiveandshow that

symbolicknowledgewhich hasmeaningto humanscanbeextractedfrom noisytime series

data.Suchinformationcanbeveryusefulandimportantto theuserof thetechnique.

We have shown that a particularrecurrentnetwork combinedwith symbolicmethodscan

generateusefulpredictionsandrules for suchpredictionson a non-stationarytime series.

It would beinterestingto comparethis approachwith otherneuralnetworksthathave been

usedin timeseriesanalysis[6, 31, 43,45,48], andothermachinelearningmethods[19, 51],

someof whichcanalsogeneraterules.It wouldalsobeinterestingto incorporatethismodel

into a tradingpolicy.

Appendix A: Simulation Details

For theexperimentsreportedin this paper, Y[Z]\_^ and Y.å`\_a . TheRNN learningratewas

linearly reducedoverthetrainingperiod(see[11] regardinglearningrateschedules)from an

initial valueof 0.5. All inputswerenormalizedto zeromeanandunit variance.All nodes

24

Page 25: Noisy Time Series Prediction using a Recurrent Neural Network and

includeda biasinput which waspartof theoptimizationprocess.Weightswereinitialized

asshown in Haykin [26]. Targetoutputswere-0.8and0.8usingthe bdc(e�f outputactivation

function andwe usedthe quadraticcost function. In order to initialize the delaysin the

network, theRNN wasrunwithout trainingfor 50timestepsprior to thestartof thedatasets.

The numberof training examplesfor the positive and negative caseswas madeequalby

not training on the appropriatenumberof positive or negative examplesstartingfrom the

beginningof thetrainingset(i.e. thetemporalorderof theseriesis notdisturbedbut thereis

notrainingsignalfor eitheranumberof positiveor anumberof negativeexamplesstartingat

thebeginningof theseries).TheSOM wastrainedfor 10,000iterationsusingthefollowing

learningratescheduleg�h^1i]j0k�lmYonKpqYorsnKt whereYon is thecurrentiterationnumberand Yursn is the

totalnumberof iterations(10,000).Theneighborhoodsizewasalteredduringtrainingusing

floor j2kTvwj6jBa&pXxyt0Y[z{l�k|t:j0k}l~Yon�pqYorsnKt�v�g�h^&t . Thefirst differenceddatawassplit asfollows:

Points1210to 1959(plusthesizeof thevalidationsetin thevalidationsetexperiments)in

eachof theexchangerateswereusedfor theselectionof thetrainingdatasetsize,thetraining

time, andexperimentationwith theuseof a validationset. Points10 to 1059in eachof the

exchangerateswereusedfor the main tests. The datafor eachsuccessive simulationwas

offsetby thesizeof thetestset,30points,asshown in figure4. Thenumberof pointsin the

main testsfrom eachexchangerateis equalto 1,050which is 50 (usedto initialize delays)

+ 100(trainingsetsize)+ 30 i 30 (30 testsetsof size30). For theselectionof thetraining

datasetsizeandthetrainingtime thenumberof pointswas750for eachexchangerate(50+

100+ 20 testsetsof size30). For theexperimentsusinga validationsetin eachsimulation,

thenumberof pointswas750+ thesizeof thevalidationsetfor eachof theexchangerates.

Acknowledgments

We would like to thankDoyneFarmer, ChristianOmlin, AndreasWeigend,andtheanony-

mousrefereesfor helpful comments.

References

[1] Machinelearning.InternationalJournal of IntelligentSystemsin Accounting, Finance

andManagement, 4(2),1995.SpecialIssue.

25

Page 26: Noisy Time Series Prediction using a Recurrent Neural Network and

[2] Proceedingsof IEEE/IAFE Conferenceon ComputationalIntelligencefor Financial

Engineering(CIFEr),1996–1999.

[3] Y.S. Abu-Mostafa. Learningfrom hints in neuralnetworks. Journal of Complexity,

6:192,1990.

[4] J.A. Alexanderand M.C. Mozer. Template-basedalgorithmsfor connectionistrule

extraction. In G. Tesauro,D. Touretzky, and T. Leen, editors,Advancesin Neural

InformationProcessingSystems, volume7, pages609–616.TheMIT Press,1995.

[5] Martin Anthony andNormanL. Biggs. A computationallearningtheoryview of eco-

nomic forecastingwith neuralnets. In A. Refenes,editor, Neural Networksin the

Capital Markets. JohnWiley andSons,1995.

[6] A.D. BackandA.C. Tsoi. FIR andIIR synapses,anew neuralnetwork architecturefor

timeseriesmodeling.Neural Computation, 3(3):375–385,1991.

[7] A.R. Barron. Universalapproximationboundsfor superpositionsof a sigmoidalfunc-

tion. IEEETransactionson InformationTheory, 39(3):930–945,1993.

[8] R. E. Bellman.AdaptiveControl Processes. PrincetonUniversityPress,Princeton,NJ,

1961.

[9] Y. Bengio.Neural Networksfor Speech andSequenceRecognition. Thomson,1996.

[10] A. Cleeremans,D. Servan-Schreiber, andJ.L. McClelland. Finite stateautomataand

simplerecurrentnetworks. Neural Computation, 1(3):372–381,1989.

[11] C.DarkenandJ.E.Moody. Noteonlearningrateschedulesfor stochasticoptimization.

In R.P. Lippmann,J.E.Moody, andD. S.Touretzky, editors,Advancesin Neural Infor-

mationProcessingSystems, volume3, pages832–838.MorganKaufmann,SanMateo,

CA, 1991.

[12] J.S.Denker, D. Schwartz,B. Wittner, S.A.Solla,R.Howard,L. Jackel, andJ.Hopfield.

Largeautomaticlearning,ruleextraction,andgeneralization.Complex Systems, 1:877–

922,1987.

[13] J.L. Elman. Distributedrepresentations,simplerecurrentnetworks, andgrammatical

structure.MachineLearning, 7(2/3):195–226,1991.

26

Page 27: Noisy Time Series Prediction using a Recurrent Neural Network and

[14] E.F. Fama.Thebehaviour of stockmarketprices.Journalof Business, January:34–105,

1965.

[15] E.F. Fama.Efficient capitalmarkets:A review of theoryandempiricalwork. Journal

of Finance, May:383–417,1970.

[16] Andras Farago and Gabor Lugosi. Stronguniversalconsistency of neuralnetwork

classifiers.IEEE Transactionson InformationTheory, 39(4):1146–1151,1993.

[17] J.H. Friedman. An overview of predictive learningand function approximation. In

V. Cherkassky, J.H.Friedman,andH. Wechsler, editors,FromStatisticsto Neural Net-

works,TheoryandPatternRecognition Applications, volume136of NATO ASISeries

F, pages1–61.Springer, 1994.

[18] J.H.Friedman.Introductionto computationallearningandstatisticalprediction.Tuto-

rial Presentedat NeuralInformationProcessingSystems,Denver, CO,1995.

[19] ZoubinGhahramaniandMichaelI. Jordan.Factorialhiddenmarkov models.Machine

Learning, 29:245,1997.

[20] C. LeeGiles,B.G. Horne,andT. Lin. Learninga classof large finite statemachines

with a recurrentneuralnetwork. Neural Networks, 8(9):1359–1365,1995.

[21] C. LeeGiles,C.B. Miller, D. Chen,H.H. Chen,G.Z. Sun,andY.C. Lee. Learningand

extractingfinite stateautomatawith second-orderrecurrentneuralnetworks. Neural

Computation, 4(3):393–405,1992.

[22] C. Lee Giles, C.B. Miller, D. Chen,G.Z. Sun,H.H. Chen,andY.C. Lee. Extracting

and learningan unknown grammarwith recurrentneuralnetworks. In J.E. Moody,

S.J.Hanson,andR.PLippmann,editors,Advancesin Neural InformationProcessing

Systems4, pages317–324,SanMateo,CA, 1992.MorganKaufmannPublishers.

[23] C. Lee Giles, G.Z. Sun,H.H. Chen,Y.C. Lee, andD. Chen. Higher orderrecurrent

networks& grammaticalinference.In D.S.Touretzky, editor, Advancesin Neural In-

formationProcessingSystems2, pages380–387,SanMateo,CA, 1990.MorganKauf-

mannPublishers.

[24] C.W.J.GrangerandP. Newbold. ForecastingEconomicTimeSeries. AcademicPress,

SanDiego,secondedition,1986.

27

Page 28: Noisy Time Series Prediction using a Recurrent Neural Network and

[25] Yoichi Hayashi. A neuralexpert systemwith automatedextractionof fuzzy if-then

rules. In RichardP. Lippmann,JohnE. Moody, andDavid S. Touretzky, editors,Ad-

vancesin Neural InformationProcessingSystems, volume3, pages578–584.Morgan

KaufmannPublishers,1991.

[26] S. Haykin. Neural Networks,A ComprehensiveFoundation. Macmillan, New York,

NY, 1994.

[27] J.E. Hopcroft and J.D. Ullman. Introductionto AutomataTheory, Languages, and

Computation. Addison-Wesley PublishingCompany, Reading,MA, 1979.

[28] L. Ingber. Statisticalmechanicsof nonlinearnonequilibriumfinancialmarkets:Appli-

cationsto optimizedtrading.MathematicalComputerModelling, 1996.

[29] M. Ishikawa.Ruleextractionby successiveregularization.In InternationalConference

onNeural Networks, pages1139–1143.IEEEPress,1996.

[30] M.I. Jordan. Neural networks. In A. Tucker, editor, CRC Handbookof Computer

Science. CRCPress,BocaRaton,FL, 1996.

[31] A. KehagiasandV. Petridis.Timeseriessegmentationusingpredictivemodularneural

networks. Neural Computation, 9:1691–1710,1997.

[32] T. Kohonen.Self-OrganizingMaps. Springer-Verlag,Berlin, Germany, 1995.

[33] A.N. Kolmogorov. On therepresentationof continuousfunctionsof severalvariables

by superpositionsof continuousfunctionsof onevariableandaddition.Dokl, 114:679–

681,1957.

[34] V. Kurkova. Kolmogorov’s theoremis relevant. Neural Computation, 3(4):617–622,

1991.

[35] V. Kurkova. Kolmogorov’s theorem. In Michael A. Arbib, editor, The Handbook

of Brain Theoryand Neural Networks, pages501–502.MIT Press,Cambridge,Mas-

sachusetts,1995.

[36] Steve Lawrence,C. Lee Giles, and Sandiway Fong. Natural languagegrammatical

inferencewith recurrentneuralnetworks. IEEE Transactionson Knowledge andData

Engineering. acceptedfor publication.

28

Page 29: Noisy Time Series Prediction using a Recurrent Neural Network and

[37] JeanY. Lequarre. Foreign currency dealing: A brief introduction(dataset C). In

A.S. WeigendandN.A. Gershenfeld,editors,TimeSeriesPrediction: Forecastingthe

FutureandUnderstandingthePast. Addison-Wesley, 1993.

[38] Andrew W. Lo and A. Craig MacKinlay. A Non-RandomWalk Down Wall Street.

PrincetonUniversityPress,1999.

[39] B.G.Malkiel. EfficientMarketHypothesis. Macmillan,London,1987.

[40] B.G.Malkiel. A RandomWalk DownWall Street. Norton,New York, NY, 1996.

[41] C. McMillan, M.C. Mozer, andP. Smolensky. Ruleinductionthroughintegratedsym-

bolic andsubsymbolicprocessing.In JohnE. Moody, SteveJ.Hanson,andRichardP.

Lippmann,editors,Advancesin Neural Information ProcessingSystems, volume 4,

pages969–976.MorganKaufmannPublishers,1992.

[42] J.E.Moody. Economicforecasting:Challengesandneuralnetwork solutions.In Pro-

ceedingsof theInternationalSymposiumon Artificial Neural Networks. Hsinchu,Tai-

wan,1995.

[43] S.Mukherjee,E.Osuna,andF. Girosi.Nonlinearpredictionof chaotictimeseriesusing

supportvectormachines.In J. Principe,L. Giles,N. Morgan,andE. Wilson, editors,

IEEE Workshopon Neural Networksfor SignalProcessingVII, page511.IEEE Press,

1997.

[44] C.W. Omlin andC.L. Giles. Extractionof rules from discrete-timerecurrentneural

networks. Neural Networks, 9(1):41–52,1996.

[45] J.C. PrincipeandJ-M. Kuo. Dynamicmodellingof chaotictime serieswith neural

networks. In G. Tesauro,D. Touretzky, andT. Leen,editors,Advancesin Neural Infor-

mationProcessingSystems,7, pages311–318.MIT Press,1995.

[46] A. Refenes,editor. Neural Networksin the Capital Markets. JohnWiley andSons,

New York, NY, 1995.

[47] D.W. Ruck, S.K. Rogers,K. Kabrisky, M.E. Oxley, andB.W. Suter. The multilayer

perceptronasanapproximationto anoptimalBayesestimator. IEEE Transactionson

Neural Networks, 1(4):296–298,1990.

29

Page 30: Noisy Time Series Prediction using a Recurrent Neural Network and

[48] E. W. Saad,Danil V. Prokhorov, andD. C. WunschII. Comparative studyof stock

trendpredictionusingtime delay, recurrentandprobabilisticneuralnetworks. IEEE

Transactionson Neural Networks, 9(6):1456,November1998.

[49] R. SetionoandH. Liu. Symbolicrepresentationof neuralnetworks. IEEE Computer,

29(3):71–77,1996.

[50] H.T. SiegelmannandE.D.Sontag.Onthecomputationalpowerof neuralnets.Journal

of ComputerandSystemSciences, 50(1):132–150,1995.

[51] WeissS.M. andN. IndurkhyaN. Rule-basedmachinelearningmethodsfor functional

prediction.Journalof Artificial IntelligenceResearch, 1995.

[52] F. Takens. Detectingstrangeattractorsin turbulence. In D.A. RandandL.-S. Young,

editors,DynamicalSystemsand Turbulence, volume898 of Lecture Notesin Mathe-

matics, pages366–381,Berlin, 1981.Springer-Verlag.

[53] S.J.Taylor, editor. Modelling Financial Time Series. J. Wiley & Sons,Chichester,

1994.

[54] A. B. Tickle, R.Andrews,M. Golea,andJ.Diederich.Thetruthwill cometo light: Di-

rectionsandchallengesin extractingtheknowledgeembeddedwithin trainedartificial

neuralnetworks. IEEETransactionsonNeural Networks, 9(6):1057,November1998.

[55] M. Tomita. Dynamicconstructionof finite-stateautomatafrom examplesusinghill-

climbing. In Proceedingsof the Fourth AnnualCognitive ScienceConference, pages

105–108,Ann Arbor, MI, 1982.

[56] G. Towell andJ.Shavlik. Theextractionof refinedrulesfrom knowledgebasedneural

networks. MachineLearning, 131:71–101,1993.

[57] Geoffrey G. Towell andJudeW. Shavlik. Extractingrefinedrules from knowledge-

basedneuralnetworks. MachineLearning, 13:71,1993.

[58] George TsibourisandMatthew Zeidenberg. Testingthe efficient marketshypothesis

with gradientdescentalgorithms.In A. Refenes,editor, Neural Networksin theCapital

Markets. JohnWiley andSons,1995.

30

Page 31: Noisy Time Series Prediction using a Recurrent Neural Network and

[59] R.L. WatrousandG.M. Kuhn. Inductionof finite statelanguagesusingsecond-order

recurrentnetworks. In J.E.Moody, S.J.Hanson,andR.PLippmann,editors,Advances

in Neural InformationProcessingSystems4, pages309–316,SanMateo,CA, 1992.

MorganKaufmannPublishers.

[60] R.L. WatrousandG.M. Kuhn. Inductionof finite-statelanguagesusingsecond-order

recurrentnetworks. Neural Computation, 4(3):406,1992.

[61] A.S.Weigend,B.A. Huberman,andD.E.Rumelhart.Predictingsunspotsandexchange

rateswith connectionistnetworks. In M. CasdagliandS. Eubank,editors,Nonlinear

ModelingandForecasting, SFIStudiesin theSciencesof Complexity, ProceedingsVol.

XII, pages395–432.Addison-Wesley, 1992.

[62] G.U. Yule. On a methodof investigatingperiodicitiesin disturbedserieswith spe-

cial referenceto Wolfer’s sunspotnumbers.PhilosophicalTransactionsRoyalSociety

LondonSeriesA, 226:267–298,1927.

[63] Z. Zeng,R.M. Goodman,andP. Smyth. Discreterecurrentneuralnetworksfor gram-

maticalinference.IEEE Transactionson Neural Networks, 5(2):320–330,1994.

31