Download - Noisy Time Series Prediction using a ... - Martin Sewellmachine-learning.martinsewell.com/ann/GiLT01.pdfThis is the “curse of dimensionality” [8, 17, 18]. The relationship between

NoisyTime SeriesPredictionusingaRecurrent

NeuralNetwork andGrammaticalInference

C. LeeGiles��, Steve Lawrence

�, Ah ChungTsoi

�� !#"$��%&"'�(��)"*�+�,,��.-�0/(+��1"2�030/1"*��/

�NECResearchInstitute,4 IndependenceWay, Princeton,NJ 08540�

Facultyof Informatics,Universityof Wollongong,NSW2522Australia

Abstract

Financialforecastingis an exampleof a signalprocessingproblemwhich is chal-

lengingdueto smallsamplesizes,highnoise,non-stationarity, andnon-linearity. Neural

networkshave beenvery successfulin a numberof signalprocessingapplications.We

discussfundamentallimitationsandinherentdifficultieswhenusingneuralnetworksfor

theprocessingof high noise,smallsamplesizesignals.We introducea new intelligent

signalprocessingmethodwhich addressesthe difficulties. Themethodproposeduses

conversioninto asymbolicrepresentationwith a self-organizingmap,andgrammatical

inferencewith recurrentneuralnetworks.Weapplythemethodto thepredictionof daily

foreignexchangerates,addressingdifficultieswith non-stationarity, overfitting,andun-

equala priori classprobabilities,andwefind significantpredictabilityin comprehensive

experimentscovering5 differentforeignexchangerates.Themethodcorrectlypredicts

thedirectionof changefor thenext daywith anerror rateof 47.1%.Theerror ratere-

ducesto around40%whenrejectingexampleswherethesystemhaslow confidencein

its prediction.Weshow thatthesymbolicrepresentationaidstheextractionof symbolic4LeeGilesis alsowith theInstitutefor AdvancedComputerStudies,Universityof Maryland,CollegePark,

MD 20742.

1

knowledgefrom thetrainedrecurrentneuralnetworksin theform of deterministicfinite

stateautomata.Theseautomataexplain theoperationof thesystemandareoftenrela-

tively simple. Automatarulesrelatedto well known behavior suchastrendfollowing

andmeanreversalareextracted.

1 Intr oduction

1.1 Predicting Noisy Time SeriesData

Thepredictionof futureeventsfrom noisytime seriesdatais commonlydoneusingvarious

formsof statisticalmodels[24]. Typical neuralnetwork modelsarecloselyrelatedto statis-

tical models,andestimateBayesiana posterioriprobabilitieswhengivenan appropriately

formulatedproblem[47]. Neuralnetworkshavebeenverysuccessfulin anumberof pattern

recognitionapplications.For noisy time seriesprediction,neuralnetworks typically take a

delayembeddingof previous inputs1 which is mappedinto a prediction. For examplesof

thesemodelsandtheuseof neuralnetworksandothermachinelearningmethodsappliedto

financesee[2, 1, 46]. However, high noise,high non-stationaritytime seriespredictionis

fundamentallydifficult for thesemodels:

1. Theproblemof learningfrom examplesis fundamentallyill-posed,i.e. therearein-

finitely many modelswhichfit thetrainingdatawell, but few of thesegeneralizewell.

In orderto form a moreaccuratemodel,it is desirableto useaslargea trainingsetas

possible.However, for the caseof highly non-stationarydata,increasingthe sizeof

the trainingsetresultsin moredatawith statisticsthatarelessrelevant to the taskat

handbeingusedin thecreationof themodel.

2. The high noiseand small datasetsmake the modelsproneto overfitting. Random

correlationsbetweenthe inputsandoutputscanpresentgreatdifficulty. The models

typically do not explicitly addressthe temporalrelationshipof the inputs– e.g. they

donotdistinguishbetweenthosecorrelationswhichoccurin temporalorder, andthose

whichdo not.1For example, 57698;:=5?6@8�A � :=

In our approachwe considerrecurrentneuralnetworks(RNNs). Recurrentneuralnetworks

employ feedbackconnectionsandhavethepotentialto representcertaincomputationalstruc-

turesin a moreparsimoniousfashion[13]. RNNsaddressthetemporalrelationshipof their

inputsby maintainingan internalstate. RNNs arebiasedtowardslearningpatternswhich

occurin temporalorder– i.e. they arelessproneto learningrandomcorrelationswhich do

notoccurin temporalorder.

However trainingRNNstendsto bedifficult with high noisedata,with a tendency for long-

termdependenciesto beneglected(experimentsreportedin [9] founda tendency for recur-

rentnetworksto takeinto accountshort-termdependenciesbut not long-termdependencies),

andfor thenetwork to fall into anaivesolutionsuchasalwayspredictingthemostcommon

output.Weaddressthisproblemby convertingthetimeseriesinto asequenceof symbolsus-

ing a self-organizingmap(SOM).Theproblemthenbecomesoneof grammaticalinference

– thepredictionof a givenquantityfrom a sequenceof symbols.It is thenpossibleto take

advantageof theknown capabilitiesof recurrentneuralnetworksto learngrammars[22, 63]

in orderto captureany predictabilityin theevolutionof theseries.

The useof a recurrentneuralnetwork is significantfor two reasons:firstly, the temporal

relationshipof theseriesis explicitly modeledvia internalstates,andsecondly, it is possi-

ble to extract rules from the trainedrecurrentnetworks in the form of deterministicfinite

stateautomata[44]2. The symbolicconversionis significantfor a numberof reasons:the

quantizationeffectively filters thedataor reducesthenoise,theRNN trainingbecomesmore

effective,andthesymbolicinput facilitatestheextractionof rulesfrom thetrainednetworks.

Furthermore,it canbe arguedthat the instantiationof rulesfrom predictorsnot only gives

theuserconfidencein thepredictionmodelbut facilitatestheuseof themodelandcanlead

to a deeperunderstandingof theprocess[54].

Financialtime seriesdatatypically containshigh noiseandsignificantnon-stationarity. In

this paper, thenoisy timesseriespredictionproblemconsideredis thepredictionof foreign

exchangerates.A brief overview of foreignexchangeratesis presentedin thenext section.

2Rulescanalsobeextractedfrom feedforwardnetworks[25, 41, 56, 4, 49, 29], andothertypesof recurrent

neuralnetworks[20] howevertherecurrentnetwork approachanddeterministicfinite stateautomataextraction

seemparticularlysuitablefor a timeseriesproblem.

3

1.2 ForeignExchangeRates

Theforeignexchangemarket asof 1997is theworld’s largestmarket, with morethan$US

1.3 trillion changinghandsevery day. Themostimportantcurrenciesin themarket arethe

US dollar (which actsasa referencecurrency), the JapaneseYen, the British Pound,the

GermanMark, andthe SwissFranc[37]. Foreignexchangeratesexhibit very high noise,

andsignificantnon-stationarity. Many financial institutionsevaluatepredictionalgorithms

using the percentageof times that the algorithmpredictsthe right trend from todayuntil

sometime in the future [37]. Hence,this paperconsidersthepredictionof thedirectionof

changein foreignexchangeratesfor thenext businessday.

The remainderof this paperis organizedas follows: section2 describesthe dataset we

have used,section3 discussesfundamentalissuessuchasill-posedproblems,the curseof

dimensionality, andvectorspaceassumptions,andsection4 discussesthe efficient market

hypothesisandpredictionmodels.Section5 detailsthesystemwehaveusedfor prediction.

Comprehensive testresultsarecontainedin section6, andsection7 considerstheextraction

of symbolicdata. Section8 providesconcludingremarks,andAppendixA provides full

simulationdetails.

2 ExchangeRate Data

Thedatawe have usedis publicly availableat IKJKJML�NCOPO�QPQPQ�R'STJ7UWVWX�RYX?ZW[�REUM\W[�O?]W^_Q7Ua`�b?U�X?\�Oc `�d�UPe�fPUWV�`_U7SMOWfP^_XKJ7^Wg?UhR>IKJWd�i . It wasfirst usedby Weigendet al. [61]. Thedataconsistsof daily closingbids for five currencies(GermanMark (DM), JapaneseYen,SwissFranc,

British Pound,andCanadianDollar) with respectto theUSDollar andis from theMonetary

Yearbookof the ChicagoMercantileExchange.Thereare 3645 datapoints for eachex-

changeratecoveringtheperiodSeptember3, 1973to May 18,1987.In contrastto Weigend

et al., this work considersthepredictionof all fiveexchangeratesin thedataandprediction

for all daysof theweekinsteadof just Mondays.

4

3 Fundamental Issues

This sectionbriefly discussesfundamentalissuesrelatedto learningby exampleusingtech-

niquessuchasmulti-layerperceptrons(MLPs): ill-posedproblems,thecurseof dimension-

ality, andvectorspaceassumptions.Thesehelpform themotivationfor our new method.

The problemof inferring an underlyingprobability distribution from a finite setof datais

fundamentallyan ill-posed problembecausethereare infinitely many solutions[30], i.e.

thereareinfinitely many functionswhich fit the trainingdataexactly but differ in otherre-

gionsof theinputspace.Theproblembecomeswell-posedonly whenadditionalconstraints

areused.For example,theconstraintthata limited sizeMLP will beusedmightbeimposed.

In this case,theremaybea uniquesolutionwhich bestfits thedatafrom therangeof solu-

tionswhich themodelcanrepresent.Theimplicit assumptionbehindthis constraintis that

theunderlyingtarget function is “smooth”. Smoothnessconstraintsarecommonlyusedby

learningalgorithms[17]. Withoutsmoothnessor someotherconstraint,thereis noreasonto

expecta modelto performwell on unseeninputs(i.e. generalizewell).

Learningby exampleoften operatesin a vectorspace,i.e. a function mapping j1k to j1lis approximated.However, a problemwith vectorspacesis that therepresentationdoesnot

reflectany additionalstructurewithin the data. Measurementsthat arerelated(e.g. from

neighboringpixelsin animageor neighboringtimestepsin atimeseries)andmeasurements

that do not have sucha relationshipare treatedequally. Theserelationshipscan only be

inferredin astatisticalmannerfrom thetrainingexamples.

The useof a recurrentneuralnetwork insteadof an MLP with a window of time delayed

inputsintroducesanotherassumptionby explicitly addressingthe temporalrelationshipof

theinputsvia themaintenanceof aninternalstate.This canbeseenassimilar to theuseof

hints[3] andshouldhelpto make theproblemlessill-posed.

Thecurseof dimensionalityrefersto theexponentialgrowth of hypervolumeasafunctionof

dimensionality[8]. Considerm0n�opj1k . Theregression,qsrutvwm�x is ahypersurfacein j kzy.{ .If t|vwm�x is arbitrarily complex andunknown thendensesamplesarerequiredto approximatethe function accurately. However, it is hardto obtaindensesamplesin high dimensions3.

3Kolmogorov’s theoremshows thatany continuousfunctionof } dimensionscanbecompletelycharacter-izedby aonedimensionalcontinuousfunction.Specifically, Kolmogorov’stheorem[33, 34, 35, 18] statesthat

5

This is the “curseof dimensionality”[8, 17, 18]. The relationshipbetweenthe sampling

densityandthenumberof pointsrequiredis ~u [18] where is thedimensionalityof theinput spaceand is thenumberof points. Thus,if { is thenumberof pointsfor a givensamplingdensityin 1 dimension,thenin orderto keepthesamedensityasthedimensionality

is increased,thenumberof pointsmustincreaseaccordingto k{ .It hasbeensuggestedthat MLPs do not suffer from the curseof dimensionality[16, 7].

However, this is not truein general(althoughMLPsmaycopebetterthanothermodels).The

apparentavoidanceof thecurseof dimensionalityin [7] is dueto the fact that the function

spacesconsideredare more and more constrainedas the input dimensionincreases[34].

Similarly, smoothnessconditionsmustbesatisfiedfor theresultsof [16].

The useof a recurrentneuralnetwork is importantfrom the viewpoint of the curseof di-

mensionalitybecausetheRNN cantake into accountgreaterhistoryof the input. Trying to

take into accountgreaterhistorywith an MLP by increasingthenumberof delayedinputs

resultsin an increasein the input dimension.This is problematic,given the small number

of datapointsavailable. The useof a self-organizingmapfor the symbolicconversionis

alsoimportantfrom theviewpoint of thecurseof dimensionality. As will beseenlater, the

topographicalorderof theSOMallowsencodingaonedimensionalSOMinto asingleinput

for theRNN.

for any continuousfunction: 695M$

4 The Efficient Mark et Hypothesis

The Efficient Market Hypothesis(EMH) wasdevelopedin 1965by E. Fama[14, 15] and

foundbroadacceptancein thefinancialcommunity[39, 58,5]. TheEMH, in its weakform,

assertsthatthepriceof anassetreflectsall of theinformationthatcanbeobtainedfrom past

pricesof theasset,i.e. themovementof theprice is unpredictable.Thebestpredictionfor

a price is thecurrentprice andtheactualpricesfollow what is calleda randomwalk. The

EMH is basedontheassumptionthatall newsis promptlyincorporatedin prices;sincenews

is unpredictable(by definition), pricesareunpredictable.Much effort hasbeenexpended

trying to prove or disprove theEMH. Currentopinionis that thetheoryhasbeendisproved

[53, 28], andmuchevidencesuggeststhatthecapitalmarketsarenotefficient [40, 38].

If the EMH was true, thena financial seriescould be modeledas the additionof a noise

componentateachstep: vY� ¢¡*x�r vY7x� ¤£2vY7x (2)where £zvYax is azeromeanGaussianvariablewith variance¥ . Thebestestimationis:¦ vY� ¢¡*x�r vY7x (3)In otherwords,if theseriesis truly a randomwalk, thenthebestestimatefor thenext time

period is equal to the currentestimate. Now, if it is assumedthat thereis a predictable

componentof theseriesthenit is possibleto use: vY§ H¡2x r vYax� ¨t|v v>ax© vY&ª«¡*x©�¬z¬z¬0© v>sª# ¢¡*x'x� ¨£2v>ax (4)where£zvYax isazeromeanGaussianvariablewith variance¥ , and tvD®¯x isanon-linearfunctionin its arguments.In this case,thebestestimateis givenby:¦ vY§ H¡2x r vY7x� ¨t|v v>ax© vY°ª±¡*x©�¬z¬z¬0© v>°ª²³ H¡*xCx (5)Predictionusingthis modelis problematicastheseriesoftencontainsa trend.For example,

a neuralnetwork trainedon sectionA in figure1 haslittle chanceof generalizingto thetest

datain sectionB, becausethemodelwasnot trainedwith datain therangecoveredby that

section.

7

Figure1. An exampleof thedifficulty in usinga modeltrainedon a raw priceseries.Generalization

to sectionB from the training datain sectionA is difficult becausethe modelhasnot beentrained

with inputsin therangecoveredby sectionB.

A commonsolutionto this is to usea modelwhich is basedon the first orderdifferences

(equation7) insteadof theraw timeseries[24]:´ v>) ¢¡*x rµt|v ´ v>ax© ´ v>&ª¶¡*x©z¬�¬z¬�© ´ vY·ª²³ H¡2x'x� ¶¸�vYax (6)where ´ v>� H¡*xº¹r v>� H¡*xª v>ax (7)and ¸0v>ax is a zeromeanGaussianvariablewith variance¥ . In this casethebestestimateis:¦´ vY� H¡*x rµt|v ´ vYax© ´ vY·ª±¡*x©z¬z¬z¬0© ´ v>&ª# ¢¡*x'x (8)5 SystemDetails

A high-level block diagramof the systemusedis shown in figure 2. The raw time series

valuesare q�vYax , ³r»¡W©¼K©z¬z¬z¬© whereq�vY7x(opj . Thesedenotethedaily closingbidsof thefinancialtimeseries.Thefirst differenceof theseries,q�vYax , is takenasfollows:´ vYax rHq0v>ax�ª²q�vY·ª±¡*x (9)

8

Figure2. A high-level block diagramof thesystemused.

Thisproduceś vY7x , ´ vYax�o½j , ³r»¡W©¼?©�¬z¬z¬©¾ª¡ . In orderto compressthedynamicrangeof theseriesandreducetheeffectof outliers,a log transformationof thedatais used: v>ax�r signv ´ vYaxCx�vw¿ÀWÁavÂ ´ vY7x*Â H¡2x'x (10)resultingin

v>ax©rÃ¡W©¼K©z¬z¬z¬©Äªu¡ , vYax&o±j . As is usual,a delayembeddingof thisseriesis thenconsidered[62]:Å vY.©$ÆaÇEx�rÈv v>ax© v>°ª±É*x© v>°ª²ÊMx©�¬z¬z¬�© vY&ª²Æ7Ç0 HÉ*xCx (11)whereÆ { is thedelayembeddingdimensionandis equalto1or2 for theexperimentsreportedhere.

Å vY.©$ÆaÇEx is a statevector. This delayembeddingformstheinput to theSOM. Hence,the SOM input is a vector of the last Æ { valuesof the log transformeddifferencedtimeseries.Theoutputof theSOM is thetopographicallocationof thewinningnode.Eachnode

representsonesymbolin theresultinggrammaticalinferenceproblem. A brief description

of theself-organizingmapis containedin thenext section.

TheSOM canberepresentedby thefollowing equation:Ë vY7x rHÌ�v Å vY�©ÆKxDx (12)where

Ë v>ax§oµÍ¯¡W©¼?©Î?©z¬z¬z¬D0Ï=Ð , and 0Ï is thenumberof symbols(nodes)for theSOM. Eachnodein theSOMhasbeenassignedanintegerindex rangingfrom 1 to thenumberof nodes.

An Elmanrecurrentneuralnetwork is thenused4 which is trainedonthesequenceof outputs

from the SOM. The Elmannetwork waschosenbecauseit is suitablefor the problem(a4Elmanrefersto thetopologyof thenetwork – full backpropagationthroughtime wasusedasopposedto

thetruncatedversionusedby Elman.

9

grammaticalinferencestyleproblem)[13], andbecauseit hasbeenshown to performwell

in comparisonto otherrecurrentarchitectures(e.g.see[36]). TheElmanneuralnetwork has

feedbackfrom eachof thehiddennodesto all of thehiddennodes,asshown in figure3. For

theElmannetwork: Ñ v>� H¡*x�rÓÒ�Ô�ÕMÖh ¤×WØ (13)and ÕMÖÙrHÚ kzÛ v>Ü1ÕMÖÝ { ¤Þ&ß�Öh áà|x (14)where Ò is a .â1ãä.å vectorrepresentingthe weightsfrom the hiddenlayer to the outputnodes,.â is thenumberof hiddennodes,.å is thenumberof outputnodes,×WØ is a constantbiasvector, ÕMÖ , ÕMÖ²oæj k2Û , is an �âçãè¡ vector, denotingthe outputsof the hiddenlayerneurons.ß�Ö is a ÆMéêãë¡ vectorasfollows,where ÆMé is theembeddingdimensionusedfor therecurrentneuralnetwork:

ß�Öºrìííííííî Ë v>axË v>°ª«¡*xË v>°ªá¼Wx¬�¬z¬Ë vY°ªáÆMé ¢¡*x

ï¯ððððððñ (15)Ü and Þ arematricesof appropriatedimensionswhichrepresentthefeedbackweightsfromthe hiddennodesto the hiddennodesand the weightsfrom the input layer to the hidden

layerrespectively. Ú k Û is a .â&ã¤¡ vectorcontainingthesigmoidfunctions. à is an �â&ãá¡vector, denotingthebiasof eachhiddenlayerneuron.

Ñ vYax is a �å ã¡ vectorcontainingtheoutputsof thenetwork. �å is 2 throughoutthis paper. Thefirst outputis trainedto predicttheprobabilityof apositivechange,thesecondoutputis trainedto predicttheprobabilityof

anegativechange.

Thus,for thecompletesystem:Ñ vY� H¡*x�rHÚ { v ´ vY7x© ´ vY·ª±¡*x© ´ vY°ªá¼Wx© ´ v>°ª²ÎMx© ´ v>&ªòPx'x (16)whichcanbeconsideredin termsof theoriginal series:Ñ v>§ ¢¡*x rµÚ�éTv;q�vYax©$q0v>)ª«¡*x©$q�vY·ª¤¼Wx©$q0v>)ªáÎMx©$q0v>·ªäòPx©$q0v>·ª¤óWx'x (17)

10

Figure3. Thepre-processed,delayembeddedseriesis convertedinto symbolsusingaself-organizing

map.An Elmanneuralnetwork is trainedon thesequenceof symbols.

11

Notethatthis is notastaticmappingdueto thedependenceon thehiddenstateof therecur-

rentneuralnetwork.

Thefollowing sectionsdescribetheself-organizingmap,recurrentnetwork grammaticalin-

ference,dealingwith non-stationarity, controllingoverfitting,a priori classprobabilities,and

amethodfor comparisonwith a randomwalk.

5.1 The Self-Organizing Map

5.1.1 Intr oduction

Theself-organizingmap,or SOM [32] is anunsupervisedlearningprocesswhich learnsthe

distribution of a setof patternswithout any classinformation.A patternis projectedfrom a

possiblyhighdimensionalinput spaceô to apositionin themap,a low dimensionaldisplayspaceõ . Thedisplayspaceõ is oftendividedinto a grid andeachintersectionof thegridis representedin the network by a neuron. The information is encodedasthe locationof

anactivatedneuron.TheSOM is unlike mostclassificationor clusteringtechniquesin that

it attemptsto preserve the topologicalorderingof the classesin the input spaceô in theresultingdisplayspaceõ . In otherwords,for similarity asmeasuredusinga metric in theinput spaceô , theSOM attemptsto preserve thesimilarity in thedisplayspaceõ .5.1.2 Algorithm

We give a brief descriptionof the SOM algorithm, for more detailssee[32]. The SOM

definesamappingfrom aninputspacej k ontoatopologicallyorderedsetof nodes,usuallyin a lower dimensionalspaceõ . A referencevector, ö1nø÷ Íúù�n { ©$ù�n¯éz©�¬û¬@¬û©'ù�n k Ð Ô oüj k , isassignedto eachnodein theSOM. Assumethat thereare ý nodes.During training,eachinput,

ouj k , is comparedto ö1n , þ°rÿ¡_©¼?©z¬z¬�¬©ý , obtainingthe locationof the closestmatch( ÂûÂ ª«ö��zÂÂ0r �� n��ÂÂ ª«ö1n'ÂÂ ). The input point is mappedto this locationin theSOM.Nodesin theSOM areupdatedaccordingto:ö1n=v�� ¢¡*x rHö1nEv��Cx� ��nEv��Cx�Í v��Cx|ª²ö1nEv��CxEÐ (18)

12

where � is the time during learningand ��wn=v��Cx is the neighborhoodfunction, a smoothingkernelwhich is maximumat ö�� . Usually, ��n=v��Cx�r�vÂûÂ ��ª��n'ÂûÂ ©��Cx , where�� and ��n representthe locationsof nodesin the SOM outputspaceõ . �� is the nodewith the closestweightvectorto the input sampleand ��n rangesover all nodes. ��wn=v��Cx approaches0 as ÂÂ ��hª��n'ÂÂincreasesandalsoas � approaches� . A widely appliedneighborhoodfunctionis:

��n�r��v��Cx�� ª ÂÂ ��ª!��n'ÂûÂ é¼�¥ é v��Cx " (19)where ��v��Cx is a scalarvaluedlearningrateand ¥�v��Cx definesthe width of the kernel. Theyaregenerallyboth monotonicallydecreasingwith time [32]. The useof the neighborhood

functionmeansthatnodeswhich aretopographicallyclosein theSOM structurearemoved

towardstheinputpatternalongwith thewinningnode.Thiscreatesasmoothingeffectwhich

leadsto aglobalorderingof themap.Notethat ¥�v��Cx shouldnotbereducedtoofarasthemapwill loseits topographicalorderif neighboringnodesarenot updatedalongwith theclosest

node.TheSOM canbeconsidereda non-linearprojectionof theprobabilitydensity,#�v x ,of theinputpatterns

ontoa (typically) smalleroutputspace[32].

5.1.3 Remarks

Thenodesin thedisplayspaceõ encodethe informationcontainedin the input spacej k .Sincethereare ý nodesin õ , this implies that the input patternvectors o²j1k aretrans-formedto a setof ý symbols,while preservingtheir original topologicalorderingin j1k .Thus,if theoriginal input patternsarehighly noisy, thequantizationinto thesetof ý sym-bols while preservingthe original topologicalorderingcanbe understoodasa form of fil-

tering. Theamountof filtering is controlledby ý . If ý is large, this impliesthereis littlereductionin thenoisecontentof theresultingsymbols.Ontheotherhand,if ý is small,thisimpliesthatthereis a “heavy” filtering effect, resultingin only asmallnumberof symbols.

5.2 Recurrent Network Prediction

In the pastfew yearsseveral recurrentneuralnetwork architectureshave emerged which

have beenusedfor grammaticalinference[10, 23, 21]. The inductionof relatively simple

13

grammarshasbeenaddressedoften– e.g.[59, 60, 21] on learningTomitagrammars[55]. It

hasbeenshown thataparticularclassof recurrentnetworksareTuringequivalent[50].

Thesetof ý symbolsfrom theoutputof theSOM arelinearly encodedinto a singleinputfor the Elmannetwork (e.g. if ý r Î , the single input is either -1, 0, or 1). The linearencodingis importantandis justifiedby thetopographicalorderof thesymbols.If no order

wasdefinedoverthesymbolsthenaseparateinputshouldbeusedfor eachsymbol,resulting

in asystemwith moredimensionsandincreaseddifficulty dueto thecurseof dimensionality.

In orderto facilitatethe trainingof the recurrentnetwork, an input window of 3 wasused,

i.e. thelastthreesymbolsarepresentedto 3 input neuronsof therecurrentneuralnetwork.

5.3 Dealingwith Non-stationarity

The approachusedto dealwith the non-stationarityof the signal in this work is to build

modelsbasedon ashorttime periodonly. This is anintuitively appealingapproachbecause

onewould expectthatany inefficienciesfound in themarket would typically not last for a

long time. Thedifficulty with this approachis thereductionin thealreadysmallnumberof

trainingdatapoints.Thesizeof thetrainingsetcontrolsanoisevs. non-stationaritytradeoff

[42]. If the training set is too small, the noisemakesit harderto estimatethe appropriate

mapping.If thetrainingsetis too large,thenon-stationarityof thedatawill meanmoredata

with statisticsthatarelessrelevantfor thetaskathandwill beusedfor creatingtheestimator.

We rantestson a segmentof dataseparatefrom themaintests,from which we choseto use

modelstrainedon 100daysof data(seeappendixA for moredetails).After training,each

model is testedon the next 30 daysof data. The entire training/testsetwindow is moved

forward30 daysandtheprocessis repeated,asdepictedin figure4. In a real-life situation

it would bedesirableto testonly for thenext day, andadvancethewindowsby stepsof one

day. However, 30daysis usedherein orderto reducetheamountof computationby a factor

of 30, andenablea much larger numberof teststo be performed(i.e. 30 timesasmany

predictionsaremadefrom thesamenumberof trainingruns),therebyincreasingconfidence

in theresults.

14

Figure4. A depictionof thetrainingandtestsetsused.

5.4 Controlling Overfitting

Highly noisydatahasoftenbeenaddressedusingtechniquessuchasweightdecay, weight

eliminationandearly stoppingto control overfitting [61]. For this problemwe useearly

stoppingwith a difference:the stoppingpoint is chosenusingmultiple testson a separate

segmentof data,ratherthanusinga validationsetin thenormalmanner. Testsshowedthat

usinga validationsetin thenormalmannerseriouslyaffectedperformance.This is not very

surprisingbecausethereis a dilemmawhenchoosingthe validationset. If the validation

set is chosenbeforethe training data(in termsof the temporalorder of the series),then

thenon-stationarityof theseriesmeansthevalidationsetmaybeof little usefor predicting

performanceon thetestset. If thevalidationsetis chosenafter thetrainingdata,a problem

occursagainwith the non-stationarity– it is desirableto make the validationsetassmall

aspossibleto alleviate this problem,however the setcannotbe madetoo small becauseit

needsto be a certainsizein orderto make the resultsstatisticallyvalid. Hence,we chose

to controloverfitting by settingthetrainingtime for anappropriatestoppingpoint a priori .

Simulationswith aseparatesegmentof datanotcontainedin themaindata(seeappendixA

for moredetails)wereusedto selectthetrainingtimeof 500epochs.

Wenotethatouruseof aseparatesegmentof datato selecttrainingsetsizeandtrainingtime

arenot expectedto beidealsolutions,e.g. thebestchoicefor theseparametersmaychange

15

significantlyover timedueto thenon-stationarityof theseries.

5.5 A Priori Data Probabilities

For somefinancialdatasets,it is possibleto obtaingoodresultsby simply predictingthe

mostcommonchangein thetrainingdata(e.g.noticingthatthechangein pricesis positive

moreoften thanit is negative in the training setandalwayspredictingpositive for the test

set). Although predictabilitydueto suchdatabiasmay be worthwhile, it canprevent the

discoveryof a bettersolution.Figure5 shows a simplifieddepictionof theproblem– if the

naive solution,A, hasa bettermeansquarederror thana randomsolution then it may be

hardto find a bettersolution,e.g. B. This may be particularlyproblematicfor high noise

dataalthoughthetruenatureof theerrorsurfaceis not yet well understood.

Figure5. A depictionof a simplified error surfaceif a naive solution,A, performswell. A better

solution,e.g.B, maybedifficult to find.

Hence,in all resultsreportedherethemodelshave beenpreventedfrom actingon any such

biasby ensuringthat thenumberof trainingexamplesfor thepositiveandnegativecasesis

equal.Thetrainingsetactuallyremainsunchangedsothatthetemporalorderis notdisturbed

but thereis no training signalfor the appropriatenumberof positive or negative examples

startingfrom thebeginningof theseries.Thenumberof positive andnegative examplesis

typically approximatelyequalto begin with, thereforetheuseof this simpletechniquedid

not resultin thelossof many datapoints.

5.6 Comparisonwith a Random Walk

The predictioncorrespondingto the randomwalk model is equalto the currentprice, i.e.

no change. A percentageof times the model predictsthe correctchange(apart from no

16

change)cannotbeobtainedfor therandomwalk model.However, it is possibleto usemod-

elswith datathat correspondsto a randomwalk andcomparethe resultsto thoseobtained

with therealdata.If equalperformanceis obtainedon therandomwalk dataascomparedto

the realdatathenno significantpredictabilityhasbeenfound. Thesetestswereperformed

by randomizingtheclassificationof the testset. For eachoriginal testsetfive randomized

versionswerecreated.The randomizedversionswerecreatedby repeatedlyselectingtwo

randompositionsin time andswappingthe two classificationvalues(the input valuesre-

main unchanged).An additionalbenefitof this techniqueis that if the model is showing

predictabilitybasedon any databias(asdiscussedin the previoussection)thenthe results

ontherandomizedtestsetswill alsoshow thispredictability, i.e. alwayspredictingapositive

changewill resultin thesameperformanceon theoriginal andtherandomizedtestsets.

6 Experimental Results

We presentresultsusingthepreviously describedsystemandalternative systemsfor com-

parison. In all cases,the Elmannetworks contained5 hiddenunits andthe SOM hadone

dimension.Furtherdetailsof the individual simulationscanbe found in appendixA. Fig-

ure6 andtable1 shows theaverageerrorrateasthenumberof nodesin theself-organizing

map(SOM) wasincreasedfrom ¼ to $ . Resultsareshown for SOM embeddingdimensionsof 1 and 2. Resultsare shown using both the original dataand the randomizeddatafor

comparisonwith a randomwalk. The resultsareaveragedover the 5 exchangerateswith

30 simulationsper exchangerateand30 testpointsper simulation,for a total of 4500test

points. Thebestperformancewasobtainedwith a SOM sizeof 7 nodesandanembedding

dimensionof 2 wheretheerror ratewas47.1%.A � -testindicatesthat this resultis signifi-cantlydifferentfrom thenull hypothesisof therandomizedtestsetsat # r�%?¬%&%&$(' or 0.76%( �çr ¼?¬'&$ ) indicating that the result is significant. We also considereda null hypothesiscorrespondingto a modelwhich makescompletelyrandompredictions(correspondingto a

randomwalk) – theresultis significantlydifferentfrom this null hypothesisat #ër)%7¬*%&%MóTòor 0.54%( �ºr ¼?¬*+&% ). Figure7 andtable2 shows theaverageresultsfor theSOM sizeof 7on theindividualexchangerates.

In orderto gaugetheeffectivenessof thesymbolicencodingandrecurrentneuralnetwork,

we investigatedthefollowing two systems:

17

46

47

48

49

50

51

52

53

54

1 2 3 4 5 6 7 8

Tes

t Err

or %

SOM Size

Embedding dimension 1Embedding dimension 2

Randomized embedding dimension 1Randomized embedding dimension 2

Figure6. The error rateasthe numberof nodesin the SOM is increasedfor SOM embeddingdi-

mensionsof 1 and2. Eachof the resultsrepresentstheaverageover the5 individual seriesandthe

30 simulationsperseries,i.e. over 150simulationswhich tested30 pointseach,or 4500individual

classifications.Theresultsfor therandomizedseriesareaveragedoverthesamevalueswith the5 ran-

domizedseriespersimulation,for a total of 22500classificationsperresult(all of thedataavailable

wasnotuseddueto limited computationalresources).

1. The systemaspresentedherewithout the symbolicencoding,i.e. the preprocessed

datais entereddirectly into therecurrentneuralnetwork without theSOM stage.

2. Thesystemaspresentedherewith therecurrentnetwork replacedby a standardMLP

network.

Tables3 and4 show theresultsfor systems1 and2 above. Theperformanceof thesesystems

canbeseento bein-betweentheperformanceof thenull hypothesisof arandomwalk andthe

performanceof thehybridsystem,indicatingthatthehybridsystemcanresultin significant

improvement.

18

SOM embeddingdimension:1

SOM size 2 3 4 5 6 7

Error 48.5 47.5 48.3 48.3 48.5 47.6

Randomizederror 50.1 50.5 50.4 50.2 49.8 50.3


SOM size 2 3 4 5 6 7

Error 48.8 47.8 47.9 47.7 47.6 47.1


Table1. Theresultsasshown in figure6.

45464748495051525354

BP CD DM JY SF

Tes

t Err

or %

Exchange Rate

Embedding dimension 1Embedding dimension 2

Randomized embedding dimension 1Randomized embedding dimension 2

Figure7. Averageresultsfor thefive exchangeratesusingthereal testdataandtherandomizedtest

data.TheSOM sizeis 7 andtheresultsfor SOM embeddingdimensionsof 1 and2 areshown.

6.1 RejectPerformance

We usedthe following equationin order to calculatea confidencemeasurefor every pre-

diction: ,ár q l.-0/ vwq l.-0/ ª¶q l n k x where q l1-2/ is themaximumoutput,and q l n k is themin-imum output (for outputswhich have beentransformedusing the softmaxtransformation:q�n r 3�465�7*8:9@BA 3�465�7*8 @ ; where C�n aretheoriginal outputs,qTn arethe transformedoutputs,and isthenumberof outputs).Theconfidencemeasure, givesanindicationof how confidentthe

19

Exchange British Canadian German Japanese Swiss

Rate Pound Dollar Mark Yen Frank

Embeddingdimension1 48.5 46.9 46 47.7 48.9

Embeddingdimension2 48.1 46.8 46.9 46.3 47.4

Randomizedembeddingdimension1 50.1 50.5 50.4 49.8 50.7

Randomizedembeddingdimension2 50.2 50.6 50.5 49.7 49.8

Table2. Resultsasshown in figure7.

Error 49.5

Randomizederror 50.1

Table3. The resultsfor the systemwithout the symbolic encoding,i.e. the preprocesseddatais

entereddirectly into therecurrentneuralnetwork without theSOMstage.


SOM size 2 3 4 5 6 7

Error 49.1 49.4 49.3 49.7 49.9 49.8


Table4. Theresultsfor thesystemusinganMLP insteadof theRNN.

network modelis for eachclassification.Thus,for afixedvalueof , , thepredictionobtainedby a particularalgorithmcanbedividedinto two sets,oneset D for thosebelow thethresh-old, andanotherset E for thoseabove thethreshold.We canthenrejectthosepointswhicharein set D , i.e. they arenotusedin thecomputationof theclassificationaccuracy.Figure8 shows theclassificationperformanceof thesystem(percentageof correctsignpre-

dictions)astheconfidencethresholdis increasedandmoreexamplesarerejectedfor classi-

fication. Figure9 shows thesamegraphfor the randomizedtestsets. It canbeseenthata

significantincreasein performance(down to approximately40%error)canbeobtainedby

only consideringthe5-15%of exampleswith thehighestconfidence.Thisbenefitcannotbe

seenfor therandomizedtestsets.It shouldbenotedthatby thetime 95%of testpointsare

rejected,thetotal numberof remainingtestpointsis only 225.

20

4648505254565860

0 10 20 30 40 50 60 70 80 90

Per

cent

Cor

rect

Reject Percentage

Reject Performance

RNN Test

Figure 8. Classificationperformanceas the confidencethresholdis increased.This graphis the

averageof the150individual results.

4648505254565860

0 10 20 30 40 50 60 70 80 90

Per

cent

Cor

rect

Reject Percentage

Reject Performance

RNN Randomized Test

Figure9. Classificationperformanceastheconfidencethresholdis increasedfor therandomizedtest

sets.This graphis theaverageof theindividual randomizedtestsets.

7 Automata Extraction

A commoncomplaintwith neuralnetworks is that the modelsare difficult to interpret–

i.e. it is not clearhow themodelsarrive at the predictionor classificationof a given input

pattern[49]. A numberof peoplehaveconsideredtheextractionof symbolicknowledgefrom

trainedneuralnetworksfor bothfeedforwardandrecurrentneuralnetworks[12, 44, 57,44].

For recurrentnetworks,theorderedtriple of adiscreteMarkov process( � state;input F nextstate ) canbe extractedandusedto form an equivalentdeterministicfinite stateautomata(DFA). This canbe doneby clusteringthe activation valuesof the recurrentstateneurons

[44]. Theautomataextractedwith this processcanonly recognizeregulargrammars5.

5A regular grammar G is a 4-tuple GææIHaF�

The algorithmusedfor automataextractionis the sameasthat describedin [21]. The al-

gorithm is basedon the observation that the activationsof the recurrentstateneuronsin a

trainednetwork tendto cluster. The outputof eachof the stateneuronsis divided intoV intervalsof equalsize,yielding VXW partitionsin thespaceof outputsof thestateneurons.Startingfrom aninitial state,the input symbolsareconsideredin order. If an input symbol

causesa transitionto a new partition in the activation spacethena new DFA stateis cre-

atedwith a correspondingtransitionon the appropriatesymbol. In the event thatdifferent

activationstatevectorsbelongto thesamepartition,thestatevectorwhich first reachedthe

partitionis chosenasthenew initial statefor subsequentsymbols.Theextractionwould be

computationallyinfeasibleif all VXW partitionshadto bevisited,but theclusterscorrespond-ing to DFA statesoftencoverasmallpercentageof theinput space.

Sampledeterministicfinite stateautomata(DFA) extractedfrom trainedfinancial predic-

tion networksusinga quantizationlevel of 5 canbeseenin figure10. TheDFAs have been

minimizedusingstandardminimizationtechniques[27]. TheDFAs wereextractedfrom net-

workswheretheSOM embeddingdimensionwas1 andtheSOM sizewas2. As expected,

the DFAs extractedin this casearerelatively simple. In all cases,the SOM symbolsend

up representingpositiveandnegativechangesin theseries– transitionsmarkedwith asolid

line correspondto positivechangesandtransitionsmarkedwith a dottedline correspondto

negative changes.For all DFAs, state1 is the startingstate. Doublecircled statescorre-

spondto predictionof a positivechange,andstateswithout thedoublecircle correspondto

predictionof a negativechange.(a) canbeseenascorrespondingto mean-revertingbehav-

ior – i.e. if the last changein the serieswasnegative thenthe DFA predictsthat the next

changewill be positive andvice versa. (b) and(c) canbe seenasvariationson (a) where

thereareexceptionsto thesimplerulesof (a),e.g. in (b) a negativechangecorrespondsto a

positiveprediction,howeveraftertheinputchangesfrom negativeto positive,theprediction

alternateswith successivepositivechanges.In contrastwith (a)-(c),(d)-(f) exhibit behavior

relatedto trendfollowing algorithms,e.g. in (d) a positivechangecorrespondsto a positive

prediction,anda negative changecorrespondsto a negative predictionexcept for the first

negativechangeafterapositivechange.Figure11showsasampleDFA for aSOMsizeof 3

– morecomplex behavior canbeseenin this case.

22

1

21 2

1 2

3

(a) (b) (c)

1 2

3 1 2

1

2 3

4

(d) (e) (f)

Figure10. Sampledeterministicfinite stateautomata(DFA) extractedfrom trainedfinancialpredic-

tion networksusingaquantizationlevel of 5.

1

2

3

4 5

6

Figure 11. A sampleDFA extractedfrom a network wherethe SOM size was 3 (and the SOM

embeddingdimensionwas1). TheSOMsymbolsrepresentapositive (solid lines)or negative (dotted

lines)change,or a changecloseto zero(grayline). In this case,therulesaremorecomplex thanthe

simplerulesin thepreviousfigure.

23

8 Conclusions

Traditionallearningmethodshave difficulty with high noise,high non-stationaritytime se-

riesprediction.Theintelligentsignalprocessingmethodpresentedhereusesa combination

of symbolicprocessingandrecurrentneuralnetworksto dealwith fundamentaldifficulties.

The useof a recurrentneuralnetwork is significantfor two reasons:firstly, the temporal

relationshipof theseriesis explicitly modeledvia internalstates,andsecondly, it is possible

to extract rulesfrom the trainedrecurrentnetworks in the form of deterministicfinite state

automata.Thesymbolicconversionis significantfor a numberof reasons:thequantization

effectively filters the dataor reducesthe noise,the RNN training becomesmoreeffective,

andthe symbolic input facilitatesthe extractionof rulesfrom the trainednetworks. Con-

sideringforeignexchangerateprediction,weperformedacomprehensivesetof simulations

covering5 differentforeignexchangerates,whichshowedthatsignificantpredictabilitydoes

exist with a 47.1%error ratein thepredictionof thedirectionof thechange.Additionally,

the error ratereducedto around40% whenrejectingexampleswherethe systemhadlow

confidencein its prediction.Themethodfacilitatestheextractionof ruleswhich explain the

operationof thesystem.Resultsfor theproblemconsideredherearepositiveandshow that

symbolicknowledgewhich hasmeaningto humanscanbeextractedfrom noisytime series

data.Suchinformationcanbeveryusefulandimportantto theuserof thetechnique.

We have shown that a particularrecurrentnetwork combinedwith symbolicmethodscan

generateusefulpredictionsandrules for suchpredictionson a non-stationarytime series.

It would beinterestingto comparethis approachwith otherneuralnetworksthathave been

usedin timeseriesanalysis[6, 31, 43,45,48], andothermachinelearningmethods[19, 51],

someof whichcanalsogeneraterules.It wouldalsobeinterestingto incorporatethismodel

into a tradingpolicy.

Appendix A: Simulation Details

For theexperimentsreportedin this paper, Y[Z]\_^ and Y.å`\_a . TheRNN learningratewaslinearly reducedoverthetrainingperiod(see[11] regardinglearningrateschedules)from an

initial valueof 0.5. All inputswerenormalizedto zeromeanandunit variance.All nodes

24

includeda biasinput which waspartof theoptimizationprocess.Weightswereinitialized

asshown in Haykin [26]. Targetoutputswere-0.8and0.8usingthe bdc(e�f outputactivationfunction andwe usedthe quadraticcost function. In order to initialize the delaysin the

network, theRNN wasrunwithout trainingfor 50timestepsprior to thestartof thedatasets.

The numberof training examplesfor the positive and negative caseswas madeequalby

not training on the appropriatenumberof positive or negative examplesstartingfrom the

beginningof thetrainingset(i.e. thetemporalorderof theseriesis notdisturbedbut thereis

notrainingsignalfor eitheranumberof positiveor anumberof negativeexamplesstartingat

thebeginningof theseries).TheSOM wastrainedfor 10,000iterationsusingthefollowing

learningratescheduleg�h^1i]j0k�lmYonKpqYorsnKt whereYon is thecurrentiterationnumberand Yursn is thetotalnumberof iterations(10,000).Theneighborhoodsizewasalteredduringtrainingusing

floor j2kTvwj6jBa&pXxyt0Y[z{l�k|t:j0k}l~Yon�pqYorsnKtv�g�h^&t . Thefirst differenceddatawassplit asfollows:Points1210to 1959(plusthesizeof thevalidationsetin thevalidationsetexperiments)in

eachof theexchangerateswereusedfor theselectionof thetrainingdatasetsize,thetraining

time, andexperimentationwith theuseof a validationset. Points10 to 1059in eachof the

exchangerateswereusedfor the main tests. The datafor eachsuccessive simulationwas

offsetby thesizeof thetestset,30points,asshown in figure4. Thenumberof pointsin the

main testsfrom eachexchangerateis equalto 1,050which is 50 (usedto initialize delays)

+ 100(trainingsetsize)+ 30 i 30 (30 testsetsof size30). For theselectionof thetrainingdatasetsizeandthetrainingtime thenumberof pointswas750for eachexchangerate(50+

100+ 20 testsetsof size30). For theexperimentsusinga validationsetin eachsimulation,

thenumberof pointswas750+ thesizeof thevalidationsetfor eachof theexchangerates.

Acknowledgments

We would like to thankDoyneFarmer, ChristianOmlin, AndreasWeigend,andtheanony-

mousrefereesfor helpful comments.

References

[1] Machinelearning.InternationalJournal of IntelligentSystemsin Accounting, Finance

andManagement, 4(2),1995.SpecialIssue.

25

[2] Proceedingsof IEEE/IAFE Conferenceon ComputationalIntelligencefor Financial

Engineering(CIFEr),1996–1999.

[3] Y.S. Abu-Mostafa. Learningfrom hints in neuralnetworks. Journal of Complexity,

6:192,1990.

[4] J.A. Alexanderand M.C. Mozer. Template-basedalgorithmsfor connectionistrule

extraction. In G. Tesauro,D. Touretzky, and T. Leen, editors,Advancesin Neural

InformationProcessingSystems, volume7, pages609–616.TheMIT Press,1995.

[5] Martin Anthony andNormanL. Biggs. A computationallearningtheoryview of eco-

nomic forecastingwith neuralnets. In A. Refenes,editor, Neural Networksin the

Capital Markets. JohnWiley andSons,1995.

[6] A.D. BackandA.C. Tsoi. FIR andIIR synapses,anew neuralnetwork architecturefor

timeseriesmodeling.Neural Computation, 3(3):375–385,1991.

[7] A.R. Barron. Universalapproximationboundsfor superpositionsof a sigmoidalfunc-

tion. IEEETransactionson InformationTheory, 39(3):930–945,1993.

[8] R. E. Bellman.AdaptiveControl Processes. PrincetonUniversityPress,Princeton,NJ,

1961.

[9] Y. Bengio.Neural Networksfor Speech andSequenceRecognition. Thomson,1996.

[10] A. Cleeremans,D. Servan-Schreiber, andJ.L. McClelland. Finite stateautomataand

simplerecurrentnetworks. Neural Computation, 1(3):372–381,1989.

[11] C.DarkenandJ.E.Moody. Noteonlearningrateschedulesfor stochasticoptimization.

In R.P. Lippmann,J.E.Moody, andD. S.Touretzky, editors,Advancesin Neural Infor-

mationProcessingSystems, volume3, pages832–838.MorganKaufmann,SanMateo,

CA, 1991.

[12] J.S.Denker, D. Schwartz,B. Wittner, S.A.Solla,R.Howard,L. Jackel, andJ.Hopfield.

Largeautomaticlearning,ruleextraction,andgeneralization.Complex Systems, 1:877–

922,1987.

[13] J.L. Elman. Distributedrepresentations,simplerecurrentnetworks, andgrammatical

structure.MachineLearning, 7(2/3):195–226,1991.

26

[14] E.F. Fama.Thebehaviour of stockmarketprices.Journalof Business, January:34–105,

1965.

[15] E.F. Fama.Efficient capitalmarkets:A review of theoryandempiricalwork. Journal

of Finance, May:383–417,1970.

[16] András Faraǵo and Gábor Lugosi. Stronguniversalconsistency of neuralnetwork

classifiers.IEEE Transactionson InformationTheory, 39(4):1146–1151,1993.

[17] J.H. Friedman. An overview of predictive learningand function approximation. In

V. Cherkassky, J.H.Friedman,andH. Wechsler, editors,FromStatisticsto Neural Net-

works,TheoryandPatternRecognition Applications, volume136of NATO ASISeries

F, pages1–61.Springer, 1994.

[18] J.H.Friedman.Introductionto computationallearningandstatisticalprediction.Tuto-

rial Presentedat NeuralInformationProcessingSystems,Denver, CO,1995.

[19] ZoubinGhahramaniandMichaelI. Jordan.Factorialhiddenmarkov models.Machine

Learning, 29:245,1997.

[20] C. LeeGiles,B.G. Horne,andT. Lin. Learninga classof large finite statemachines

with a recurrentneuralnetwork. Neural Networks, 8(9):1359–1365,1995.

[21] C. LeeGiles,C.B. Miller, D. Chen,H.H. Chen,G.Z. Sun,andY.C. Lee. Learningand

extractingfinite stateautomatawith second-orderrecurrentneuralnetworks. Neural

Computation, 4(3):393–405,1992.

[22] C. Lee Giles, C.B. Miller, D. Chen,G.Z. Sun,H.H. Chen,andY.C. Lee. Extracting

and learningan unknown grammarwith recurrentneuralnetworks. In J.E. Moody,

S.J.Hanson,andR.PLippmann,editors,Advancesin Neural InformationProcessing

Systems4, pages317–324,SanMateo,CA, 1992.MorganKaufmannPublishers.

[23] C. Lee Giles, G.Z. Sun,H.H. Chen,Y.C. Lee, andD. Chen. Higher orderrecurrent

networks& grammaticalinference.In D.S.Touretzky, editor, Advancesin Neural In-

formationProcessingSystems2, pages380–387,SanMateo,CA, 1990.MorganKauf-

mannPublishers.

[24] C.W.J.GrangerandP. Newbold. ForecastingEconomicTimeSeries. AcademicPress,

SanDiego,secondedition,1986.

27

[25] Yoichi Hayashi. A neuralexpert systemwith automatedextractionof fuzzy if-then

rules. In RichardP. Lippmann,JohnE. Moody, andDavid S. Touretzky, editors,Ad-

vancesin Neural InformationProcessingSystems, volume3, pages578–584.Morgan

KaufmannPublishers,1991.

[26] S. Haykin. Neural Networks,A ComprehensiveFoundation. Macmillan, New York,

NY, 1994.

[27] J.E. Hopcroft and J.D. Ullman. Introductionto AutomataTheory, Languages, and

Computation. Addison-Wesley PublishingCompany, Reading,MA, 1979.

[28] L. Ingber. Statisticalmechanicsof nonlinearnonequilibriumfinancialmarkets:Appli-

cationsto optimizedtrading.MathematicalComputerModelling, 1996.

[29] M. Ishikawa.Ruleextractionby successiveregularization.In InternationalConference

onNeural Networks, pages1139–1143.IEEEPress,1996.

[30] M.I. Jordan. Neural networks. In A. Tucker, editor, CRC Handbookof Computer

Science. CRCPress,BocaRaton,FL, 1996.

[31] A. KehagiasandV. Petridis.Timeseriessegmentationusingpredictivemodularneural

networks. Neural Computation, 9:1691–1710,1997.

[32] T. Kohonen.Self-OrganizingMaps. Springer-Verlag,Berlin, Germany, 1995.

[33] A.N. Kolmogorov. On therepresentationof continuousfunctionsof severalvariables

by superpositionsof continuousfunctionsof onevariableandaddition.Dokl, 114:679–

681,1957.

[34] V. Kŭrková. Kolmogorov’s theoremis relevant. Neural Computation, 3(4):617–622,

1991.

[35] V. Kŭrková. Kolmogorov’s theorem. In Michael A. Arbib, editor, The Handbook

of Brain Theoryand Neural Networks, pages501–502.MIT Press,Cambridge,Mas-

sachusetts,1995.

[36] Steve Lawrence,C. Lee Giles, and Sandiway Fong. Natural languagegrammatical

inferencewith recurrentneuralnetworks. IEEE Transactionson Knowledge andData

Engineering. acceptedfor publication.

28

[37] JeanY. Lequarŕe. Foreign currency dealing: A brief introduction(dataset C). In

A.S. WeigendandN.A. Gershenfeld,editors,TimeSeriesPrediction: Forecastingthe

FutureandUnderstandingthePast. Addison-Wesley, 1993.

[38] Andrew W. Lo and A. Craig MacKinlay. A Non-RandomWalk Down Wall Street.

PrincetonUniversityPress,1999.

[39] B.G.Malkiel. EfficientMarketHypothesis. Macmillan,London,1987.

[40] B.G.Malkiel. A RandomWalk DownWall Street. Norton,New York, NY, 1996.

[41] C. McMillan, M.C. Mozer, andP. Smolensky. Ruleinductionthroughintegratedsym-

bolic andsubsymbolicprocessing.In JohnE. Moody, SteveJ.Hanson,andRichardP.

Lippmann,editors,Advancesin Neural Information ProcessingSystems, volume 4,

pages969–976.MorganKaufmannPublishers,1992.

[42] J.E.Moody. Economicforecasting:Challengesandneuralnetwork solutions.In Pro-

ceedingsof theInternationalSymposiumon Artificial Neural Networks. Hsinchu,Tai-

wan,1995.

[43] S.Mukherjee,E.Osuna,andF. Girosi.Nonlinearpredictionof chaotictimeseriesusing

supportvectormachines.In J. Principe,L. Giles,N. Morgan,andE. Wilson, editors,

IEEE Workshopon Neural Networksfor SignalProcessingVII, page511.IEEE Press,

1997.

[44] C.W. Omlin andC.L. Giles. Extractionof rules from discrete-timerecurrentneural

networks. Neural Networks, 9(1):41–52,1996.

[45] J.C. PrincipeandJ-M. Kuo. Dynamicmodellingof chaotictime serieswith neural

networks. In G. Tesauro,D. Touretzky, andT. Leen,editors,Advancesin Neural Infor-

mationProcessingSystems,7, pages311–318.MIT Press,1995.

[46] A. Refenes,editor. Neural Networksin the Capital Markets. JohnWiley andSons,

New York, NY, 1995.

[47] D.W. Ruck, S.K. Rogers,K. Kabrisky, M.E. Oxley, andB.W. Suter. The multilayer

perceptronasanapproximationto anoptimalBayesestimator. IEEE Transactionson

Neural Networks, 1(4):296–298,1990.

29

[48] E. W. Saad,Danil V. Prokhorov, andD. C. WunschII. Comparative studyof stock

trendpredictionusingtime delay, recurrentandprobabilisticneuralnetworks. IEEE

Transactionson Neural Networks, 9(6):1456,November1998.

[49] R. SetionoandH. Liu. Symbolicrepresentationof neuralnetworks. IEEE Computer,

29(3):71–77,1996.

[50] H.T. SiegelmannandE.D.Sontag.Onthecomputationalpowerof neuralnets.Journal

of ComputerandSystemSciences, 50(1):132–150,1995.

[51] WeissS.M. andN. IndurkhyaN. Rule-basedmachinelearningmethodsfor functional

prediction.Journalof Artificial IntelligenceResearch, 1995.

[52] F. Takens. Detectingstrangeattractorsin turbulence. In D.A. RandandL.-S. Young,

editors,DynamicalSystemsand Turbulence, volume898 of Lecture Notesin Mathe-

matics, pages366–381,Berlin, 1981.Springer-Verlag.

[53] S.J.Taylor, editor. Modelling Financial Time Series. J. Wiley & Sons,Chichester,

1994.

[54] A. B. Tickle, R.Andrews,M. Golea,andJ.Diederich.Thetruthwill cometo light: Di-

rectionsandchallengesin extractingtheknowledgeembeddedwithin trainedartificial

neuralnetworks. IEEETransactionsonNeural Networks, 9(6):1057,November1998.

[55] M. Tomita. Dynamicconstructionof finite-stateautomatafrom examplesusinghill-

climbing. In Proceedingsof the Fourth AnnualCognitive ScienceConference, pages

105–108,Ann Arbor, MI, 1982.

[56] G. Towell andJ.Shavlik. Theextractionof refinedrulesfrom knowledgebasedneural

networks. MachineLearning, 131:71–101,1993.

[57] Geoffrey G. Towell andJudeW. Shavlik. Extractingrefinedrules from knowledge-

basedneuralnetworks. MachineLearning, 13:71,1993.

[58] George TsibourisandMatthew Zeidenberg. Testingthe efficient marketshypothesis

with gradientdescentalgorithms.In A. Refenes,editor, Neural Networksin theCapital

Markets. JohnWiley andSons,1995.

30

[59] R.L. WatrousandG.M. Kuhn. Inductionof finite statelanguagesusingsecond-order

recurrentnetworks. In J.E.Moody, S.J.Hanson,andR.PLippmann,editors,Advances

in Neural InformationProcessingSystems4, pages309–316,SanMateo,CA, 1992.

MorganKaufmannPublishers.

[60] R.L. WatrousandG.M. Kuhn. Inductionof finite-statelanguagesusingsecond-order

recurrentnetworks. Neural Computation, 4(3):406,1992.

[61] A.S.Weigend,B.A. Huberman,andD.E.Rumelhart.Predictingsunspotsandexchange

rateswith connectionistnetworks. In M. CasdagliandS. Eubank,editors,Nonlinear

ModelingandForecasting, SFIStudiesin theSciencesof Complexity, ProceedingsVol.

XII, pages395–432.Addison-Wesley, 1992.

[62] G.U. Yule. On a methodof investigatingperiodicitiesin disturbedserieswith spe-

cial referenceto Wolfer’s sunspotnumbers.PhilosophicalTransactionsRoyalSociety

LondonSeriesA, 226:267–298,1927.

[63] Z. Zeng,R.M. Goodman,andP. Smyth. Discreterecurrentneuralnetworksfor gram-

maticalinference.IEEE Transactionson Neural Networks, 5(2):320–330,1994.

31

Download - Noisy Time Series Prediction using a ... - Martin Sewellmachine-learning.martinsewell.com/ann/GiLT01.pdfThis is the “curse of dimensionality” [8, 17, 18]. The relationship between

Top Related