noisy time series prediction using a ... - martin...

31
Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference C. Lee Giles , Steve Lawrence , Ah Chung Tsoi , NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 Faculty of Informatics, University of Wollongong, NSW 2522 Australia Abstract Financial forecasting is an example of a signal processing problem which is chal- lenging due to small sample sizes, high noise, non-stationarity, and non-linearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent difficulties when using neural networks for the processing of high noise, small sample size signals. We introduce a new intelligent signal processing method which addresses the difficulties. The method proposed uses conversion into a symbolic representation with a self-organizing map, and grammatical inference with recurrent neural networks. We apply the method to the prediction of daily foreign exchange rates, addressing difficulties with non-stationarity, overfitting, and un- equal a priori class probabilities, and we find significant predictability in comprehensive experiments covering 5 different foreign exchange rates. The method correctly predicts the direction of change for the next day with an error rate of 47.1%. The error rate re- duces to around 40% when rejecting examples where the system has low confidence in its prediction. We show that the symbolic representation aids the extraction of symbolic Lee Giles is also with the Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742. 1

Upload: others

Post on 18-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • NoisyTime SeriesPredictionusingaRecurrent

    NeuralNetwork andGrammaticalInference

    C. LeeGiles���, Steve Lawrence

    �, Ah ChungTsoi

    ���������������������������������������� ��!#"$��%&"'�(���)"*�+�,,���.-�0/(+��1"2�030/1"*��/

    �NECResearchInstitute,4 IndependenceWay, Princeton,NJ 08540�

    Facultyof Informatics,Universityof Wollongong,NSW2522Australia

    Abstract

    Financialforecastingis an exampleof a signalprocessingproblemwhich is chal-

    lengingdueto smallsamplesizes,highnoise,non-stationarity, andnon-linearity. Neural

    networkshave beenvery successfulin a numberof signalprocessingapplications.We

    discussfundamentallimitationsandinherentdifficultieswhenusingneuralnetworksfor

    theprocessingof high noise,smallsamplesizesignals.We introducea new intelligent

    signalprocessingmethodwhich addressesthe difficulties. Themethodproposeduses

    conversioninto asymbolicrepresentationwith a self-organizingmap,andgrammatical

    inferencewith recurrentneuralnetworks.Weapplythemethodto thepredictionof daily

    foreignexchangerates,addressingdifficultieswith non-stationarity, overfitting,andun-

    equala priori classprobabilities,andwefind significantpredictabilityin comprehensive

    experimentscovering5 differentforeignexchangerates.Themethodcorrectlypredicts

    thedirectionof changefor thenext daywith anerror rateof 47.1%.Theerror ratere-

    ducesto around40%whenrejectingexampleswherethesystemhaslow confidencein

    its prediction.Weshow thatthesymbolicrepresentationaidstheextractionof symbolic4LeeGilesis alsowith theInstitutefor AdvancedComputerStudies,Universityof Maryland,CollegePark,

    MD 20742.

    1

  • knowledgefrom thetrainedrecurrentneuralnetworksin theform of deterministicfinite

    stateautomata.Theseautomataexplain theoperationof thesystemandareoftenrela-

    tively simple. Automatarulesrelatedto well known behavior suchastrendfollowing

    andmeanreversalareextracted.

    1 Intr oduction

    1.1 Predicting Noisy Time SeriesData

    Thepredictionof futureeventsfrom noisytime seriesdatais commonlydoneusingvarious

    formsof statisticalmodels[24]. Typical neuralnetwork modelsarecloselyrelatedto statis-

    tical models,andestimateBayesiana posterioriprobabilitieswhengivenan appropriately

    formulatedproblem[47]. Neuralnetworkshavebeenverysuccessfulin anumberof pattern

    recognitionapplications.For noisy time seriesprediction,neuralnetworks typically take a

    delayembeddingof previous inputs1 which is mappedinto a prediction. For examplesof

    thesemodelsandtheuseof neuralnetworksandothermachinelearningmethodsappliedto

    financesee[2, 1, 46]. However, high noise,high non-stationaritytime seriespredictionis

    fundamentallydifficult for thesemodels:

    1. Theproblemof learningfrom examplesis fundamentallyill-posed,i.e. therearein-

    finitely many modelswhichfit thetrainingdatawell, but few of thesegeneralizewell.

    In orderto form a moreaccuratemodel,it is desirableto useaslargea trainingsetas

    possible.However, for the caseof highly non-stationarydata,increasingthe sizeof

    the trainingsetresultsin moredatawith statisticsthatarelessrelevant to the taskat

    handbeingusedin thecreationof themodel.

    2. The high noiseand small datasetsmake the modelsproneto overfitting. Random

    correlationsbetweenthe inputsandoutputscanpresentgreatdifficulty. The models

    typically do not explicitly addressthe temporalrelationshipof the inputs– e.g. they

    donotdistinguishbetweenthosecorrelationswhichoccurin temporalorder, andthose

    whichdo not.1For example, 57698;:=5?6@8�A � :=

  • In our approachwe considerrecurrentneuralnetworks(RNNs). Recurrentneuralnetworks

    employ feedbackconnectionsandhavethepotentialto representcertaincomputationalstruc-

    turesin a moreparsimoniousfashion[13]. RNNsaddressthetemporalrelationshipof their

    inputsby maintainingan internalstate. RNNs arebiasedtowardslearningpatternswhich

    occurin temporalorder– i.e. they arelessproneto learningrandomcorrelationswhich do

    notoccurin temporalorder.

    However trainingRNNstendsto bedifficult with high noisedata,with a tendency for long-

    termdependenciesto beneglected(experimentsreportedin [9] founda tendency for recur-

    rentnetworksto takeinto accountshort-termdependenciesbut not long-termdependencies),

    andfor thenetwork to fall into anaivesolutionsuchasalwayspredictingthemostcommon

    output.Weaddressthisproblemby convertingthetimeseriesinto asequenceof symbolsus-

    ing a self-organizingmap(SOM).Theproblemthenbecomesoneof grammaticalinference

    – thepredictionof a givenquantityfrom a sequenceof symbols.It is thenpossibleto take

    advantageof theknown capabilitiesof recurrentneuralnetworksto learngrammars[22, 63]

    in orderto captureany predictabilityin theevolutionof theseries.

    The useof a recurrentneuralnetwork is significantfor two reasons:firstly, the temporal

    relationshipof theseriesis explicitly modeledvia internalstates,andsecondly, it is possi-

    ble to extract rules from the trainedrecurrentnetworks in the form of deterministicfinite

    stateautomata[44]2. The symbolicconversionis significantfor a numberof reasons:the

    quantizationeffectively filters thedataor reducesthenoise,theRNN trainingbecomesmore

    effective,andthesymbolicinput facilitatestheextractionof rulesfrom thetrainednetworks.

    Furthermore,it canbe arguedthat the instantiationof rulesfrom predictorsnot only gives

    theuserconfidencein thepredictionmodelbut facilitatestheuseof themodelandcanlead

    to a deeperunderstandingof theprocess[54].

    Financialtime seriesdatatypically containshigh noiseandsignificantnon-stationarity. In

    this paper, thenoisy timesseriespredictionproblemconsideredis thepredictionof foreign

    exchangerates.A brief overview of foreignexchangeratesis presentedin thenext section.

    2Rulescanalsobeextractedfrom feedforwardnetworks[25, 41, 56, 4, 49, 29], andothertypesof recurrent

    neuralnetworks[20] howevertherecurrentnetwork approachanddeterministicfinite stateautomataextraction

    seemparticularlysuitablefor a timeseriesproblem.

    3

  • 1.2 ForeignExchangeRates

    Theforeignexchangemarket asof 1997is theworld’s largestmarket, with morethan$US

    1.3 trillion changinghandsevery day. Themostimportantcurrenciesin themarket arethe

    US dollar (which actsasa referencecurrency), the JapaneseYen, the British Pound,the

    GermanMark, andthe SwissFranc[37]. Foreignexchangeratesexhibit very high noise,

    andsignificantnon-stationarity. Many financial institutionsevaluatepredictionalgorithms

    using the percentageof times that the algorithmpredictsthe right trend from todayuntil

    sometime in the future [37]. Hence,this paperconsidersthepredictionof thedirectionof

    changein foreignexchangeratesfor thenext businessday.

    The remainderof this paperis organizedas follows: section2 describesthe dataset we

    have used,section3 discussesfundamentalissuessuchasill-posedproblems,the curseof

    dimensionality, andvectorspaceassumptions,andsection4 discussesthe efficient market

    hypothesisandpredictionmodels.Section5 detailsthesystemwehaveusedfor prediction.

    Comprehensive testresultsarecontainedin section6, andsection7 considerstheextraction

    of symbolicdata. Section8 providesconcludingremarks,andAppendixA provides full

    simulationdetails.

    2 ExchangeRate Data

    Thedatawe have usedis publicly availableat IKJKJML�NCOPO�QPQPQ�R'STJ7UWVWX�RYX?ZW[�REUM\W[�O?]W^_Q7Ua`�b?U�X?\�Oc `�d�UPe�fPUWV�`_U7SMOWfP^_XKJ7^Wg?UhR>IKJWd�i . It wasfirst usedby Weigendet al. [61]. Thedataconsistsof daily closingbids for five currencies(GermanMark (DM), JapaneseYen,SwissFranc,

    British Pound,andCanadianDollar) with respectto theUSDollar andis from theMonetary

    Yearbookof the ChicagoMercantileExchange.Thereare 3645 datapoints for eachex-

    changeratecoveringtheperiodSeptember3, 1973to May 18,1987.In contrastto Weigend

    et al., this work considersthepredictionof all fiveexchangeratesin thedataandprediction

    for all daysof theweekinsteadof just Mondays.

    4

  • 3 Fundamental Issues

    This sectionbriefly discussesfundamentalissuesrelatedto learningby exampleusingtech-

    niquessuchasmulti-layerperceptrons(MLPs): ill-posedproblems,thecurseof dimension-

    ality, andvectorspaceassumptions.Thesehelpform themotivationfor our new method.

    The problemof inferring an underlyingprobability distribution from a finite setof datais

    fundamentallyan ill-posed problembecausethereare infinitely many solutions[30], i.e.

    thereareinfinitely many functionswhich fit the trainingdataexactly but differ in otherre-

    gionsof theinputspace.Theproblembecomeswell-posedonly whenadditionalconstraints

    areused.For example,theconstraintthata limited sizeMLP will beusedmightbeimposed.

    In this case,theremaybea uniquesolutionwhich bestfits thedatafrom therangeof solu-

    tionswhich themodelcanrepresent.Theimplicit assumptionbehindthis constraintis that

    theunderlyingtarget function is “smooth”. Smoothnessconstraintsarecommonlyusedby

    learningalgorithms[17]. Withoutsmoothnessor someotherconstraint,thereis noreasonto

    expecta modelto performwell on unseeninputs(i.e. generalizewell).

    Learningby exampleoften operatesin a vectorspace,i.e. a function mapping j1k to j1lis approximated.However, a problemwith vectorspacesis that therepresentationdoesnot

    reflectany additionalstructurewithin the data. Measurementsthat arerelated(e.g. from

    neighboringpixelsin animageor neighboringtimestepsin atimeseries)andmeasurements

    that do not have sucha relationshipare treatedequally. Theserelationshipscan only be

    inferredin astatisticalmannerfrom thetrainingexamples.

    The useof a recurrentneuralnetwork insteadof an MLP with a window of time delayed

    inputsintroducesanotherassumptionby explicitly addressingthe temporalrelationshipof

    theinputsvia themaintenanceof aninternalstate.This canbeseenassimilar to theuseof

    hints[3] andshouldhelpto make theproblemlessill-posed.

    Thecurseof dimensionalityrefersto theexponentialgrowth of hypervolumeasafunctionof

    dimensionality[8]. Considerm0n�opj1k . Theregression,qsrutvwm�x is ahypersurfacein j kzy.{ .If t|vwm�x is arbitrarily complex andunknown thendensesamplesarerequiredto approximatethe function accurately. However, it is hardto obtaindensesamplesin high dimensions3.

    3Kolmogorov’s theoremshows thatany continuousfunctionof } dimensionscanbecompletelycharacter-izedby aonedimensionalcontinuousfunction.Specifically, Kolmogorov’stheorem[33, 34, 35, 18] statesthat

    5

  • This is the “curseof dimensionality”[8, 17, 18]. The relationshipbetweenthe sampling

    densityandthenumberof pointsrequiredis ~u [18] where is thedimensionalityof theinput spaceand is thenumberof points. Thus,if { is thenumberof pointsfor a givensamplingdensityin 1 dimension,thenin orderto keepthesamedensityasthedimensionality

    is increased,thenumberof pointsmustincreaseaccordingto k{ .It hasbeensuggestedthat MLPs do not suffer from the curseof dimensionality[16, 7].

    However, this is not truein general(althoughMLPsmaycopebetterthanothermodels).The

    apparentavoidanceof thecurseof dimensionalityin [7] is dueto the fact that the function

    spacesconsideredare more and more constrainedas the input dimensionincreases[34].

    Similarly, smoothnessconditionsmustbesatisfiedfor theresultsof [16].

    The useof a recurrentneuralnetwork is importantfrom the viewpoint of the curseof di-

    mensionalitybecausetheRNN cantake into accountgreaterhistoryof the input. Trying to

    take into accountgreaterhistorywith an MLP by increasingthenumberof delayedinputs

    resultsin an increasein the input dimension.This is problematic,given the small number

    of datapointsavailable. The useof a self-organizingmapfor the symbolicconversionis

    alsoimportantfrom theviewpoint of thecurseof dimensionality. As will beseenlater, the

    topographicalorderof theSOMallowsencodingaonedimensionalSOMinto asingleinput

    for theRNN.

    for any continuousfunction: 695M$

  • 4 The Efficient Mark et Hypothesis

    The Efficient Market Hypothesis(EMH) wasdevelopedin 1965by E. Fama[14, 15] and

    foundbroadacceptancein thefinancialcommunity[39, 58,5]. TheEMH, in its weakform,

    assertsthatthepriceof anassetreflectsall of theinformationthatcanbeobtainedfrom past

    pricesof theasset,i.e. themovementof theprice is unpredictable.Thebestpredictionfor

    a price is thecurrentprice andtheactualpricesfollow what is calleda randomwalk. The

    EMH is basedontheassumptionthatall newsis promptlyincorporatedin prices;sincenews

    is unpredictable(by definition), pricesareunpredictable.Much effort hasbeenexpended

    trying to prove or disprove theEMH. Currentopinionis that thetheoryhasbeendisproved

    [53, 28], andmuchevidencesuggeststhatthecapitalmarketsarenotefficient [40, 38].

    If the EMH was true, thena financial seriescould be modeledas the additionof a noise

    componentateachstep: vY� ¢¡*x�r vY7x� ¤£2vY7x (2)where £zvYax is azeromeanGaussianvariablewith variance¥ . Thebestestimationis:¦ vY� ¢¡*x�r vY7x (3)In otherwords,if theseriesis truly a randomwalk, thenthebestestimatefor thenext time

    period is equal to the currentestimate. Now, if it is assumedthat thereis a predictable

    componentof theseriesthenit is possibleto use: vY§ H¡2x r vYax� ¨t|v v>ax© vY&ª«¡*x©�¬z¬z¬0© v>sª# ¢¡*x'x� ¨£2v>ax (4)where£zvYax isazeromeanGaussianvariablewith variance¥ , and tvD®¯x isanon-linearfunctionin its arguments.In this case,thebestestimateis givenby:¦ vY§ H¡2x r vY7x� ¨t|v v>ax© vY°ª±¡*x©�¬z¬z¬0© v>°ª²³ H¡*xCx (5)Predictionusingthis modelis problematicastheseriesoftencontainsa trend.For example,

    a neuralnetwork trainedon sectionA in figure1 haslittle chanceof generalizingto thetest

    datain sectionB, becausethemodelwasnot trainedwith datain therangecoveredby that

    section.

    7

  • Figure1. An exampleof thedifficulty in usinga modeltrainedon a raw priceseries.Generalization

    to sectionB from the training datain sectionA is difficult becausethe modelhasnot beentrained

    with inputsin therangecoveredby sectionB.

    A commonsolutionto this is to usea modelwhich is basedon the first orderdifferences

    (equation7) insteadof theraw timeseries[24]:´ v>) ¢¡*x rµt|v ´ v>ax© ´ v>&ª¶¡*x©z¬�¬z¬�© ´ vY·ª²³ H¡2x'x� ¶¸�vYax (6)where ´ v>� H¡*xº¹r v>� H¡*xª v>ax (7)and ¸0v>ax is a zeromeanGaussianvariablewith variance¥ . In this casethebestestimateis:¦´ vY� H¡*x rµt|v ´ vYax© ´ vY·ª±¡*x©z¬z¬z¬0© ´ v>&ª# ¢¡*x'x (8)5 SystemDetails

    A high-level block diagramof the systemusedis shown in figure 2. The raw time series

    valuesare q�vYax , ³r»¡W©¼K©z¬z¬z¬© whereq�vY7x(opj . Thesedenotethedaily closingbidsof thefinancialtimeseries.Thefirst differenceof theseries,q�vYax , is takenasfollows:´ vYax rHq0v>ax�ª²q�vY·ª±¡*x (9)

    8

  • Figure2. A high-level block diagramof thesystemused.

    Thisproduceś vY7x , ´ vYax�o½j , ³r»¡W©¼?©�¬z¬z¬©¾ª¡ . In orderto compressthedynamicrangeof theseriesandreducetheeffectof outliers,a log transformationof thedatais used: v>ax�r signv ´ vYaxCx�vw¿ÀWÁav ´ vY7x* H¡2x'x (10)resultingin

    v>ax©ráW©¼K©z¬z¬z¬©Äªu¡ , vYax&o±j . As is usual,a delayembeddingof thisseriesis thenconsidered[62]:Å vY.©$ÆaÇEx�rÈv v>ax© v>°ª±É*x© v>°ª²ÊMx©�¬z¬z¬�© vY&ª²Æ7Ç0 HÉ*xCx (11)whereÆ { is thedelayembeddingdimensionandis equalto1or2 for theexperimentsreportedhere.

    Å vY.©$ÆaÇEx is a statevector. This delayembeddingformstheinput to theSOM. Hence,the SOM input is a vector of the last Æ { valuesof the log transformeddifferencedtimeseries.Theoutputof theSOM is thetopographicallocationof thewinningnode.Eachnode

    representsonesymbolin theresultinggrammaticalinferenceproblem. A brief description

    of theself-organizingmapis containedin thenext section.

    TheSOM canberepresentedby thefollowing equation:Ë vY7x rHÌ�v Å vY�©ÆKxDx (12)where

    Ë v>ax§oµÍ¯¡W©¼?©Î?©z¬z¬z¬D0Ï=Ð , and 0Ï is thenumberof symbols(nodes)for theSOM. Eachnodein theSOMhasbeenassignedanintegerindex rangingfrom 1 to thenumberof nodes.

    An Elmanrecurrentneuralnetwork is thenused4 which is trainedonthesequenceof outputs

    from the SOM. The Elmannetwork waschosenbecauseit is suitablefor the problem(a4Elmanrefersto thetopologyof thenetwork – full backpropagationthroughtime wasusedasopposedto

    thetruncatedversionusedby Elman.

    9

  • grammaticalinferencestyleproblem)[13], andbecauseit hasbeenshown to performwell

    in comparisonto otherrecurrentarchitectures(e.g.see[36]). TheElmanneuralnetwork has

    feedbackfrom eachof thehiddennodesto all of thehiddennodes,asshown in figure3. For

    theElmannetwork: Ñ v>� H¡*x�rÓÒ�Ô�ÕMÖh ¤×WØ (13)and ÕMÖÙrHÚ kzÛ v>Ü1ÕMÖÝ {  ¤Þ&ß�Öh áà|x (14)where Ò is a .â1ãä.å vectorrepresentingthe weightsfrom the hiddenlayer to the outputnodes,.â is thenumberof hiddennodes,.å is thenumberof outputnodes,×WØ is a constantbiasvector, ÕMÖ , ÕMÖ²oæj k2Û , is an �âçãè¡ vector, denotingthe outputsof the hiddenlayerneurons.ß�Ö is a ÆMéêãë¡ vectorasfollows,where ÆMé is theembeddingdimensionusedfor therecurrentneuralnetwork:

    ß�Öºrìííííííî Ë v>axË v>°ª«¡*xË v>°ªá¼Wx¬�¬z¬Ë vY°ªáÆMé ¢¡*x

    ï¯ððððððñ (15)Ü and Þ arematricesof appropriatedimensionswhichrepresentthefeedbackweightsfromthe hiddennodesto the hiddennodesand the weightsfrom the input layer to the hidden

    layerrespectively. Ú k Û is a .â&㤡 vectorcontainingthesigmoidfunctions. à is an �â&ãá¡vector, denotingthebiasof eachhiddenlayerneuron.

    Ñ vYax is a �å ã¡ vectorcontainingtheoutputsof thenetwork. �å is 2 throughoutthis paper. Thefirst outputis trainedto predicttheprobabilityof apositivechange,thesecondoutputis trainedto predicttheprobabilityof

    anegativechange.

    Thus,for thecompletesystem:Ñ vY� H¡*x�rHÚ { v ´ vY7x© ´ vY·ª±¡*x© ´ vY°ªá¼Wx© ´ v>°ª²ÎMx© ´ v>&ªòPx'x (16)whichcanbeconsideredin termsof theoriginal series:Ñ v>§ ¢¡*x rµÚ�éTv;q�vYax©$q0v>)ª«¡*x©$q�vY·ª¤¼Wx©$q0v>)ªáÎMx©$q0v>·ªäòPx©$q0v>·ª¤óWx'x (17)

    10

  • Figure3. Thepre-processed,delayembeddedseriesis convertedinto symbolsusingaself-organizing

    map.An Elmanneuralnetwork is trainedon thesequenceof symbols.

    11

  • Notethatthis is notastaticmappingdueto thedependenceon thehiddenstateof therecur-

    rentneuralnetwork.

    Thefollowing sectionsdescribetheself-organizingmap,recurrentnetwork grammaticalin-

    ference,dealingwith non-stationarity, controllingoverfitting,a priori classprobabilities,and

    amethodfor comparisonwith a randomwalk.

    5.1 The Self-Organizing Map

    5.1.1 Intr oduction

    Theself-organizingmap,or SOM [32] is anunsupervisedlearningprocesswhich learnsthe

    distribution of a setof patternswithout any classinformation.A patternis projectedfrom a

    possiblyhighdimensionalinput spaceô to apositionin themap,a low dimensionaldisplayspaceõ . Thedisplayspaceõ is oftendividedinto a grid andeachintersectionof thegridis representedin the network by a neuron. The information is encodedasthe locationof

    anactivatedneuron.TheSOM is unlike mostclassificationor clusteringtechniquesin that

    it attemptsto preserve the topologicalorderingof the classesin the input spaceô in theresultingdisplayspaceõ . In otherwords,for similarity asmeasuredusinga metric in theinput spaceô , theSOM attemptsto preserve thesimilarity in thedisplayspaceõ .5.1.2 Algorithm

    We give a brief descriptionof the SOM algorithm, for more detailssee[32]. The SOM

    definesamappingfrom aninputspacej k ontoatopologicallyorderedsetof nodes,usuallyin a lower dimensionalspaceõ . A referencevector, ö1nø÷ Íúù�n { ©$ù�n¯éz©�¬û¬@¬û©'ù�n k Ð Ô oüj k , isassignedto eachnodein theSOM. Assumethat thereare ý nodes.During training,eachinput,

    ouj k , is comparedto ö1n , þ°rÿ¡_©¼?©z¬z¬�¬©ý , obtainingthe locationof the closestmatch( Âû ª«ö��zÂÂ0r ����� n��� ª«ö1n' ). The input point is mappedto this locationin theSOM.Nodesin theSOM areupdatedaccordingto:ö1n=v�� ¢¡*x rHö1nEv��Cx� ���nEv��Cx�Í v��Cx|ª²ö1nEv��CxEÐ (18)

    12

  • where � is the time during learningand ��wn=v��Cx is the neighborhoodfunction, a smoothingkernelwhich is maximumat ö�� . Usually, ��n=v��Cx�r�vÂû ����ª���n'Âû ©��Cx , where��� and ��n representthe locationsof nodesin the SOM outputspaceõ . ��� is the nodewith the closestweightvectorto the input sampleand ��n rangesover all nodes. ��wn=v��Cx approaches0 as  ���hª��n'ÂÂincreasesandalsoas � approaches� . A widely appliedneighborhoodfunctionis:

    ��n�r���v��Cx������ ª  ���ª!��n'Âû é¼�¥ é v��Cx " (19)where ��v��Cx is a scalarvaluedlearningrateand ¥�v��Cx definesthe width of the kernel. Theyaregenerallyboth monotonicallydecreasingwith time [32]. The useof the neighborhood

    functionmeansthatnodeswhich aretopographicallyclosein theSOM structurearemoved

    towardstheinputpatternalongwith thewinningnode.Thiscreatesasmoothingeffectwhich

    leadsto aglobalorderingof themap.Notethat ¥�v��Cx shouldnotbereducedtoofarasthemapwill loseits topographicalorderif neighboringnodesarenot updatedalongwith theclosest

    node.TheSOM canbeconsidereda non-linearprojectionof theprobabilitydensity,#�v x ,of theinputpatterns

    ontoa (typically) smalleroutputspace[32].

    5.1.3 Remarks

    Thenodesin thedisplayspaceõ encodethe informationcontainedin the input spacej k .Sincethereare ý nodesin õ , this implies that the input patternvectors o²j1k aretrans-formedto a setof ý symbols,while preservingtheir original topologicalorderingin j1k .Thus,if theoriginal input patternsarehighly noisy, thequantizationinto thesetof ý sym-bols while preservingthe original topologicalorderingcanbe understoodasa form of fil-

    tering. Theamountof filtering is controlledby ý . If ý is large, this impliesthereis littlereductionin thenoisecontentof theresultingsymbols.Ontheotherhand,if ý is small,thisimpliesthatthereis a “heavy” filtering effect, resultingin only asmallnumberof symbols.

    5.2 Recurrent Network Prediction

    In the pastfew yearsseveral recurrentneuralnetwork architectureshave emerged which

    have beenusedfor grammaticalinference[10, 23, 21]. The inductionof relatively simple

    13

  • grammarshasbeenaddressedoften– e.g.[59, 60, 21] on learningTomitagrammars[55]. It

    hasbeenshown thataparticularclassof recurrentnetworksareTuringequivalent[50].

    Thesetof ý symbolsfrom theoutputof theSOM arelinearly encodedinto a singleinputfor the Elmannetwork (e.g. if ý r Î , the single input is either -1, 0, or 1). The linearencodingis importantandis justifiedby thetopographicalorderof thesymbols.If no order

    wasdefinedoverthesymbolsthenaseparateinputshouldbeusedfor eachsymbol,resulting

    in asystemwith moredimensionsandincreaseddifficulty dueto thecurseof dimensionality.

    In orderto facilitatethe trainingof the recurrentnetwork, an input window of 3 wasused,

    i.e. thelastthreesymbolsarepresentedto 3 input neuronsof therecurrentneuralnetwork.

    5.3 Dealingwith Non-stationarity

    The approachusedto dealwith the non-stationarityof the signal in this work is to build

    modelsbasedon ashorttime periodonly. This is anintuitively appealingapproachbecause

    onewould expectthatany inefficienciesfound in themarket would typically not last for a

    long time. Thedifficulty with this approachis thereductionin thealreadysmallnumberof

    trainingdatapoints.Thesizeof thetrainingsetcontrolsanoisevs. non-stationaritytradeoff

    [42]. If the training set is too small, the noisemakesit harderto estimatethe appropriate

    mapping.If thetrainingsetis too large,thenon-stationarityof thedatawill meanmoredata

    with statisticsthatarelessrelevantfor thetaskathandwill beusedfor creatingtheestimator.

    We rantestson a segmentof dataseparatefrom themaintests,from which we choseto use

    modelstrainedon 100daysof data(seeappendixA for moredetails).After training,each

    model is testedon the next 30 daysof data. The entire training/testsetwindow is moved

    forward30 daysandtheprocessis repeated,asdepictedin figure4. In a real-life situation

    it would bedesirableto testonly for thenext day, andadvancethewindowsby stepsof one

    day. However, 30daysis usedherein orderto reducetheamountof computationby a factor

    of 30, andenablea much larger numberof teststo be performed(i.e. 30 timesasmany

    predictionsaremadefrom thesamenumberof trainingruns),therebyincreasingconfidence

    in theresults.

    14

  • Figure4. A depictionof thetrainingandtestsetsused.

    5.4 Controlling Overfitting

    Highly noisydatahasoftenbeenaddressedusingtechniquessuchasweightdecay, weight

    eliminationandearly stoppingto control overfitting [61]. For this problemwe useearly

    stoppingwith a difference:the stoppingpoint is chosenusingmultiple testson a separate

    segmentof data,ratherthanusinga validationsetin thenormalmanner. Testsshowedthat

    usinga validationsetin thenormalmannerseriouslyaffectedperformance.This is not very

    surprisingbecausethereis a dilemmawhenchoosingthe validationset. If the validation

    set is chosenbeforethe training data(in termsof the temporalorder of the series),then

    thenon-stationarityof theseriesmeansthevalidationsetmaybeof little usefor predicting

    performanceon thetestset. If thevalidationsetis chosenafter thetrainingdata,a problem

    occursagainwith the non-stationarity– it is desirableto make the validationsetassmall

    aspossibleto alleviate this problem,however the setcannotbe madetoo small becauseit

    needsto be a certainsizein orderto make the resultsstatisticallyvalid. Hence,we chose

    to controloverfitting by settingthetrainingtime for anappropriatestoppingpoint a priori .

    Simulationswith aseparatesegmentof datanotcontainedin themaindata(seeappendixA

    for moredetails)wereusedto selectthetrainingtimeof 500epochs.

    Wenotethatouruseof aseparatesegmentof datato selecttrainingsetsizeandtrainingtime

    arenot expectedto beidealsolutions,e.g. thebestchoicefor theseparametersmaychange

    15

  • significantlyover timedueto thenon-stationarityof theseries.

    5.5 A Priori Data Probabilities

    For somefinancialdatasets,it is possibleto obtaingoodresultsby simply predictingthe

    mostcommonchangein thetrainingdata(e.g.noticingthatthechangein pricesis positive

    moreoften thanit is negative in the training setandalwayspredictingpositive for the test

    set). Although predictabilitydueto suchdatabiasmay be worthwhile, it canprevent the

    discoveryof a bettersolution.Figure5 shows a simplifieddepictionof theproblem– if the

    naive solution,A, hasa bettermeansquarederror thana randomsolution then it may be

    hardto find a bettersolution,e.g. B. This may be particularlyproblematicfor high noise

    dataalthoughthetruenatureof theerrorsurfaceis not yet well understood.

    Figure5. A depictionof a simplified error surfaceif a naive solution,A, performswell. A better

    solution,e.g.B, maybedifficult to find.

    Hence,in all resultsreportedherethemodelshave beenpreventedfrom actingon any such

    biasby ensuringthat thenumberof trainingexamplesfor thepositiveandnegativecasesis

    equal.Thetrainingsetactuallyremainsunchangedsothatthetemporalorderis notdisturbed

    but thereis no training signalfor the appropriatenumberof positive or negative examples

    startingfrom thebeginningof theseries.Thenumberof positive andnegative examplesis

    typically approximatelyequalto begin with, thereforetheuseof this simpletechniquedid

    not resultin thelossof many datapoints.

    5.6 Comparisonwith a Random Walk

    The predictioncorrespondingto the randomwalk model is equalto the currentprice, i.e.

    no change. A percentageof times the model predictsthe correctchange(apart from no

    16

  • change)cannotbeobtainedfor therandomwalk model.However, it is possibleto usemod-

    elswith datathat correspondsto a randomwalk andcomparethe resultsto thoseobtained

    with therealdata.If equalperformanceis obtainedon therandomwalk dataascomparedto

    the realdatathenno significantpredictabilityhasbeenfound. Thesetestswereperformed

    by randomizingtheclassificationof the testset. For eachoriginal testsetfive randomized

    versionswerecreated.The randomizedversionswerecreatedby repeatedlyselectingtwo

    randompositionsin time andswappingthe two classificationvalues(the input valuesre-

    main unchanged).An additionalbenefitof this techniqueis that if the model is showing

    predictabilitybasedon any databias(asdiscussedin the previoussection)thenthe results

    ontherandomizedtestsetswill alsoshow thispredictability, i.e. alwayspredictingapositive

    changewill resultin thesameperformanceon theoriginal andtherandomizedtestsets.

    6 Experimental Results

    We presentresultsusingthepreviously describedsystemandalternative systemsfor com-

    parison. In all cases,the Elmannetworks contained5 hiddenunits andthe SOM hadone

    dimension.Furtherdetailsof the individual simulationscanbe found in appendixA. Fig-

    ure6 andtable1 shows theaverageerrorrateasthenumberof nodesin theself-organizing

    map(SOM) wasincreasedfrom ¼ to $ . Resultsareshown for SOM embeddingdimensionsof 1 and 2. Resultsare shown using both the original dataand the randomizeddatafor

    comparisonwith a randomwalk. The resultsareaveragedover the 5 exchangerateswith

    30 simulationsper exchangerateand30 testpointsper simulation,for a total of 4500test

    points. Thebestperformancewasobtainedwith a SOM sizeof 7 nodesandanembedding

    dimensionof 2 wheretheerror ratewas47.1%.A � -testindicatesthat this resultis signifi-cantlydifferentfrom thenull hypothesisof therandomizedtestsetsat # r�%?¬%&%&$(' or 0.76%( �çr ¼?¬'&$ ) indicating that the result is significant. We also considereda null hypothesiscorrespondingto a modelwhich makescompletelyrandompredictions(correspondingto a

    randomwalk) – theresultis significantlydifferentfrom this null hypothesisat #ër)%7¬*%&%MóTòor 0.54%( �ºr ¼?¬*+&% ). Figure7 andtable2 shows theaverageresultsfor theSOM sizeof 7on theindividualexchangerates.

    In orderto gaugetheeffectivenessof thesymbolicencodingandrecurrentneuralnetwork,

    we investigatedthefollowing two systems:

    17

  • 46

    47

    48

    49

    50

    51

    52

    53

    54

    1 2 3 4 5 6 7 8

    Tes

    t Err

    or %

    SOM Size

    Embedding dimension 1Embedding dimension 2

    Randomized embedding dimension 1Randomized embedding dimension 2

    Figure6. The error rateasthe numberof nodesin the SOM is increasedfor SOM embeddingdi-

    mensionsof 1 and2. Eachof the resultsrepresentstheaverageover the5 individual seriesandthe

    30 simulationsperseries,i.e. over 150simulationswhich tested30 pointseach,or 4500individual

    classifications.Theresultsfor therandomizedseriesareaveragedoverthesamevalueswith the5 ran-

    domizedseriespersimulation,for a total of 22500classificationsperresult(all of thedataavailable

    wasnotuseddueto limited computationalresources).

    1. The systemaspresentedherewithout the symbolicencoding,i.e. the preprocessed

    datais entereddirectly into therecurrentneuralnetwork without theSOM stage.

    2. Thesystemaspresentedherewith therecurrentnetwork replacedby a standardMLP

    network.

    Tables3 and4 show theresultsfor systems1 and2 above. Theperformanceof thesesystems

    canbeseento bein-betweentheperformanceof thenull hypothesisof arandomwalk andthe

    performanceof thehybridsystem,indicatingthatthehybridsystemcanresultin significant

    improvement.

    18

  • SOM embeddingdimension:1

    SOM size 2 3 4 5 6 7

    Error 48.5 47.5 48.3 48.3 48.5 47.6

    Randomizederror 50.1 50.5 50.4 50.2 49.8 50.3

    SOM embeddingdimension:2

    SOM size 2 3 4 5 6 7

    Error 48.8 47.8 47.9 47.7 47.6 47.1

    Randomizederror 49.8 50.4 50.4 50.1 49.7 50.2

    Table1. Theresultsasshown in figure6.

    45464748495051525354

    BP CD DM JY SF

    Tes

    t Err

    or %

    Exchange Rate

    Embedding dimension 1Embedding dimension 2

    Randomized embedding dimension 1Randomized embedding dimension 2

    Figure7. Averageresultsfor thefive exchangeratesusingthereal testdataandtherandomizedtest

    data.TheSOM sizeis 7 andtheresultsfor SOM embeddingdimensionsof 1 and2 areshown.

    6.1 RejectPerformance

    We usedthe following equationin order to calculatea confidencemeasurefor every pre-

    diction: ,ár q l.-0/ vwq l.-0/ ª¶q l n k x where q l1-2/ is themaximumoutput,and q l n k is themin-imum output (for outputswhich have beentransformedusing the softmaxtransformation:q�n r 3�465�7*8:9@BA 3�465�7*8 @ ; where C�n aretheoriginal outputs,qTn arethe transformedoutputs,and isthenumberof outputs).Theconfidencemeasure, givesanindicationof how confidentthe

    19

  • Exchange British Canadian German Japanese Swiss

    Rate Pound Dollar Mark Yen Frank

    Embeddingdimension1 48.5 46.9 46 47.7 48.9

    Embeddingdimension2 48.1 46.8 46.9 46.3 47.4

    Randomizedembeddingdimension1 50.1 50.5 50.4 49.8 50.7

    Randomizedembeddingdimension2 50.2 50.6 50.5 49.7 49.8

    Table2. Resultsasshown in figure7.

    Error 49.5

    Randomizederror 50.1

    Table3. The resultsfor the systemwithout the symbolic encoding,i.e. the preprocesseddatais

    entereddirectly into therecurrentneuralnetwork without theSOMstage.

    SOM embeddingdimension:1

    SOM size 2 3 4 5 6 7

    Error 49.1 49.4 49.3 49.7 49.9 49.8

    Randomizederror 49.7 50.1 49.9 50.3 50.0 49.9

    Table4. Theresultsfor thesystemusinganMLP insteadof theRNN.

    network modelis for eachclassification.Thus,for afixedvalueof , , thepredictionobtainedby a particularalgorithmcanbedividedinto two sets,oneset D for thosebelow thethresh-old, andanotherset E for thoseabove thethreshold.We canthenrejectthosepointswhicharein set D , i.e. they arenotusedin thecomputationof theclassificationaccuracy.Figure8 shows theclassificationperformanceof thesystem(percentageof correctsignpre-

    dictions)astheconfidencethresholdis increasedandmoreexamplesarerejectedfor classi-

    fication. Figure9 shows thesamegraphfor the randomizedtestsets. It canbeseenthata

    significantincreasein performance(down to approximately40%error)canbeobtainedby

    only consideringthe5-15%of exampleswith thehighestconfidence.Thisbenefitcannotbe

    seenfor therandomizedtestsets.It shouldbenotedthatby thetime 95%of testpointsare

    rejected,thetotal numberof remainingtestpointsis only 225.

    20

  • 4648505254565860

    0 10 20 30 40 50 60 70 80 90

    Per

    cent

    Cor

    rect

    Reject Percentage

    Reject Performance

    RNN Test

    Figure 8. Classificationperformanceas the confidencethresholdis increased.This graphis the

    averageof the150individual results.

    4648505254565860

    0 10 20 30 40 50 60 70 80 90

    Per

    cent

    Cor

    rect

    Reject Percentage

    Reject Performance

    RNN Randomized Test

    Figure9. Classificationperformanceastheconfidencethresholdis increasedfor therandomizedtest

    sets.This graphis theaverageof theindividual randomizedtestsets.

    7 Automata Extraction

    A commoncomplaintwith neuralnetworks is that the modelsare difficult to interpret–

    i.e. it is not clearhow themodelsarrive at the predictionor classificationof a given input

    pattern[49]. A numberof peoplehaveconsideredtheextractionof symbolicknowledgefrom

    trainedneuralnetworksfor bothfeedforwardandrecurrentneuralnetworks[12, 44, 57,44].

    For recurrentnetworks,theorderedtriple of adiscreteMarkov process( � state;input F nextstate ) canbe extractedandusedto form an equivalentdeterministicfinite stateautomata(DFA). This canbe doneby clusteringthe activation valuesof the recurrentstateneurons

    [44]. Theautomataextractedwith this processcanonly recognizeregulargrammars5.

    5A regular grammar G is a 4-tuple GææIHaF�

  • The algorithmusedfor automataextractionis the sameasthat describedin [21]. The al-

    gorithm is basedon the observation that the activationsof the recurrentstateneuronsin a

    trainednetwork tendto cluster. The outputof eachof the stateneuronsis divided intoV intervalsof equalsize,yielding VXW partitionsin thespaceof outputsof thestateneurons.Startingfrom aninitial state,the input symbolsareconsideredin order. If an input symbol

    causesa transitionto a new partition in the activation spacethena new DFA stateis cre-

    atedwith a correspondingtransitionon the appropriatesymbol. In the event thatdifferent

    activationstatevectorsbelongto thesamepartition,thestatevectorwhich first reachedthe

    partitionis chosenasthenew initial statefor subsequentsymbols.Theextractionwould be

    computationallyinfeasibleif all VXW partitionshadto bevisited,but theclusterscorrespond-ing to DFA statesoftencoverasmallpercentageof theinput space.

    Sampledeterministicfinite stateautomata(DFA) extractedfrom trainedfinancial predic-

    tion networksusinga quantizationlevel of 5 canbeseenin figure10. TheDFAs have been

    minimizedusingstandardminimizationtechniques[27]. TheDFAs wereextractedfrom net-

    workswheretheSOM embeddingdimensionwas1 andtheSOM sizewas2. As expected,

    the DFAs extractedin this casearerelatively simple. In all cases,the SOM symbolsend

    up representingpositiveandnegativechangesin theseries– transitionsmarkedwith asolid

    line correspondto positivechangesandtransitionsmarkedwith a dottedline correspondto

    negative changes.For all DFAs, state1 is the startingstate. Doublecircled statescorre-

    spondto predictionof a positivechange,andstateswithout thedoublecircle correspondto

    predictionof a negativechange.(a) canbeseenascorrespondingto mean-revertingbehav-

    ior – i.e. if the last changein the serieswasnegative thenthe DFA predictsthat the next

    changewill be positive andvice versa. (b) and(c) canbe seenasvariationson (a) where

    thereareexceptionsto thesimplerulesof (a),e.g. in (b) a negativechangecorrespondsto a

    positiveprediction,howeveraftertheinputchangesfrom negativeto positive,theprediction

    alternateswith successivepositivechanges.In contrastwith (a)-(c),(d)-(f) exhibit behavior

    relatedto trendfollowing algorithms,e.g. in (d) a positivechangecorrespondsto a positive

    prediction,anda negative changecorrespondsto a negative predictionexcept for the first

    negativechangeafterapositivechange.Figure11showsasampleDFA for aSOMsizeof 3

    – morecomplex behavior canbeseenin this case.

    22

  • 1

    21 2

    1 2

    3

    (a) (b) (c)

    1 2

    3 1 2

    1

    2 3

    4

    (d) (e) (f)

    Figure10. Sampledeterministicfinite stateautomata(DFA) extractedfrom trainedfinancialpredic-

    tion networksusingaquantizationlevel of 5.

    1

    2

    3

    4 5

    6

    Figure 11. A sampleDFA extractedfrom a network wherethe SOM size was 3 (and the SOM

    embeddingdimensionwas1). TheSOMsymbolsrepresentapositive (solid lines)or negative (dotted

    lines)change,or a changecloseto zero(grayline). In this case,therulesaremorecomplex thanthe

    simplerulesin thepreviousfigure.

    23

  • 8 Conclusions

    Traditionallearningmethodshave difficulty with high noise,high non-stationaritytime se-

    riesprediction.Theintelligentsignalprocessingmethodpresentedhereusesa combination

    of symbolicprocessingandrecurrentneuralnetworksto dealwith fundamentaldifficulties.

    The useof a recurrentneuralnetwork is significantfor two reasons:firstly, the temporal

    relationshipof theseriesis explicitly modeledvia internalstates,andsecondly, it is possible

    to extract rulesfrom the trainedrecurrentnetworks in the form of deterministicfinite state

    automata.Thesymbolicconversionis significantfor a numberof reasons:thequantization

    effectively filters the dataor reducesthe noise,the RNN training becomesmoreeffective,

    andthe symbolic input facilitatesthe extractionof rulesfrom the trainednetworks. Con-

    sideringforeignexchangerateprediction,weperformedacomprehensivesetof simulations

    covering5 differentforeignexchangerates,whichshowedthatsignificantpredictabilitydoes

    exist with a 47.1%error ratein thepredictionof thedirectionof thechange.Additionally,

    the error ratereducedto around40% whenrejectingexampleswherethe systemhadlow

    confidencein its prediction.Themethodfacilitatestheextractionof ruleswhich explain the

    operationof thesystem.Resultsfor theproblemconsideredherearepositiveandshow that

    symbolicknowledgewhich hasmeaningto humanscanbeextractedfrom noisytime series

    data.Suchinformationcanbeveryusefulandimportantto theuserof thetechnique.

    We have shown that a particularrecurrentnetwork combinedwith symbolicmethodscan

    generateusefulpredictionsandrules for suchpredictionson a non-stationarytime series.

    It would beinterestingto comparethis approachwith otherneuralnetworksthathave been

    usedin timeseriesanalysis[6, 31, 43,45,48], andothermachinelearningmethods[19, 51],

    someof whichcanalsogeneraterules.It wouldalsobeinterestingto incorporatethismodel

    into a tradingpolicy.

    Appendix A: Simulation Details

    For theexperimentsreportedin this paper, Y[Z]\_^ and Y.å`\_a . TheRNN learningratewaslinearly reducedoverthetrainingperiod(see[11] regardinglearningrateschedules)from an

    initial valueof 0.5. All inputswerenormalizedto zeromeanandunit variance.All nodes

    24

  • includeda biasinput which waspartof theoptimizationprocess.Weightswereinitialized

    asshown in Haykin [26]. Targetoutputswere-0.8and0.8usingthe bdc(e�f outputactivationfunction andwe usedthe quadraticcost function. In order to initialize the delaysin the

    network, theRNN wasrunwithout trainingfor 50timestepsprior to thestartof thedatasets.

    The numberof training examplesfor the positive and negative caseswas madeequalby

    not training on the appropriatenumberof positive or negative examplesstartingfrom the

    beginningof thetrainingset(i.e. thetemporalorderof theseriesis notdisturbedbut thereis

    notrainingsignalfor eitheranumberof positiveor anumberof negativeexamplesstartingat

    thebeginningof theseries).TheSOM wastrainedfor 10,000iterationsusingthefollowing

    learningratescheduleg�h^1i]j0k�lmYonKpqYorsnKt whereYon is thecurrentiterationnumberand Yursn is thetotalnumberof iterations(10,000).Theneighborhoodsizewasalteredduringtrainingusing

    floor j2kTvwj6jBa&pXxyt0Y[z{l�k|t:j0k}l~Yon�pqYorsnKtv�g�h^&t . Thefirst differenceddatawassplit asfollows:Points1210to 1959(plusthesizeof thevalidationsetin thevalidationsetexperiments)in

    eachof theexchangerateswereusedfor theselectionof thetrainingdatasetsize,thetraining

    time, andexperimentationwith theuseof a validationset. Points10 to 1059in eachof the

    exchangerateswereusedfor the main tests. The datafor eachsuccessive simulationwas

    offsetby thesizeof thetestset,30points,asshown in figure4. Thenumberof pointsin the

    main testsfrom eachexchangerateis equalto 1,050which is 50 (usedto initialize delays)

    + 100(trainingsetsize)+ 30 i 30 (30 testsetsof size30). For theselectionof thetrainingdatasetsizeandthetrainingtime thenumberof pointswas750for eachexchangerate(50+

    100+ 20 testsetsof size30). For theexperimentsusinga validationsetin eachsimulation,

    thenumberof pointswas750+ thesizeof thevalidationsetfor eachof theexchangerates.

    Acknowledgments

    We would like to thankDoyneFarmer, ChristianOmlin, AndreasWeigend,andtheanony-

    mousrefereesfor helpful comments.

    References

    [1] Machinelearning.InternationalJournal of IntelligentSystemsin Accounting, Finance

    andManagement, 4(2),1995.SpecialIssue.

    25

  • [2] Proceedingsof IEEE/IAFE Conferenceon ComputationalIntelligencefor Financial

    Engineering(CIFEr),1996–1999.

    [3] Y.S. Abu-Mostafa. Learningfrom hints in neuralnetworks. Journal of Complexity,

    6:192,1990.

    [4] J.A. Alexanderand M.C. Mozer. Template-basedalgorithmsfor connectionistrule

    extraction. In G. Tesauro,D. Touretzky, and T. Leen, editors,Advancesin Neural

    InformationProcessingSystems, volume7, pages609–616.TheMIT Press,1995.

    [5] Martin Anthony andNormanL. Biggs. A computationallearningtheoryview of eco-

    nomic forecastingwith neuralnets. In A. Refenes,editor, Neural Networksin the

    Capital Markets. JohnWiley andSons,1995.

    [6] A.D. BackandA.C. Tsoi. FIR andIIR synapses,anew neuralnetwork architecturefor

    timeseriesmodeling.Neural Computation, 3(3):375–385,1991.

    [7] A.R. Barron. Universalapproximationboundsfor superpositionsof a sigmoidalfunc-

    tion. IEEETransactionson InformationTheory, 39(3):930–945,1993.

    [8] R. E. Bellman.AdaptiveControl Processes. PrincetonUniversityPress,Princeton,NJ,

    1961.

    [9] Y. Bengio.Neural Networksfor Speech andSequenceRecognition. Thomson,1996.

    [10] A. Cleeremans,D. Servan-Schreiber, andJ.L. McClelland. Finite stateautomataand

    simplerecurrentnetworks. Neural Computation, 1(3):372–381,1989.

    [11] C.DarkenandJ.E.Moody. Noteonlearningrateschedulesfor stochasticoptimization.

    In R.P. Lippmann,J.E.Moody, andD. S.Touretzky, editors,Advancesin Neural Infor-

    mationProcessingSystems, volume3, pages832–838.MorganKaufmann,SanMateo,

    CA, 1991.

    [12] J.S.Denker, D. Schwartz,B. Wittner, S.A.Solla,R.Howard,L. Jackel, andJ.Hopfield.

    Largeautomaticlearning,ruleextraction,andgeneralization.Complex Systems, 1:877–

    922,1987.

    [13] J.L. Elman. Distributedrepresentations,simplerecurrentnetworks, andgrammatical

    structure.MachineLearning, 7(2/3):195–226,1991.

    26

  • [14] E.F. Fama.Thebehaviour of stockmarketprices.Journalof Business, January:34–105,

    1965.

    [15] E.F. Fama.Efficient capitalmarkets:A review of theoryandempiricalwork. Journal

    of Finance, May:383–417,1970.

    [16] András Faraǵo and Gábor Lugosi. Stronguniversalconsistency of neuralnetwork

    classifiers.IEEE Transactionson InformationTheory, 39(4):1146–1151,1993.

    [17] J.H. Friedman. An overview of predictive learningand function approximation. In

    V. Cherkassky, J.H.Friedman,andH. Wechsler, editors,FromStatisticsto Neural Net-

    works,TheoryandPatternRecognition Applications, volume136of NATO ASISeries

    F, pages1–61.Springer, 1994.

    [18] J.H.Friedman.Introductionto computationallearningandstatisticalprediction.Tuto-

    rial Presentedat NeuralInformationProcessingSystems,Denver, CO,1995.

    [19] ZoubinGhahramaniandMichaelI. Jordan.Factorialhiddenmarkov models.Machine

    Learning, 29:245,1997.

    [20] C. LeeGiles,B.G. Horne,andT. Lin. Learninga classof large finite statemachines

    with a recurrentneuralnetwork. Neural Networks, 8(9):1359–1365,1995.

    [21] C. LeeGiles,C.B. Miller, D. Chen,H.H. Chen,G.Z. Sun,andY.C. Lee. Learningand

    extractingfinite stateautomatawith second-orderrecurrentneuralnetworks. Neural

    Computation, 4(3):393–405,1992.

    [22] C. Lee Giles, C.B. Miller, D. Chen,G.Z. Sun,H.H. Chen,andY.C. Lee. Extracting

    and learningan unknown grammarwith recurrentneuralnetworks. In J.E. Moody,

    S.J.Hanson,andR.PLippmann,editors,Advancesin Neural InformationProcessing

    Systems4, pages317–324,SanMateo,CA, 1992.MorganKaufmannPublishers.

    [23] C. Lee Giles, G.Z. Sun,H.H. Chen,Y.C. Lee, andD. Chen. Higher orderrecurrent

    networks& grammaticalinference.In D.S.Touretzky, editor, Advancesin Neural In-

    formationProcessingSystems2, pages380–387,SanMateo,CA, 1990.MorganKauf-

    mannPublishers.

    [24] C.W.J.GrangerandP. Newbold. ForecastingEconomicTimeSeries. AcademicPress,

    SanDiego,secondedition,1986.

    27

  • [25] Yoichi Hayashi. A neuralexpert systemwith automatedextractionof fuzzy if-then

    rules. In RichardP. Lippmann,JohnE. Moody, andDavid S. Touretzky, editors,Ad-

    vancesin Neural InformationProcessingSystems, volume3, pages578–584.Morgan

    KaufmannPublishers,1991.

    [26] S. Haykin. Neural Networks,A ComprehensiveFoundation. Macmillan, New York,

    NY, 1994.

    [27] J.E. Hopcroft and J.D. Ullman. Introductionto AutomataTheory, Languages, and

    Computation. Addison-Wesley PublishingCompany, Reading,MA, 1979.

    [28] L. Ingber. Statisticalmechanicsof nonlinearnonequilibriumfinancialmarkets:Appli-

    cationsto optimizedtrading.MathematicalComputerModelling, 1996.

    [29] M. Ishikawa.Ruleextractionby successiveregularization.In InternationalConference

    onNeural Networks, pages1139–1143.IEEEPress,1996.

    [30] M.I. Jordan. Neural networks. In A. Tucker, editor, CRC Handbookof Computer

    Science. CRCPress,BocaRaton,FL, 1996.

    [31] A. KehagiasandV. Petridis.Timeseriessegmentationusingpredictivemodularneural

    networks. Neural Computation, 9:1691–1710,1997.

    [32] T. Kohonen.Self-OrganizingMaps. Springer-Verlag,Berlin, Germany, 1995.

    [33] A.N. Kolmogorov. On therepresentationof continuousfunctionsof severalvariables

    by superpositionsof continuousfunctionsof onevariableandaddition.Dokl, 114:679–

    681,1957.

    [34] V. Kŭrková. Kolmogorov’s theoremis relevant. Neural Computation, 3(4):617–622,

    1991.

    [35] V. Kŭrková. Kolmogorov’s theorem. In Michael A. Arbib, editor, The Handbook

    of Brain Theoryand Neural Networks, pages501–502.MIT Press,Cambridge,Mas-

    sachusetts,1995.

    [36] Steve Lawrence,C. Lee Giles, and Sandiway Fong. Natural languagegrammatical

    inferencewith recurrentneuralnetworks. IEEE Transactionson Knowledge andData

    Engineering. acceptedfor publication.

    28

  • [37] JeanY. Lequarŕe. Foreign currency dealing: A brief introduction(dataset C). In

    A.S. WeigendandN.A. Gershenfeld,editors,TimeSeriesPrediction: Forecastingthe

    FutureandUnderstandingthePast. Addison-Wesley, 1993.

    [38] Andrew W. Lo and A. Craig MacKinlay. A Non-RandomWalk Down Wall Street.

    PrincetonUniversityPress,1999.

    [39] B.G.Malkiel. EfficientMarketHypothesis. Macmillan,London,1987.

    [40] B.G.Malkiel. A RandomWalk DownWall Street. Norton,New York, NY, 1996.

    [41] C. McMillan, M.C. Mozer, andP. Smolensky. Ruleinductionthroughintegratedsym-

    bolic andsubsymbolicprocessing.In JohnE. Moody, SteveJ.Hanson,andRichardP.

    Lippmann,editors,Advancesin Neural Information ProcessingSystems, volume 4,

    pages969–976.MorganKaufmannPublishers,1992.

    [42] J.E.Moody. Economicforecasting:Challengesandneuralnetwork solutions.In Pro-

    ceedingsof theInternationalSymposiumon Artificial Neural Networks. Hsinchu,Tai-

    wan,1995.

    [43] S.Mukherjee,E.Osuna,andF. Girosi.Nonlinearpredictionof chaotictimeseriesusing

    supportvectormachines.In J. Principe,L. Giles,N. Morgan,andE. Wilson, editors,

    IEEE Workshopon Neural Networksfor SignalProcessingVII, page511.IEEE Press,

    1997.

    [44] C.W. Omlin andC.L. Giles. Extractionof rules from discrete-timerecurrentneural

    networks. Neural Networks, 9(1):41–52,1996.

    [45] J.C. PrincipeandJ-M. Kuo. Dynamicmodellingof chaotictime serieswith neural

    networks. In G. Tesauro,D. Touretzky, andT. Leen,editors,Advancesin Neural Infor-

    mationProcessingSystems,7, pages311–318.MIT Press,1995.

    [46] A. Refenes,editor. Neural Networksin the Capital Markets. JohnWiley andSons,

    New York, NY, 1995.

    [47] D.W. Ruck, S.K. Rogers,K. Kabrisky, M.E. Oxley, andB.W. Suter. The multilayer

    perceptronasanapproximationto anoptimalBayesestimator. IEEE Transactionson

    Neural Networks, 1(4):296–298,1990.

    29

  • [48] E. W. Saad,Danil V. Prokhorov, andD. C. WunschII. Comparative studyof stock

    trendpredictionusingtime delay, recurrentandprobabilisticneuralnetworks. IEEE

    Transactionson Neural Networks, 9(6):1456,November1998.

    [49] R. SetionoandH. Liu. Symbolicrepresentationof neuralnetworks. IEEE Computer,

    29(3):71–77,1996.

    [50] H.T. SiegelmannandE.D.Sontag.Onthecomputationalpowerof neuralnets.Journal

    of ComputerandSystemSciences, 50(1):132–150,1995.

    [51] WeissS.M. andN. IndurkhyaN. Rule-basedmachinelearningmethodsfor functional

    prediction.Journalof Artificial IntelligenceResearch, 1995.

    [52] F. Takens. Detectingstrangeattractorsin turbulence. In D.A. RandandL.-S. Young,

    editors,DynamicalSystemsand Turbulence, volume898 of Lecture Notesin Mathe-

    matics, pages366–381,Berlin, 1981.Springer-Verlag.

    [53] S.J.Taylor, editor. Modelling Financial Time Series. J. Wiley & Sons,Chichester,

    1994.

    [54] A. B. Tickle, R.Andrews,M. Golea,andJ.Diederich.Thetruthwill cometo light: Di-

    rectionsandchallengesin extractingtheknowledgeembeddedwithin trainedartificial

    neuralnetworks. IEEETransactionsonNeural Networks, 9(6):1057,November1998.

    [55] M. Tomita. Dynamicconstructionof finite-stateautomatafrom examplesusinghill-

    climbing. In Proceedingsof the Fourth AnnualCognitive ScienceConference, pages

    105–108,Ann Arbor, MI, 1982.

    [56] G. Towell andJ.Shavlik. Theextractionof refinedrulesfrom knowledgebasedneural

    networks. MachineLearning, 131:71–101,1993.

    [57] Geoffrey G. Towell andJudeW. Shavlik. Extractingrefinedrules from knowledge-

    basedneuralnetworks. MachineLearning, 13:71,1993.

    [58] George TsibourisandMatthew Zeidenberg. Testingthe efficient marketshypothesis

    with gradientdescentalgorithms.In A. Refenes,editor, Neural Networksin theCapital

    Markets. JohnWiley andSons,1995.

    30

  • [59] R.L. WatrousandG.M. Kuhn. Inductionof finite statelanguagesusingsecond-order

    recurrentnetworks. In J.E.Moody, S.J.Hanson,andR.PLippmann,editors,Advances

    in Neural InformationProcessingSystems4, pages309–316,SanMateo,CA, 1992.

    MorganKaufmannPublishers.

    [60] R.L. WatrousandG.M. Kuhn. Inductionof finite-statelanguagesusingsecond-order

    recurrentnetworks. Neural Computation, 4(3):406,1992.

    [61] A.S.Weigend,B.A. Huberman,andD.E.Rumelhart.Predictingsunspotsandexchange

    rateswith connectionistnetworks. In M. CasdagliandS. Eubank,editors,Nonlinear

    ModelingandForecasting, SFIStudiesin theSciencesof Complexity, ProceedingsVol.

    XII, pages395–432.Addison-Wesley, 1992.

    [62] G.U. Yule. On a methodof investigatingperiodicitiesin disturbedserieswith spe-

    cial referenceto Wolfer’s sunspotnumbers.PhilosophicalTransactionsRoyalSociety

    LondonSeriesA, 226:267–298,1927.

    [63] Z. Zeng,R.M. Goodman,andP. Smyth. Discreterecurrentneuralnetworksfor gram-

    maticalinference.IEEE Transactionson Neural Networks, 5(2):320–330,1994.

    31