Forecasting foreign exchange rates using kernel methods

Download Forecasting foreign exchange rates using kernel methods

Post on 05-Sep-2016




4 download

Embed Size (px)


<ul><li><p>kLan</p><p>1E 6</p><p>fressied.theanin tM,</p><p>(1997) and published in Neely, Weller, and Ulrich (2009). Two implementations of SVMs for Windows</p><p>1. Introduction</p><p>1.1. Objectives</p><p>hods tthe mthe stang the</p><p>trivial, frequently misunderstood and profoundly relevant to opti-mization, machine learning and science in general (and often con-veniently ignored by the evolutionary algorithms and statisticallearning theory communities). I (Sewell) run the worlds only nofree lunch website.2.</p><p>1.2.3. No free lunch theorem for supervised machine learningHume (17391740) pointed out that even after the observation</p><p>of the frequentor constant conjunctionof objects,wehaveno reasonto draw any inference concerning any object beyond those of whichwe have had experience. More recently, andwith increasing rigour,Mitchell (1980), Schaffer (1994) and Wolpert (1996) showed thatbias-free learning is futile. The no free lunch theorem for supervisedmachine learning (Wolpert, 1996) shows that in a noise-free sce-nario where the loss function is the misclassication rate, in terms</p><p> Corresponding author. Tel.: +44 (0) 1223 765224; fax: +44 (0) 1223 337130.E-mail addresses: (M. Sewell),</p><p>(J. Shawe-Taylor).1 Tel.: +44 (0) 20 76797680; fax: +44 (0) 20 73871397.</p><p>Expert Systems with Applications 39 (2012) 76527662</p><p>Contents lists available at</p><p>Expert Systems w</p><p>.e2 a positive return, but it is not a useful benchmark as it doesnot incorporate risk.</p><p>1.2. Background</p><p>1.2.1. No free lunch theoremsThe two main no free lunch theorems (NFL) are introduced and</p><p>then evolutionary algorithms and statistical learning theory arereconciled with the NFL theorems. The theorems are novel, non-</p><p>Macready (1997).The no free lunch theorem for search implies that putting blind</p><p>faith in evolutionary algorithms as a blind search/optimization algo-rithm is misplaced. For example, on average, a genetic algorithm isno better, or worse, than any other search algorithm. In practice, inour universe, one will only be interested in a subset of all possiblefunctions. Thismeans that it is necessary to showthat the set of func-tions that are of interest has some property that allows a particularalgorithm to perform better than random search on this subset.This paper employs kernel metchange rates, and aims to (1) beatstandard methodology, and (3) beatin a foreign exchange market beati0957-4174/$ - see front matter 2012 Elsevier Ltd. Adoi:10.1016/j.eswa.2012.01.026with semi-automated parameter selection are built. 2012 Elsevier Ltd. All rights reserved.</p><p>o forecast foreign ex-arket, (2) beat existingte of the art. Note thatmarket simply means</p><p>1.2.2. No free lunch theorem for optimization/searchThe no free lunch theorem for search and optimization applies</p><p>to nite spaces and algorithms that do not resample points. Thetheorem tells us that all algorithms that search for an extremumof a cost function perform exactly the same when averaged overall possible cost functions. So, for any search/optimization algo-rithm, any elevated performance over one class of problems is ex-actly paid for in performance over another class. See Wolpert andimprovements over a standard SVM in terms of both gross returns and net returns, but none achievednet returns as high as the genetic programming approach employed by Neely, Weller, and Dittmara r t i c l e i n f o</p><p>Keywords:ForecastingForeign exchangeKernel methods</p><p>a b s t r a c t</p><p>First, the all-important nomachines (SVMs), preproceare introduced and discussFisher kernel for an SVM,the kernel on foreign exchthe hidden Markov modelpositive for BPM; and BPForecasting foreign exchange rates using</p><p>Martin Sewell a,, John Shawe-Taylor b,1a The Cambridge Centre for Climate Change Mitigation Research (4CMR), Department ofUnited KingdombDepartment of Computer Science, University College London, Gower Street, London WC</p><p>journal homepage: wwwll rights reserved.ernel methods</p><p>d Economy, University Of Cambridge, 16-21 Silver Street, Cambridge CB3 9EP,</p><p>BT, United Kingdom</p><p>e lunch theorems are introduced. Next, kernel methods, support vectorng, model selection, feature selection, SVM software and the Fisher kernelA hidden Markov model is trained on foreign exchange data to derive aDC algorithm and the Bayes point machine (BPM) are also used to learnge data. Further, the DC algorithm was used to learn the parameters ofhe Fisher kernel, creating a hybrid algorithm. The mean net returns werethe Fisher kernel, the DC algorithm and the hybrid algorithm were allSciVerse ScienceDirect</p><p>ith Applications</p><p>lsevier .com/locate /eswa</p></li><li><p>d); and</p><p>tions has been the focus of much research in statistics and</p><p>ms wPjc sj &gt; jf ;m:</p><p>VC theory does not concern</p><p>Pcjs;m;VC dimension:</p><p>So there is no free lunch for Vapnik, and no guarantee that supportvector machines generalize well.</p><p>1.3. Kernel methods</p><p>Central to the work on forecasting in this paper is the concept ofa kernel. The technical aspects of kernels are dealt with in Section2.1 (pp. 6), and the history is given here. The Fisher kernel is de-rived and implemented below; to save space, a thorough literaturereview is provided in Sewell (2011b).</p><p>1.4. Support vector machines</p><p>Support vector machines (SVMs) are used extensively in theforecasting of nancial time series and are covered in more detailin Section 2.2 (p. 6). Among other sources, the introductory paper(Hearst, Dumais, Osuna, Platt, &amp; Schlkopf, 1998), the classic SVMtutorial (Burges, 1998), the excellent book (Cristianini &amp; Shawe-Taylor, 2000) and the implementation details within Joachims(2002) have contributed to my own understanding.</p><p>1.4.1. The application of support vector machines to the nancialdomain</p><p>An exhaustive review of articles that apply SVMs to the nan-cial domain (not reported here) included 38 articles that compareSVMs with articial neural networks (ANNs), SVMs outperformedANNs in 32 cases, ANNs outperformed SVMs in 3 cases, and therewas no signicance difference in 3 cases. More specically, of the22 articles that concern the prediction of nancial or commodityget functions, or if you have a uniform prior, then P(cjd) is indepen-dent of ones learning algorithm. Vapnik appears to prove thatgiven a large training set and a small VC dimension, one can gen-eralize well. The VC dimension is a property of the learning algo-rithm, so no assumptions are being made about the targetfunctions. So, has Vapnik found a free lunch? VC theory tells us thatthe training set error, s, converges to c. If is an arbitrary real num-ber, the VC framework actually concernsc = off-training-set loss associated with f and h (generalizationerror or test set error)</p><p>all algorithms are equivalent, on average, by any of the followingmeasures of risk: E(cjd), E(cjm), E(cjf,d) or E(cjf,m).</p><p>How well you do is determined by how aligned your learningalgorithm P(hjd) is with the actual posterior, P(fjd). This result, inessence, formalizes Hume, extends him and calls all of science intoquestion.</p><p>The NFL proves that if you make no assumptions about the tar-of off-training-set error, there are no a priori distinctions betweenlearning algorithms.</p><p>More formally, where</p><p>d = training set;m = number of elements in training set;f = target inputoutput relationships;h = hypothesis (the algorithms guess for f made in response to</p><p>M. Sewell, J. Shawe-Taylor / Expert Systemarkets, 18 favoured SVMs, 2 favoured ANNs and 2 found no sig-nicant difference. This bodes well for SVMs, and as such, the fol-lowing research on forecasting shall employ them.machine learning for decades and the resulting algorithms are wellunderstood, well developed and efcient. Naturally, one wants thebest of both worlds. So, if a problem is non-linear, instead of tryingto t a non-linear model, one can map the problem from the inputspace to a new (higher-dimensional) space (called the featurespace) by doing a non-linear transformation using suitably chosenbasis functions and then use a linear model in the feature space.This is known as the kernel trick. The linear model in the featurespace corresponds to a non-linear model in the input space. Thisapproach can be used in both classication and regression prob-lems. The choice of kernel function is crucial for the success of allkernel algorithms because the kernel constitutes prior knowledgethat is available about a task. Accordingly, there is no free lunch(see p. 1) in kernel choice.</p><p>2.1.4. Kernel trickThe kernel trick was rst published by Aizerman, Braverman,</p><p>and Rozonoer (1964). Mercers theorem states that any continuous,symmetric, positive semi-denite kernel function K(x,y) can be ex-pressed as a dot product in a high-dimensional space.</p><p>If the arguments to the kernel are in a measurable space X, andif the kernel is positive semi-denitei.e.</p><p>Xni1</p><p>Xnj1</p><p>Kxi; xjcicj P 0</p><p>for any nite subset {x1, . . . ,xn} of X and subset {c1, . . . ,cn} of objects(typically real numbers, but could even be molecules)then thereexists a function u(x) whose range is in an inner product space ofpossibly high dimension, such that2. Material and methods</p><p>Domain knowledge is necessary to provide the assumptionsthat supervised machine learning relies upon.</p><p>2.1. Kernel methods</p><p>2.1.1. TerminologyThe termkernel isderived fromaword that canbe tracedback to c.</p><p>1000 and originally meant a seed (contained within a fruit) or thesofter (usually edible) part contained within the hard shell of a nutor stone-fruit. The formermeaning is now obsolete. It was rst usedinmathematics when it was dened for integral equations in whichthe kernel is knownand the other function(s) unknown, but nowhasseveral meanings inmathematics. As far as I am aware, themachinelearning term kernel trickwas rst used in 1998.</p><p>2.1.2. DenitionThe kernel of a function f is the equivalence relation on the func-</p><p>tions domain that roughly expresses the idea of equivalent as faras the function f can tell.</p><p>Denition 1. Let X and Y be sets and let f be a function from X to Y.Elements x1 and x2 of X are equivalent if f(x1) and f(x2) are equal, i.e.are the same element of Y. Formally, if f : X? Y, then</p><p>kerf fx1; x2 2 X X : f x1 f x2g:The kernel trick (described below) uses the kernel as a similar-</p><p>ity measure and the term kernel function is often used for f above.</p><p>2.1.3. Motivation and descriptionFirstly, linearity is rather special, and outside quantummechan-</p><p>ics no real system is truly linear. Secondly, detecting linear rela-</p><p>ith Applications 39 (2012) 76527662 7653Kx; y ux uy:</p></li><li><p>ing inputs, I (among other things) subscribe to Toblers rst law ofgeography (Tobler, 1970) that tells us that everything is related</p><p>of which are freely available online (including source code):SVM is based on SVMlight (Joachims, 2004) and written in C for</p><p>ms with Applications 39 (2012) 765276622.1.5. Advantages</p><p> The kernel denes a similarity measure between two datapoints and thus allows one to incorporate prior knowledge ofthe problem domain.</p><p> Most importantly, the kernel contains all of the informationabout the relative positions of the inputs in the feature spaceand the actual learning algorithm is based only on the kernelfunction and can thus be carried out without explicit use ofthe feature space. The training data only enter the algorithmthrough their entries in the kernel matrix (a Gram matrix, seeAppendix A (p. 22)), and never through their individual attri-butes. Because one never explicitly has to evaluate the featuremap in the high dimensional feature space, the kernel functionrepresents a computational shortcut.</p><p> The number of operations required is not necessarily propor-tional to the number of features.</p><p>2.2. Support vector machines</p><p>A support vector machine (SVM) is a supervised learning tech-nique from the eld of machine learning applicable to both classi-cation and regression. Rooted in the statistical learning theorydeveloped by Vladimir Vapnik and co-workers, SVMs are basedon the principle of structural risk minimization (Vapnik &amp; Chervo-nenkis, 1974).</p><p>The background mathematics required includes probability, lin-ear algebra and functional analysis. More specically: vectorspaces, inner product spaces, Hilbert spaces (dened in AppendixB (p. 22)), operators, eigenvalues and eigenvectors. A good bookfor learning the background maths is Introductory Real Analysis(Kolmogorov &amp; Fomin, 1975).</p><p>Support vector machines (reviewed briey on p. 4) are the best-known example of kernel methods.</p><p>The basic idea of an SVM is as follows:</p><p>1. Non-linearly map the input space into a very high dimensionalfeature space (the kernel trick).</p><p>2. In the case of classication, construct an optimal separating</p><p>hyperplane in this space (a maximal margin classier); or in the case of regression, perform linear regression in this</p><p>space, but without penalising small errors.</p><p>2.3. Preprocessing</p><p>Preprocessing the data is a vital part of forecasting. Filtering thedata is a common procedure, but should be avoided altogether if itis suspected that the time series may be chaotic (there is little evi-dence for low dimensional chaos in nancial data Sewell, 2011a).In the following work, simple averaging was used to deal withmissing data. It is good practice to normalize the data so that theinputs are in the range [0,1] or [1,1], here I used [1,1]. Carewas taken to avoid multicollinearity in the inputs, as this would in-crease the variance (in a bias-variance sense). Another commontask is outlier removal, however, if an outlier is a market crash,it is obviously highly signicant, so no outliers were removed. Use-ful references include Masters (1995), Pyle (1999) and (to a lesserextent) Theodoridis and Koutroumbas (2008).</p><p>2.4. Model selection</p><p>For books onmodel selection, see Burnham and Anderson (2002)</p><p>7654 M. Sewell, J. Shawe-Taylor / Expert Systeand Claeskens and Hjort (2008). For a Bayesian approach to modelselection using foreign exchange data, see Sewell (2008) and Sewell(2009). Support vector machines are implemented here, which em-dark</p><p>Win32, whilst winSVM is based on mySVM (Rping, 2000) and writ-ten in C++ for Win32. Both products include a model/parameterselection tool which randomly selects the SVM kernel and/or param-eters within the range selected by the user. Results for each param-eter combination are saved in a spreadsheet and the user can narrowdown the range of parameters and home in on the optimum solutionfor the validation set. The software comes with a tutorial, and has re-ceived a great deal of positive feedback from academia, banks andindividuals. The programs make a very real practical contributionto SVMmodel and parameter selection, as they each present the userwith an easy-to-use interface that allows them to select a subset ofthe search space of parameters to be parsed randomly, and enablesthem to inspect and sort the results with ease in Excel. The randommodel/parameter selection is particularly benecial in applicationswith limited domain knowledge, such as nancial time series. Figs. 1and 2 (p. 10) show screenshots of my Windows SVM software.</p><p>The experiments reported...</p></li></ul>