# Forecasting foreign exchange rates using kernel methods

Post on 05-Sep-2016

217 views

TRANSCRIPT

kLan

1E 6

fressied.theanin tM,

(1997) and published in Neely, Weller, and Ulrich (2009). Two implementations of SVMs for Windows

1. Introduction

1.1. Objectives

hods tthe mthe stang the

trivial, frequently misunderstood and profoundly relevant to opti-mization, machine learning and science in general (and often con-veniently ignored by the evolutionary algorithms and statisticallearning theory communities). I (Sewell) run the worlds only nofree lunch website.2.

1.2.3. No free lunch theorem for supervised machine learningHume (17391740) pointed out that even after the observation

of the frequentor constant conjunctionof objects,wehaveno reasonto draw any inference concerning any object beyond those of whichwe have had experience. More recently, andwith increasing rigour,Mitchell (1980), Schaffer (1994) and Wolpert (1996) showed thatbias-free learning is futile. The no free lunch theorem for supervisedmachine learning (Wolpert, 1996) shows that in a noise-free sce-nario where the loss function is the misclassication rate, in terms

Corresponding author. Tel.: +44 (0) 1223 765224; fax: +44 (0) 1223 337130.E-mail addresses: mvs25@cam.ac.uk (M. Sewell), J.Shawe-Taylor@cs.ucl.ac.uk

(J. Shawe-Taylor).1 Tel.: +44 (0) 20 76797680; fax: +44 (0) 20 73871397.

Expert Systems with Applications 39 (2012) 76527662

Contents lists available at

Expert Systems w

.e2 http://www.no-free-lunch.org.earning a positive return, but it is not a useful benchmark as it doesnot incorporate risk.

1.2. Background

1.2.1. No free lunch theoremsThe two main no free lunch theorems (NFL) are introduced and

then evolutionary algorithms and statistical learning theory arereconciled with the NFL theorems. The theorems are novel, non-

Macready (1997).The no free lunch theorem for search implies that putting blind

faith in evolutionary algorithms as a blind search/optimization algo-rithm is misplaced. For example, on average, a genetic algorithm isno better, or worse, than any other search algorithm. In practice, inour universe, one will only be interested in a subset of all possiblefunctions. Thismeans that it is necessary to showthat the set of func-tions that are of interest has some property that allows a particularalgorithm to perform better than random search on this subset.This paper employs kernel metchange rates, and aims to (1) beatstandard methodology, and (3) beatin a foreign exchange market beati0957-4174/$ - see front matter 2012 Elsevier Ltd. Adoi:10.1016/j.eswa.2012.01.026with semi-automated parameter selection are built. 2012 Elsevier Ltd. All rights reserved.

o forecast foreign ex-arket, (2) beat existingte of the art. Note thatmarket simply means

1.2.2. No free lunch theorem for optimization/searchThe no free lunch theorem for search and optimization applies

to nite spaces and algorithms that do not resample points. Thetheorem tells us that all algorithms that search for an extremumof a cost function perform exactly the same when averaged overall possible cost functions. So, for any search/optimization algo-rithm, any elevated performance over one class of problems is ex-actly paid for in performance over another class. See Wolpert andimprovements over a standard SVM in terms of both gross returns and net returns, but none achievednet returns as high as the genetic programming approach employed by Neely, Weller, and Dittmara r t i c l e i n f o

Keywords:ForecastingForeign exchangeKernel methods

a b s t r a c t

First, the all-important nomachines (SVMs), preproceare introduced and discussFisher kernel for an SVM,the kernel on foreign exchthe hidden Markov modelpositive for BPM; and BPForecasting foreign exchange rates using

Martin Sewell a,, John Shawe-Taylor b,1a The Cambridge Centre for Climate Change Mitigation Research (4CMR), Department ofUnited KingdombDepartment of Computer Science, University College London, Gower Street, London WC

journal homepage: wwwll rights reserved.ernel methods

d Economy, University Of Cambridge, 16-21 Silver Street, Cambridge CB3 9EP,

BT, United Kingdom

e lunch theorems are introduced. Next, kernel methods, support vectorng, model selection, feature selection, SVM software and the Fisher kernelA hidden Markov model is trained on foreign exchange data to derive aDC algorithm and the Bayes point machine (BPM) are also used to learnge data. Further, the DC algorithm was used to learn the parameters ofhe Fisher kernel, creating a hybrid algorithm. The mean net returns werethe Fisher kernel, the DC algorithm and the hybrid algorithm were allSciVerse ScienceDirect

ith Applications

lsevier .com/locate /eswa

d); and

tions has been the focus of much research in statistics and

ms wPjc sj > jf ;m:

VC theory does not concern

Pcjs;m;VC dimension:

So there is no free lunch for Vapnik, and no guarantee that supportvector machines generalize well.

1.3. Kernel methods

Central to the work on forecasting in this paper is the concept ofa kernel. The technical aspects of kernels are dealt with in Section2.1 (pp. 6), and the history is given here. The Fisher kernel is de-rived and implemented below; to save space, a thorough literaturereview is provided in Sewell (2011b).

1.4. Support vector machines

Support vector machines (SVMs) are used extensively in theforecasting of nancial time series and are covered in more detailin Section 2.2 (p. 6). Among other sources, the introductory paper(Hearst, Dumais, Osuna, Platt, & Schlkopf, 1998), the classic SVMtutorial (Burges, 1998), the excellent book (Cristianini & Shawe-Taylor, 2000) and the implementation details within Joachims(2002) have contributed to my own understanding.

1.4.1. The application of support vector machines to the nancialdomain

An exhaustive review of articles that apply SVMs to the nan-cial domain (not reported here) included 38 articles that compareSVMs with articial neural networks (ANNs), SVMs outperformedANNs in 32 cases, ANNs outperformed SVMs in 3 cases, and therewas no signicance difference in 3 cases. More specically, of the22 articles that concern the prediction of nancial or commodityget functions, or if you have a uniform prior, then P(cjd) is indepen-dent of ones learning algorithm. Vapnik appears to prove thatgiven a large training set and a small VC dimension, one can gen-eralize well. The VC dimension is a property of the learning algo-rithm, so no assumptions are being made about the targetfunctions. So, has Vapnik found a free lunch? VC theory tells us thatthe training set error, s, converges to c. If is an arbitrary real num-ber, the VC framework actually concernsc = off-training-set loss associated with f and h (generalizationerror or test set error)

all algorithms are equivalent, on average, by any of the followingmeasures of risk: E(cjd), E(cjm), E(cjf,d) or E(cjf,m).

How well you do is determined by how aligned your learningalgorithm P(hjd) is with the actual posterior, P(fjd). This result, inessence, formalizes Hume, extends him and calls all of science intoquestion.

The NFL proves that if you make no assumptions about the tar-of off-training-set error, there are no a priori distinctions betweenlearning algorithms.

More formally, where

d = training set;m = number of elements in training set;f = target inputoutput relationships;h = hypothesis (the algorithms guess for f made in response to

M. Sewell, J. Shawe-Taylor / Expert Systemarkets, 18 favoured SVMs, 2 favoured ANNs and 2 found no sig-nicant difference. This bodes well for SVMs, and as such, the fol-lowing research on forecasting shall employ them.machine learning for decades and the resulting algorithms are wellunderstood, well developed and efcient. Naturally, one wants thebest of both worlds. So, if a problem is non-linear, instead of tryingto t a non-linear model, one can map the problem from the inputspace to a new (higher-dimensional) space (called the featurespace) by doing a non-linear transformation using suitably chosenbasis functions and then use a linear model in the feature space.This is known as the kernel trick. The linear model in the featurespace corresponds to a non-linear model in the input space. Thisapproach can be used in both classication and regression prob-lems. The choice of kernel function is crucial for the success of allkernel algorithms because the kernel constitutes prior knowledgethat is available about a task. Accordingly, there is no free lunch(see p. 1) in kernel choice.

2.1.4. Kernel trickThe kernel trick was rst published by Aizerman, Braverman,

and Rozonoer (1964). Mercers theorem states that any continuous,symmetric, positive semi-denite kernel function K(x,y) can be ex-pressed as a dot product in a high-dimensional space.

If the arguments to the kernel are in a measurable space X, andif the kernel is positive semi-denitei.e.

Xni1

Xnj1

Kxi; xjcicj P 0

for any nite subset {x1, . . . ,xn} of X and subset {c1, . . . ,cn} of objects(typically real numbers, but could even be molecules)then thereexists a function u(x) whose range is in an inner product space ofpossibly high dimension, such that2. Material and methods

Domain knowledge is necessary to provide the assumptionsthat supervised machine learning relies upon.

2.1. Kernel methods

2.1.1. TerminologyThe termkernel isderived fromaword that canbe tracedback to c.

1000 and originally meant a seed (contained within a fruit) or thesofter (usually edible) part contained within the hard shell of a nutor stone-fruit. The formermeaning is now obsolete. It was rst usedinmathematics when it was dened for integral equations in whichthe kernel is knownand the other function(s) unknown, but nowhasseveral meanings inmathematics. As far as I am aware, themachinelearning term kernel trickwas rst used in 1998.

2.1.2. DenitionThe kernel of a function f is the equivalence relation on the func-

tions domain that roughly expresses the idea of equivalent as faras the function f can tell.

Denition 1. Let X and Y be sets and let f be a function from X to Y.Elements x1 and x2 of X are equivalent if f(x1) and f(x2) are equal, i.e.are the same element of Y. Formally, if f : X? Y, then

kerf fx1; x2 2 X X : f x1 f x2g:The kernel trick (described below) uses the kernel as a similar-

ity measure and the term kernel function is often used for f above.

2.1.3. Motivation and descriptionFirstly, linearity is rather special, and outside quantummechan-

ics no real system is truly linear. Secondly, detecting linear rela-

ith Applications 39 (2012) 76527662 7653Kx; y ux uy:

ing inputs, I (among other things) subscribe to Toblers rst law ofgeography (Tobler, 1970) that tells us that everything is related

of which are freely available online (including source code):SVM is based on SVMlight (Joachims, 2004) and written in C for

ms with Applications 39 (2012) 765276622.1.5. Advantages

The kernel denes a similarity measure between two datapoints and thus allows one to incorporate prior knowledge ofthe problem domain.

Most importantly, the kernel contains all of the informationabout the relative positions of the inputs in the feature spaceand the actual learning algorithm is based only on the kernelfunction and can thus be carried out without explicit use ofthe feature space. The training data only enter the algorithmthrough their entries in the kernel matrix (a Gram matrix, seeAppendix A (p. 22)), and never through their individual attri-butes. Because one never explicitly has to evaluate the featuremap in the high dimensional feature space, the kernel functionrepresents a computational shortcut.

The number of operations required is not necessarily propor-tional to the number of features.

2.2. Support vector machines

A support vector machine (SVM) is a supervised learning tech-nique from the eld of machine learning applicable to both classi-cation and regression. Rooted in the statistical learning theorydeveloped by Vladimir Vapnik and co-workers, SVMs are basedon the principle of structural risk minimization (Vapnik & Chervo-nenkis, 1974).

The background mathematics required includes probability, lin-ear algebra and functional analysis. More specically: vectorspaces, inner product spaces, Hilbert spaces (dened in AppendixB (p. 22)), operators, eigenvalues and eigenvectors. A good bookfor learning the background maths is Introductory Real Analysis(Kolmogorov & Fomin, 1975).

Support vector machines (reviewed briey on p. 4) are the best-known example of kernel methods.

The basic idea of an SVM is as follows:

1. Non-linearly map the input space into a very high dimensionalfeature space (the kernel trick).

2. In the case of classication, construct an optimal separating

hyperplane in this space (a maximal margin classier); or in the case of regression, perform linear regression in this

space, but without penalising small errors.

2.3. Preprocessing

Preprocessing the data is a vital part of forecasting. Filtering thedata is a common procedure, but should be avoided altogether if itis suspected that the time series may be chaotic (there is little evi-dence for low dimensional chaos in nancial data Sewell, 2011a).In the following work, simple averaging was used to deal withmissing data. It is good practice to normalize the data so that theinputs are in the range [0,1] or [1,1], here I used [1,1]. Carewas taken to avoid multicollinearity in the inputs, as this would in-crease the variance (in a bias-variance sense). Another commontask is outlier removal, however, if an outlier is a market crash,it is obviously highly signicant, so no outliers were removed. Use-ful references include Masters (1995), Pyle (1999) and (to a lesserextent) Theodoridis and Koutroumbas (2008).

2.4. Model selection

For books onmodel selection, see Burnham and Anderson (2002)

7654 M. Sewell, J. Shawe-Taylor / Expert Systeand Claeskens and Hjort (2008). For a Bayesian approach to modelselection using foreign exchange data, see Sewell (2008) and Sewell(2009). Support vector machines are implemented here, which em-dark

Win32, whilst winSVM is based on mySVM (Rping, 2000) and writ-ten in C++ for Win32. Both products include a model/parameterselection tool which randomly selects the SVM kernel and/or param-eters within the range selected by the user. Results for each param-eter combination are saved in a spreadsheet and the user can narrowdown the range of parameters and home in on the optimum solutionfor the validation set. The software comes with a tutorial, and has re-ceived a great deal of positive feedback from academia, banks andindividuals. The programs make a very real practical contributionto SVMmodel and parameter selection, as they each present the userwith an easy-to-use interface that allows them to select a subset ofthe search space of parameters to be parsed randomly, and enablesthem to inspect and sort the results with ease in Excel. The randommodel/parameter selection is particularly benecial in applicationswith limited domain knowledge, such as nancial time series. Figs. 1and 2 (p. 10) show screenshots of my Windows SVM software.

The experiments reported in this paper used LIBSVM (Chang &Lin, 2001) and MATLAB.to everything else, but near things are more related than distantthings. That is, for example, the following common sense notionis applied: when predicting tomorrows price change, yesterdaysprice change is more likely to have predictive value than the dailyprice change, say, 173 days ago. With such noisy data, standard fea-ture selection techniques such as principal component analysis(PCA), factor analysis and independent component analysis (ICA)all risk overtting the training set. For reasons of market efciency,it is safest to take the view that there are no privileged features innancial time series, over and above keeping the inputs potentiallyrelevant, orthogonal and utilizing Toblers rst law of geography. Toa degree, the random subspace method (RSM) (Ho, 1998) alleviatesthe problem of feature selection in areas with little domain knowl-edge, but was not used here.

2.6. Software

I wrote twoWindows versions of support vector machines, both3ploy structural risk minimization, and a validation set is used formeta-parameter selection.

Typically, the data is split thus: the rst 50% is the training set,the next 25% the validation set and the nal 25% the test set. How-ever, in theexperimentsbelowI split thedata set in thesamemanneras that of a publishedwork (Neely et al., 2009), for comparative pur-poses. The training set is used for training the SVM, the validation setfor parameter selection and the test set is the out of sample data. Theparameters that generated the highest net prot on the validationset are used for the test set.

Can one use K-fold cross-validation (rather than a sliding win-dow) on a time series? In other words, what assumptions are madeif one uses the data in an order other than that in which it was gen-erated? It is only a problem if the function that you are approxi-mating is also a function of time (or order). To be safe, a systemshould be tested using a data set that is both previously unseenand forwards in time, a rule that I adhered to in the experimentsthat follow.

2.5. Feature selection

First and foremost, when making assumptions regarding select-3 http://winsvm.martinsewell.com/ and http://svmdark.martinsewell.com/.

ms wM. Sewell, J. Shawe-Taylor / Expert Syste2.7. Fisher kernel

2.7.1. IntroductionTo save space, my literature review on Fisher kernels is omitted

here, but is available for download on the Web (Sewell, 2011b). Incommon with all kernel methods, the support vector machinetechnique involves two stages: rst non-linearly map the inputspace into a very high dimensional feature space, then apply alearning algorithm designed to discover linear patterns in thatspace. The novelty in this section concerns the rst stage. The basicidea behind the Fisher kernel method is to train a (generative) hid-den Markov model (HMM) on data to derive a Fisher kernel for a

Fig. 1. SVM

Fig. 2. winSVM.(discriminative) support vector machine (SVM). The Fisher kernelgives a natural similarity measure that takes into account theunderlying probability distribution. If each data item is a (possiblyvarying length) sequence, the sequence may be used to train aHMM. It is then possible to calculate how much a new data itemwould stretch the parameters of the existing model. This isachieved by, for two data items, calculating and comparing the gra-dient of the log-likelihood of the data item with respect to themodel with a given set of parameters. If these Fisher scores aresimilar it means that the two data items would adapt the modelin the same way, that is from the point of view of the given para-

dark.ith Applications 39 (2012) 76527662 7655metric model at the current parameter setting they are similar inthe sense that they would require similar adaptations to theparameters.

2.7.2. Markov chainsMarkov chains were introduced by the Russian mathematician

Andrey Markov in 1906 (Markov, 1906), although the term didnot appear for over 20 years when it was used by Bernstein(1927). A Markov process is a stochastic process that satises theequality P(Xn+1jX1, . . . ,Xn) = P(Xn+1jXn). A Markov chain is a dis-crete-state Markov process. Formally, a discrete time Markov chainis a sequence of n random variables Xn, nP 0 such that for every n,P(Xn+1 = xjX0 = x0, X1 = x1, . . . , Xn = xn) = P(Xn+1 = xjXn = xn). In words,the future of the system depends on the present, but not the past.

2.7.3. Hidden Markov modelsA hidden Markov model (HMM) is a temporal probabilistic model

in which the state of the process is described by a single discreterandom variable. Loosely speaking, it is a Markov chain observedin noise. The theory of hidden Markov models was developed inthe late 1960s and early 1970s by Baum, Eagon, Petrie, Soulesand Weiss (Baum, 1972; Baum & Eagon, 1967; Baum & Petrie,1966; Baum, Petrie, Soules, & Weiss, 1970), whilst the name hid-den Markov model was coined by L.P. Neuwirth. For more infor-mation on HMMs, see the tutorial papers Rabiner and Juang(1986), Poritz (1988), Rabiner (1989) and Eddy (2004), and thebooks MacDonald and Zucchini (1997), Durbin, Eddy, Krogh, andMitchison (1999), Elliot, Aggoun, and Moore (2004) and Capp,

Moulines, and Rydn (2005). HMMs have earned their popularityla from ssful a tion ch re ion (R r,19 but h o bee ied to writin gniti -tu cogni usica follo and b rmati

ally, den M mod biva iscre epr s {Xk, , whe is a M cha , con lon Yk is a nce o pende dom les su tth ditio tribu Yk on ends .

succe pplic f HM mark efere sfa k as K , Sne Knap 6) an g (198 ebo har a mori and M and (2007 r

states: 5

The step-by-step methodology follows.

3.1. Introduction

As reported in the introduction, there is evidence that, onaverage, SVMs outperform ANNs when applied to the prediction

7656 M. Sewell, J. Shawe-Taylor / Expert Systems with Applications 39 (2012) 76527662symbols: 5

0.5 0.96 0.01 0.01 0.01 0.01

0.5 0.96 0.01 0.01 0.01 0.01

0.5 0.01 0.96 0.01 0.01 0.01

0.5 0.01 0.96 0.01 0.01 0.01

0.5 0.01 0.01 0.96 0.01 0.01

0.5 0.01 0.01 0.96 0.01 0.01

0.5 0.01 0.01 0.01 0.96 0.01

0.5 0.01 0.01 0.01 0.96 0.01

1.0 0.01 0.01 0.01 0.01 0.96

0.0 0.0 0.0 0.0 0.0 0.0HMMs in nance.

2.7.4. Fixed length strings generated by a hidden Markov modelAs explained in the introduction, the Fisher kernel gives a nat-

ural similarity measure that takes into account an underlyingprobability distribution. It seems natural to compare two datapoints through the directions in which they stretch the parame-ters of the model, that is by viewing the score function at thetwo points as a function of the parameters and comparing thetwo gradients. If the gradient vectors are similar it means thatthe two data items would adapt the model in the same way, thatis from the point of view of the given parametric model at the cur-rent parameter setting they are similar in the sense that theywould require similar adaptations to the parameters.

Parts of the nal chapter of Shawe-Taylor and Cristianini(2004)which covers turning generative models into kernelsare followed, resulting in the code in Appendix C (pp. 2226), thecalculation of the Fisher scores for the transmission probabilitieswere omitted from the book, but included here.

2.8. Test

This subsection concerns the prediction of synthetic data, gen-erated by a very simple 5-symbol, 5-state HMM, in order to testthe Fisher kernel. The hidden Markov model used in this paper isbased on a C++ implementation of a basic left-to-right HMMwhichuses the BaumWelch (maximum likelihood) training algorithmwritten by Richard Myers4. The hidden Markov model used to gen-erate the synthetic data is shown below. Following the header ima-gine a series of ordered blocks, each of which is two lines long. Eachof the 5 blocks corresponds to a state in the model. Within eachblock, the rst line gives the probability of the model recurring(the rst number) followed by the probability of generating eachof the possible output symbols when it recurs (the following venumbers). The second line gives the probability of the model transi-tioning to the next state (the rst number) followed by the probabil-ity of generating each of the possible output symbols when ittransitions (the following ve numbers).1.0oks B4 Available f3.tar.gz.nd Harom ftp://svr-f(2004)tp.eng.cam.ac.uamonk/pub/comp.spElliotteech/recogniti) cove

r bac emeny ll, and p (197 d Juan 5). Th

The ssful a ation o Ms tok

ets is r nced a

e con nal dis tion of ly dep on X

Xk,k kP0

seque

k

f inde nt ran variab ch tha

oces Y } re X arkov in and ditiona

Form a hid arkov el is a riate d te tim

re re tion, m l score wing ioinfo cs.

89), ave als n appl hand g reco on, ges

rgely succe pplica to spee cognit abineon/hmm-1. Create a HMMwith 5 states and 5 symbols, as above. Save ashmm.txt.

2. Use generate_seq on hmm.txt to generate 10,000 sequences,each 11 symbols long, each symbol 2 {0,1,2,3,4}. Outputwill be hmm.txt.seq.

3. Save the output, hmm.txt.seq, in Fisher.xlsx, Sheet 1. Splitthe data into 5000 sequences for training, 2500 sequencesfor validation and 2500 sequences for testing. Separate the11th column, this will be the target and is not used untillater.

4. Copy the training data (without the 11th column) intostringst.txt.

5. Run train_hmm on strings.txt , with the following parametersettings: seed = 1234, states = 5, symbols = 5 and min_delta_psum = 0.01. The output will be hmmt.txt.

6. From Fisher.xlsx, Sheet 1, copy all of the data except the tar-get column into strings.txt.

7. In strings.txt, replace symbols thus: 4? 5, 3? 4, 2? 3,1? 2, 0? 1 (this is simply an artefact of the software). Save.

8. Run Fisher.exe (code given in Appendix C (pp. 2226)),inputs are hmmt.txt and strings.txt, output will be sher.txt.

9. Use formati.exe5 to convert sher.txt to LIBSVM format: for-mati.exe sher.txt sherf.txt.

10. Copy and paste sherf.txt into Fisher.xlsx, Sheet 2 (cells needto be formatted for text).

11. Copy target data from Fisher.xlsx, Sheet 1 into a temporaryle and replace symbols thus: 4? 5, 3? 4, 2? 3, 1? 2,0? 1.

12. Insert the target data into Fisher.xlsx, Sheet 2, column A thensplit the data into training set, validation set and test set.

13. Copy and paste into training.txt, validation.txt and test.txt.14. Scale the data.15. Apply LIBSVM for regression with default Gaussian (rbf)

kernel eck~u~vk2 using the validation set to select C 2 {0.1,1,10,100,1000,10000,100000} and 2 {0.00001,0.0001,0.001,0.01,0.1}, svmtrain.exe -s 3 -t 2 [. . .]. In practice, veparameter combinations performed joint best on thevalidation set, namely {C = 1, = 0.00001}, {C = 1, =0.0001}, {C=1, =0.001}, {C = 1, =0.01} and {C = 1, = 0.1},so the median values were chosen, C = 1 and = 0.001. RunLIBSVM with these parameter settings on the test set.

Results are given in Table 1. There are ve symbols, so if thealgorithm was no better than random, one would expect a correctclassication rate of 20.00%. The results are impressive, andevidence the fact that my implementation of the Fisher kernelworks.

3. CalculationTable 1Fisher kernel test results.

Training set Validation set Test set

Correct classication (%) 84.28 83.60 83.085 Available from http://format.martinsewell.com/.

ms wof nancial or commodity markets. Therefore, my approach fo-cuses on kernel methods, and includes an SVM. The no freelunch theorem for supervised machine learning discussed earliershowed us that there is no free lunch in kernel choice, and thatthe success of our algorithm depends on the assumptions thatwe make. The kernel constitutes prior knowledge that is avail-able about a task, so the choice of kernel function is crucialfor the success of all kernel algorithms. A kernel is a similaritymeasure, and it seems wise to use the data itself to learn theoptimal similarity measure. This section compares a vanilla sup-port vector machine, three existing methods of learning the ker-nelthe Fisher kernel, the DC algorithm and a Bayes pointmachineand a new technique, a DC algorithm-Fisher kernel hy-brid, when applied to the classication of daily foreign exchangelog returns into positive and negative.

3.2. Data

In Park and Irwin (2004)s excellent review of technical anal-ysis, genetic programming did quite well on foreign exchangedata, and Christopher Neely is the most published author withinthe academic literature on technical analysis (Neely, 1997, 1998;Neely & Weller, 2001; Neely et al., 1997), so for the sake of com-parison, the experiments conducted in this section use the samedata sets as employed in Neely et al. (2009). The FX rates wereoriginally from the Board of Governors of the Federal ReserveSystem, and are published online via the H.10 release. The inter-est rate data was from the Bank for International Settlements(BIS), and is not in the public domain. All of the data was kindlyprovided by Chris Neely. Missing data was lled in by takingaverages of the data points immediately before and after themissing value. The experiments forecast six currency pairs,USD/DEM, USD/JPY, GBP/USD, USD/CHF, DEM/JPY and GBP/CHF,independently. As in Neely et al. (2009), the data set was dividedup thus: training set 19751977, validation set 19781980 andthe (out-of-sample) test set spanned 198130 June 2005.

Let Pt be the exchange rate (such as USD/DEM) on day t, It theannual interest rate of the nominator currency (e.g. USD) and Itthe annual interest rate of the denominator currency (e.g. DEM),d = 1 Monday to Friday and d = 3 on Fridays, n is the number ofround trip trades and c is the one-way transaction cost. Consistent

Table 2Fisher kernel symbol allocation.

Range Symbol

r < 20th centile 020th centile 6 r < 40th centile 140th centile 6 r < 60th centile 260th centile 6 r < 80th centile 3rP 80th centile 4

M. Sewell, J. Shawe-Taylor / Expert Systewith Neely et al. (2009), c was taken as 0.0005 from 1978 to 1980,then decreasing in a linear fashion to 0.000094 on 30 June 2005.For the vanilla SVM, Bayes point machine, DC algorithm and DC-Fisher hybrid, the inputs are

logPtPt1

; logPt1Pt5

; logPt5Pt20

;

plus, for four of the currency pairs, USD/DEM, GBP/USD, USD/CHFand GBP/CHF,

d365

log1 It11001 It1100

;Xt5it2

d365

log1 Ii1001 Ii100

andXt20it6

d365

log1 Ii1001 Ii100

:

vanilla SVM above, but using the data set created in 9.3.5. DC algorithm

This section explores another attempt to learn the kernel, thistime using the DC (difference of convex functions) algorithm. Foran overview of DC programming, see Horst and Thoai (1999). Theconvex hull of a set of points X in a real vector space V is the min-imal convex set containing X. The idea is to learn convex combina-tions of continuously-parameterized basic kernels by searchingFor the Fisher kernel experiment, the original inputs are

logPt9=Pt10 . . . logPt=Pt1:In all cases, the target is +1 or 1, depending on whether the follow-ing days log return, log Pt1Pt , is positive or negative.

The cumulative net return, r, over k days is given by

r Xk1t0

logPt1Pt

d365

log1 It1001 It100

! n log 1 c

1 c ;

3.3. Vanilla support vector machine

The experiment employs LIBSVM (Chang & Lin, 2001) Version2.91, for classication. In common with all of the experiments inthis section, a Gaussian radial basis function eck~u~vk2 was chosenas the similarity measure. Whilst systematically cycling throughdifferent combinations of values of meta-parameters, the SVM isrepeatedly trained on the training set and tested on the validationset. Meta-parameters were chosen thus: C 2 {106,105, . . . ,106}and r 2 {0.0001,0.001,0.01,0.1,1,10,100}. For each currency pair,the parameter combination that led to the highest net return onthe validation set was used for the (out of sample) test set.

3.4. Fisher kernel

1. Data consists of daily log returns of FX.2. Split the data into many smaller subsequences of 11 data

points each (with each subsequence overlapping the previ-ous subsequence by 10 data points).

3. For each subsequence, the target is +1 or 1, depending onwhether the following days log return, log Pt1Pt , is positiveor negative.

4. Convert each subsequence of log returns into a 5-symbolalphabet {0,1,2,3,4}. Each log return, r, is replaced by a sym-bol according to Table 2, where centiles are derived from thetraining set. In other words, the range of returns is split intoequiprobable regions, and each allocated a symbol.

5. Split the data into training set, validation set and test set aspreviously described above (p. 16).

6. Exclude target data until otherwise mentioned.7. For each training set, generate a left-to-right 5-state hidden

Markov model, giving us the following parameters: statetransition probability matrix and conditional probabilitiesof symbols given states.

8. Using the program whose C++ code is provided in AppendixC (pp. 2226), plus the parameters of the HMM and eachstring from the training set, determine the Fisher scores.

9. Create a new data set using the Fisher scores as the inputvectors and the original targets as the targets. Each inputvector will have 50 elements, and each target will be either1 or +1.

10. Using LIBSVM, proceed with an SVM as described for the

ith Applications 39 (2012) 76527662 7657within the convex hull of a prescribed set of basic kernels for onewhich minimizes a convex regularization functional. The methodand software used here is that outlined in Argyriou, Hauser, Micch-

DTable 4Out of sample results, Bayes point machine.

7658 M. Sewell, J. Shawe-Taylor / Expert Systems wUSD/DEM USD/JPY GBP/USD

Gross AR% 0.27 2.63 1.68Net AR% 3.82 3.46 2.90t-stat 1.78 1.61 1.45Sharpe ratio 0.36 0.32 0.28(SE) 0.21 0.21 0.21Trades/year 81.47 17.14 24.98Table 3Out of sample results, vanilla SVM.

USD/DEM USD/JPY GBP/US

Gross AR% 0.27 0.03 1.68Net AR% 3.82 5.09 2.90t-stat 1.78 2.37 1.45Sharpe ratio 0.36 0.48 0.28(SE) 0.21 0.21 0.21Trades/year 81.47 105.22 24.98elli, and Pontil (2006). An implementation written in MATLAB wasdownloaded from the website of Andreas Argyriou6 The validationset was used to select the following parameters. l 2 {103,104,. . . ,1011}, for USD/DEM, GBP/USD, USD/CHF and GBP/CHF blocksizes 2 {[6], [3,3], [2,2,2], [1,1,2,2]}, for USD/JPY and DEM/JPY blocksizes 2 {[3], [1,2]}, and for all cases ranges 2{[75,25000], [100,10000], [500,5000]}.

3.6. Bayes point machine

Given a sample of labelled instances, the so-called version spaceis dened as the set of classiers consistent with the sample.Whilst an SVM singles out the consistent classier with the largestmargin, the Bayes point machine (Herbrich, Graepel, & Campbell,2001) approximates the Bayes-optimal decision by the centre ofmass of version space. Tom Minkas Bayes Point Machine (BPM)MATLAB toolbox7 which implements the expectation propagation(EP) algorithms for training was used. Expectation propagation is a

Table 5Out of sample results, Fisher kernel.

USD/DEM USD/JPY GBP/USD

Gross AR% 0.84 1.54 1.04Net AR% 0.56 4.15 1.99t-stat 0.26 1.94 1.00Sharpe ratio 0.05 0.39 0.19(SE) 0.20 0.21 0.20Trades/year 28.82 53.35 61.96

Table 6Out of sample results, DC algorithm.

USD/DEM USD/JPY GBP/USD

Gross AR% 2.90 1.20 1.25Net AR% 1.85 2.98 4.01t-stat 0.86 1.39 2.01Sharpe ratio 0.17 0.28 0.40(SE) 0.20 0.21 0.21Trades/year 20.57 36.65 55.22

6 http://ttic.uchicago.edu/argyriou/code/dc/dc.tar.7 http://research.microsoft.com/en-us/um/people/minka/papers/ep/bpm/.USD/CHF DEM/JPY GBP/CHF Mean

1.51 2.78 7.22 0.264.51 7.75 7.22 2.811.90 3.80 4.15 1.190.38 0.75 0.81 0.240.21 0.23 0.23 0.22

60.08 99.10 0.08 61.82

USD/CHF DEM/JPY GBP/CHF Mean

3.48 3.43 7.22 1.682.44 3.43 7.22 0.491.03 1.68 4.15 0.340.20 0.32 0.81 0.060.20 0.21 0.23 0.21

18.53 0.08 0.08 23.71

ith Applications 39 (2012) 76527662family of algorithms developed by Tom Minka (Minka, 2001b,2001a) for approximate inference in Bayesian models. The methodapproximates the integral of a function by approximating each factorby sequential moment-matching. EP unies and generalizes two pre-vious techniques: (1) assumed-density ltering, an extension of theKalman lter, and (2) loopy belief propagation, an extension of beliefpropagation in Bayesian networks. The BPM attempts to select theoptimum kernel width by inspecting the training set. The expectederror rate of the BPM was xed at 0.45, and the kernel widthr 2 {0.0001,0.001,0.01,0.1,1,10,100}. Using LIBSVM (Chang & Lin,2001) a standard support vector machine was trained on the trainingset with the optimal r found using the BPM and C 2 {106,105, . . . ,106} selected using the validation set.

3.7. DC algorithmFisher kernel hybrid

This section describes a novel algorithm. First, the Fisher kernelwas derived, as described earlier, using the FX data. The data fromstep 9 of the Fisher kernel method was used. The input dataconsists of the parameters of the hidden Markov model in theFisher kernel, namely the emission and transition probabilities,respectively

USD/CHF DEM/JPY GBP/CHF Mean

4.04 0.17 5.74 1.664.01 4.63 3.72 0.601.69 2.27 2.14 0.270.33 0.45 0.43 0.050.21 0.21 0.21 0.210.82 88.49 42.65 46.01

USD/CHF DEM/JPY GBP/CHF Mean

1.09 0.57 5.28 1.041.81 3.84 4.68 1.020.77 1.88 2.69 0.420.15 0.37 0.51 0.090.20 0.21 0.21 0.21

60.94 70.04 8.98 42.07

D48.33 94.82 19.43 58.78

Net AR% 5.54 1.60 1.99

Beat standard SVM? Yes

ms w@ log PMsjh@sa;b

and@ log PMsjh

@hr;a:

The input data was scaled. Next, the data was split into training, val-idation and test sets as previously described. Then the DC algo-rithm, as described above, was used to nd an optimal kernel. Thevalidation set was used to select the following parameters used in

3 4 11

Beat state of the art? Not-stat 2.15 0.85 0.85Sharpe ratio 0.59 0.22 0.24(SE) 0.29 0.28 0.28Trades/year 5.17 5.37 5.02

Table 9Summary of results on forecasting.

Beat market? YesTable 7Out of sample results, DC-Fisher hybrid.

USD/DEM USD/JPY GBP/US

Gross AR% 1.09 3.52 3.28Net AR% 5.33 1.66 6.56t-stat 2.49 0.77 3.28Sharpe ratio 0.51 0.15 0.64(SE) 0.21 0.20 0.22Trades/year 85.59 38.33 66.16

Table 8Out of sample results, Neely et al. (1997) via Neely et al. (2009).

USD/DEM USD/JPY GBP/USD

Gross AR% 5.79 1.86 2.26

M. Sewell, J. Shawe-Taylor / Expert Systethe DC algorithm: l 2 {10 ,10 , . . . ,10 }, block sizes 2 {[50],[25,25], [16,17,17], [12,12,13,13]} and ranges 2 {[75,25000], [100,10000], [500,5000]}.

4. Results

Tables 37 (pp. 1920) show an analysis of the out of sample re-sults, whilst for the sake of comparison, Table 8 (p. 21) shows theresults from Neely et al. (1997) published in Neely et al. (2009)(NWD/NWU). Annual returns (AR) are calculated both gross andnet of transaction costs. The Sharpe ratios are annualized, and theirstandard errors (SE) calculated, in accordance with Lo (2002).

5. Discussion

The mean gross returns from all six experiments were positive,with NWD/NWU being the highest, followed by BPM and the Fisherkernel, whilst the vanilla SVM was the lowest. The mean net re-turns were positive for NWD/NWU and BPM; NWD/NWU per-formed best, and the vanilla SVM and hybrid algorithm theworst. BPM, the Fisher kernel, the DC algorithm and the hybridalgorithm were all improvements over the vanilla SVM in termsof both gross returns and net returns, but none achieved net re-turns as high as NWD/NWU. One likely reason is that the geneticprogramming methodology was better suited to optimally restrict-ing the number of trades per year. However, the performance ofthe genetic programming trading system described in Neely et al.(1997) was one of the worst reported in Neely et al. (2009). The fol-lowing three methods performed best. Sweeney (1986) used lterrules, as described in Fama and Blume (1966). Taylor (1994) con-sidered ARIMA (1,0,2) trading rules, prespecifying the ARIMA or-der and choosing the parameters and the size of a band ofinactivity to maximize in-sample protability. Dueker and Neely(2007) used a Markov-switching model on deviations from uncov-ered interest parity, with time-varying mean, variance, and kurto-sis to develop trading rules. In-sample data was used to estimatemodel parameters and to construct optimal bands of inactivitythat reduce trading frequency.

6. Conclusions

The applications of the Fisher kernel, the DC algorithm andBayes point machine to nancial time series are all new. Mostnovel of all was the use of the DC algorithm to learn theparameters of the hidden Markov model in the Fisher kernel.Table 9 gives a summary of the goals achieved in this paper.More precise conclusions are elusive, because a slight changeto the data set or the inputs can produce quite different results.Although I believe that machine learning in general, and learn-ing the kernel in particular, have a lot to offer nancial timeseries prediction, nancial data is a poor test bed for comparingmachine learning algorithms due to its vanishingly small signal-

USD/CHF DEM/JPY GBP/CHF Mean

0.25 3.17 0.06 2.220.01 2.04 0.18 1.830.00 1.34 0.03 0.86

0.03 0.35 0.02 0.230.28 0.30 0.29 0.294.88 23.39 2.26 7.680.02 0.97 0.61 0.22USD/CHF DEM/JPY GBP/CHF Mean

2.71 4.89 6.43 0.570.29 9.61 5.49 2.350.12 4.71 3.15 1.07

0.20 0.24 0.22 0.22

ith Applications 39 (2012) 76527662 7659to-noise ratio.

Acknowledgements

Thanks to David Barber and Edward Tsang for several sugges-tions. Many thanks also due to Chris Neely for data and support.

Appendix A. Gram matrix

Denition 2. Given a set S f~x1; . . . ;~xng of vectors from an innerproduct space X, the n n matrix G with entries Gij h~xi ~xji iscalled the Gram matrix (or kernel matrix) of S.

Appendix B. Hilbert space

Denition 3. A Hilbert space is a Euclidean space which iscomplete, separable and innite-dimensional.

In other words, a Hilbert space is a set H of elements f, g, . . . ofany kind such that

H is a Euclidean space, i.e. a real linear space equipped with ascalar product;

H is complete with respect to the metric q(f,g) = kf gk;

ms wit H is separable, i.e. H contains a countable everywhere densesubset;

H is innite-dimensional, i.e., given any positive integer n, Hcontains n linearly independent elements.

Appendix C. Fisher kernel source code

//line numbers refer to Code Fragment 12.4 (p. 435)

//in Shawe-Taylor and Cristianini (2004)

//use symbols 1, 2, 3, etc.

#include #include #include #include #include using namespace std;

int main ()

{int string_length = 10;int number_of_states = 5;int number_of_symbols = 5;int p = number_of_states;int n = string_length;int a,b;

double Prob = 0;

string stringstring;

ifstream hmmstream ("hmmt.txt");//INPUT: HMM, one

line of params

ifstream stringfile ("strings.txt");//INPUT:

symbol strings, one per line

ofstream fisherfile ("fisher.txt");//OUTPUT:

Fisher scores, 1 data pt/line

int s[n + 1];//symbol string, uses s[1] to s[n]

(s[0] is never used)

double PM[p + 1][p + 1];//state transition

probability matrix

double P[number_of_symbols + 1][p + 1];//cond.probs of symbols given states

double scoree[p + 1][number_of_symbols + 1];//Fisher scores for the em. probs double

scoret[p + 1][p + 1];//Fisher scores for the

transmission probs

double forw[p + 1][n + 1];

double back[p + 1][n + 1];

//initialize to zero

for (int i = 0; i

ms wThe Annals of Mathematical Statistics, 41, 164171.Bernstein, S. (1927). Sur lextension du thorme limite du calcul des probabilits

aux sommes de quantits dpendantes. Mathematische Annalen, 97, 159.Bhar, R., & Hamori, S. (2004). Hidden Markov models: Applications to nancial

economics. Advanced studies in theoretical and applied econometrics (Vol. 40).Dordrecht: Kluwer Academic Publishers.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery, 2, 121167.

Burnham, K. P., & Anderson, D. R. (2002).Model selection and multimodel inference: Apractical information-theoretic approach (2nd ed.). New York: Springer-Verlag.

Capp, O., Moulines, E., & Rydn, T. (2005). Inference in hidden Markov models.Springer series in statistics. New York: Springer.

Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines.National Taiwan University.

Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging. Cambridgeseries in statistical and probabilistic mathematics (Vol. 27). Cambridge:Cambridge University Press.

Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machinesand other kernel-based learning methods. Cambridge: Cambridge UniversityPress.

Dueker, M., & Neely, C. J. (2007). Can Markov switching models predict excessforeign exchange returns? Journal of Banking & Finance, 31, 279296.

Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1999). Biological sequence analysis:probabilistic models of proteins and nucleic acids. Cambridge: CambridgeUniversity Press.

Eddy, S. R. (2004). What is a hidden Markov model? Nature Biotechnology, 22,13151316.

Elliot, R. J., Aggoun, L., & Moore, J. B. (2004). Hidden Markov models: Estimation andcontrol. Applications of mathematics (Vol. 29). New York: Springer-Verlag.

Fama, E. F., & Blume, M. E. (1966). Filter rules and stock-market trading. The Journalof Business, 39, 226241.for (int i = n-1;i>=1;i)for (b = 1; b

Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroitregion. Economic Geography, 46, 234240.

Vapnik, V. N., & Chervonenkis, A. Y. (1974). Teoriya Raspoznavaniya Obrazov:Statisticheskie Problemy Obucheniya. Moscow: Nauka [Russian, Theory ofPattern Recognition: Statistical Problems of Learning].

Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms.Neural Computation, 8, 13411390.

Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization.IEEE Transactions on Evolutionary Computation, 1, 6782.

7662 M. Sewell, J. Shawe-Taylor / Expert Systems with Applications 39 (2012) 76527662

Forecasting foreign exchange rates using kernel methods1 Introduction1.1 Objectives1.2 Background1.2.1 No free lunch theorems1.2.2 No free lunch theorem for optimization/search1.2.3 No free lunch theorem for supervised machine learning

1.3 Kernel methods1.4 Support vector machines1.4.1 The application of support vector machines to the financial domain

2 Material and methods2.1 Kernel methods2.1.1 Terminology2.1.2 Definition2.1.3 Motivation and description2.1.4 Kernel trick2.1.5 Advantages

2.2 Support vector machines2.3 Preprocessing2.4 Model selection2.5 Feature selection2.6 Software2.7 Fisher kernel2.7.1 Introduction2.7.2 Markov chains2.7.3 Hidden Markov models2.7.4 Fixed length strings generated by a hidden Markov model

2.8 Test

3 Calculation3.1 Introduction3.2 Data3.3 Vanilla support vector machine3.4 Fisher kernel3.5 DC algorithm3.6 Bayes point machine3.7 DC algorithmFisher kernel hybrid

4 Results5 Discussion6 ConclusionsAcknowledgementsAppendix A Gram matrixAppendix B Hilbert spaceAppendix C Fisher kernel source codeReferences

Recommended