forecasting foreign exchange rates using kernel methods

Expert Systems with Applications 39 (2012) 7652–7662

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Forecasting foreign exchange rates using kernel methods

Martin Sewell a,⇑, John Shawe-Taylor b,1

a The Cambridge Centre for Climate Change Mitigation Research (4CMR), Department of Land Economy, University Of Cambridge, 16-21 Silver Street, Cambridge CB3 9EP,United Kingdomb Department of Computer Science, University College London, Gower Street, London WC1E 6BT, United Kingdom

a r t i c l e i n f o a b s t r a c t

Keywords:ForecastingForeign exchangeKernel methods

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Adoi:10.1016/j.eswa.2012.01.026

⇑ Corresponding author. Tel.: +44 (0) 1223 765224E-mail addresses: [email protected] (M. Sewell)

(J. Shawe-Taylor).1 Tel.: +44 (0) 20 76797680; fax: +44 (0) 20 7387132 http://www.no-free-lunch.org.

First, the all-important no free lunch theorems are introduced. Next, kernel methods, support vectormachines (SVMs), preprocessing, model selection, feature selection, SVM software and the Fisher kernelare introduced and discussed. A hidden Markov model is trained on foreign exchange data to derive aFisher kernel for an SVM, the DC algorithm and the Bayes point machine (BPM) are also used to learnthe kernel on foreign exchange data. Further, the DC algorithm was used to learn the parameters ofthe hidden Markov model in the Fisher kernel, creating a hybrid algorithm. The mean net returns werepositive for BPM; and BPM, the Fisher kernel, the DC algorithm and the hybrid algorithm were allimprovements over a standard SVM in terms of both gross returns and net returns, but none achievednet returns as high as the genetic programming approach employed by Neely, Weller, and Dittmar(1997) and published in Neely, Weller, and Ulrich (2009). Two implementations of SVMs for Windowswith semi-automated parameter selection are built.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

1.1. Objectives

This paper employs kernel methods to forecast foreign ex-change rates, and aims to (1) beat the market, (2) beat existingstandard methodology, and (3) beat the state of the art. Note thatin a foreign exchange market ‘beating the market’ simply meansearning a positive return, but it is not a useful benchmark as it doesnot incorporate risk.

1.2. Background

1.2.1. No free lunch theoremsThe two main no free lunch theorems (NFL) are introduced and

then evolutionary algorithms and statistical learning theory arereconciled with the NFL theorems. The theorems are novel, non-trivial, frequently misunderstood and profoundly relevant to opti-mization, machine learning and science in general (and often con-veniently ignored by the evolutionary algorithms and statisticallearning theory communities). I (Sewell) run the world’s only nofree lunch website.2.

ll rights reserved.

; fax: +44 (0) 1223 337130., [email protected]

97.

1.2.2. No free lunch theorem for optimization/searchThe no free lunch theorem for search and optimization applies

to finite spaces and algorithms that do not resample points. Thetheorem tells us that all algorithms that search for an extremumof a cost function perform exactly the same when averaged overall possible cost functions. So, for any search/optimization algo-rithm, any elevated performance over one class of problems is ex-actly paid for in performance over another class. See Wolpert andMacready (1997).

The no free lunch theorem for search implies that putting blindfaith in evolutionary algorithms as a blind search/optimization algo-rithm is misplaced. For example, on average, a genetic algorithm isno better, or worse, than any other search algorithm. In practice, inour universe, one will only be interested in a subset of all possiblefunctions. This means that it is necessary to show that the set of func-tions that are of interest has some property that allows a particularalgorithm to perform better than random search on this subset.

1.2.3. No free lunch theorem for supervised machine learningHume (1739–1740) pointed out that ‘even after the observation

of the frequent or constant conjunction of objects, we have no reasonto draw any inference concerning any object beyond those of whichwe have had experience’. More recently, and with increasing rigour,Mitchell (1980), Schaffer (1994) and Wolpert (1996) showed thatbias-free learning is futile. The no free lunch theorem for supervisedmachine learning (Wolpert, 1996) shows that in a noise-free sce-nario where the loss function is the misclassification rate, in terms

http://dx.doi.org/10.1016/j.eswa.2012.01.026

mailto:[email protected]

mailto:[email protected]

http://www.no-free-lunch.org

http://dx.doi.org/10.1016/j.eswa.2012.01.026

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

M. Sewell, J. Shawe-Taylor / Expert Systems with Applications 39 (2012) 7652–7662 7653

of off-training-set error, there are no a priori distinctions betweenlearning algorithms.

More formally, where

d = training set;m = number of elements in training set;f = ‘target’ input–output relationships;h = hypothesis (the algorithm’s guess for f made in response tod); andc = off-training-set ‘loss’ associated with f and h (‘generalizationerror’ or ‘test set error’)

all algorithms are equivalent, on average, by any of the followingmeasures of risk: E(cjd), E(cjm), E(cjf,d) or E(cjf,m).

How well you do is determined by how ‘aligned’ your learningalgorithm P(hjd) is with the actual posterior, P(fjd). This result, inessence, formalizes Hume, extends him and calls all of science intoquestion.

The NFL proves that if you make no assumptions about the tar-get functions, or if you have a uniform prior, then P(cjd) is indepen-dent of one’s learning algorithm. Vapnik appears to ‘prove’ thatgiven a large training set and a small VC dimension, one can gen-eralize well. The VC dimension is a property of the learning algo-rithm, so no assumptions are being made about the targetfunctions. So, has Vapnik found a free lunch? VC theory tells us thatthe training set error, s, converges to c. If � is an arbitrary real num-ber, the VC framework actually concerns

Pðjc � sj > �jf ;mÞ:

VC theory does not concern

Pðcjs;m;VC dimensionÞ:

So there is no free lunch for Vapnik, and no guarantee that supportvector machines generalize well.

1.3. Kernel methods

Central to the work on forecasting in this paper is the concept ofa kernel. The technical aspects of kernels are dealt with in Section2.1 (pp. –6), and the history is given here. The Fisher kernel is de-rived and implemented below; to save space, a thorough literaturereview is provided in Sewell (2011b).

1.4. Support vector machines

Support vector machines (SVMs) are used extensively in theforecasting of financial time series and are covered in more detailin Section 2.2 (p. 6). Among other sources, the introductory paper(Hearst, Dumais, Osuna, Platt, & Schölkopf, 1998), the classic SVMtutorial (Burges, 1998), the excellent book (Cristianini & Shawe-Taylor, 2000) and the implementation details within Joachims(2002) have contributed to my own understanding.

1.4.1. The application of support vector machines to the financialdomain

An exhaustive review of articles that apply SVMs to the finan-cial domain (not reported here) included 38 articles that compareSVMs with artificial neural networks (ANNs), SVMs outperformedANNs in 32 cases, ANNs outperformed SVMs in 3 cases, and therewas no significance difference in 3 cases. More specifically, of the22 articles that concern the prediction of financial or commoditymarkets, 18 favoured SVMs, 2 favoured ANNs and 2 found no sig-nificant difference. This bodes well for SVMs, and as such, the fol-lowing research on forecasting shall employ them.

2. Material and methods

Domain knowledge is necessary to provide the assumptionsthat supervised machine learning relies upon.

2.1. Kernel methods

2.1.1. TerminologyThe term kernel is derived from a word that can be traced back to c.

1000 and originally meant a seed (contained within a fruit) or thesofter (usually edible) part contained within the hard shell of a nutor stone-fruit. The former meaning is now obsolete. It was first usedin mathematics when it was defined for integral equations in whichthe kernel is known and the other function(s) unknown, but now hasseveral meanings in mathematics. As far as I am aware, the machinelearning term kernel trick was first used in 1998.

2.1.2. DefinitionThe kernel of a function f is the equivalence relation on the func-

tion’s domain that roughly expresses the idea of ‘equivalent as faras the function f can tell’.

Definition 1. Let X and Y be sets and let f be a function from X to Y.Elements x1 and x2 of X are equivalent if f(x1) and f(x2) are equal, i.e.are the same element of Y. Formally, if f : X ? Y, then

kerðf Þ ¼ fðx1; x2Þ 2 X � X : f ðx1Þ ¼ f ðx2Þg:The kernel trick (described below) uses the kernel as a similar-

ity measure and the term kernel function is often used for f above.

2.1.3. Motivation and descriptionFirstly, linearity is rather special, and outside quantum mechan-

ics no real system is truly linear. Secondly, detecting linear rela-tions has been the focus of much research in statistics andmachine learning for decades and the resulting algorithms are wellunderstood, well developed and efficient. Naturally, one wants thebest of both worlds. So, if a problem is non-linear, instead of tryingto fit a non-linear model, one can map the problem from the inputspace to a new (higher-dimensional) space (called the featurespace) by doing a non-linear transformation using suitably chosenbasis functions and then use a linear model in the feature space.This is known as the ‘kernel trick’. The linear model in the featurespace corresponds to a non-linear model in the input space. Thisapproach can be used in both classification and regression prob-lems. The choice of kernel function is crucial for the success of allkernel algorithms because the kernel constitutes prior knowledgethat is available about a task. Accordingly, there is no free lunch(see p. 1) in kernel choice.

2.1.4. Kernel trickThe kernel trick was first published by Aizerman, Braverman,

and Rozonoer (1964). Mercer’s theorem states that any continuous,symmetric, positive semi-definite kernel function K(x,y) can be ex-pressed as a dot product in a high-dimensional space.

If the arguments to the kernel are in a measurable space X, andif the kernel is positive semi-definite—i.e.

Xn

i¼1

Xn

j¼1

Kðxi; xjÞcicj P 0

for any finite subset {x1, . . . ,xn} of X and subset {c1, . . . ,cn} of objects(typically real numbers, but could even be molecules)—then thereexists a function u(x) whose range is in an inner product space ofpossibly high dimension, such that

Kðx; yÞ ¼ uðxÞ �uðyÞ:

3 http://winsvm.martinsewell.com/ and http://svmdark.martinsewell.com/.

7654 M. Sewell, J. Shawe-Taylor / Expert Systems with Applications 39 (2012) 7652–7662

2.1.5. Advantages

� The kernel defines a similarity measure between two datapoints and thus allows one to incorporate prior knowledge ofthe problem domain.� Most importantly, the kernel contains all of the information

about the relative positions of the inputs in the feature spaceand the actual learning algorithm is based only on the kernelfunction and can thus be carried out without explicit use ofthe feature space. The training data only enter the algorithmthrough their entries in the kernel matrix (a Gram matrix, seeAppendix A (p. 22)), and never through their individual attri-butes. Because one never explicitly has to evaluate the featuremap in the high dimensional feature space, the kernel functionrepresents a computational shortcut.� The number of operations required is not necessarily propor-

tional to the number of features.

2.2. Support vector machines

A support vector machine (SVM) is a supervised learning tech-nique from the field of machine learning applicable to both classi-fication and regression. Rooted in the statistical learning theorydeveloped by Vladimir Vapnik and co-workers, SVMs are basedon the principle of structural risk minimization (Vapnik & Chervo-nenkis, 1974).

The background mathematics required includes probability, lin-ear algebra and functional analysis. More specifically: vectorspaces, inner product spaces, Hilbert spaces (defined in AppendixB (p. 22)), operators, eigenvalues and eigenvectors. A good bookfor learning the background maths is Introductory Real Analysis(Kolmogorov & Fomin, 1975).

Support vector machines (reviewed briefly on p. 4) are the best-known example of kernel methods.

The basic idea of an SVM is as follows:

1. Non-linearly map the input space into a very high dimensionalfeature space (the ‘kernel trick’).

2.� In the case of classification, construct an optimal separating

hyperplane in this space (a maximal margin classifier); or� in the case of regression, perform linear regression in this

space, but without penalising small errors.

2.3. Preprocessing

Preprocessing the data is a vital part of forecasting. Filtering thedata is a common procedure, but should be avoided altogether if itis suspected that the time series may be chaotic (there is little evi-dence for low dimensional chaos in financial data Sewell, 2011a).In the following work, simple averaging was used to deal withmissing data. It is good practice to normalize the data so that theinputs are in the range [0,1] or [�1,1], here I used [�1,1]. Carewas taken to avoid multicollinearity in the inputs, as this would in-crease the variance (in a bias-variance sense). Another commontask is outlier removal, however, if an ‘outlier’ is a market crash,it is obviously highly significant, so no outliers were removed. Use-ful references include Masters (1995), Pyle (1999) and (to a lesserextent) Theodoridis and Koutroumbas (2008).

2.4. Model selection

For books on model selection, see Burnham and Anderson (2002)and Claeskens and Hjort (2008). For a Bayesian approach to modelselection using foreign exchange data, see Sewell (2008) and Sewell(2009). Support vector machines are implemented here, which em-

ploy structural risk minimization, and a validation set is used formeta-parameter selection.

Typically, the data is split thus: the first 50% is the ‘training set’,the next 25% the ‘validation set’ and the final 25% the ‘test set’. How-ever, in the experiments below I split the data set in the same manneras that of a published work (Neely et al., 2009), for comparative pur-poses. The training set is used for training the SVM, the validation setfor parameter selection and the test set is the out of sample data. Theparameters that generated the highest net profit on the validationset are used for the test set.

Can one use K-fold cross-validation (rather than a sliding win-dow) on a time series? In other words, what assumptions are madeif one uses the data in an order other than that in which it was gen-erated? It is only a problem if the function that you are approxi-mating is also a function of time (or order). To be safe, a systemshould be tested using a data set that is both previously unseenand forwards in time, a rule that I adhered to in the experimentsthat follow.

2.5. Feature selection

First and foremost, when making assumptions regarding select-ing inputs, I (among other things) subscribe to Tobler’s first law ofgeography (Tobler, 1970) that tells us that ‘everything is relatedto everything else, but near things are more related than distantthings’. That is, for example, the following common sense notionis applied: when predicting tomorrow’s price change, yesterday’sprice change is more likely to have predictive value than the dailyprice change, say, 173 days ago. With such noisy data, standard fea-ture selection techniques such as principal component analysis(PCA), factor analysis and independent component analysis (ICA)all risk overfitting the training set. For reasons of market efficiency,it is safest to take the view that there are no privileged features infinancial time series, over and above keeping the inputs potentiallyrelevant, orthogonal and utilizing Tobler’s first law of geography. Toa degree, the random subspace method (RSM) (Ho, 1998) alleviatesthe problem of feature selection in areas with little domain knowl-edge, but was not used here.

2.6. Software

I wrote two Windows versions of support vector machines, bothof which are freely available online3 (including source code):SVMdark is based on SVMlight (Joachims, 2004) and written in C forWin32, whilst winSVM is based on mySVM (Rüping, 2000) and writ-ten in C++ for Win32. Both products include a model/parameterselection tool which randomly selects the SVM kernel and/or param-eters within the range selected by the user. Results for each param-eter combination are saved in a spreadsheet and the user can narrowdown the range of parameters and home in on the optimum solutionfor the validation set. The software comes with a tutorial, and has re-ceived a great deal of positive feedback from academia, banks andindividuals. The programs make a very real practical contributionto SVM model and parameter selection, as they each present the userwith an easy-to-use interface that allows them to select a subset ofthe search space of parameters to be parsed randomly, and enablesthem to inspect and sort the results with ease in Excel. The randommodel/parameter selection is particularly beneficial in applicationswith limited domain knowledge, such as financial time series. Figs. 1and 2 (p. 10) show screenshots of my Windows SVM software.

The experiments reported in this paper used LIBSVM (Chang &Lin, 2001) and MATLAB.

http://winsvm.martinsewell.com/

http://svmdark.martinsewell.com/

Fig. 1. SVMdark.

Fig. 2. winSVM.


2.7. Fisher kernel

2.7.1. IntroductionTo save space, my literature review on Fisher kernels is omitted

here, but is available for download on the Web (Sewell, 2011b). Incommon with all kernel methods, the support vector machinetechnique involves two stages: first non-linearly map the inputspace into a very high dimensional feature space, then apply alearning algorithm designed to discover linear patterns in thatspace. The novelty in this section concerns the first stage. The basicidea behind the Fisher kernel method is to train a (generative) hid-den Markov model (HMM) on data to derive a Fisher kernel for a

(discriminative) support vector machine (SVM). The Fisher kernelgives a ‘natural’ similarity measure that takes into account theunderlying probability distribution. If each data item is a (possiblyvarying length) sequence, the sequence may be used to train aHMM. It is then possible to calculate how much a new data itemwould ‘stretch’ the parameters of the existing model. This isachieved by, for two data items, calculating and comparing the gra-dient of the log-likelihood of the data item with respect to themodel with a given set of parameters. If these ‘Fisher scores’ aresimilar it means that the two data items would adapt the modelin the same way, that is from the point of view of the given para-metric model at the current parameter setting they are similar inthe sense that they would require similar adaptations to theparameters.

2.7.2. Markov chainsMarkov chains were introduced by the Russian mathematician

Andrey Markov in 1906 (Markov, 1906), although the term didnot appear for over 20 years when it was used by Bernstein(1927). A Markov process is a stochastic process that satisfies theequality P(Xn+1jX1, . . . ,Xn) = P(Xn+1jXn). A Markov chain is a dis-crete-state Markov process. Formally, a discrete time Markov chainis a sequence of n random variables Xn, n P 0 such that for every n,P(Xn+1 = xjX0 = x0, X1 = x1, . . . , Xn = xn) = P(Xn+1 = xjXn = xn). In words,the future of the system depends on the present, but not the past.

2.7.3. Hidden Markov modelsA hidden Markov model (HMM) is a temporal probabilistic model

in which the state of the process is described by a single discreterandom variable. Loosely speaking, it is a Markov chain observedin noise. The theory of hidden Markov models was developed inthe late 1960s and early 1970s by Baum, Eagon, Petrie, Soulesand Weiss (Baum, 1972; Baum & Eagon, 1967; Baum & Petrie,1966; Baum, Petrie, Soules, & Weiss, 1970), whilst the name ‘hid-den Markov model’ was coined by L.P. Neuwirth. For more infor-mation on HMMs, see the tutorial papers Rabiner and Juang(1986), Poritz (1988), Rabiner (1989) and Eddy (2004), and thebooks MacDonald and Zucchini (1997), Durbin, Eddy, Krogh, andMitchison (1999), Elliot, Aggoun, and Moore (2004) and Cappé,

Table 1Fisher kernel test results.

Training set Validation set Test set

Correct classification (%) 84.28 83.60 83.08


Moulines, and Rydén (2005). HMMs have earned their popularitylargely from successful application to speech recognition (Rabiner,1989), but have also been applied to handwriting recognition, ges-ture recognition, musical score following and bioinformatics.

Formally, a hidden Markov model is a bivariate discrete timeprocess {Xk,Yk}kP0, where Xk is a Markov chain and, conditionalon Xk, Yk is a sequence of independent random variables such thatthe conditional distribution of Yk only depends on Xk.

The successful application of HMMs to markets is referenced asfar back as Kemeny, Snell, and Knapp (1976) and Juang (1985). Thebooks Bhar and Hamori (2004) and Mamon and Elliott (2007) coverHMMs in finance.

2.7.4. Fixed length strings generated by a hidden Markov modelAs explained in the introduction, the Fisher kernel gives a ‘nat-

ural’ similarity measure that takes into account an underlyingprobability distribution. It seems natural to compare two datapoints through the directions in which they ‘stretch’ the parame-ters of the model, that is by viewing the score function at thetwo points as a function of the parameters and comparing thetwo gradients. If the gradient vectors are similar it means thatthe two data items would adapt the model in the same way, thatis from the point of view of the given parametric model at the cur-rent parameter setting they are similar in the sense that theywould require similar adaptations to the parameters.

Parts of the final chapter of Shawe-Taylor and Cristianini(2004)—which covers turning generative models into kernels—are followed, resulting in the code in Appendix C (pp. 22–26), thecalculation of the Fisher scores for the transmission probabilitieswere omitted from the book, but included here.

2.8. Test

This subsection concerns the prediction of synthetic data, gen-erated by a very simple 5-symbol, 5-state HMM, in order to testthe Fisher kernel. The hidden Markov model used in this paper isbased on a C++ implementation of a basic left-to-right HMM whichuses the Baum–Welch (maximum likelihood) training algorithmwritten by Richard Myers4. The hidden Markov model used to gen-erate the synthetic data is shown below. Following the header ima-gine a series of ordered blocks, each of which is two lines long. Eachof the 5 blocks corresponds to a state in the model. Within eachblock, the first line gives the probability of the model recurring(the first number) followed by the probability of generating eachof the possible output symbols when it recurs (the following fivenumbers). The second line gives the probability of the model transi-tioning to the next state (the first number) followed by the probabil-ity of generating each of the possible output symbols when ittransitions (the following five numbers).

states: 5

symbols: 5

1.0

0.5

4 Available f3.tar.gz.

0.96

rom ftp://svr-f

0.01

tp.eng.cam.ac.u

0.01

k/pub/comp.sp

0.01

eech/recogniti

0.01

0.5
0.96 0.01 0.01 0.01 0.01
0.5
0.01 0.96 0.01 0.01 0.01
0.5
0.01 0.96 0.01 0.01 0.01
0.5
0.01 0.01 0.96 0.01 0.01
0.5
0.01 0.01 0.96 0.01 0.01
0.5
0.01 0.01 0.01 0.96 0.01
0.5
0.01 0.01 0.01 0.96 0.01
1.0
0.01 0.01 0.01 0.01 0.96
0.0
0.0 0.0 0.0 0.0 0.0
on/hmm-

The step-by-step methodology follows.

1. Create a HMM with 5 states and 5 symbols, as above. Save ashmm.txt.

2. Use generate_seq on hmm.txt to generate 10,000 sequences,each 11 symbols long, each symbol 2 {0,1,2,3,4}. Outputwill be hmm.txt.seq.

3. Save the output, hmm.txt.seq, in Fisher.xlsx, Sheet 1. Splitthe data into 5000 sequences for training, 2500 sequencesfor validation and 2500 sequences for testing. Separate the11th column, this will be the target and is not used untillater.

4. Copy the training data (without the 11th column) intostringst.txt.

5. Run train_hmm on strings.txt , with the following parametersettings: seed = 1234, states = 5, symbols = 5 and min_delta_psum = 0.01. The output will be hmmt.txt.

6. From Fisher.xlsx, Sheet 1, copy all of the data except the tar-get column into strings.txt.

7. In strings.txt, replace symbols thus: 4 ? 5, 3 ? 4, 2 ? 3,1 ? 2, 0 ? 1 (this is simply an artefact of the software). Save.

8. Run Fisher.exe (code given in Appendix C (pp. 22–26)),inputs are hmmt.txt and strings.txt, output will be fisher.txt.

9. Use formati.exe5 to convert fisher.txt to LIBSVM format: ‘for-mati.exe fisher.txt fisherf.txt’.

10. Copy and paste fisherf.txt into Fisher.xlsx, Sheet 2 (cells needto be formatted for text).

11. Copy target data from Fisher.xlsx, Sheet 1 into a temporaryfile and replace symbols thus: 4 ? 5, 3 ? 4, 2 ? 3, 1 ? 2,0 ? 1.

12. Insert the target data into Fisher.xlsx, Sheet 2, column A thensplit the data into training set, validation set and test set.

13. Copy and paste into training.txt, validation.txt and test.txt.14. Scale the data.15. Apply LIBSVM for regression with default Gaussian (rbf)

kernel ðe�ck~u�~vk2 Þ using the validation set to select C 2 {0.1,1,10,100,1000,10000,100000} and � 2 {0.00001,0.0001,0.001,0.01,0.1}, ‘svmtrain.exe -s 3 -t 2 [. . .]’. In practice, fiveparameter combinations performed joint best on thevalidation set, namely {C = 1,� = 0.00001}, {C = 1,� =0.0001}, {C=1, � =0.001}, {C = 1,� =0.01} and {C = 1,� = 0.1},so the median values were chosen, C = 1 and � = 0.001. RunLIBSVM with these parameter settings on the test set.

Results are given in Table 1. There are five symbols, so if thealgorithm was no better than random, one would expect a correctclassification rate of 20.00%. The results are impressive, andevidence the fact that my implementation of the Fisher kernelworks.

3. Calculation

3.1. Introduction

As reported in the introduction, there is evidence that, onaverage, SVMs outperform ANNs when applied to the prediction

5 Available from http://format.martinsewell.com/.

http://format.martinsewell.com/

Table 2Fisher kernel symbol allocation.

Range Symbol

r < 20th centile 020th centile 6 r < 40th centile 140th centile 6 r < 60th centile 260th centile 6 r < 80th centile 3r P 80th centile 4


of financial or commodity markets. Therefore, my approach fo-cuses on kernel methods, and includes an SVM. The no freelunch theorem for supervised machine learning discussed earliershowed us that there is no free lunch in kernel choice, and thatthe success of our algorithm depends on the assumptions thatwe make. The kernel constitutes prior knowledge that is avail-able about a task, so the choice of kernel function is crucialfor the success of all kernel algorithms. A kernel is a similaritymeasure, and it seems wise to use the data itself to learn theoptimal similarity measure. This section compares a vanilla sup-port vector machine, three existing methods of learning the ker-nel—the Fisher kernel, the DC algorithm and a Bayes pointmachine—and a new technique, a DC algorithm-Fisher kernel hy-brid, when applied to the classification of daily foreign exchangelog returns into positive and negative.

3.2. Data

In Park and Irwin (2004)’s excellent review of technical anal-ysis, genetic programming did quite well on foreign exchangedata, and Christopher Neely is the most published author withinthe academic literature on technical analysis (Neely, 1997, 1998;Neely & Weller, 2001; Neely et al., 1997), so for the sake of com-parison, the experiments conducted in this section use the samedata sets as employed in Neely et al. (2009). The FX rates wereoriginally from the Board of Governors of the Federal ReserveSystem, and are published online via the H.10 release. The inter-est rate data was from the Bank for International Settlements(BIS), and is not in the public domain. All of the data was kindlyprovided by Chris Neely. Missing data was filled in by takingaverages of the data points immediately before and after themissing value. The experiments forecast six currency pairs,USD/DEM, USD/JPY, GBP/USD, USD/CHF, DEM/JPY and GBP/CHF,independently. As in Neely et al. (2009), the data set was dividedup thus: training set 1975–1977, validation set 1978–1980 andthe (out-of-sample) test set spanned 1981–30 June 2005.

Let Pt be the exchange rate (such as USD/DEM) on day t, It theannual interest rate of the nominator currency (e.g. USD) and I�tthe annual interest rate of the denominator currency (e.g. DEM),d = 1 Monday to Friday and d = 3 on Fridays, n is the number ofround trip trades and c is the one-way transaction cost. Consistentwith Neely et al. (2009), c was taken as 0.0005 from 1978 to 1980,then decreasing in a linear fashion to 0.000094 on 30 June 2005.For the vanilla SVM, Bayes point machine, DC algorithm and DC-Fisher hybrid, the inputs are

logPt

Pt�1; log

Pt�1

Pt�5; log

Pt�5

Pt�20;

plus, for four of the currency pairs, USD/DEM, GBP/USD, USD/CHFand GBP/CHF,

d365

log1þ I�t�1

100

1þ It�1100

;Xt�5

i¼t�2

d365

log1þ I�i

100

1þ Ii100

andXt�20

i¼t�6

d365

log1þ I�i

100

1þ Ii100

:

For the Fisher kernel experiment, the original inputs are

logðPt�9=Pt�10Þ . . . logðPt=Pt�1Þ:

In all cases, the target is +1 or �1, depending on whether the follow-ing day’s log return, log Ptþ1

Pt, is positive or negative.

The cumulative net return, r, over k days is given by

r ¼Xk�1

t¼0

logPtþ1

Ptþ d

365log

1þ I�t100

1þ It100

!þ n log

1� c1þ c

;

3.3. Vanilla support vector machine

The experiment employs LIBSVM (Chang & Lin, 2001) Version2.91, for classification. In common with all of the experiments inthis section, a Gaussian radial basis function ðe�ck~u�~vk2 Þ was chosenas the similarity measure. Whilst systematically cycling throughdifferent combinations of values of meta-parameters, the SVM isrepeatedly trained on the training set and tested on the validationset. Meta-parameters were chosen thus: C 2 {10�6,10�5, . . . ,106}and r 2 {0.0001,0.001,0.01,0.1,1,10,100}. For each currency pair,the parameter combination that led to the highest net return onthe validation set was used for the (out of sample) test set.

3.4. Fisher kernel

1. Data consists of daily log returns of FX.2. Split the data into many smaller subsequences of 11 data

points each (with each subsequence overlapping the previ-ous subsequence by 10 data points).

3. For each subsequence, the target is +1 or �1, depending onwhether the following day’s log return, log Ptþ1

Pt, is positive

or negative.4. Convert each subsequence of log returns into a 5-symbol

alphabet {0,1,2,3,4}. Each log return, r, is replaced by a sym-bol according to Table 2, where centiles are derived from thetraining set. In other words, the range of returns is split intoequiprobable regions, and each allocated a symbol.

5. Split the data into training set, validation set and test set aspreviously described above (p. 16).

6. Exclude target data until otherwise mentioned.7. For each training set, generate a left-to-right 5-state hidden

Markov model, giving us the following parameters: statetransition probability matrix and conditional probabilitiesof symbols given states.

8. Using the program whose C++ code is provided in AppendixC (pp. 22–26), plus the parameters of the HMM and eachstring from the training set, determine the Fisher scores.

9. Create a new data set using the Fisher scores as the inputvectors and the original targets as the targets. Each inputvector will have 50 elements, and each target will be either�1 or +1.

10. Using LIBSVM, proceed with an SVM as described for thevanilla SVM above, but using the data set created in 9.

3.5. DC algorithm

This section explores another attempt to ‘learn the kernel’, thistime using the DC (difference of convex functions) algorithm. Foran overview of DC programming, see Horst and Thoai (1999). Theconvex hull of a set of points X in a real vector space V is the min-imal convex set containing X. The idea is to learn convex combina-tions of continuously-parameterized basic kernels by searchingwithin the convex hull of a prescribed set of basic kernels for onewhich minimizes a convex regularization functional. The methodand software used here is that outlined in Argyriou, Hauser, Micch-

Table 3Out of sample results, vanilla SVM.

USD/DEM USD/JPY GBP/USD USD/CHF DEM/JPY GBP/CHF Mean

Gross AR% 0.27 0.03 �1.68 �1.51 �2.78 7.22 0.26Net AR% �3.82 �5.09 �2.90 �4.51 �7.75 7.22 �2.81t-stat �1.78 �2.37 �1.45 �1.90 �3.80 4.15 �1.19Sharpe ratio �0.36 �0.48 �0.28 �0.38 �0.75 0.81 �0.24(SE) 0.21 0.21 0.21 0.21 0.23 0.23 0.22Trades/year 81.47 105.22 24.98 60.08 99.10 0.08 61.82

Table 4Out of sample results, Bayes point machine.


Gross AR% 0.27 �2.63 �1.68 3.48 3.43 7.22 1.68Net AR% �3.82 �3.46 �2.90 2.44 3.43 7.22 0.49t-stat �1.78 �1.61 �1.45 1.03 1.68 4.15 0.34Sharpe ratio �0.36 �0.32 �0.28 0.20 0.32 0.81 0.06(SE) 0.21 0.21 0.21 0.20 0.21 0.23 0.21Trades/year 81.47 17.14 24.98 18.53 0.08 0.08 23.71

Table 5Out of sample results, Fisher kernel.


Gross AR% 0.84 �1.54 1.04 4.04 �0.17 5.74 1.66Net AR% �0.56 �4.15 �1.99 4.01 �4.63 3.72 �0.60t-stat �0.26 �1.94 �1.00 1.69 �2.27 2.14 �0.27Sharpe ratio �0.05 �0.39 �0.19 0.33 �0.45 0.43 �0.05(SE) 0.20 0.21 0.20 0.21 0.21 0.21 0.21Trades/year 28.82 53.35 61.96 0.82 88.49 42.65 46.01

Table 6Out of sample results, DC algorithm.


Gross AR% 2.90 �1.20 �1.25 1.09 �0.57 5.28 1.04Net AR% 1.85 �2.98 �4.01 �1.81 �3.84 4.68 �1.02t-stat 0.86 �1.39 �2.01 �0.77 �1.88 2.69 �0.42Sharpe ratio 0.17 �0.28 �0.40 �0.15 �0.37 0.51 �0.09(SE) 0.20 0.21 0.21 0.20 0.21 0.21 0.21Trades/year 20.57 36.65 55.22 60.94 70.04 8.98 42.07


elli, and Pontil (2006). An implementation written in MATLAB wasdownloaded from the website of Andreas Argyriou6 The validationset was used to select the following parameters. l 2 {10�3,10�4,. . . ,10�11}, for USD/DEM, GBP/USD, USD/CHF and GBP/CHF blocksizes 2 {[6], [3,3], [2,2,2], [1,1,2,2]}, for USD/JPY and DEM/JPY blocksizes 2 {[3], [1,2]}, and for all cases ranges 2{[75,25000], [100,10000], [500,5000]}.

3.6. Bayes point machine

Given a sample of labelled instances, the so-called version spaceis defined as the set of classifiers consistent with the sample.Whilst an SVM singles out the consistent classifier with the largestmargin, the Bayes point machine (Herbrich, Graepel, & Campbell,2001) approximates the Bayes-optimal decision by the centre ofmass of version space. Tom Minka’s Bayes Point Machine (BPM)MATLAB toolbox7 which implements the expectation propagation(EP) algorithms for training was used. Expectation propagation is a

6 http://ttic.uchicago.edu/�argyriou/code/dc/dc.tar.7 http://research.microsoft.com/en-us/um/people/minka/papers/ep/bpm/.

family of algorithms developed by Tom Minka (Minka, 2001b,2001a) for approximate inference in Bayesian models. The methodapproximates the integral of a function by approximating each factorby sequential moment-matching. EP unifies and generalizes two pre-vious techniques: (1) assumed-density filtering, an extension of theKalman filter, and (2) loopy belief propagation, an extension of beliefpropagation in Bayesian networks. The BPM attempts to select theoptimum kernel width by inspecting the training set. The expectederror rate of the BPM was fixed at 0.45, and the kernel widthr 2 {0.0001,0.001,0.01,0.1,1,10,100}. Using LIBSVM (Chang & Lin,2001) a standard support vector machine was trained on the trainingset with the optimal r found using the BPM and C 2 {10�6,10�5, . . . ,106} selected using the validation set.

3.7. DC algorithm–Fisher kernel hybrid

This section describes a novel algorithm. First, the Fisher kernelwas derived, as described earlier, using the FX data. The data fromstep 9 of the Fisher kernel method was used. The input dataconsists of the parameters of the hidden Markov model in theFisher kernel, namely the emission and transition probabilities,respectively

http://ttic.uchicago.edu/~argyriou/code/dc/dc.tar

http://ttic.uchicago.edu/~argyriou/code/dc/dc.tar

http://research.microsoft.com/en-us/um/people/minka/papers/ep/bpm/

Table 7Out of sample results, DC-Fisher hybrid.


Gross AR% �1.09 3.52 �3.28 2.71 �4.89 6.43 0.57Net AR% �5.33 1.66 �6.56 0.29 �9.61 5.49 �2.35t-stat �2.49 0.77 �3.28 0.12 �4.71 3.15 �1.07Sharpe ratio �0.51 0.15 �0.64 0.02 �0.97 0.61 �0.22(SE) 0.21 0.20 0.22 0.20 0.24 0.22 0.22Trades/year 85.59 38.33 66.16 48.33 94.82 19.43 58.78

Table 8Out of sample results, Neely et al. (1997) via Neely et al. (2009).


Gross AR% 5.79 1.86 2.26 0.25 3.17 �0.06 2.22Net AR% 5.54 1.60 1.99 0.01 2.04 �0.18 1.83t-stat 2.15 0.85 0.85 0.00 1.34 �0.03 0.86Sharpe ratio 0.59 0.22 0.24 �0.03 0.35 �0.02 0.23(SE) 0.29 0.28 0.28 0.28 0.30 0.29 0.29Trades/year 5.17 5.37 5.02 4.88 23.39 2.26 7.68

Table 9Summary of results on forecasting.

Beat market? YesBeat standard SVM? YesBeat state of the art? No


@ log PMðsjhÞ@sa;b

and@ log PMðsjhÞ

@hr;a:

The input data was scaled. Next, the data was split into training, val-idation and test sets as previously described. Then the DC algo-rithm, as described above, was used to find an optimal kernel. Thevalidation set was used to select the following parameters used inthe DC algorithm: l 2 {10�3,10�4, . . . ,10�11}, block sizes 2 {[50],[25,25], [16,17,17], [12,12,13,13]} and ranges 2 {[75,25000], [100,10000], [500,5000]}.

4. Results

Tables 3–7 (pp. 19–20) show an analysis of the out of sample re-sults, whilst for the sake of comparison, Table 8 (p. 21) shows theresults from Neely et al. (1997) published in Neely et al. (2009)(NWD/NWU). Annual returns (AR) are calculated both gross andnet of transaction costs. The Sharpe ratios are annualized, and theirstandard errors (SE) calculated, in accordance with Lo (2002).

5. Discussion

The mean gross returns from all six experiments were positive,with NWD/NWU being the highest, followed by BPM and the Fisherkernel, whilst the vanilla SVM was the lowest. The mean net re-turns were positive for NWD/NWU and BPM; NWD/NWU per-formed best, and the vanilla SVM and hybrid algorithm theworst. BPM, the Fisher kernel, the DC algorithm and the hybridalgorithm were all improvements over the vanilla SVM in termsof both gross returns and net returns, but none achieved net re-turns as high as NWD/NWU. One likely reason is that the geneticprogramming methodology was better suited to optimally restrict-ing the number of trades per year. However, the performance ofthe genetic programming trading system described in Neely et al.(1997) was one of the worst reported in Neely et al. (2009). The fol-lowing three methods performed best. Sweeney (1986) used filterrules, as described in Fama and Blume (1966). Taylor (1994) con-

sidered ARIMA (1,0,2) trading rules, prespecifying the ARIMA or-der and choosing the parameters and the size of a ‘band ofinactivity’ to maximize in-sample profitability. Dueker and Neely(2007) used a Markov-switching model on deviations from uncov-ered interest parity, with time-varying mean, variance, and kurto-sis to develop trading rules. In-sample data was used to estimatemodel parameters and to construct optimal ‘bands of inactivity’that reduce trading frequency.

6. Conclusions

The applications of the Fisher kernel, the DC algorithm andBayes point machine to financial time series are all new. Mostnovel of all was the use of the DC algorithm to learn theparameters of the hidden Markov model in the Fisher kernel.Table 9 gives a summary of the goals achieved in this paper.More precise conclusions are elusive, because a slight changeto the data set or the inputs can produce quite different results.Although I believe that machine learning in general, and learn-ing the kernel in particular, have a lot to offer financial timeseries prediction, financial data is a poor test bed for comparingmachine learning algorithms due to its vanishingly small signal-to-noise ratio.

Acknowledgements

Thanks to David Barber and Edward Tsang for several sugges-tions. Many thanks also due to Chris Neely for data and support.

Appendix A. Gram matrix

Definition 2. Given a set S ¼ f~x1; . . . ;~xng of vectors from an innerproduct space X, the n � n matrix G with entries Gij ¼ h~xi �~xji iscalled the Gram matrix (or kernel matrix) of S.

Appendix B. Hilbert space

Definition 3. A Hilbert space is a Euclidean space which iscomplete, separable and infinite-dimensional.

In other words, a Hilbert space is a set H of elements f, g, . . . ofany kind such that

� H is a Euclidean space, i.e. a real linear space equipped with ascalar product;� H is complete with respect to the metric q(f,g) = kf � gk;


� H is separable, i.e. H contains a countable everywhere densesubset;� H is infinite-dimensional, i.e., given any positive integer n, H

contains n linearly independent elements.

Appendix C. Fisher kernel source code

//line numbers refer to Code Fragment 12.4 (p. 435)

//in Shawe-Taylor and Cristianini (2004)

//use symbols 1, 2, 3, etc.

#include <iostream>#include <fstream>#include <sstream>#include <math.h>#include <string>using namespace std;

int main ()

{int string_length = 10;

int number_of_states = 5;

int number_of_symbols = 5;

int p = number_of_states;int n = string_length;int a,b;

double Prob = 0;

string stringstring;

ifstream hmmstream ("hmmt.txt");//INPUT: HMM, one

line of params

ifstream stringfile ("strings.txt");//INPUT:

symbol strings, one per line

ofstream fisherfile ("fisher.txt");//OUTPUT:

Fisher scores, 1 data pt/line

int s[n + 1];//symbol string, uses s[1] to s[n]

(s[0] is never used)

double PM[p + 1][p + 1];//state transition

probability matrix

double P[number_of_symbols + 1][p + 1];//cond.probs of symbols given states

double scoree[p + 1][number_of_symbols + 1];//Fisher scores for the em. probs double

scoret[p + 1][p + 1];//Fisher scores for the

transmission probs

double forw[p + 1][n + 1];

double back[p + 1][n + 1];

//initialize to zero

for (int i = 0; i<=p; i++)for (int j = 0; j<=p; j++)PM[i][j] = 0;

for (int i = 0; i<=number_of_symbols; i++)for (int j = 0; j<=p; j++)P[i][j] = 0;

PM[1][0] = 1.0;//because it is a left-to-right

hidden Markov model

for (int i = 2; i<=p; i++)PM[i][0] = 0;

for (int i = 1; i<=p; i++)for (int j = 1; j<=p; j++)hmmstream �PM[i][j];

for (int i = 1; i<=number_of_symbols; i++)for (int j = 1; j<=p; j++)hmmstream �P[i][j];

while (getline (stringfile, stringstring)) {istringstream stringstream (stringstring);

//initialize to zero

for (int i = 0; i<=p; i++)for (int j = 0; j<=n; j++)forw[i][j] = 0;

for (int i = 0; i<=p; i++)for (int j = 0; j<=n; j++)back[i][j] = 0;

for (int i = 0; i<=p; i++)for (int j = 0; j<=number_of_symbols; j++)scoree[i][j] = 0;

for (int i = 0; i<=p; i++)for (int j = 0; j<=p; j++)scoret[i][j] = 0;

for (int i = 0; i<=n; i++)s[i] = 0;

for (int i = 1; i<=n; i++)stringstream �s[i];

for (int i = 0; i<=p; i++)for (int j = 1; j<=number_of_symbols; j++)scoree[i][j] = 0;//line 2

for (int i = 0; i<=p; i++)for (int j = 1; j<=p; j++)scoret[i][j] = 0;//mvs

for (int i = 0; i<=p; i++)forw[i][0] = 0;//line 3

for (int i = 0; i<=p; i++)//line 4

back[i][n] = 1;

forw[0][0] = 1;//line 4 (corrected)

Prob = 0;

for (int i = 1; i<=n; i++) {//line 5

for (a = 1; a<=p; a++) {//line 7

forw[a][i] = 0;//line 8

for (b = 0; b<=p; b++)//line 9 (corrected)

forw[a][i] = forw[a][i] +

PM[a][b]*forw[b][i-1];//line 10

forw[a][i] = forw[a][i]*P[s[i]][a];//line

12

}}for (a = 1; a<=p; a++)//line 15

Prob = Prob + forw[a][n];

for (int i = n-1;i>=1;i–) {//line 18

for (a = 1; a<=p; a++) {//line 19

back[a][i] = 0;//line 20

for (b = 1; b<=p; b++)back[a][i] = back[a][i] +

PM[b][a]*P[s[i + 1]][b]*back[b][i + 1];//

line 22

}}//Fisher scores for the emission probabilities

for (int i = n-1;i>=1;i–) {//line 18

for (a = 1; a<=p; a++) {//line 19

scoree[a][s[i]] = scoree[a][s[i]] +

back[a][i]*forw[a][i]/

(P[s[i]][a]*Prob);//line 24

for (int sigma = 1;

sigma<=number_of_symbols; sigma++)scoree[a][sigma] = scoree[a][sigma] -

back[a][i]*forw[a][i]/Prob;//line 26

(corrected)

}}//Fisher scores for the transmission

probabilities


for (int i = n-1;i>=1;i–)for (b = 1; b<=p; b++)for (a = 1; a<=p; a++) {scoret[b][a] = scoret[b][a] +

(back[a][i]*forw[b][i-1]*P[s[i]][a]/

Prob - back[b][i]*forw[b][i]/Prob);

}//transform Fisher scores to the interval (0,1)

using the logistic fn

//scoree[i][j] = 1/(1 + exp (-scoree[i][j]));

for (int i = 1; i<=p; i++)for (int j = 1; j<=p; j++)fisherfile scoret[i][j] ’’ ‘‘;

for (int j = 1; j<=number_of_symbols; j++)for (int i = 1; i<=p; i++)fisherfile scoree[i][j] ’’ ‘‘;

fisherfile endl;}hmmstream.close ();

fisherfile.close ();

system ("PAUSE");

}

References

Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundationsof the potential function method in pattern recognition learning. Automationand Remote Control, 25, 821–837.

Argyriou, A., Hauser, R., Micchelli, C. A., & Pontil, M. (2006). A DC-programmingalgorithm for kernel selection. In W. W. Cohen, & A. Moore (Eds.), ICML’06:Proceedings of the 23rd international conference on machine learning (pp. 41–48).New York: ACM Press.

Baum, L. E., 1972. An inequality and associated maximization technique instatistical estimation for probabilistic functions of Markov processes. In O.Shisha (Ed.), Inequalities III: Proceedings of the third symposium on inequalities(pp. 1–8). New York: Academic Press.

Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statisticalestimation for probabilistic functions of a Markov process and to a model forecology. Bulletin of the American Mathematical Society, 73, 360–-363.

Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions offinite state Markov chains. The Annals of Mathematical Statistics, 37, 1554–1563.

Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization techniqueoccurring in the statistical analysis of probabilistic functions of Markov chains.The Annals of Mathematical Statistics, 41, 164–171.

Bernstein, S. (1927). Sur l’extension du théorème limite du calcul des probabilitésaux sommes de quantités dépendantes. Mathematische Annalen, 97, 1–59.

Bhar, R., & Hamori, S. (2004). Hidden Markov models: Applications to financialeconomics. Advanced studies in theoretical and applied econometrics (Vol. 40).Dordrecht: Kluwer Academic Publishers.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery, 2, 121–167.

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: Apractical information-theoretic approach (2nd ed.). New York: Springer-Verlag.

Cappé, O., Moulines, E., & Rydén, T. (2005). Inference in hidden Markov models.Springer series in statistics. New York: Springer.

Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines.National Taiwan University.

Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging. Cambridgeseries in statistical and probabilistic mathematics (Vol. 27). Cambridge:Cambridge University Press.

Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machinesand other kernel-based learning methods. Cambridge: Cambridge UniversityPress.

Dueker, M., & Neely, C. J. (2007). Can Markov switching models predict excessforeign exchange returns? Journal of Banking & Finance, 31, 279–296.

Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1999). Biological sequence analysis:probabilistic models of proteins and nucleic acids. Cambridge: CambridgeUniversity Press.

Eddy, S. R. (2004). What is a hidden Markov model? Nature Biotechnology, 22,1315–1316.

Elliot, R. J., Aggoun, L., & Moore, J. B. (2004). Hidden Markov models: Estimation andcontrol. Applications of mathematics (Vol. 29). New York: Springer-Verlag.

Fama, E. F., & Blume, M. E. (1966). Filter rules and stock-market trading. The Journalof Business, 39, 226–241.

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Schölkopf, B. (1998). Support vectormachines. IEEE Intelligent Systems, 13, 18–28.

Herbrich, R., Graepel, T., & Campbell, C. (2001). Bayes point machines. Journal ofMachine Learning Research, 1, 245–279.

Ho, T. K. (1998). The random subspace method for constructing decision forests.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 832–844.

Horst, R., & Thoai, N. V. (1999). DC programming: Overview. Journal of OptimizationTheory and Applications, 103, 1–43.

Hume, D. (1739–1740). A treatise of human nature: Being an attempt to introducethe experimental method of reasoning into moral subjects. In D. F. Norton & M.J. Norton (Eds.), Oxford philosophical texts. Oxford: Oxford University Press.

Joachims, T. (2002). Learning to classify text using support vector machines: Methods,theory and algorithms. The Kluwer international series in engineering and computerscience. Boston, MA: Kluwer Academic Publishers.

Joachims, T. (2004). SVMlight. University of Dortmund, Dortmund.Juang, B. H. (1985). Maximum-likelihood estimation for mixture multivariate

stochastic observations of Markov chains. AT&T Technical Journal, 64,1235–1249.

Kemeny, J. G., Snell, J. L., & Knapp, A. W. (1976). Denumerable Markov chains.Graduate Texts in Mathematics (2nd ed.) (Vol. 40). New York: Springer-Verlag.

Kolmogorov, A. N., & Fomin, S. V. (1975). Introductory real analysis. New York:Dover Publications [Revised English edition translated and edited by Silverman,R.A. Originally published 1970, Prentice-Hall, Englewood Cliffs].

Lo, A. W. (2002). The statistics of Sharpe ratios. Financial Analysts Journal, 58, 36–52.MacDonald, I. L., & Zucchini, W. (1997). Hidden Markov and other models for

discrete-valued time series. Monographs on statistics and applied probability (Vol.70). Boca Raton: Chapman & Hall/CRC.

Mamon, R. S., & Elliott, R. J. (Eds.). (2007). Hidden Markov models in finance.International series in operations research & management science (Vol. 104). NewYork: Springer.

Markov, A. A. (1906). Rasprostranenie zakona bol’shih chisel na velichiny,zavisyaschie drug ot druga. Izvestiya Fiziko-Matematicheskogo Obschestva priKazanskom Universitete, 2-ya seriya (Vol. 15, pp. 135–156). [In Russian.English translation: ‘Extension of the law of large numbers to dependentquantities].

Masters, T. (1995). Neural, novel & hybrid algorithms for time series prediction. NewYork: Wiley.

Minka, T. P. (2001a). Expectation propagation for approximate Bayesian inference.In J. S. Breese, & D. Koller (Eds.), UAI’01: Proceedings of the 17th conference inuncertainty in artificial intelligence (pp. 362–369). San Francisco, CA: MorganKaufmann.

Minka, T. P. (2001b). A Family of algorithms for approximate Bayesian inference.Ph.D. thesis Massachusetts Institute of Technology, Cambridge, MA.

Mitchell, T. M. (1980). The need for biases in learning generalizations. Technicalreport CBM-TR-117 Rutgers University, New Brunswick, NJ.

Neely, C. J. (1997). Technical analysis in the foreign exchange market: A layman’sguide. Federal Reserve Bank of St. Louis Review, 79, 23–38.

Neely, C. J. (1998). Technical analysis and the profitability of US foreign exchangeintervention. Federal Reserve Bank of St. Louis Review, 80, 3–18.

Neely, C. J., & Weller, P. A. (2001). Technical analysis and central bank intervention.Journal of International Money and Finance, 20, 949–970.

Neely, C., Weller, P., & Dittmar, R. (1997). Is technical analysis in the foreignexchange market profitable?: A genetic programming approach. Journal ofFinancial and Quantitative Analysis, 32, 405–426.

Neely, C. J., Weller, P. A., & Ulrich, J. M. (2009). The adaptive markets hypothesis:Evidence from the foreign exchange market. Journal of Financial and QuantitativeAnalysis, 44, 467–488.

Park, C.-H., & Irwin, S. H. (2004). The profitability of technical analysis: A review.AgMAS Project Research Report 2004-04 University of Illinois at Urbana-Champaign Urbana.

Poritz, A. B. (1988). Hidden Markov models: A guided tour. In 1988 Internationalconference on acoustics, speech, and signal processing, 1988. ICASSP-88 (Vol. 1, pp.7–13). New York: IEEE Press.

Pyle, D. (1999). Data preparation for data mining. The Morgan Kaufmann Series in DataManagement Systems. San Francisco, CA: Morgan Kaufmann Publishers.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applicationsin speech recognition. Proceedings of the IEEE, 77, 257–286.

Rabiner, L. R., & Juang, B. H. (1986). An introduction to hidden Markov models. IEEEASSP Magazine, 3, 4–16.

Rüping, S. (2000). mySVM. University of Dortmund, Dortmund.Schaffer, C. (1994). A conservation law for generalization performance. In W.

W. Cohen, & H. Hirsh (Eds.), Proceedings of the eleventh internationalconference on machine learning (pp. 259–265). San Francisco, CA: MorganKaufmann.

Sewell, M. (2008). Optimization and methods to avoid overfitting. Talk presented atthe Automated Trading 2008 conference, London, 15 October 2008.

Sewell, M. (2009). Algorithm bias: A statistical review. Futures, 38–40.Sewell, M. (2011a). Characterization of financial time series. Research Note RN/11/

01 University College London, London.Sewell, M. (2011b). The Fisher kernel: A brief review. Research Note RN/11/06

University College London, London.Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis.

Cambridge: Cambridge University Press.Sweeney, R. J. (1986). Beating the foreign exchange market. The Journal of Finance,

41, 163–182.Taylor, S. J. (1994). Trading futures using a channel rule: A study of the predictive

power of technical analysis with currency examples. Journal of Futures Markets,14, 215–235.

Theodoridis, S., & Koutroumbas, K. (2008). Pattern recognition (4th ed.). Burlington,MA: Academic Press.


Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroitregion. Economic Geography, 46, 234–240.

Vapnik, V. N., & Chervonenkis, A. Y. (1974). Teoriya Raspoznavaniya Obrazov:Statisticheskie Problemy Obucheniya. Moscow: Nauka [Russian, Theory ofPattern Recognition: Statistical Problems of Learning].

Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms.Neural Computation, 8, 1341–1390.

Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization.IEEE Transactions on Evolutionary Computation, 1, 67–82.

forecasting foreign exchange rates using kernel methods

Documents