# forecasting foreign exchange rates using kernel methods

Post on 05-Sep-2016

223 views

Embed Size (px)

TRANSCRIPT

kLan

1E 6

fressied.theanin tM,

(1997) and published in Neely, Weller, and Ulrich (2009). Two implementations of SVMs for Windows

1. Introduction

1.1. Objectives

hods tthe mthe stang the

trivial, frequently misunderstood and profoundly relevant to opti-mization, machine learning and science in general (and often con-veniently ignored by the evolutionary algorithms and statisticallearning theory communities). I (Sewell) run the worlds only nofree lunch website.2.

1.2.3. No free lunch theorem for supervised machine learningHume (17391740) pointed out that even after the observation

of the frequentor constant conjunctionof objects,wehaveno reasonto draw any inference concerning any object beyond those of whichwe have had experience. More recently, andwith increasing rigour,Mitchell (1980), Schaffer (1994) and Wolpert (1996) showed thatbias-free learning is futile. The no free lunch theorem for supervisedmachine learning (Wolpert, 1996) shows that in a noise-free sce-nario where the loss function is the misclassication rate, in terms

Corresponding author. Tel.: +44 (0) 1223 765224; fax: +44 (0) 1223 337130.E-mail addresses: mvs25@cam.ac.uk (M. Sewell), J.Shawe-Taylor@cs.ucl.ac.uk

(J. Shawe-Taylor).1 Tel.: +44 (0) 20 76797680; fax: +44 (0) 20 73871397.

Expert Systems with Applications 39 (2012) 76527662

Contents lists available at

Expert Systems w

.e2 http://www.no-free-lunch.org.earning a positive return, but it is not a useful benchmark as it doesnot incorporate risk.

1.2. Background

1.2.1. No free lunch theoremsThe two main no free lunch theorems (NFL) are introduced and

then evolutionary algorithms and statistical learning theory arereconciled with the NFL theorems. The theorems are novel, non-

Macready (1997).The no free lunch theorem for search implies that putting blind

faith in evolutionary algorithms as a blind search/optimization algo-rithm is misplaced. For example, on average, a genetic algorithm isno better, or worse, than any other search algorithm. In practice, inour universe, one will only be interested in a subset of all possiblefunctions. Thismeans that it is necessary to showthat the set of func-tions that are of interest has some property that allows a particularalgorithm to perform better than random search on this subset.This paper employs kernel metchange rates, and aims to (1) beatstandard methodology, and (3) beatin a foreign exchange market beati0957-4174/$ - see front matter 2012 Elsevier Ltd. Adoi:10.1016/j.eswa.2012.01.026with semi-automated parameter selection are built. 2012 Elsevier Ltd. All rights reserved.

o forecast foreign ex-arket, (2) beat existingte of the art. Note thatmarket simply means

1.2.2. No free lunch theorem for optimization/searchThe no free lunch theorem for search and optimization applies

to nite spaces and algorithms that do not resample points. Thetheorem tells us that all algorithms that search for an extremumof a cost function perform exactly the same when averaged overall possible cost functions. So, for any search/optimization algo-rithm, any elevated performance over one class of problems is ex-actly paid for in performance over another class. See Wolpert andimprovements over a standard SVM in terms of both gross returns and net returns, but none achievednet returns as high as the genetic programming approach employed by Neely, Weller, and Dittmara r t i c l e i n f o

Keywords:ForecastingForeign exchangeKernel methods

a b s t r a c t

First, the all-important nomachines (SVMs), preproceare introduced and discussFisher kernel for an SVM,the kernel on foreign exchthe hidden Markov modelpositive for BPM; and BPForecasting foreign exchange rates using

Martin Sewell a,, John Shawe-Taylor b,1a The Cambridge Centre for Climate Change Mitigation Research (4CMR), Department ofUnited KingdombDepartment of Computer Science, University College London, Gower Street, London WC

journal homepage: wwwll rights reserved.ernel methods

d Economy, University Of Cambridge, 16-21 Silver Street, Cambridge CB3 9EP,

BT, United Kingdom

e lunch theorems are introduced. Next, kernel methods, support vectorng, model selection, feature selection, SVM software and the Fisher kernelA hidden Markov model is trained on foreign exchange data to derive aDC algorithm and the Bayes point machine (BPM) are also used to learnge data. Further, the DC algorithm was used to learn the parameters ofhe Fisher kernel, creating a hybrid algorithm. The mean net returns werethe Fisher kernel, the DC algorithm and the hybrid algorithm were allSciVerse ScienceDirect

ith Applications

lsevier .com/locate /eswa

d); and

tions has been the focus of much research in statistics and

ms wPjc sj > jf ;m:

VC theory does not concern

Pcjs;m;VC dimension:

So there is no free lunch for Vapnik, and no guarantee that supportvector machines generalize well.

1.3. Kernel methods

Central to the work on forecasting in this paper is the concept ofa kernel. The technical aspects of kernels are dealt with in Section2.1 (pp. 6), and the history is given here. The Fisher kernel is de-rived and implemented below; to save space, a thorough literaturereview is provided in Sewell (2011b).

1.4. Support vector machines

Support vector machines (SVMs) are used extensively in theforecasting of nancial time series and are covered in more detailin Section 2.2 (p. 6). Among other sources, the introductory paper(Hearst, Dumais, Osuna, Platt, & Schlkopf, 1998), the classic SVMtutorial (Burges, 1998), the excellent book (Cristianini & Shawe-Taylor, 2000) and the implementation details within Joachims(2002) have contributed to my own understanding.

1.4.1. The application of support vector machines to the nancialdomain

An exhaustive review of articles that apply SVMs to the nan-cial domain (not reported here) included 38 articles that compareSVMs with articial neural networks (ANNs), SVMs outperformedANNs in 32 cases, ANNs outperformed SVMs in 3 cases, and therewas no signicance difference in 3 cases. More specically, of the22 articles that concern the prediction of nancial or commodityget functions, or if you have a uniform prior, then P(cjd) is indepen-dent of ones learning algorithm. Vapnik appears to prove thatgiven a large training set and a small VC dimension, one can gen-eralize well. The VC dimension is a property of the learning algo-rithm, so no assumptions are being made about the targetfunctions. So, has Vapnik found a free lunch? VC theory tells us thatthe training set error, s, converges to c. If is an arbitrary real num-ber, the VC framework actually concernsc = off-training-set loss associated with f and h (generalizationerror or test set error)

all algorithms are equivalent, on average, by any of the followingmeasures of risk: E(cjd), E(cjm), E(cjf,d) or E(cjf,m).

How well you do is determined by how aligned your learningalgorithm P(hjd) is with the actual posterior, P(fjd). This result, inessence, formalizes Hume, extends him and calls all of science intoquestion.

The NFL proves that if you make no assumptions about the tar-of off-training-set error, there are no a priori distinctions betweenlearning algorithms.

More formally, where

d = training set;m = number of elements in training set;f = target inputoutput relationships;h = hypothesis (the algorithms guess for f made in response to

M. Sewell, J. Shawe-Taylor / Expert Systemarkets, 18 favoured SVMs, 2 favoured ANNs and 2 found no sig-nicant difference. This bodes well for SVMs, and as such, the fol-lowing research on forecasting shall employ them.machine learning for decades and the resulting algorithms are wellunderstood, well developed and efcient. Naturally, one wants thebest of both worlds. So, if a problem is non-linear, instead of tryingto t a non-linear model, one can map the problem from the inputspace to a new (higher-dimensional) space (called the featurespace) by doing a non-linear transformation using suitably chosenbasis functions and then use a linear model in the feature space.This is known as the kernel trick. The linear model in the featurespace corresponds to a non-linear model in the input space. Thisapproach can be used in both classication and regression prob-lems. The choice of kernel function is crucial for the success of allkernel algorithms because the kernel constitutes prior knowledgethat is available about a task. Accordingly, there is no free lunch(see p. 1) in kernel choice.

2.1.4. Kernel trickThe kernel trick was rst published by Aizerman, Braverman,

and Rozonoer (1964). Mercers theorem states that any continuous,symmetric, positive semi-denite kernel function K(x,y) can be ex-pressed as a dot product in a high-dimensional space.

If the arguments to the kernel are in a measurable space X, andif the kernel is positive semi-denitei.e.

Xni1

Xnj1

Kxi; xjcicj P 0

for any nite subset {x1, . . . ,xn} of X and subset {c1, . . . ,cn} of objects(typically real numbers, but could even be molecules)then thereexists a function u(x) whose range is in an inner product space ofpossibly high dimension, such that2. Material and methods

Domain knowledge is necessary to provide the assumptionsthat supervised machine learning relies upon.

2.1. Kernel methods

2.1.1. TerminologyThe termkernel isderived fromaword that canbe tracedback to c.

1000 and originally meant a seed (contained within a fruit) or thesofter (usually edible) part contained within the hard shell of a nutor stone-fruit. The formermeaning is now obsolete. It was rst usedinmathematics when it was dened for integral equations in whichthe kernel is knownand the other function(s) unknown, but nowhasseveral meanin

Recommended