isi2007 nn shc_2007

16
1 1 Financial Data Mining with Genetic Programming: a Survey and Look Forward Nicolas NAVET Nicolas NAVET INRIA INRIA France France [email protected] [email protected] Shu Shu-Heng Heng CHEN CHEN AIECON/NCCU AIECON/NCCU Taiwan Taiwan [email protected] [email protected] ISI 2007 ISI 2007 - 08/23/2007 08/23/2007 2 Genetic programming Genetic programming Generate a population of random programs Evaluate their quality (“fitness”) Create better programs by applying genetic operators, eg - mutation - combination (“crossover”) GP is the process of evolving a population of computer programs, that are candidate solutions, according to the evolutionary principles Solution

Upload: nicolas-navet

Post on 19-Jan-2015

135 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Isi2007 nn shc_2007

1

1

Financial Data Mining with Genetic Programming:

a Survey and Look Forward

Nicolas NAVETNicolas NAVET –– INRIAINRIAFrance France [email protected]@loria.fr

ShuShu--HengHeng CHENCHEN –– AIECON/NCCU AIECON/NCCU [email protected]@nccu.edu.tw

ISI 2007 ISI 2007 -- 08/23/200708/23/2007

2

Genetic programmingGenetic programming

Generate a population of

random programs

Evaluate their quality (“fitness”)

Create better programs by applying genetic operators, eg

- mutation- combination (“crossover”)

GP is the process of evolving a population of computer programs, that are candidate solutions,

according to the evolutionary principles

Solution

Page 2: Isi2007 nn shc_2007

2

3

In GP, programs are In GP, programs are represented by trees represented by trees

Trading system: buy if

functions

terminals

abs(Close(t)/0.7748) < Close(t− 218)

4

Typical genetic operator: Typical genetic operator: standard crossover standard crossover

Standard crossover : exchange two randomly chosen sub-trees among the parents

+

Page 3: Isi2007 nn shc_2007

3

5

Strong points of GP Strong points of GP

Solutions are produced under a symbolic formSolutions are produced under a symbolic formthat can be analyzed by humansthat can be analyzed by humans

GP does not assume a predefined size and shape: GP does not assume a predefined size and shape: it creates bothit creates both the functional form and the the functional form and the parameters’ valuesparameters’ values

“Ability to produce a large number of different, “Ability to produce a large number of different, yet meaningful hypotheses .. that are nonyet meaningful hypotheses .. that are non--intuitive and sometimes provocative” [Kei02] intuitive and sometimes provocative” [Kei02]

6

G.P. in the financial domainG.P. in the financial domain

1.1. Knowledge discovery :Knowledge discovery : results are scarceresults are scarce

Agent based modeling:Agent based modeling: study the evolution of study the evolution of a population of decision rulesa population of decision rulesTesting the EMHTesting the EMH in real and artificial marketsin real and artificial markets

2.2. Financial trading :Financial trading :Composing portfoliosComposing portfoliosEvolving structure of NN used for predictionEvolving structure of NN used for predictionPredicting price evolutionPredicting price evolutionDiscovering trading rulesDiscovering trading rules

Page 4: Isi2007 nn shc_2007

4

7

Discovering trading rules : Discovering trading rules : the big picturethe big picture

1 ) Creation of the trading rules using GP

2) Selection of the best resulting strategies

Further selection on unseen data

-

One strategy is chosen for

out-of-sample

Performance evaluation

Training interval

Validation interval

Out-of-sample interval

8

Improvements ahead of us (1/2)Improvements ahead of us (1/2)

1.1. Rigorous assessment of the GP Rigorous assessment of the GP outcomesoutcomes: controlling the data: controlling the data--mining mining bias!bias!

2.2. Selecting the right time seriesSelecting the right time series: market : market can be efficientcan be efficient

3.3. Reducing variabilityReducing variability of the results from of the results from GP run to GP runGP run to GP run

4.4. ReRe--thinking the datathinking the data--division schemedivision scheme for for training, validation and testing periodstraining, validation and testing periods

Page 5: Isi2007 nn shc_2007

5

9

Improvements ahead of us (2/2)Improvements ahead of us (2/2)

5.5. PrePre--processing the data ?!?processing the data ?!?

6.6. ReRe--thinking fitness functionsthinking fitness functions : GP: GP--friendly, sensitivity and risk adjusted, … friendly, sensitivity and risk adjusted, …

7.7. Embedding more domain specific Embedding more domain specific knowledgeknowledge : GP function set is still very : GP function set is still very primitive .. primitive ..

10

1.1. Rigorous assessment of the GP Rigorous assessment of the GP outcomesoutcomes

Page 6: Isi2007 nn shc_2007

6

11

GP’s outcomes on the training GP’s outcomes on the training interval (1/2)interval (1/2)

Assume an “inefficient” solution leads to a Assume an “inefficient” solution leads to a profitable trade with probability 0.5profitable trade with probability 0.5

Number of trades

Success rate

Probability than an inefficient system achieves a given success rate for a given number of trades

10 50 10060% 0.38 0.1 0.0370% 0.17 3 · 10−3 4 · 10−5

Guideline :Guideline : penalize or discard systems with few penalize or discard systems with few tradestrades

12

GP’s outcomes on the training GP’s outcomes on the training interval (2/2)interval (2/2)

Number of trades

Number of solutions tested

Probability than at least one inefficient system achieves a success rate = 70% for a given number

of solutions

NB :NB : in a typical GP run, 50000 solutions are tested in a typical GP run, 50000 solutions are tested and the average number of trades is usually small … and the average number of trades is usually small …

10 50 100100 1 0.28 0.0041000 1 0.96 0.3850000 1 1 0.85

Page 7: Isi2007 nn shc_2007

7

13

GP’s outcomes on the testing GP’s outcomes on the testing period period [ChNa07][ChNa07]

Compare GP with several variants ofRandom search algorithms

“Zero-Intelligence Strategies” - ZISRandom trading behaviors

“Lottery trading” - LT

Statistical hypotheses testingNull : GP does not outperform ZISNull : GP does not outperform LT

Issue : how to best constrain randomness ?

14

2.2. Selecting the Right Time SeriesSelecting the Right Time Series

Experiments [CIEF2007]:Experiments [CIEF2007]:Does low entropy imply better Does low entropy imply better

profitability of GPprofitability of GP--induced induced GP Trading Rules ?GP Trading Rules ?

NYSE US 100 Stocks NYSE US 100 Stocks Daily Data from 2000 to 2006Daily Data from 2000 to 2006

Page 8: Isi2007 nn shc_2007

8

15

Experimental setup Experimental setup

Entropy rate estimator: Entropy rate estimator: KontoyannisKontoyannis et al 1998et al 1998

rt = ln(ptpt−1

)

Discretization:

3,4,1,0,2,6,2,…

{rt} ∈ R→ {At} ∈ N

alphabet of size 8 - equal number of values in each bin max. theoretical entropy = 3

16

Entropy of NYSE US 100 stocks Entropy of NYSE US 100 stocks ––period 2000period 2000--20062006

entropy

Den

sity

2.66 2.68 2.70 2.72 2.74 2.76 2.78 2.80

05

1015

2025

NB : a normal distribution of same mean and standard deviation is plotted for comparison.

Mean = Median = 2.75

Max = 2.79

Min = 2.68

Rand() boost = 2.96

Rand() C lib = 2.77 !

Page 9: Isi2007 nn shc_2007

9

17

Entropy is high but price time Entropy is high but price time series are not random! series are not random!

Entropy (original data)

Den

sity

2.65 2.70 2.75 2.80 2.85

010

2030

4050

Original time series

Entropy (shuffled data)D

ensi

ty

2.65 2.70 2.75 2.80 2.85

010

2030

4050

Randomly shuffled time series

18

Stocks in the distribution’s tailsStocks in the distribution’s tails

Symbol EntropyTWX 2.677EMC 2.694C 2.712JPM 2.716GE 2.723

Highest entropy time series

Lowest entropy time series

Symbol EntropyOXY 2.789VLO 2.787MRO 2.785BAX 2.78WAG 2.776

Page 10: Isi2007 nn shc_2007

10

19

Up to a lag 100, there are 2.7 x more autocorrelations outside the 99% confidence bands for the lowest entropy stocks than for the highest entropy stocks

Autocorrelation analysisAutocorrelation analysisLow complexity

stock (C)High complexity

stock (OXY)

20

BDS tests: are daily log price BDS tests: are daily log price changes changes i.i.di.i.d ??

Lowest entropy time series

m δ OXY V LO MRO BAX WAG2 1 5.66 4.17 6.69 8.13 7.453 1 6.61 5.35 9.40 11.11 8.895 1 9.04 6.88 13.08 15.31 11.17

Highest entropy time series

Null that log price changes are i.i.d. always rejected at 1% level but - whatever BDS parameters - rejection is much stronger for high-entropy stocks

m δ TWX EMC C JPM GE2 1 18.06 14.21 13.9 11.82 11.673 1 22.67 19.54 18.76 16.46 16.345 1 34.18 29.17 28.12 26.80 24.21

Page 11: Isi2007 nn shc_2007

11

21

Results: surprisingly .. Results: surprisingly ..

On highOn high--entropy stocksentropy stocksGP is always profitable

LT is never better than GP (95% confidence level)

GP outperforms LT 2 times out of 5 (95% C.L.)

On lowOn low--entropy stocksentropy stocksGP is never better than LT (95% C.L.)

LT outperforms GP 2 times out of 5 (95% C.L.)

22

Explanations (1/2) Explanations (1/2)

GP is not good when training period is very GP is not good when training period is very different from outdifferent from out--ofof--samplesample e.g.e.g.

2000 2006 2000 2006

Typical low complexity stock (EMC)

Typical high complexity stock (MRO)

Page 12: Isi2007 nn shc_2007

12

23

Explanations (2/2) Explanations (2/2)

The 2 cases where GP outperforms LT : The 2 cases where GP outperforms LT : training training quite similar to outquite similar to out--ofof--samplesample

BAX WAG

2000 2006 2000 2006

24

4.4. ReRe--thinking data division thinking data division schemescheme

Page 13: Isi2007 nn shc_2007

13

25

Data division schemeData division scheme

There is multiple evidence that GP performs poorly when training interval ≠ from the out-of-sample interval …

What is needed: characterization of the market condition – similarity measure

Re-learning triggered when similarity or performances below a threshold

26

5.5. ReRe--thinking fitness functionsthinking fitness functions

Page 14: Isi2007 nn shc_2007

14

27

Rethinking fitness Rethinking fitness functionsfunctions

from [LaPo02]

Issue 1 : some fitness functions induce a “difficult" landscape for GP GP-friendly fitness

Issue 2 : a few lucky trades alone may lead to an outstanding return risk-adjusted fitness

Issue 3 : solutions located on peaks of the fitness landscape are not robust out-of-sample

sensitivity-adjusted fitness

28

7.7. Embedding more domain specific Embedding more domain specific knowledgeknowledge

Page 15: Isi2007 nn shc_2007

15

29

Embedding more domain specific Embedding more domain specific knowledgeknowledge

Choice of the function/terminal sets is crucial – no guidelines - 2 risks:

Extraneous functionsRequired functions not available

As yet, GP uses a very primitive language

Enrich primitive set with volume, indexes, bid/ask spread, …

Enrich function set with cross-correlation, predictability measure, …

30

References (1/2)References (1/2)[ChKuHo06][ChKuHo06] S.S.--H. Chen and T.H. Chen and T.--W. W. KuoKuo and K.and K.--M. Hoi. “Genetic M. Hoi. “Genetic Programming and Financial Trading: How Much about "What we Programming and Financial Trading: How Much about "What we Know“”. In 4th NTU International Conference on Economics, Know“”. In 4th NTU International Conference on Economics, Finance and Accounting, April 2006.Finance and Accounting, April 2006.[ChNa06][ChNa06] S.S.--H. Chen and N. Navet. “Pretests for geneticH. Chen and N. Navet. “Pretests for genetic--programming evolved trading programs : “zeroprogramming evolved trading programs : “zero--intelligence” intelligence” strategies and lottery trading”, Proc. ICONIP’2006, Hongstrategies and lottery trading”, Proc. ICONIP’2006, Hong--Kong, Kong, October 2006October 2006[ChNa07][ChNa07] S.S.--H. Chen, N. Navet, "Failure of GeneticH. Chen, N. Navet, "Failure of Genetic--Programming Programming Induced Trading Strategies: Distinguishing between Efficient Induced Trading Strategies: Distinguishing between Efficient Markets and Inefficient Algorithms", Chapter 8, Evolutionary Markets and Inefficient Algorithms", Chapter 8, Evolutionary Computation in Economics and Finance: Volume 2, Springer, Computation in Economics and Finance: Volume 2, Springer, ISBN3540728201, 2007.ISBN3540728201, 2007.[NaCh07][NaCh07] N. Navet, S.N. Navet, S.--H. Chen, "Entropy rate and profitability of H. Chen, "Entropy rate and profitability of technical analysis: experiments on the NYSE US 100 stocks", 6th technical analysis: experiments on the NYSE US 100 stocks", 6th International Conference on Computational Intelligence in International Conference on Computational Intelligence in Economics & Finance (CIEF2007), SaltEconomics & Finance (CIEF2007), Salt--Lake City, USA, July 2007.Lake City, USA, July 2007.[Kab02][Kab02] M. M. KaboudanKaboudan, “GP Forecasts of Stock Prices for Profitable , “GP Forecasts of Stock Prices for Profitable Trading”, Evolutionary computation in economics and finance, Trading”, Evolutionary computation in economics and finance, KluwersKluwers, 2002., 2002.

Page 16: Isi2007 nn shc_2007

16

31

References (2/2)References (2/2)

[SaTe02][SaTe02] M. M. SantiniSantini, A. , A. TettamanziTettamanzi, “Genetic Programming for , “Genetic Programming for Financial Series Prediction”, Proceedings of EuroGP'2001, 2001.Financial Series Prediction”, Proceedings of EuroGP'2001, 2001.[BhPiZu02][BhPiZu02] S. Bhattacharyya, O. V. S. Bhattacharyya, O. V. PictetPictet, G. , G. ZumbachZumbach, , “Knowledge“Knowledge--Intensive Genetic Discovery in Foreign Exchange Intensive Genetic Discovery in Foreign Exchange Markets”, IEEE Transactions on Evolutionary Computation, Markets”, IEEE Transactions on Evolutionary Computation, volvol 6, 6, n° 2, April 2002.n° 2, April 2002.[LaPo02][LaPo02] W.B. Langdon, R. W.B. Langdon, R. PoliPoli, “, “FondationsFondations of Genetic of Genetic Programming”, Springer Programming”, Springer VerlagVerlag, 2002., 2002.[Kab00][Kab00] M. M. KaboudanKaboudan, “Genetic Programming Prediction of Stock , “Genetic Programming Prediction of Stock Prices”, Computational Economics, vol16, 2000.Prices”, Computational Economics, vol16, 2000.[Wag03][Wag03] L. L. WagmanWagman, “Stock Portfolio Evaluation: An Application , “Stock Portfolio Evaluation: An Application of Geneticof Genetic--ProgrammingProgramming--Based Technical Analysis”, Genetic Based Technical Analysis”, Genetic Algorithms and Genetic Programming at Stanford 2003, 2003.Algorithms and Genetic Programming at Stanford 2003, 2003.[Dem05][Dem05] I. Dempsey, “Constant Generation for the Financial I. Dempsey, “Constant Generation for the Financial Domain using Grammatical Evolution”, Proceedings of the 2005 Domain using Grammatical Evolution”, Proceedings of the 2005 workshops on Genetic and evolutionary computation 2005, pp workshops on Genetic and evolutionary computation 2005, pp 350 350 –– 353, Washington, June 25 353, Washington, June 25 -- 26, 2005.26, 2005.[Kei02][Kei02] M. M. KeijzerKeijzer, “Scientific discovery using Genetic , “Scientific discovery using Genetic Programming”, Programming”, PhdPhd Thesis, DTU, Thesis, DTU, LyngbyLyngby, Denmark, 2002. , Denmark, 2002.

32

?