applying data mining techniques to stock market

9
Y. Demazeau et al. (Eds.): Trends in PAAMS, AISC 71, pp. 519–527. springerlink.com © Springer-Verlag Berlin Heidelberg 2010 Applying Data Mining Techniques to Stock Market Analysis Gabriel Fiol-Roig, Margaret Miró-Julià, and Andreu Pere Isern-Deyà * Abstract. The stock market can be viewed as a particular data mining and artifi- cial intelligence problem. The movement in the stock exchange depends on capi- tal gains and losses and most people consider the stock market erratic and unpredictable. However, patterns that allow the prediction of some movements can be found. Stock market analysis deals with the study of these patterns. It uses different techniques and strategies, mostly automatic that trigger buying and selling orders depending on different decision making algorithms. It can be considered as an intelligent treatment of past and present financial data in order to predict the stock market future behavior. Therefore it can be viewed as an ar- tificial intelligence problem in the data mining field. This paper aims to study, construct and evaluate these investment strategies in order to predict future stock exchanges. Firstly, data mining techniques will be used to evaluate past stock prices and acquire useful knowledge through the calculation of some fi- nancial indicators. Next artificial intelligence strategies will be used to construct decision making trees. Keywords: Data Mining, Artificial Intelligence, Stock Market, Investment Strate- gies, Decision Trees. 1 Introduction to the Problem The problem considered deals with the design and validation of a data mining de- cision system which generates buying and selling orders in the stock market [1]. The decision system’s design has three main phases as indicated in Figure 1: the data processing phase, the data mining phase and the evaluation phase. The data processing phase selects from the raw data base a data set that focuses on a subset of attributes or variables on which knowledge discovery is to be Gabriel Fiol-Roig . Margaret Miró-Julià . Andreu Pere Isern-Deyà Departament de Ciències Matemàtiques i Informàtica Universitat de les Illes Balears Ctra. de Valldemossa km 7,5, 07122 Palma de Mallorca, Spain e-mail: {biel.fiol,margaret.miro}@uib.es

Upload: library2540

Post on 24-Nov-2015

17 views

Category:

Documents


1 download

TRANSCRIPT

  • Y. Demazeau et al. (Eds.): Trends in PAAMS, AISC 71, pp. 519527. springerlink.com Springer-Verlag Berlin Heidelberg 2010

    Applying Data Mining Techniques to Stock Market Analysis

    Gabriel Fiol-Roig, Margaret Mir-Juli, and Andreu Pere Isern-Dey*

    Abstract. The stock market can be viewed as a particular data mining and artifi-cial intelligence problem. The movement in the stock exchange depends on capi-tal gains and losses and most people consider the stock market erratic and unpredictable. However, patterns that allow the prediction of some movements can be found. Stock market analysis deals with the study of these patterns. It uses different techniques and strategies, mostly automatic that trigger buying and selling orders depending on different decision making algorithms. It can be considered as an intelligent treatment of past and present financial data in order to predict the stock market future behavior. Therefore it can be viewed as an ar-tificial intelligence problem in the data mining field. This paper aims to study, construct and evaluate these investment strategies in order to predict future stock exchanges. Firstly, data mining techniques will be used to evaluate past stock prices and acquire useful knowledge through the calculation of some fi-nancial indicators. Next artificial intelligence strategies will be used to construct decision making trees.

    Keywords: Data Mining, Artificial Intelligence, Stock Market, Investment Strate-gies, Decision Trees.

    1 Introduction to the Problem

    The problem considered deals with the design and validation of a data mining de-cision system which generates buying and selling orders in the stock market [1]. The decision systems design has three main phases as indicated in Figure 1: the data processing phase, the data mining phase and the evaluation phase.

    The data processing phase selects from the raw data base a data set that focuses on a subset of attributes or variables on which knowledge discovery is to be

    *Gabriel Fiol-Roig . Margaret Mir-Juli . Andreu Pere Isern-Dey Departament de Cincies Matemtiques i Informtica Universitat de les Illes Balears Ctra. de Valldemossa km 7,5, 07122 Palma de Mallorca, Spain e-mail: {biel.fiol,margaret.miro}@uib.es

  • 520 G. Fiol-Roig, M. Mir-Juli, and A. Pere Isern-Dey

    Data Base

    Data Processing

    Data Mining Evaluation

    Transformed Data Patterns Knowledge

    Fig. 1 The Decision System

    performed. It also removes outliers and redundant information, and uses finan-cial indicators to represent the processed data by means of an Object Attribute Table (OAT). The data mining phase converts the data contained in the OAT into useful patterns, in particular decision trees are found [2]. The evaluation phase proves the consistency of pattern by means of a testing set. The positively evaluated decision system can then be used in real world situations that will al-low for its validation.

    2 The Data Processing Phase

    The original data base is formed by financial information available in different web pages. In particular, data from the American stock market, available from Google finance, have been used. Out of all the variables originally considered the target data is formed by: date (D), opening price (O), closing price (C), daily high (H), daily low (L) and daily volume (V) as illustrated in Table 1.

    Table 1 Data format available from Google Finance

    Date Open High Low Close Volume 15-Dec-08 13.30 13.39 13.23 13.34 144567

    In order to obtain a decision making system based on classification techniques a particular company must be selected. This company must satisfy some restrictions:

    - It must not be a technological security, in order to avoid using data from year 2000 economic bubble,

    - It must not be a financial institution, due to the erratic behaviour of the past years,

    - It must have a significant influence within the American stock market, - It must be a reference company.

    Due to these restrictions and after a careful study of the stock market, Alcoa was selected to complete the project. Alcoa is the world's leading producer and man-ager of primary aluminum, fabricated aluminum and alumina facilities, and is ac-tive in all major aspects of the industry. The targeted data includes information from the beginning of 1995 till the end of 2008.

  • Applying Data Mining Techniques to Stock Market Analysis 521

    2.1 Financial Indicators

    This target data must be processed and transformed into useful data written in terms of financial indicators [3]. The most widely used indicators are based on graphical methods (where price oscillation is studied) and on mathematical meth-ods (where mathematical equations are applied).

    The Simple Moving Average (SMA) is the unweighted mean of the previous n data points. It is commonly used with time series data to smooth out short-term fluctuations and highlight longer-terms trends and is often used for the analysis of financial data such as stock prices (P), returns or trading volumes. There are vari-ous popular values for n, like 10 days (short term), 30 days (intermediate term), 70 days (long term) or 200 days depending on the number of days considered.

    n

    PPPPSMA nnn 111 )(+++

    =

    +

    "

    The moving averages are interpreted as support in a rising market or resistance in a falling market. The SMA treats all data points equally and can be disproportion-ally influenced by old data points. The exponential moving average addresses this point by giving extra weight to more recent data points.

    The Exponential Moving Average (EMA) applies weighting factors which de-creases exponentially, giving much more importance to recent observations while not discarding older observations totally. The degree of weighting decrease is ex-pressed as a constant smoothing factor , which can be expressed in terms of n.

    12+

    =

    n

    And the exponential moving average is expressed as:

    )1()(1 +=+ nnn EMAPPEMA The moving averages are adequate indicators to determine tendencies. When a short term moving average crosses over a longer term moving average we have a rising tendency. Whereas when a short term moving average crosses under a longer term moving average we have a falling tendency. The moving averages have a drawback since the information provided has a time lag of several days. This can be avoided by using more powerful indicators.

    The Moving Average Convergence Divergence (MACD) is a combination of two exponential moving averages with different number of data points (days). The MACD is calculated by subtracting the y-day EMA from the x-day EMA. A z-day EMA of the MACD, called the signal line, is plotted on top of the MACD, func-tioning as a trigger for buy and sell signals. Variables x, y, and z are the parame-ters of the MACD, usually x = 12, y = 26 and x = 9. This is illustrated in Figure 2, the dark line is the MACD whereas the light line is the signal, on the top of the figure the histogram (signal minus MACD) is drawn.

  • 522 G. Fiol-Roig, M. Mir-Juli, and A. Pere Isern-Dey

    Fig. 2 The Moving Average Convergence Divergence

    The MACD is a trend-following momentum indicator that shows the relation-ship between two moving averages of prices. Its two-folded interpretation is as follows. If the MACD line and signal line are considered, when the MACD falls below the signal line, it may be time to sell. Conversely, when the MACD rises above the signal line, the price of the asset is likely to experience upward momen-tum, it may be time to buy. If the histogram and the price curb are considered, a divergence can be found and buying-selling signals can be obtained.

    The stochastic (K) is an indicator that finds the range between an assets high (H) and low price (L) during a given period of time, typically 5 sessions (1 week) or 20 sessions (1 month). The current securities price at closing (C) is then ex-pressed as a percentage of this range with 0% indicating the bottom of the range and 100% indicating the upper limits of the range over the time period covered.

    LHLCK

    =100

    An interesting interpretation for the stochastic can be obtained if the moving aver-age associated with the stochastic D(K) is considered: The crossing points be-tween K and D trigger buy and sell signals. When K crosses over a rising D it may be time to buy, conversely when K crosses over a falling D selling might be con-venient. If the stochastic and the price curb are considered, a divergence can be found and buying-selling signals can be obtained.

    The representation of the stochastic can be seen in Figure 3.

    Fig. 3 The Stochastic Indicator (K)

    The Bollinger Bands (BB) consists of a set of three curves drawn in relation to security prices. The middle band is a measure of the intermediate-term trend, usu-ally a moving average (simple or exponential) over n periods. This middle band serves as the base to construct the upper and lower bands. The interval between the upper and lower bands and the middle band is determined by volatility, typi-cally the interval is Q times the standard deviation of the same data that were used

  • Applying Data Mining Techniques to Stock Market Analysis 523

    to calculate the moving average. The default parameters most commonly used are n = 20 periods and Q = 2 standard deviations.

    95% of security prices can be found within the Bollinger Bands, the band repre-sents areas of support and resistance when the market shows no tendencies. When the bands lie close together a period of low volatility in stock price is indicated. When they are far apart a period of high volatility in price is indicated.

    Generally Bollinger Bands are used together with other indicators to reinforce its validity. However, on its own they can trigger buy and sell signals. When prices touch the lower band it might be convenient to buy, conversely when prices touch the upper band selling might be convenient.

    2.2 Processing of the Financial Indicators The financial indicators introduced above will aid us in the construction of our Object Attribute Table from the raw data base. In order to calculate these indica-tors, R a language for statistical computing and graphics, is used [4]. R is a popu-lar programming language used by a growing number of data analysts inside corporations and academia. R is applied mainly as a statistical package but it can be used in other areas of science such as numerical analysis, signal processing, computer graphics and so on because of the number of growing implemented li-braries in R to these scientific areas. See [5] for details on the use of R in acade-mia and other institutions. R is available as Free Software under the terms of the Free Software Foundations GNU General Public in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSDband Linux, Windows and MacOS).

    The financial indicators used in this paper are calculated by the application of Rs TTR package to the raw data retrieved from Google finance. These indicators must now be transformed into useful data as follows.

    2.3 Transformed Data: The OAT The next step in the data processing phase is the generation of the Object Attribute Table [6, 7]. The indicators considered are the following: MACD (Moving Aver-age Convergence Divergence), EMA(C) (Exponential Moving Average of the Closing Price), EMA(V) (Exponential Moving Average of the Volume), K (Sto-chastic) and BB (Bollinger Bands) and they generate eight binary attributes A1, A2, A3, A4, A5, A6, A7 and A8 in the following manner.

    - The MACD indicator identifies buying (B) or selling (S) orders depending on the crossing of intermediate term moving averages. Two binary attrib-utes A1 and A2 are used as follows: if an S signal is generated then A1= 1; if a B signal is generated then A2= 1.

    - The EMA(C) indicator when compared to the stocks actual price generates attribute A3, if the assets ranking is strong then A3=1.

  • 524 G. Fiol-Roig, M. Mir-Juli, and A. Pere Isern-Dey

    - The EMA(V) indicator when compared to the actual volume generates at-tribute A4, if the assets reliability increases then A4= 1.

    - The K indicator also identifies buying (B) or selling (S) orders depending on the crossing of moving averages, in particular K and D. Two binary at-tributes A5 and A6 are used as follows: if an S signal is generated then A5= 1; if a B signal is generated then A6= 1.

    - The BB indicator generates 4 zones. Two binary attributes A7 and A8 are used as follows: if the assets ranking is strong then A7= 1 and A8=1; if the assets ranking is weak then A7= 0 and A8=0; if a B hint is found then A7= 0 and A8=1; if an S hint is obtained then A7= 1 and A8=0.

    Thus, the OAT of transformed data has 8 columns corresponding to the above mentioned binary attributes and a total of 3515 rows corresponding to available transformed financial data, as illustrated in Table 2. The data contained in the ta-ble must be filtered. First of all, repeated rows are eliminated. Secondly, incom-plete rows are also removed. The remaining table has 61 rows. The next step is to assign the class to each of the remaining rows taking into account the values of the attributes as indicated by the financial expert. Three different classes are used de-pending on whether buying (B), selling (S) or no action (N) must be taken.

    Table 2 Object Attribute Table

    A1 A2 A3 A4 A5 A6 A7 A8 class 0 0 0 0 1 0 1 0 S 0 1 0 0 0 0 0 1 B 0 0 1 0 0 1 0 1 N 0 0 0 1 0 1 1 0 N 1 0 1 0 0 0 0 1 S 1 0 1 1 0 0 1 0 S

    3 The Data Mining Phase

    In the data mining phase, the data contained in the OAT is converted into useful patterns. In particular, a decision tree will be obtained. The decision tree is one of the most popular classification algorithms. A decision tree can be viewed as a graphical representation of a reasoning process and represents a sequence of deci-sions, interrelationships between alternatives and possible outcomes. There are different trees available depending on the evaluation functions considered. In this project, the ID3 algorithm is used due to its simplicity and speed [8]. ID3 is based on information theory to determine the most informative attribute at each step of the process. In this sense, it uses information gain to measure how well a given at-tribute separates the training examples into the classes.

  • Applying Data Mining Techniques to Stock Market Analysis 525

    Figure 4 shows the resulting decision tree. The decision tree has a total of 37 nodes, 18 corresponding to the 8 attributes used and 19 leaf nodes or branches. The average branch length is 6.

    Attributes A1, A2 and A5 appear at the root of the tree suggesting the impor-tance of indicators MACD and K in the classifying process. This is in accordance with the experts that consider MACD and K as key indicators in the buying-selling decisions, whereas EMA(C), EMA(V) and BB are used as reinforcement indicators.

    A20

    A1

    A5

    A7

    A8A8

    0

    0

    0

    0

    1

    1

    1

    0

    A3

    A80

    0

    1

    A4

    0

    S

    S

    S N

    N

    B

    A6

    B

    1

    1

    1

    1

    S

    0A4

    1

    1

    0

    A7

    N

    0

    A6

    N

    A7

    NB B

    1

    A7

    A6 A6

    N SNA3

    B N

    N

    1

    1 0

    1

    1

    1

    0

    0 01

    0

    1 0

    A20

    A1

    A5

    A7

    A8A8

    0

    0

    0

    0

    1

    1

    1

    0

    A3

    A80

    0

    1

    A4

    0

    S

    S

    S N

    N

    B

    A6

    B

    1

    1

    1

    1

    S

    0A4

    1

    1

    0

    A7

    N

    0

    A6

    N

    A7

    NB B

    1

    A7

    A6 A6

    N SNA3

    B N

    N

    1

    1 0

    1

    1

    1

    0

    0 01

    0

    1 0

    Fig. 4 The ID3 Decision Tree

    4 The Evaluation Phase

    The evaluation phase proves the consistency of the ID3 decision tree by means of a testing set. The testing set is formed by the stock market prices of a different as-set, of similar characteristics to the one used to train the tree. This raw data are transformed into a testing OAT using the previous indicators and attributes. Now the decision tree is applied to the testing OAT and all instances are classified. These classes obtained with the tree are compared with the classes corresponding to the instances of the testing OAT as suggested by the expert. The evaluation of the method provides the results shown on Table 3.

    Table 3 Evaluation of the method

    Parameters 3-class tree Correct decision 113 Incorrect decision 176 Correct decision (%) 39.10 Incorrect decision (%) 60.90

  • 526 G. Fiol-Roig, M. Mir-Juli, and A. Pere Isern-Dey

    These results might seem poor. However, stock market analysis is an unpre-dictable field and even though accuracy is desired, profit associated to capital gain is crucial. Therefore, classification accuracy is not the only parameter that must be evaluated. When profit is considered, the method provides the following results: if an instance is correctly classified by the tree, the percentage gain is 579.71%, if an instance is not correctly classified by the tree, the percentage loss is -461.56%. This produces a total yield of 118.15% which is an acceptable result.

    5 Conclusions and Future Work

    The stock market is an unpredictable, erratic and changeable domain. Neverthe-less, tools that predict the stock markets behaviour exist. This paper considers a method based on financial indicators to generate a decision tree that classifies buy-ing-selling orders. A decision tree in Artificial Intelligence is equivalent to an automatic investment system in the stock market analysis. The procedure devel-oped here is long and costly, but once the tree is generated, its automatic and ap-plicable to any financial asset.

    The initial raw data are daily stock market values of Alcoa, these data are trans-formed into an OAT and used as a training set to construct a decision tree by means of the ID3 algorithm. This decision tree has been tested and evaluated. The results show a 40% accuracy percentage and a 118% profit gain that are consid-ered satisfactory.

    Currently, a second OAT with only two classes, buying (B) and selling (S) is being studied. In this case the no action class (N) has not been considered. This 2-class decision tree will allow for a more complete evaluation of the method.

    This paper offers a limited vision of one of the many solutions available. The following aspects can be considered as future work: other different financial indi-cators may offer better results and should be studied, also other decision trees, such as C4.5 or J48, can be obtained using different learning algorithms and should be evaluated. The stock market is a dynamic world, and the decision tree could present time-related disturbances. Therefore, an autonomous system capable of dealing with these changes should be considered.

    Acknowledgements. The authors want to thank Dami Vaquer, Gerard Draper and Jaume Jaume for their helpful comments and suggestions.

    References

    1. Fayyed, U., Piatetsky-Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discov-ery in Databases. American Association for Artificial Intelligence, AI Magazine 96, 3754 (Fall 1996)

    2. Fiol-Roig, G.: UIB-IK: A Computer System for Decision Trees Induction. In: Ra, Z.W., Skowron, A. (eds.) ISMIS 1999. LNCS, vol. 1609, pp. 601611. Springer, Heidelberg (1999)

  • Applying Data Mining Techniques to Stock Market Analysis 527

    3. Weinstein, S.: Stans Weinsteins Secrets For Profiting in Bull and Bear Markets. McGraw-Hill, New York (1988)

    4. The R project for Statistical Computing, http://www.r-project.org/ 5. http://www.nytimes.com/2009/01/07/technology/

    business-computing/07program.html 6. Mir-Juli, M.: Knowledge Discovery in Databases Using Multivalued Array Algebra.

    In: Moreno-Daz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Sys-tems Theory - EUROCAST 2009. LNCS, vol. 5717, pp. 1724. Springer, Heidelberg (2009)

    7. Fiol-Roig, G.: Learning from Incompletely Specified Object Attribute Tables with Con-tinuous Attributes. Frontiers in Artificial Intelligence and Applications, vol. 113, pp. 145152 (2004)

    8. Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81106 (1986)

    Applying Data Mining Techniques to Stock Market AnalysisIntroduction to the ProblemThe Data Processing PhaseFinancial IndicatorsProcessing of the Financial IndicatorsTransformed Data: The OAT

    The Data Mining PhaseThe Evaluation PhaseConclusions and Future WorkReferences

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 600 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice