Predicting the Stock Market Using Clementine

Download Predicting the Stock Market Using Clementine

Post on 28-Nov-2014




2 download

Embed Size (px)




<ul><li> 1. Predicting the Stock Market Using Clementine Eric U. Graubins Department of Computer Science Illinois Institute of Technology Abstract: The application of KDD (Knowledge Discovery in Databases) techniques to the analysis of stock market data has been validated as an effective method for the identification of high performing stocks. Using a database of firm fundamental information for the years 1988 to 2001 as input, we construct several stock performance prediction models employing SPSSs Clementine data mining software product. Exploiting the rich set of data modeling tools available in Clementine, we achieve success rates of over 80%. This paper describes the data, the methodology, and the results. It compares the algorithms used and rates the effectiveness of each. We affirm that it is possible to apply computer analysis on stock market data and arrive at performance predictions. These methods demonstrate an approach that not only has the potential for obtaining outstanding profits from investing, but also shows the value of data mining in extirpating knowledge from unstructured data. Introduction: The efficacy of performing ex ante stock price computations has always been open to conjecture. Predicting stock prices has been Holy Grail of market analysis - a crystal ball for the divination of securities prices and the basis for an automaton which produces a flow of cash. Market divination has been an area of interest for as long as securities have been traded. A cursory count of the number of books, seminars, software packages and newsletters pertaining to this subject attests to this fact. While the amount of literature in this subject area is indeed copious, the amount of published serious scientific study in this field is scant. This paper presents such a study. Using a database containing approximately ten years of stock market data, we apply analysis techniques to identify high performing stocks. The methodology, as well as the results, is tabulated and the process is described. This study utilizes company fundamental data as the basis for research. Broadly, the entire area of stock analysis can be classified into two approaches: fundamental analysis and technical analysis [Hol2000]. This distinction is important and is drawn here. Fundamental analysis uses company accounting data to determine a stocks value whereas technical analysis uses market stock price fluctuations and actual price history to arrive at a stocks value. While the emphasis of these two schools is different, much cross- discipline influence is evident. For example a company with no profits, </li> <li> 2. such the high-technology Internet companies of recent memory, find weak long-range demand for their shares from the investment community. Likewise, a firm may publish admirable fundamentals, yet find that their stock is not in demand, i.e., not in favor. This results in the stock price becoming depressed, which lowers the worth of the firm. This, in turn, makes it more difficult to raise capital for growth and expenses, effecting the welfare of the company. The remainder of this paper covers current research in the stock analysis field, after which the conducted subject research is presented. This research is divided into data, methodology, and result sections. Prior Work: We surveyed recent work done on applying knowledge and discovery techniques to modeling the U.S. Stock Market. As sated earlier, traders and stock market experts may be roughly classified into these two camps. Fundamental analysts base trading decisions on information obtained through the meticulous examination of a firms financial information. This results in a determination of the fundamental value of the stock. If the trading price is lower than the value, the stock is undervalued and there is a strong incentive to buy. Alternately, if the price is higher than the value, the stock is overpriced and it is desirable to sell it. Technical traders approach stock prices according to the dictum Res tantum valet quantum vendi potest", which translates to: A thing is worth as much as it can be sold for. The view is that the purest determinant of value for a stock is its trade price. Technical traders exploit free market forces to execute trades that are profitable. They take advantage of anomalies in the Efficient Market Hypothesis (EMH). EMH states that markets are efficient if prices fully and instantaneously reflect all available information and no profit opportunities are left unexploited. Even in this wired world, information dissemination is not complete and instantaneous. Background about the markets is provided by Bass [Bas1999]. A good description of market structure and procedures can be obtained from Dalton [Dal1993]. Bass [Bas1999], in his book, relates the story of a group of physicists who apply chaos theory to effect stock market predictions. This book gives a non-technical overview of the stock market and its various machinations and validates that the stock market is predictable, to an extent. Much of the technical analysis work utilizes chaos theory and complexity theory, while the fundamentalist analysis is more centered on more traditional data mining approaches on static data. The technical analysis is more apt to be applied on a dynamic data stream of trade data (OLAP), while the fundamental analysis takes place on the underlying firms accounting data. </li> <li> 3. Stocks which have a strong fundamental value tend to have more price stability and are subjected to different sorts of analysis. This approach does not take into account the wild, violent price fluctuations which are the hallmarks of a market in flux. Rather, the goal is to extirpate stocks which can be held for lengthy periods and which can be identified as assets to a portfolio. The majority of algorithms used to analyze stock performance from a fundamental perspective are more traditional data mining algorithms. These are run against static data, such as a company's quarterly reported financial data. The results usually consist of a ranking of firms whose securities were deemed good investments. Mandelbrot extends chaos theory to the financial markets. Multi fractal processes are derived from a scaling law for stock price time graphs. At each point in a series of finite time intervals, the behavior of the graph is used as an additional input to a model, which ultimately describes the path of the graph. George H. John, Peter Miller, Randy Kerber,[JMK1996], researchers at the computer science department at Stanford, present a study of the application of a software package named "Recon" upon the financial markets. The goal was to identify superior performing stocks using this artificial intelligence based system. Recon was developed at Lockheed Martin as a software product for use in discerning patterns in voluminous amounts of data. The team applied Recon to the stock market. They created a database consisting of six years of stock information for 1987-1993. Each tuple had about 100 attributes, containing information such as price-to-earnings ration, market trend and market capitalization. Recon was used to identify stocks that were deemed "exceptional". Final examination of the stocks selected by Recon showed that the stock selected had a return of 238% over a 4 year period. By comparison, a team of human experts were only able to achieve a return of 92.5% over the same period. The previous paper is valuable in that actual empirical performance data is cited. It is unique in that respect because the vast majority of papers in stock market forecasting and analysis are replete with formulas and theoretical constructs without ever providing proof that the stuff performs. This work is detailed in Section 2, and further covered in Section 3. Five algorithms to identify top performing stocks are given by George T. Albanis, Roy A. Batchelor, [AB2000], who describe 21 ways to beat the stock market. A related work by the same author team, [AB19999], seems to lay the basis for the later research effort. In their paper, Albanis and Batchelor, describe a set of five algorithms that are used to identify outstanding stocks with exceptional returns. The algorithms then "vote" as to projected performance of a stock. The system was trained from statistical data collected from 700 companies trading on the London Stock Exchange for the period 1993 to 1997. The inputs consisted of company financial information, market </li> <li> 4. economic information, and industry performance. The output was stock classified either high or low. If we draw a sharp division between the fundamental stock analysis school and the technical analysis school, published papers in the former outnumber those in the latter. If a computer model of a trading engine is truly to emulate a human trader, that model needs to utilize a variety of approaches [Dal1993], [Osh1996]. This entails a combination of fundamental analysis and technical analysis [HH1998a]. Is short, fundamental analysis is used to identify securities which have the potential for outstanding appreciation, and technical analysis is used to determine when to purchase, when to retain, and when to divest. Hellstrom and Holstrom [HH1998a] discuss the melding of fundamental and technical analysis in the implementation of an efficacious trading system. Fundamental data, of course, consists of a firms financial data and is commonly presented in balance sheets and financial statements. According to Hellstrom and Holstrom and the majority of literature dealing with stock market analysis, the fluctuations in the graph in a Cartesian coordinate system is a time series, where each stock value at a give time is the result of the previous value, plus some transform function. While simple to express, the equation has a sinister feature: the value of the transform function has zero mean and each value is independent of others in the series. This is what makes stock market predictability such a daunting task. The equations derived from graph fluctuations are complex and are of a higher, non-determinable, order. Often, graph properties change, rendering what had been considered a valid equation, non-representative of the new market condition. Chen and Tan [CT1996] point that the measure of complexity for a set of data is the length of the shortest Turing machine program that will generate the data. This measure can be defined but the problem is not practically computable. For this reason, it is advocated that the Turing approach be replaced by classes of probabilistic models, i.e., stored patterns. Doyne Farmer, in an article by Kevin Kelly [Kel1994], draws an analogy between catching a baseball and predicting to market. He says that we know how to catch a baseball because we have developed and stored a model in our consciousness that describes how baseballs fly. Although we could calculate the trajectory using Newtonian physics, our brains do not stock up on mechanics equations. The similarity to the stock market is drawn. Hence, we do not need to calculate a stock graph, we just need to recognize the pattern. In logic, such a process is known as induction, in contradistinction to the deduction process that leads to a mathematical formula. Neural Networks are an artificial intelligence construct that mimics the pattern discriminating ability of the human brain. Kogan, </li> <li> 5. [Kog1995], outlines the comparative efficacy of using neural nets in artificial intelligence applications such as medicine, war, genetics, and lastly, finance. Kogan also cites two other applications of AI based stock selection systems in industry: the Fidelity StockSelector fund, and LBS Capital. By contrast, rule induction, the approach used by Recon, is the facile extraction of useful if-then rules from data based on statistical significance. The fundamental approach to rule formulation is thought the use of decision trees. A good treatment is given in [HK2001]. Current Research Methodology: The starting point for the creation of a statistical model price prediction system was the gathering of approximately ten years of U.S. Company fundamental information. These companies are publicly traded and the data spanned the period 6/30/88 to 12/31/2001. The data was obtained from 10K annual filings with the U.S. Securities and Exchange Commission. The metadata is in Appendix 1. Since the number of companies traded on U.S. markets is too unwieldy for a complete study, a selection had to be made for to generate a set of firms to operate on. It was decided, to use the Standard and Poors 1300 list as a criterion. This list contains the top 1300 companies traded on U.S. markets and accounts for 87% of U.S. market capitalization. Standard and Poors (, is an independent provider of company rating services. It maintains a number of lists which are held in high regard in the investment community. The S&amp;P 1300 list consists of approximately 1300 companies and is updated periodically. The company filing data, which was originally in a vertical ASCII text format, was parsed and placed into a horizontal orientation. Each 10K annual filing was reduced to one record and placed in a file. The total number of records was 13726. This file was subsequently loaded into an mSQL relational database on a Unix server. The count of companies by year is given as follows: 1988 5 1989 23 1990 593 1991 985 1992 1029 1993 1116 1994 1205 1995 1255 1996 1335 1997 1377 1998 1401 1999 1406 2000 1407 2001 493 </li> <li> 6. The smaller number of companies at the extremes of the year range is because of incomplete filing records. Although these incomplete years were not used in the study, they were still included in the database. It must be emphasized that the S&amp;P 1300 list is not strictly limited to 1300 firms per annum. The numeric moniker is a mere guideline and the actual number of companies on the list may fluctuate by year. This is partially explainable by periodic reevaluations at which time some firms may be added and others deleted. This results in a firm being included in t...</li></ul>