performing stock closing price prediction

Upload: svarogus

Post on 08-Mar-2016

214 views

Category:

Documents


0 download

DESCRIPTION

Performing stock closing price prediction through the use of principle component regression in association with general regression neural network

TRANSCRIPT

  • Performing stock closing price prediction through the use of principlecomponent regression in association with general regression neural2network

    Jui-Ching Huang 4Wen-Tsao Pan

    Department of Accounting Information6Kun Shan UniversityNo. 949, Da Wan Rd.8Yung-Kang CityTainan Hsien, 71010Taiwan, R.O.C.

    Abstract12

    Many problems existed in conventional Multiple Regression model constantly botherthe researchers. These problems include multicollinearity problem and nonlinear problem14among independent variables, which lead to the difficulty of accurate prediction of manyproblems. Therefore, in this article, an analysis method based on hybrid model is proposed,16which combines Principle Component Regression (PCR) model and General Regression NeuralNetwork (GRNN) to solve these two problems at the same time. First, in this article, the18financial ratio data of companies with stocks listed in regular stock market and over-the-counter stock market in Taiwan and Mainland China are collected and used as sample data;20moreover, Grey Relational Analysis is used to rank enterprises operation performance, andthe enterprises in Taiwan and Mainland China with business operation performance in the22first place are selected and their stock information collected to perform the prediction ofstock closing price. In this article, the stock information are divided into training data and24test data to perform model construction and verification; meanwhile, five indexes such asthe Root Mean Squared Error (RMSE), Revision Theil Inequality Coefficient (RTIC), Mean Absolute26Error (MAE), Mean Absolute Percentage Error (MAPE) and Coefficient of Efficiency (CE) ofthe test result are calculated; the empirical results show that the prediction power of the28hybrid model of PCR+GAGRNN is obviously better than the model of PCR, GRNN andPCR+GRNN.30

    Keywords and phrases : Principle component regression, genetic algorithm, general regression

    neural network, hybrid model, grey relational analysis.32

    E-mail: [email protected]: [email protected]

    Journal of Discrete Mathematical Sciences & CryptographyVol. ( ), No. , pp. 112c Taru Publications

  • 2 J. C. HUANG AND W. T. PAN

    1. Introduction

    Mainland China had fast economic growth in the past few years,2many Mainland Chinas private enterprise have become top companiesaround the world; however, many financial experts have pointed out4that Mainland Chinas enterprise has price/earnings ratio too high andserious economic bubble and inflation, hence, the business operation6situation of each enterprise needs to be checked in detail so as to preventthe financial crisis from happening. Taiwan has suffered from economic8stagnation in the past few years and many private enterprises thus haveoperation difficulty and financial crisis issues. However, any countries10around the worlds right now are affected by subprime mortgage of USA,the business operation performances become worse. Therefore, this article12tries to use Grey Relational Analysis to investigate enterprises operationperformance, and top one enterprise is selected as the investment target14so as to reduce the risk of investors to lose in the stock market.

    However, in the past, when researchers were constructing prediction16model to predict stock trend, most of them adopt conventional analysismethods, for example, Multiple Regression analysis model (Megginson18and Weiss, 1991; Fama and French, 1993) or time series model (Lai andYang, 2004; Tsaur, 2004). However, these two models can only be applied20to the study of linear problems and when they are applied to the study ofnonlinear problems, the result is usually bad. In addition, another issue of22Multiple Regression analysis model is that there will be multicollinearityissues existed among dependent variables, which will cause the issues that24many independent variables can not effectively explain the dependentvariables. Therefore, in this article, a method for constructing a hybrid26model is proposed, that is, Principle Component Regression (PCR) model isassociated with General Regression Neural Network (GRNN) to construct28prediction model; in the mean time, two issues of Multiple Regression aresolved.30

    This article is the first one that adopts the financial ratio data fromenterprises in Taiwan and Mainland China as operation performance32index; meanwhile, Grey Relational Analysis method is used to analyzeenterprises operation capability. Furthermore, the Grey Relational Grade34acquired in the analysis is used as enterprises operation performance toperform ranking; moreover, enterprises in Taiwan and Mainland China36with top one operation performance are selected and their stock infor-mation collected, then PCR model, GRNN model, PCR model associated38

  • REGRESSION NEURAL NETWORK 3

    with GRNN model and a hybrid model of PCR model associated withGAGRNN model that is parameter-adjusted with Genetic Algorithm (GA)2to perform the prediction of stock closing price.

    This article has the following main structure: Section 1 will in-4troduces the research motivation, objective and research flow of thisarticle; section 2 will perform literature review related to PCR model and6GRNN model and discuss the method for constructing hybrid model ofPCR+GAGRNN model; section 3 will perform empirical result analysis;8section 4 will propose research conclusions and suggestions.

    2. Hybrid model of PCR and GRNN10

    2.1 Principle component regression model

    In regression analysis, if multicollinearity issue occurs in indepen-12dent variables, then the square of multiple correlation coefficient (R2)will be very high, and the entire test F value of linear regression model14will reach significant level too; however, the test t values for individualregression coefficients of most independent variables do not reach signifi-16cant level; therefore, there is issue that most of the independent variablescan not effectively explain the dependent variables. In multiple regression18analysis, the diagnosis indexes of multicollinearity issue include tolerance,variance inflation factor (VIF), condition index, eigen value and variance20proportions; wherein tolerance has value in the range 0 to 1, the closerthe tolerance to 0, the stronger the multicollinearity issue occurs; variance22inflation factor is the reciprocal of tolerance; therefore, the larger thevalue of variance inflation factor, the larger the multicollinearity issue;24generally speaking, when variance inflation factor value is larger than10, multicollinearity issue might occur among independent variables;26condition index (CI) is converted from eigen value, the larger the conditionindex value, the smaller the eigen value, and multicollinearity issue might28more likely occur among independent variables, when the eigen valueapproaches 0 or condition index is larger than 30, then there is medium or30high multicollinearity issue. The test of variance proportion value meansthat when the variance proportion value of independent variable at certain32eigen value approaches 1, it means that multicollinearity issue might morelikely to occur among independent variables. In regression analysis, if34multicollinearity issue occurs, Principle Component Regression analysismethod can be used to solve; this analysis method is mainly to construct36

  • 4 J. C. HUANG AND W. T. PAN

    the latent principal component of independent variables that have multi-collinearity occurs and use it as new prediction variable.2

    2.2 General regression neural network

    Specht (1990) proposed Probabilistic Neural Networks structure,4which is a monitoring type network structure; its theory is built onBayes decision and nonparametric technique to predict Probability Density6Function (PDF) and this Probability Density Function is of the form ofGaussian Distribution. Yeh (1998) pointed out that this function is as8shown in formula (1).

    fk(X) =(

    1Nk

    )(1

    (2pi)m2

    )(1m

    ) Nkj=1

    exp( X Xk j

    22

    ). (1)10

    Since Probabilistic Neural Networks is applicable to general classi-fication problem with the assumption that the eigenvector to be classified12must belong to one of the known classifications, then the magnitude of theabsolute probabilistic value of each classification is not important, only the14relative magnitude needs to be considered, then(

    1

    (2pi)m2

    )(1m

    )16

    of formula (1) can be neglected and formula (1) can be simplified to be

    fk(X) =(

    1Nk

    ) Nkj=1

    exp(X Xk j

    22

    ). (2)18

    In formula (2), is the Smoothing Parameter of Probabilistic NeuralNetworks; after the completion of network training, the prediction accu-20racy can be enhanced through the adjustment of Smoothing Parameter ;the larger this value is, it will approaches the function more smoothly. If22Smoothing Parameter is not appropriately selected, the neuron numberin the network design will be too much or too few and over-fitness or24insufficient fitness will be resulted in the function approaching process,and the prediction power will then get reduced.26

    Let d2k j = X Xk j be the Euclid distance square between X andXk j in the sample space, then formula (2) can be re-written as28

    The generalization capability of Probabilistic Neural Networksreplies on the adjustment of Smoothing Parameter . In formula (1), when30

  • REGRESSION NEURAL NETWORK 5

    Smoothing Parameter approaches 0

    fk(X) =(

    1Nk

    ) Nkj=1

    exp(1

    2

    (dk j

    )2).2

    If X = Xk j , otherwise

    fk(X) = 0 . (3)4

    At this moment, Probabilistic Neural Networks has its classificationdependent on how close the unclassified sample is to the classified sample.6When Smoothing Parameter approaches infinity

    At this moment, Probabilistic Neural Networks gets close to blind8classification. However, since Probabilistic Neural Networks can onlyperform classification problem study, Specht (1991) uses Probabilistic10Neural Networks to evolve General Regression Neural Network to solvethe continuous variable problem. General Regression Neural Network not12only can be used in the study of classification problem, but has very goodprediction power on the construction of prediction model or control model14and linear or nonlinear problem.

    General Regression Neural Network is similar to Probabilistic Neural16Networks and is a four layers neural network (as in Figure 1); the firstlayer is input layer, neuron number is the number of independent variable18and receives input data; the medium second layer is hidden layer namedPattern Layer, which stores all the training data; here the output data of20Pattern Layer will pass through the neuron of third layer of SummationLayer and get corresponded to each possible classification; meanwhile,22this layer will perform the calculation of formula (3). The fourth layer isdifferent than Probabilistic Neural Networks and this layer is Linear Layer,24this layer performs the output weighted average calculation of SummationLayer and generate the output value.26

    2.3 Construction method for hybrid model

    Although Principle Component Regression analysis method solves28the multicollinearity issue among the independent variables of MultipleRegression model, yet there is still one issue not solved, that is, Multiple30Regression model can not solve nonlinear problem effectively. In thisarticle, Multiple Regression is represented as linear and non-linear parts32and processed separately. First, lets observe one regression equation asshown in equation (4):34

    Y = +1X1 +2X2 + . (4)

  • 6 J. C. HUANG AND W. T. PAN

    Figure 1General Regression Neural Network structure diagram

    2

    Wherein is error term and is a random variable. In this article, themethod of Pai (2004) was referred to represent the Multiple Regression4model into equation (5):

    Zt = Lt + Nt . (5)6

    Wherein Lt is the linear part and Nt is the nonlinear part; let Yt be theestimated value of Multiple Regression model at time t , t is the estimated8residual. Therefore, the residual at time t is:

    t = Zt Yt . (6)10The residual in this article is predicted by GRNN model and GAGRNNmodel and can be represented as:12

    t = f (X1t1 , X2t1 , X3t1 , X4t1 , X5t1) + t . (7)

    Wherein f is a non-linear function and t is a random error term.14Therefore, the associated prediction of hybrid mode is:

    Zt = Yt + Nt . (8)16

    Wherein is Nt the prediction value of (7).

    3. Empirical analysis18

    3.1 Sample data and variable

    In this article, financial ratio data on January 2006 of 876 Taiwans20companies and 991 Mainland Chinas companies with stocks listed in

  • REGRESSION NEURAL NETWORK 7

    the regular stock market and over-the-counter stock market are collectedfrom Information Winner Database and China Center for Economic2Research web site; moreover, the financial ratio includes Turnover ofReceivables, Inventory Turnover, Fixed assets turnover, Current Ratio,4Acid-test ratio, Debt ratio, Net profit ratio, Return On Asset ratio, Long-term capital to Fixed Asset Ratio and Sale/Share-Accu, that is, a total of610 ratios. This article refers to Ling-lang Tang (2001) to use financial ratiodata as evaluation indexes; moreover, Grey Relational Analysis proposed8by professor Deng (1982) and the grey correlation MATLAB programdeveloped by Kun-Li Wen (2006) will be used to find out Grey Relational10Grade and to perform enterprise operation efficiency ranking; moreover,top one enterprise in Taiwan and Mainland China will be selected as the12investment target. In this article, MATLAB7.0 software is used for theanalysis and each line in Figure 2 and Figure 3 represents a company;14there are 10 nodes on the line to represent 10 types of financial ratios,lines connected by symbol node represent Standard Sequence, and16the rest of lines are Inspected Sequence. If the observatory sequencegets closer to standard sequence, it means that the enterprise has better18operation performance. From the analysis result, it can be found thatthe top one enterprise in Taiwan and Mainland China are respectively20HUNG POO Construction 2536 (Grey Relational Grade is 0.9643) andBEIJING WANGFUJING DEPARTMENT STORE (GROUP) CO., LTD.,226008596(Grey Relational Grade is 0.8599). Therefore, this article furthercollects the daily stock information for these two companies from August2403, 2004 to March 26, 2008 for a total of 900 sets of data; the technicalindexes of the stocks are calculated (including 10 days moving average26(MA), 10 days RSI, 9 days K value, 9 days D value, 10 days volume ratio(VR), 10 days bias (BIAS), 10 days Williams Indicator (WMS%R or %R),28Total Amount Weight Stock Price Index (TAPI) and 10 days psychologicalline (PSY)) to be used as independent variables; then the closing prices of30the stock are used as dependent variables, and the first 800 sets of dataare used as training data to construct model; the last 100 sets of data are32then used as test data to perform the prediction accuracy analysis of the 4models.34

    3.2 The construct of four types of prediction models

    In this article, the independent variables of the sample data are36performed first with Principle Component Regression model analysis, forthe analysis steps, please refer to the writings by Ming-Lung Wu (2007);38

  • 8 J. C. HUANG AND W. T. PAN

    Figure 2Illustration of the result of Grey Relational Analysis usingthe financial ratio data for enterprises in Taiwan area

    2

    Figure 3Illustration of the result of Grey Relational Analysis usingthe financial ratio data for enterprises in Mainland China area

    4

    moreover the difference between the prediction result of the PrincipleComponent Regression and dependent variable is calculated to be used6as error term. In the Hybrid model aspect, GAGRNN adopts MATLABto self-edit program to perform prediction of the error term; during8

  • REGRESSION NEURAL NETWORK 9

    the prediction period, Genetic Algorithm as proposed by John Holland(1992) is adopted to adjust the smoothing parameter of GRNN . In2MATLAB program, GRNN can adjust its parameter spread value (thatis, the smoothing parameter ), the default value of it is 1. Hua-Chiang4Lo (2006) had pointed out that spread is divergent constant, the largerits value, the smoother a smoothing parameter will be obtained. The6parameter setting values of Genetic Algorithm include maximal geneticalgebra of 200, chromosome number of 8, gene crossover rate of 0.75,8mutation rate of 0.01 and initial value of smoothing parameter of 1.However, parameter adjustment is to bring the training data into GRNN10to perform network training and then perform coding conversion on thesmoothing parameter of GRNN; then the value is brought into GRNN,12in the mean time, the test data is brought into GRNN for network testtoo; after one generation of test, an output value will be obtained at14this moment and it will be used together with dependent variable Y tocalculate root mean square error (RMSE); this root mean square error is the16objective function of Genetic Algorithm, and this objective function willbe converted into fitness function, then three basic operations of Genetic18Algorithm (reproduction, crossover and mutation) will be performed.

    In this article, the scheme adopted for reproduction process is20Roulette Wheel Selection Scheme; the scheme adopted for crossoverprocess is single point crossover; the scheme adopted for mutation process22is single point mutation; the termination condition is the number ofgeneration, then after the evolution of 200 generations, the prediction24value will approach the target value. Figure 4 shows that the test data,after evolution of 200 generations, RMSE value will gradually drop, and26the prediction value will gradually approach the target value.

    28

    Figure 4The RMSE value trend of the test result of GAGRNN

  • 10 J. C. HUANG AND W. T. PAN

    In addition, this article also adopts PCR+GRNN model and pureGRNN model to perform the prediction accuracy analysis; therefore, the2analysis results are from a total of four 4 models.

    3.3 General comparison of the prediction capabilities of four models4

    In these five evaluation indexes, the closer the first to fourth index tozero, the more accurate the model. Furthermore, the closer the fifth index6to 1, the more accurate the model. Analysis results of the test data are asshown in Table 1:8

    Table 1Analysis results of five evaluation indexes

    Stock Model RMSE RTIC MAE MAPE CE

    PCR 0.163955 0.000998 0.028943 0.002641 0.915215

    HUNG POO GRNN 0.159376 0.000898 0.025378 0.002496 0.920129

    PCR+GRNN 0.121110 0.000585 0.017998 0.001582 0.964731

    Hybrid 0.119134 0.000339 0.014233 0.001236 0.970599

    PCR 0.329961 0.001325 0.072898 0.007112 0.998503

    WANGFUJING GRNN 0.325829 0.001319 0.070146 0.007083 0.998485

    PCR+GRNN 0.263072 0.000990 0.040919 0.004505 0.999177

    Hybrid 0.181573 0.000816 0.032987 0.003393 0.999778

    10

    It can be seen from Table 1 that PCR+GRNN model has better predic-tion accuracy than single model of PCR and GRNN; however, when PCR12is associated with GAGRNN model that is parameter-adjusted by GeneticAlgorithm (PCR+GAGRNN), the accuracy will be all relatively higher14than those of other three models. However, GRNN model that belongs toartificial intelligence does not have obvious prediction accuracy than that16of conventional Principle Component Regression model. Figure 5 showsthe prediction results of the last 100 sets of test data of the sample data; the18horizontal axis is the real value and corresponds to the prediction valuein the vertical axis; when the sample point gets closer to the diagonal,20it means that the prediction result is more accurate; from the clusteringtrend of the sample points in the Figure 5, it can be seen that the adoption22of Hybrid model in the prediction of the stocks of two companies all showvery good prediction capabilities.24

  • REGRESSION NEURAL NETWORK 11

    Figure 5The sample points clustering trend chart of the prediction resultsusing Hybrid model to predict HUNG POO and WANGFUJING

    2

    4 Conclusions and suggestions

    Although PCR model has solved the multicollinearity issue occurs4among the independent variables of the Multiple Regression model; how-ever, from the research result, it shows that the single adoption of PCR and6GRNN model in the stock sample data of 2006 Taiwans HUNG POO andMainland Chinas BEIJING WANGFUJING, the prediction performance8is obviously lower than that of PCR+GRNN model and Hybrid model.Therefore, in this article, the instability of the error term of PCR model is10included for consideration and a solution is provided. Furthermore, PCRmodels are all suitable for the prediction of linear data and GRNN model12is suitable for the prediction of nonlinear data. It can be seen from theanalysis results of this article that when the parameter of GRNN model14is not properly selected, it might lead to bad prediction result. Therefore,this article adopts a hybrid model of PCR model associates with GAGRNN16model that is adjusted with parameters to perform accuracy analysis forthe prediction of the closing price of the stock; through the association of18two models, and the parameter adjustment of optimized GRNN model,the model prediction capability can be greatly enhanced. However, this20article does not perform prediction capability comparisons with otherdata mining techniques (for example, back-propagation neural network,22BPN, support vector machine (SVM)). These research topics can be futureresearch directions for future researchers.24

    References

    [1] W. L. Megginson and K. A. Weiss (1991), Venture capitalists certifica-26tion in initial public offerings, Journal of Finance, Vol. 46, pp. 879903.

  • 12 J. C. HUANG AND W. T. PAN

    [2] E. F. Fama and K. R. French (1993), Common risk factors in thereturns on stocks and bonds, Journal of Financial Economics, Vol. 332(1), pp. 356.

    [3] S. L. Lai and C. C. Yang (2004), Determinants of noise trading4in Taiwan stock market and their effect for stock prices timeseries cross-section regression, Journal of Risk Management, Vol. 6 (1)6(March), pp. 531.

    [4] R. C. Tsaur (2004), Planning and analyzing for stock investment a8study for stocks of banks, Hsuan Chuang Management Journal, Vol. 1(2), pp. 116.10

    [5] D. F. Specht (1990), Probabilistic neural networks and the polynomialadaline as complementary techniques for classification, IEEE Trans.12on Neural Networks, Vol. 1 (1), pp. 111121.

    [6] Y. C. Yeh (1998), The Application of Neural Network, Julin Book Co.,14Ltd.

    [7] D. F. Specht (1991), A general regression neural network, IEEE Trans.16Neural Networks, Vol. 2 (6), pp. 568576.

    [8] P. F. Pai and C. S. Lin (2005), A hybrid ARIMA and support vector18machines model in stock price forecasting, Omega, Vol. 33, pp. 497505.20

    [9] L. L. Tang and P. C. Shih (2001), Predict the financial crisis by usinggrey relation analysis, neural network, and case-based reasoning,22Chinese Management Review, Vol. 4 (2), pp. 2537.

    [10] J. Deng (1982), The control problems of grey system, System & Control24Letters, Vol. 5, pp. 288294.

    [11] K. L. Wen, S. K. C. Chang, C. K. Yeh, C. W. Wang and H. S. Lin (2006),26Apply MATLAB in Grey System Theory, Chuan Hwa Book Co., Ltd.

    [12] M. L. Wu (2007), SPSS Statistical Application Learning Practices, Acore28Boook Co., Ltd.

    [13] J. Holland (1992), Adaptation in Natural and Artificial Systems, MIT30Press.

    [14] H. C. Lo (2006), Neural Network the Application of MATLAB, Gau32Lih Book Co., Ltd.

    Received April, 200934