[Wiley Series in Computational Statistics] Symbolic Data Analysis || Regression Analysis

Download [Wiley Series in Computational Statistics] Symbolic Data Analysis || Regression Analysis

Post on 10-Dec-2016

215 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>6Regression Analysis</p><p>In this chapter, linear regression methods are described for each of multi-valued (orcategorical) variables, interval-valued variables, and histogram-valued variables, inturn. Then, the methodology is extended to models which include taxonomy vari-ables as predictor variables, and to models which contain a hierarchical variablestructure, as shown by Alfonso et al. (2005). Except for the multi-valued depen-dent variable case of Section 6.2.2 and a special case of interval-valued data inExample 6.6, the fitted regression models and hence their predicted values are gener-ally single valued, even though the input variables had symbolic formats. A method-ology that gives output predicted values as symbolic values is still an open question.The focus will be on fitting a linear regression model. How other types of modelsand model diagnostics broadly defined are handled also remains as an open problem.</p><p>6.1 Classical Multiple Regression Model</p><p>Since the symbolic regression methodology relies heavily on the classic theory, wevery briefly describe the familiar classical method. The multiple classical linearregression model, for the predictor variables X1 Xp and dependent variable Y ,is defined by</p><p>Y = 0+1X1+ +pXp+ e (6.1)or, in vector terms,</p><p>Y = X+e (6.2)where the vector of observations Y is</p><p>Y = Y1 Yn</p><p>189</p><p>Symbolic Data Analysis: Conceptual Statistics and Data Mining L. Billard and E. Diday 2006 John Wiley &amp; Sons, Ltd. ISBN: 978-0-470-09016-9</p></li><li><p>190 REGRESSION ANALYSIS</p><p>the regression design matrix X is the n p+1 matrix</p><p>X =1 X11 X1p</p><p>1 Xn1 Xnp</p><p>the regression coefficient vector is the p+1 vector = 0 p</p><p>and the error vector e is</p><p>e = e1 enwhere the error terms satisfy Eei= 0 and Varei=2 and Covei ei= 0 i = i.The least squares estimators of the parameters are given by, if X is nonsingular,</p><p>= X X1X Y (6.3)In the particular case when p= 1, Equation (6.3) simplifies to</p><p>1 =n</p><p>i=1Xi XYi Y ni=1Xi X2</p><p>= CovXYVarX</p><p> (6.4)</p><p>0 = Y X (6.5)where</p><p>Y = 1n</p><p>ni=1</p><p>Yi X =1n</p><p>ni=1</p><p>Xi (6.6)</p><p>An alternative formulation of the model in Equation (6.1) is</p><p>Y Y = 1X1 X1+ +pXp Xp+ ewhere</p><p>Xj =1n</p><p>ni=1</p><p>Xij j = 1 p</p><p>When using this formulation, Equation (6.3) becomes</p><p>= X XX X1X XY Ywhere now there is no column of ones in the matrix X. It follows that 0 inEquation (6.1) is</p><p>0 Y 1X1+ +pXphence the two formulations are equivalent.</p></li><li><p>CLASSICAL MULTIPLE REGRESSION MODEL 191</p><p>Given the nature of the definitions of the symbolic covariance for interval-valuedand histogram-valued observations (see Equation (4.16) and Equation (4.19), respec-tively), this latter formulation is preferable when fitting p 2 predictor variablesin these cases (as illustrated in subsequent sections below).</p><p>The model of Equation (6.1) assumes there is only one dependent variable Yfor the set of p predictor variables X = X1 Xp. When the variable Y is itselfof dimension q &gt; 1, we have a multivariate multiple regression model, given by</p><p>Yk = 0k+1kX1+ +pkXp+ ek k= 1 q (6.7)or, in vector form,</p><p>Y = X + e (6.8)where the observation matrix Y is the nq matrix</p><p>Y = Y1 Yqwith</p><p>Y k = Y1k Ynkthe regression design matrix X is the same n p+ 1 matrix X used in Equa-tion (6.2), the regression parameter matrix is the p+1q matrix</p><p> = 1 qwith</p><p>k = 0k pkand where the error terms e form the nq matrix</p><p>e = e1 eqwith</p><p>ek = e1k enkand where the error terms satisfy</p><p>Eek= 0 Covek1 ek2= k1k2I Note that these errors from different responses on the same trials are correlated ingeneral but that observations from different trials are not. Then, the least squaresestimators of the parameters are</p><p> = X X1X Y (6.9)</p></li><li><p>192 REGRESSION ANALYSIS</p><p>In the special case when p = 1, the model of Equation (6.7) becomes, fori= 1 n,</p><p>Yik = 0k+1kXi+ eik k= 1 q (6.10)and hence we can show that</p><p>1k =n</p><p>i=1Xi XYik Ykni=1Xi X2</p><p>(6.11)</p><p>and</p><p>0k = Yk 1kX (6.12)where</p><p>Yk =1n</p><p>ni=1</p><p>Yik k= 1 q (6.13)</p><p>and X is as defined in Equation (6.6). There is a plethora of standard texts coveringthe basics of classical regression analysis, including topics beyond the model fittingstages touched on herein, such as model diagnostics, non-linear and other typesof models, variable selection, and the like. See, for example, Montgomery andPeck (1992) and Myers (1986) for an applications-oriented introduction. As for theunivariate Y case, there are numerous texts available for details on multivariateregression; see, for example, Johnson and Wichern (2002).</p><p>6.2 Multi-Valued Variables</p><p>6.2.1 Single dependent variable</p><p>Recall from Definition 2.2 that a multi-valued random variable is one whose valueis a list of categorical values, i.e., each observation wu takes a value</p><p>u = k k= 1 su u= 1 m (6.14)A modal multi-valued random variable, from Definition 2.4, is one whose valuetakes the form</p><p>u = uk uk k= 1 su u= 1 m (6.15)Clearly, the non-modal multi-valued variable of Equation (6.14) is a special</p><p>case of the modal multi-valued variable where each value k in this list is equallylikely. In the sequel, we shall take the measures k to be relative frequenciespk k= 1 su; adjustments to other forms of measures k can readily be made(see Section 2.2).</p></li><li><p>MULTI-VALUED VARIABLES 193</p><p>For clarity, let us denote the independent predictor variables by X =X1 Xp taking values in p, where Xj has realizations</p><p>Xjwu= ujk pujk k= 1 suj j = 1 p (6.16)on the observation wuu = 1 m. Let the possible values of Xj in j be</p><p>Xj1 Xjsj . These Xjk k = 1 sj j = 1 p, are a type of indicatorvariable taking relative frequency values pjk, respectively. Without loss of generalityin Equation (6.16), we take suj = sj for all observations wu and set the observedpujk 0 for those values of Xj j not appearing in Equation (6.16). Let thedependent variable be denoted by Y taking values in the space . Initially, letY take a single (quantitative, or qualitative) value in ; this is extended to the caseof several Y values in Section 6.2.2.</p><p>The regression analysis for these multi-valued observations proceeds as for themultiple classical regression model of Equation (6.1) in the usual way, where theXj variables in Equation (6.1) take values equal to the pujk of the symbolicobservations. Note that, since the X are a type of indicator variable, only kj 1are included in the model to enable inversion of the associated X X matrix.</p><p>Example 6.1. Table 6.1 lists the average fuel costs (Y , in coded dollars) paid perannum and the types of fuel used X for heating in several different regions ofa country. The complete set of possible X values is X = X1 X4, i.e., =</p><p>gas oil electric other forms of heating. Thus, we see, for instance, that inregion1 wu = w1, 83% of the customers used gas, 3% used oil, and 14% usedelectricity for heating their homes, at an average expenditure of Y = 281.</p><p>Table 6.1 Fuel consumption.</p><p>wu Region Y = Expenditure X = Fuel Typew1 region1 28.1 {gas, 0.83; oil, 0.03; electric, 0.14}w2 region2 23.4 {gas, 0.69; oil, 0.13; electric, 0.12;</p><p>other, 0.06}w3 region3 33.2 {oil, 0.40; electric, 0.15; other,</p><p>0.45}w4 region4 25.1 {gas, 0.61; oil, 0.07; electric, 0.16;</p><p>other, 0.16}w5 region5 21.7 {gas, 0.67; oil, 0.15; electric, 0.18}w6 region6 32.5 {gas, 0.40; electric, 0.45; other,</p><p>0.15}w7 region7 26.6 {gas, 0.83; oil, 0.01; electric, 0.09;</p><p>other, 0.07}w8 region8 19.9 {gas, 0.66; oil, 0.15; electric, 0.19}w9 region9 28.4 {gas, 0.86; electric, 0.09; other,</p><p>0.05}w10 region10 25.5 {gas, 0.77; electric, 0.23}</p></li><li><p>194 REGRESSION ANALYSIS</p><p>The resulting regression equation is, from Equation (6.3),</p><p>Y = 61313673(gas)6134(oil)3274(electric). (6.17)Note that although the X4 = other variable does not enter directly into Equa-tion (6.17), it is present indirectly by implication when the corresponding valuesfor X1k k= 123, are entered into the model.By substituting observed X values into Equation (6.17), predicted expendi-</p><p>tures Y and the corresponding residuals R = Y Y are obtained. For example,for region1,</p><p>Y w1= 6131367308361340033274014= 2440and hence the residual is</p><p>Rw1= Yw1 Y w1= 28102440= 370The predicted Y values and the residuals R for all regions are displayed inTable 6.2.</p><p>Table 6.2 Fuel regression parameters.</p><p>wu Region Y gas oil electric other Y R</p><p>w1 region1 28.1 0.83 0.03 0.14 0.00 24.40 370w2 region2 23.4 0.69 0.13 0.12 0.06 24.07 067w3 region3 33.2 0.00 0.40 0.15 0.45 31.87 133w4 region4 25.1 0.61 0.07 0.16 0.16 29.37 427w5 region5 21.7 0.67 0.15 0.18 0.00 21.61 009w6 region6 32.5 0.40 0.00 0.45 0.15 31.89 061w7 region7 26.6 0.83 0.01 0.09 0.07 27.26 066w8 region8 19.9 0.66 0.15 0.19 0.00 21.65 175w9 region9 28.4 0.86 0.00 0.09 0.05 26.78 162w10 region10 25.5 0.77 0.00 0.23 0.00 25.50 000</p><p>Example 6.2. Suppose there are p= 2 predictor variables. The data of Table 6.3extend the analysis of Table 6.1 by relating the expenditure Y on heating totwo predictors X1 = Type of heating fuel (as in Table 6.1) and X2 = Size ofhousehold with 2 = small large. Thus, the third column of Table 6.3 givesthe observation values for X2. The X2 possible values are recoded to X21 =small and X22 = large. The fourth column in Table 6.3 gives the correspondingrecoded values. The X1 and the related X1k k= 1 4, values are as given inTable 6.1 and Table 6.2, and the X2 and related X2k k = 12, values are as inTable 6.3.</p></li><li><p>MULTI-VALUED VARIABLES 195</p><p>Table 6.3 Fuel consumption two variables.</p><p>wu Region X2 X21 X22 Y R</p><p>w1 region1 {small, 0.57; large, 0.43} 057 043 2593 217w2 region2 {small, 0.65; large, 0.35} 065 035 2397 057w3 region3 {small, 0.38; large, 0.62} 038 062 3272 048w4 region4 {small, 0.82; large, 0.18} 082 018 2658 148w5 region5 {small, 0.74; large, 0.26} 074 026 2035 135w6 region6 {small, 0.55; large, 0.45} 055 045 3228 022w7 region7 {small, 0.81; large, 0.19} 081 019 2540 120w8 region8 {small, 0.63; large, 0.37} 063 037 2186 196w9 region9 {small, 0.47; large, 0.53} 047 053 2966 126w10 region10 {small, 0.66; large, 0.34} 066 034 2566 016</p><p>The regression model that fits these data is, from Equation (6.3),</p><p>Y = 66403208(gas)5977(oil)3069 (electric)1360(small). (6.18)Then, from Equation (6.18), we predict, for example, for region1,</p><p>Y w1= 66403208083597700330690141360057= 2593and hence the residual is</p><p>Rw1= 2812593= 217The predicted and residual values for all regions are shown in the last twocolumns, respectively, of Table 6.3. </p><p>6.2.2 Multi-valued dependent variable</p><p>Suppose now that the dependent variable Y can take any of a number of values for alist of possible values = 1 2 q. Let us write this as, for observation wu,</p><p>Ywu= uk quk k= 1 tuwhere for observation wu the outcome uk occurred with relative frequency quk, andwhere tu is the number of components in that were observed in Ywu. Withoutloss of generality we let tu = q for all u, i.e., those k in not actually occurringin Ywu take relative frequency quk 0.</p><p>Then, in the same manner as the predictor variables Xj j = 1 p werecoded in Section 6.2.1, we also code the dependent variable to the q-dimensionaldependent variable Y = Y1 Yq with Yk representing the possible outcome kwith the observed relative frequency qk k= 1 q. Our linear regression model</p></li><li><p>196 REGRESSION ANALYSIS</p><p>between the coded Y and X variables can be viewed as analogous to the classicalmultivariate multiple regression model of Equation (6.7).</p><p>Therefore, the techniques of multivariate multiple regression models are easilyextended to our symbolic multi-valued variable observations. This time we obtainpredicted fits Y which have symbolic modal-multi-valued values.</p><p>Example 6.3. A study was undertaken to investigate the relationship if anybetween Gender X1 and Age X2 on the types of convicted Crime Y reportedin m= 15 areas known to be populated by gangs. The dependent random variableY took possible values in = violent non-violent none. The Age X1 of gangmembers took values in 1 = &lt; 20 20 years and Gender X2 took valuesin 2 = male female. These random variables were coded, respectively, toY = Y1 Y2 Y3 X1 = X11X12, and X2 = X21X22. The coded data are asshown in Table 6.4. Thus, for example, for gang1, the original data had</p><p>gang1= Yw1X1w1X2w1= violent, 0.73; non-violent, 0.16, none, 0.11}</p><p>male, 0.68; female, 0.32 &lt; 20064 20036</p><p>i.e., in this gang, 64% were under 20 years of age and 36% were 20 or older,68% were male and 32% were female, and 73% had been convicted of a violentcrime, 16% a non-violent crime, and 11% had not been convicted.</p><p>Table 6.4 Crime demographics.</p><p>wu Crime Gender Age</p><p>Y1 Y2 Y3 X11 X12 X21 X22violent non-violent none male female &lt; 20 20</p><p>gang1 073 016 011 068 032 064 036gang2 040 020 040 070 030 080 020gang3 020 020 060 050 050 050 050gang4 010 020 070 060 040 040 060gang5 020 040 040 035 065 055 045gang6 048 032 020 053 047 062 038gang7 014 065 021 040 060 033 067gang8 037 037 026 051 049 042 058gang9 047 032 021 059 041 066 034gang10 018 015 077 037 063 022 078gang11 035 035 030 041 059 044 056gang12 018 057 025 039 061 045 055gang13 074 016 010 070 030 063 037gang14 033 045 022 037 064 029 071gang15 035 039 026 050 050 044 056</p></li><li><p>MULTI-VALUED VARIABLES 197</p><p>The multivariate regression model is, from Equation (6.7), for u= 1 m,</p><p>Yu1 = 01+11Xu11+31Xu21+ eu1Yu2 = 02+12Xu11+32Xu21+ eu2 (6.19)Yu3 = 03+13Xu11+33Xu21+ eu3</p><p>where for observation wu the dependent variable is written asYwu= Yu1 Yu2 Yu3.Note that since the X1 and X2 variables are a type of coded variable, one</p><p>of the X1k and one of the X2k variables is omitted in the regression model ofEquation (6.9). However, all the coded Y variables are retained, so in this casethere are q = 3 equations, one for each of the coded Y variables. Then, fromEquation (6.19), we can estimate the parameters by to give the regressionmodel as Y = Y1 Y2 Y3 where</p><p>Y1 =0202+0791(male)+0303under 20Y2 = 071109800(male)+0226under 20 (6.20)Y3 = 0531+0213(male)0622under 20</p><p>Substituting the observed X values into Equation (6.20), we can calculate Y .Thus, for example, for the first gang,</p><p>Y1gang1=0202+0791068+0303064= 053Y2gang1= 07110980068+0226064= 019Y3gang1= 0531+02130680622064= 028</p><p>That is, the predicted crime for a gang with the age and gender characteristics ofgang1 is</p><p>Y gang1= Y w1= violent, 0.53; non-violent, 0.19; none, 0.28</p><p>i.e., 53% are likely to be convicted of a violent crime, 19% of a non-violentcrime, and 28% will not be convicted of any crime. The predicted crime ratesY for each of the gangs along with the corresponding residuals R= Y Y areshown in Table 6.5.</p></li><li><p>198 REGRESSION ANALYSIS</p><p>Table 6.5 Crime predictions.</p><p>wu Y = Predicted Crime ResidualsY1 Y2 Y3 R1 R2 R3</p><p>violent non-violent none</p><p>gang1 053 019 028 020 003 017gang2 059 021 018 019 001 022gang3 034 033 033 014 013 027gang4 039 021 041 029 001 029gang5 024 049 026 004 009 014gang6 041 033 026 008 001 006gang7 021 039 041 007 026 020gang8 033 031 038 004 006 012gang9 046 028 025 001 004 004gang10 016 040 047 002 025 030gang11 026 041 034 009 006 004gang12 024 043 033 006 014 008gang13 054 017 029 020 001 019gang14 018 041 043 015 004 021gang15 033 032 036 002 007 010</p><p>6.3 Interval-Valued Variables</p><p>For interv...</p></li></ul>

Recommended

View more >