regression analysis and multiple regression

Click here to load reader

Post on 02-Jan-2016

116 views

Category:

Documents

3 download

Embed Size (px)

DESCRIPTION

Regression Analysis and Multiple Regression. Session 7. Using Statistics The Simple Linear Regression Model Estimation: The Method of Least Squares Error Variance and the Standard Errors of Regression Estimators Correlation Hypothesis Tests about the Regression Relationship - PowerPoint PPT Presentation

TRANSCRIPT

  • Regression Analysis and Multiple RegressionSession 7

  • Simple Linear Regression Model Using StatisticsThe Simple Linear Regression ModelEstimation: The Method of Least SquaresError Variance and the Standard Errors of Regression EstimatorsCorrelationHypothesis Tests about the Regression RelationshipHow Good is the Regression?Analysis of Variance Table and an F Test of the Regression ModelResidual Analysis and Checking for Model InadequaciesUse of the Regression Model for PredictionUsing the ComputerSummary and Review of Terms

  • 7-1 Using Statistics

  • Examples of Other Scatterplots

  • Model Building

  • 7-2 The Simple Linear Regression Model

  • Picturing the Simple Linear Regression ModelThe simple linear regression model posits an exact linear relationship between the expected or average value of Y, the dependent variable, and X, the independent or predictor variable: E[Yi]=0 + 1 Xi

    Actual observed values of Y differ from the expected value by an unexplained or random error:

    Yi = E[Yi] + i = 0 + 1 Xi + i

  • Assumptions of the Simple Linear Regression ModelThe relationship between X and Y is a straight-line relationship.The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term i.The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations. That is: ~ N(0,2)

  • 7-3 Estimation: The Method of Least SquaresEstimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line.The estimated regression equation: Y=b0 + b1X + e

    where b0 estimates the intercept of the population regression line, 0 ;b1 estimates the slope of the population regression line, 1;and e stands for the observed errors - the residuals from fitting the estimated regression line b0 + b1X to a set of n points.

  • Fitting a Regression LineXYDataXYThree errors from a fitted line XYThree errors from the least squares regression lineeXErrors from the least squares regression line are minimized

  • Errors in Regression.{YX

  • Least Squares Regression

  • Sums of Squares, Cross Products, and Least Squares Estimators

  • Example 7-1 MilesDollarsMiles 2Miles*Dollars 1211180214665212182222 1345240518090253234725 1422200520220842851110 1687251128459694236057 1849233234188014311868 2026230541046764669930 2133301645496896433128 2253338550760097626405 2400309057600007416000 2468369460910249116792 2699337172846019098329 28063998787363611218388 30823555949872410956510 320946921029768115056628 346642441201315614709704 364352981327144919300614 385248011483790418493452 403351471626508920757852 426757381820728824484046 449864202023200428877160 453360592054808827465448 480464262307841630870504 509063212590810032173890 523370262738428836767056 5439696429582720378771967949810605293426944390185024

  • Example 7-1: Using the ComputerMTB > Regress 'Dollars' 1 'Miles';SUBC> Constant.

    Regression Analysis

    The regression equation isDollars = 275 + 1.26 Miles

    Predictor Coef Stdev t-ratio pConstant 274.8 170.3 1.61 0.120Miles 1.25533 0.04972 25.25 0.000

    s = 318.2 R-sq = 96.5% R-sq(adj) = 96.4%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 1 64527736 64527736 637.47 0.000Error 23 2328161 101224Total 24 66855896

  • Example 7-1: Using Computer-ExcelThe results on the right side are the output created by selecting REGRESSION option from the DATA ANALYSIS toolkit.

    Sheet4

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R0.9824339304

    R Square0.9651764276

    Adjusted R Square0.9636623593

    Standard Error248.9927747387

    Observations25

    ANOVA

    dfSSMSFSignificance F

    Regression139521617.596942439521617.5969424637.47215856710

    Residual231425940.2430576161997.4018720699

    Total2440947557.84

    CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%

    Intercept-100.6545126642139.0751189028-0.72374205720.4765213794-388.3529241266187.0438987982-388.3529241266187.0438987982

    X Variable 10.76886039880.030452074425.248210997400.70586556920.83185522840.70586556920.8318552284

    Sheet5

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R0.9824339304

    R Square0.9651764276

    Adjusted R Square0.9636623593

    Standard Error318.1578225487

    Observations25

    ANOVA

    dfSSMSFSignificance F

    Regression164527736.798873964527736.7988739637.47215856710

    Residual232328161.20112609101224.40004896

    Total2466855898

    CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%

    Intercept274.8496866723170.33684369311.61356568970.120259309-77.5184416502627.2178149949-77.5184416502627.2178149949

    MILES1.2553337760.049719711925.248210997401.15248085571.35818669631.15248085571.3581866963

    Sheet1

    12111802

    13452405

    14222005

    16872511

    18492332

    20262305

    21333016

    22533385

    24003090

    24683694

    26993371

    28063998

    30823555

    32094692

    34664244

    36435298

    38524801

    40335147

    42675738

    44986420

    45336059

    48046426

    50906321

    52337026

    54396964

    Sheet2

    Sheet3

  • Example 7-1: Using Computer-ExcelResidual Analysis. The plot shows the absence of a relationship between the residuals and the X-values (miles).

  • Total Variance and Error Variance

  • 7-4 Error Variance and the Standard Errors of Regression Estimators

  • Standard Errors of Estimates in Regression

  • Confidence Intervals for the Regression Parameters

  • 7-5 CorrelationThe correlation between two random variables, X and Y, is a measure of the degree of linear association between the two variables. The population correlation, denoted by, can take on any value from -1 to 1.indicates a perfect negative linear relationship-1<
  • Illustrations of Correlation

  • Covariance and Correlation*Note: If < 0, b1 < 0 If = 0, b1 = 0If > 0, b1 >0

  • Example 7-2: Using Computer-Excel

    Sheet4

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R0.9922659456

    R Square0.9845917068

    Adjusted R Square0.9826656702

    Standard Error0.2797613719

    Observations10

    ANOVA

    dfSSMSFSignificance F

    Regression140.009868598340.0098685983511.20092035410.0000000155

    Residual80.62613140170.0782664252

    Total940.636

    CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%

    Intercept-8.76252469480.5940927979-14.74942084070.0000004391-10.1325060294-7.3925433602-10.1325060294-7.3925433602

    US1.42363608730.062965575222.60975277070.00000001551.27843711671.56883505791.27843711671.5688350579

    RESIDUAL OUTPUT

    ObservationPredicted YResiduals

    12.05710956890.2428904311

    22.48420039510.1157996049

    33.05365483-0.15365483

    43.4807456562-0.2807456562

    53.7654728737-0.0654728737

    64.05020009120.0497999088

    74.61965452610.1803454739

    85.758563396-0.058563396

    97.4669267008-0.4669267008

    108.46347196190.4365280381

    Sheet4

    2.37.6

    2.67.9

    2.98.3

    3.28.6

    3.78.8

    4.19

    4.89.4

    5.710.2

    711.4

    8.912.1

    Y

    Predicted Y

    X Variable 1

    Y

    X Variable 1 Line Fit Plot

    Sheet1

    7.62.3

    7.92.6

    8.32.9

    8.63.2

    8.83.7

    94.1

    9.44.8

    10.25.7

    11.47

    12.18.9

    Sheet2

    Sheet3

  • Example 7-2: Regression Plot

    Chart3

    2.32.0571095689

    2.62.4842003951

    2.93.05365483

    3.23.4807456562

    3.73.7654728737

    4.14.0502000912

    4.84.6196545261

    5.75.758563396

    77.4669267008

    8.98.4634719619

    Y

    Predicted Y

    United States

    International

    United States Line Fit Ploty = 1.4236x - 8.7625

    Sheet4

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R0.9922659456

    R Square0.9845917068

    Adjusted R Square0.9826656702

    Standard Error0.2797613719

    Observations10

    ANOVA

    dfSSMSFSignificance F

    Regression140.009868598340.0098685983511.20092035410.0000000155

    Residual80.62613140170.0782664252

    Total940.636

    CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%

    Intercept-8.76252469480.5940927979-14.74942084070.0000004391-10.1325060294-7.3925433602-10.1325060294-7.3925433602

    US1.42363608730.062965575222.60975277070.00000001551.27843711671.56883505791.27843711671.5688350579

    RESIDUAL OUTPUT

    ObservationPredicted YResiduals

    12.05710956890.2428904311

    22.48420039510.1157996049

    33.05365483-0.15365483

    43.4807456562-0.2807456562

    53.7654728737-0.0654728737

    64.05020009120.0497999088

    74.61965452610.1803454739

    85.758563396-0.058563396

    97.4669267008-0.4669267008

    108.46347196190.4365280381

    Sheet4

    2.37.6

    2.67.9

    2.98.3

    3.28.6

    3.78.8

    4.19

    4.89.4

    5.710.2

    711.4

    8.912.1

    Y

    Predicted Y

    United States

    International

    United States Line Fit Ploty = 1.4236x - 8.7625

    Sheet1

    7.62.3

    7.92.6

    8.32.9

    8.63.2

    8.83.7

    94.1

    9.44.8

    10.25.7

    11.47

    12.18.9

    Sheet2

    Sheet3

  • Hypothesis Tests for theCorrelation Coefficient

  • Hypothesis Tests about the Regression Relationship

  • Hypothesis Tests for the Regression Slope

  • 7-7 How Good is the Regression?

  • The Coefficient of DeterminationYXr2=0SSESSTYXr2=0.90SSESSTSSRYXr2=0.50SSESSTSSR

  • 7-8 Analysis of Variance and an F Test of the Regression Model

  • 7-9 Residual Analysis and Checking for Model Inadequacies

  • 7-10 Use of the Regression Model for PredictionPoint PredictionA single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation.Prediction Interval For a value of Y given a value of XVariation in regression line estimate.Variation of points around regression line.For an average value of Y given a value of XVariation in regression line estimate.

  • Errors in Predicting E[Y|X]

  • Prediction Interval for E[Y|X]XYXPrediction Interval for E[Y|X]YRegression lineThe prediction band for E[Y|X] is narrowest at the mean value of X.The prediction band widens as the distance from the mean of X increases.Predictions become very unreliable when we extrapolate beyond the range of the sample itself.Prediction band for E[Y|X]

  • Additional Error in Predicting Individual Value of Y

  • Prediction Interval for a Value of Y

  • Prediction Interval for the Average Value of Y

  • Using the ComputerMTB > regress 'Dollars' 1 'Miles' tres in C3 fits in C4;SUBC> predict 4000;SUBC> residuals in C5.Regression Analysis

    The regression equation isDollars = 275 + 1.26 Miles

    Predictor Coef Stdev t-ratio pConstant 274.8 170.3 1.61 0.120Miles 1.25533 0.04972 25.25 0.000

    s = 318.2 R-sq = 96.5% R-sq(adj) = 96.4%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 1 64527736 64527736 637.47 0.000Error 23 2328161 101224Total 24 66855896

    Fit Stdev.Fit 95.0% C.I. 95.0% P.I. 5296.2 75.6 ( 5139.7, 5452.7) ( 4619.5, 5972.8)

  • Plotting on the Computer (1)

  • Plotting on the Computer (2)

  • Using Statistics.The k-Variable Multiple Regression Model.The F Test of a Multiple Regression Model.How Good is the Regression.Tests of the Significance of Individual Regression Parameters.Testing the Validity of the Regression Model.Using the Multiple Regression Model for Prediction.

  • Qualitative Independent Variables.Polynomial Regression.Nonlinear Models and Transformations.Multicollinearity.Residual Autocorrelation and the Durbin-Watson Test.Partial F Tests and Variable Selection Methods.Using the Computer.The Matrix Approach to Multiple Regression Analysis.Summary and Review of Terms.

  • 7-11 Using Statistics

  • 7-12 The k-Variable Multiple Regression Model

  • Simple and Multiple Least-Squares Regression

  • The Estimated Regression RelationshipThe estimated regression relationship:

    where is the predicted value of Y, the value lying on the estimated regression surface. The terms b0,...,k are the least-squares estimates of the population regression parameters i.The actual, observed value of Y is the predicted value plus an error: y=b0+ b1 x1+ b2 x2+. . . + bk xk+e

  • Least-Squares Estimation: The 2-Variable Normal EquationsMinimizing the sum of squared errors with respect to the estimated coefficients b0, b1, and b2 yields the following normal equations:

  • Example 7-3

  • Example 7-3: Using the ComputerExcel Output

    Sheet4

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R0.9803263228

    R Square0.9610396992

    Adjusted R Square0.9499081847

    Standard Error1.9109404323

    Observations10

    ANOVA

    dfSSMSFSignificance F

    Regression2630.5381466487315.269073324386.33503537240.0000116729

    Residual725.56185335133.6516933359

    Total9656.1

    CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%

    Intercept47.16494227022.470414433319.09191495780.000000269241.323344568853.006539971641.323344568853.0065399716

    X11.59904033590.28096305685.69128323810.00074201010.93466875322.26341191860.93466875322.2634119186

    X21.14874793820.30524885013.76331618540.00704424580.42694962081.87054625560.42694962081.8705462556

    Sheet1

    YX1X2

    72125

    76118

    78156

    70105

    68113

    80169

    821412

    6584

    6283

    901810

    Sheet2

    Sheet3

  • Decomposition of the Total Deviation in a Multiple Regression ModelTotal Deviation = Regression Deviation + Error Deviation SST = SSR + SSE

  • 7-13 The F Test of a Multiple Regression ModelA statistical test for the existence of a linear relationship between Y and any or all of the independent variables X1, x2, ..., Xk:H0: 1 = 2 = ...= k=0H1: Not all the i (i=1,2,...,k) are 0

    Source of Variation

    Sum of Squares

    Degrees of Freedom

    Mean Square

    F Ratio

    Regression

    SSR

    (k)

    Error

    SSE

    (n-(k+1))

    =(n-k-1)

    Total

    SST

    (n-1)

  • Using the Computer: Analysis of Variance Table (Example 7-3)Analysis of Variance

    SOURCE DF SS MS F pRegression 2 630.54 315.27 86.34 0.000Error 7 25.56 3.65Total 9 656.10The test statistic, F = 86.34, is greater than the critical point of F(2, 7) for any common level of significance (p-value 0), so the null hypothesis is rejected, and we might conclude that the dependent variable is related to one or more of the independent variables.

  • 7-14 How Good is the Regression

  • Decomposition of the Sum of Squares and the Adjusted Coefficient of DeterminationSSTSSESSRExample 11-1: s = 1.911 R-sq = 96.1% R-sq(adj) = 95.0%

  • Measures of Performance in Multiple Regression and the ANOVA Table

    Source of Variation

    Sum of Squares

    Degrees of Freedom

    Mean Square

    F Ratio

    Regression

    SSR

    (k)

    Error

    SSE

    (n-(k+1))

    =(n-k-1)

    Total

    SST

    (n-1)

  • 7-15 Tests of the Significance of Individual Regression ParametersHypothesis tests about individual regression slope parameters:(1)H0: b1=0H1: b10(2)H0: b2=0H1: b20 . . .(k)H0: bk=0H1: bk0

  • Regression Results for Individual Parameters

    Variable

    Coefficient

    Estimate

    Standard

    Error

    t-Statistic

    Constant

    53.12

    5.43

    9.783

    *

    X1

    2.03

    0.22

    9.227

    *

    X2

    5.60

    1.30

    4.308

    *

    X3

    10.35

    6.88

    1.504

    X4

    3.45

    2.70

    1.259

    X5

    -4.25

    0.38

    11.184

    *

    n=150

    t0.025=1.96

  • Example 7-3: Using the ComputerMTB > regress 'Y' on 2 predictors 'X1' 'X2'

    Regression Analysis

    The regression equation isY = 47.2 + 1.60 X1 + 1.15 X2

    Predictor Coef Stdev t-ratio pConstant 47.165 2.470 19.09 0.000X1 1.5990 0.2810 5.69 0.000X2 1.1487 0.3052 3.76 0.007

    s = 1.911 R-sq = 96.1% R-sq(adj) = 95.0%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 2 630.54 315.27 86.34 0.000Error 7 25.56 3.65Total 9 656.10

    SOURCE DF SEQ SSX1 1 578.82X2 1 51.72

  • Using the Computer: Example 7-4MTB > READ a:\data\c11_t6.dat C1-C5MTB > NAME c1 'EXPORTS' c2 'M1' c3 'LEND' c4 'PRICE' C5 'EXCHANGE'MTB > REGRESS 'EXPORTS' on 4 predictors 'M1' 'LEND' 'PRICE' 'EXCHANGE'

    Regression Analysis

    The regression equation isEXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE

    Predictor Coef Stdev t-ratio pConstant -4.015 2.766 -1.45 0.152M1 0.36846 0.06385 5.77 0.000LEND 0.00470 0.04922 0.10 0.924PRICE 0.036511 0.009326 3.91 0.000EXCHANGE 0.268 1.175 0.23 0.820

    s = 0.3358 R-sq = 82.5% R-sq(adj) = 81.4%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 4 32.9463 8.2366 73.06 0.000Error 62 6.9898 0.1127Total 66 39.9361

  • Example 7-5: Three PredictorsMTB > REGRESS 'EXPORTS' on 3 predictors 'LEND' 'PRICE' 'EXCHANGE'

    Regression Analysis

    The regression equation isEXPORTS = - 0.29 - 0.211 LEND + 0.0781 PRICE - 2.10 EXCHANGE

    Predictor Coef Stdev t-ratio pConstant -0.289 3.308 -0.09 0.931LEND -0.21140 0.03929 -5.38 0.000PRICE 0.078148 0.007268 10.75 0.000EXCHANGE -2.095 1.355 -1.55 0.127

    s = 0.4130 R-sq = 73.1% R-sq(adj) = 71.8%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 3 29.1919 9.7306 57.06 0.000 Error 63 10.7442 0.1705Total 66 39.9361

  • Example 7-5: Two PredictorsMTB > REGRESS 'EXPORTS' on 2 predictors 'M1' 'PRICE'

    Regression Analysis

    The regression equation isEXPORTS = - 3.42 + 0.361 M1 + 0.0370 PRICE

    Predictor Coef Stdev t-ratio pConstant -3.4230 0.5409 -6.33 0.000M1 0.36142 0.03925 9.21 0.000PRICE 0.037033 0.004094 9.05 0.000 s = 0.3306 R-sq = 82.5% R-sq(adj) = 81.9%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 2 32.940 16.470 150.67 0.000Error 64 6.996 0.109Total 66 39.936

  • 7-16 Investigating the Validity of the Regression Model: Residual Plots

  • Investigating the Validity of the Regression: Residual Plots (2)

  • Histogram of Standardized Residuals: Example 7-6

  • Investigating the Validity of the Regression: Outliers and Influential Observations

  • Outliers and Influential Observations: Example 7-6Unusual ObservationsObs. M1 EXPORTS Fit Stdev.Fit Residual St.Resid 1 5.10 2.6000 2.6420 0.1288 -0.0420 -0.14 X 2 4.90 2.6000 2.6438 0.1234 -0.0438 -0.14 X 25 6.20 5.5000 4.5949 0.0676 0.9051 2.80R 26 6.30 3.7000 4.6311 0.0651 -0.9311 -2.87R 50 8.30 4.3000 5.1317 0.0648 -0.8317 -2.57R 67 8.20 5.6000 4.9474 0.0668 0.6526 2.02R

    R denotes an obs. with a large st. resid.X denotes an obs. whose X value gives it large influence.

  • 7-17 Using the Multiple Regression Model for Prediction

  • Prediction in Multiple RegressionMTB > regress 'EXPORTS' 2 'M1' 'PRICE';SUBC> predict 6 160;SUBC> predict 5 150;SUBC> predict 4 130. Fit Stdev.Fit 95.0% C.I. 95.0% P.I. 4.6708 0.0853 ( 4.5003, 4.8412) ( 3.9885, 5.3530) 3.9390 0.0901 ( 3.7590, 4.1190) ( 3.2543, 4.6237) 2.8370 0.1116 ( 2.6140, 3.0599) ( 2.1397, 3.5342)

  • 7-18 Qualitative (or Categorical) Independent Variables (in Regression)MOVIEEARNCOSTPROMBOOK 1284.21.00 2356.03.01 3505.56.01 4203.31.00 57512.511.01 6609.68.01 7152.50.50 84510.85.00 9508.43.01 10346.62.00 114810.71.01 128211.015.01 13243.54.00 14506.910.00 15587.89.01 166310.110.00 17305.01.01 18377.55.00 19456.48.01 207210.012.01MTB > regress 'EARN 3 'COST' 'PROM 'BOOK'

    Regression Analysis

    The regression equation isEARN = 7.84 + 2.85 COST + 2.28 PROM + 7.17 BOOK

    Predictor Coef Stdev t-ratio pConstant 7.836 2.333 3.36 0.004COST 2.8477 0.3923 7.26 0.000PROM 2.2782 0.2534 8.99 0.000BOOK 7.166 1.818 3.94 0.001

    s = 3.690 R-sq = 96.7% R-sq(adj) = 96.0%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 3 6325.2 2108.4 154.89 0.000Error 16 217.8 13.6 Total 19 6543.0

  • Picturing Qualitative Variables in RegressionA multiple regression with two quantitative variables (X1 and X2) and one qualitative variable (X3):

    A regression with one quantitative variable (X1) and one qualitative variable (X2):

  • Picturing Qualitative Variables in Regression: Three Categories and Two Dummy Variables

  • Using Qualitative Variables in Regression: Example 7-6

  • Interactions between Quantitative and Qualitative Variables: Shifting SlopesA regression with interaction between a quantitative variable (X1) and a qualitative variable (X2 ):

  • 7-19 Polynomial Regression

  • Polynomial Regression: Example 7-7

  • Polynomial Regression: Other Variables and Cross-Product Terms

  • 7-20 Nonlinear Models and Transformations: Multiplicative ModelMTB > loge c1 c3MTB > loge c2 c4MTB > name c3 'LOGSALE' c4 'LOGADV'MTB > regress 'logsale' 1 'logadv'Regression AnalysisThe regression equation isLOGSALE = 1.70 + 0.553 LOGADV

    Predictor Coef Stdev t-ratio pConstant 1.70082 0.05123 33.20 0.000LOGADV 0.55314 0.03011 18.37 0.000

    s = 0.1125 R-sq = 94.7% R-sq(adj) = 94.4%

    Analysis of VarianceSOURCE DF SS MS F pRegression 1 4.2722 4.2722 337.56 0.000 Error 19 0.2405 0.0127Total 20 4.5126

  • Transformations: Exponential ModelMTB > regress 'sales' 1 'logadv'

    Regression AnalysisThe regression equation isSALES = 3.67 + 6.78 LOGADV

    Predictor Coef Stdev t-ratio pConstant 3.6683 0.4016 9.13 0.000LOGADV 6.7840 0.2360 28.74 0.000

    s = 0.8819 R-sq = 97.8% R-sq(adj) = 97.6%

    Analysis of Variance

    SOURCE DF SS MS F pRegression 1 642.62 642.62 826.24 0.000 Error 19 14.78 0.78Total 20 657.40

  • Plots of Transformed Variables

  • Variance Stabilizing TransformationsSquare root transformation:Useful when the variance of the regression errors is approximately proportional to the conditional mean of Y.Logarithmic transformation:Useful when the variance of regression errors is approximately proportional to the square of the conditional mean of Y. Reciprocal transformation:Useful when the variance of the regression errors is approximately proportional to the fourth power of the conditional mean of Y.

  • Regression with Dependent Indicator Variables

  • 7.21 Multicollinearity

  • Effects of MulticollinearityVariances of regression coefficients are inflated.Magnitudes of regression coefficients may be different from what are expected.Signs of regression coefficients may not be as expected.Adding or removing variables produces large changes in coefficients.Removing a data point may cause large changes in coefficient estimates or signs.In some cases, the F ratio may be significant while the t ratios are not.

  • Detecting the Existence of Multicollinearity: Correlation Matrix of Independent Variables and Variance Inflation FactorsMTB > CORRELATION 'm1' 'lend 'price 'exchange'

    Correlations (Pearson)

    M1 LEND PRICELEND -0.112PRICE 0.447 0.745EXCHANGE -0.410 -0.279 -0.420

    MTB > regress 'exports' on 4 predictors 'm1 'lend 'price 'exchange';SUBC> vif.

    Regression AnalysisThe regression equation isEXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE

    Predictor Coef Stdev t-ratio p VIFConstant -4.015 2.766 -1.45 0.152M1 0.36846 0.06385 5.77 0.000 3.2LEND 0.00470 0.04922 0.10 0.924 5.4PRICE 0.036511 0.009326 3.91 0.000 6.3EXCHANGE 0.268 1.175 0.23 0.820 1.4

    s = 0.3358 R-sq = 82.5% R-sq(adj) = 81.4%

  • Variance Inflation FactorRelationship between VIF and Rh2

  • Solutions to the Multicollinearity ProblemDrop a collinear variable from the regression.Change in sampling plan to include elements outside the multicollinearity range.Transformations of variables.Ridge regression.

  • 7-22 Residual Autocorrelation and the Durbin-Watson TestAn autocorrelation is a correlation of the values of a variable with values of the same variable lagged one or more periods back. Consequences of autocorrelation include inaccurate estimates of variances and inaccurate predictions.

  • Critical Points of the Durbin-Watson Statistic: =0.05, n= Sample Size, k = Number of Independent Variables k = 1 k = 2 k = 3 k = 4 k = 5 ndL dU dL dU dL dU dL dU dL dU 151.081.360.951.540.821.750.691.970.562.21 161.101.370.981.540.861.730.741.930.622.15 171.131.381.021.540.901.710.781.900.672.10 181.161.391.051.530.931.690.821.870.712.06 ...... ...... ...... 651.571.63 1.541.661.501.701.471.731.441.77 701.581.64 1.551.671.521.701.491.741.461.77 751.601.65 1.571.681.541.711.511.741.491.77 801.611.66 1.591.691.561.721.531.741.511.77 85 1.621.67 1.601.701.571.721.551.751.521.77 901.631.68 1.611.701.591.731.571.751.541.78 951.641.69 1.621.711.601.731.581.751.561.781001.651.69 1.631.721.611.741.591.761.571.78

  • Using the Durbin-Watson StatisticMTB > regress 'EXPORTS' 4 'M1' 'LEND' 'PRICE' 'EXCHANGE';SUBC> dw.

    Durbin-Watson statistic = 2.58PositiveAutocorrelationNegativeAutocorrelationTest isInconclusiveNoAutocorrelationTest isInconclusive0dLdU4-dL4-dU4For n = 67, k = 4:dU1.73 4-dU2.27 dL1.47 4- dL2.53 < 2.58 H0 is rejected, and we conclude there is negative first-order autocorrelation.

  • 7-23 Partial F Tests and Variable Selection MethodsFull model:Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + Reduced model:Y = 0 + 1 X1 + 2 X2 +

    Partial F test:H0: 3 = 4 = 0H1: 3 and 4 not both 0

    Partial F statistic:

    where SSER is the sum of squared errors of the reduced model, SSEF is the sum of squared errors of the full model; MSEF is the mean square error of the full model [MSEF = SSEF/(n-(k+1))]; r is the number of variables dropped from the full model.

  • Variable Selection MethodsAll possible regressionsRun regressions with all possible combinations of independent variables and select best model.Stepwise proceduresForward selectionAdd one variable at a time to the model, on the basis of its F statistic.Backward eliminationRemove one variable at a time, on the basis of its F statistic.Stepwise regressionAdds variables to the model and subtracts variables from the model, on the basis of the F statistic.

  • Stepwise Regression

  • Stepwise Regression: Using the Computer

    MTB > STEPWISE 'EXPORTS' PREDICTORS 'M1 'LEND' 'PRICE 'EXCHANGE'

    Stepwise Regression

    F-to-Enter: 4.00 F-to-Remove: 4.00

    Response is EXPORTS on 4 predictors, with N = 67

    Step 1 2Constant 0.9348 -3.4230

    M1 0.520 0.361T-Ratio 9.89 9.21

    PRICE 0.0370T-Ratio 9.05

    S 0.495 0.331R-Sq 60.08 82.48

  • Using the Computer: MINITABMTB > REGRESS 'EXPORTS 4 'M1 'LEND 'PRICE' 'EXCHANGE';SUBC> vif;SUBC> dw.Regression AnalysisThe regression equation isEXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE

    Predictor Coef Stdev t-ratio p VIFConstant -4.015 2.766 -1.45 0.152M1 0.36846 0.06385 5.77 0.000 3.2LEND 0.00470 0.04922 0.10 0.924 5.4PRICE 0.036511 0.009326 3.91 0.000 6.3EXCHANGE 0.268 1.175 0.23 0.820 1.4

    s = 0.3358 R-sq = 82.5% R-sq(adj) = 81.4% Analysis of Variance

    SOURCE DF SS MS F pRegression 4 32.9463 8.2366 73.06 0.000Error 62 6.9898 0.1127Total 66 39.9361

    Durbin-Watson statistic = 2.58

  • Using the Computer: SASdata exports;infile 'c:\aczel\data\c11_t6.dat';input exports m1 lend price exchange;proc reg data = exports;model exports=m1 lend price exchange/dw vif;run;Model: MODEL1Dependent Variable: EXPORTS

    Analysis of Variance

    Sum of Mean Source DF Squares Square F Value Prob>F

    Model 4 32.94634 8.23658 73.059 0.0001 Error 62 6.98978 0.11274 C Total 66 39.93612

    Root MSE 0.33577 R-square 0.8250 Dep Mean 4.52836 Adj R-sq 0.8137 C.V. 7.41473

  • Using the Computer: SAS (continued)Parameter Estimates

    Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T|

    INTERCEP 1 -4.015461 2.76640057 -1.452 0.1517 M1 1 0.368456 0.06384841 5.771 0.0001 LEND 1 0.004702 0.04922186 0.096 0.9242 PRICE 1 0.036511 0.00932601 3.915 0.0002 EXCHANGE 1 0.267896 1.17544016 0.228 0.8205

    Variance Variable DF Inflation

    INTERCEP 1 0.00000000 M1 1 3.20719533 LEND 1 5.35391367 PRICE 1 6.28873181 EXCHANGE 1 1.38570639

    Durbin-Watson D 2.583(For Number of Obs.) 671st Order Autocorrelation -0.321

  • The Matrix Approach to Regression Analysis (1)

  • The Matrix Approach to Regression Analysis (2)