chapter 13 · 2011. 11. 15. · chapter 13 multiple regression and model building. multiple...

43
Chapter 13 Multiple Regression and Model Building

Upload: others

Post on 08-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Chapter 13

    Multiple Regression and Model

    Building

  • Multiple Regression Models

    The General Multiple Regression Model

    is the dependent variable

    are the independent variables

    is the deterministic portion of

    the model

    determines the contribution of the independent variable

    y

    0 1 1 2 2. . .

    k ky x x x

    0 1 1 2 2 . . . k kE y x x x

    ix

    1 2, , . . . ,

    kx x x

    i

  • Multiple Regression Models

    Analyzing a Multiple Regression Model

    1. Hypothesize the deterministic component of the model

    2. Use sample data to estimate β0,β1,β2,… βk

    3. Specify probability distribution of ε and estimate σ

    4. Check that assumptions on ε are satisfied

    5. Statistically evaluate model usefulness

    6. Useful model used for prediction, estimation, other

    purposes

  • The First-Order Model: Estimating

    and Interpreting the -Parameters

    For

    the chosen fitted model

    minimizes

    0 1 1

    ˆ ˆ ˆˆ . . .k k

    y x x

    0 1 1 2 2 3 3 4 4 5 5E y x x x x x

    2

    ˆS S E y y

  • The First-Order Model: Estimating

    and Interpreting the -Parameters

    y = β0 + β1x1 + β2x2 + β3x3 + ε

    where

    Y = Sales price (dollars)

    X1 = Appraised land value (dollars)

    X2 = Appraised improvements (dollars)

    X3 = Area (square feet)

  • The First-Order Model: Estimating

    and Interpreting the -Parameters

    Plot of data for sample size n=20

  • The First-Order Model: Estimating

    and Interpreting the -Parameters

    Fit model to data

  • The First-Order Model: Estimating

    and Interpreting the -Parameters

    Interpret β estimates

    2

    ˆ .8 2 0 4

    1

    ˆ 1 3 .5 3

    1

    ˆ .8 1 4 5

    E(y), the mean sale price of the property is

    estimated to increase .8145 dollars for every $1

    increase in appraised land value, holding other

    variables constant

    E(y), the mean sale price of the property is

    estimated to increase .8204 dollars for every $1

    increase in appraised improvements, holding other

    variables constant

    E(y), the mean sale price of the property is

    estimated to increase 13.53 dollars for additional

    square foot of living area, holding other variables

    constant

  • The First-Order Model: Estimating

    and Interpreting the -Parameters

    Given the model E(y) = 1 +2x1 +x2, the effect

    of x2 on E(y), holding x1 and x2 constant is

  • The First-Order Model: Estimating

    and Interpreting the -Parameters

    Given the model E(y) = 1 +2x1 +x2, the effect

    of x2 on E(y), holding x1 and x2 constant is

  • Model Assumptions

    Assumptions about Random Error ε

    1. For any given set of values of x1, x2,…..xk, the random

    error has a normal probability distribution with mean 0

    and variance σ2

    2. The random errors are independent

    Estimators of σ2 for a Multiple Regression Model

    with k Independent Variables

    s2=SSE

    =SSE

    n-Number of Estimated β parameters n-(k+1)

  • Inferences about the -Parameters

    2 types of inferences can be made, using

    either confidence intervals or hypothesis

    testing

    For any inferences to be made, the

    assumptions made about the random error

    term ε (normal distribution with mean 0 and

    variance σ2, independence or errors) must

    be met

  • Inferences about the -Parameters

    A 100(1-α)% Confidence Interval for a -Parameter

    where tα/2 is based on n-(k+1) degrees of freedom and

    n = Number of observations

    k+1 = Number of parameters in the model

    ˆ2

    ˆ

    ii

    t s

  • Inferences about the -Parameters

    A Test of an Individual Parameter Coefficient

    One-Tailed TestTwo-Tailed

    Test

    H0: βi=0

    Ha: βi0)

    H0: βi=0

    Ha: βi≠0

    Rejection region: t< -tα

    (or t< -tα when Ha: β1>0)

    Rejection

    region: |t|> tα/2

    Where tα and tα/2 are based on n-(k+1) degrees of freedom

    ˆ

    ˆ

    :

    i

    iT e s t S ta t i s t i c t

    s

  • Inferences about the -Parameters

    An Excel Analysis

    Use for

    confidence

    Intervals

    Use for hypotheses

    about parameter

    coefficients

  • Checking the Overall Utility of a

    Model

    3 tests:1. Multiple coefficient of determination R2

    2. Adjusted multiple coefficient of determination

    3. Global F-test

    2 21 1

    1 1 11 1

    a

    y y

    n nS S ER R

    n k S S n k

    21

    y y

    y y y y

    S S S S ES S E E x p la in e d v a r ia b i l i t yR

    S S S S T o ta l v a r ia b i l i t y

    2

    2:

    1 1 1

    y yS S S S E k R k

    T e s t s ta t i s t ic F

    S S E n k R n k

  • Checking the Overall Utility of a

    Model

    Testing Global Usefulness of the Model: The

    Analysis of Variance F-test

    H0: β1 =β2=....βk=0

    Ha: At least one βi ≠ 0

    where n is the sample size and k is number of terms in the model

    Rejection region: F>Fα, with k numerator degrees of freedom and [n-

    (k+1)] denominator degrees of freedom

    2

    2:

    1 1 1

    y yS S S S E k R k M e a n S q u a r e M o d e l

    T e s t s ta t i s t ic FM e a n S q u a r e E r r o rS S E n k R n k

  • Checking the Overall Utility of a

    Model

    Checking the Utility of a Multiple Regression Model

    1. Conduct a test of overall model adequacy

    using the F-test. If H0 is rejected, proceed to

    step 2

    2. Conduct t-tests on β parameters of particular

    interest

  • Using the Model for Estimation and

    Prediction

    As in Simple Linear Regression, intervals around a

    predicted value will be wider than intervals around

    an estimated value

    Most statistics packages will print out both

    confidence and prediction intervals

  • Model Building: Interaction Models

    An Interaction Model relating E(y) to Two Quantitative Independent Variables

    where

    represents the change in E(y) for every 1-unit increase in x1, holding x2 fixed

    represents the change in E(y) for every 1-unit increase in x2, holding x1 fixed

    1 3 2 x

    0 1 1 2 2 3 1 2 E y x x x x

    2 3 1 x

  • Model Building: Interaction Models

    When the relationship between two y

    and xi is not impacted by a second x

    (no interaction)

    When the linear relationship

    between y and xi depends on

    another x

  • Model Building: Interaction Models

  • Model Building: Quadratic and

    other Higher-Order Models

    A Quadratic (Second-Order) Model

    where

    is the y-intercept of the curve

    is a shift parameter

    is the rate of curvature

    2

    0 1 2 E y x x

    0

    1

    2

  • Model Building: Quadratic and

    other Higher-Order Models

    Home Size-Electrical

    Usage Data

    Size of Home,

    x (sq. ft.)

    Monthly Usage,

    y (kilowatt-hours)

    1,290 1,182

    1,350 1,172

    1,470 1,264

    1,600 1,493

    1,710 1,571

    1,840 1,711

    1,980 1,804

    2,230 1,840

    2,400 1,95

    2,930 1,954

  • Model Building: Quadratic and

    other Higher-Order Models

    2ˆ 1, 2 1 6 .1 2 .3 9 8 9 .0 0 0 4 5y x x

  • Model Building: Quadratic and

    other Higher-Order Models

    A Complete Second-Order Model with Two

    Quantitative Independent Variables

    where

    is the y-intercept, value of E(y) when x1=x2=0

    changes cause the surface to shift along the x1 and x2axes

    controls the rotation of the surface

    control the type of surface, rates of curvature

    2 2

    0 1 2 2 3 1 2 4 1 5 2 E y x x x x x x

    0

    1 2,

    3

    4 5,

  • Model Building: Quadratic and

    other Higher-Order Models

  • Model Building: Qualitative

    (Dummy) Variable Models

    Dummy variables – coded, qualitative variables

    •Codes are in the form of (1, 0), 1 being the presence of a condition, 0 the absence

    •Create Dummy variables so that there is one less dummy variable than categories of the qualitative variable of interest

    Gender dummy variable coded

    as x = 1 if male, x=0 if female

    If model is E(y)=β0+β1x ,

    β1 captures the effect of being

    male on the dependent variable

  • Model Building: Models with both

    Quantitative and Qualitative Variables

    Start with a first order model with one quantitative

    variable, E(y)=β0+β1x

    Adding a qualitative variable

    with no interaction,

    E(y)=β0+β1x1+ β2x2+ β3x3

  • Model Building: Models with both

    Quantitative and Qualitative Variables

    Adding an interaction term,

    E(y)=β0+β1x1+ β2x2+ β3x3+ β4x1x2+ β5x1x3

    Main effect, Main effect Interaction

    x1 x2 and x3

  • Model Building: Comparing Nested

    Models

    Models are nested if one model contains all

    the terms of the other model and at least

    one additional term.

    Complete (full) model – the more complex

    model

    Reduced model – the simpler model

  • Model Building: Comparing Nested

    Models

    Models are nested if one model contains all the

    terms of the other model and at least one

    additional term.

    Complete (full) model – the more complex model

    Reduced model – the simpler model

    2 2

    0 1 2 2 3 1 2 4 1 5 2 E y x x x x x x

    0 1 2 2 3 1 2 E y x x x x

  • Model Building: Comparing Nested

    Models

    F-Test for comparing nested models:F-Test for Comparing Nested ModelsReduced model

    Complete Model

    H0: βg+1 =βg+2=....βk=0

    Ha: At least one β under test is nonzero.

    Rejection region: F>Fα, with k-g numerator degrees of freedom and

    [n-(k+1)] denominator degrees of freedom

    0# ':

    1

    R C R C

    CC

    S S E S S E k g S S E S S E s te s te d in HT e s t s ta t is t ic F

    M S ES S E n k

    0 1 1 . . . g gE y x x

    0 1 1 1 1. . . . . . g g g g k kE y x x x x

  • Model Building: Stepwise

    Regression

    Used when a large set of independent

    variables

    Software packages will add in variables in

    order of explanatory value.

    Decisions based on largest t-values at each

    step

    Procedure is best used as a screening

    procedure only

  • Residual Analysis: Checking the

    Regression Assumptions

    Regression Residual – the difference between an observed y value and its corresponding predicted value

    Properties of Regression Residuals•The mean of the residuals equals zero

    •The standard deviation of the residuals is equal to the

    standard deviation of the fitted regression model

    ˆ ˆy y

  • Residual Analysis: Checking the

    Regression Assumptions

    Analyzing Residuals

    Top plot of residuals reveals

    non-random pattern, curved

    shape

    Second plot, based on

    second-order term being

    added to model, results in

    random pattern, better

    model

  • Residual Analysis: Checking the

    Regression Assumptions

    Identifying OutliersResidual plots can reveal outliers

    Outliers need to be checked to try

    to determine if error is involved

    If error is involved, or observation

    is not representative, analysis can

    be rerun after deleting data point

    to assess the effect.

    Outlier

  • Residual Analysis: Checking the

    Regression Assumptions

    With Outlier Without Outlier

    Checking for Normal Errors

  • Residual Analysis: Checking the

    Regression Assumptions

    Checking for Equal Variances

    Pattern in residuals indicate violation of equal

    variance assumption

    Can point to use of transformation on the

    dependent variable to stabilize variance

  • Residual Analysis: Checking the

    Regression Assumptions

    Steps in Residual Analysis

    1. Check for misspecified model by plotting

    residuals against quantitative independent

    variables

    2. Examine residual plots for outliers

    3. Check for non-normal error using frequency

    distribution of residuals

    4. Check for unequal error variances using plots

    of residuals against predicted values

  • Some Pitfalls: Estimability,

    Multicollinearity, and Extrapolation

    Estimability – the number of levels of

    observed x-values must be one more than

    the order of the polynomial in x that you

    want to fit

    Multicollinearity – when two or more

    independent variables are correlated

  • Some Pitfalls: Estimability,

    Multicollinearity, and Extrapolation

    Multicollinearity – when two or more independent variables are correlated

    Leads to confusing, misleading results, incorrect parameter estimate signs.

    Can be identified by

    –checking correlations among x’s

    –non-significant for most/all x’s

    –signs opposite from expected in the estimated β parameters

    Can be addressed by–Dropping one or more of the correlated variables in the model

    –Restricting inferences to range of sample data, not making inferences about individual β parameters based on t-tests.

  • Some Pitfalls: Estimability,

    Multicollinearity, and Extrapolation

    Extrapolation – use of model to predict

    outside of range of sample data is

    dangerous

    Correlated Errors – most common when

    working with time series data, values of y

    and x’s observed over a period of time.

    Solution is to develop a time series model.