chapter 3: linear regressionhomepage.ntu.edu.tw/~yutzung/html/class-x/lecture/chapter3.pdf ·...

53
Chapter 3: Linear regression Yu-Tzung Chang and Hsuan-Wei Lee Department of Political Science, National Taiwan University 2018.10.11 1 / 53

Upload: others

Post on 24-Oct-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

  • Chapter 3: Linear regression

    Yu-Tzung Chang and Hsuan-Wei Lee

    Department of Political Science, National Taiwan University

    2018.10.11

    1 / 53

  • Outline

    • Simple linear regression• Multiple linear regression• Qualitative predictors• Extensions of the linear model• Comparison of linear regression with K -nearest neighbors

    2 / 53

  • Linear regression• Linear regression is a simple approach to supervised learning.

    It assumes that the dependence of Y on X1,X2, . . . ,Xp islinear.• True regression function are never linear.

    • The simpler the better! Although it may seem overlyunrealistic and simplistic, linear regression is extremely usefulboth conceptually and practically. 3 / 53

  • Advertising data

    0 50 100 200 300

    510

    15

    20

    25

    TV

    Sale

    s

    0 10 20 30 40 50

    510

    15

    20

    25

    Radio

    Sale

    s

    0 20 40 60 80 100

    510

    15

    20

    25

    Newspaper

    Sale

    s

    4 / 53

  • Linear regression for the advertising data

    Consider the advertising data in the previous slide. Some questionswe should ask”

    • Is there a relationship between advertising budget and sales?• How strong is the relationship between advertising budget and

    sales?

    • Which media contribute to sales?• How accurately can we predict future sales?• Is the relationship linear?• Is there synergy among the advertising media?

    5 / 53

  • Simple linear regression using a singlepredictor X

    • We assume a model

    Y = β0 + β1X + �,

    where β0 and β1 are two unknown constants that representthe intercept and the slope, also known as coefficients orparameters, and � is the error term.

    • Given some estimates β̂0 and β̂1 for the model coefficients, wepredict future sales using

    ŷ = β̂0 + β̂1x ,

    where ŷ indicates a prediction of Y on the basis of X = x .The hat symbol denotes an estimated value.

    6 / 53

  • Estimation of the parameters by leastsquares

    • Let ŷi = β̂0 + β̂1xi be the prediction for Y based on the ithvalue of X . The ei = yi − ŷi represents the ith residual.• We define the residual sum of squares (RSS) as

    RSS = e21 + e22 + · · ·+ e2n ,

    or equivalently as

    RSS = (y1−β̂0+β̂1x1)2+(y2−β̂0+β̂1x2)2+· · ·+(yn−β̂0+β̂1xn)2.

    7 / 53

  • Example: advertising data

    0 50 100 150 200 250 300

    51

    01

    52

    02

    5

    TV

    Sa

    les

    The least squares fit for the regression of sales onto TV. In thiscase a linear fit captures the essence of the relationship, althoughit is somewhat deficient in the left of the plot.

    8 / 53

  • Estimation of the parameters by leastsquares

    • The least square approach chooses β̂0 and β̂1 to minimize theRSS . The minimizing values can be shown to be

    β̂1 =

    ∑ni=1(xi − x̄)(yi − ȳ)∑n

    i=1(xi − x̄)2,

    β̂0 = ȳ − β̂1x̄ ,

    where ȳ = 1n∑n

    i=1 yi and x̄ =1n

    ∑ni=1 xi are the sample

    means.

    9 / 53

  • Assessing the accuracy of the coefficientestimates

    • The standard error of an estimator reflects how it varies underrepeated sampling. We have

    SE (β̂1)2 =

    σ2∑ni=1(xi − x̄)2

    and

    SE (β̂0)2 = σ2

    [1

    n+

    x̄2∑ni=1(xi − x̄)2

    ]where σ2 = Var(�).

    10 / 53

  • Confidence interval

    • These standard errors can be used to compute confidenceintervals. A 95% confidence interval is defined as a range ofvalues such that with 95% probability, the range will containthe true unknown value of the parameter. It has the form

    β̂1 ± 2 · SE (β1).

    • That is, there is approximately 95% chance that the interval[β̂1 − 2 · SE (β1), β̂1 + 2 · SE (β1)

    ]will contain the true value of β1 (under the scenario where wegot repeated samples like the present sample).

    • For the advertising data, the 95% confidence interval for β1 is[0.042, 0.053].

    11 / 53

  • Hypothesis testing

    • Standard errors can also be used to perform the hypothesistests on the coefficients. The most common hypothesis testinvolves testing the null hypothesis versus the alternativehypothesisH0 : There is no relationship between X and YHA : There is some relationship between X and Y .

    • Mathematically, this corresponds to testing

    H0 : β1 = 0

    versusHA : β1 6= 0,

    since if β1 = 0 then the model reduces to Y = β0 + �, and Xis not associated with Y .

    12 / 53

  • Hypothesis testing

    • To test the null hypothesis, we compute a t-statistic, given by

    t =β̂1 − 0SE (β̂1)

    .

    • This will have a t-distribution with n − 2 degrees of freedom,assuming β1 = 0.

    • Using statistical software, it is easy to compute the probabilityof observing any value equal to |t| or larger. We call thisprobability the p-value.

    13 / 53

  • Results for the advertising data

    14 / 53

  • RSE and RSS

    • We compute the Residual Standard Error

    RSE =

    √1

    n − 2RSS =

    √√√√ 1n − 2

    n∑i=1

    (yi − ŷi )2,

    where the residual sum-of-squares is

    RSS =n∑

    i=1

    (yi − ŷi )2.

    15 / 53

  • R2

    • R-squared or fraction of variance explained is

    R2 =TSS − RSS

    TSS= 1− RSS

    TSS

    where

    TSS =n∑

    i=1

    (yi − ȳi )2

    is the total sum-of-squares.

    • It can be shown that in this simple linear regression settingthat R2 = r2, where r is the correlation between X and Y :

    r =

    ∑ni=1(xi − x̄)(yi − ȳ)√∑n

    i=1(xi − x̄)2√∑n

    i=1(yi − ȳ)2.

    16 / 53

  • Results for the advertising data

    17 / 53

  • Multiple linear regression

    • The multiple linear regression model:

    Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + �.

    • We interpret βj as the average effect on Y of a one unitincrease in Xj , holding all other predictors fixed.

    • In the advertising example, the model becomes

    sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + �.

    18 / 53

  • Interpretation regression coefficients

    • The ideal scenario is when the predictors are uncorrelated (abalanced design):• Each coefficient can be estimated and tested separately.• Interpretations such as “a unit change in Xj is associated with

    a βj change in Y , while all the other variables stay fixed”, arepossible.

    • Correlations amongst predictors cause problems:• The variance of all coefficients tends to increase, sometimes

    dramatically.• Interpretations become hazardous – when Xj changes,

    everything else changes.

    • Claims of causality should be avoided for observational data.

    19 / 53

  • The woes of (interpreting) regressioncoefficients

    “Data Analysis and Regression” Mosteller and Tukey 1977

    • a regression coefficient βj estimates the expected change in Yper unit change in Xj , with all other predictors held fixed. Butpredictors usually change together.

    • Example: Y = total amount of change in your pocket;X1 = # of coins; X2 = # of pennies, nickels and dimes. Byitself, regression coefficient of Y on X2 will be > 0. But howabout with X1 in model?

    • Example: Y = number of tackles by a football player in aseason; W and H are his weight and height. Fitted regressionmodel is Ŷ = b0 + 0.5W − 0.1H. How do we interpretβ̂2 < 0?

    20 / 53

  • Estimation and prediction for multipleregression

    • Given estimates β̂0, β̂1, β̂2, . . . , β̂p, we can make predictionsusing the formula

    β̂0 + β̂1x1 + β̂2x2 + · · ·+ β̂pxp.

    • We estimate β0, β1, β2, . . . , βp as the values that minimize thesum of squared residuals

    RSS =n∑

    i=1

    (yi − ŷi )2 =n∑

    i=1

    (yi − β̂0 − β̂1xi1 − · · · − −β̂pxip)2.

    • This is done using standard statistical software. The valuesβ̂0, β̂1, β̂2, . . . , β̂p that minimize RSS are the multiple leastsquares regression coefficient estimate.

    21 / 53

  • Estimation and prediction for multipleregression

    X1

    X2

    Y

    22 / 53

  • Results for advertising data

    23 / 53

  • Some important questions

    • Is at least one of the predictors X1,X2, . . . ,Xp useful inpredicting the response?

    • Do all the predictors help to explain Y , or is only a subset ofthe predictors useful?

    • How well does the model fit the data?• Given a set of predictor values, what response value should we

    predict, and how accurate is our prediction?

    24 / 53

  • Is at least one predictor useful?

    For this question, we can use F -statistics

    F =(TSS − RSS)/pRSS/(n − p − 1)

    ∼ Fp,n−p−1.

    25 / 53

  • Deciding on the important variables

    • The most direct approach is called all subsets or best subsetsregression: we compute the least squares fit for all possiblesubsets and then choose between them based on somecriterion that balances training error with model size.

    • However, this is too computationally expensive. There are 2pregression models to run if this method is used. Note240 ≈ 1012.

    26 / 53

  • Forward selection

    • Begin with the null model – a model that contains anintercept but no predictors.

    • Fit p simple linear regressions and add to the null model thevariable that results in the lowest RSS .

    • Add to that model the variable that results in the lowest RSSamongst all two-variable models.

    • Continue until some stopping rule is met, for example whenall remaining variables have a p−value about some threshold.

    27 / 53

  • Backward selection

    • Start with all variables in the model.• Remove the variable with the largest p−value – that is,

    variable that is the least statistically significant.

    • The new (p − 1)-variable model is fit, and the variable withthe largest p−value is removed.• Continue until a stopping rule is reached. For example, we

    may stop when all remaining variables have a significantp−value defined by some significance threshold.

    28 / 53

  • Qualitative predictors

    • Some predictors are not quantitative but are qualitative,taking a discrete set of values.

    • These are also called categorical predictors or factor variables.

    29 / 53

  • Credit card data

    Balance

    20 40 60 80 100 5 10 15 20 2000 8000 14000

    0500

    1500

    20

    40

    60

    80

    100

    Age

    Cards

    24

    68

    510

    15

    20

    Education

    Income

    50

    100

    150

    2000

    8000

    14000

    Limit

    0 500 1500 2 4 6 8 50 100 150 200 600 1000

    200

    600

    1000

    Rating

    30 / 53

  • Qualitative predictors – continued

    • Example: investigate differences in credit card balancebetween males and females, ignoring the other variables. Wecreate a new variable

    f (n) =

    {1, if ith person is female

    0, if ith person is male.

    • Resulting model

    yi = β0 + β1xi + �i =

    {β0 + β1 + �i , if ith person is female

    β0 + �i , if ith person is male.

    31 / 53

  • Results for credit card data

    32 / 53

  • Qualitative predictors with more than twolevels

    • With more than two levels, we create additional dummyvariables. For example, for the ethnicity variable we createtwo dummy variables. The first should be

    xi1 =

    {1, if ith person is Asian

    0, if n if ith person is not Asian,

    and the second should be

    xi2 =

    {1, if ith person is Caucasian

    0, if n if ith person is not Caucasian.

    33 / 53

  • Qualitative predictors with more than twolevels – continued

    • There will always be one fewer dummy variable than thenumber of levels. The level with no dummy variable is knownas the baseline.

    • From the last example, both of the variables can be used inthe regression equation, in order to obtain the model

    yi = β0 + β1xi1 + β2xi2 + �i

    =

    β0 + β1 + �i , if ith person is Asian

    β0 + β2 + �i , if ith person is Caucasian

    β0 + �i , if ith person is African American.

    34 / 53

  • Results for ethnicity

    35 / 53

  • Extensions of the linear model

    Removing the additive assumption: interactions and nonlinearlityInteractions:

    • In our analysis of the advertising data, we assumed that theeffect on sales of increasing one advertising medium isindependent of the amount spent on the other media.

    • For example, the linear model

    sales = β0 + β1 × TV + β2 × radio + β3 × newspaper

    states that the average effect on sales of a one-unit increase inTV is always β1, regardless of the amount spent on radio.

    36 / 53

  • Interactions – continued

    • But suppose that spending money on radio advertisingactually increases the effectiveness of TV advertising, so thatthe slope term for TV should increase as radio increases.

    • In this situation, given a fixed budget of $100,000, spendinghalf on radio and half on TV may increase sales more thanallocating the entire amount to either TV or to radio.

    • In marketing, this is known as a synergy effect, and instatistics it is referred to as an interaction effect.

    37 / 53

  • Interaction in the advertising data

    Sales

    Radio

    TV

    • When levels of either TV or radio are low, then the true salesare lower than predicted by the linear model.• But when advertising is split between the two media, then the

    model tends to underestimate sales.38 / 53

  • Modeling interactions – advertising data

    Model takes the form:

    sales = β0 + β1 × TV + β2 × radio + β3 × (radio× TV) + �= β0 + (β1 + β3 × radio)× TV + β2 × radio + �= β0 + (β2 + β3 × TV)× radio + β1 × TV + �.

    Results:

    39 / 53

  • Interpretation

    • The results in the table suggests that interactions areimportant.

    • The p-value for the interaction term TV× radio is extremelylow, indicating that there is strong evidence for

    HA : β3 6= 0.

    • The R2 for the interaction model is 96.8% compared to only89.7% for the model that predicts sales using TV and radiowithout an interaction term.

    40 / 53

  • Interpretation – continued

    • This means that (96.8− 89.7)/(100− 89.7) = 69% of thevariability in sales that remains after fitting the additive modelhad been explained by the interaction term.

    • The coefficient estimates in the table suggest that an increasein TV advertising of $1,000 is associated with increased salesof

    (β̂1 + β̂3 × radio)× 1000 = 19 + 1.1× radio units.

    • An increase in radio advertising of $1,000 is associated withincreased sales of

    (β̂2 + β̂3 × TV)× 1000 = 29 + 1.1× TV units.

    41 / 53

  • Hierarchy

    • Sometimes it is the case that an interaction term has a verysmall p-value, but the associated main effects (in this case,TV and radio) do not.

    • The hierarchy principle:If we include an interaction in a model, we should also includethe main effects, even if the p-values associated with theircoefficients are not significant.

    42 / 53

  • Hierarchy – continued

    • The rationale for this principle is that interactions are hard tointerpret in a model without main effects – their meaning ischanged.

    • Specifically, the interaction terms also contain main effects, ifthe model has no main effect terms.

    43 / 53

  • Interactions between qualitative andquantitative variables

    Consider the Credit data set, and suppose that we wish to predictbalance using income (quantitative) and student (qualitative).Without an interaction term, the model takes the form

    balancei ≈ β0 + β1 × incomei +

    {β2, if ith person is a student

    0, if ith person is not a student,

    = β1 × incomei +

    {β0 + β2, if ith person is a student

    β0, if ith person is not a student.

    44 / 53

  • Interactions between qualitative andquantitative variables – continued

    With an interaction term, the model takes the form

    balancei ≈ β0 + β1 × incomei +

    {β2 + β3 × incomei , if student0, if not student,

    =

    {(β0 + β2) + (β1 + β3)× incomei , if studentβ0 + β1 × incomei , if not student.

    45 / 53

  • Interactions between qualitative andquantitative variables – continued

    0 50 100 150

    200

    600

    1000

    1400

    Income

    Bala

    nce

    0 50 100 150

    200

    600

    1000

    1400

    IncomeB

    ala

    nce

    student

    non−student

    Left: no interaction between income and student.Right: With an interaction term between income and student.

    46 / 53

  • Nonlinear effects of predictorsPolynomial regression on auto data.

    50 100 150 200

    10

    20

    30

    40

    50

    Horsepower

    Mile

    s p

    er

    gallo

    n

    Linear

    Degree 2

    Degree 5

    47 / 53

  • Results for the auto data

    The figure suggests that

    mpg = β0 + β1 × horsepower + β1 × horsepower2 + �

    may provide a better fit.

    48 / 53

  • KNN regression

    • KNN regression is similar to the KNN classifier.• To predict Y for a given value of X , consider k closet points

    to X in the training data and take the average of theresponse, i.e.

    f (x) =1

    K

    ∑xi∈Ni

    yi .

    • If k is small, KNN is much more flexible than linear regression.

    49 / 53

  • KNN fits for k = 1 and k = 9

    yy

    x1x1

    x 2x 2

    50 / 53

  • KNN fits in one dimension for k = 1 andk = 9

    −1.0 −0.5 0.0 0.5 1.0

    12

    34

    −1.0 −0.5 0.0 0.5 1.0

    12

    34

    yy

    xx

    51 / 53

  • KNN versus linear regression

    −1.0 −0.5 0.0 0.5 1.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    0.2 0.5 1.0

    0.0

    00.0

    20.0

    40.0

    60.0

    8

    Mean S

    quare

    d E

    rror

    −1.0 −0.5 0.0 0.5 1.0

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    0.2 0.5 1.0

    0.0

    00.0

    50.1

    00.1

    5

    Mean S

    quare

    d E

    rror

    yy

    x

    x

    1/K

    1/K

    52 / 53

  • KNN is NOT good in high dimensionalsituations

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=1

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=2

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=3

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=4

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=10

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=20

    Mean S

    quare

    d E

    rror

    1/K

    53 / 53