chapter 3: linear regressionhomepage.ntu.edu.tw/~yutzung/html/class-x/lecture/chapter3.pdf ·...

Chapter 3: Linear regression

Yu-Tzung Chang and Hsuan-Wei Lee

Department of Political Science, National Taiwan University

2018.10.11

1 / 53

Outline

• Simple linear regression• Multiple linear regression• Qualitative predictors• Extensions of the linear model• Comparison of linear regression with K -nearest neighbors

2 / 53

Linear regression• Linear regression is a simple approach to supervised learning.

It assumes that the dependence of Y on X1,X2, . . . ,Xp islinear.• True regression function are never linear.

• The simpler the better! Although it may seem overlyunrealistic and simplistic, linear regression is extremely usefulboth conceptually and practically. 3 / 53

Advertising data

0 50 100 200 300

510

15

20

25

TV

Sale

s

0 10 20 30 40 50

510

15

20

25

Radio

Sale

s

0 20 40 60 80 100

510

15

20

25

Newspaper

Sale

s

4 / 53

Linear regression for the advertising data

Consider the advertising data in the previous slide. Some questionswe should ask”

• Is there a relationship between advertising budget and sales?• How strong is the relationship between advertising budget and

sales?

• Which media contribute to sales?• How accurately can we predict future sales?• Is the relationship linear?• Is there synergy among the advertising media?

5 / 53

Simple linear regression using a singlepredictor X

• We assume a model

Y = β0 + β1X + �,

where β0 and β1 are two unknown constants that representthe intercept and the slope, also known as coefficients orparameters, and � is the error term.

• Given some estimates β̂0 and β̂1 for the model coefficients, wepredict future sales using

ŷ = β̂0 + β̂1x ,

where ŷ indicates a prediction of Y on the basis of X = x .The hat symbol denotes an estimated value.

6 / 53

Estimation of the parameters by leastsquares

• Let ŷi = β̂0 + β̂1xi be the prediction for Y based on the ithvalue of X . The ei = yi − ŷi represents the ith residual.• We define the residual sum of squares (RSS) as

RSS = e21 + e22 + · · ·+ e2n ,

or equivalently as

RSS = (y1−β̂0+β̂1x1)2+(y2−β̂0+β̂1x2)2+· · ·+(yn−β̂0+β̂1xn)2.

7 / 53

Example: advertising data

0 50 100 150 200 250 300

51

01

52

02

5

TV

Sa

les

The least squares fit for the regression of sales onto TV. In thiscase a linear fit captures the essence of the relationship, althoughit is somewhat deficient in the left of the plot.

8 / 53

Estimation of the parameters by leastsquares

• The least square approach chooses β̂0 and β̂1 to minimize theRSS . The minimizing values can be shown to be

β̂1 =

∑ni=1(xi − x̄)(yi − ȳ)∑n

i=1(xi − x̄)2,

β̂0 = ȳ − β̂1x̄ ,

where ȳ = 1n∑n

i=1 yi and x̄ =1n

∑ni=1 xi are the sample

means.

9 / 53

Assessing the accuracy of the coefficientestimates

• The standard error of an estimator reflects how it varies underrepeated sampling. We have

SE (β̂1)2 =

σ2∑ni=1(xi − x̄)2

and

SE (β̂0)2 = σ2

[1

n+

x̄2∑ni=1(xi − x̄)2

]where σ2 = Var(�).

10 / 53

Confidence interval

• These standard errors can be used to compute confidenceintervals. A 95% confidence interval is defined as a range ofvalues such that with 95% probability, the range will containthe true unknown value of the parameter. It has the form

β̂1 ± 2 · SE (β1).

• That is, there is approximately 95% chance that the interval[β̂1 − 2 · SE (β1), β̂1 + 2 · SE (β1)

]will contain the true value of β1 (under the scenario where wegot repeated samples like the present sample).

• For the advertising data, the 95% confidence interval for β1 is[0.042, 0.053].

11 / 53

Hypothesis testing

• Standard errors can also be used to perform the hypothesistests on the coefficients. The most common hypothesis testinvolves testing the null hypothesis versus the alternativehypothesisH0 : There is no relationship between X and YHA : There is some relationship between X and Y .

• Mathematically, this corresponds to testing

H0 : β1 = 0

versusHA : β1 6= 0,

since if β1 = 0 then the model reduces to Y = β0 + �, and Xis not associated with Y .

12 / 53

Hypothesis testing

• To test the null hypothesis, we compute a t-statistic, given by

t =β̂1 − 0SE (β̂1)

.

• This will have a t-distribution with n − 2 degrees of freedom,assuming β1 = 0.

• Using statistical software, it is easy to compute the probabilityof observing any value equal to |t| or larger. We call thisprobability the p-value.

13 / 53

Results for the advertising data

14 / 53

RSE and RSS

• We compute the Residual Standard Error

RSE =

√1

n − 2RSS =

√√√√ 1n − 2

n∑i=1

(yi − ŷi )2,

where the residual sum-of-squares is

RSS =n∑

i=1

(yi − ŷi )2.

15 / 53

R2

• R-squared or fraction of variance explained is

R2 =TSS − RSS

TSS= 1− RSS

TSS

where

TSS =n∑

i=1

(yi − ȳi )2

is the total sum-of-squares.

• It can be shown that in this simple linear regression settingthat R2 = r2, where r is the correlation between X and Y :

r =

∑ni=1(xi − x̄)(yi − ȳ)√∑n

i=1(xi − x̄)2√∑n

i=1(yi − ȳ)2.

16 / 53

Results for the advertising data

17 / 53

Multiple linear regression

• The multiple linear regression model:

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + �.

• We interpret βj as the average effect on Y of a one unitincrease in Xj , holding all other predictors fixed.

• In the advertising example, the model becomes

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + �.

18 / 53

Interpretation regression coefficients

• The ideal scenario is when the predictors are uncorrelated (abalanced design):• Each coefficient can be estimated and tested separately.• Interpretations such as “a unit change in Xj is associated with

a βj change in Y , while all the other variables stay fixed”, arepossible.

• Correlations amongst predictors cause problems:• The variance of all coefficients tends to increase, sometimes

dramatically.• Interpretations become hazardous – when Xj changes,

everything else changes.

• Claims of causality should be avoided for observational data.

19 / 53

The woes of (interpreting) regressioncoefficients

“Data Analysis and Regression” Mosteller and Tukey 1977

• a regression coefficient βj estimates the expected change in Yper unit change in Xj , with all other predictors held fixed. Butpredictors usually change together.

• Example: Y = total amount of change in your pocket;X1 = # of coins; X2 = # of pennies, nickels and dimes. Byitself, regression coefficient of Y on X2 will be > 0. But howabout with X1 in model?

• Example: Y = number of tackles by a football player in aseason; W and H are his weight and height. Fitted regressionmodel is Ŷ = b0 + 0.5W − 0.1H. How do we interpretβ̂2 < 0?

20 / 53

Estimation and prediction for multipleregression

• Given estimates β̂0, β̂1, β̂2, . . . , β̂p, we can make predictionsusing the formula

β̂0 + β̂1x1 + β̂2x2 + · · ·+ β̂pxp.

• We estimate β0, β1, β2, . . . , βp as the values that minimize thesum of squared residuals

RSS =n∑

i=1

(yi − ŷi )2 =n∑

i=1

(yi − β̂0 − β̂1xi1 − · · · − −β̂pxip)2.

• This is done using standard statistical software. The valuesβ̂0, β̂1, β̂2, . . . , β̂p that minimize RSS are the multiple leastsquares regression coefficient estimate.

21 / 53

Estimation and prediction for multipleregression

X1

X2

Y

22 / 53

Results for advertising data

23 / 53

Some important questions

• Is at least one of the predictors X1,X2, . . . ,Xp useful inpredicting the response?

• Do all the predictors help to explain Y , or is only a subset ofthe predictors useful?

• How well does the model fit the data?• Given a set of predictor values, what response value should we

predict, and how accurate is our prediction?

24 / 53

Is at least one predictor useful?

For this question, we can use F -statistics

F =(TSS − RSS)/pRSS/(n − p − 1)

∼ Fp,n−p−1.

25 / 53

Deciding on the important variables

• The most direct approach is called all subsets or best subsetsregression: we compute the least squares fit for all possiblesubsets and then choose between them based on somecriterion that balances training error with model size.

• However, this is too computationally expensive. There are 2pregression models to run if this method is used. Note240 ≈ 1012.

26 / 53

Forward selection

• Begin with the null model – a model that contains anintercept but no predictors.

• Fit p simple linear regressions and add to the null model thevariable that results in the lowest RSS .

• Add to that model the variable that results in the lowest RSSamongst all two-variable models.

• Continue until some stopping rule is met, for example whenall remaining variables have a p−value about some threshold.

27 / 53

Backward selection

• Start with all variables in the model.• Remove the variable with the largest p−value – that is,

variable that is the least statistically significant.

• The new (p − 1)-variable model is fit, and the variable withthe largest p−value is removed.• Continue until a stopping rule is reached. For example, we

may stop when all remaining variables have a significantp−value defined by some significance threshold.

28 / 53

Qualitative predictors

• Some predictors are not quantitative but are qualitative,taking a discrete set of values.

• These are also called categorical predictors or factor variables.

29 / 53

Credit card data

Balance

20 40 60 80 100 5 10 15 20 2000 8000 14000

0500

1500

20

40

60

80

100

Age

Cards

24

68

510

15

20

Education

Income

50

100

150

2000

8000

14000

Limit

0 500 1500 2 4 6 8 50 100 150 200 600 1000

200

600

1000

Rating

30 / 53

Qualitative predictors – continued

• Example: investigate differences in credit card balancebetween males and females, ignoring the other variables. Wecreate a new variable

f (n) =

{1, if ith person is female

0, if ith person is male.

• Resulting model

yi = β0 + β1xi + �i =

{β0 + β1 + �i , if ith person is female

β0 + �i , if ith person is male.

31 / 53

Results for credit card data

32 / 53

Qualitative predictors with more than twolevels

• With more than two levels, we create additional dummyvariables. For example, for the ethnicity variable we createtwo dummy variables. The first should be

xi1 =

{1, if ith person is Asian

0, if n if ith person is not Asian,

and the second should be

xi2 =

{1, if ith person is Caucasian

0, if n if ith person is not Caucasian.

33 / 53

Qualitative predictors with more than twolevels – continued

• There will always be one fewer dummy variable than thenumber of levels. The level with no dummy variable is knownas the baseline.

• From the last example, both of the variables can be used inthe regression equation, in order to obtain the model

yi = β0 + β1xi1 + β2xi2 + �i

=

β0 + β1 + �i , if ith person is Asian

β0 + β2 + �i , if ith person is Caucasian

β0 + �i , if ith person is African American.

34 / 53

Results for ethnicity

35 / 53

Extensions of the linear model

Removing the additive assumption: interactions and nonlinearlityInteractions:

• In our analysis of the advertising data, we assumed that theeffect on sales of increasing one advertising medium isindependent of the amount spent on the other media.

• For example, the linear model

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper

states that the average effect on sales of a one-unit increase inTV is always β1, regardless of the amount spent on radio.

36 / 53

Interactions – continued

• But suppose that spending money on radio advertisingactually increases the effectiveness of TV advertising, so thatthe slope term for TV should increase as radio increases.

• In this situation, given a fixed budget of $100,000, spendinghalf on radio and half on TV may increase sales more thanallocating the entire amount to either TV or to radio.

• In marketing, this is known as a synergy effect, and instatistics it is referred to as an interaction effect.

37 / 53

Interaction in the advertising data

Sales

Radio

TV

• When levels of either TV or radio are low, then the true salesare lower than predicted by the linear model.• But when advertising is split between the two media, then the

model tends to underestimate sales.38 / 53

Modeling interactions – advertising data

Model takes the form:

sales = β0 + β1 × TV + β2 × radio + β3 × (radio× TV) + �= β0 + (β1 + β3 × radio)× TV + β2 × radio + �= β0 + (β2 + β3 × TV)× radio + β1 × TV + �.

Results:

39 / 53

Interpretation

• The results in the table suggests that interactions areimportant.

• The p-value for the interaction term TV× radio is extremelylow, indicating that there is strong evidence for

HA : β3 6= 0.

• The R2 for the interaction model is 96.8% compared to only89.7% for the model that predicts sales using TV and radiowithout an interaction term.

40 / 53

Interpretation – continued

• This means that (96.8− 89.7)/(100− 89.7) = 69% of thevariability in sales that remains after fitting the additive modelhad been explained by the interaction term.

• The coefficient estimates in the table suggest that an increasein TV advertising of $1,000 is associated with increased salesof

(β̂1 + β̂3 × radio)× 1000 = 19 + 1.1× radio units.

• An increase in radio advertising of $1,000 is associated withincreased sales of

(β̂2 + β̂3 × TV)× 1000 = 29 + 1.1× TV units.

41 / 53

Hierarchy

• Sometimes it is the case that an interaction term has a verysmall p-value, but the associated main effects (in this case,TV and radio) do not.

• The hierarchy principle:If we include an interaction in a model, we should also includethe main effects, even if the p-values associated with theircoefficients are not significant.

42 / 53

Hierarchy – continued

• The rationale for this principle is that interactions are hard tointerpret in a model without main effects – their meaning ischanged.

• Specifically, the interaction terms also contain main effects, ifthe model has no main effect terms.

43 / 53

Interactions between qualitative andquantitative variables

Consider the Credit data set, and suppose that we wish to predictbalance using income (quantitative) and student (qualitative).Without an interaction term, the model takes the form

balancei ≈ β0 + β1 × incomei +

{β2, if ith person is a student

0, if ith person is not a student,

= β1 × incomei +

{β0 + β2, if ith person is a student

β0, if ith person is not a student.

44 / 53

Interactions between qualitative andquantitative variables – continued

With an interaction term, the model takes the form

balancei ≈ β0 + β1 × incomei +

{β2 + β3 × incomei , if student0, if not student,

=

{(β0 + β2) + (β1 + β3)× incomei , if studentβ0 + β1 × incomei , if not student.

45 / 53

Interactions between qualitative andquantitative variables – continued

0 50 100 150

200

600

1000

1400

Income

Bala

nce

0 50 100 150

200

600

1000

1400

IncomeB

ala

nce

student

non−student

Left: no interaction between income and student.Right: With an interaction term between income and student.

46 / 53

Nonlinear effects of predictorsPolynomial regression on auto data.

50 100 150 200

10

20

30

40

50

Horsepower

Mile

s p

er

gallo

n

Linear

Degree 2

Degree 5

47 / 53

Results for the auto data

The figure suggests that

mpg = β0 + β1 × horsepower + β1 × horsepower2 + �

may provide a better fit.

48 / 53

KNN regression

• KNN regression is similar to the KNN classifier.• To predict Y for a given value of X , consider k closet points

to X in the training data and take the average of theresponse, i.e.

f (x) =1

K

∑xi∈Ni

yi .

• If k is small, KNN is much more flexible than linear regression.

49 / 53

KNN fits for k = 1 and k = 9

yy

x1x1

x 2x 2

50 / 53

KNN fits in one dimension for k = 1 andk = 9

−1.0 −0.5 0.0 0.5 1.0

12

34

−1.0 −0.5 0.0 0.5 1.0

12

34

yy

xx

51 / 53

KNN versus linear regression

−1.0 −0.5 0.0 0.5 1.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.0

0.0

00.0

20.0

40.0

60.0

8

Mean S

quare

d E

rror

−1.0 −0.5 0.0 0.5 1.0

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.0

0.0

00.0

50.1

00.1

5

Mean S

quare

d E

rror

yy

x

x

1/K

1/K

52 / 53

KNN is NOT good in high dimensionalsituations

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=1

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=2

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=3

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=4

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=10

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=20

Mean S

quare

d E

rror

1/K

53 / 53

chapter 3: linear regressionhomepage.ntu.edu.tw/~yutzung/html/class-x/lecture/chapter3.pdf ·...

Documents