section 3.1: understanding multiple linear regression · mccombs school of business 1. the multiple...

Section 3.1: Understanding Multiple LinearRegression

Jared S. MurrayThe University of Texas at Austin

McCombs School of Business

1

The Multiple Linear Regression ModelMany problems involve more than one independent variable or

factor which influence or predict the dependent or response

variable.

I More than size to predict house price!

I Demand for a product given prices of competing brands,

advertising,house hold attributes, etc.

We might also want to study relationships between two or more

variables, controlling for other variables.

In SLR, the conditional mean of Y depends on a covariate X. The

Multiple Linear Regression (MLR) model extends this idea to

include more than one independent variable.2

Understanding Multiple Linear Regression

The Sales Data:

I Sales : units sold in excess of a baseline

I P1: our price in $ (in excess of a baseline price)

I P2: competitors price (again, over a baseline)

3

Understanding Multiple Regression

I If we regress Sales on our own price alone, we obtain a

surprising conclusion... the higher the price the more we sell!!

27

The Sales Data

In this data we have weekly observations onsales:# units (in excess of base level)p1=our price: $ (in excess of base)p2=competitors price: $ (in excess of base).

p1 p2 Sales5.13567 5.2042 144.493.49546 8.0597 637.257.27534 11.6760 620.794.66282 8.3644 549.01......

(each row correspondsto a week)

If we regressSales on own price,we obtain thesomewhatsurprisingconclusionthat a higherprice is associatedwith more sales!!

9876543210

1000

500

0

p1

Sal

esS = 223.401 R-Sq = 19.6 % R-Sq(adj) = 18.8 %

Sales = 211.165 + 63.7130 p1

Regression Plot

The regression linehas a positive slope !!

I It looks like we should just raise our prices, right? Why not?

I We can use MLR to help... 4

The MLR ModelSame as always, but with more covariates.

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + ε

Recall the key assumptions of our linear regression model:

(i) The conditional mean of Y is linear in the Xj variables.

(ii) We also make some assumptions about the error terms

(deviations from line)I independent from each other

I identically distributed (i.e., they have constant variance)

If we also assume they are normally distributed, then

(Y |X1 . . .Xp) ∼ N(β0 + β1X1 . . . + βpXp, σ2)

5

The MLR Model

Our interpretation of regression coefficients can be extended from

the simple single covariate regression case:

βj =∂E [Y |X1, . . . ,Xp]

∂Xj

Holding all other variables constant, βj is the

average change in Y per unit change in Xj .

(Just consider the change in predicted Y moving from

Xj → Xj + 1 while leaving the other X’s alone)

6

The MLR ModelIf p = 2, we can plot the regression surface in 3D.

Consider sales of a product as predicted by price of this product

(P1) and the price of a competing product (P2).

Sales = β0 + β1P1 + β2P2 + ε

7


I The regression equation for Sales on own price (P1) is:

Sales = 211 + 63.7P1

I If now we add the competitors price to the regression we get

Sales = 116− 97.7P1 + 109P2

I Does this look better? How did it happen?

I Remember: −97.7 is the effect on sales of a change in P1

with P2 held fixed!!

8


I How can we see what is going on? Let’s compare Sales in two

weeks different weeks (82 and 99) where our competitor’s

price P2 was exactly the same.

I In this instance an increase in P1, holding P2 constant,

corresponds to a drop in Sales!

28

Sales on own price:

The multiple regression of Sales on own price (p1) andcompetitor's price (p2) yield more intuitive signs:

How does this happen ?

The regression equation isSales = 211 + 63.7 p1

The regression equation isSales = 116 - 97.7 p1 + 109 p2

Remember: -97.7 is the affect on sales of a change inp1 with p2 held fixed !!

151050

9

8

7

6

5

4

3

2

1

0

p2

p1

9876543210

1000

500

0

p1

Sale

s

82

82

99

99

If we compares sales in weeks 82 and 99, we see that an increase in p1, holding p2 constant(82 to 99) corresponds to a drop is sales.

How can we see what is going on ?

Note the strong relationship between p1 and p2 !!I Note the strong relationship (dependence) between P1 and

P2!9


I Let’s look at a subset of points where P1 varies and P2 is

held approximately constant...

29

9876543210

1000

500

0

p1Sa

les

151050

9

8

7

6

5

4

3

2

1

0

p2

p1Here we select a subset of points where p variesand p2 does is help approximately constant.

For a fixed level of p2, variation in p1 is negativelycorrelated with sale!

0 5 10 15

24

68

sales$p2

sale

s$p1

2 4 6 8

020

040

060

080

010

00

sales$p1

sale

s$S

ales

Different colors indicate different ranges of p2.

p1

p2

Sales

p1

for each fixed level of p2there is a negative relationshipbetween sales and p1

larger p1 are associated withlarger p2

I For a fixed level of P2, variation in P1 is negatively correlated

with Sales!!10


I Below, different colors indicate different ranges for P2...

29

9876543210

1000

500

0

p1

Sale

s

151050

9

8

7

6

5

4

3

2

1

0

p2

p1

Here we select a subset of points where p variesand p2 does is help approximately constant.

For a fixed level of p2, variation in p1 is negativelycorrelated with sale!

0 5 10 15

24

68

sales$p2

sale

s$p1

2 4 6 8

020

040

060

080

010

00

sales$p1

sale

s$S

ales

Different colors indicate different ranges of p2.

p1

p2

Sales

p1

for each fixed level of p2there is a negative relationshipbetween sales and p1

larger p1 are associated withlarger p2

11


I Summary:

1. A larger P1 is associated with larger P2 and the overall effect

leads to bigger sales

2. With P2 held fixed, a larger P1 leads to lower sales

3. MLR does the trick and unveils the “correct” economic

relationship between Sales and prices!

12

Details: Parameter Estimation

Y = β0 + β1X1 . . .+ βpXp + ε, ε ∼ N(0, σ2)

How do we estimate the MLR model parameters?

The principle of Least Squares is exactly the same as before:

I Define the fitted values

I Find the best fitting plane by minimizing the sum of squared

residuals.

Then we can use the least squares estimates to find s...

13

Details: Least Squares

Just as before, each bi is our estimate of βi

Fitted Values: Yi = b0 + b1X1i + b2X2i . . .+ bpXp.

Residuals: ei = Yi − Yi .

Least Squares: Find b0, b1, b2, . . . , bp to minimize∑n

i=1 e2i .

In MLR the formulas for the bj ’s are too complicated so we won’t

talk about them...

14

Details: Least Squares

15

Details: Residual Standard Error

The calculation for s2 is similar to SLR:

s2 =

∑ni=1 e

2i

n − p − 1=

∑ni=1(Yi − Yi)

2

n − p − 1

I Yi = b0 + b1X1i + · · ·+ bpXpi

I The denominator above is n − (p + 1), since we estimated

p + 1 b’s.

I The residual “standard error” is the estimate for the standard

deviation of ε,i.e,

σ = s =√s2.

16

Example: Price/Sales Data

Model: Salesi = β0 + β1P1i + β2P2i + εi , ε ∼ N(0, σ2)

fit = lm(Sales~p1+p2, data=price_sales)

print(fit)

##

## Call:

## lm(formula = Sales ~ p1 + p2, data = price_sales)

##

## Coefficients:

## (Intercept) p1 p2

## 115.72 -97.66 108.80

b0 = β0 = 115.72, b1 = β1 = −97.66, b2 = β2 = 108.80.

print(sigma(fit)) # sigma(fit) extracts s from an lm fit

## [1] 28.41801

s = σ = 28.42 17

Confidence Intervals for Individual Coefficients

As in SLR, the sampling distribution tells us how far we can expect

bj to be from βj

The LS estimators are unbiased: E [bj ] = βj for j = 0, . . . , d .

18


I According to the Central Limit Theorem, the sampling

distribution of each coefficient’s estimator is approximately

bj ∼ N(βj , s2bj

)

I The approximation is best when the residuals have constant

variance, are approximately normally distributed, and the

sample size is large. If in doubt we can also bootstrap.

I Computing confidence intervals is exactly the same as in SLR:

Use the percentile bootstrap, estimate standard errors using

the bootstrap, or use the CLT (default R output)

19

In R...

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 115.717 8.548 13.54 <2e-16 ***

## p1 -97.657 2.669 -36.59 <2e-16 ***

## p2 108.800 1.409 77.20 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 28.42 on 97 degrees of freedom

## Multiple R-squared: 0.9871,Adjusted R-squared: 0.9869

## F-statistic: 3717 on 2 and 97 DF, p-value: < 2.2e-16

95% C.I. for β1 ≈ b1± 2× sb1

[−97.66− 2× 2.67;−97.66 + 2× 2.67] = [−102.95;−92.36]

20


Remember: Intervals (and later hypothesis tests) for bj are

one-at-a-time procedures:

I You are evaluating the j th coefficient conditional on the other

X ’s being in the model

I This is the change in expected value of Y for a one-unit

increase in Xj , holding all the other X’s constant

I There’s no such thing as “THE effect of Xj” in regression

models – we can only understand bj in the context of the

other variables in the model

21

Hold my beer: Interpreting coefficients in context

Beer Data (from an MBA class)

I nbeer – number of beers before getting drunk

I height and weight

31

The regression equation isnbeer = - 36.9 + 0.643 height

Predictor Coef StDev T PConstant -36.920 8.956 -4.12 0.000height 0.6430 0.1296 4.96 0.000

75706560

20

10

0

height

nbee

rIs nbeer relatedto height ?

Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht

Is nbeer relatedto height ?

No, not all.

nbeer weightweight 0.692height 0.582 0.806

The correlations:

The regression equation isnbeer = - 11.2 + 0.078 height + 0.0853 weight

Predictor Coef StDev T PConstant -11.19 10.77 -1.04 0.304height 0.0775 0.1960 0.40 0.694weight 0.08530 0.02381 3.58 0.001

S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%

The two x’s arehighly correlated !!

Is number of beers related to height? 22

nbeers = β0 + β1height + ε

## Coefficients:


## (Intercept) -36.9200 8.9560 -4.122 0.000148 ***

## height 0.6430 0.1296 4.960 9.23e-06 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##



## F-statistic: 24.6 on 1 and 48 DF, p-value: 9.23e-06

Yes! Beers and height are related...

23

What happens after accounting for weight?

nbeers = β0 + β1weight + β2height + ε

## Coefficients:


## (Intercept) -11.18709 10.76821 -1.039 0.304167

## height 0.07751 0.19598 0.396 0.694254

## weight 0.08530 0.02381 3.582 0.000806 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##




What about now?? Height is not necessarily a factor...24

Watch out for lurking variables/confounders

31



75706560

20

10

0

height

nbee

r


Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht


No, not all.


The correlations:



S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%


I If we regress “beers” only on height we see an effect. Bigger

heights → more beers, on average.

I However, when height goes up weight tends to go up as well...

in the first regression, height was a proxy for the real cause of

drinking ability. Bigger people can drink more and weight is a

more relevant measure of “bigness”. 25

Watch out for lurking variables/confounders

31



75706560

20

10

0

height

nbee

r


Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht


No, not all.


The correlations:



S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%


I In the multiple regression, when we consider only the variation

in height that is not associated with variation in weight, we

see no relationship between height and beers.

26

nbeers = β0 + β1weight + ε

## Coefficients:


## (Intercept) -7.02070 2.21329 -3.172 0.00264 **

## weight 0.09289 0.01399 6.642 2.6e-08 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##




Why is this a better model than the one with weight and height??

27

When to drop variables from a regression

I Here we dropped height because:

I It made our model simpler to interpret

I From what we know about the problem, it seemed plausible

that height was irrelevant given weight

I And 0 was a plausble value for the height coefficient

I Dropping variables just because 0 is inside the CI, or because

they are highly correlated with other variables in the model, is

usually a bad idea

28

Correlation and causation

In general, when we see a relationship between y and x (or x ’s),

that relationship may be driven by variables “lurking” in the

background which are related to your current x ’s.

This makes it hard to reliably find causal relationships. Any

correlation (association) you find could be caused by other

variables in the background... correlation is NOT causation

Any time a report says two variables are related and there’s a

suggestion of a causal relationship, ask yourself whether or not

other variables might be the real reason for the effect. Multiple

regression is one tool to control for all important variables by

including them in the regression. “Once we control for weight,

height and beers are NOT related”! 29

correlation is NOT causation

also...

I http://www.tylervigen.com/spurious-correlations

30

http://www.tylervigen.com/spurious-correlations

Uncertainty in Multiple Regression: Collinearity

I With the above examples we saw how the relationship

amongst the X ’s can affect our interpretation of a multiple

regression... we will now look at how these dependencies will

inflate the standard errors for the regression coefficients, and

hence our uncertainty about them.

I Remember that in simple linear regression our uncertainty

about b1 is measured by

s2b1 =s2

(n − 1)s2x

I The more variation in X (the larger s2x ) the more “we know”

about β1... ie, our error (b1 − β1) tends to be smaller.

31

Uncertainty in Multiple Regression: Collinearity

I In MLR we relate the variation in Y to the variation in an X

holding the other X ’s fixed. So, we need to know how much

each X varies on its own.

I We can relate the standard errors in MLR to the standard

errors from SLR: With two X s,

s2bj =1

1− r2x1x2× s2

(n − 1)s2xj

where rx1x2 = cor(x1, x2) (recall: r2x1x2 is also the R2 from

regressing X1 on X2 or vice-versa.

The SE in MLR increases by a factor of 11−r2x1x2

relative to

simple linear regression.

32

Uncertainty in Multiple Regression: Multicollinearity

I In MLR we relate the variation in Y to the variation in an X

holding the other X ’s fixed. So, we need to know how much

each X varies on its own.

I In general, with p covariates,

s2bj =1

1− R2j

× s2

(n − 1)s2xj

where R2j is the R2 from regressing Xj on the other X ’s.

I When there are strong dependencies between the covariates

(known as multicollinearity), it is hard to attribute predictive

ability to any of them individually.

33

Back to Baseball

R/G = β0 + β1OBP + β2SLG + ε


## (Intercept) -7.0143 0.8199 -8.555 3.61e-09 ***

## OBP 27.5929 4.0032 6.893 2.09e-07 ***

## SLG 6.0311 2.0215 2.983 0.00598 **

Compare the std error to the model with OBP alone:


## (Intercept) -7.7816 0.8816 -8.827 1.40e-09 ***

## OBP 37.4593 2.5544 14.665 1.15e-14 ***

Even though s2 is smaller in the MLR model (check it out!), theSE on OBP is higher than in SLR, since

cor(baseball$OBP, baseball$SLG)

## [1] 0.826103334

Last words on multicollinearity

I Multicollinearity is often framed as a problem to be “fixed”. It

usually isn’t, and the cures are often worse than the disease.

I Dropping a collinear variable can reduce the SE of a

coefficient, but completely changes its meaning!

I In prediction contexts it sometimes but not always makes

sense to drop variables in collinear relationships that are also

weak predictors

35

Predictions with Uncertainty in MLR: Plug-in method

Suppose that by using advanced corporate espionage tactics, I

discover that my competitor will charge $10 the next quarter.

After some marketing analysis I decided to charge $8. How much

will I sell?

Our model is

Sales = β0 + β1P1 + β2P2 + ε

with ε ∼ N(0, σ2)

Our estimates are b0 = 115, b1 = −97, b2 = 109 and s = 28

which leads to

Sales = 115 +−97 ∗ P1 + 109 ∗ P2 + ε

with ε ∼ N(0, 282)36

Plug-in Prediction in MLR

By plugging-in the numbers,

Sales = 115.72 +−97.66 ∗ 8 + 108.8 ∗ 10 + ε

≈ 422 + ε

Sales|P1 = 8,P2 = 10 ∼ N(422.44, 282)

and the 95% Prediction Interval is (422± 2 ∗ 28)

366 < Sales < 478

37

Better Prediction Intervals in R

new_data = data.frame(p1=8, p2=10)

predict(fit, newdata = new_data,

interval="prediction", level=0.95)

## fit lwr upr

## 1 422.4573 364.2966 480.6181

Pretty similar to (366,478), right?

Like in SLR, the difference gets larger the “farther” our new point

(here P1 = 8, P2 = 10) gets from the observed data

38

Still be careful extrapolating!

In SLR “farther” is measured as distance from X ; in MLR the idea

of extrapolation is a little more complicated.

2 4 6 8

05

1015

p1

p2

Blue: (P1=P1,P2 = P2), red: (P1=8, P2=10), purple: (P1=7.2,

P2=4). Red looks “consistent” with the data; purple not so much.39

section 3.1: understanding multiple linear regression · mccombs school of business 1. the multiple...

Documents