Simple Regression Multiple Regression
Part II
Linear Regression
As of Nov 2, 2020Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications
in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Simple (linear) regression is defined as
y = β0 + β1x + ε, (1)
where y and x are observed values, β0 and β1, called parameters,are the intercept (constant) term and slope coefficient,respectively, and ε is an unobserved random error term.
The parameters β0 and β1 are unknown, and are estimated fromtraining data.
Given estimates β0 and β1, one can compute
y = β0 + β1x (2)
to predict y on the basis of given x-value, in which y indicates theprediction.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Estimation
The unknown coefficients β0 and β1 are most often estimated fromthe training data (sample) with observations (x1, y1), . . . , (xn, yn)by the ordinary least squares (OLS), i.e.,
(β0, β1) = arg minβ0,β1
n∑i=1
(yi − (β0 + β1xi ))2. (3)
Here the solutions are
β1 =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2(4)
β0 = y − β1x , (5)
where x and y are the sample means of x and y .
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Estimation
Remark 1
A crucial assumption for in estimation for successful estimation is
E (ε|x) = 0, (6)
which implies that the explanatory variable x and the error term ε are
uncorrelated.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Estimation
Example 1
In the Advertising data, regressing Sales on TV budget gives:
lm(formula = sales ~ TV, data = adv)
Residuals:
Min 1Q Median 3Q Max
-8.3860 -1.9545 -0.1913 2.0671 7.2124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.032594 0.457843 15.36 <2e-16 ***
TV 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared: 0.6119,Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 50 100 150 200 250 3005
10
15
20
25
TV budget (1,000USD)
Sa
les (
1,0
00
un
its)
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
The deviations from the regression line, ei = yi − yi are calledresiduals, which are estimates of the unobserved errors (εi ).
Note, yi = yi + ei = β0 + β1x + ei .
The estimates2 are random variables in the sense that is anothersample is selected they assume different values.
The distribution of an estimate (estimator) is called the samplingdistribution, which is the distribution of the estimate values if onesampled n observations over and over again from the populationand computed the estimates (see example below).
2Statistical literature usually makes distinction with estimator and estimate,so that the estimator refers to the function and estimate refers to the value ofthe function.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Estimation
Example 2
Below in the left panel are the scatter plot, the population regression line, and the OLSestimated line from a sample of n = 100 obsevations, in the middle panel are OLS estimatedlines from 10 samples of size n = 100, and in the right panel is a histogram of the slopecoefficient estimates β1 from 1,000 different samples of size n = 100.The population model is
y = 2 + 3x + ε, (7)
i.e. β0 = 2 and β1 = 3. For this simulated example x and ε are generated from N(0, 1) andN(0, 4) distributions, respectively.
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
● ●●
●
●
●
●●
●
−2 −1 0 1 2
−5
05
10
15
Initial Sample
x
y
β0 = 1.99
β1 = 3.36
−2 −1 0 1 2
−5
05
10
15
Lines from 10 Samples
x
y
PopulationInitial sample10 simulations
Histogram of 1,000 β1 Estimates
β1
De
nsi
ty
1.5 2.5 3.5 4.50
.00
.40
.8
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
As stated generally in equation (1.1), the true relationship betweenx and y is y = f (x) + ε.
Here, f (x) = β0 + β1x , resulting to the population regressiony = β0 + β1x + ε as given in equation (1).
As demonstrated by Example 2 estimates β0 and β1 deviate moreor less from the underlying population parameters β0 and β1.
However, it can be shown that on average the estimates equal the
underlying parameter value, mathematically E[βj
]= βj , j = 0, 1.
In such a case we say that the estimates are unbiased
So, in summary, it can be shown that OLS estimators are unbiased,i.e., they do not systematically over or under estimate theunderlying parameters.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
The accuracy of the estimates can be evaluated in terms ofstandard errors of the estimates
se(β1) =σ√∑n
i=1(xi − x)2(8)
se(β0) = σ
(1
n+
x∑ni=1(xi − x)2
) 12
. (9)
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
These are routinely produced by every regression package.
In the above example, the initial sample produces:
lm(formula = y ~ x)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9905 0.3663 5.434 4.03e-07 ***
x 3.3588 0.3852 8.720 7.22e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.663 on 98 degrees of freedom
Multiple R-squared: 0.4369,Adjusted R-squared: 0.4311
F-statistic: 76.03 on 1 and 98 DF, p-value: 7.219e-14
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
Thus se(β1) = 0.3852, which estimates the standard error if werepeated the sampling over and over again, computed β1 fromeach sample and calculated the standard deviation of them.
We did this 1,000 times for the right panel histogram in Example 2.
The standard deviation of these estimates is 0.4141 which is closeto that of se(β1) = 0.3852 (the difference is about .029, or 7%).
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
The standard errors can be used to compute e.g. confidenceintervals (CIs)for the coefficients.
CIs are of the formβ ± cα/2 · se(β), (10)
or[β − cα/2 · se(β), β + cα/2 · se(β)], (11)
where cα/2 is the 1− α/2 percentile of the t-distribution (ornormal distribution).
α is the confidence level with typical values .05 or .01 in whichcases the confidence intervals are 95% and 99%, respectively. Forexample, for the 95% confidence interval c.025 ≈ 2.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
In Example 2 the 95% confidence interval for β1 is
β1 ± 2× se(β1) = 3.56± 2× 0.385 = 3.56± 0.770, or [2.79, 4.33].
We observe that in this case the population β1 = 3 belongs to theinterval.
For a 95% confidence interval there is 5% change to have such asample that the estimate is so much off that the confidenceinterval does not cover the population parameter.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
Standard errors can also used in hypothesis testing.
The most common hypothesis testing involves testing the nullhypothesis of
H0 : There is no relation between x and y (12)
versus the alternative hypothesis
H1 : There is some relationship between x and y . (13)
More formally, in terms of the regression model in equation (1) thiscorresponds to testing
H0 : β1 = 0 (14)
versusH1 : β1 6= 0. (15)
If the null hypothesis holds then y = β0 + ε, so that x is notassociated to y .
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
Tesing for a more general null hypothesis of H0 : β1 = β∗1 , whereβ∗1 is some given value, the test statistic is
t =β1 − β∗1se(β1)
, (16)
which for tesing hypothesis (14) with β∗1 = 0 reduces to
t =β1
se(β1). (17)
The null distribution (i.e., when the null hypothesis H0 is true) of tis the Student t distribution with n − 2 degrees of freedom.
With n > 30 the t-distribution is close to the normal distribution.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
For large absolute value of t the null hypothesis is rejected.
By a ’large’ value we mean that if the probability of obtaining sucha large value is smaller then some specified threshold value α, wereject the null hypothesis.
Typical values of α are 0.05 or 0.01, i.e., 5% or 1%.
The computer produces p-values that indicate the probabilityP(|t| > |tobs| |H0), i.e., the probability for getting as large (orlarger) values than the one observed, tobs, if the null hypothesis isholds.
If the probability is too low, we infer that rather than having soextreme sample, the underlying parameter value is something elsethan that of the null hypothesis, and therefore reject the H0.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of coefficients
Typical threshold values are 0.05 (statistically significant at the 5%level) and 0.01 (statistically significant at the 1% level), i.e., if thep-value goes below these values, we reject the null hypothesis atthe associated level of significance.
Example 3
In the advertising example in p-value < .0001 (in fact the first 15
decimals are zeros), so the data suggest strongly to reject the null
hypothesis that TV advertising does not affect Sales (the sign of the
coefficient show that the association is positive, as could be expected).
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of the model
The quality of a linear regression fit is typically assessed by the residualstandard error (RSE) and coefficient of determination R2 (R-square) ofwhich the R-square more popular.
RSE =
√√√√ 1
n − 2
n∑i=1
(yi − yi )2. (18)
and
R2 =TSS− RSS
TSS= 1− RSS
TSS, (19)
where TSS =∑n
i=1(yi − y)2 is the total sum of squares andRSS =
∑ni=1(yi − yi )
2 is the residual sum of squares.
We observe that RSE =√
RSS/(n − 2).
R-square is a goodness-of-fit measure with 0 ≤ R2 ≤ 1 (R2 = 0, noassociation, R2 = 1 perfect fit), while RSE measures lack of fit.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Regression: Accuracy of the model
Both of these are routinely produced by regression packages.
In the advertising example (rounded to two decimals) RSE = 3.26and R2 = 0.61.
RSE is in the same units as the dependent variable y .
James et al. interpret RSE as the amount the prediction is onaverage off from the true value of the dependent variable.
Accordingly, RES = 3.26 would indicate that any sales on the basisof TV would be off on average 3.25 thousand units.
Thus, as the average sales over all markets is approximately 14thousand units, the error is 3.26 / 14 = 23 %.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Adding explanatory variables (x-variables) to the model givesmultiple regression
y = β0 + β1x1 + · · ·+ βpxp + ε, (20)
where xj is the jth predictor (explanatory variable) and βjquantifies the marginal effect or association between y and xj .
That is, βj indicates the unit change in xj holding all otherpredictors fixed.
The coefficients are again estimated by finding β0, β1, . . . , βp thatminimize the sum of squares
∑ni=1(yi − yi )
2, whereyi = β0 + β1xi1 + · · ·+ βpxip.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Example 4
In the advertising example let us enhance the model as
sales = β0 + β1 TV + β2 radio + β3 newspaper + ε (21)
lm(formula = sales ~ TV + radio + newspaper, data = adv)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972,Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
The results indicate that newspapers do not contribute sales, while an additionalthousand spent in TV advertising predicts an average increase in sales by about46 units (holding radio budget unchanged).
Similarly an additional thousand in radio advertising predicts increase in sales byabout 189 units (holding TV budget intact).
However, checking out the residuals reveals that the specification is notsatisfactory.
5 10 15 20 25
−10
−8−6
−4−2
02
4
Fitted values
Res
idua
ls
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
lm(sales ~ TV + radio + newspaper)
Residuals vs Fitted
131
6
179
The graph indicates non-linearity.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
After dropping the non-significant newspaper, we add squared terms ofthe explanatory variables to account for the obvious non-linearity.
sales = β0 + β1 TV + β2 radio + β11 (TV)2 + β22 (radio)2 + ε (22)
(Note: βs and ε are generic notations).lm(formula = sales ~ TV + radio + I(TV^2) + I(radio^2), data = adv)
Residuals:
Min 1Q Median 3Q Max
-7.3987 -0.8509 0.0376 0.9781 3.3727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.535e+00 4.093e-01 3.750 0.000233 ***
TV 7.852e-02 4.978e-03 15.774 < 2e-16 ***
radio 1.588e-01 2.830e-02 5.613 6.78e-08 ***
I(TV^2) -1.138e-04 1.674e-05 -6.799 1.26e-10 ***
I(radio^2) 7.135e-04 5.709e-04 1.250 0.212862
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.515 on 195 degrees of freedom
Multiple R-squared: 0.9174,Adjusted R-squared: 0.9157
F-statistic: 541.2 on 4 and 195 DF, p-value: < 2.2e-16
5 10 15 20
−8
−6
−4
−2
02
4
Fitted values
Re
sid
ua
ls
●
●●
●
●
●
●●
●
●
●
●●
● ●●
●
●
●
● ●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●●
●
●
●
●
●
●●
●
● ●
●●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●●● ●
●
●
●
●
●
●
●
● ●●● ●●
●
● ●● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
Residuals vs Fitted
131
6
92
●
●●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●●●●●●
●
●●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−4
−2
02
Theoretical Quantiles
Sta
nd
ard
ize
d r
esid
ua
ls
Normal Q−Q
131
6
92
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●
●●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●●●
●● ●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
0 50 100 150 200 250 300
−4
−2
02
4
Residuals vs TV adrvertising
TV
Re
sid
ua
ls ●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●●
● ●
●
●
●
●
●
●
●
● ●●
●●●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
0 10 20 30 40 50
−4
−2
02
4
Residuals vs Radio adrvertising
Radio
Re
sid
ua
lsSeppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
The residual plot (top left) indicate still non-linearity.
Let us yet enhance the model by adding the interaction term TV× radio of the explanatoryvariables and estimate regression
sales = β0 + β1 TV + β2 radio + β11 (TV)2 + β22 (radio)2 + β12(TV× radio) + ε (23)
lm(formula = sales ~ TV + radio + I(TV^2) + I(radio^2) + TV:radio,
data = adv)
Residuals:
Min 1Q Median 3Q Max
-5.0027 -0.2859 -0.0062 0.3829 1.2100
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.194e+00 2.061e-01 25.202 <2e-16 ***
TV 5.099e-02 2.236e-03 22.801 <2e-16 ***
radio 2.654e-02 1.242e-02 2.136 0.0339 *
I(TV^2) -1.098e-04 6.901e-06 -15.914 <2e-16 ***
I(radio^2) 1.861e-04 2.359e-04 0.789 0.4311
TV:radio 1.075e-03 3.479e-05 30.892 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6244 on 194 degrees of freedom
Multiple R-squared: 0.986,Adjusted R-squared: 0.9857
F-statistic: 2740 on 5 and 194 DF, p-value: < 2.2e-16
5 10 15 20 25
−4
−2
0
Fitted values
Res
idua
ls
●
●
●
●●
●
●
●
● ●
● ●
●
● ● ●
●●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
● ●●
●
●
●● ●
●
●
●
●
●●
●
●●
●
●●
●
●
●●●
● ●
●
●
● ●
●
●●
●
●
●●●● ●
●●
●
●●
● ●
●●
●●
●
●
●
●●
●●
●
●
●●
●
● ●● ●
●
●
● ●
●
●
●
●
●
●
●
●●●
●
●
●●
● ●
●●
●
●● ●●
●●●
●
●
●● ● ●
●● ●
●●● ● ●
●
●
●
●
●●●
●
●
●●
●●
● ●
● ●●
●
●
●
●●
●●●
●● ●
●
●
Residuals vs Fitted
131
156
79
●
●
●
●●
●
●
●
●●
● ●
●
● ●●
●●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●
●●●
●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●●
●
●
●●
●
●
●●
●
●●● ●●
●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
●
●
●●
●
●●●●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●●
●
●●●●
●●
●
●
●
●●●●
●●●
●●●●●
●
●
●
●
●●●
●
●
●●
●●
●●
●●●
●
●
●
●●
●●●
●●●
●
●
−3 −2 −1 0 1 2 3
−8
−6
−4
−2
02
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
131
156
79
●●
●
●●
●
●
●
● ●
● ●
●
● ●●
●●
●
● ●●
●
●
●
●
●
●●
●●
●
● ●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●● ●
●●
●
●
● ●●
●
●●
● ●●
●
●
●●
●
●● ●
●
● ●●
●
●●●
●●
●
●● ●
●
●●
●
●
●●● ●● ●
● ●
●●
● ●
●●
●●
●●
●●
●
●●
●
●
●●
●
● ●● ●
●
●
● ●●
●
●
●●
●
●
●●●
●
●
●●●●
●●●
●● ●●●
● ●
●
●
●● ● ●
●●●
● ●● ●●
●
●●
●
● ●●
●
●
●●
● ●
● ●
●●● ●
●●
●●
●●●
●● ●
●●
0 50 100 150 200 250 300
−4
−2
02
4
Residuals vs TV adrvertising
TV
Res
idua
ls
●●
●
●●
●
●
●
●●
● ●
●
● ● ●
●●
●
● ●●
●
●
●
●
●
●●
●●
●
● ●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●● ●
●●
●
●
● ●●
●
●●
● ●●
●
●
●●
●
●●●
●
●●●
●
● ●●
● ●
●
●● ●
●
●●
●
●
●●●● ● ●
●●
●●
● ●
●●
●●
●●
●●
●
●●
●
●
●●
●
●● ●●
●
●
● ●●
●
●
●●
●
●
●●●
●
●
●●● ●
●●
●
●●●●●
●●
●
●
●●●●
●● ●
●● ●● ●
●
●●
●
●●●
●
●
●●
● ●
● ●
● ●● ●
●●
●●
●●●
●● ●
●●
0 10 20 30 40 50
−4
−2
02
4
Residuals vs Radio adrvertising
Radio
Res
idua
ls
Except of two potential outliers (obs 131 and 156), the residual plots are more satisfactory
(recall error term should be purely random, thereby showing any systematic patterns in any
context). Some indication of third order effect of TV may be present.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
The interpretation of coefficients is now a bit more tricky.
For example the TV coefficient 0.051 indicates TV effect at zero radio budget (anincrease of $1,000 TV advertising can be expected to increases sales by 51 units ifradio advertising is zero), while generally the marginal effect depends on thecurrent levels of TV and radio advertising, being of the formβ1 + 2β11TV + β12radio.
Finally, it may be surprising that newspaper advertising does not contribute sales
in the model because alone it is significant in a simple regression.Dependent variable: sales
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
newspaper 0.05469 0.01658 3.30 0.00115 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlations
sales TV radio newspaper
sales 1.000 0.782 0.576 0.228
TV 0.782 1.000 0.055 0.057
radio 0.576 0.055 1.000 0.354
newspaper 0.228 0.057 0.354 1.000
The reason is that radio and newspaper are correlated.
So, newspaper alone in a regression reflects the radio advertising (due to the
correlation) even though newspaper advertising actually does not contribute sales!
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
A 3D plot to illustrate graphically the relationships.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Statistical significance of coefficients
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Statistical significance of coefficients
The t-ratios
t =βj
se(βj)
and associated p-values indicate significances of individual coefficientsseparately.
Testing joint significance of all (or a subset of) coefficients, i.e., nullhypothesis
H0 : β1 = · · · = βp = 0 (24)
versusH1 : at least one βj 6= 0. (25)
can be performed by the F -statistic
F =(TSS− RSS)/p
RSS/(n − p − 1), (26)
which has the F -distribution with p and n − p − 1 degrees of freedom ifthe null hypothesis H0 is true.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Statistical significance of coefficients
Example 5
In Example 4 F = 260.9 with degrees of freedom 3 and 169 the p-value
zero in 15 decimal places, implying strong rejection of the null hypothesis
that advertising in the three media does not affect sales (i.e., the null
hypothesis H0 : β1 = β2 = β3 = 0).
The F -test for testing hypothesis (24) can be consider as the firststep.
If the null hypothesis is not rejected, we conclude that noexplanatory variable is associated to y and the model is of the formy = β0 + ε, i.e., y purely varies around its mean.
If the null hypothesis is rejected then the interest is which variablesare associated with y , i.e., which explanatory variables areimportant.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Selecting important variables
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Selecting important variables
From a set of p explanatory variables all are not necessarily associated to y or theirimportance is marginal.
Variable selection or model selection in regression analysis refers to the problem tochoosing the best subset of variables from available (large number of) candidates.
Criterion functions: e.g., Akaike information criterion (AIC), Bayesianinformation criterion (BIC)
Select that subset for which the criterion function assumes its minimum
Step wise selection
Forward selection: start from the null model with no explanatory variables andenhance the model one variable at a time with smallest p-value until the smallestp-value among the non-selected variables is not significant at the chosen level(e.g. 5%-level).Backward selection: Start with all explanatory variables in the model and removeone by one a variable largest non-significant p-value. Stop when all remainingvariables are significant.Forward-backward selection: This is a combination of forward and backwardselection by starting with the forward selection and applying backward selectionat each step to remove non-significant variables from the current model. Theprocedure is stopped when no more variables are selected and no variables areremoved.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Selecting important variables
Remark 2
The step wise selections can be also performed using criterion functionslike AIC.
For example R package car has AIC step wise option.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Selecting important variables
Example 6
AIC step wise variable selection.Full model:
lm(formula = medv ~ ., data = boston)
## note, the dot in medv ~ . includes all variables
Residuals:
Min 1Q Median 3Q Max
-15.594 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.45949 5.10346 7.1 3e-12 ***
crim -0.10801 0.03286 -3.3 0.001 **
zn 0.04642 0.01373 3.4 8e-04 ***
indus 0.02056 0.06150 0.3 0.738
chas 2.68673 0.86158 3.1 0.002 **
nox -17.76661 3.81974 -4.7 4e-06 ***
rm 3.80987 0.41793 9.1 <2e-16 ***
age 0.00069 0.01321 0.1 0.958
dis -1.47557 0.19945 -7.4 6e-13 ***
rad 0.30605 0.06635 4.6 5e-06 ***
tax -0.01233 0.00376 -3.3 0.001 **
ptratio -0.95275 0.13083 -7.3 1e-12 ***
black 0.00931 0.00269 3.5 6e-04 ***
lstat -0.52476 0.05072 -10.3 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
Residual standard error: 4.75 on 492 degrees of freedom
Multiple R-squared: 0.741,Adjusted R-squared: 0.734
F-statistic: 108 on 13 and 492 DF, p-value: <2e-16
lm(formula = medv ~ crim + zn + chas + nox + rm + dis +
rad + tax + ptratio + black + lstat, data = boston)
Residuals:
Min 1Q Median 3Q Max
-15.5984 -2.7386 -0.5046 1.7273 26.2373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.3411 5.0675 7.2 3e-12 ***
crim -0.1084 0.0328 -3.3 0.001 **
zn 0.0458 0.0135 3.4 8e-04 ***
chas 2.7187 0.8542 3.2 0.002 **
nox -17.3760 3.5352 -4.9 1e-06 ***
rm 3.8016 0.4063 9.4 <2e-16 ***
dis -1.4927 0.1857 -8.0 7e-15 ***
rad 0.2996 0.0634 4.7 3e-06 ***
tax -0.0118 0.0034 -3.5 5e-04 ***
ptratio -0.9465 0.1291 -7.3 9e-13 ***
black 0.0093 0.0027 3.5 6e-04 ***
lstat -0.5226 0.0474 -11.0 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared: 0.7406,Adjusted R-squared: 0.7348
F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
indus and age become removed.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Model fit
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Model fit
Similar to simple regression R2 and residual standard error (RSE) are two ofthe most common model fit measures.
In Example 4: R2 = 0.986, i.e., the model explains 98.6% of the total variationin sales.
Another R-squared is
R2 = 1− RSS/(n − p − 1)
TSS/(n − 1)= 1− (1− R2)
n − 1
n − p − 1, (27)
called the adjusted R-squared which penalizes the R2 by inclusion of additionalexplanatory variables.
In Example 6: R2 = 0.741 for the full model and 0.7406 for the stepwiseselected model (indus and age removed), while R2 = 0.734 for the full modeland R2 = 0.735 for the reduced model.
Thus, R-squared (slightly) decreases when removing explanatory variables,while in this case the adjusted R-squared slightly increases.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Prediction
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Prediction
The estimated modely = β0 + β1x1 + · · ·+ βpxp (28)
estimates the population regression
f (x) = β0 + β1x1 + · · ·+ βpxp. (29)
The inaccuracy in the estimated coefficients in (28) is related to the reducible error.
Confidence interval for regression: Confidence interval of the population regressionin (29) is
y ± cα/2 se(y), (30)
where
se(y) = σ√
x′(X′X)−1x (31)
is the standerd error of the regression line (more precisely hyper plane).
Confidence interval for prediction: Confidence interval for a realized value y relatedto a given x observed values is given by
y ± cα/2 se(pred y), (32)
where
se(pred y) = σ√
1 + x′(X′X)−1x. (33)
The one in the standard error of prediction is due to the irreducible error.Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Prediction
Example 7
In the Advertising data set consider regression sales = β0 + β1 TV + β11 (TV)2 + u.
The figure depicts 95% confidence for the regression line (grey) and predictions (light
blue).
0 50 100 150 200 250 300
510
1520
25
TV
sale
s
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Other considerations
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Other considerations
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Other considerations
Qualitative predictors
Qualitative information that indicate only classificationinformation (e.g. gender, ethnic group, etc) can be introducedinto the regression using indicator or dummy variables.
In regression for a qualitative explanatory variables with qclasses, one is selected as the reference class and the otherq − 1 classes are indicated by q − 1 dummy variables.
The coefficients of the dummy variable indicate the deviationfrom the base group.
In R category variables can be defined as factor variables, forwhich R generates the needed dummy variables in theregression.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Other considerations
Example 8
Using the wage data available on www.econometrics.com
log(wage) = β0 + δ1singlefem + δ2marrmale + δ3marrfem (34)
+β2educ + β3tenure + β4exper + β5(tenure)2 + β6(exper)2 + ε.
Thus, single male is the reference group.
lm(formula = log(wage) ~ mstatus + educ + tenure + exper + I(tenure^2) +
I(exper^2), data = wdf)
Residuals:
Min 1Q Median 3Q Max
-1.89697 -0.24060 -0.02689 0.23144 1.09197
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3213781 0.1000090 3.213 0.001393 **
mstatussingle female -0.1103502 0.0557421 -1.980 0.048272 *
mstatusmarried male 0.2126757 0.0553572 3.842 0.000137 ***
mstatusmarried female -0.1982676 0.0578355 -3.428 0.000656 ***
educ 0.0789103 0.0066945 11.787 < 2e-16 ***
tenure 0.0290875 0.0067620 4.302 2.03e-05 ***
exper 0.0268006 0.0052428 5.112 4.50e-07 ***
I(tenure^2) -0.0005331 0.0002312 -2.306 0.021531 *
I(exper^2) -0.0005352 0.0001104 -4.847 1.66e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3933 on 517 degrees of freedom
Multiple R-squared: 0.4609,Adjusted R-squared: 0.4525
F-statistic: 55.25 on 8 and 517 DF, p-value: < 2.2e-16
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Other considerations
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Other considerations
Potential problems
In fitting a linear regression, many problems may occur.
Non-linearity.
Correlation of error terms.
Heteroskedasticity.
Outliers.
High-leverage points.
Collinearity.
Graphical tools hare often useful in checking the presence of theseproblems (scatter plots of residuals against predicted values andexplanatory variables as we have done in some of Advertisingexamples).
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Non-parametric regressions
1 Simple Regression
2 Multiple Regression
Statistical significance of coefficients
Selecting important variables
Model fit
Prediction
Other considerations
Qualitative predictors
Potential problems
Non-parametric regressions
Seppo Pynnonen Applied Multivariate Statistical Analysis
Simple Regression Multiple Regression
Non-parametric regressions
Parametric regression assume well defined functional form for f (x).
Non-parametric approaches do not set assumptions on f (x).
These methods rely on data and apply different algorithms to findrelationships between the dependent and response variable.
One is the K -nearest neighbor regression (KNN regression), whichis closely related to KNN classifier.
Given a value of K and prediction point x0, KNN regression firstidentifies the K training observations that are closest to x0,represented by N0.
f (x0) is estimated by the average of the training responses in N0,i.e.,
f (x0) =1
K
∑xi∈N0
yi . (35)
We will return to this later.
Seppo Pynnonen Applied Multivariate Statistical Analysis