linear regrssion analysis and residual
TRANSCRIPT
-
8/9/2019 Linear Regrssion Analysis and Residual
1/52
Simple Linear Regression
and Correlation
Presented by :
Eng. Heba El-Haddad
1
-
8/9/2019 Linear Regrssion Analysis and Residual
2/52
Introduction
Regression analysis is a statistical tool foranalyzing the relationships
between variables.For example:
Collage guidance counselor have just administrated a vocational
aptitude test to 1000 entering freshman, she is interested in knowing
whether there is a relationship between the math aptitude scores andthe business aptitude score
To determine the relationship between the math aptitude scores and the
business aptitude scores, we have to compute a number that measurethe relationship between these two sets of scores.
This number is called thecorrelation coefficient
-
8/9/2019 Linear Regrssion Analysis and Residual
3/52
-
8/9/2019 Linear Regrssion Analysis and Residual
4/52
Correlation Coefficient
To determine the correlation
between the math aptitude andbusiness aptitude we cananalyze the situationpictorially by using scatterdiagram.
The math score will representthe independent variable anddenote it byx.
The business score willrepresent thedependentvariable and denote it byy
Different Aptitude Scores Received by Ten Students
Student Mathaptitude
Businessaptitude
languageaptitude
Musicaptitude
A 52 48 26 22
B 49 49 53 23
C 26 27 48 57
D 28 24 31 54
E 63 59 67 13
F 44 40 75 20
G 70 72 31 9
H 32 31 22 50
I 49 50 11 17
J 51 49 19 24
-
8/9/2019 Linear Regrssion Analysis and Residual
5/52
Scatter Diagram
Analyze the situation pictorially by
using scatter diagram.To create a scatter diagram by
using Minitab
click on Graph > Scatterplot
Click on "With Regression" then "OK" in
the first dialog box.
In the second box, select business score
into the first box in the Y column, and math
score into the first box in the X column.
Note: each plot represent each
person score
Student F
-
8/9/2019 Linear Regrssion Analysis and Residual
6/52
-
8/9/2019 Linear Regrssion Analysis and Residual
7/52
Another Types of scatter diagram
Negative linear
correlation
No correlation
-
8/9/2019 Linear Regrssion Analysis and Residual
8/52
The coefficient Correlation
Once we have determined that there is a linear relation
between two variables we cam measure the strength of thisrelation by usingthe coefficient correlation ofthe linear
relationship developed by Karl Pearson.
The coefficient of linear correlation is given by
nxy (x)(y)
n(x2 ) (x)2 n(y2 ) (y)2r=
Where
x = labelforoneofthevariablesy = label for the other variablen = numberofpairsofscores
-
8/9/2019 Linear Regrssion Analysis and Residual
9/52
Possibilities of the r value
The coefficient correlation will always have a value -1 r + 1
No Correlation
r = 0
positive Correlation
r > 0
Strong positive Correlation
r close to 1
PerfectpositiveCorrelation
r = 1
negative Correlation
r < 0
Strong negative Correlation
r close to -1
PerfectnegativeCorrelation
r = -1
-
8/9/2019 Linear Regrssion Analysis and Residual
10/52
Example on coefficient correlation
r =n (22,729) (464)(449)
10)23,396) (464(2 10)22,137) (449(2
0.986747791r =
Thus, the coefficient of correlation is 0.9867. Since this value is
close to +1 we say that there is ahigh degree of positive
correlation
M ath
ap titud e " xBusiness
aptitude"y"x 2 y 2 x y
5 2 .0 0 4 8 .0 0 2 7 0 4 2 3 0 4 2 4 9 6
4 9 .0 0 4 9 .0 0 2 4 0 1 2 4 0 1 2 4 0 1
2 6 .0 0 2 7 .0 0 6 7 6 7 2 9 7 0 2
2 8 .0 0 2 4 .0 0 7 8 4 5 7 6 6 7 2
6 3 .0 0 5 9 .0 0 3 9 6 9 3 4 8 1 3 7 1 7
4 4 .0 0 4 0 .0 0 1 9 3 6 1 6 0 0 1 7 6 0
7 0 .0 0 7 2 .0 0 4 9 0 0 5 1 8 4 5 0 4 03 2 .0 0 3 1 .0 0 1 0 2 4 9 6 1 9 9 2
4 9 .0 0 5 0 .0 0 2 4 0 1 2 5 0 0 2 4 5 0
5 1 .0 0 4 9 .0 0 2 6 0 1 2 4 0 1 2 4 9 9
464 449 2 33 96 2 21 37 2 27 2 9
-
8/9/2019 Linear Regrssion Analysis and Residual
11/52
-
8/9/2019 Linear Regrssion Analysis and Residual
12/52
The sensitivity of correlation coefficient
The correlation coefficient is
unaffectedby adding orsubtracting a number to either x
or y or both, even if x coded in
one way perhaps by adding or
subtracting a number- and y is
coded by another way say, by
multiplying by a number
0.986747791r =
Math aptitude "x+ 29"
Businessaptitude" y -
38"
x2 y2 XY
81 10 6561 100 810
78 11 6084 121 858
55 -11 3025 121 -605
57 -14 3249 196 -798
92 21 8464 441 1932
73 2 5329 4 146
99 34 9801 1156 3366
61 -7 3721 49 -427
78 12 6084 144 936
80 11 6400 121 880
754 69 58718 2453 7098
-
8/9/2019 Linear Regrssion Analysis and Residual
13/52
The Reliability of r
When r is computed we may get a
strong correlation, positive ornegative which is due purely to
chance not to some relation that
exists between x and y
Business aptitude"x"
Music aptitude "Y" x2 y2 XY
48 22 2304 484 1056
49 23 2401 529 1127
27 57 729 3249 1539
24 54 576 2916 1296
59 13 3481 169 767
40 20 1600 400 800
72 9 5184 81 648
31 50 961 2500 1550
50 17 2500 289 850
49 24 2401 576 1176
449 289 22137 11193 10809
r = -0.914447
-
8/9/2019 Linear Regrssion Analysis and Residual
14/52
amount of snow ininchs "X"
no. of hoursestudied "Y"
x2 y2 XY
1 2 1 4 2
4 6 16 36 24
2 3 4 9 6
6 4 36 16 24
3 4 9 16 12
16 19 66 81 68
r = 0.63
The value of r in this case is
0.63, but we can not concludethat if it snows in U.S.A then
the students in Egypt studies
more !!!
-
8/9/2019 Linear Regrssion Analysis and Residual
15/52
A chart has been constructed that allow us to determine the significance of particularvalue of the correlation coefficient
1. Compute the value of r
2. Look in the chart for the appropriate r-value corresponding to some given n,where n is the number of pairs of scores
3. The value of r is not satisfactory significant if it is between r
and rfor a
particular value of n.
-
8/9/2019 Linear Regrssion Analysis and Residual
16/52
Coefficient Correlation Chart
In case of the correlation between
amount of snow in U.S.A and the
studied hours for the students egypt
Assume = 0.025
1. r = 0.63 , n = 5
2. From table r0.025 is between -
0.878 and + 0.878
3. Since the value of r = 0.63 is
between than + 0.878 and - 0.878
We conclude that the correlation is due
purely the chance
-
8/9/2019 Linear Regrssion Analysis and Residual
17/52
Coefficient Correlation Chart
In case of the correlation between math
aptitude and business aptitude scores
Assume = 0.025
1. r = 0.986747791 , n = 10
2. From table r0.025 is between -
0.0632 and + 0.0632
3. Since the value of r is greater than +
0.0632
We conclude that there is adefinite positivecorrelation between the math aptitude
score and the business aptitude score.
-
8/9/2019 Linear Regrssion Analysis and Residual
18/52
The correlation coefficientmerely determines weather two
variables are related, but it does not specify how
-
8/9/2019 Linear Regrssion Analysis and Residual
19/52
Linear Regression
Once we determine the linear correlation between two
variable,Linear Regression is used to predict the value ofone variable (thedependent variable y ) on the basis of other
variables (the independent variablesx).
To predict the value of y
A- From scatter diagram
Which line has the best fit to the
-
8/9/2019 Linear Regrssion Analysis and Residual
20/52
Which line has the best fit to thedata?
?
?
?
B - Least Square Method
-
8/9/2019 Linear Regrssion Analysis and Residual
21/52
A Digression into History
The Statistical method of least
squares was developed by Frenchmathematician Adrien-Marie
Legendre (1752 1833)
Adrien-Marie
Legendre
-
8/9/2019 Linear Regrssion Analysis and Residual
22/52
The Method of the Least Square
B- Least Square Method
The differencesbetween theobserved andpredict value
Theequ
ation
ofthel
inethat
minimize
sthe
sum
ofthesq
uared
betw
eenv
ertica
ldeviati
o
Regression line
-
8/9/2019 Linear Regrssion Analysis and Residual
23/52
The Method of the Least Square
The regression equation of the estimated regression line is
Where
nxy (x)(y)
n(x2 ) (x)2b1= b0= y - b1 xn
1
and n is a number of pairs of scores
-
8/9/2019 Linear Regrssion Analysis and Residual
24/52
The Prediction of y value
If the counselor was interested in
predicting how will student do onthe business aptitude if she knows
the student score in the math
aptitude.
Math aptitude"x"
Businessaptitude" y"
x2 XY
52 48 2704 2496
49 49 2401 2401
26 27 676 702
28 24 784 672
63 59 3969 3717
44 40 1936 1760
70 72 4900 5040
32 31 1024 992
49 50 2401 2450
51 49 2601 2499
464 449 23396 22729
b1= 10(22729) (464)(449)
10(23396 ) (464)2
nxy (x)(y)
n(x2 ) (x)2
b1=
b0= y - b1 xn1
b0= 1
10449 1.01553*464
= 1.01553
= -2.221
-2.221+ 1.01553 X
For example at x =50
-2.221 + 1.01553 * 50 = 48.56
-
8/9/2019 Linear Regrssion Analysis and Residual
25/52
Alternative way to compute b1 and b0The coefficients b
1and b
0for
the least squares line
are calculated as:
1. Compute the average of x-values
and average of y values.
2. Compute sample standard deviation
for x values Sx
3. Compute sample covariance of n
data points, which is defined by
Sxy
-
8/9/2019 Linear Regrssion Analysis and Residual
26/52
Alternative way to compute b1 and b0
= 464/10 = 46.4 = 449/10 = 44.9
= 1866.4/9 = 207.378
= 1895.4/9 = 210.6
= 210.6 / 207.378 = 1.01554
= 44.9 (1.01554 * 46.4) = -2.221
-2.221 + 1.01554 X
-
8/9/2019 Linear Regrssion Analysis and Residual
27/52
STANDARD ERROR TO ESTIMATE
X
Y
48.56
For each x there is a correspondingpopulation y values
50
At x =50 -2.221 + 1.01553 * 50 = 48.56
Predicted value
y = + x
We can not expect such a prediction to be accurate
The relationship between X
and Y is a straight-Line
(linear) relationship.
The values of the
independent variable X are
assumed fixed (not random);
the only randomness in the
values of Y comes from the
error term .
-
8/9/2019 Linear Regrssion Analysis and Residual
28/52
X
Y
Identical normal
distributions of errors,
all centered on the
regression line.
STANDARD ERROR TO ESTIMATE
my|x = 0 + 1x
x
y
The mean of the corresponding
y value lies on some straight
line whose equation we do notknow but which is of the
form:
-
8/9/2019 Linear Regrssion Analysis and Residual
29/52
STANDARD ERROR TO ESTIMATE
The error term (vertical distance
between the predicted y value and
the true population values ) are
normally distributed with mean 0
and the same standard deviation
This is called error sum of square
The value of can be estimatedfrom the sample data by computing
thestandard error of the estimate
also called residual standarddeviation
SSE = (y )2
If is zero, all the points fall on the regression
line
-
8/9/2019 Linear Regrssion Analysis and Residual
30/52
STANDARD ERROR TO ESTIMATE
Math aptitude"x"
Business aptitude"y"
y - (y - )2
52 48 50.58656 -2.58656 6.69029263
49 49 47.53997 1.46003 2.1316876
26 27 24.18278 2.81722 7.93672853
28 24 26.21384 -2.21384 4.90108755
63 59 61.75739 -2.75739 7.60319961
44 40 42.46232 -2.46232 6.06301978
70 72 68.8661 3.1339 9.82132921
32 31 30.27596 0.72404 0.52423392
49 50 47.53997 2.46003 6.0517476
51 49 49.57103 -0.57103 0.32607526
464 449 52.0494017
SSE
=52.049)/10-2= (2.55
-
8/9/2019 Linear Regrssion Analysis and Residual
31/52
Hypothesis Tests About The Regression
If no linear relationship exists between the two variables, we would expect the
regression line to behorizontal, that is, to have a b1 = 0
= b 0 + b1 x
= b 0
to determine whether x can be used as a predictor of y, we will implement test
of hypothesis
-
8/9/2019 Linear Regrssion Analysis and Residual
32/52
Hypothesis Tests About The Regression
1. State hypothesis becomes:
H0
: b1
=
0
H1: b1 0
2. Compute the value of test statistic
3. Find n-2 which is the degree of freedom for the t-distribution
4. Find the appropriate critical value t/2 by using t-student table
5. If the value of the test statistic falls in the rejection region two-tail test,
reject H0 .Otherwise do not reject H0.
6. State the conclusion.
-
8/9/2019 Linear Regrssion Analysis and Residual
33/52
-
8/9/2019 Linear Regrssion Analysis and Residual
34/52
The confidence interval for b1
Wheret/2 represent the t-distribution value obtained from table using n-2
degrees of freedom.
Se the standard error to estimate
p is the predicted value of y corresponding to x=xp
-
8/9/2019 Linear Regrssion Analysis and Residual
35/52
Minitab
Regression Analysis
35
-
8/9/2019 Linear Regrssion Analysis and Residual
36/52
Minitab
Type the data into C1 and C2 in the data window, and
label the columns; we will call C1 math aptitude" and C2business aptitude.
36
-
8/9/2019 Linear Regrssion Analysis and Residual
37/52
i i b
-
8/9/2019 Linear Regrssion Analysis and Residual
38/52
Minitab
The resulted information:
The resulted information can be filtered can deleting the unwanted
information
38
Mi i b
-
8/9/2019 Linear Regrssion Analysis and Residual
39/52
Minitab
Tocomputethe linear correlation coefficient.
Click on Stat > Basic Statistics > Correlation then select mathaptitude and math aptitude into the "Variables: box
click "OK."
39
Mi it b
-
8/9/2019 Linear Regrssion Analysis and Residual
40/52
Minitab
To predict business aptitude score when the math aptitude score is 50
Give C6 new math score label; then write in it 50 the values of mathscore for which we want predictions.
click on Stat > Regression > Regression.
Select math aptitude into the "Response:" box and business aptitude
into the "Predictors:" box.
Now go to "Options" and select "New math score into the box called
"Prediction intervals for new observations:
click on the "Fits" box under "Storage" to check it.
click "OK" to get back to the Regression dialog box
click on "Results then click on the top button, "Display nothing."
Click "OK" on the Results dialog box and on the Regression dialog box.
The result is displayed on C7
40
-
8/9/2019 Linear Regrssion Analysis and Residual
41/52
Residual Analysis
Model Adequacy Checking
41
R i M d l
-
8/9/2019 Linear Regrssion Analysis and Residual
42/52
Regression Model:
Regression Model:
Assumptions:
1.The relationship between and the predictors is linear.
2.The s are normally distributed3.The noise term has zero mean.
4.All s have the same variance 2.
5.The s are uncorrelated between observations.
6.The s are independent of the predictors.
Residual analysis is used for detecting departures from
assumptions.
42
++ X10
R id l A l i
-
8/9/2019 Linear Regrssion Analysis and Residual
43/52
Residual AnalysisDefinition of Residuals
Residual are estimates of experimental error.
Mathematically, the residual for a specific predictor value is the difference between the
response valuey and the predicted response value.
Where:
yi = actual observation at Xi
i = predicted value from equation i = 0+ 1Xi
Residual PlotsResidual plotting is a very effective way to investigate the adequacy of the fit of a regression model and to check the underlying
assumption.
43
( ) y y =
R id l A l i
-
8/9/2019 Linear Regrssion Analysis and Residual
44/52
Residual Analysis
The residual Plots are:
Histogram of the residual
Checknormal probability plot of residuals
A symmetric bell-shaped histogram which is evenly distributed around zero indicates that the
normality assumption is likely to be true. If the histogram indicates that random error is not
normally distributed, it suggests that the model's underlying assumptions may have been
violated.
Sample sizes of residuals are generally small (
-
8/9/2019 Linear Regrssion Analysis and Residual
45/52
Residual Analysis
The normal probability plot
The normal probability plot should produce anapproximately straight line if the points come from a
normal distribution.
45
DESI GN-EX PERT P lo t
L i fe
R es idua l
Norm
al%
probability
Norma l plot of residuals
-60.75 -34.25 -7.75 18.75 45.25
1
5
10
20
30
50
70
80
90
95
99
R id l A l i
-
8/9/2019 Linear Regrssion Analysis and Residual
46/52
Residual Analysis Residuals plotted against the fitted values,
Check for the error variance
This plot should produce a distribution of points scattered randomly about 0, regardless of the size ofthe fitted value.
A residuals plot which has an increasing trend suggests that the error variance increases with the
independent variable; while a distribution that reveals a decreasing trend indicates that the error
variance decreases with the independent variable. Neither of these distributions are constant variance
patterns.
Therefore they indicate that the assumption of constant variance is not likely to be true and the regressionis not a good one.
On the other hand, a horizontal-band pattern suggests that the variance of the residuals is constant
46
R id l l i
-
8/9/2019 Linear Regrssion Analysis and Residual
47/52
Residual analysis
Residuals against run-order sequence or time
Checkingthe process driftThe Residual vs. Order of the Data plot can be used to check the drift of the variance during
the experimental process, when data are time-ordered. If the residuals are randomly distributed
around zero, it means that there is no drift in the process.
Checking independenceof the error term
the Residual vs. Order of the Data plot will reflect the correlation between the error term and
time. Fluctuating patterns around zero will indicate that the error term is dependent.
47
Residual Analysis
-
8/9/2019 Linear Regrssion Analysis and Residual
48/52
Residual AnalysisOutlier
Outlier is a single or a group of observations which are markedly different from the bulk of the data or from the pattern set by the majority of the observations.
The presence of one or more outliers can seriously distort the analysis of variance.
The check of the outlier may be made by examining the standardized residuals
48
The standardized residual should be approximately normal with mean zero and
unit variance.
A residual bigger than 3 or 4 standard deviations from zero is a potential
outlier.
Where:
di= standardized residual
ei = residual
MSE = error to be estimate
Residual Analysis
-
8/9/2019 Linear Regrssion Analysis and Residual
49/52
Residual Analysis
Example:
the regression equation:
49
Math aptitude"x"
Business aptitude"y"
residual
y
52 48 50.58656 -2.58656
49 49 47.53997 1.46003
26 27 24.18278 2.81722
28 24 26.21384 -2.21384
63 59 61.75739 -2.75739
44 40 42.46232 -2.46232
70 72 68.8661 3.1339
32 31 30.27596 0.72404
49 50 47.53997 2.46003
51 49 49.57103 -0.57103
464 449
-2.221 + 1.01553 X
Residual Analysis
-
8/9/2019 Linear Regrssion Analysis and Residual
50/52
Residual Analysis
Using Minitab
50
Stat
Regression
Regression
Graphs
Residual Analysis
-
8/9/2019 Linear Regrssion Analysis and Residual
51/52
Residual Analysis
51
-
8/9/2019 Linear Regrssion Analysis and Residual
52/52
Thank you