simple linear regression[1]
Embed Size (px)
TRANSCRIPT

7/30/2019 Simple Linear Regression[1]
1/50
1
Linear Regressionand
Correlation Analysis

7/30/2019 Simple Linear Regression[1]
2/50
2
Chapter Goals
To understand the methods for displaying and describingrelationship among two variables
Models
Linear regression
Correlations
Frequency tables
Methods for Studying Relationships
Graphical
Scatter plots
Line plots 3D plots

7/30/2019 Simple Linear Regression[1]
3/50
3
Two Quantitative Variables
The response var iable, also called thedependent var iable, is the variable we wantto predict, and is usually denoted by y.
Theexplanatory variab le, also called theindependent var iable, is the variable thatattempts to explain the response, and isdenoted by x.

7/30/2019 Simple Linear Regression[1]
4/50
4
Scatter Plots and Correlation
A scatter plot (or scatter diagram) is usedto show the relationship between twovariables
Correlation analysis is used to measurestrength of the association (linearrelationship) between two variables
Only concerned with strength of therelationship
No causal effect is implied

7/30/2019 Simple Linear Regression[1]
5/50
5
Example
The following graphshows the scatter plotof Exam 1 score (x)and Exam 2 score (y)for 354 students in aclass.
Is there a relationshipbetween x and y?

7/30/2019 Simple Linear Regression[1]
6/50
6
Scatter Plot Examples
y
x
y
x
y
y
x
x
Linear relationships Curvilinear relationships

7/30/2019 Simple Linear Regression[1]
7/50
7
Scatter Plot Examples
y
x
y
x
No relationship
(continued)

7/30/2019 Simple Linear Regression[1]
8/50
8
Correlation Coefficient
The population correlation coefficient (rho) measures the strength of theassociation between the variables
The sample correlation coefficient ris an estimate of and is used tomeasure the strength of the linear
relationship in the sample observations
(continued)

7/30/2019 Simple Linear Regression[1]
9/50
9
Features of and r Unit free
Range between 1 and 1
The closer to 1, the stronger thenegative linear relationship
The closer to 1, the stronger the positivelinear relationship
The closer to 0, the weaker the linearrelationship

7/30/2019 Simple Linear Regression[1]
10/50
10
Examples of Approximate r Values
yy
x
y
x
x
y
x
y
x
Tag with appropriate value: 1, .6, 0, +.3, 1

7/30/2019 Simple Linear Regression[1]
11/50
11
Earlier Example
Correlations
1 .400**
.000
366 351
.400** 1
.000
351 356
Pearson Correlation
Sig. (2tailed)
N
Pearson Correlation
Sig. (2tailed)
N
Exam1
Exam2
Exam 1 Exam 2
Correlation is significant at the 0.01 level**.

7/30/2019 Simple Linear Regression[1]
12/50
12
Questions?
What kind of relationship would you expectin the following situations:
Age (in years) of a car, and its price.
Number of calories consumed per day andweight.
Height and IQ of a person.

7/30/2019 Simple Linear Regression[1]
13/50
13
Exercise
Identify the two variables that vary and decidewhich should be the independent variable andwhich should be the dependent variable.Sketch a graph that you think best representsthe relationship between the two variables.
1. The size of a persons vocabulary over his orher lifetime.
2. The distance from the ceiling to the tip of theminute hand of a clock hung on the wall.

7/30/2019 Simple Linear Regression[1]
14/50
14
Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable basedon the value of at least one independent variable.
Explain the impact of changes in an independentvariable on the dependent variable.
Dependent variable: the variable we wish to
explain.Independent variable: the variable used to
explain the dependent variable.

7/30/2019 Simple Linear Regression[1]
15/50
15
Simple Linear RegressionModel
Only one independent variable, x.
Relationship between x and y isdescribed by a linear function.
Changes in y are assumed to becaused by changes in x.

7/30/2019 Simple Linear Regression[1]
16/50
16
Types of Regression ModelsPositive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship

7/30/2019 Simple Linear Regression[1]
17/50
17
ni ,...,2,1; ii10i XY
Linear component
Population Linear Regression
The population regression model:
Populationy intercept
PopulationSlopeCoefficient
RandomError term,
or residual
Dependent
Variable
IndependentVariable
Random Errorcomponent

7/30/2019 Simple Linear Regression[1]
18/50
18
Linear Regression Assumptions
The Error terms i, i=1, 2. , n areindependent and i ~ Normal (0,
2).
The Error terms i, i=1, 2. , n have constantvariance 2.
The underlying relationship between the Xvariable and the Y variable is linear.

7/30/2019 Simple Linear Regression[1]
19/50
19
Population Linear Regression(continued)
Random Errorfor this xi value
Y
X
Observed Valueof yi for xi
Predicted Valueof yi for xi
ii10i XY
xi
Slope = 1
YIntercept = 0
i

7/30/2019 Simple Linear Regression[1]
20/50
20
i10i XbbY
The sample regression line provides an estimateof the population regression function.
Estimated Regression Model
Estimate ofthe regressionyintercept
Estimate of theregression slopeEstimated
(or predicted)y value Independent
variable
The individual random error terms ei have a mean of zero
The Regression Function: iiii XEXYE 1010 )()(

7/30/2019 Simple Linear Regression[1]
21/50
21
Earlier Example

7/30/2019 Simple Linear Regression[1]
22/50
22
ResidualA residualis the difference between the observed
response yiand the predicted response i. Thus, for
each pair of observations (xi, yi), the ith residual isei= yi i= yi (b0+ b1xi)
Least Squares Criterion b0 and b1 are obtained by finding the
values of b0 and b1 that minimize the sumof the squared residuals.
2
10
22
x))b(b(y
)y(ye

7/30/2019 Simple Linear Regression[1]
23/50
23
b0 is the estimated average value of
y when the value of x is zero.
b1 is the estimated change in the
average value of y as a result of a
oneunit change in x.
Interpretation of theSlope and the Intercept

7/30/2019 Simple Linear Regression[1]
24/50
24
The Least Squares Equation
The formulas for b1
and b0
are:
algebraic equivalent:
nxx
n
yxxy
b2
2
1
)(
21 )(
))((
xx
yyxxb
xbyb 10 and

7/30/2019 Simple Linear Regression[1]
25/50
25
Finding the Least Squares Equation
The coefficients b0 and b1 willusually be found using computersoftware, such as Excel, Minitab, or
SPSS.
Other regression measures will alsobe computed as part of computerbased regression analysis.
Si l Li R i

7/30/2019 Simple Linear Regression[1]
26/50
26
Simple Linear RegressionExample
A real estate agent wishes to examine therelationship between the selling price of a
home and its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (y) = house price in
$1000s
Independent variable (x) = square feet

7/30/2019 Simple Linear Regression[1]
27/50
27
Sample Data for House PriceModel
House Price in $1000s(y)
Square Feet(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550405 2350
324 2450
319 1425
255 1700

7/30/2019 Simple Linear Regression[1]
28/50
28
Model Summary
.762a .581 .528 41.33032Model1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), Square Feeta.
Coefficientsa
98.248 58.033 1.693 .129
.110 .033 .762 3.329 .010
(Constant)
Square Feet
Model
1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: House Pricea.
SPSS Output
The regression equation is:feet)(square0.11098.248pricehouse

7/30/2019 Simple Linear Regression[1]
29/50
29
0
50
100
150
200
250300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
HousePrice($1
000s)
Graphical Presentation
House price model: scatter plot andregression line
feet)(square0.11098.248pricehouse
Slope
= 0.110
Intercept
= 98.248

7/30/2019 Simple Linear Regression[1]
30/50
30
Interpretation of the Intercept, b0
b0 is the estimated average value of Y when
the value of X is zero (if x = 0 is in the range
of observed x values)
Here, no houses had 0 square feet, so
b0 = 98.24833 just indicates that, for houses
within the range of sizes observed,
$98,248.33 is the portion of the house price
not explained by square feet
feet)(square0.11098.248pricehouse
Interpretation of the

7/30/2019 Simple Linear Regression[1]
31/50
31
Interpretation of theSlope Coefficient, b1
b1 measures the estimated change in
the average value of Y as a result ofa oneunit change in X
Here, b1 = .10977 tells us that the averagevalue of a house increases by .10977($1000)
= $109.77, on average, for each additional
one square foot of size
feet)(square0.1097798.24833pricehouse

7/30/2019 Simple Linear Regression[1]
32/50
32
Least Squares Regression Properties
The sum of the residuals from the least
squares regression line is 0 ( ).
The sum of the squared residuals is a
minimum (minimized ).
The simple regression line always passes
through the mean of the y variable and the
mean of the x variable.
The least squares coefficients b0 and b1 are
unbiased estimates of 0 and 1.
0)( yy
2)( yy

7/30/2019 Simple Linear Regression[1]
33/50
33
Exercise
The growth of children from early childhood through
adolescence generally follows a linear pattern. Data onthe heights of female Americans during childhood, fromfour to nine years old, were compiled and the leastsquares regression line was obtained as = 32 + 2.4xwhere is the predicted height in inches, andxis age inyears.
Interpret the value of the estimated slope b1= 2. 4. Would interpretation of the value of the estimated
yintercept, b0= 32, make sense here? What would you predict the height to be for a female
American at 8 years old? What would you predict the height to be for a femaleAmerican at 25 years old? How does the quality of thisanswer compare to the previous question?

7/30/2019 Simple Linear Regression[1]
34/50
34
The coefficient of determination is theportion of the total variation in thedependent variable that is explained byvariation in the independent variable.
The coefficient of determination is alsocalled Rsquared and is denoted as R2
Coefficient of Determination, R2
1R0 2

7/30/2019 Simple Linear Regression[1]
35/50
35
Coefficient of Determination, R2
(continued)
Note: In the single independent variable case, the coefficientof determination is
where:R2 = Coefficient of determinationr = Simple correlation coefficient
22
rR
E l f A i t

7/30/2019 Simple Linear Regression[1]
36/50
36
Examples of ApproximateR2 Values
y
x
y
x
y
x
y
x
E l f A i t

7/30/2019 Simple Linear Regression[1]
37/50
37
Examples of ApproximateR2 Values
R2 = 0
No linear relationshipbetween x and y:
The value of Y does not
depend on x. (None of thevariation in y is explainedby variation in x)
y
xR2 = 0
SPSS O t t

7/30/2019 Simple Linear Regression[1]
38/50
38
SPSS OutputModel Summary
.762a .581 .528 41.33032
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), Square Feeta.
ANOVAb
18934.935 1 18934.935 11.085 .010a
13665.565 8 1708.196
32600.500 9
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), Square Feeta.
Dependent Variable: House Priceb.
Coefficientsa
98.248 58.033 1.693 .129
.110 .033 .762 3.329 .010
(Constant)
Square Feet
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: House Pricea.

7/30/2019 Simple Linear Regression[1]
39/50
39
Standard Error of Estimate
The standard deviation of the variation ofobservations around the regression line iscalled the standard error of estimate
The standard error of the regressionslope coefficient (b1) is given by sb1
s

7/30/2019 Simple Linear Regression[1]
40/50
40
Model Summary
.762a .581 .528 41.33032
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), Square Feeta.
Coefficientsa
98.248 58.033 1.693 .129
.110 .033 .762 3.329 .010
(Constant)
Square Feet
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: House Pricea.
SPSS Output
41.33032s
0.03297s1b

7/30/2019 Simple Linear Regression[1]
41/50
41
Comparing Standard Errors
y
y y
x
x
x
y
x
1bssmall
1
bslarge
ssmall
slarge
Variation of observed y valuesfrom the regression line Variation in the slope of regressionlines from different possible samples
I f b t th Sl t T t

7/30/2019 Simple Linear Regression[1]
42/50
42
Inference about the Slope: tTest ttest for a population slope
Is there a linear relationship between X and Y? Null and alternate hypotheses
H0: 1 = 0 (no linear relationship)
H1: 1 0 (linear relationship does exist)
Test statistic:
Degree of Freedom:
1b
11
s
bt
2nd.f.
where:
b1 = Sample regression slopecoefficient
1
= Hypothesized slope
sb1 = Estimator of the standarderror of the slope

7/30/2019 Simple Linear Regression[1]
43/50
43
House Pricein $1000s
(y)
Square Feet(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.)0.109898.25pricehouse
Estimated Regression Equation:
The slope of this model is 0.1098
Does square footage of the houseaffect its sales price?
Example: Inference about the Slope:t Test
(continued)
I f b t th Sl

7/30/2019 Simple Linear Regression[1]
44/50
44
Inferences about the Slope:tTest Example  Continue
H0: 1 = 0
HA: 1 0
Test Statistic: t = 3.329
There is sufficient evidencethat square footage affects
house price
From Excel output:
Coeff icients Standard Error t Stat Pvalue
Intercept 98.24833 58.03348 1.69296 0.12892Square Feet 0.10977 0.03297 3.32938 0.01039
1bs t*
b1
Decision: Reject H0
Conclusion:
Reject H0Reject H0
a/2=.025
t/2Do not reject H0
0t(1/2)
a/2=.025
2.3060 2.3060 3.329
d.f. = 102 = 8

7/30/2019 Simple Linear Regression[1]
45/50
45
Confidence Interval for the Slope
Confidence Interval Estimate of the Slope:
Excel Printout for House Prices:
At 95% level of confidence, the confidence interval forthe slope is (0.0337, 0.1858)
1b/211sb a t
Coeff icients Standard Error t Stat Pvalue Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
d.f. = n  2
C f f S

7/30/2019 Simple Linear Regression[1]
46/50
46
Confidence Interval for the Slope
Since the units of the house price variable is$1000s, we are 95% confident that the averageimpact on sales price is between $33.70 and$185.80 per square foot of house size
Coeff ic ient
sStandard
Error t Stat Pvalue Low er 95% Upper
95%
Intercept 98.24833 58.03348 1.69296 0.12892 35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship betweenhouse price and square feet at the .05 level of significance
R id l A l i

7/30/2019 Simple Linear Regression[1]
47/50
47
Residual Analysis
Purposes
Examine for linearity assumption
Examine for constant variance for alllevels of x
Evaluate normal distribution assumption
Graphical Analysis of Residuals
Can plot residuals vs. x Can create histogram of residuals tocheck for normality

7/30/2019 Simple Linear Regression[1]
48/50
48
Residual Analysis for Linearity
Not Linear Linear
x
residua
ls
x
y
x
y
x
residua
ls

7/30/2019 Simple Linear Regression[1]
49/50
49
Residual Analysis for Constant Variance
Nonconstant variance Constant variance
x x
y
x x
y
residuals
residuals

7/30/2019 Simple Linear Regression[1]
50/50
50
House Price Model Residual Plot
60
40
20
0
20
40
60
80
0 1000 2000 3000
Square Feet
Residuals
Example: Residual Output
RESIDUAL OUTPUTPredicted
House
Price Residuals
1 251.92316 6.923162
2 273.87671 38.123293 284.85348 5.853484
4 304.06284 3.937162
5 218.99284 19.99284
6 268.38832 49.38832
7 356.20251 48.79749
8 367.17929 43.17929
9 254.6674 64.33264
10 284.85348 29.85348