chapter 6 simple regression. 6.1 - introduction fundamental questions – is there a relationship...
TRANSCRIPT
6.1 - Introduction
Fundamental questions – Is there a relationship between two random
variables and how strong is it?– Can we predict the value of one if we know the
value of the other?
Example– The author had ten of his students measure their
shoe length and height
6.2 – Covariance and Correlation
Definition 6.2.1 Let and be two random variables with respective means and . The covariance of and is
Alternatively,
( , ) X YCov X Y E X Y
( , ) ( ) X YCov X Y E XY
Example 6.2.1
1(1/ 9) 2(1/ 3) 3(5 / 9) 22 / 9
1(5 / 9) 2(1/ 3) 3(1/ 9) 14 / 9
( ) 1 1(1/ 9) 2 1(2 / 9) 3 3(1/ 9) 4
( , ) 4 (22 / 9)(14 / 9) 16 / 81 0.1975
X
Y
E XY
Cov X Y
Correlation Coefficient
Definition 6.2.2 Let and be random variables with standard deviations and , respectively. The correlation coefficient of and is
Theorem 6.2.2
( )( , )( , ) X Y
X Y X Y
E XYCov X YX Y
( , ) 1
( , ) 0
( , ) 1 if and only if 1 for so
1.
2. If and are independent, then
me
con
3
stants an
.
d
X Y
X Y
X Y P Y mX b
m b
X Y
Sample Correlation Coefficient
Definition 6.2.3 The sample correlation coefficient of n pairs of data values is
Alternatively,
2 2
1
1 1
i i
i i
x y x ynr
x x y yn n
2 22 2
i i i i
i i i i
n x y x yr
n x x n y y
Bivariate Normal Distribution
Definition 6.2.4 Let
Two variables X and Y are said to have a bivariate normal distribution if their joint p.d.f. is
2 2
2
1( , ) 2
1X X Y Y
X X Y Y
x x y yh x y
( , )/2
2
1( , )
2 1
h x y
X Y
f x y e
Bivariate Normal Distribution
Theorem 6.2.3 Two random variables and with a bivariate normal distribution are independent if and only if .
T-test of
T-test of for Bivariate Random Variables
Purpose: To test the null hypothesis H0: where and have a bivariate normal distribution.– Test statistic
– Critical value: t-score with degrees of freedom
2
2
1
nt r
r
Example 6.2.4
For the shoe length vs height data, , – Test the claim that
H0: H1:
– Test statistic
2
10 20.974 12.16
1 (0.974)t
Example 6.2.4
– Critical value: – Critical region: – P-value = twice the region to the right of which is 0
– Reject H0
Final conclusion:– There is a statistically significant linear
relationship between shoe length and height.
6.3 – Method of Least-Squares
We want to find and that minimize
22
1 1
ˆˆ ˆn n
i i i ii i
S y y y mx b
Method of Least-Squares
22
ˆˆ2 ( ) 0ˆ
ˆˆ2 ( 1) 0ˆ
ˆˆ ˆand
i i i
i i
i i i i
i i
Sy mx b x
mS
y mx bb
n x y x ym b y mx
n x x
Example 6.3.1
Suppose a crime scene investigator finds a shoe print outside a window that measures 11.25 in long and would like to estimate the height of the person who made the print
Cautions1. If there is no linear correlation, do not use a linear
regression equation to make predictions
2. Only use a linear regression equation to make predictions within the range of the x-values of the data
ˆ 3.878(11.25) 25.84 69.47y
6.4 – The Simple Linear Model
Definition 6.4.1 Two random variables and are said to be described by a simple linear model if
where and are constants and is a random variable independent of that is where is a constant.
Y mX b ò
Residuals
Definition 6.4.2 For a set of data the residuals are
where and are the least-squares estimates of m and b as calculated in Section 6.3– Observed values of
ˆˆ ˆ for 1, ,i i i iy y y mx b i n
Standard Error of Estimate
Definition 6.4.3 Let and be described by a simple linear model. The standard error of estimate is
– An unbiased estimate of , the variance of
2
1
1 ˆˆ2
n
e i ii
s y mx bn
Prediction Interval
Definition 6.4.4 Let and be described by a simple linear model. Given a value of , say , a prediction interval estimate for the corresponding value of is
where , the margin of error is
and is a critical t-value with d.f.
ˆ ˆy E Y y E
2
0/2 2
11e
i
x xE t s
n x x
Confidence Interval for
Definition 6.4.5 Let X and Y be described by a simple linear model . A confidence interval estimate of is
where the margin of error is
and is a critical t-value with d.f.
ˆ ˆm E m m E
/2 2
e
i
sE t
x x
T-Test of the Slope
Let and be described by a simple linear model . To test the null hypothesis
H0: ,
the test statistic is
the critical value is a t-score with degrees of freedom, and the P-value is the area under the corresponding density curve.
2
0
1ˆ ,i
e
t m m x xs
Coefficient of Determination
– The square of the sample correlation coefficient
Interpretation– “The proportion of the total variation in the -values
from explained (or accounted for) by the regression equation.”
2 Tot Res
Tot
SS SSr
SS
F-Test of the Slope
Let X and Y be described by a simple linear model . To test the hypotheses
H0: vs. H1: ,
the test statistic is
The critical value is The P-value is the area under the corresponding density curve to the right of the test statistic.
/ ( 2)Reg
Res
SSf
SS n
6.6 – Nonlinear Regression
Example: and are described by – Use the data below to estimate and – is linear with respect to – “Transform” the -values
Example 6.6.1
• People/physician ()• Male life expectancy ()
(World Almanac Book of Facts, 1992, Pharos Books)
• Fit Power and Exponential models to the data
6.7 – Multiple Regression
Goal: Predict the value of a variable in terms of two or more other variables – – response variable– – predictor variables
Assume a relation of the form
– Use software to estimate coefficients1 1 k kY m X m X b ò
Outputs
Coefficients: Yield the multiple regression equation
Standard error: Use to calculate confidence interval estimate of the coefficients
where is a critical t-value with d.f.
1 2 3ˆ 20.9 10339.6 2641.1 41510.9y x x x
/2 /2ˆ ˆi i i i im t s m m t s
Outputs
t Stat: Test statistic for the hypotheses
H0: , H1:
in the presence of the other predictor variables– Small P-value indicates that the variable is
“statistically significant”
ANOVA Results
F – Test statistic for the hypothesesH0: , H1: at least one is not 0
Significance F – Corresponding P-value– Measures the “overall significance” of the set of predictor
variables– Small P-value: The set is “statistically significant”
Regression Statistics
Multiple R – Multiple regression equivalent of the sample correlation coefficient r
R Squared – Multiple coefficient of determination
Regression Statistics
Adjusted R Square – Calculated with the formula
– The higher the value, the better the overall quality of the model
Standard Error – Estimate of the standard deviation of the random variable in the multiple regression model– Also called the standard error of estimate
2 21Adjusted 1 1
1
nR R
n k