statistics for business and economics dr. tang yu department of mathematics soochow university may...
Post on 12-Jan-2016
216 Views
Preview:
TRANSCRIPT
Statistics for Business and Economics
Dr. TANG Yu
Department of MathematicsSoochow University
May 28, 2007
Types of Correlation
Positive correlation
Slope is positive
Negative correlation
Slope is negtive
No correlation
Slope is zero1 1 1
Hypothesis Test
For the simple linear regression model xy 10
If x and y are linearly related, we must have 01
We will use the sample data to test the following hypotheses about the parameter 1
0:0: 110 aHH
Sampling Distribution
• Just as the sampling distribution of the sample mean, X-bar, depends on the the mean, standard deviation and shape of the X population, the sampling distributions of the β0-hat and β1-hat least squares estimators depend on the properties of the {Yj } sub-populations (j=1,…, n).
Given xj, the properties of the {Yj } sub-population are determined by the εj error/random variable.
y xj j j 0 1
Model Assumption
As regards the As regards the probability distributions of εj ( j =1,…, n), it is assumed that:
i.i. Each Each εεjj is normally distributed, is normally distributed, YYjj is also normal; is also normal;
ii.ii. Each Each εεjj has zero mean, has zero mean, E(E(YYjj) =) = β β00 + β + β1 1 xxjj
iii.iii. Each Each εεjj has the same variance has the same variance, , σσεε
22,,Var(Var(YYjj) =) = σ σεε
22 is also consta is also constant;nt;
iv.iv. The errors are independent of The errors are independent of each other,each other,
{{YYii} and {} and {YYjj},}, i i jj, are also i, are also independent;ndependent;
v.v. The error does not depend on The error does not depend on the independent variable(s).the independent variable(s).
The effects of The effects of XX and and εε on on YY can be separated from eaccan be separated from each other.h other.
Graph ShowE(Y)
X
XββY 10)(E
xi xj
Yi : N (β0+β1xi ; σ )
Yj : N (β0+β1xj ; σ )
The y distributions have the same shape at each x value
Sum of SquaresSum of squares due to error (SSE)
224
23
22
21
ˆ
ˆˆˆˆ
ii YY
SSE
Sum of squares due to regression (SSR)
2ˆ YYSSSR iYY
Total sum of squares (SST)
SSRSSEYYSST i 2
ANOVA Table
Source of Variation
Sum of Squares
Degree of Freedom
Mean Square F
Regression SSR 1 MSR=SSR/1 MSR/MSE
Error SSE n-2MSE=
SSE/(n-2)
Total SST n-1
Example
xy
xy
xy
01.910.89
10.89)33.4)(01.9(09.5001.94749.22
4872.202
333.47
33.30087.50
7
61.350
^
1
^
0
^
1
^
Score (y) LSD Conc (x) x-xbar y-ybar Sxx Sxy Syy78.93 1.17 -3.163 28.843 10.004569 -91.230409 831.91864958.20 2.97 -1.363 8.113 1.857769 -11.058019 65.82076967.47 3.26 -1.073 17.383 1.151329 -18.651959 302.16868937.47 4.69 0.357 -12.617 0.127449 -4.504269 159.18868945.65 5.83 1.497 -4.437 2.241009 -6.642189 19.68696932.92 6.00 1.667 -17.167 2.778889 -28.617389 294.70588929.97 6.41 2.077 -20.117 4.313929 -41.783009 404.693689350.61 30.33 -0.001 0.001 22.474943 -202.487243 2078.183343Total
SSE
78.93 78.5583 0.3717 0.13816158.20 62.3403 -4.1403 17.1420867.47 59.7274 7.7426 59.9478537.47 46.8431 -9.3731 87.85545.65 36.5717 9.0783 82.4155332.92 35.04 -2.12 4.494429.97 31.3459 -1.3759 1.893101
253.886
iYiY ii YY ˆ 2ii YY
xy 01.910.89ˆ
SST and SSR
3.1824
89.253183.2078
01.910.89ˆ
183.2078487.202475.22
SSESSTSSR
SSESSST
xy
SSS
YY
yyxyxx
ANOVA Table
Source of Variation
Sum of Squares
Degree of Freedom
Mean Square F
Regression 1824.3 1 MSR=1824.3 35.93
Error 253.9 5 MSE=50.78
Total 2078.2 6
As F=35.93 > 6.61, where 6.61 is the critical value for F-distribution with degrees of freedom 1 and 5 (significant level takes .05), we reject H0, and conclude that the relationship between x and y is significant
Hypothesis Test
For the simple linear regression model xy 10
If x and y are linearly related, we must have 01
We will use the sample data to test the following hypotheses about the parameter 1
0:0: 110 aHH
Standard Errors
Standard error of estimate: the sample standard deviatio: the sample standard deviation of n of ε.ε.
2
n
SSEMSEs
Replacing Replacing σσεε with its estimate, with its estimate, ssεε, the , the estimated standard error ofofββ11-hat is -hat is
21 xx
s
S
ss
ixx
t-test Hypothesis
Test statistic
1
1
ˆ
ˆ
s
t
where t follows a t-distribution with n-2 degrees of freedom
0:0: 110 aHH
Reject Rule
This is a two-tailed test
value if Reject :approach value 0 pHp
220 or if Reject :approach valueCritical ttttH
Hypothesis
0:0: 110 aHH
Example
xy
xy
xy
01.910.89
10.89)33.4)(01.9(09.5001.94749.22
4872.202
333.47
33.30087.50
7
61.350
^
1
^
0
^
1
^
Score (y) LSD Conc (x) x-xbar y-ybar Sxx Sxy Syy78.93 1.17 -3.163 28.843 10.004569 -91.230409 831.91864958.20 2.97 -1.363 8.113 1.857769 -11.058019 65.82076967.47 3.26 -1.073 17.383 1.151329 -18.651959 302.16868937.47 4.69 0.357 -12.617 0.127449 -4.504269 159.18868945.65 5.83 1.497 -4.437 2.241009 -6.642189 19.68696932.92 6.00 1.667 -17.167 2.778889 -28.617389 294.70588929.97 6.41 2.077 -20.117 4.313929 -41.783009 404.693689350.61 30.33 -0.001 0.001 22.474943 -202.487243 2078.183343Total
SSE
78.93 78.5583 0.3717 0.13816158.20 62.3403 -4.1403 17.1420867.47 59.7274 7.7426 59.9478537.47 46.8431 -9.3731 87.85545.65 36.5717 9.0783 82.4155332.92 35.04 -2.12 4.494429.97 31.3459 -1.3759 1.893101
253.886
iYiY ii YY ˆ 2ii YY
xy 01.910.89ˆ
Calculation
5031.1
475.22
1258.721
xx
s
S
ss
ixx
1258.727
89.253
2
n
SSEMSEs
571.29943.55031.1
01.9ˆ
ˆ
1
1
s
t
where 2.571 is the critical value for t-distribution with degree of freedom 5 (significant level takes .025), so we reject H0, and conclude that the relationship between x and y is significant
Confidence Interval
So the So the CC% % confidence interval estimators of of ββ11 is is stβ
β,nα/1
ˆ221ˆ
The The estimated standard error ofofββ11-hat is -hat is
21 xx
s
S
ss
ixx
ββ11-hat is an estimator of -hat is an estimator of ββ11
1
1
ˆ
ˆ
s
t follows a t-distribution with n-2 degrees of freedom
Example
The The 9595% % confidence interval estimators of of ββ11 in the previous example is in the previous example is
86.301.95031.1571.201.9
i.e., from –12.87 to -5.15, which does not contai.e., from –12.87 to -5.15, which does not contain 0in 0
Regression Equation
It is believed that the longer one studied, the better It is believed that the longer one studied, the better one’s grade is. The final mark (one’s grade is. The final mark (YY) on study time () on study time (XX) ) is supposed to follow the regression equation:is supposed to follow the regression equation:
xxy 877.1590.21ˆˆˆ 10
If the fit of the sample regression equation is satisfIf the fit of the sample regression equation is satisfactory, it can be used to actory, it can be used to estimate its mean value or its mean value or to to predict the dependent variable. the dependent variable.
Estimate and Predict
xxy 877.1590.21ˆˆˆ 10
E.g.: What is the final mark of Tom who spent 30 hours on studying?I.e., given x = 30, how large is y?
E.g.: What is the mean final mark of all those students who spent 30 hours on studying?
I.e., given x = 30, how large is E(y)?
For a particular element of a Y sub-population.
For the expected value of a Y sub-population.
PredictEstimate
What Is the Same?For a given X value, the point forecast (predict) of Y and the point estimator of the mean of the {Y} sub-population are the same:
xy 10ˆˆˆ
Ex.1 Estimate the mean final mark of students who spent 30 hours on study.
9.7730877.1590.21ˆˆˆ 10 xy
Ex.2 Predict the final mark of Tom, when his study time is 30 hours.
What Is the Difference?The interval prediction of Y and the interval estimation of the mean of the {Y} sub-population are different:
The prediction The prediction
The prediction interval is wider than the confidence interval
2
2
2 )(
)(11ˆ
xx
xx
nsty
i
g
The estimation The estimation
2
2
2 )(
)(1ˆ
xx
xx
nsty
i
g
Example
xy
xy
xy
01.910.89
10.89)33.4)(01.9(09.5001.94749.22
4872.202
333.47
33.30087.50
7
61.350
^
1
^
0
^
1
^
Score (y) LSD Conc (x) x-xbar y-ybar Sxx Sxy Syy78.93 1.17 -3.163 28.843 10.004569 -91.230409 831.91864958.20 2.97 -1.363 8.113 1.857769 -11.058019 65.82076967.47 3.26 -1.073 17.383 1.151329 -18.651959 302.16868937.47 4.69 0.357 -12.617 0.127449 -4.504269 159.18868945.65 5.83 1.497 -4.437 2.241009 -6.642189 19.68696932.92 6.00 1.667 -17.167 2.778889 -28.617389 294.70588929.97 6.41 2.077 -20.117 4.313929 -41.783009 404.693689350.61 30.33 -0.001 0.001 22.474943 -202.487243 2078.183343Total
SSE
78.93 78.5583 0.3717 0.13816158.20 62.3403 -4.1403 17.1420867.47 59.7274 7.7426 59.9478537.47 46.8431 -9.3731 87.85545.65 36.5717 9.0783 82.4155332.92 35.04 -2.12 4.494429.97 31.3459 -1.3759 1.893101
253.886
iYiY ii YY ˆ 2ii YY
xy 01.910.89ˆ
Estimation and Prediction
xy 01.910.89ˆ For 0.5gx
The point forecast (predict) of Y and the point estimator of the mean of the {Y} are the same:
05.440.501.910.89ˆ y
Estimation and Prediction
xy 01.910.89ˆ For 0.5gx
But for the interval estimation and prediction, it is different:
Data Needed
1258.727
89.253
2
n
SSEMSEs
The prediction The prediction
2
2
2 )(
)(11ˆ
xx
xx
nsty
i
g
The estimation The estimation
2
2
2 )(
)(1ˆ
xx
xx
nsty
i
g
475.222 xxi Sxx
571.2025. t
For 0.5gx
Calculation
3887.705.44475.22
333.40.5
7
11258.7571.205.44
)(
)(1ˆ
2
2
2
2
xx
xx
nsty
i
g
7543.1905.44475.22
333.40.5
7
111258.7571.205.44
)(
)(11ˆ
2
2
2
2
xx
xx
nsty
i
g
Estimation
Prediction
Moving Rule
2
i
2g
2)xx(
)xx(
n1
sty
x 2
i
2
2)xx(
2n1
sty
2
i
2
2)xx(
1n1
sty
As xg moves away from x the interval becomes longer. That is, the shortest interval is found at x.
2x 2x 1x 1x
The confidence intervalwhen xg = x
The confidence intervalwhen xg = 1x
The confidence intervalwhen xg = 2x
Moving Rule
2
2
2 )(
)(11ˆ
xx
xx
nsty
i
g
x
2
2
2 )(
211ˆ
xxnsty
i
2
2
2 )(
111ˆ
xxnsty
i
As xg moves away from x the interval becomes longer. That is, the shortest interval is found at x.
2x 2x 1x 1x
The confidence intervalwhen xg = x
The confidence intervalwhen xg = 1x
The confidence intervalwhen xg = 2x
Interval Estimation
x2x 2x 1x 1x
EstimationPrediction
Residual AnalysisRegression Residual – the difference between an observed y value and its corresponding predicted value
Properties of Regression ResidualThe mean of the residuals equals zero
The standard deviation of the residuals is equal to the standard deviation of the fitted regression model
yyr ˆ
Example
Score (y) LSD Conc (x) y-hat residual(r)78.93 1.17 78.558 0.371758.20 2.97 62.34 -4.140367.47 3.26 59.727 7.742637.47 4.69 46.843 -9.373145.65 5.83 36.572 9.078332.92 6.00 35.04 -2.1229.97 6.41 31.346 -1.3759
xy 01.910.89ˆ
Residual Plot Against x
x
r
Residual Plot Against y-hat
y
r
Three Situations
Good Pattern
Non-constant Variance
Model form not adequate
Standardized Residual Standard deviation of the ith residual
iyy hssii
1ˆwhere
2
2
ˆ
1
estimate theoferror standard the
residual ofdeviation standard the
xx
xx
nh
s
is
j
ii
yy ii
Standardized residual for observation i
ii yy
iii s
yyz
ˆ
ˆ
Standardized Residual Plot
x
z
Standardized Residual The standardized residual plot can provide
insight about the assumption that the error term has a normal distribution
It is expected to see approximately 95% of the standardized residuals between –2 and +2
If the assumption is satisfied, the distribution of the standardized residuals should appear to come from a standard normal probability distribution
Detecting Outlier
Outlier
Influential Observation
Outlier
Influential Observation
Influential observation
High Leverage Points Leverage of observation
2
21
xx
xx
nh
j
ii
70252020151010
For example
24.2857x
94.2857.24
2857.2470
7
112
2
2
2
ij
ii
xxx
xx
nh 86.
7
66
n
Contact Information
Tang Yu ( 唐煜 ) ytang@suda.edu.cn http://math.suda.edu.cn/homepage/tangy
top related