13. linear regression and correlation

PowerPoint Presentation
2
Outline
• Data: two continuous measurements on each subject • Goal: study the relationship between the two variables • PART I : correlation analysis
– Study the relationship between two continuous variables. – Steps :
• Scatter diagram • Correlation coefficient : Calculation, meaning, hypothesis testing
• PART II : linear regression – Construct a linear equation between 2 variables.
• Model building • Model estimating : Confidence intervals and prediction intervals • Model fitting: Strength of the linear association,coefficient of
determination
3
• Ex. Gender(binary), brand(3-level)
– Y : response variable(cont. or binary). • Ex. Score, success-failure, yield,
– Q : whether X and Y are correlated ? • A : If Y is continuous, comparing the population
means of Y in the groups divided by X. – Ex : --Z-test, T-test, ANOVA F-test
4
• Recall : In Ch.11 and 12, – Q : whether X and Y are correlated ?
• A : If X and Y are binary, compare the population proportions of Y in the two groups divided by X.
– Ex :
– When sample sizes are large, Z-test is used.
– Q : How to determine the correlation if X and Y are both continuous? -- correlation and regression analysis!
5
• Data : – A sample of n sets of observation. – There are k continuous variables measured in each observation. – Example. Surveyed n=10 students, k=3 scores are recorded. – Questions : any association between scores?

1 82 67 56
2 89 99 70
3 45 31 42
4 74 66 67
5 75 86 99
6 69 39 75
7 70 86 67
8 47 61 86
9 92 88 75
10 92 79 54
• What is correlation analysis ? – Study the relationship between several continuous variables. – Measure the strength of the association between variables.
• Correlation analysis consists : – Step 1. Scatter diagram : Plot (X1, X2) – Step 2. Coefficient of correlation :
7
Conclusion :
• Population coefficient of correlation, ρ : – A measure of the strength of the linear relationship between two variables. – Definition: population correlation coefficient
– Estimation : sample correlation coefficient
xy
x y
n 1 n 1
n 1 n 1
9
• Properties : – -1r1 – “Positive linear association” : r > 0 – “Negative linear association” : r < 0 – “no linear relation” : r0 (! Other relation may exist) – “Strongly positive linear association” : r1 – “Strongly negative linear association” : r-1
10
11
• Why such definition? – If there is a strongly positive linear association, when
x is large, y is large, then we have a large positive value of Sxy.
– If there is a strongly negative linear association, when x is large, y is small, then we have a large negative value of Sxy.
– If there is no relation, when x is large, some y are large, some y are small, then Sxy0, r0.
12
EXCEL
13
EXCEL
2. =0.7151
3. =0.3754
Population : N=∞ subjects
Population : N=∞ subjects
“H0 : ρ= 0” ? Unknown!
-- a t-test!
• Testing the null hypothesis of no correlation : ρ=0
• Step 1. State the hypotheses – H0 : no correlation v.s. H1: correlated – H0 : ρ= 0 v.s. H1 : ρ≠0
• Step 2. Select the significance level α
17
– Note that under null hypothesis, t ~ t-distribution with d.f.=(n-2)
• Step 4. Formulate the decision rule – A two-sided test; – A t-test; – With significance level α, H0 should be rejected if
t > tα/2,n-2 or t <- tα/2,n-2
• Step 5. Collect data, compute t-value, draw conclusion
2 2
= = = − − −
18
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 1 : ( v.s. )
– H0 : ρ1= 0 v.s. H1 : ρ1≠0 – Since r1=0.033, n=10,
– Since –2.306< t=0.093 <2.306, H0 is not rejected. – Conclusion : there is no sufficient evidence to reject the null
hypothesis of no correlation.
19
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 2 : ( v.s. )
– H0 : ρ2= 0 v.s. H1 : ρ2≠0 – Since r2=0.7151, n=10,
– Since t=2.89>2.306, H0 is rejected. – Conclusion : there is sufficient evidence to reject the null
hypothesis of no correlation. –
89.2 )7151.0(1 2107151.0
20
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 3: ( v.s. )
– H0 : ρ3= 0 v.s. H1 : ρ3≠0 – Since r3=0.3755, n=10,
– Since –2.306< t=1.15 <2.306, H0 is not rejected. – Conclusion : there is no sufficient evidence to reject the null
hypothesis of no correlation.
21
! A word of caution : for H0 of no correlation been rejected, ! Only linear relationship between variables are ascertained.
! Quadratic? Cubic? ! No “cause and effect” () is established.
! ! “” ! “”? !
! Spurious() correlations : !
• Variables : – X=Independent variable(s), explanatory variable, predictor,
,
• To be predicted or estimated.
• Regression analysis : – Develop an equation/function that allows us to estimate/predict Y
based on X. – Example. X=Y 60
23
• Recall : In a one-way ANOVA – AGE vs. INCOME – The whole population are classified into three sub-
populations by “AGE” • A young-population. • A middle-age-population. • A senior-population
– The INCOMEs of all sub-populations are • Normally distributed with same variance
– Research question: “The mean INCOMEs, μincome ,are the same”?
ANOVA Regression
ANOVA Regression
• Recall : In a simple linear regression model, – (X) vs. (Y) – The whole population are classified into many sub-populations
by “(X)” • X=0-population; X=1-population;....., X=100-population.
– The (Y) of all sub-populations are • Normally distributed with same variance
– Research question: “The mean s, μY , are the same”? “Establish the relationship between μY and X”
25
Regression Model: (P449)
1. Given each value of X, there is a group of Ys. – X – X=60
Y= – X=50
– At X=60, Y~ – At X=50, Y~
),(N 2 60X|Y σµ =
),(N 2 50X|Y σµ =
26
3. The means of these normal distributions is a linear function of x
– X
– X
0
20
40
60
80
100

Example. -30+1.5()
27
4. The standard deviations of these normal distributions are all the same. (independent with x)
•
–
28
),x(N~Y 2σβ+α
29
unknown are ,, 2σβαPractically, only a sample data is collected and
?"x" x|Y β+α=µ
P(y)
: observations
30
How to estimate the regression equation using a sample data?
?"x" x|Y β+α=µ
Y
: observations
31
• Regression equation :
xx|Y β+α=µ
??? 2 =σ=β=α
• Let a, b be estimates of
• Predicted equation : Y’ = a + b x, it could be a 1. predicted value of Y : Y – Ex. X=60

– Ex.X=60
34
35
• In the predicted equation, the intercepta = ? The slope b= ?
• Least Squares estimates (LSE, ) a, b : – Principle : find a regression equation which minimizes the sum of
∑ =
• Estimated regression coefficients :
xy
S (x x)(y y) /(n 1) { xy nxy}/(n 1),
S (x x) /(n 1) { x nx }/(n 1)
= − − − = − −
= − − = − −
∑ ∑ ∑ ∑
Meaning of the estimated intercept, a
• a = Y’ at X=0. – The estimated value of when X=0.
• Example. XY0 = a
– The predicted value of Y when X=0. • Example. XY0 a
– 0Xa • Example. X=Y= • X0a
0X|Y =µ
38
• a is an estimate of the true interceptα. • One may interest in testing H0 : α=0. • When α=0, the equation passes through the origin(),
0 x
0x|Y
x|Y
X
39
Meaning of the estimated slope, b
• b = increment with unit change of x – When there is one unit change in x, the
increment/decrement in – Example. In previous case, if b=0.2, X1 0.2
x|Yµ
x|Yµ
40
• b is an estimate of the true slopeβ. • One is more interested in testing H0 : β=0. • When β=0, the equation is a constantand
independent of X values,
– The distribution of Y is uncorrelated with X. – X and Y are independent!
),(N~Y, 2 x|Y σαα=µ
α=µ x|Y
X Y XY
1 82 67 5494
2 89 99 8811
3 45 31 1395
4 74 66 4884
5 75 86 6450
6 69 39 2691
7 70 86 6020
8 47 61 2867
9 92 88 8096
10 92 79 7268
mean 74 70 53976
{53976 10(73.5)(70.2)}/(10 1) 264.333
= − −
= − − =
=
= = = −
= − × =
∑
1 2222.52 2222.52 8.3747 0.0201
8 2123.08 265.39
1.5346 24.2804 0.0632 0.9512 -54.4562 57.5253
X 0.9342 0.3228 2.8939 0.0201 0.1898 1.6787
Model fitting
Model estimating
43
EXCEL : output t P- 95% 95%
1.5346 24.2804 0.0632 0.9512 -54.4562 57.5253
X 0.9342 0.3228 2.8939 0.0201 0.1898 1.6787
Note : The difference to previous calculation is due to rounding error.
a, b estimates of α,β
SE(a), SE(b)
t-value(a)=a/SE(a), t-value(b)=b/SE(b)
• p-value (a) =0.9511>0.05, not reject that α=0
• p-value (b) = 0.02<0.05, reject! β≠0
95%95% confidence interval for α,β
44
'Ybxaxx|Y =+≈β+α=µ
45
The standard error of estimate :
• Variance : – Dispersion of Y around the regression line – The variation of the random “error”,
Error = = : unobtainable
• Standard error of estimate : – Use “residuals” to estimate “error”,
Residual = = Y-Y’ : observable – Standard error of estimate is defined by
where Sy : sample s.d. of Y, Sx : sample s.d. of X
2σ
• – The random variation is unexplained by the regression line.
2 xy
47
Example. X=Y : Y’=1.53+0.93X X Y Y Y-Y' (Y-Y)^2
1 82 67 78.14 -11.14 124.12
2 89 99 84.68 14.32 205.05
3 45 31 43.57 -12.57 158.12
4 74 66 70.67 -4.67 21.78
5 75 86 71.60 14.40 207.32
6 69 39 66.00 -27.00 728.78
7 70 86 66.93 19.07 363.66
8 47 61 45.44 15.56 242.02
9 92 88 87.48 0.52 0.27
10 92 79 87.48 -8.48 71.96
2123.08
Note :
48
3.16))93.0(94.28284.482( 8 9)bSS(
49
EXCEL :

ESTIMATION & PREDICTION— Confidence intervals and prediction intervals
• ESTIMATION: – Q: At X=x, the mean value of Y, – Point estimation, confidence interval
• PREDICTION: – Q:If an individual is drawn from the population of X=x, Y=? – Point prediction, prediction interval
?x|Y =µ
?3x|Y =µ
Confidence interval of at X=xx|Yµ
x|Yµ• Confidence interval : At X=x, the mean value of Y, – Point estimation : Y’ = a+bx
– 100(1-α)% confidence interval :
Y’=1.53+0.93X
Ans.
2. 95% confidence interval :
Prediction interval of Y at X=x
• Prediction interval : If draw an individual from the population of X=x, Y=? – Prediction : Y’ = a + bx
– 100(1-α)% prediction interval :
Y’=1.53+0.93X
2 2 2 x
Y ' 57.33, t 2.306,s 16.29,
n 10,(x x) (60 73.5) 182.25,s 282.94
1 (x x)Y ' t s 1 n (n 1)s
1 182.2557.33 2.306 16.29 1 57.33 40.66 10 9(282.94)
α
• D.f . = n-1 for n observations. • MStotal = SS total/(n-1)
– SST = due to treatment = • Yj = estimated mean of Y of treatment-j group • D.f. = k-1 for k treatments • MST=SST/(k-1)
– SSE = due to random error = • D.f. = n – k • MSE = SSE/(n-k)
– SS total = SST + SSE
Degrees of Freedom
Mean Square F
Treatment SST k-1 SST/(k-1)=MST Error SSE n-k SSE/(n-k)=MSE
MST/MSE
Vs
57
• SStotal = Total variation of Y :
• SSR = The variation explained by the regression model • SSE=The unexplained variation
SSESSR )'YY()Y'Y(
)YY( SStotal
• SStotal = – D.f . = n-1 for n observations. – Mstotal = SS total/(n-1)
• SSR = due to regression model – Y’ = estimated mean of Y at some X-level – D.f. = 2-1=1 for 2 regression coefficients – MSR=SSR/1 = SSR
• SSE = due to random error – D.f. = n – 2 – MSE = SSE/(n-2) =
2 Y
2 S)1n()YY( −=∑ −
2 xyS ⋅
2 X
22 Sb)1n()Y'Y( −=∑ −=
2 xy
2 S)2n()'YY( ⋅−=∑ −=
Degrees of Freedom
Mean Square F
Regression SSR 2-1 SSR/1=MSR Error SSE n-2 SSE/(n-2)=MSE
MSR/MSE
vs.
The regression line is horizontal.
60
2 89 99
3 45 31
4 75 67
5 76 86
6 69 40
7 71 87
8 48 61
9 93 89
10 93 80
sum 740 706
mean 74.0 70.6
sd 16.8 22.0
variance 283.2 482.7
1 2222.518 2222.518 8.374682 0.020079
8 2123.082 265.3853
Further, since for F-test, p-value = 0.02< 0.05, the linearity exists.
6.434584.4829S)1n()YY( 2 Y
2 ≈×=−=−∑ 5.22229.2829342.09Sb)1n()Y'Y( 22
The Coefficient of Determination
• Coefficient of Determination : – the proportion of the total variation of Y that is explained by the
variation of X. – YX
– YX
SStotal SSE1
total lainedexpuntotal
)YY( )Y'Y(
total elmodbylainedexp
SStotal SSRr
1 2222.518 2222.518 8.374682 0.020079
8 2123.082 265.3853
– CORREL
–
66
Exercise.
• Linear regression analysis : – 45, 46, 53, 57 – EXCEL: 47, 49
67
• (X)(Y) 1. (correlation analysis) – Scatter plot, correlation matrix
– XY(α=0.05)
3. ANOVA
Outline
PART II. Linear Regression Analysis
Regression Model: (P449)
ESTIMATION & PREDICTION—Confidence intervals and prediction intervals
Confidence interval of at X=x
Prediction interval of Y at X=x
RECALL : ANOVA-table

13. linear regression and correlation

Documents