principles of biostatistics chapter 18 simple linear regression

36
4/9/2005 11:38 AM Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu) 18-1 Principles of Biostatistics Chapter 18 Simple Linear Regression http://statdtedm.6to23.com

Upload: kamaria-gyasi

Post on 30-Dec-2015

95 views

Category:

Documents


2 download

DESCRIPTION

Principles of Biostatistics Chapter 18 Simple Linear Regression. 宇传华 http://statdtedm.6to23.com. Terminology. Linear regression 线性回归 Response (dependent) variable 反应 ( 应 ) 变量 Explanatory (independent) variable 解释 ( 自 ) 变量 Linear regression model 线性回归模型 - PowerPoint PPT Presentation

TRANSCRIPT

Simple Linear Regression18-*
18-*
Terminology
Error sum of squares or residual sum of squares

18-*
18.3 Estimation: The Method of Least Squares
18.4 Error Variance and the Standard Errors of Regression Estimators
18.5 Confidence Intervals for the Regression Parameters
18.6 Hypothesis Tests about the Regression Relationship
18.7 How Good is the Regression?
18.8 Analysis of Variance Table and an F Test of the Regression Model
18.9 Residual Analysis
Contents
18-*
8.1 An example
Table18.1 IL-6 levels in brain and serum (pg/ml) of 10 patients with subarachnoid hemorrhage ()
Patient i
18-*
Scatterplot
This scatterplot locates pairs of observations of serum IL-6 on the x-axis and brain IL-6 on the y-axis. We notice that:
Larger (smaller) values of brain IL-6 tend to be associated with larger (smaller) values of serum IL-6 .
The scatter of points tends to be distributed around a positively sloped straight line.
The pairs of values of serum IL-6 and brain IL-6 are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. The line represents the nature of the relationship on average.
Chart3
22.4
51.6
58.1
25.1
65.9
79.7
75.3
32.4
96.4
85.7
134
167
132.3
80.2
100
139.1
187.2
97.2
192.3
199.4
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
18-*
Model Building
The inexact nature of the relationship between serum and brain suggests that a statistical model might be useful in analyzing the relationship.
A statistical model separates the systematic component of a relationship from the random component.
Data
Statistical
model
Systematic
component
Random
errors
In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE).
In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.
4/9/2005 11:38 AM
18-*
y= a + b x + or my|x=a+b x
Nonrandom or Random
Component
Where y is the dependent (response) variable, the variable we wish to explain or predict; x is the independent (explanatory) variable, also called the predictor variable; and is the error term, the only random component in the model, and thus, the only source of randomness in y.
my|x is the mean of y when x is specified, all called the conditional mean of Y.
a is the intercept of the systematic component of the regression relationship.
is the slope of the systematic component.
4/9/2005 11:38 AM
18-*
Picturing the Simple Linear Regression Model
The simple linear regression model posits () an exact linear relationship between the expected or average value of Y, the dependent variable Y, and X, the independent or predictor variable:
my|x= a+b x
Actual observed values of Y (y) differ from the expected value (my|x ) by an unexplained or random error(e):
y = my|x +
= a+b x +
18-*
The relationship between X and Y is a straight-Line relationship.
The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term .
The errors are uncorrelated (i.e. Independent) in successive observations. The errors are Normally distributed with mean 0 and variance 2(Equal variance). That is: ~ N(0,2)
X
Y
LINE assumptions of the Simple Linear Regression Model
Identical normal distributions of errors, all centered on the regression line.
my|x=a + x
18-*
$
Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line.
The estimated regression equation:
y= a+ bx + e
where a estimates the intercept of the population regression line, a ;
b estimates the slope of the population regression line, ;
and e stands for the observed errors ------- the residuals from fitting the estimated regression line a+ bx to a set of n points.
The estima
ted regres
sion line:
18-*
X
Y
e
X
4/9/2005 11:38 AM
18-*
18-*
18-*

18-*
Sheet1
Patient
x
y
y^
18-*
Example 18-1: Using Computer-Excel
The results on the bottom are the output created by selecting REGRESSION option from the DATA ANALYSIS toolkit.
Office“”“”“”
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
Y
X
What you see when looking at the total variation of Y.
X
What you see when looking along the regression line at the error variance of Y.
Y
18-*
18.4 Error Variance and the Standard Errors of Regression Estimators
X
Y
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
Constant Y
Unsystematic Variation
Nonlinear Relationship
A hypothes
18-*
4/9/2005 11:38 AM
18-*
18.7 How Good is the Regression?
.
{
R2=
18-*
134
167
132.3
80.2
100
139.1
187.2
97.2
192.3
199.4
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
18.8 Analysis of Variance Table and an F Test of the Regression Model
4/9/2005 11:38 AM
18-*
0
Residuals
0
Residuals
Time
0
Residuals
4/9/2005 11:38 AM
18-*
Chart4
22.4
25.1
32.4
51.6
58.1
65.9
75.3
79.7
85.7
96.4
0
0
0
0
0
0
0
0
0
0
Sheet1
Patient
x
y
x2
y2
134
167
132.3
80.2
100
139.1
187.2
97.2
192.3
199.4
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
Point Prediction
A single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation.
Prediction Interval
For a value of Y given a value of X
Variation in regression line estimate
Variation of points around regression line
For confidence interval of an average value of Y given a value of X
Variation in regression line estimate
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
Confidence Interval for the Average Value of Y and Prediction Interval for the Individual Value of Y
Chart5
22.4
22.4
22.4
22.4
22.4
25.1
25.1
25.1
25.1
25.1
32.4
32.4
32.4
32.4
32.4
51.6
51.6
51.6
51.6
51.6
58.1
58.1
58.1
58.1
58.1
65.9
65.9
65.9
65.9
65.9
75.3
75.3
75.3
75.3
75.3
79.7
79.7
79.7
79.7
79.7
85.7
85.7
85.7
85.7
85.7
96.4
96.4
96.4
96.4
96.4
serum IL-6
brain IL-6
Sheet1
Patient
x
y
y^
serum IL-6
brain IL-6
18-*
Summary
1. Regression analysis is applied for prediction while control effect of independent variable X.
2. The principle of least squares in solution of regression parameters is to minimize the residual sum of squares.
3. The coefficient of determination, R2, is a descriptive measure of the strength of the regression relationship.
4. There are two confidence bands: one for mean predictions and the other for individual prediction values
5. Residual analysis is used to check goodness of fit for models
4/9/2005 11:38 AM
18-*
Assignments
1. What is the main distinctions and assossiations between correlation analysis and simple linear regression?
2. What is the least squares method to estimate regression line?
3. Please describe the main steps for fitting a simple linear regression model with data.
4/9/2005 11:38 AM
18-*
1. Data source:
correlation analysis is required that both x and y follow normal distribution; but for simple linear regression, only y is required following normal distribution.
2. application:
correlation analysis is employed to measure the association between two random variables (both x and y are treated symmetrically)
simple linear regression is employed to measure the change in y for x (x is the independent varible, y is the dependent variable)
3. r is a dimensionless number, it has no unit of measurement; but b has its unit which relate to y.
4/9/2005 11:38 AM
18-*
3.
YY
XX
l
l
b
r
yx
Y
ˆ
Intercept72.96072725.57022272.8533473370.02136880313.9956876131.925766
x1.17970420.398262342.9621285150.0180872980.2613096332.09809885
0
0
2
0
0
0
/2,2
2
7
()
egree of freedom
for each paramete
r estimated ( and ) )
An
2
SSE
: MSE=
-2
ˆ
a
a
b
b
n
n
ts
a
ts
b
a
a
a
a
0
1.180
2.962
0.398
value0.018(1028)
b
b
HH
t
t
H
b
s
p
bba
n
=¹=
Total
mean value of Y:
Actual observationslower of 95% CL for y
upper of 95% CL for ylower of 95% CL for mean
upper of 95% CL for mean
2
2
where s = MSE
1
xx
xx
x
a
x
b
x
s
nl
s
s
l
s
s
n
a
b
l
x
brain IL-6