principles of biostatistics simple linear regression

53
4/9/2005 11:38 AM Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu) 18-1 Principles of Biostatistics Simple Linear Regression PPT based on Dr Chuanhua Yu and Wikipedia

Upload: duman

Post on 16-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Principles of Biostatistics Simple Linear Regression. PPT based on Dr Chuanhua Yu and Wikipedia. Terminology. Moments, Skewness, Kurtosis Analysis of varianceANOVA Response (dependent) variable Explanatory ( independent) variable Linear regression model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-1

Principles of Biostatistics

Simple Linear Regression

PPT based on

Dr Chuanhua Yu and Wikipedia

Page 2: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-2

TerminologyMoments, Skewness, Kurtosis Analysis of variance ANOVAResponse (dependent) variableExplanatory (independent) variableLinear regression model      Method of least squaresNormal equationsum of squares, Error SSEsum of squares, Regression SSRsum of squares, Total SSTCoefficient of Determination R2

F-value P-value, t-test, F-test, p-testHomoscedasticityheteroscedasticity          

Page 3: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-3

• 18.0 Normal distribution and terms

• 18.1 An Example

• 18.2 The Simple Linear Regression Model

• 18.3 Estimation: The Method of Least Squares

• 18.4 Error Variance and the Standard Errors of Regression Estimators

• 18.5 Confidence Intervals for the Regression Parameters

• 18.6 Hypothesis Tests about the Regression Relationship

• 18.7 How Good is the Regression?

• 18.8 Analysis of Variance Table and an F Test of the Regression Model

• 18.9 Residual Analysis

• 18.10 Prediction Interval and Confidence Interval

Contents

Page 4: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-4

Normal Distribution

The continuous probability density function of the normal distribution is the Gaussian function

where σ > 0 is the standard deviation, the real parameter μ is the expected value, andis the density function of the "standard" normal distribution: i.e., the normal distribution with μ = 0 and σ = 1.

Page 5: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-5

Normal Distribution

Page 6: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-6

Moment About The Mean

The kth moment about the mean (or kth central moment) of a real-valued random variable X is the quantity

μk = E[(X − E[X]) k], where E is the expectation operator. For a continuous uni-variate probability distribution with probability density function f(x), the moment about the mean μ is

The first moment about zero, if it exists, is the expectation of X, i.e. the mean of the probability distribution of X, designated μ. In higher orders, the central moments are more interesting than the moments about zero. μ1 is 0. μ2 is the variance, the positive square root of which is the standard deviation, σ. μ3/σ3 is Skewness, often γ. μ3/σ4 -3 is Kurtosis.

Page 7: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-7

Skewness

Consider the distribution in the figure. The bars on the right side of the distribution taper differently than the bars on the left side. These tapering sides are called tails (or snakes), and they provide a visual means for determining which of the two kinds of skewness a distribution has:1.negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed. 2.positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed.

Page 8: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-8

Skewness

Skewness, the third standardized moment, is written as γ1 and defined as

where μ3 is the third moment about the mean and σ is the standard deviation.

For a sample of n values the sample skewness is

Page 9: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-9

Kurtosis

Kurtosis is the degree of peakedness of a distribution. A normal distribution is a mesokurtic distribution. A pure leptokurtic distribution has a higher peak than the normal distribution and has heavier tails. A pure platykurtic distribution has a lower peak than a normal distribution and lighter tails.

Page 10: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-10

Kurtosis

The fourth standardized moment is defined as

where μ4 is the fourth moment about the mean and σ is the standard deviation.

For a sample of n values the sample kurtosis is

Page 11: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-11

18.1 An example

Table18.1 IL-6 levels in brain and serum (pg/ml) of 10 patients with subarachnoid hemorrhage Patient

i Serum IL-6 (pg/ml)

xBrain IL-6 (pg/ml)

y1 22.4 134.02 51.6 167.03 58.1 132.34 25.1 80.25 65.9 100.06 79.7 139.17 75.3 187.28 32.4 97.29 96.4 192.3

10 85.7 199.4

Page 12: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-12

This scatterplot locates pairs of observations of serum IL-6 on the x-axis and brain IL-6 on the y-axis. We notice that:

Larger (smaller) values of brain IL-6 tend to be associated with larger (smaller) values of serum IL-6 .

The scatter of points tends to be distributed around a positively sloped straight line.

The pairs of values of serum IL-6 and brain IL-6 are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. The line represents the nature of the relationship on average.

Scatterplot

Page 13: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-13

X

Y

X

Y

X 0

0

0

0

0

Y

X

Y

X

Y

X

Y

Examples of Other Scatterplots

Page 14: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-14

The inexact nature of the relationship between serum and brain suggests that a statistical model might be useful in analyzing the relationship.

A statistical model separates the systematic component of a relationship from the random component.

The inexact nature of the relationship between serum and brain suggests that a statistical model might be useful in analyzing the relationship.

A statistical model separates the systematic component of a relationship from the random component.

Data

Statistical model

Systematic component

+Random

errors

In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE).

In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.

In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE).

In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.

Model Building

Page 15: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-15

The population simple linear regression model:

y= + x + or y|

x=+x Nonrandom or Random

Systematic Component Component

Where y is the dependent (response) variable, the variable we wish to explain or predict; x is the independent (explanatory) variable, also called the predictor variable; and is the error term, the only random component in the model, and thus, the only source of randomness in y. y|x is the mean of y when x is specified, all called the conditional mean of Y.

is the intercept of the systematic component of the regression relationship. is the slope of the systematic component.

The population simple linear regression model:

y= + x + or y|

x=+x Nonrandom or Random

Systematic Component Component

Where y is the dependent (response) variable, the variable we wish to explain or predict; x is the independent (explanatory) variable, also called the predictor variable; and is the error term, the only random component in the model, and thus, the only source of randomness in y. y|x is the mean of y when x is specified, all called the conditional mean of Y.

is the intercept of the systematic component of the regression relationship. is the slope of the systematic component.

18.2 The Simple Linear Regression Model

Page 16: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-16

The simple linear regression model posits an exact linear relationship between the expected or average value of Y, the dependent variable Y, and X, the independent or predictor variable:

y|x= +x Actual observed values of Y (y) differ from the expected value

(y|x ) by an unexplained or

random error():

y = y|x + = +x +

The simple linear regression model posits an exact linear relationship between the expected or average value of Y, the dependent variable Y, and X, the independent or predictor variable:

y|x= +x Actual observed values of Y (y) differ from the expected value

(y|x ) by an unexplained or

random error():

y = y|x + = +x +

X

Y

y|x= + x

x

}} = Slope

1

y

{Error:

Regression Plot

Picturing the Simple Linear Regression Model

0

{ = Intercept

Page 17: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-17

• The relationship between X and Y is a straight-Line (linear) relationship.

• The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term .

• The errors are uncorrelated (i.e. Independent) in successive observations. The errors are Normally distributed with mean 0 and variance 2(Equal variance). That is: ~ N(0,2)

• The relationship between X and Y is a straight-Line (linear) relationship.

• The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term .

• The errors are uncorrelated (i.e. Independent) in successive observations. The errors are Normally distributed with mean 0 and variance 2(Equal variance). That is: ~ N(0,2)

X

YLINE assumptions of the Simple

Linear Regression Model

Identical normal distributions of errors, all centered on the regression line.

Assumptions of the Simple Linear Regression Model

y|x= + x

x

y

N(y|x, y|x2)

Page 18: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-18

Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line.

The estimated regression equation: y= a+ bx + e

where a estimates the intercept of the population regression line, ; b estimates the slope of the population regression line, ; and e stands for the observed errors ------- the residuals from fitting the estimated regression line a+ bx to a set of n points.

Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line.

The estimated regression equation: y= a+ bx + e

where a estimates the intercept of the population regression line, ; b estimates the slope of the population regression line, ; and e stands for the observed errors ------- the residuals from fitting the estimated regression line a+ bx to a set of n points.

18.3 Estimation: The Method of Least Squares

The estimated regression line:

+

where (y - hat) is the value of Y lying on the fitted regression line for a givenvalue of X .

y a b x

y

Page 19: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-19

Fitting a Regression Line

X

Y

Data

X

Y

Three errors from a fitted line

X

Y

Three errors from the least squares regression line

e

X

Errors from the least squares regression line are minimized

Page 20: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-20

.{ˆError e y yi i i

ˆ the predicted value of for y xY

Y

X

ˆ the fitted regression liney xa b

ˆiy

Errors in Regression

xi

yi

Page 21: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-21

Least Squares Regression

aSSE

b

Least squares a

Least squares b

The sum of squared errors in regression is:

SSE = e (y

The is that which the SSEwith respect to the estimates a and b.

i

2

i=1

n

ii=1

n

)yi

2

least squares regression line minimizes

SSE: sum of squared errors

Parabola function

Page 22: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-22

Normal Equation

S is minimized when its gradient with respect to each parameter is equal to zero. The elements of the gradient vector are the partial derivatives of S with respect to the parameters:

Since , the derivatives are

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

Upon rearrangement, the normal equations

are obtained. The normal equations are written in matrix notation as

The solution of the normal equations yields the vector of the optimal parameter values.

Page 23: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-23

Normal Equation

1

ˆ

0ˆ0ˆ

ˆˆˆ2

)ˆˆ()ˆˆˆˆ(

)ˆ)(ˆ(

)ˆ()ˆ(

ˆˆˆ

),0(~ˆˆ

ˆ2

11

2

2

1

2

kn

eeYXXXB

BXXYXB

Q

BXXBYXBYY

YXBBXYBXXBYXBBXYYY

BXYXBYQ

BXYBXYee

BXYYYEyyQ

NUUXBYBXYn

i

n

ii iie

Page 24: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-24

Sums of Squares, Cross Products, and Least Squares Estimators

y a bx

Sums of Squares and Cross Products:

Least squares re gression estimators:

lxx x x xx

n

lyy y y yy

n

lxy x x y y xyx y

n

blxy

lxx

a y b x

( )

( )

( )( )( )

2 2

2

2 2

2

y a bx

Page 25: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-25

22

2

22

2

592.6 41222.14 6104.66

10

1428.70220360.47 16242.10

10

592.6 1428.70 91866.46 7201.70

10

1428.7

1.18

592.6(1.18)10 10

72.96

7201.706104.66

xx

yy

xy

xy

xx

xl x

n

yl y

n

x yl xy

n

l

lb

a y bx

22

2

22

2

592.6 41222.14 6104.66

10

1428.70220360.47 16242.10

10

592.6 1428.70 91866.46 7201.70

10

1428.7

1.18

592.6(1.18)10 10

72.96

7201.706104.66

xx

yy

xy

xy

xx

xl x

n

yl y

n

x yl xy

n

l

lb

a y bx

Example 18-1

Patient x y x 2 y 2 x ×y

1 22. 4 134. 0 501. 76 17956. 0 3001. 604 25. 1 80. 2 630. 01 6432. 0 2013. 028 32. 4 97. 2 1049. 76 9447. 8 3149. 282 51. 6 167. 0 2662. 56 27889. 0 8617. 203 58. 1 132. 3 3375. 61 17503. 3 7686. 635 65. 9 100. 0 4342. 81 10000. 0 6590. 007 75. 3 187. 2 5670. 09 35043. 8 14096. 166 79. 7 139. 1 6352. 09 19348. 8 11086. 2710 85. 7 199. 4 7344. 49 39760. 4 17088. 589 96. 4 192. 3 9292. 96 36979. 3 18537. 72Total 592. 6 1428. 7 41222. 14 220360. 5 91866. 46

regression equation:

ˆ 72.96 1.18y x

Page 26: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-26

New Normal Distributions

• Since each coefficient estimator is a linear combination of Y (normal random variables), each bi (i = 0,1, ..., k) is normally distributed.

• Notation:

in 2D special case,

when j=0, in 2D special case

YXXXB )(ˆ 1

1-2 X)(X' ofelement column jth rowjth theis ),,(~ jjjjjj ccN

)/)(,(~ 220 xxnlxaNa

xxl1/c jj

Page 27: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-27

Y

X

What you see when looking at the total variation of Y.

X

What you see when looking along the regression line at the error variance of Y.

Y

Total Variance and Error Variance

2( )

1

y y

n

2ˆ( )

2

y y

n

Page 28: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-28

2

= ( -2) ( total observations less one degree of freedom for each parameter estimated ( and ) )

2ˆ= ( )

=

xyyy

xx

yy xy

a

ll

l

l l

df n nb

SSE y y

b

Degrees of Freedom in Regression:

An

2SSE: MSE=

-2ˆ( )

2

s

ny yn

E

2 2unbiased estimator of ,de

rror Variance

noted by :

X

Y

Square and sum all regression errors to find SSE.

968.28

=

16242.10 (1.18)(7201.70)7746.23

2 8

31.12

7746.23

968.28

yy xySSE l bl

SSEMSE

n

s MSE

Standard Er :ror

Example 18-1:

18.4 Error Variance and the Standard Errors of Regression Estimators

Page 29: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-29

22

The standard error of (intercept) :

where s = MSE

The standard error of (slope) :

1

xx xx

x

a

x

b

xs

n l

ss

l

ss

n

a

b

l

x

22

The standard error of (intercept) :

where s = MSE

The standard error of (slope) :

1

xx xx

x

a

x

b

xs

n l

ss

l

ss

n

a

b

l

x

2

41222.1431.12 106104.66

0.398 64.204 25.570

31.12 6104.66

0.398

xx

a

b

xx

ss

l

ss

l

x

n

Example 18-1:

2

41222.1431.12 106104.66

0.398 64.204 25.570

31.12 6104.66

0.398

xx

a

b

xx

ss

l

ss

l

x

n

Example 18-1:

Standard Errors of Estimates in Regression

Page 30: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-30

T distribution

Student's distribution arises when the population standard deviation is unknown and has to be estimated from the data.

Page 31: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-31

/ 2, 2

/ 2, 2

A 100(1- ) % confidence interval for :

A 100(1- ) % confidence interval for :

aa

bb

n

n

t sa

t sb

0.05 / 2,10 2

0.05 / 2,10 2

=72.961 (2.306) (25.570)72.961 58.964[13.996,131.925]

=1.180 (2.306) (0.398)1.180 0.918[0.261,2.098]

aa

bb

t s

t s

Example 18-195% Confidence Intervals:

0.05 / 2,10 2

0.05 / 2,10 2

=72.961 (2.306) (25.570)72.961 58.964[13.996,131.925]

=1.180 (2.306) (0.398)1.180 0.918[0.261,2.098]

aa

bb

t s

t s

Example 18-195% Confidence Intervals:

18.5 Confidence Intervals for the Regression Parameters

Page 32: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-32

Y

X

Y

X

Y

X

Constant Y Unsystematic Variation Nonlinear Relationship

A hypothesis test for the existence of a linear relationship between X and Y:

H 0 H1Test statistic for the existence of a linear relationship between X and Y:

where is the least - squares estimate of the regression slope and is the standard error of

When the null hypothesis is true, the statistic has a distribution with - degrees of freedom.

:

:

0

0

2

b sb bt n

18.6 Hypothesis Tests about the Regression Relationship

H0:=0 H0:=0H0:=0

bb b

b bt

s s

Page 33: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-33

A test of the null hypothesis that the means of two normally distributed populations are equal. Given two data sets, each characterized by its mean, standard deviation and number of data points, we can use some kind of t test to determine whether the means are distinct, provided that the underlying distributions can be assumed to be normal. All such tests are usually called Student's t tests

T-test

)(~t knt

j

jj

Page 34: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-34

0.05 / 2, 8

: 0, : 0 0.050 1

2.306 2.962

is rejected at the 5% level and we may0conclude that there is a relationship

1.180 2.9620.398

value 0.018 ( 10 2 8)

bb

H H

t

t

H

bs

p

Example 18-1:

betweenserum IL-6 and brain IL-6.

T-test

Page 35: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-35

T test Table

Page 36: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-36

The coefficient of determination, R2, is a descriptive measure of the strength of the regression relationship, a measure how well the regression line fits the data.

.{

Y

X

Y

Y

Y

X

{}Total Deviation

Explained Deviation

Unexplained Deviation

Total = Unexplained ExplainedDeviation Deviation Deviation (Error) (Regression)

SST = SSE + SSR

r2

( ) ( ) ( )

( ) ( ) ( )

y y y y y y

y y y y y y

SSR

SST

SSE

SST

2 2 2

1Percentage of total variation explained by the regression.

18.7 How Good is the Regression?

R2=

R2 : coefficient of determination

Page 37: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-37

Y

X

R2=0 SSE

SST

Y

X

R2=0.90SSE

SST

SSR

Y

X

R2=0.50 SSE

SST

SSR

2 1.180 7201.70

16242.10

0.5231 52.31%

xy

yy

blSSRR

SST l

Example 18 -1 :

The Coefficient of Determination

Page 38: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-38

Another Test

• Earlier in this section you saw how to perform a t-test to compare a sample mean to an accepted value, or to compare two sample means. In this section, you will see how to use the F-test to compare two variances or standard deviations.

• When using the F-test, you again require a hypothesis, but this time, it is to compare standard deviations. That is, you will test the null hypothesis H0: σ1

2 = σ22 against an

appropriate alternate hypothesis.

Page 39: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-39

T test is used for every single parameter. If there are many dimensions, all parameters are independent.Too verify the combination of all the paramenters, we can use F-test.

The formula for an F- test in multiple-comparison ANOVA problems is: F = (between-group variability) / (within-group variability)

F-test

),1(~)/(

)1/(knkF

knSSE

kSSRF

zero-non oneleast at ,...,,:

0...:

321

320

k

k

H

H

Page 40: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-40

F test table

Page 41: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-41

18.8 Analysis of Variance Table and an F Test of the Regression Model

Example 18-1

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Square

F Ratio

p Value

Regression 8495.87 1 8495.87 8.77 0.0181

Error 7746.23 8 968.28 Total 16242.10 9

Example 18-1

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Square

F Ratio

p Value

Regression 8495.87 1 8495.87 8.77 0.0181

Error 7746.23 8 968.28 Total 16242.10 9

Source ofVariation

Sum ofSquares

Degrees ofFreedom Mean Square F Ratio

Regression SSR (1) MSR MSRMSE

Error SSE (n-2) MSE

Total SST (n-1) MST

Source ofVariation

Sum ofSquares

Degrees ofFreedom Mean Square F Ratio

Regression SSR (1) MSR MSRMSE

Error SSE (n-2) MSE

Total SST (n-1) MST

Page 42: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-42

1. In 2D case, F-test and T-test are same. It can be proved that f = t2

So in 2D case, either F or T test is enough. This is not true for more variables.2. F-test and R have the same purpose to measure the whole regressions. They are co-related as 3. F-test are better than R became it has better metric which has distributions for hypothesis test.

Approach:1.First F-test. If passed, continue.2.T-test for every parameter, if some parameter can not pass, then we can delete it can re-evaluate the regression.3.Note we can delete only one parameters(which has least effect on regression) at one time, until we get all the parameters with strong effect.

F-test T-test and R

,11 2

2

R

R

k

knF

Page 43: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-43

x or y

0

Residuals

Homoscedasticity: Residuals appear completely random. No indication of model inadequacy.

0

Residuals

Curved pattern in residuals resulting from underlying nonlinear relationship.

0

Residuals

Residuals exhibit a linear trend with time.

Time

0

Residuals

Heteroscedasticity: Variance of residuals changes when x changes.

x or y

x or y

18.9 Residual Analysis

Page 44: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-44

Residual Analysis. The plot shows the a curve relationshipbetween the residuals and the X-values (serum IL - 6).

Example 18-1: Using Computer-Excel

serum I L- 6 Resi dual Pl ot

- 60

- 40

- 20

0

20

40

0 20 40 60 80 100 120

serum I L- 6

Resi

dual

()

残差

Page 45: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-45

Prediction Interval

• samples from a normally distributed population.• The mean and standard deviation of the population are unknown

except insofar as they can be estimated based on the sample. It is desired to predict the next observation.

• Let n be the sample size; let μ and σ be respectively the unobservable mean and standard deviation of the population. Let X1, ..., Xn, be the sample; let Xn+1 be the future observation to be predicted. Let

• and

)/,(~ 2 nSXNX n

Page 46: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-46

Prediction Interval

• Then it is fairly routine to show that

• It has a Student's t-distribution with n − 1 degrees of freedom.

Consequently we have

• where Tais the 100((1 + p)/2)th percentile of Student's t-distribution

with n − 1 degrees of freedom. Therefore the numbers

• are the endpoints of a 100p% prediction interval for Xn + 1.

Page 47: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-47

• Point Prediction– A single-valued estimate of Y for a given value of X

obtained by inserting the value of X in the estimated regression equation.

• Prediction Interval – For a value of Y given a value of X

• Variation in regression line estimate• Variation of points around regression line

– For confidence interval of an average value of Y given a value of X

• Variation in regression line estimate

• Point Prediction– A single-valued estimate of Y for a given value of X

obtained by inserting the value of X in the estimated regression equation.

• Prediction Interval – For a value of Y given a value of X

• Variation in regression line estimate• Variation of points around regression line

– For confidence interval of an average value of Y given a value of X

• Variation in regression line estimate

18.10 Prediction Interval and Confidence Interval

Page 48: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-48

confidence interval of an average value of Y given a value of X

n

ii

Y

YnYn

XX

XX

nSS

StYYEStY

1

2

2

ˆ2/,2ˆ2/,2

1

where

ˆ)(ˆ

)))(1

(,(~ˆ2

202

0100

ix

XX

nXNY

)2(~)(ˆ

0

0100

ntS

XYt

Y

Page 49: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-49

Confidence Interval for the Average Value of Y

0

0

20

/ 2,

0

2

0

2

A 100(1- ) % confidence interval for the mean value of Y:

( )1ˆ

( = ):

ˆ 72.96 1.18 75.3 161.79

1 (75.3 59.26)161.79 2.306 31.12

10 6104.6

7 .3

6

5

x nxx

x

x xy t s

n l

y bx

x

a

Example 18 -1

161.79 27.06 [134.74,188.85]

0

0

20

/ 2,

0

2

0

2

A 100(1- ) % confidence interval for the mean value of Y:

( )1ˆ

( = ):

ˆ 72.96 1.18 75.3 161.79

1 (75.3 59.26)161.79 2.306 31.12

10 6104.6

7 .3

6

5

x nxx

x

x xy t s

n l

y bx

x

a

Example 18 -1

161.79 27.06 [134.74,188.85]

Page 50: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-50

Prediction Interval For a value of Y given a value of X

n

ii

YY

YYnPYYn

XX

XX

nSS

StYYStY

1

2

2

ˆ2/,2ˆ2/,2

11

where

ˆˆ

)))(1

1(,0(~ˆ2

202

00

ix

XX

nNYY

),(~ 20100 XNY

)2(~ˆ

00

00

ntS

YYt

YY

Page 51: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-51

0

0

2

0

0

0/ 2, 2

2

7

A 100(1- ) % prediction interval for Y:

( )1ˆ 1

( = ):

ˆ 72.96 1.18 75.3 161.79

1 (75.3 59.26)161.79 2.306 31.12 1

10 6104.66161.79 76 9

5.3

.6 [

x nxx

x

x xy t s

n

y a bx

x

l

Example 18 -1

85.11,238.48]

0

0

2

0

0

0/ 2, 2

2

7

A 100(1- ) % prediction interval for Y:

( )1ˆ 1

( = ):

ˆ 72.96 1.18 75.3 161.79

1 (75.3 59.26)161.79 2.306 31.12 1

10 6104.66161.79 76 9

5.3

.6 [

x nxx

x

x xy t s

n

y a bx

x

l

Example 18 -1

85.11,238.48]

Prediction Interval for a Value of Y

Page 52: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-52

Confidence Interval for the Average Value of Y and Prediction Interval for the Individual Value of Y

0. 0

50. 0

100. 0

150. 0

200. 0

250. 0

300. 0

20 40 60 80 100serum I L-6

brai

n IL

-6

Actual observati ons l ower of 95% CL for yupper of 95% CL for y l ower of 95% CL for meanupper of 95% CL for mean

Page 53: Principles of Biostatistics Simple Linear Regression

4/9/2005 11:38 AM

Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu)

18-53

Summary1. Regression analysis is applied for prediction while 1. Regression analysis is applied for prediction while

control effect of independent variable X.control effect of independent variable X.2. The principle of least squares in solution of 2. The principle of least squares in solution of

regression parameters is to minimize the residual sum regression parameters is to minimize the residual sum of squares.of squares.

3. 3. The coefficient of determination, R2, is a descriptive measure of the strength of the regression relationship.

4. There are two confidence bands: one for mean 4. There are two confidence bands: one for mean predictions and the other for individual prediction predictions and the other for individual prediction values values

5. Residual analysis is used to check goodness of fit for 5. Residual analysis is used to check goodness of fit for modelsmodels