binf 702 chapter 11 regression and correlation methods

67
Chapter 11 Regression and Correlation Methods (SPRING 2014) 1 BINF 702 Chapter 11 Regression and Correlation Methods

Upload: others

Post on 08-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 1

BINF 702 Chapter 11 Regression and Correlation Methods

Page 2: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 2

Section 11.1 Introduction

Example 11.1 Obstetrics Obstetricians sometimes order tests for estriol levels from 24-hour urine specimens taken from pregnant women who are near term, since the level of estriol has been found to be related to the birthweight of the infant. The test can provide indirect evidence of an abnormally small fetus. The relationship between estriol level and birthweight can be quantified by fitting a regression line that relates the two variables.

Example 11.2 Hypertension Much discussion has taken place in the literature concerning the familial aggregation of blood pressure. In general, children whose parents have high blood pressure tend to have higher blood pressure than their peers. One way of expressing this relationship is to compute a correlation coefficient relating the blood pressure of parents and children over a large collection of families.

Page 3: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 3

Section 11.2 General Concepts

Let us return to our consideration of the relationship between estriol level and birthweight data. Let x = estriol level and y = birthweight. We might posit a relationship such as

Eq. 11.1 E(y|x) = a + bx

Our regression line is defined as

Def. 11.1 – y = a + bx, a is the y-intercept and b is the slope.

It is expected of course that our regression line does not fit exactly. There will be some associated error to the fit.

Eq. 11.2 y = a + bx + e where e ~ N(0,s2) where x is the independent variable and y is the dependent variable.

Page 4: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 4

Section 11.2 General Concepts

A linear regression fit for our birthweight data

Page 5: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 5

Section 11.2 General Concepts

Some nuances of the fit

We can vary noise.

b may vary.

Page 6: BINF 702 Chapter 11 Regression and Correlation Methods

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Def. 11.3 – The least-squares line, or estimated regression line, is the line y = a +bx minimizing the sum of squares distances of the sample points from the line given by

Eq. 11.3 Estimation of the Least-Squares Line The coefficients of the least-squares line y = ax + b are given by

1 1

n n

i ixy i i

xx

y b xL

b and a y bxL n

2

1

n

i

i

S d

We choose this criteria because the math is tractable.

6 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 7: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 7

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Section 11.3 – Fitting Regression Lines The Method of Least Squares

2

2

11

1 1

1

nn

i iii

xx

n n

i ini i

xy i i

i

x x

Ln

x y

L x yn

DEF. 11.6 The predicted , or average, value of y for a given value of x , as estimated from the fitted regression line, is denoted by y a bx

Page 8: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 8

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Regression in R

lm {stats}

R Documentation

Fitting Linear Models

Description

lm is used to fit linear models. It can be used to

carry out regression, single stratum analysis of

variance and analysis of covariance (although aov may

provide a more convenient interface for these).

Usage

lm(formula, data, subset, weights, na.action, method =

"qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,

singular.ok = TRUE, contrasts = NULL, offset, ...)

Page 9: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 9

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Regression in R (The Arguments) Formula a symbolic description of the model to be fit.

The details of model specification are given below.

Data an optional data frame containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

Subset an optional vector specifying a subset of observations to be used in the fitting process.

Weights an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used.

na.action a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The “factory-fresh” default is na.omit. Another possible value is NULL, no action.

Page 10: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 10

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Regression in R (The Arguments) Method the method to be used; for fitting, currently

only method = "qr" is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below).

model, x, y, qr logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.

singular.ok logical. If FALSE (the default in S but not in R) a singular fit is an error.

Contrasts an optional list. See the contrasts.arg of model.matrix.default.

offsetthis can be used to specify an a priori known component to be included in the linear predictor during fitting. An offset term can be included in the formula instead or as well, and if both are specified their sum is used

.... additional arguments to be passed to the low level regression fitting functions (see below).

Page 11: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 11

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Regression in R (Some of the Details) Models for lm are specified symbolically. A typical model has the form

response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second. If response is a matrix a linear model is fitted to each column of the matrix. See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula.

A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae.

lm calls the lower level functions lm.fit, etc, see below, for the actual numerical computations. For programming only, you may consider doing likewise.

All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.

Page 12: BINF 702 Chapter 11 Regression and Correlation Methods

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Regression in R (Some of the Details) lm returns an object of class "lm" or for multiple responses of class

c("mlm", "lm"). The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results.

The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm. An object of class "lm" is a list containing at least the following components:

Coefficients a named vector of coefficients

Residuals the residuals, that is response minus fitted values.

fitted.values the fitted mean values.

rank the numeric rank of the fitted linear model.

weights (only for weighted fits) the specified weights.

df.residualthe residual degrees of freedom.

call the matched call.

terms the terms object used.

contrasts (only where relevant) the contrasts used.

xlevels (only where relevant) a record of the levels of the factors used in fitting.

y if requested, the response used.

x if requested, the model matrix used.

model if requested (the default), the model frame used. 12

Chapter 11 Regression and Correlation Methods (SPRING 2014)

Page 13: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 13

Section 11.3 – Fitting Regression Lines The Method of Least Squares

Example 11.8 Obstetrics Birthweight as a function of estriol in R.

es =

c(7,9,9,12,14,16,16,14,16,16,17,19,21,24,15,1

6,17,25,27,15,15,15,16,19,18,17,18,20,22,25,2

4)

bw =

c(25,25,25,27,27,27,24,30,30,31,30,31,30,28,3

2,32,32,32,34,34,34,35,35,34,35,36,37,38,40,3

9,43)

library(stats)

bw.lm = lm(bw ~ es)

bw.lm$coefficients

(Intercept) es

21.5234286 0.6081905

plot(es,bw)

lines(es, 0.6081905 * es + 21.5234286)

Page 14: BINF 702 Chapter 11 Regression and Correlation Methods

Section 11.4 Inferences About Parameters from Regression Lines

EQ 11.5 Decomposition of the Total Sum of Squares into Regression and Residual Components

2 2 2

1 1 1

ˆ ˆn n n

i i i i

i i i

y y y y y y

Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares

A good-fitting regression line will have regression components large in absolute value relative to the residual components whereas the opposite is true for poor fitting lines.

Check out Figure 11.6

14 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 15: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 15

11.4.1 F Test for Simple Linear Regression

We will use the ratio of the regression sum of squares to the residual sum of squares as a regression test. A large ratio will indicate a good fit where we are testing H0: b = 0 versus H1:b != 0 where b is the slope of the regression line.

Some helpful notation

Regression mean square (Reg MS) is (Reg SS)/k, the number of predictors in the model.

Residual mean square, Res MS is (Res SS)/(n – k – 1). Df = (n – k -1), the degrees of freedom of the residual sum of squares, Res df. In the literature Res MS = s2

y,x

Reg SS = bLxy = b2Lxx = L2xy/Lxx

Res SS = Total SS – Reg SS = Lyy – L2xy/Lxx

Page 16: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 16

11.4.1 F Test for Simple Linear Regression

Eq. 11.7 F Test for Simple Linear Regression To test H0: b = 0 versus H1: b != 0, use the following procedure:

1) Compute the test statistic

F = Reg MS/Res MS = (L2xy/Lxx)/[Lyy – L2

xy/Lxx)(n-2)]

that follows an F1,n-2 distribution under H0.

2) For a two-sided test with significance level a, if

F > F1,n-2,1-a then reject H0; if

F <= F1,n-2,1-a then accept H0.

3) The exact p-value is given by P(F1,n-2 > F)

Page 17: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 17

11.4.1 F Test for Simple Linear Regression

Def. 11.14 R2 is defined as (Reg SS)/(Total SS)

Interpretation of R2

R2 can be though of as the proportion of the variance of y that can be explained by the variable x

R2 = 1 all of the data points fall on the regression line

R2 = 0 x gives no information about the variance of y

Page 18: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 18

11.4.1 F Test for Simple Linear Regression

The obstetrics data revisited in R

> summary(bw.lm)

Call:

lm(formula = bw ~ es)

Residuals:

Min 1Q Median 3Q Max

-8.12000 -2.03810 -0.03810 3.35371 6.88000

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 21.5234 2.6204 8.214 4.68e-09 ***

es 0.6082 0.1468 4.143 0.000271 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.821 on 29 degrees of freedom

Multiple R-Squared: 0.3718, Adjusted R-squared: 0.3501

F-statistic: 17.16 on 1 and 29 DF, p-value: 0.0002712

Page 19: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 19

11.4.1 F Test for Simple Linear Regression

Using aov in R to perform the regression fit on the obstetrics data

> summary(aov(bw ~ es))

Df Sum Sq Mean Sq F value Pr(>F)

es 1 250.57 250.57 17.162 0.0002712 ***

Residuals 29 423.43 14.60

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Page 20: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 20

11.4.2 t Test for Simple Linear Regression

EQ 11.8 t Test for Simple Linear Regression To test the hypothesis H0: b = 0 versus H1: b != 0, use the following procedure:

1) Compute the test statistic

t = b/(s2yx/Lxx)

1/2

2) For a two-sided test with significance level a, if

t > tn-2,1-a/2 or

t <= tn-2,a/2 = -tn-2,1-a/2

Then reject H0; if –tn-2,1-a/2 <= t <= tn-2,1-a/2

Then accept H0

3) The p-value is given by

p = 2 x (area to the left of t under a tn-2 distribution) if t < 0

p = 2 x (area to the right of t under a tn-2 distribution) if t >= 0

Page 21: BINF 702 Chapter 11 Regression and Correlation Methods

11.4.1 F Test for Simple Linear Regression The R output of the obstetrics data revisited

> summary(bw.lm)

Call:

lm(formula = bw ~ es)

Residuals:

Min 1Q Median 3Q Max

-8.12000 -2.03810 -0.03810 3.35371 6.88000

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 21.5234 2.6204 8.214 4.68e-09 ***

es 0.6082 0.1468 4.143 0.000271 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.821 on 29 degrees of freedom

Multiple R-Squared: 0.3718, Adjusted R-squared: 0.3501

F-statistic: 17.16 on 1 and 29 DF, p-value: 0.0002712 21

Chapter 11 Regression and Correlation Methods (SPRING 2014)

Page 22: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 22

11.5 Interval Estimation for Linear Regression

11.5.1 Interval Estimates for Regression Parameters

Under certain assumptions how well can we quantify the uncertainty in our estimates of the slope and y-intercept

11.5.2 Interval Estimation for Predictions Made from Regression Line

Under certain assumptions how well can we quantify the uncertainty in our estimates of the predicted values

Page 23: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 23

11.5 Interval Estimation for Linear Regression – 11.5.1 Interval Estimates for Regression Parameters

Eq. 11.9 Standard Errors of Estimated Parameters in Simple Linear Regression

2

22

( )

1( )

yx

xx

yx

xx

sse b

L

xse a s

n L

Page 24: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 24

11.5 Interval Estimation for Linear Regression – 11.5.1 Interval Estimates for Regression Parameters

Eq. 11.10 Two-Sided 100% x (1 –a) Confidence Intervals for the Parameters of a Regression Line: If b and a are, respectively, the estimated slope and intercept of a regression line as given on the previous slide, i. e. se(b) and se(a) are the estimated standards errors, the the two-sided 100% x (1-a) confidence intervals for b and a are given by

2,1 / 2

2,1 / 2

( )

( )

n

n

b t se b

a t se a

a

a

Page 25: BINF 702 Chapter 11 Regression and Correlation Methods

11.5.1 Interval Estimates for Regression Parameters

Confidence intervals on regression parameters in R > summary(bw.lm)

Call:

lm(formula = bw ~ es)

Residuals:

Min 1Q Median 3Q Max

-8.12000 -2.03810 -0.03810 3.35371 6.88000

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 21.5234 2.6204 8.214 4.68e-09 ***

es 0.6082 0.1468 4.143 0.000271 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.821 on 29 degrees of freedom

Multiple R-Squared: 0.3718, Adjusted R-squared: 0.3501

F-statistic: 17.16 on 1 and 29 DF, p-value: 0.0002712 25

Chapter 11 Regression and Correlation Methods (SPRING 2014)

Page 26: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 26

11.5.2 Interval Estimation for Predictions Made from Regression Lines

A pedagogical example Forced expiratory volume (FEV) is a standard measure of pulmonary function. To identify people with abnormal pulmonary function, standards of FEV for normal people must be established. One problem here is that FEV is related to both age and height. Let us focus on boys who are ages 10-15 and postulate a regression model for the form FEV = a + b(height) + e. Data were collected on FEV and height for 655 boys in this age group residing in Tecumseh, Michigan. The mean FEV in liters is presented for each of twelve 4-cmheight groups in the Table below. Find the best fitting regression line and test for statistical significance. What proportion of the variance of FEV can be explained by height?

Page 27: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 27

11.5.2 Interval Estimation for Predictions Made from Regression Lines

Our FEV pedagogical example continued.

Mean Mean

Height FEV Height FEV

(cm) (L) (cm) (L)

134 1.7 158 2.7

138 1.9 162 3.0

142 2.0 166 3.1

146 2.1 170 3.4

150 2.2 174 3.8

154 2.5 178 3.9

Page 28: BINF 702 Chapter 11 Regression and Correlation Methods

11.5.2 Interval Estimation for Predictions Made from Regression Lines

28 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 29: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 29

11.5.2 Interval Estimation for Predictions Made from Regression Lines

EX. 11.17 Pulmonary Function suppose we wish to use the FEV-height regression line computed previously to develop normal ranges for 10- to 15-year-old boys of particular heights. In particular, consider John H., whose is 12 years old and 160 cm tall and whose FEV is 2.5 L. Can his FEV be considered abnormal for his age and height?

Page 30: BINF 702 Chapter 11 Regression and Correlation Methods

11.5.2 Interval Estimation for Predictions Made from Regression Lines

Eq.11.11 Predictions Made from Regression Lines for Individual Observations Suppose we wish to make predictions from a regression line for an individual observations with independent variable x that was not used in constructing the regress line. The distribution of observed y values for the subset of individuals with independent variable x is normal with mean =

and standard deviation given by

y a bx

2

2

1

1ˆ 1yx

xx

x xse y s

n L

Furthermore, 100% x (1-a) of the observed values will fall within the interval

This interval is sometimes called a 100% x (1-a) prediction interval for y.

2,1 / 2 1ˆ ˆ( )ny t se ya

30 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 31: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 31

11.5.2 Interval Estimation for Predictions Made from Regression Lines

Predicted Confidence Intervals in R

> new = list(ht=160)

> predict(fev.lm,new,interval='prediction')

fit lwr upr

[1,] 2.896911 2.616527 3.177295

We note that John’s observed value of 2.5 does not fall within the predicted interval. John merits follow up.

Page 32: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 32

11.5.2 Interval Estimation for Predictions Made from Regression Lines

Suppose we wish to asses the mean FEV value for a large number of boys with the same x value?

Eq. 11.12 Standard Error and Confidence Interval for Predictions Made from Regression Lines for the Average Value of y for a Given x The best estimate of the average value of y for a given x is

Its standard error is given by

y a bx

2

2

2

yx

xx

x xse y s

n L

Furthermore, a two-sided 100% x (1-a) confidence interval for he average value of y is

2,1 / 2 2

ˆ ˆny t se ya

Page 33: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 33

11.5.2 Interval Estimation for Predictions Made from Regression Lines

Predicted Confidence Intervals in R for the average value of y

> predict(fev.lm,new,interval='confidence')

fit lwr upr

[1,] 2.896911 2.81621 2.977613

This is sometimes denoted within the statistics community as the confidence interval for the regression function.

Page 34: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 34

11.5.2 Interval Estimation for Predictions Made from Regression Lines

Example 11.21

Page 35: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 35

11.6 Assessing the Goodness of Fit of Regression Lines

Eq. 11.13 Assumptions Made in Linear-Regression Models

1) For any given value of x, the corresponding value of y has an average value of a + bx, which is a linear function of x.

2) For any given value of x, the corresponding value of y is normally distributed about a + bx with the same variance s2 for any x.

3) For any two data points (x1, y1), (x2, y2), the error terms e1, e2, are independent of each other.

Page 36: BINF 702 Chapter 11 Regression and Correlation Methods

11.6 Assessing the Goodness of Fit of Regression Lines

The simplest type of diagnostic plot.

There may be more variability for larger values of es. Which assumption is this violating?

36 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 37: BINF 702 Chapter 11 Regression and Correlation Methods

11.6 Assessing the Goodness of Fit of Regression Lines

Eq. 11.14 Standard Deviation of Residuals About Fitted Regression Line Let (xi, yi) be a sample point used in estimating the regression line, y = a +bx. If y = a + bx is the estimated regression line, and

= residual for the point (xi, yi) about the estimated regression line, then

and

ie

ˆ ( )i i ie y a bx

2

2 1ˆ ˆ( ) 1i

xx

x xsd e

n Ls

The Studentized residual corresponding to the point (xi,yi) is given by

ˆ

ˆi

i

e

sd e

37 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 38: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 38

11.6 Assessing the Goodness of Fit of Regression Lines (Regression Diagnostic Plots in R - I)

Page 39: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 39

11.6 Assessing the Goodness of Fit of Regression Lines (Regression Diagnostic Plots in R - II)

Page 40: BINF 702 Chapter 11 Regression and Correlation Methods

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

Assessing uniformity of variance and linearity of residual structure.

40 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 41: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 41

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

Assessing normality of residual structure with QQ plots.

Page 42: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 42

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

A few EDA type plots for assessment of normality.

Page 43: BINF 702 Chapter 11 Regression and Correlation Methods

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

QQ plots for various types of distributions.

43 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 44: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 44

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

Cook's Distance for the i-th observation is based on the differences between the predicted responses from the model constructed from all of the data and the predicted responses from the model constructed by setting the i-th observation aside. For each observation, the sum of squared residuals is divided by (p+1) times the Residual Mean Square from the full model. Some analysts suggest investigating observations for which Cook's distance is greater than 1. Others suggest looking at a dot plot to find extreme values.

Cooks Distance Plots.

Page 45: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 45

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

A pedagogical example. Age is age at first word (x-values) and gesell (y-values) is the Gesell adaptive score.

age =

c(15,26,10,9,15,20,18,11,8,20,7,9,1

0,11,11,10,12,42,17,11,10)

gesell =

c(95,71,83,91,102,87,93,100,104,94,

113,96,83,84,102,100,105,57,121,86,

100)

> plot(gesell ~ age)

> identify(gesell ~ age)

[1] 2 18 19

Page 46: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 46

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

Gesell example continued

Page 47: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 47

11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)

Page 48: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 48

11.17 The Correlation Coefficient

The sample correlation coefficient offers an alternative way to measure a linear association between variables. One can use it rather than the regression coefficient. The sample, Pearson, correlation coefficient is given by

r = Lxy/sqrt(Lxx*Lyy)

Properties of r

r > 0 positively correlated

r < 0 negatively correlated

r = 0 uncorrelated

Page 49: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 49

11.17 The Correlation Coefficient

Relationship between sample correlation coefficient r and the population correlation coefficient r

/( 1)

1 1

xy xy

x yyyxx

L n sr

s sLL

n n

Page 50: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 50

11.17 The Correlation Coefficient

There is actually a simple relationship between the sample correlation coefficient and the regression coefficient

So these two quantities really are just rescaled versions of one another

y

x

rsb

s

Page 51: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 51

11.17 The Correlation Coefficient

The sample Pearson correlation coefficient, r, in R

Example 11.24 > es =

c(7,9,9,12,14,16,16,14,16,16,17,19,21,24,15,16,17,25,27,15,

15,15,16,19,18,17,18,20,22,25,24)

> bw =

c(25,25,25,27,27,27,24,30,30,31,30,31,30,28,32,32,32,32,34,

34,34,35,35,34,35,36,37,38,40,39,43)

> cor(es,bw,method='pearson')

[1] 0.6097313

Page 52: BINF 702 Chapter 11 Regression and Correlation Methods

11.8 Statistical Inference for Correlation Coefficients : One-Sample t-Test for a Correlation Coefficient

Eq. 11.20 One-sample t Test for a Correlation Coefficient To test the hypothesis H0: r = 0 versus H1: r != 0, use the following procedure:

1) Compute the sample correlation coefficient r.

2) Compute the test statsitic

t = r(n – 2)1/2/(1 – r2)1/2

Which under H0 follows a t distribution with n – 2 df.

3) For a two-sided level a test, if

t > tn-2,1-a/2 or t < -tn-2,1-a/2 then reject H0. If –tn-2,1-a/2 <= t <=tn-2,1-a/2

accept

4) The p-value is given by

p = 2 * (area to the left of t under a tn-2 distribution) if t < 0

P = 2 * (area to the right of t under a tn-2 distribution) if t >= 0

5) We assume an underlying normal distribution for each of the random variables used to compute r.

52 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 53: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 53

11.8 Statistical Inference for Correlation Coefficients : One-Sample t-Test for a Correlation Coefficient

Problem 11.36 pg. 505 in R

> logmort = c(-2.35, -2.20, -2.12,-1.95,-1.85,-1.80,-1.70,-1.58)

> logcig = c(-0.26,-0.03,0.30,0.37,0.40,0.50,0.55,0.55)

> cor(logmort,logcig)

[1] 0.9300082

> cor.test(logmort,logcig)

Pearson's product-moment correlation

data: logmort and logcig

t = 6.1981, df = 6, p-value = 0.0008128

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.653812 0.987513

sample estimates:

cor

0.9300082

Page 54: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 54

11.8 Statistical Inference for Correlation Coefficients : One-Sample z-Test for a Correlation Coefficient

Eq. 11.22 One-Sample z Test for a Correlation Coefficient To test the hypothesis H0: r = r0 versus H1:r !=r0, use the following procedure:

1) Compute the sample correlation coefficient r and the z transformation of r.

2) Compute the test statistic

l = (z – z0)*sqrt(n-3)

3) If l > z1-a/2 or l < -z1-a/2 reject H0. If –z1-a/2 <= l <= z1-a/2 accept H0.

4) The exact p-value is given by

P = 2 * F(l) if l <= 0

P = 2 * [1 – F(l)] if l > 0

5) Assume an underlying normal distribution for each of the random variables used to compute r and z.

Page 55: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 55

11.8 Statistical Inference for Correlation Coefficients : One-Sample z-Test for a Correlation Coefficient

0

0

0

11 1 1 1ln ln ,

2 1 2 1 3

rz N under H

r n

r

r

z0

Page 56: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 56

11.8 Statistical Inference for Correlation Coefficients : One-Sample z-Test for a Correlation Coefficient

There is no implementation of this in R but this method is used to compute confidence intervals when the number of observation is larger than 6 when one calls cor.test

Page 57: BINF 702 Chapter 11 Regression and Correlation Methods

11.9 Multiple Regression

Consider Ex 11.38 on pg. 466 of the text.

Eq. 11.28 y = a + b1x1 + b2x2 + e where y is the systolic blood pressure, x1 is birth weight and x2 is the age in days where e ~ N(0, s2) . We choose the method of least square to minimize the sum of [y – (a + b1x1 + b2x2)]

2

In general if we have k independent variables x1, …, xk then a linear-regression model relating y to x1, …, xk is of the form

EQ. 11.29

1

k

j j

j

y x ea b

, e ~ N(0, s2)

57 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 58: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 58

11.9 Multiple Regression

Def. 11.16

1

k

j j

j

y x ea b

Partial Regression Coefficients

Page 59: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 59

11.9 Multiple Regression

Def. 11.17 The standardized regression coefficient bs is given by

b * (sx/sy)

Page 60: BINF 702 Chapter 11 Regression and Correlation Methods

11.9.2 Hypothesis Testing

Eq. 11.31 F Test for Testing the Hypothesis H0: b1 = b2 = …bk = 0 versus H1:At least One of the bj != 0 in Multiple Regression

1) Fit the regression parameters using the method of least squares , and compute Reg SS and Res SS

2

1

2

1

1

ˆRe

Re Re

ˆ

jth independent variable for ith subject, 1, , ; 1, ,

n

i i

i

n

i

i

k

i j ij

j

ik

s SS y y

g SS Total SS s SS

Total SS y y

y a b x

x j k i n

60 Chapter 11 Regression and Correlation Methods

(SPRING 2014)

Page 61: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 61

11.9.2 Hypothesis Testing

Eq. 11.31 F Test for Testing the Hypothesis H0: b1 = b2 = …bk = 0 versus H1:At least One of the bj != 0 in Multiple Regression

2) Compute Reg MS = RegSS/k, RegMS = ResSS/(n-k-1)

3) Compute the test statistic

F=Reg MS/Res MS

which follows an Fk,n-k-1 distribution under H0.

4) For a level a test,

F > Fk, n-k,1-a then reject H0:

If F <= Fk,n-k,1-a then accept H0

5) The exact p-value is given by the area to the right of F under an Fk,n-k-1 distribution = P(Fk,n-k-1 > F)

Page 62: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 62

11.9.2 Hypothesis Testing

Eq. 11.32 t Test for Testing the Hypothesis H0:bl = 0, All Other bj != 0 versus H1:bl != 0, All other bj != 0 in Multiple Linear Regression

1) Compute

t = bl/se(bl)

2) If

t < tn-k-1,a/2 or t > tn-k-1,1-a/2 then reject H0

If tn-k-1,a/2 <= t <= tn-k-1,1-a/2 then accept H0

3) The exact p-value is given by

2 * P(tn-k-1 > t) if t >= 0

2 * P(tn-k-1 <=t) if t < 0

Page 63: BINF 702 Chapter 11 Regression and Correlation Methods

11.9 Multiple Regression (EX. 11.39 in R)

> bwmv = c(135,120,100,105,130,125,125,105,120,90,120,95,120,150,160,125)

> agemv = c(3,4,3,2,4,5,2,3,5,4,2,3,3,4,3,3)

> bpmv = c(89, 90, 83, 77, 92, 98, 82, 85, 96, 95, 80, 79, 86, 97, 92,88)

> bpmv.lm = lm(bpmv ~ bwmv + agemv)

> summary(bpmv.lm)

Call:

lm(formula = bpmv ~ bwmv + agemv)

Residuals:

Min 1Q Median 3Q Max

-4.0438 -1.3481 -0.2395 0.9688 6.6964

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 53.45019 4.53189 11.794 2.57e-08 ***

bwmv 0.12558 0.03434 3.657 0.00290 **

agemv 5.88772 0.68021 8.656 9.34e-07 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 2.479 on 13 degrees of freedom

Multiple R-Squared: 0.8809, Adjusted R-squared: 0.8626

F-statistic: 48.08 on 2 and 13 DF, p-value: 9.844e-07

63

Chapter 11 Regression and Correlation Methods (SPRING 2014)

Page 64: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 64

11.9.3 Regression Diagnostics

Page 65: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 65

11.9 Multiple Regression (EX. 11.39 in R)

Page 66: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 66

11.9 Multiple Regression (EX. 11.39 in R)

Page 67: BINF 702 Chapter 11 Regression and Correlation Methods

Chapter 11 Regression and Correlation Methods (SPRING 2014) 67

Chapter 11 Homework

11.1 – 11.8; 11.17 – 11.20, 11.42 – 11.44