sta 106: correlation and linear regression

36
1 STA 106: STA 106: Correlation and Correlation and Linear Regression Linear Regression Lecturer: Dr. Daisy Dai Lecturer: Dr. Daisy Dai Department of Medical Department of Medical Research Research

Upload: amal

Post on 18-Feb-2016

132 views

Category:

Documents


1 download

DESCRIPTION

STA 106: Correlation and Linear Regression . Lecturer: Dr. Daisy Dai Department of Medical Research. Contents. Correlation Regression Simple Regression Multiple Regression. What is correlation?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: STA 106:  Correlation and Linear Regression

1

STA 106: STA 106: Correlation and Correlation and Linear Regression Linear Regression

Lecturer: Dr. Daisy DaiLecturer: Dr. Daisy DaiDepartment of Medical Department of Medical

ResearchResearch

Page 2: STA 106:  Correlation and Linear Regression

2

ContentsContents• Correlation• Regression• Simple Regression• Multiple Regression

Page 3: STA 106:  Correlation and Linear Regression

3

What is correlation?What is correlation?• Correlation and linear regression are

techniques for dealing with the relationship between two or more continuous variables.

• In correlation we are looking for a linear association between two variables, and the strength of the association is summarized by the correlation coefficient (r) or coefficient of determination (r2)

Page 4: STA 106:  Correlation and Linear Regression

4

Case Study: Anemia in Case Study: Anemia in WomenWomen

• A survey was conduct to a sample of 20 anemia women, randomly selected from a pre-defined geographical area. The participants had a blood sample taken and their hemoglobin (Hb) level and packed cell volume (PCV) measured. They were also asked their age, and whether or not they had experienced the menopause.

• The goals of the study were to determine whether Hb affects PCV or the other way around or whether Hb was associated with age.

Page 5: STA 106:  Correlation and Linear Regression

5

DataDataSubject Number Hb (g/dl) PCV (%) Age (years) Menopause0 = No, 1 = Yes

1 11.1 35 20 0

2 10.7 45 22 0

3 12.4 47 25 0

4 14.0 50 28 0

5 13.1 31 28 0

6 10.5 30 31 0

7 9.6 25 32 0

8 12.5 33 35 0

9 13.5 35 38 0

10 13.9 40 40 1

11 15.1 45 45 0

12 13.9 47 49 1

13 16.2 49 54 1

14 16.3 42 55 1

15 16.8 40 57 1

16 17.1 50 60 1

17 16.6 46 62 1

18 16.9 55 63 1

19 15.7 42 65 1

20 16.5 46 67 1

Page 6: STA 106:  Correlation and Linear Regression

6

Scatter Plots Scatter Plots

20.00 30.00 40.00 50.00 60.00 70.00

Age

10.00

12.00

14.00

16.00

18.00

Hb

MenopauseNoYes

R Sq Linear = 0.774

25.00 30.00 35.00 40.00 45.00 50.00 55.00

PCV

10.00

12.00

14.00

16.00

18.00

Hb

R Sq Linear = 0.453

Page 7: STA 106:  Correlation and Linear Regression

7

Correlation CoefficientCorrelation Coefficient• Pearson product-mom

ent correlation coefficient, also known as r, R, or Pearson's r, is a measure of the strength of the linear relationship between two variables that is defined in terms of the (sample) covariance of the variables divided by their (sample) standard deviations

Karl Pearson (1857 – 1936)

Page 8: STA 106:  Correlation and Linear Regression

8

FormulaFormula

Page 9: STA 106:  Correlation and Linear Regression

9

Some properties of Some properties of Correlation CoefficientCorrelation Coefficient

• The sample and population Pearson correlation coefficient, r, ranges between -1 and 1.

• The absolute value of r stands for the strength of the correlation.

• The sign of r stands for the direction of the relationship. For r>0, two variables changes in the same direction. For r<0, two variables are inversely related.

Page 10: STA 106:  Correlation and Linear Regression

10

Page 11: STA 106:  Correlation and Linear Regression

11

Coefficient of DeterminationCoefficient of Determination• The coefficient of determination, r2 , is the proportion

of variation in the observed values of the response variable explained by the regression.

Coefficient of Determination (r2)=square of Correlation Coefficient (r)

• The coefficient of determination always lies between 0 and 1 and is a descriptive measure of the utility of the regression equation for making predictions. A value of near 0 indicates that the regression equation is not very useful for making predictions, whereas a value of near 1 indicates that the regression equation is extremely useful for making predictions.

2 1SSR SST SSE SSErSST SST SST

Page 12: STA 106:  Correlation and Linear Regression

12

Some figures of Coefficient of Some figures of Coefficient of DeterminationDetermination

Page 13: STA 106:  Correlation and Linear Regression

13

When not to use the When not to use the correlation coefficientcorrelation coefficient

Page 14: STA 106:  Correlation and Linear Regression

14

Interpretation of the size of Interpretation of the size of a correlationa correlation

Correlation of Determination

Strength of Correlation

0.00 No/Poor

0.01-0.20 Slight

0.21-0.40 Fair

0.41-0.60 Moderate

0.61-0.80 Substantial

0.81-1.00 Almost perfect

Page 15: STA 106:  Correlation and Linear Regression

15

What is Regression?What is Regression?• Regression are methods to identify the associations between the

outcome variable and explanatory variables. The value of the outcome variable can be predicted by the values of explanatory variables.

• The outcome variable, also called dependent variable, is listed in the left side of regression models. The explanatory variable(s), also called independent, variable, stay in the right side of regression model.

Outcome variable explanatory variable(s)

The birth weight = 0.2 +0.4 * Gestational age • The relationship is summarized by a regression equation consisting of a

slope and an intercept. An intercept is the constant. The slope reflects the change of change in the outcome variable with respect to the explanatory variable.

Page 16: STA 106:  Correlation and Linear Regression

16

The following terminologies The following terminologies are used interchangeably are used interchangeably

• Outcome variable• Dependent variable

Note: These are the phenomena we want to interpret the variation and predict.

For instance, response to treatment etc.

• Explanatory variable• Independent variable• Risk factors

Note: These are the variables that can be used to explain the variation in the outcome variables.

For instance, demographics, environmental factors, genetic factors, medical educational intervention.

Page 17: STA 106:  Correlation and Linear Regression

17

Case Study: Orion CarsCase Study: Orion Cars• To find the association between the

age and price of Orion cars and predict the price by age, we randomly recorded 11 Orions and list data in the following table.

Page 18: STA 106:  Correlation and Linear Regression

18

Orion Car DataOrion Car DataCar AGE (x) PRICE (y) xy x^2 y^2

1 5 85 425 25 72252 4 103 412 16 106093 6 70 420 36 49004 5 82 410 25 67245 5 89 445 25 79216 5 98 490 25 96047 6 66 396 36 43568 6 95 570 36 90259 2 169 338 4 28561

10 7 70 490 49 490011 7 48 336 49 2304

Total 58 975 4732 326 96129

Page 19: STA 106:  Correlation and Linear Regression

19

Page 20: STA 106:  Correlation and Linear Regression

20

ResultsResults• Describe the apparent relationship between age

and price of Orions.Because the slope of the regression line is negative, price tends to decrease as age increases

• Interpret the slope of the regression line in terms of prices for Orions.Orions depreciate an estimated $2026 per year, at least in the 2- to 7- year-old range.

• Use the regression equation to predict the price of a 3-year-old Orion and a 4-year-old Orion.

Page 21: STA 106:  Correlation and Linear Regression

21

Simple vs. Multiple Simple vs. Multiple RegressionRegression

• The regression involving one independent variable is called simple linear regression. – Outcome variable one explanatory variable– y=a + b * x + error, where a is intercept and b is slope. When b=0, y

is independent on x (i.e. x and y are not correlated). When b>0, x and y have positive relationship. When b<0, x and y have negative/inverse relationship.

– Height = 0.2 + 0.4 *weight

• The regression involving a set of independent variables is called multiple regression. – Outcome variable a set of explanatory variable– y=a + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4+…+error– Weight =0.2 + 0.4*height +0.3*age

Page 22: STA 106:  Correlation and Linear Regression

22

The Regression EquationThe Regression Equation• Least-Squares criterion: The straight line that

best fits a set of data points is the one having the smallest possible sum of squared errors.

• Regression line: The straight line that best fits a set of data points according to the least-square criterion.

• Regression equation: The equation of the regression line.

Page 23: STA 106:  Correlation and Linear Regression

23

Sum of Squares in Sum of Squares in RegressionRegression

• Total sum of squares, SST: The variation in the observed values of the response variable:

• Regression sum of squares, SSR: The variation in the observed values of the response variable explained by the regression:

• Error sum of squares, SSE: The variation in the observed values of the response variable not explained by the regression:

2( )SST y y

2ˆ( )SSR y y

2ˆ( )SSE y y

Page 24: STA 106:  Correlation and Linear Regression

24

The three sums of squares, SST, SSR, and SSE, can be obtained by using the following computing formulas:

• Total sum of squares; SST= • Regression sum of squares: SSR= • Error sum of square: SSE=• Regression identity: SST=SSR+SSE

yyS2 /xy xxS S

2 /yy xy xxS S S

Page 25: STA 106:  Correlation and Linear Regression

25

Case Study: Anemia in Case Study: Anemia in WomenWomen

• A random sample of 20 anemia women, from a pre-defined geographical area, were investigated by a survey. They had a blood sample taken and their hemoglobin (Hb) level and packed cell volume (PCV) measured. They were also asked their age, and whether or not they had experienced the menopause.

• The goal of the study is to determine whether Hb affects PCV or the other way around.

Page 26: STA 106:  Correlation and Linear Regression

26

DataDataSubject Number Hb (g/dl) PCV (%) Age (years) Menopause0 = No, 1 = Yes

1 11.1 35 20 0

2 10.7 45 22 0

3 12.4 47 25 0

4 14.0 50 28 0

5 13.1 31 28 0

6 10.5 30 31 0

7 9.6 25 32 0

8 12.5 33 35 0

9 13.5 35 38 0

10 13.9 40 40 1

11 15.1 45 45 0

12 13.9 47 49 1

13 16.2 49 54 1

14 16.3 42 55 1

15 16.8 40 57 1

16 17.1 50 60 1

17 16.6 46 62 1

18 16.9 55 63 1

19 15.7 42 65 1

20 16.5 46 67 1

Page 27: STA 106:  Correlation and Linear Regression

2720.00 30.00 40.00 50.00 60.00 70.00

Age

10.00

12.00

14.00

16.00

18.00

Hb

MenopauseNoYes

R Sq Linear = 0.774

25.00 30.00 35.00 40.00 45.00 50.00 55.00

PCV

10.00

12.00

14.00

16.00

18.00Hb

R Sq Linear = 0.453

Page 28: STA 106:  Correlation and Linear Regression

28

Page 29: STA 106:  Correlation and Linear Regression

29

OutliersOutliers• An outlier is a point that

lies far from the regression line. Such points may represent measuring error, or may indicate heterogeneity in sampling.

• An outlier may skew the direction of the regression line and increase the variation in the data.

• Outliers need to be removed from analysis.

Page 30: STA 106:  Correlation and Linear Regression

30

Influential ObservationsInfluential Observations• Influential observations are

the points far from the other data in the horizontal direction.

• Influential observations may have a significant impact on the slope of the regression line.

• One need to compare the fitted model with influential observations vs. the fitted model without influential observations and identify the reasons of influential observations.

• Decide whether influential points need to be removed from studies.

Page 31: STA 106:  Correlation and Linear Regression

31

Residuals Residuals • Residual is the

discrepancy between the observed value and the predicted value.

• A residual plot is an useful diagnostic tool to check model assumption and to detect outliers.

Page 32: STA 106:  Correlation and Linear Regression

32

Extrapolation Extrapolation • Whenever a linear regression model is fit to a

group of data, the range of the data should be carefully observed. Attempting to use a regression equation to predict values outside of this range is often inappropriate, and may yield incredible answers.

• Consider, for example, a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd, since the relationship between age and weight gain is not consistent for all age groups.

Page 33: STA 106:  Correlation and Linear Regression

33

Correlation is not causationCorrelation is not causation• One of the most common errors in the medical literature is

to assume that simply because two variables are correlated, therefore one causes the other. Amusing examples include the positive correlation between the mortality rate in Victorian England and the number of Church of England marriages, and the negative correlation between monthly deaths from ischemic heart disease and monthly ice-scream sales. In each case there, the fallacy is obvious because all the variables are time-related. In the former example, both the mortality rate and the number of Church of England marriages went down during the 19th century, in the latter example, deaths from ischemic heart disease are higher in winter when ice-cream sales are at their lowest. However, it is always worth trying to think of other variables, confounding factors, which may be related to both of the variables under study.

Page 34: STA 106:  Correlation and Linear Regression

34

Points when performing Points when performing correlation or regressionscorrelation or regressions

• Plot the data to see whether the relationship is likely to be linear.

• Is the variables normally distributed? If not, consider transformation of variable or switching to other models.

• Correlation does not necessarily imply causation.• Think about confounding factors. If a significant correlation is

obtained and the causation inferred, could there be a third factor, not measured, which is jointly correlated with the other two, and so accounts for their association?

• If a scatter plot is given to support a linear regression, is the variability of the points about the line roughly the same over the range of the independent variable? If not, then perhaps some transformation of the variables is necessary before computing the regression line.

• If predictions are given, are any made outside the range of the observed values of the independent variable?

• Outliers need to be removed from analysis.

Page 35: STA 106:  Correlation and Linear Regression

35

SoftwareSoftware• The open source correlation

coefficient calculator: http://www.easycalculation.com/statistics/correlation.php

• We will offer a SPSS workshop for correlation, linear and logistic regression analysis in April.

Page 36: STA 106:  Correlation and Linear Regression

36

ReferencesReferences• Designing Clinical Research 3rd

edition by Hulley et al.• Medical Statistics by Campbell et al.