regression and correlation analysis

22
Regression and correlation analysis: Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables. Regression model. In simple linear regression, the model used to describe the relationship between a single dependent variable y and a single independent variable x is y = a0 + a1x + k. a0and a1 are referred to as the model parameters, and is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with x. If the error term were not present, the model would be deterministic; in that case, knowledge of the value of x would be sufficient to determine the value of y. Least squares method. Either a simple or multiple regression model is initially posed as a hypothesis concerning the relationship among the dependent and independent variables. The least squares method is the most widely used procedure for developing estimates of the model parameters. As an illustration of regression analysis and the least squares method, suppose a university medical centre is investigating the relationship between stress and blood pressure. Assume that both a stress test score and a blood pressure reading have been recorded for a sample of 20 patients. The data are shown graphically in the figure below, called a scatter diagram. Values

Upload: luculus-lee

Post on 20-Apr-2017

232 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Regression and Correlation Analysis

Regression and correlation analysis:

Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables.

Regression model.

In simple linear regression, the model used to describe the relationship between a single dependent variable y and a single independent variable x is y = a0 + a1x + k. a0and a1 are referred to as the model parameters, and is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with x. If the error term were not present, the model would be deterministic; in that case, knowledge of the value of x would be sufficient to determine the value of y.

Least squares method.

Either a simple or multiple regression model is initially posed as a hypothesis concerning the relationship among the dependent and independent variables. The least squares method is the most widely used procedure for developing estimates of the model parameters.

As an illustration of regression analysis and the least squares method, suppose a university medical centre is investigating the relationship between stress and blood pressure. Assume that both a stress test score and a blood pressure reading have been recorded for a sample of 20 patients. The data are shown graphically in the figure below, called a scatter diagram. Values of the independent variable, stress test score, are given on the horizontal axis, and values of the dependent variable, blood pressure, are shown on the vertical axis. The line passing through the data points is the graph of the estimated regression equation: y = 42.3 + 0.49x. The parameter estimates, b0 = 42.3 and b1 = 0.49, were obtained using the least squares method.

Correlation.

Correlation and regression analysis are related in the sense that both deal with relationships among variables. The correlation coefficient is a measure of linear association between two variables. Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a correlation coefficient of -1 indicates that two variables are perfectly related in a negative linear sense, and a correlation coefficient of 0 indicates that there is no linear relationship between the two variables. For simple linear regression, the sample correlation coefficient is the square root of the coefficient of determination, with the sign of the correlation coefficient being the same as the sign of b1, the coefficient of x1 in the estimated regression equation.

Page 2: Regression and Correlation Analysis

Neither regression nor correlation analyses can be interpreted as establishing cause-and-effect relationships. They can indicate only how or to what extent variables are associated with each other. The correlation coefficient measures only the degree of linear association between two variables. Any conclusions about a cause-and-effect relationship must be based on the judgment of the analyst.

Excerpt from the Encyclopedia Britannica without permission.

Chapter 9 - Correlation and Regression

Terminology: Correlation vs. Regression

This chapter will speak of both correlations and regressions

both use similar mathematical procedures to provide a measure of relation; the degree to which two continuous variables vary together ... or covary

The correlations term is used when 1) both variables are random variables, and 2) the end goal is simply to find a number that expresses the relation between the variables

The regression term is used when 1) one of the variables is a fixed variable, and 2) the end goal is use the measure of relation to predict values of the random variable based on values of the fixed variable

Examples

In this class, height and ratings of physical attractiveness (both random variables) vary across individuals. We could ask, "What is the correlation between height and these ratings in our class"? Essentially, we are asking "As height increases, is there any systematic increase (positive correlation) or decrease (negative correlation) in one�s rating of their own attractiveness?"

Alternatively, we could do an experiment in which the experimenter compliments a subject on their appearance one to eight times prior to obtaining a rating (note that �number of compliments� is a fixed variable). We could now ask "can we predict a person�s rating of their attractiveness, based on the number of compliments they were given"?

Scatterplots

The first way to get some idea about a possible relation between two variables is to do a scatterplot of the variables

Page 3: Regression and Correlation Analysis

Let�s consider the first example discussed previously where we were interested in the possible relation between height and ratings of physical attractiveness

The Regression Line

Often scatterplots will include a �regression line� that overlays the points in the graph:

The regression line represents the best prediction of the variable on the Y axis (Weight) for each point along the X axis (Height)

For example, my (Steve�s) data is not depicted in the graph. But if I tell you that I am 75 inches tall, you can use the graph to predict my weight

Computing the Regression Line

Going back to your high school days, you perhaps recall that any straight line can be depicted by an equation of the form:

since the regression line is supposed to be the line that provides the best prediction of Y, given some value of X, we need to find values of a & b that produce a line that will be the best-fitting linear function (i.e., the predicted values of Y will come as close as possible to the obtained values of Y)

Correlation or Regression ?

Correlation makes no a priori assumption as to whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate as to the degree of association between the variables. In fact, correlation analysis tests for interdependence of the variables.

As regression attempts to describe the dependence of a variable on one (or more) explanatory variables; it implicitly assumes that there is a one-way causal effect from the explanatory

Page 4: Regression and Correlation Analysis

variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. There are advanced regression methods that allow a non-dependence based relationship to be described (eg. Principal Components Analysis or PCA) and these will be touched on later.

The best way to appreciate this difference is by example.

Take for instance samples of the leg length and skull size from a population of elephants. It would be reasonable to suggest that these two variables are associated in some way, as elephants with short legs tend to have small heads and elephants with long legs tend to have big heads. We may, therefore, formally demonstrate an association exists by performing a correlation analysis. However, would regression be an appropriate tool to describe a relationship between head size and leg length? Does an increase in skull size cause an increase in leg length? Does a decrease in leg length cause the skull to shrink? As you can see, it is meaningless to apply a causal regression analysis to these variables as they are interdependent and one is not wholly dependent on the other, but more likely some other factor that affects them both (eg. food supply, genetic makeup).

Consider two variables: crop yield and temperature. These are measured independently, one by the weather station thermometer and the other by Farmer Giles' scales. While correlation anaylsis would show a high degree of association between these two variables, regression anaylsis would be able to demonstrate the dependence of crop yield on temperature. However, careless use of regression analysis could also demonstrate that temperature is dependent on crop yield: this would suggest that if you grow really big crops you'll be guaranteed a hot summer!

Abstract

The present review introduces methods of analyzing the relationship between two quantitative variables. The calculation and interpretation of the sample product moment correlation coefficient and the linear regression equation are discussed and illustrated. Common misuses of the techniques are considered. Tests and confidence intervals for the population parameters are described, and failures of the underlying assumptions are highlighted.

Introduction

The most commonly used techniques for investigating the relationship between two quantitative variables are correlation and linear regression. Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation. For example, in patients attending an accident and emergency unit (A&E), we could use correlation and regression to determine whether there is a relationship between age and urea level, and whether the level of urea can be predicted for a given age.

Page 5: Regression and Correlation Analysis

Scatter diagram

When investigating a relationship between two variables, the first step is to show the data values graphically on a scatter diagram. Consider the data given in Table Table1.1. These are the ages (years) and the

logarithmically transformed admission serum urea (natural logarithm [ln] urea) for 20 patients attending an A&E. The reason for transforming the urea levels was to obtain a more Normal distribution [1]. The scatter diagram for ln urea and age (Fig. (Fig.1)1) suggests there is a positive linear relationship between these variables.

Correlation

On a scatter diagram, the closer the points lie to a straight line, the stronger the linear relationship between two variables. To quantify the strength of the relationship, we can calculate the correlation coefficient. In algebraic notation, if we have two variables x and y, and the data take the form of n pairs (i.e. [x1, y1], [x2, y2], [x3, y3] ... [xn, yn]), then the correlation coefficient is given by the following equation:

where is the mean of the x values, and is the mean of the y values.

This is the product moment correlation coefficient (or Pearson correlation coefficient). The value of r always lies between -1 and +1. A value of the correlation coefficient close to +1 indicates a strong positive linear relationship (i.e. one variable increases with the other; Fig. Fig.2).2). A value close to -1 indicates a strong negative linear relationship (i.e. one variable decreases as the other increases; Fig. Fig.3).3). A value close to 0 indicates no linear relationship (Fig. (Fig.4);4); however, there could be a nonlinear relationship between the variables (Fig. (Fig.55).

Page 6: Regression and Correlation Analysis

For the A&E data, the correlation coefficient is 0.62, indicating a moderate positive linear relationship between the two variables.

Regression

In the A&E example we are interested in the effect of age (the predictor or x variable) on ln urea (the response or y variable). We want to estimate the underlying linear relationship so that we can predict ln urea (and hence urea) for a given age. Regression can be used to find the equation of this line. This line is usually referred to as the regression line.

Note that in a scatter diagram the response variable is always plotted on the vertical (y) axis.

Equation of a straight line

The equation of a straight line is given by y = a + bx, where the coefficients a and b are the intercept of the line on the y axis and the gradient, respectively. The equation of the regression

Page 7: Regression and Correlation Analysis

line for the A&E data (Fig. (Fig.7)7) is as follows: ln urea = 0.72 + (0.017 × age) (calculated using the method of least squares, which

is described below). The gradient of this line is 0.017, which indicates that for an increase of 1 year in age the expected increase in ln urea is 0.017 units (and hence the expected increase in urea is 1.02 mmol/l). The predicted ln urea of a patient aged 60 years, for example, is 0.72 + (0.017 × 60) = 1.74 units. This transforms to a urea level of e1.74 = 5.70 mmol/l. The y intercept is 0.72, meaning that if the line were projected back to age = 0, then the ln urea value would be 0.72. However, this is not a meaningful value because age = 0 is a long way outside the range of the data and therefore there is no reason to believe that the straight line would still be appropriate.

Conclusion

Both correlation and simple linear regression can be used to examine the presence of a linear relationship between two variables providing certain assumptions about the data are satisfied. The results of the analysis, however, need to be interpreted with care, particularly when looking for a causal relationship or when using the regression equation for prediction. Multiple and logistic regression will be the subject of future reviews.

Page 8: Regression and Correlation Analysis

Correlation Coefficient

A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables.

There are a number of different correlation coefficients that might be appropriate depending on the kinds of variables being studied.

Spearman Rank Correlation Coefficient

The Spearman rank correlation coefficient is one example of a correlation coefficient. It is usually calculated on occasions when it is not convenient, economic, or even possible to give actual values to variables, but only to assign a rank order to instances of each variable. It may also be a better indicator that a relationship exists between two variables when the relationship is non-linear.

Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient, for making inferences about the population correlation coefficient make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate.

Least Squares

The method of least squares is a criterion for fitting a specified model to observed data. For example, it is the most commonly used method of defining a straight line through a set of points on a scatterplot.

Regression Equation

A regression equation allows us to express the relationship between two (or more) variables algebraically. It indicates the nature of the relationship between two (or more) variables. In particular, it indicates the extent to which you can predict some variables by knowing others, or the extent to which some are associated with others.

A linear regression equation is usually written

Page 9: Regression and Correlation Analysis

Y = a + bX + e

where

Y is the dependent variable

a is the intercept

b is the slope or regression coefficient

X is the independent variable (or covariate)

e is the error term

The equation will specify the average magnitude of the expected change in Y given a change in X.

The regression equation is often represented on a scatterplot by a regression line.

Regression Line

A regression line is a line drawn through the points on a scatterplot to summarise the relationship between the variables being studied. When it slopes down (from top left to bottom right), this indicates a negative or inverse relationship between the variables; when it slopes up (from bottom right to top left), a positive or direct relationship is indicated.

The regression line often represents the regression equation on a scatterplot.

Page 10: Regression and Correlation Analysis

11. Correlation and regression

The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of wheeziness. However, in statistical terms we use correlation to denote association between two quantitative variables. We also assume that the association is linear, that one variable increases or decreases a fixed amount for a unit increase or decrease in the other. The other technique that is often used in these circumstances is regression, which involves estimating the best straight line to summarise the association.

Correlation coefficient

The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes called Pearson's correlation coefficient after its originator and is a measure of linear association. If a curved line is needed to express the relationship, other and more complicated measures of the correlation must be used.

The correlation coefficient is measured on a scale that varies from + 1 through 0 to - 1. Complete correlation between two variables is expressed by either + 1 or -1. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative. Complete absence of correlation is represented by 0. Figure 11.1 gives some graphical representations of correlation.

Page 11: Regression and Correlation Analysis

Figure 11.1 Correlation illustrated.

Looking at data: scatter diagrams

When an investigator has collected two series of observations and wishes to see whether there is a relationship between them, he or she should first construct a scatter diagram. The vertical scale represents one set of measurements and the horizontal scale the other. If one set of observations consists of experimental results and the other consists of a time scale or observed classification of some kind, it is usual to put the experimental results on the vertical axis. These represent what is called the "dependent variable". The "independent variable", such as time or height or some other observed classification, is measured along the horizontal axis, or baseline.

The words "independent" and "dependent" could puzzle the beginner because it is sometimes not clear what is dependent on what. This confusion is a triumph of common sense over misleading terminology, because often each variable is dependent on some third variable, which may or may not be mentioned. It is reasonable, for instance, to think of the height of children as dependent on age rather than the converse but consider a positive correlation between mean tar yield and nicotine yield of certain brands of cigarette.' The nicotine liberated is unlikely to have its origin in the tar: both vary in parallel with some other factor or factors in the composition of the cigarettes. The yield of the one does not seem to be "dependent" on the other in the sense that, on average, the height of a child depends on his age. In such cases it often does not matter which scale is put on which axis of the scatter diagram. However, if the intention is to make inferences about one variable from the other, the observations from which the inferences are to be made are usually put on the baseline. As a further example, a plot of monthly deaths from heart disease against monthly sales of ice cream would show a negative association. However, it is hardly likely that eating ice cream protects from heart disease! It is simply that the mortality rate from heart disease is inversely related - and ice cream consumption positively related - to a third factor, namely environmental temperature.

Calculation of the correlation coefficient

A paediatric registrar has measured the pulmonary anatomical dead space (in ml) and height (in cm) of 15 children. The data are given in table 11.1 and the scatter diagram shown in figure 11.2 Each dot represents one child, and it is placed at the point corresponding to the measurement of the height (horizontal axis) and the dead space (vertical axis). The registrar now inspects the pattern to see whether it seems likely that the area covered by the dots centres on a straight line or whether a curved line is needed. In this case the paediatrician decides that a straight line can adequately describe the general trend of the dots. His next step will therefore be to calculate the correlation coefficient.

Page 12: Regression and Correlation Analysis

When making the scatter diagram (figure 11.2 ) to show the heights and pulmonary anatomical dead spaces in the 15 children, the paediatrician set out figures as in columns (1), (2), and (3) of table 11.1 . It is helpful to arrange the observations in serial order of the independent variable when one of the two variables is clearly identifiable as independent. The corresponding figures for the dependent variable can then be examined in relation to the increasing series for the independent variable. In this way we get the same picture, but in numerical form, as appears in the scatter diagram.

Page 13: Regression and Correlation Analysis

Figure 11.2 Scatter diagram of relation in 15 children between height and pulmonary anatomical dead space.

The calculation of the correlation coefficient is as follows, with x representing the values of the independent variable (in this case height) and y representing the values of the dependent variable (in this case anatomical dead space). The formula to be used is:

Spearman rank correlation

A plot of the data may reveal outlying points well away from the main body of the data, which could unduly influence the calculation of the correlation coefficient. Alternatively the variables may be quantitative discrete such as a mole count, or ordered categorical such as a pain score. A non-parametric procedure, due to Spearman, is to replace the observations by their ranks in the calculation of the correlation coefficient.

This results in a simple formula for Spearman's rank correlation, Rho.

where d is the difference in the ranks of the two variables for a given individual. Thus we can derive table 11.2 from the data in table 11.1 .

Page 14: Regression and Correlation Analysis

In this case the value is very close to that of the Pearson correlation coefficient. For n> 10, the Spearman rank correlation coefficient can be tested for significance using the t test given earlier.

The regression equation

Correlation describes the strength of an association between two variables, and is completely symmetrical, the correlation between A and B is the same as the correlation between B and A. However, if the two variables are related it means that when one changes by a certain amount the other changes on an average by a certain amount. For instance, in the children described earlier greater height is associated, on average, with greater anatomical dead Space. If y represents the dependent variable and x the independent variable, this relationship is described as the regression of y on x.

The relationship can be represented by a simple equation called the regression equation. In this context "regression" (the term is a historical anomaly) simply means that the average value of y is a "function" of x, that is, it changes with x.

Page 15: Regression and Correlation Analysis

The regression equation representing how much y changes with any given change of x can be used to construct a regression line on a scatter diagram, and in the simplest case this is assumed to be a straight line. The direction in which the line slopes depends on whether the correlation is positive or negative. When the two sets of observations increase or decrease together (positive) the line slopes upwards from left to right; when one set decreases as the other increases the line slopes downwards from left to right. As the line must be straight, it will probably pass through few, if any, of the dots. Given that the association is well described by a straight line we have to define two features of the line if we are to place it correctly on the diagram. The first of these is its distance above the baseline; the second is its slope. They are expressed in the following regression equation :

Page 16: Regression and Correlation Analysis