lt4009 week 11 slides

18
Week 11 Correlation & Regression

Upload: ashley-garlick

Post on 18-Dec-2014

676 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Lt4009 week 11 slides

Week 11

Correlation & Regression

Page 2: Lt4009 week 11 slides

In this lecture

We learn how to Draw a scatter diagram and label using SPSS. Distinguish between positive, negative and no correlation. Calculate the correlation coefficient and coefficient of determination. Interpret the findings

We learn how to Use the linear regression using SPSS Generate an linear regression equation Interpret the regression coefficients Use the linear regression to make predictions.

Page 3: Lt4009 week 11 slides

What is correlation analysis?•This is an analysis of the level of association between two or more variables. In this module, we only consider the level of association between two variables.

• If two variables are related in the sense that low (and high) values for one variable are associated with low (and high) values of the other variable, then we say that the two variables are correlated.

•Sometimes, the association between variables is causal and sometimes it is not causal. NOTE THAT STATISTICAL TESTS DO NOT PROVE CAUSALITY, they only show that there is a relationship.

•A causal association is when levels of one variable causes (or forces) the levels of the other variable. Example, Weight and the body mass index among adults.

•A non-causal association is when two variables are related with no obvious one to one causation. Example, Sale of ice cream and atmospheric temperatures.

•A spurious correlation between two variables is an association that clearly occurs by accident. Example, height and earnings among workers in London.

Page 4: Lt4009 week 11 slides

The first observation

The first descriptive observation of an association between two variables is made by drawing a scatter diagram.

Here, we need to distinguish between independent and dependent variables.

The dependent variable is the variable of interest to us. It is the variable that is seemingly affected by other factors.

Example:Teachers often want to understand the attendance patterns of students. Is it the case that attendance in class is due to the distance travelled from to university? If that is the case, then attendance is the dependent variable and distance travelled is the independent variable.

The independent variable should be on the horizontal axis (x).

Page 5: Lt4009 week 11 slides

Scatter diagram

Positive correlation Negative correlation No correlation

Upward trend No trendDownward trend

Stronger

Weaker

Page 6: Lt4009 week 11 slides

An exampleWe are interested to know whether the amount of CO2 emission is associated with the gross domestic product (or growth in the economy) among some countries.In particular, we want to know if the increase in CO2 emission is explained by growth in the economy (GDP) and vice versa.

Page 7: Lt4009 week 11 slides

Using SPSS to draw a scatter diagram

Page 8: Lt4009 week 11 slides

Output

Outliers

Outliers

Page 9: Lt4009 week 11 slides

Strength of the correlation: Correlation coefficient

The correlation coefficient is a statistics that informs the strength of the correlation.It is a value between -1 and +1.A value close to -1 suggests a strong negative correlation, whereas a value close to +1 suggests a strong positive correlation.There is no universal cut off point for a strong correlation.In this module, we accept a cut off point of 0.7 for strong correlation.

Strong negative correlation

Strong positive correlation

No correlation

There are three types of correlation coefficient. We need to learn only two of them:•Pearson correlation coefficient: Used only for scale (or cardinal) data•Spearman’s rank correlation coefficient: Used mainly for ordinal data.

Page 10: Lt4009 week 11 slides

Using SPSS to calculate the correlation coefficient

Obviously, if we are dealing with ordinal data, we will use the Spearman correlation.

The correlation coefficient between Co2 emissions and GDP is 0.721This is considered a relatively strong positive correlation.

Page 11: Lt4009 week 11 slides

Linear regression

Regression analysis is to do with finding a mathematical model for the relationship between the variables.

A linear regression is finding a mathematical in the form of a straight line. In the process of regression, we find the equation for the dependent variable (variable of interest).

A linear equation is of the form

The values of a and b are obtained from the data of the two variables.

y is the dependent variable and x the independent variable.

bxay

Page 12: Lt4009 week 11 slides

Using SPSS

Page 13: Lt4009 week 11 slides

Regression output

From the output above, we can write down the regression equation as follows:

GDPCO 405.516 144.452

The R Square value (0.521) is called the coefficient of determination.

Page 14: Lt4009 week 11 slides

Interpreting the output

The coefficient of determination (0.521) represents the proportion of variance of the CO2 emissions that is explained by the variance in GDP. So, we can say that 52.1% of changes in CO2 emissions are explained by changes in GDP. The rest (47.9%) are explained by changes to factors other than GDP.

The regression equation is made up of two coefficients, namely 45.144 and 516.405. These are called the coefficients of regression.

The coefficient 516.405 is called the gradient and tells us that an increase in GDP by $1bn will lead to an increase of 516.405 tons of CO2.

GDPCO 405.516 144.452

Page 15: Lt4009 week 11 slides

Predictions

From the output above, we can write down the regression equation as follows:

Example 1: TongaCO2 = 45.144 + 516.405*0.24 = 45.144 + 123.9372 = 169.0812

Example 2: LiberiaCO2 = 45.144 + 516.405*0.61 = 45.144 + 315.00705 = 360.151

GDPCO 405.516 144.452

The predictions based on the regression equation are added in a new column.

Page 16: Lt4009 week 11 slides

Another exampleWe asked 55 students to rank on the scale 1 to 5 (1=Not at all very good and 5=Very good), the following aspects of the module:•The general quality of the lecturer•The quality of the seminar sessions•The quality of the module booklet

We want to know if there is any correlation between the three rankings.

Page 17: Lt4009 week 11 slides

Another example

We choose Spearman’s rank correlation coefficient because we are dealing with ordinal data.

Page 18: Lt4009 week 11 slides

Output

Note, the symmetry of this correlation matrix.

The highest correlation exist between the ratings for the quality of lecturer and the quality of the module booklet, R=0.892

This means that the students’ views on both the lecturer and the module booklet were fairly identical. If they like the lecturer, they tend to like the module booklet and if they don’t like the lecturer then they tend to dislike the module booklet.

A fairly weak correlation exists between the ratings of the other qualities.