correlaton & regression correlation and regression are concerned with the investigation of...

28
CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables.

Post on 18-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

CORRELATON & REGRESSION

Correlation and regression are concerned with the investigation of relationships between two or more variables.

Page 2: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

We consider just two associated variables.

We might want to know:

If a relationship exists between those variables

If so, how strong that relationship is

What form that relationship takes

Can we make use of that relationship for predictive purposes i.e. forecasting?

Page 3: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Correlation is used to find the strength of the relationship

Regression describes the relationship itself in the form of an equation which best fits the data

General method for investigating the relationship between 2 variables:

Page 4: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

For an initial insight into the relationshipFor an initial insight into the relationshipbetween two variables:between two variables:

plot a scatter diagramplot a scatter diagram

If there appears to be a linear If there appears to be a linear relationship, quantify it:relationship, quantify it: calculate the correlation coefficientcalculate the correlation coefficient

This is a measure of the strength of this This is a measure of the strength of this linearlinear

relationship. relationship. Its symbol is 'r' and its value lies betweenIts symbol is 'r' and its value lies between -1 and +1 -1 and +1

Page 5: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

If the relationship is found to be significantly strong: find the equation of the ‘line of best fit’

through the data, using linear regression

The 'goodness of fit' statistic can be calculated to see how useful the regression equation is likely to be

Once defined by an equation, the relationship can be used for predictive purposes.

Page 6: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

ExampleThe data represents a sample of advertisingexpenditures and sales for ten randomlyselected months. See slide 12 for complete data.

Month Advertising Salesexpenditure (£0.000’s) y(£0,000’s) x

1 1.2 1012 0.8 923 1.0 110 etc.

Plot a scatter diagram of the data

Page 7: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

advertising (£0,000's)

sale

s (£

0,0

00's

)

1.31.21.11.00.90.80.70.6

120

110

100

90

80

70

Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)

The graph suggests a linear relationship between sales and advertising expenditure.

The larger the amount spent on advertising the higher the sales in general.

Note scales are not started at zero

Page 8: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

If there is a relationship, we need to be able to measure the strength of that relationship.

i.e. calculate the value of the correlation coefficient

Page 9: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Pearson's Product Moment CorrelationPearson's Product Moment Correlation

Coefficient (r)Coefficient (r)is a measure of how close a linear relationship there is between x and y.

can be produced directly from a calculator in LR (linear regression) mode

For the sales and advertising data the correlation coefficient: r = 0.875

The value of r is always between + 1 and -1

Page 10: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

x

y

1412108642

45

40

35

30

25

20

15

Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)

x

y

1412108642

50

45

40

35

30

25

20

15

Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)

x

y

1412108642

12

10

8

6

4

2

0

Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)

x

y

1412108642

30

25

20

15

10

5

0

Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)

x

y

1412108642

30

25

20

15

10

5

0

Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)

r = -1 perfect negative correlation

r = -0.7

r = 0 no correlation

r = +0.8

r = +1 perfect positive correlation

Page 11: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Formula for correlation coefficient, r

wherewhere

Sxx = x2 - x x nSyy = y2 - y y nSxy = x2 - x y n

r = Sxy Sxx Syy

Page 12: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Longhand calculations for correlation coefficient r.

Month Advertising

Expenditure £0000’s x

Sales £0000’s y

x2

y2

xy

1 1.2 101 1.44 10201 121.2 2 0.8 92 0.64 8464 73.6 3 1.0 110 1.00 12100 110.0 4 1.3 120 1.69 14400 156.0 5 0.7 90 0.49 8100 63.0 6 0.8 82 0.64 6724 65.6 7 1.0 93 1.00 8649 93.0 8 0.6 75 0.36 5625 45.0 9 0.9 91 0.81 8281 81.9 10 1.1 105 1.21 11025 115.5 Totals 9.4 959 9.28 93569 924.8

Step 1

Page 13: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Therefore:

Sxx = x2 - x x = 9.28 - 9.4 x 9.4 = 0.444 n 10

Syy = y2 - y y = 93569 - 959 x 959 = 1600.9

n 10

Sxy = xy - x y = 924.8 - 9.4 x 959 = 23.34 n 10 Step 3

Therefore: r = Sxy = 23.34 = 0.875

Sxx Syy 0.444 x 1600.9

Step 2

Page 14: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Hypothesis test for the value of rHypothesis test for the value of r We shall not go into the details here!We shall not go into the details here!

Null hypothesis (H0): A linear relationship does not exist between sales and advertising

Alternative hypothesis(H1): A linear relationship

does exist between sales and advertising.

If we calculate a test statistic and critical value we discover that test statistic > critical value

so we reject H0

Conclude that a linear relationship exists between sales and amount spent on advertising.

Page 15: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

The Goodness of Fit Statistic (R2)

This also measures of the closeness of the relationship between x and y

R2 = 100r2

R2 tells us what percentage of the total variation in y (here sales) is explained by the variation in x (here advertising expenditure)

Page 16: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

If r = +1 or –1, then RIf r = +1 or –1, then R22 =100% =100%

So 100% of the variation in y is explained So 100% of the variation in y is explained by the variation in x.by the variation in x.

If r = 0, then RIf r = 0, then R22 = 0% = 0%

So none of the variation in y is explained So none of the variation in y is explained by the variation in xby the variation in x

For the data above the goodness of fit For the data above the goodness of fit statistic Rstatistic R22 = 100 r = 100 r22 = 100 x 0.875 = 100 x 0.87522

= = 76.6%76.6%

Interpretation:Interpretation:

Page 17: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

76.6% of the variation in sales is 76.6% of the variation in sales is explained by the variation in the explained by the variation in the amount spent on advertising.amount spent on advertising.

The remaining 23.4% of the variation The remaining 23.4% of the variation is explained by other factors:is explained by other factors:

e.g. pricee.g. price

competitor’s prices etc.competitor’s prices etc.

Page 18: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Regression equation

Since we know, for the sample data, thatthere is a significant relationship betweenthe two variables,

the next obvious step is to find its equation.

We can then add the regression line to the scatter diagram and use it to predict futuresales, given advertising expenditure for aparticular month.

The regression equation can be produceddirectly from a calculator in LR mode.

Page 19: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

The regression line has the equation:

y = a + bx

x is the independent variabley is the dependent variable

a is the intercept on the y-axisb is the gradient or slope of the line.

Page 20: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

For the sales and advertising data, thevalues of a and b are 46.5 and 52.6. So regression equation is:

y = 46.5 + 52.6x

Sales = 46.5 + 52.6 advertising

(a and b can be found using LR mode on your calculator or by calculation)

Page 21: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Formula for a and b This is found by calculating the square ofThis is found by calculating the square of the the

differences between actual and expected differences between actual and expected values.values.

We chose We chose aa and and b b so that the total difference so that the total difference is is minimizied:minimizied:

b = b = SxySxy a = a = y - b x y - b x

Sxx Sxx ( ( x , y )x , y )

is called theis called the

centroidcentroid

WhereWhere x , y x , y are the meansare the means of theof the x x and and y y datadata

and the and the S’s S’s are defined as previously.are defined as previously.

Page 22: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Calculations for the regression equation. Calculations for the regression equation.

In the regression equation y = a + bxIn the regression equation y = a + bx

b = b = SxySxy = = 23.3423.34 = = 52.652.6

Sxx 0.444Sxx 0.444

a = y - b x = 95.9 - 52.6 x 0.94 = a = y - b x = 95.9 - 52.6 x 0.94 = 46.546.5

(As y = = yy = = 959959 and x = and x = xx = = 9.49.4 = = 0.94)0.94)

n 10 n 10n 10 n 10

Therefore the regression equation is Therefore the regression equation is

y y = 46.5 + 52.6x= 46.5 + 52.6x

Page 23: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Plotting the regression equation on thescatter diagram.

The line y = a + bx can be plotted on the scatterdiagram by plotting three points.

The centroid ( x , y ) and any other two points,which satisfy the regression equation.

From the data (x, y) = (0.94, 95.9)

When x = 0.6, y = 46.5 + (52.6 x 0.6) = 78.06

When x = 1.2, y = 46.5 + (52.6 x 1.2) = 109.6

Plot (0.6, 78.6)

Plot (0.94,95.9)

Plot (1.3, 109.6)

Page 24: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

advertising

sale

s

1.31.21.11.00.90.80.70.6

120

110

100

90

80

70

Plot of sales (£0,000's) against Advertising expenditure (£),000's)

xx

x

x

Page 25: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

NoteNote regression equation y = a + bx

can only be used to calculate an estimate for y given the value of x

The linear relationship y = a + bx can only be assumed to exist between y and x for the range of values within the sample

Page 26: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

Interpreting the coefficients in theInterpreting the coefficients in the

regression equation -regression equation -

first the a valuefirst the a value

The intercept (a) is the estimate ofThe intercept (a) is the estimate of

y when x = 0, y when x = 0, but care is needed if using this – but care is needed if using this – why?why?

y = 46.5 + 52.6xy = 46.5 + 52.6x

Sales = Sales = 46.546.5 + 52.6 advertising + 52.6 advertising

When x = 0, y = 46.5When x = 0, y = 46.5

i.e. When nothing is spent on advertising,i.e. When nothing is spent on advertising,

sales would be expected on average to be 46.5 sales would be expected on average to be 46.5 units = 46.5 x £10,0000units = 46.5 x £10,0000

=£ 465,000=£ 465,000

Page 27: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

the b valuethe b valuey = 46.5 + y = 46.5 + 52.652.6xx

If x = 0If x = 0 y = 46.5, y = 46.5, but care is needed here!but care is needed here!

If x = 0.6 y = 46.5 + (52.6)(0.6) = If x = 0.6 y = 46.5 + (52.6)(0.6) = If x = 0.8 y = 46.5 + (52.6)(0.8) = If x = 0.8 y = 46.5 + (52.6)(0.8) = If x = 1If x = 1 y = 46.5 + 52.6 =y = 46.5 + 52.6 =If x = 1.2If x = 1.2 y = 46.5 + (52.6)(1. 2) =y = 46.5 + (52.6)(1. 2) =If x = 2If x = 2 y = 46.5 + 52.6 x 2 y = 46.5 + 52.6 x 2 but care is but care is

needed needed here also!here also!

etc.etc.

So if advertising expenditure is increasedSo if advertising expenditure is increasedby 1 unit, sales will be increased by 52.6by 1 unit, sales will be increased by 52.6units on average.units on average.

Page 28: CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables

For each additional £10,000 spent onFor each additional £10,000 spent on

advertising, sales will increase byadvertising, sales will increase by

£52.6 x £10,000 = £526,000 on £52.6 x £10,000 = £526,000 on average.average.

But we cannot estimate sales outside But we cannot estimate sales outside the range:the range:

E.g. we should not try to estimate E.g. we should not try to estimate sales sales

for x = 5 using this method.for x = 5 using this method.