l10-edu5950 simple regression analysis

8/17/2019 l10-Edu5950 Simple Regression Analysis

1/24

1

EDU5950

SEM2 2010-11

CORRELATION &SIMPLE REGRESSION

Correlation - Test of association

! A correlation measures the “degree of association” between

two variables (interval or ordinal)

! Associations can be positive (an increase in one variable is

associated with an increase in the other) or negative (an

increase in one variable is associated with a decrease in the

other)! Correlation is measured in “r” (parametric, Pearson’s) or

“!” (non-parametric, Spearman’s)


2/24

2

Test of association - Correlation

! Compare two continuous variables in terms of degree ofassociation

" e.g. attitude scale vs behavioural frequency

0

50

100

150

200

250

300

0 50 100 150 200 250 300

0

50

100

150

200

250

0 50 100 150 200 250

Positive Negative


0

20

40

60

80

100

120

140

160

180

0 50 100 150 200 250

0

20

40

60

80

100

120

140

160

0 50 100 150 200 250

! Test statistic is “r” (parametric) or “ ” (non-parametric)

0 (random distribution, zero correlation)

1 (perfect correlation)

High Low


3/24

3


0

20

40

60

80

100

120

140

160

180

0 50 100 150 200 250

0

20

40

60

80

100

120

140

160

180

200

0 50 100 150 200 250

! Test statistic is “r” (parametric) or “ ” (non-parametric)

0 (random distribution, zero correlation)

1 (perfect correlation)

High Zero

6

Regression & Correlation!

A correlation measures the “degree of

association” between two variables (interval

(50,100,150…) or ordinal (1,2,3...))

!

Associations can be positive (an increase in onevariable is associated with an increase in the

other) or negative (an increase in one variable is

associated with a decrease in the other)


4/24


5/24

5

! “Best fit line”

! Allows us to describe

relationship between variables

more accurately.

! We can now predict specific

values of one variable from

knowledge of the other

! All points are close to the line

Graph Three: Relationship between

Symptom Index and Drug A

(with best-fit line)

0

20

40

60

80

100

120

140

160

180

0 50 100 150 200 250

Drug A (dose in mg)

S y m p t o m I

n d e x

Example: Symptom Index vs Drug A

Graph Four: Relationship between Symptom

Index and Drug B


0

20

40

60

80

100

120

140

160

0 50 100 150 200 250

Drug B (dose in mg)

S y m p t o m I

n d

e x

! We can still predict specific

values of one variable from

knowledge of the other

! Will predictions be as accurate?

! Why not?

! “Residuals”

Example: Symptom Index vs Drug B


6/24

6

11

Correlation examples

" Regression analysis procedures have as their

primary purpose the development of an

equation that can be used for predicting values

on some DV for all members of a population.

" A secondary purpose is to use regressionanalysis as a means of explaining causal

relationships among variables.

Regression


7/24

7

" The most basic application of regression analysis is the

bivariate situation, to which is referred as simple linear

regression, or just simple regression.

" Simple regression involves a single IV and a single DV.

" Goal: to obtain a linear equation so that we can predict the

value of the DV if we have the value of the IV.

" Simple regression capitalizes on the correlation between the

DV and IV in order to make specific predictions about theDV.

" The correlation tells us how much information about the

DV is contained in the IV.

" If the correlation is perfect (i.e r = ±1.00), the IV contains

everything we need to know about the DV, and we will

be able to perfectly predict one from the other.

" Regression analysis is the means by which we determine

the best-fitting line, called the regression line.

" Regression line is the straight line that lies closest to all

points in a given scatterplot

" This line sometimes pass through the centroid of the

scatterplot.


8/24

8

" 3 important facts about the regression line must beknown:

" The extent to which points are scattered around the line

"

The slope of the regression line

" The point at which the line crosses the Y-axis

" The extent to which the points are scattered around theline is typically indicated by the degree of relationship

between the IV (X) and DV (Y)."

This relationship is measured by a correlationcoefficient – the stronger the relationship, the higher thedegree of predictability between X and Y.

" The degree of slope is determined by the amount

of change in Y that accompanies a unit change in

X.

" It is the slope that largely determines the predicted

values of Y from known values for X.

" It is important to determine exactly where the

regression line crosses the Y-axis (this value is

known as the Y-intercept).


9/24

9

" The regression line is essentially an equation that

express Y as a function of X.

"

The basic equation for simple regression is:

" Y = a + b X

" where Y is the predicted value for the DV,

" X is the known raw score value on the IV,

" b is the slope of the regression line

" a is the Y-intercept

Simple Linear Regression

! Purpose

" determine relationship between two metric variables

" predict value of the dependent variable (Y ) based on

value of independent variable ( X )

! Requirement :

" DV Interval / Ratio

" IV Internal / Ratio

! Requirement :

" The independent and dependent variables are normally

distributed in the population

" The cases represents a random sample from the population


10/24

10

Simple Regression How best to summarise the data?

0

20

40

60

80

100

120

140

160

180

0 50 100 150 200 250

Drug A (dose in mg)

S y m p t o m I

n d e x

0

20

40

60

80

100

120

140

160

0 50 100 150 200 250

Drug A (dose in mg)

S y m p t o m I

n d e x

Adding a best-fit line allows us to describe data simply

"

Establish equation for the best-fit line:

Y = a + b X

General Linear Model (GLM) How best to summarise the data?

0

20

40

60

80

100

120

140

160

180

200

0 50 100 150 200 250

Where: a = y intercept (constant)b = slope of best-fit line

Y = dependent variable

X = independent variable


11/24

11

"

For simple regression, R 2 is the square of the correlationcoefficient

"

Reflects variance accounted for in data by the best-fit line

" Takes values between 0 (0%) and 1 (100%)

"

Frequently expressed as percentage, rather than decimal

" High values show good fit, low values show poor fit

Simple Regression R2 - “Goodness of fit”

" R 2 = 0

"

(0% - randomly scattered

points, no apparent

relationship between Xand Y)

" Implies that a best-fit line

will be a very poor

description of data0

50

100

150

200

250

300

0 100 200 300

IV (regressor, predictor)

DV

Simple Regression Low values of R2


12/24

12

" R 2 = 1

" (100% - points lie directly

on the line - perfect

relationship between X

and Y)"

Implies that a best-fit line

will be a very good

description of data

0

50

100

150

200

250

300

0 100 200 300

IV

D V

0

50

100

150

200

250

0 50 100 150 200 250

IV

D V

Simple Regression High values of R2

0

20

40

60

80

100

120

140

160

180

0 50 100 150 200 250

Drug A (dose in mg)

S y m p t o m I n d e x

0

20

40

60

80

100

120

140

160

0 50 100 150 200 250

Drug B (dose in mg)

S

y m p t o m I

n d e x

Good fit ! R2 high

High variance explained

Moderate fit ! R2 lower

Less variance explained

Simple Regression R2 - “Goodness of fit”


13/24

13

25

Problem: to draw a straight line through the points

that best explains the variance

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6

Line can then be used to

predict Y from X

26

" “Best fit line”

" allows us to describe relationship

between variables more accurately.

" We can now predict specific values

of one variable from knowledge of

the other

" All points are close to the line

Graph Three: Relationship between

Symptom Index and Drug A


0

20

40

60

80

100

120

140

160

180

0 50 100 150 200 250

Drug A (dose in mg)

S y m p t o m I n

d e x

Example: Symptom Index vs Drug A


14/24

14

27

" Establish equation for the best-fit line:

Y = a + b X

! Best-fit line same as regression line

! b is the regression coefficient for x

! x is the predictor or regressor variable for y

Regression

Regression - Types


15/24

15

Linear Regression - Model

ii iY X ! ! "

0 1+ +=

Regression Coefficients

Population

Sample

Y = a + b X ˆ

Constant

30

Parameters

# The population parameters andare simple the least squares estimatescomputed on all the members of thepopulation, not just the sample

# Population parameters:# Sample statistics: a and b

01

0 1and


16/24

16

31

Inference About the PopulationSlope and Intercept

# If then we have a graph like this:

0 1 Y = X

1 0

X

0 1 X

32



0 1 Y = X

1 0

X

0 1 X

This is the meanof Y for those

whoseindependentvariable is X


17/24

17

Copyright (c) Bani K. Mallick 33



0 1 Y = X

1 0

X

0 1 X Note how the mean

of Y does not dependon X: Y and X are

independent

34

Linear Regression and Correlation

# If then Y and X are independent

# So, we can test the null hypothesis

# that Y and X are independent by testing

# The p-value in regression tables tests this

hypothesis

0 1 Y = X

1 0

0H :

0 1H : 0


18/24

18

X Y

Temperature Sales

63 1.52

70 1.68

73 1.8

75 2.05

80 2.36

82 2.25

85 2.68

88 2.9

90 3.14

91 3.06

92 3.24

75 1.92

98 3.4

100 3.28

92 3.17

87 2.83

84 2.58

88 2.86

80 2.26

82 2.14

76 1.98

Ice Cream Example

Ice Cream Sales

0

0.5

1

1.5

2

2.5

3

3.5

4

0 20 40 60 80 100 120

Simple Regression Lineˆ Y = a + b X

TWO STEPS TO SIMPLE LINEAR REGRESSION

Descriptive

InferentialHypothesis Test :

1 Regression Model

2 Slope

# Regression equation : ! = a + b X

# Correlation coefficient (r)

# Coefficient of Determination (r $)


19/24

19

First Step Descriptive

Derive Regression / Prediction equation

# Calculate a and b

a = y – b X

! = a + b X

Example1 :

Data were collected from a randomly

selected sample to determine relationship

between average assignment scores and test

scores in statistics. Distribution for

the data is presented in the table below.

1. Calculate coefficient of determination

and the correlation coefficient

2. Determine the prediction equation.

3. Test hypothesis for the slope at 0.05 level

of significance

Data set:

Scores

ID Assign Test

1 8.5 88

2 6 66

3 9 94

4 10 98

5 8 87

6 7 72

7 5 45

8 6 63

9 7.5 85

10 5 77


20/24

20

1.


215.5

26.1= 8.257=

a= y – b x= 77.5 – 8.257 (7.2)

= 18.050

ID X Y1 8.5 88

2 6 66

3 9 94

4 10 98

5 8 87

6 7 72

7 5 45

8 6 63

9 7.5 85

10 5 77

Summary stat:

n 10

%& 72

%' 775

%&$ 544.5

%'$ 62,441

%&' 5,795.5

Prediction equation:

! = 18.05 + 8.257X

Interpretation of regression equation

! = 18.05 + 8.257x

For every 1 unit change in X,

Y will change by 8.257 units

(X

(Y 8. 2 5 7

18.05


21/24

21

MARITAL SATISFACTION

Parents : X Children : Y

1 3

3 2

7 6

9 7

8 8

4 6

5 3 Mean of X Mean of Y

No of pairs

" X " Y

" X squared " X squared

Standard deviation Standard deviation

" XY

Example 2:

1.


a= y – b x

= 5.00 +.65 (5.29)

= 8.438

Prediction equation:

! = 8.44 + 65x


22/24

22

Interpretation of regression equation

! = 8.43 + .65x

For every 1 unit change in X,

Y will change by .65 units

(X

(Y 0. 6 5

8.43

Descriptive Statistics

Mean Std. Deviation N

Grade - PMR MATH 2.53 1.468 62

TEACHER_FACTOR 3.9643 .91443 62

Correlations

Grade - PMR

MATH

TEACHER_F

ACTOR

Pearson Correlation Grade - PMR MATH 1.000 .571

TEACHER_FACTO

R

.571 1.000

Sig. (1-tailed) Grade - PMR MATH . .000

TEACHER_FACTO

R

.000 .

N Grade - PMR MATH 62 62

TEACHER_FACTO

R

62 62

Model Summaryb

Model

R R Square

Adjusted R

Square

Std. Error of the

Estimate

di

m

e

n

si

o

n

0

1

.571a

.326

.315

1.215

a. Predictors: (Constant), TEACHER_FACTOR

b. Dependent Variable: Grade - PMR MATH


23/24

23

ANOVAb

Model Sum of

Squares df

Mean

Square F

Sig.

1 Regression 42.848 1 42.848 29.021 .000a

Residual 88.588 60 1.476

Total 131.435 61

a. Predictors: (Constant), TEACHER_FACTOR


Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t

Sig.

B

Std. Error

Beta

1 (Constant) -1.101 .692 -1.591 .117

TEACHER_FACTOR .917 .170 .571 5.387 .000

a. Dependent Variable: Grade - PMR MATH

Descriptive Statistics

Mean

Std. Deviation

N

Grade - PMR MATH

2.53

1.468

62

TEACHER_FACTOR

3.9643

.91443

62

Race

1.90

.593

62

Correlations

Grade -PMR MATH

TEACHER _FACTOR Race

Pearson

Correlation

Grade - PMRMATH

1.000 .571 -.015

TEACHER_FACTOR

.571 1.000 .019

Race

-.015

.019

1.000

Sig. (1-tailed) Grade - PMRMATH

. .000 .453

TEACHER_FACTOR

.000 . .440

Race .453 .440 .

N

Grade - PMRMATH

62 62 62

TEACHER_FACTOR

62 62 62

Race 62 62 62

Model Summaryb

Model

R R Square

Adjusted R

Square

Std. Error of

the Estimate

d

i

m

e

n

s

i

o

n

0

1 .572a .327 .304 1.225

a. Predictors: (Constant), Race, TEACHER_FACTOR



24/24

24

ANOVAb

Model Sum of

Squares df

Mean

Square F

Sig.

1

Regression

42.939

2

21.469

14.313

.000a

Residual

88.497

59

1.500

Total

131.435

61

a. Predictors: (Constant), Race, TEACHER_FACTOR


Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t

Sig.

B

Std. Error

Beta 1

(Constant)

-.980

.853

-1.150

.255

TEACHER_FACTOR

.917

.172

.571

5.349

.000

Race -.065 .265 -.026 -.246 .806

a. Dependent Variable: Grade - PMR MATH

l10-edu5950 simple regression analysis

Documents