stats 3000 week 2 - winter 2011

Section D

1. Goodness of fit Ch. 6 2. Test of independence Ch. 6 3. Simple regression and correlation (not included

on test 3) a) Regression Ch. 9 b) Correlation Ch. 9 c) Inferences about regression and

correlation Ch.9

Data comes in pairs of quantitative variables. Given such paired data (bivariate data), we want to determine whether there is a relationship between the two quantitative variables and, if so, identify what the relationship is.

Regression analysis allows us to identify an equation that best fits the data, and to predict values of one variable based on another variable.

Descriptive Methods in Regression

What is Linear Regression?

• The straight-line linear regression model is a means of relating one quantitative variable to another quantitative variable

• A way of predicting the value of one variable from another.– It is a hypothetical model of the relationship

between two variables.– The model used is a linear one.– Therefore, we describe the relationship using the

equation of a straight line.

LINEAR REGRESSION ANALYSIS (PREDICTION)

Process of finding the equation of the straight line that best

predicts the value of one variable from a given value of the other.

Procedures that allow us to predict one variable (Y) based

on knowledge of another variable (X)

The goal is to be able to predict new values of Y based on values of X

Generally, Y is called the dependent variable (Predicted)

(Criterion) (Outcome) (ordinate) X is called the independent variable (Predictor) (abscissa).

Scatter Plot: shows the relationship between X and Y.

Student High School GPA (X)

University GPA (Y)

1

2.00

1.60

2 2.25 2.00 3 2 60 1.80 4 2.65 2.80 5 2.80 2.10 6 3.10 2.00 7 2.90 2.65 8 3.25 2.25 9 3.30 2.60 10 3.60 3.00 11 3.25 3.10

4.003.753.503.253.002.752.502.252.001.751.501.251.00

.75

.50

.25

.5 1.0 1.5 2 2.5 3.0 3.5 4.0.25 .75 1.25 1.75 2.25 2.75 3.25 3.75

. . . . . ....

..

High School(X)

University(Y)

Describing a Straight Line

Linear equation: When the relationship between X and Y is linearLinear equation: Y = bX + a

Regression line: Line whose equation is used for predictionLine that best describes the relationship between y, the dependent variable and x, the independent variable.

Linear regression builds on the equation for a straight line because the relationship between the two variables is assumed to be linear

A straight line should yield the best “fit” of the data points in a scatterplot (a linear model)

Y = bX + a (regression equation)***** Y = predicted value of Y b = slope of the line; is called the regression coefficient X = value of independent variable a = intercept

Slope: Change in the value of y for one-unit increase in X

0 1 X

Y ^

ˆChange in value of Slope = =

Change in value of

Yb

X

Intercept: The point at which a line intersects they axis. It is the value of Y, when X = 0. Determine the location of the line.

Intercepts and Slopes

Least squares criterion: Statistical method for finding the best prediction

line. Best prediction line will minimize error in

predicting Y from X.

Best regression line will be closest to the actual data points.

Residuals - the difference between a score and its predicted value

x y1 42 244 85 32

ˆ(Y Y) = residual, error in prediction = e ***********************

Best regression line minimizes the value of

2ˆY Y (is at a minimum) **************

SSresidual = Sum of squares residual= 2ˆY Y ****

x

2

x

xy

VarianceiancevarCo

s

Covb ***********************

Covariance

The degree to which X and Y, vary together (covary); The variation in one variable (X) that is shared by another (Y)

( )( )Cov ***********ariance

1*

X X Y Y

n

22 ( )

1x

X Xs

n

***********************

*********************a=Y-bX **

A medical researcher is interested in the possibility of a linear relationship between a patient's age and the effectiveness of a certain drug (hours). The drug is administered to 8 randomly selected patients.

Age (X) Effectiveness (Y) 34 6.3 42 8.1 37 7.9 55 9.8 47 8.6 43 8.4 52 9.1 39 8.6

35.8Y

625.43X

x

2

x

xy

VarianceiancevarCo

s

Covb ***********************

( )( )Covariance *********************

1**

X X Y Y

n

( )( )X X Y Y (34 43.625)(6.3 8.35) (42 43.625)(8.1 8.35) (37 43.625)(7.9 8.35)

(55 43.625)(9.8 8.35) ........ (39 43.625)(8.6 8.350 45.54875

var 45.54875 / 7 6.507

Co iance

22 ( )

1x

X Xs

n

***********************

2 2 2

2 2

(34 43.625) (42 43.625) (37 43.625)

(55 43.625) ........ (39 43.625)53.125

7

6.507

.12253.125

b

a Y bX

a (8.35) (.122)(43.625) 3.03

Y .122X 3.03

Prediction for 44 years old

hours398.8Y

)44)(122(.03.3Y

A researcher suspects that there is a relationship between the number of promisesa political candidate makes and the number of promises that are fulfilled once the candidate is elected. He examines the track record of 10 politicians. Use spss to construct a regression equation that predicts the number of promises made and promises kept by politicians.

slope

The information in the column “unstandardized coefficients” column B embodies the regression equation: (constant) is the intercept

Y 0.118x 9.268

Standard error of estimate .y x(s )

Is a measure of the amount of error in prediction, in

units of the Y variable. Is the standard deviation of the distribution of

obtained Y scores about predicted values of Y, Y.

Standard error of estimate: a measure of the error in prediction used as the basis for a measure of the accuracy of prediction

2

residual

y . x

ˆY Y SSs

n*****

2 df**

x.ys represents the average error in prediction

over an entire scatterplot.

Age (X) Effectiveness (Y) Y 2ˆ( )y y

34 6.3 7.178 .771 42 8.1 8.154 .003 37 7.9 7.544 .127 55 9.8 9.74 .004 47 8.6 8.764 .027 43 8.4 8.276 .015 52 9.1 9.374 .075 39 8.6 7.788 .660

2

.

ˆ1.682

.28 .532 6

y x

Y Ys hours

n

Averaged dispersion of the effectiveness scores around their predicted values.

Section D Goodness of fit Ch. 6 Test of independence Ch. 6 Regression Ch. 9 Correlation Ch. 9 Inferences about regression and correlation Ch.9

Scatterplot

• To see if scores may be related construct a graph of the scores, called a scatterplot– The variable labeled X is plotted on the

horizontal axis (the abscissa)– The Y variable is plotted on the vertical axis (the

ordinate)– The score of a subject on each of the two

measures is indicated by one point on the scatterplot

Conclusions drawn from scatterplots are subjective. A more precise and objective method for detecting straight-line patterns is the linear correlation coefficient.

The linear correlation coefficient r (often simply called the correlation coefficient) measures the strength of the linear relationship between the paired x and y values in a sample.

Is a statistical technique used to measure the relationship between two variables. (magnitude and direction)

Descriptive Methods in Correlation

Pearson Correlation (r) r is a descriptive statistic used to measure the degree of straight line relationship between 2 variables.

r also determines the precision with which predictions can be made using the regression line (r2 = coefficient of determination)

The value of r is not affected by the choice of x or y.

Interchange all x and y values and the value of r will not

change.

r measures the strength of a linear relationship. It is

not designed to measure the strength of a relationship

that is not linear.

xy

x y

2

x

2

y

xy

Cov Co var iancer

s s (S.D.of X)(S.D.of Y)

(X X)s S.D. of X

n 1

(Y Y)s S.D. of Y

***

***

***

**

n 1

(X X)(Y Y)s Covarianc *e

n 1

Sign r is a measure of the extent to which paired scores occupy the same (+) or opposite (-) positions within their own distribution.

x y

X X Y Yz z

s s

xy

x y2 2

Cov s s

( )( )

( ) ( )***

X X Y Yr

X X Y Y

Sign of r is determined by covariance or by the numerator. X Y X Y + = two variables move in same direction ( +z, +z, -z, -z) X Y X Y = two variables move in oppositive direction ( +z, - z, -z, +z)

RAW SCORES

1 60 162 45 133 40 124 20 85 10 6

Subject X Y

X

sx

35

20 Y

sy

11

4

z

X Xsx

z

Y Ysy

Z SCORES

1 1.25 1.25 2 0.50 0.5 3 0.25 0.25 4 -0.75 - 0.75 5 -1.25 -1.25

Subject X Y

r = 1

2 xy xy

x x y

Cov Covb r

s s s

( )( ) xyCov X X Y Y

b and r will have the same sign 1. The magnitude of r ranges between 0 and 1. 2. The sign of r is either positive or negative.

Degree of linear relationship 0 < r .3 then weak .3 < r .55 then slight .55 < r .8 then moderate .8 < r 1 then strong Thus, r can take any value between -1 and 1.

3. Generally, if regardless of sign,

Direction of relationship

• A correlation coefficient indicates the direction of the relationship by the positive or negative sign of the coefficient

• A positive r indicates• A positive (direct)relationship between

variables X and Y• As the scores on variable X increase, the scores

on variable Y tend to increase• A negative r indicates

• A negative (inverse)relationship between variables X and Y

• As the scores on variable X increase, the scores on variable Y tend to decrease

........

..... .. ... ..

.... ....... .

.. ...

....

...

........

.. .... ... ...

. .. .... ...

...

... ..

.. . .

... ... .

..

..

.... .

.

.

. .. .. ....

..

... .. .

...

. ..

.. . .. .

..

.....

. .. .. . .

..

..

.... . ...

....

..

. ..

....

.... ..

.. ..

.. .

.. ...

. ....

.... ...

.. ..

.... .. ...

..

....

...

Y

Y

r = 1 r = 0.9 r = 0.5

r = -1 r = -0.9 r = -0.5

(no error inpredictions)

. ...... .

..

. . ... ... . ... . .. .... ..

...

...

.....

..

. .... .

... ... ...

..

... ... . ... . ..... . ... .

. ......

.. ..

..

....

.

.. .

... ...

....

. ... . ... ..

.... ..

...

.

.....

..

. ...

.

...

. ...

..

..

............ ......

.

...... ...

. .. ..

.. .

....

.

.

... .

.

....

.. .

.

.

. .

.

. ...

.

.

.

.

.

.

...

.

.. ...

.. ..

..

..

.

..

. Y on X

X

Y

Y

X

Example: Most of us have heard that tall people generally have larger feet than short people. Is that really true and, if so, what is the relationship between height and foot length? To examine this, Professor Dennis Young obtained data on shoe size and height for a sample of students at Carleton University.

SIZE (X) x- X 2(x X) HEIGHT

(Y)

Y Y

2(Y Y) (x- X ) ( Y Y )

10.5 -0.46 0.22 70 -1.46 2.144133 0.679847

13 2.04 4.16 72 0.54 0.28699 1.092857

10.5 -0.46 0.21 74.5 3.04 9.215561 -1.39643

12 1.04 1.08 71 -0.46 0.215561 -0.48286

10.5 -0.46 0.21 71 -0.46 0.215561 0.213571

13 2.04 4.16 77 5.54 30.64413 11.29286

11.5 0.54 0.29 72 0.54 0.28699 0.289286

10 -0.96 0.92 72 0.54 0.28699 -0.51429

8.5 -2.46 6.05 67 -4.46 19.92985 10.98214

10.5 -0.46 0.21 73 1.54 2.358418 -0.70643

10.5 -0.46 0.21 72 0.54 0.28699 -0.24643

11 0.04 0.00 70 -1.46 2.144133 -0.05857

9 -1.96 3.84 69 -2.46 6.072704 4.83

13 2.04 4.16 70 -1.46 2.144133 -2.98714

N=14

X=10.964286

2(x X)=

25.736

Y=71.4643

2(Y Y)=

76.23214

(X X) (Y Y) 22.98842

xy

x y

Covr

s s

2

x

x

(X X)s

n 1

25.736s

131.407

2

y

y

(Y Y)s

n 1

76.23214s

132.422

xy

xy

(X X)(Y Y)s

n 122.98842

s13

1.768

xy

x y

Cov 1.768r

s s (1.407)(2.422)0.5189

Exercise 14 see webct, exercise folder

Textbook exercises: 9.2, 9.3, 9.10, 9.11, 9.13, 9.15. When appropriate verify your answers with SPSS. Get data for spss from webct, spss folder, spss exercises subfolder.

Readings to prepare for week 3, January 17-22

Chapter 9

Sections: 9.7, 9.8, 9.10, 9.11

SPSS assignment # 2 due next week

stats 3000 week 2 - winter 2011

Education

regression equation

regression line

linearlinear equation

regression datacomes

regression analysis

linear model

straight linelinear

independent variable