stats 3000 week 2 - winter 2011
DESCRIPTION
TRANSCRIPT
Section D
1. Goodness of fit Ch. 6 2. Test of independence Ch. 6 3. Simple regression and correlation (not included
on test 3) a) Regression Ch. 9 b) Correlation Ch. 9 c) Inferences about regression and
correlation Ch.9
Data comes in pairs of quantitative variables. Given such paired data (bivariate data), we want to determine whether there is a relationship between the two quantitative variables and, if so, identify what the relationship is.
Regression analysis allows us to identify an equation that best fits the data, and to predict values of one variable based on another variable.
Descriptive Methods in Regression
What is Linear Regression?
• The straight-line linear regression model is a means of relating one quantitative variable to another quantitative variable
• A way of predicting the value of one variable from another.– It is a hypothetical model of the relationship
between two variables.– The model used is a linear one.– Therefore, we describe the relationship using the
equation of a straight line.
LINEAR REGRESSION ANALYSIS (PREDICTION)
Process of finding the equation of the straight line that best
predicts the value of one variable from a given value of the other.
Procedures that allow us to predict one variable (Y) based
on knowledge of another variable (X)
The goal is to be able to predict new values of Y based on values of X
Generally, Y is called the dependent variable (Predicted)
(Criterion) (Outcome) (ordinate) X is called the independent variable (Predictor) (abscissa).
Scatter Plot: shows the relationship between X and Y.
Student High School GPA (X)
University GPA (Y)
1
2.00
1.60
2 2.25 2.00 3 2 60 1.80 4 2.65 2.80 5 2.80 2.10 6 3.10 2.00 7 2.90 2.65 8 3.25 2.25 9 3.30 2.60 10 3.60 3.00 11 3.25 3.10
4.003.753.503.253.002.752.502.252.001.751.501.251.00
.75
.50
.25
.5 1.0 1.5 2 2.5 3.0 3.5 4.0.25 .75 1.25 1.75 2.25 2.75 3.25 3.75
. . . . . ....
..
High School(X)
University(Y)
Describing a Straight Line
Linear equation: When the relationship between X and Y is linearLinear equation: Y = bX + a
Regression line: Line whose equation is used for predictionLine that best describes the relationship between y, the dependent variable and x, the independent variable.
Linear regression builds on the equation for a straight line because the relationship between the two variables is assumed to be linear
A straight line should yield the best “fit” of the data points in a scatterplot (a linear model)
Y = bX + a (regression equation)***** Y = predicted value of Y b = slope of the line; is called the regression coefficient X = value of independent variable a = intercept
Slope: Change in the value of y for one-unit increase in X
0 1 X
Y ^
ˆChange in value of Slope = =
Change in value of
Yb
X
Intercept: The point at which a line intersects they axis. It is the value of Y, when X = 0. Determine the location of the line.
Intercepts and Slopes
Least squares criterion: Statistical method for finding the best prediction
line. Best prediction line will minimize error in
predicting Y from X.
Best regression line will be closest to the actual data points.
Residuals - the difference between a score and its predicted value
x y1 42 244 85 32
ˆ(Y Y) = residual, error in prediction = e ***********************
Best regression line minimizes the value of
2ˆY Y (is at a minimum) **************
SSresidual = Sum of squares residual= 2ˆY Y ****
x
2
x
xy
VarianceiancevarCo
s
Covb ***********************
Covariance
The degree to which X and Y, vary together (covary); The variation in one variable (X) that is shared by another (Y)
( )( )Cov ***********ariance
1*
X X Y Y
n
22 ( )
1x
X Xs
n
***********************
*********************a=Y-bX **
A medical researcher is interested in the possibility of a linear relationship between a patient's age and the effectiveness of a certain drug (hours). The drug is administered to 8 randomly selected patients.
Age (X) Effectiveness (Y) 34 6.3 42 8.1 37 7.9 55 9.8 47 8.6 43 8.4 52 9.1 39 8.6
35.8Y
625.43X
x
2
x
xy
VarianceiancevarCo
s
Covb ***********************
( )( )Covariance *********************
1**
X X Y Y
n
( )( )X X Y Y (34 43.625)(6.3 8.35) (42 43.625)(8.1 8.35) (37 43.625)(7.9 8.35)
(55 43.625)(9.8 8.35) ........ (39 43.625)(8.6 8.350 45.54875
var 45.54875 / 7 6.507
Co iance
22 ( )
1x
X Xs
n
***********************
2 2 2
2 2
(34 43.625) (42 43.625) (37 43.625)
(55 43.625) ........ (39 43.625)53.125
7
6.507
.12253.125
b
a Y bX
a (8.35) (.122)(43.625) 3.03
Y .122X 3.03
Prediction for 44 years old
hours398.8Y
)44)(122(.03.3Y
A researcher suspects that there is a relationship between the number of promisesa political candidate makes and the number of promises that are fulfilled once the candidate is elected. He examines the track record of 10 politicians. Use spss to construct a regression equation that predicts the number of promises made and promises kept by politicians.
slope
The information in the column “unstandardized coefficients” column B embodies the regression equation: (constant) is the intercept
Y 0.118x 9.268
Standard error of estimate .y x(s )
Is a measure of the amount of error in prediction, in
units of the Y variable. Is the standard deviation of the distribution of
obtained Y scores about predicted values of Y, Y.
Standard error of estimate: a measure of the error in prediction used as the basis for a measure of the accuracy of prediction
2
residual
y . x
ˆY Y SSs
n*****
2 df**
x.ys represents the average error in prediction
over an entire scatterplot.
Age (X) Effectiveness (Y) Y 2ˆ( )y y
34 6.3 7.178 .771 42 8.1 8.154 .003 37 7.9 7.544 .127 55 9.8 9.74 .004 47 8.6 8.764 .027 43 8.4 8.276 .015 52 9.1 9.374 .075 39 8.6 7.788 .660
2
.
ˆ1.682
.28 .532 6
y x
Y Ys hours
n
Averaged dispersion of the effectiveness scores around their predicted values.
Section D Goodness of fit Ch. 6 Test of independence Ch. 6 Regression Ch. 9 Correlation Ch. 9 Inferences about regression and correlation Ch.9
Scatterplot
• To see if scores may be related construct a graph of the scores, called a scatterplot– The variable labeled X is plotted on the
horizontal axis (the abscissa)– The Y variable is plotted on the vertical axis (the
ordinate)– The score of a subject on each of the two
measures is indicated by one point on the scatterplot
Conclusions drawn from scatterplots are subjective. A more precise and objective method for detecting straight-line patterns is the linear correlation coefficient.
The linear correlation coefficient r (often simply called the correlation coefficient) measures the strength of the linear relationship between the paired x and y values in a sample.
Is a statistical technique used to measure the relationship between two variables. (magnitude and direction)
Descriptive Methods in Correlation
Pearson Correlation (r) r is a descriptive statistic used to measure the degree of straight line relationship between 2 variables.
r also determines the precision with which predictions can be made using the regression line (r2 = coefficient of determination)
The value of r is not affected by the choice of x or y.
Interchange all x and y values and the value of r will not
change.
r measures the strength of a linear relationship. It is
not designed to measure the strength of a relationship
that is not linear.
xy
x y
2
x
2
y
xy
Cov Co var iancer
s s (S.D.of X)(S.D.of Y)
(X X)s S.D. of X
n 1
(Y Y)s S.D. of Y
***
***
***
**
n 1
(X X)(Y Y)s Covarianc *e
n 1
Sign r is a measure of the extent to which paired scores occupy the same (+) or opposite (-) positions within their own distribution.
x y
X X Y Yz z
s s
xy
x y2 2
Cov s s
( )( )
( ) ( )***
X X Y Yr
X X Y Y
Sign of r is determined by covariance or by the numerator. X Y X Y + = two variables move in same direction ( +z, +z, -z, -z) X Y X Y = two variables move in oppositive direction ( +z, - z, -z, +z)
RAW SCORES
1 60 162 45 133 40 124 20 85 10 6
Subject X Y
X
sx
35
20 Y
sy
11
4
z
X Xsx
z
Y Ysy
Z SCORES
1 1.25 1.25 2 0.50 0.5 3 0.25 0.25 4 -0.75 - 0.75 5 -1.25 -1.25
Subject X Y
r = 1
2 xy xy
x x y
Cov Covb r
s s s
( )( ) xyCov X X Y Y
b and r will have the same sign 1. The magnitude of r ranges between 0 and 1. 2. The sign of r is either positive or negative.
Degree of linear relationship 0 < r .3 then weak .3 < r .55 then slight .55 < r .8 then moderate .8 < r 1 then strong Thus, r can take any value between -1 and 1.
3. Generally, if regardless of sign,
Direction of relationship
• A correlation coefficient indicates the direction of the relationship by the positive or negative sign of the coefficient
• A positive r indicates• A positive (direct)relationship between
variables X and Y• As the scores on variable X increase, the scores
on variable Y tend to increase• A negative r indicates
• A negative (inverse)relationship between variables X and Y
• As the scores on variable X increase, the scores on variable Y tend to decrease
........
..... .. ... ..
.... ....... .
.. ...
....
...
........
.. .... ... ...
. .. .... ...
...
... ..
.. . .
... ... .
..
..
.... .
.
.
. .. .. ....
..
... .. .
...
. ..
.. . .. .
..
.....
. .. .. . .
..
..
.... . ...
....
..
. ..
....
.... ..
.. ..
.. .
.. ...
. ....
.... ...
.. ..
.... .. ...
..
....
...
Y
Y
r = 1 r = 0.9 r = 0.5
r = -1 r = -0.9 r = -0.5
(no error inpredictions)
. ...... .
..
. . ... ... . ... . .. .... ..
...
...
.....
..
. .... .
... ... ...
..
... ... . ... . ..... . ... .
. ......
.. ..
..
....
.
.. .
... ...
....
. ... . ... ..
.... ..
...
.
.....
..
. ...
.
...
. ...
..
..
............ ......
.
...... ...
. .. ..
.. .
....
.
.
... .
.
....
.. .
.
.
. .
.
. ...
.
.
.
.
.
.
...
.
.. ...
.. ..
..
..
.
..
. Y on X
X
Y
Y
X
Example: Most of us have heard that tall people generally have larger feet than short people. Is that really true and, if so, what is the relationship between height and foot length? To examine this, Professor Dennis Young obtained data on shoe size and height for a sample of students at Carleton University.
SIZE (X) x- X 2(x X) HEIGHT
(Y)
Y Y
2(Y Y) (x- X ) ( Y Y )
10.5 -0.46 0.22 70 -1.46 2.144133 0.679847
13 2.04 4.16 72 0.54 0.28699 1.092857
10.5 -0.46 0.21 74.5 3.04 9.215561 -1.39643
12 1.04 1.08 71 -0.46 0.215561 -0.48286
10.5 -0.46 0.21 71 -0.46 0.215561 0.213571
13 2.04 4.16 77 5.54 30.64413 11.29286
11.5 0.54 0.29 72 0.54 0.28699 0.289286
10 -0.96 0.92 72 0.54 0.28699 -0.51429
8.5 -2.46 6.05 67 -4.46 19.92985 10.98214
10.5 -0.46 0.21 73 1.54 2.358418 -0.70643
10.5 -0.46 0.21 72 0.54 0.28699 -0.24643
11 0.04 0.00 70 -1.46 2.144133 -0.05857
9 -1.96 3.84 69 -2.46 6.072704 4.83
13 2.04 4.16 70 -1.46 2.144133 -2.98714
N=14
X=10.964286
2(x X)=
25.736
Y=71.4643
2(Y Y)=
76.23214
(X X) (Y Y) 22.98842
xy
x y
Covr
s s
2
x
x
(X X)s
n 1
25.736s
131.407
2
y
y
(Y Y)s
n 1
76.23214s
132.422
xy
xy
(X X)(Y Y)s
n 122.98842
s13
1.768
xy
x y
Cov 1.768r
s s (1.407)(2.422)0.5189
Exercise 14 see webct, exercise folder
Textbook exercises: 9.2, 9.3, 9.10, 9.11, 9.13, 9.15. When appropriate verify your answers with SPSS. Get data for spss from webct, spss folder, spss exercises subfolder.
Readings to prepare for week 3, January 17-22
Chapter 9
Sections: 9.7, 9.8, 9.10, 9.11
SPSS assignment # 2 due next week