simple linear regression. our objective is to study the relationship between two variables x and y....

58
Simple Linear Regression

Upload: virgil-malone

Post on 01-Jan-2016

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Simple Linear Regression

Page 2: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Simple Linear Regression • Our objective is to study the relationship

between two variables X and Y.• One way to study the relationship between

two variables is by means of regression.• Regression analysis is the process of

estimating a functional relationship between X and Y. A regression equation is often used to predict a value of Y for a given value of X.

• Another way to study relationship between two variables is correlation. It involves measuring the direction and the strength of the linear relationship.

2

Page 3: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

First-Order Linear Model = Simple Linear Regression Model

where

y = dependent variable

x = independent variable

0= y-intercept

1= slope of the line

= error variable

ii10i XY

3

Page 4: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Simple Linear Model

This model is – Simple: only one X– Linear in the parameters: No parameter

appears as exponent or is multiplied or divided by another parameter

– Linear in the predictor variable (X): X appears only in the first power.

4

ii10i XY

Page 5: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Examples• Multiple Linear Regression:

• Polynomial Linear Regression:

• Linear Regression:

• Nonlinear Regression:

5

ii22i110i XXY

ii2

2i10i XXY

ii2i10i10 )Xexp(X)Y(log

ii210i ))Xexp(1/(Y

Linear or nonlinear in parameters

Page 6: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Deterministic Component of Model

05

101520253035404550

0 5 10 15 20

x

xˆˆy 10

(slope)=rise/runy-intercept

Run

Rise

01

6

Page 7: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Mathematical vs Statistical Relation

y = - 5.3562 + 3.3988x

05

101520253035404550

0 5 10 15 20

x

^

x

7

Page 8: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Error

• The scatterplot shows that the points are not on a line, and so, in addition to the relationship, we also describe the error:

• The Y’s are the response (or dependent) variable. The x’s are the predictors or independent variables, and the epsilon’s are the errors. We assume that the errors are normal, mutually independent, and have variance 2.

0 1 , i=1,2,...,ni i iy x

8

Page 9: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Least Squares: Minimize

 The minimizing y-intercept and slope are given by . We use the notation:

• The quantities are called the residuals. If we assume a normal error, they should look normal.

 

2 20

1 1

( )n n

i i i i

i i

y x

0 1ˆ ˆ and

0 1ˆ ˆˆi iy x

ˆi i iR y y

9Error: Yi-E(Yi) unknown; Residual: estimated, i.e. knownˆi i iR y y

Page 10: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Minimizing error

10

Page 11: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

• The Simple Linear Regression Model

• The Least Squares Regression Line where

0 1y x

0 1ˆ ˆy x

1xy

x

SS

SS

( )( )

i i

xy i i i i

x ySS x x y y x y

n

2

2 2( )i

x i i

xSS x x x

n

0 1ˆ ˆy x

11

Page 12: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

What form does the error take?

• Each observation may be decomposed into two parts:

• The first part is used to determine the fit, and the second to estimate the error.

• We estimate the standard deviation of the error by:

ˆ ˆ( )y y y y

22ˆ( ) xy

yyxx

SSSE Y Y S

S

12

Page 13: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Estimate of 2

• We estimate 2 by

2

2

SSEs MSE

n

13

Page 14: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Example• An educational economist wants to

establish the relationship between an individual’s income and education. He takes a random sample of 10 individuals and asks for their income ( in $1000s) and education ( in years). The results are shown below. Find the least squares regression line.

11 12 11 15 8 10 11 12 17 11

25 33 22 41 18 28 32 24 53 26

Education

Income

14

Page 15: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Dependent and Independent Variables

• The dependent variable is the one that we want to forecast or analyze.

• The independent variable is hypothesized to affect the dependent variable.

• In this example, we wish to analyze income and we choose the variable individual’s education that most affects income. Hence, y is income and x is individual’s education

15

Page 16: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

First Step:

2

2

118

1450

302

10072

3779

i

i

i

i

i i

x

x

y

y

x y

16

Page 17: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Sum of Squares:

( )( ) (118)(302)

3779 215.410

i i

xy i i

x ySS x y

n

2 22

( ) (118)1450 57.6

10

i

x i

xSS x

n

Therefore,1

215.4ˆ 3.7457.6

xy

x

SS

SS

0 1

302 118ˆ ˆ 3.74 13.9310 10

y x

17

Page 18: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

The Least Squares Regression Line

• The least squares regression line is

• Interpretation of coefficients:

*The sample slope tells us that on average for each additional year of education, an individual’s income rises by $3.74 thousand.

• The y-intercept is . This value is the expected (or average) income for an individual who has 0 education level (which is meaningless here)

ˆ 13.93 3.74y x

1 3.74

0ˆ 13.93

18

Page 19: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Error Variable: is normally distributed.• E() = 0• The variance of is 2.• The errors are independent of each other. • The estimator of 2 is

where

2

2

SSEs MSE

n

22ˆ( ) xy

yyxx

SSSE Y Y S

S

2

2( )i

y i

ySS y

n and

19

Page 20: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Example (continue)• For the previous example

Hence, SSE is

2 22

( ) (302)10072 951.6

10

i

y i

ySS y

n

2 2(215.4)951.6 146.09

57.6xy

yyxx

SSSE S

S

2 2146.0918.26 and 4.27

2 10 2

SSEs s s

n

20

Page 21: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Interpretation of

• The value of “s” can be compared with the mean value of y to provide a rough guide as to whether s is small or large.

• Since and s=4.27, we would conclude that “s” is relatively small, indicating that the regression line fits the data quite well.

30.2y

2s

21

Page 22: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Example

• Car dealers across North America use the red book to determine a cars selling price on the basis of important features. One of these is the car’s current odometer reading.

• To examine this issue 100 three-year old cars in mint condition were randomly selected. Their selling price and odometer reading were observed.

22

Page 23: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Portion of the data file

Odometer Price37388 531844758 506145833 500830862 5795….. …34212 528333190 525939196 535636392 5133

23

Page 24: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Example (Minitab Output)

Regression Analysis

The regression equation isPrice = 6533 - 0.0312 Odometer

Predictor Coef StDev T PConstant 6533.38 84.51 77.31 0.000(SIGNIFICANT)Odometer -0.031158 0.002309 -13.49 0.000(SIGNIFICANT)

S = 151.6 R-Sq = 65.0% R-Sq(adj) = 64.7%

Analysis of Variance

Source DF SS MS F PRegression 1 4183528 4183528 182.11 0.000Error 98 2251362 22973Total 99 6434890

24

Page 25: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Example• The least squares regression line is

ˆ 6533.38 0.031158y x

50000400003000020000

6000

5500

5000

Odometer

Pric

e

25

Page 26: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Interpretation of the coefficients

• means that for each additional mile on the odometer, the price decreases by an average of 3.1158 cents.

• means that when x = 0 (new car), the selling price is $6533.38 but x = 0 is not in the range of x. So, we cannot interpret the value of y when x=0 for this problem.

• R2=65.0% means that 65% of the variation of y can be explained by x. The higher the value of R2, the better the model fits the data.

1 0.031158

0ˆ 6533.38

26

Page 27: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

R² and R² adjusted

• R² measures the degree of linear association between X and Y.

• So, a high R² does not necessarily indicate that the estimated regression line is a good fit.

• Also, an R² close to 0 does not necessarily indicate that X and Y are unrelated (relation can be nonlinear)

• As more and more X’s are added to the model, R² always increases. R²adj accounts for the number of parameters in the model.

27

Page 28: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Scatter Plot

Odometer .vs. Price Line Fit Plot

4500

5000

5500

6000

19000 29000 39000 49000

Odometer

Pri

ce

28

Page 29: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Testing the slope • Are X and Y linearly related?

0:

0:

1

10

AH

H

•Test Statistic:

11ˆ

s

t

where1

x

ss

SS

29

Page 30: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Testing the slope (continue)• The Rejection Region:

Reject H0 if t < -t/2,n-2 or t > t/2,n-2.

• If we are testing that high x values lead to high y values, HA: 1>0. Then, the rejection region is

t > t,n-2.

• If we are testing that high x values lead to low y values or low x values lead to high y values, HA: 1 <0. Then, the rejection region is t < - t,n-2.

30

Page 31: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Assessing the model

Example:

• Excel output

• Minitab output

Coefficients Standard Error t Stat P-valueIntercept 6533.4 84.512322 77.307 1E-89Odometer -0.031 0.0023089 -13.49 4E-24

Predictor Coef StDev T PConstant 6533.38 84.51 77.31 0.000Odometer -0.031158 0.002309 -13.49 0.000

31

Page 32: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Coefficient of Determination

yyx

xy

SS

SSE

SSSS

SSR

1

22

For the data in odometer example, we obtain:

6501.03499.01

890,434,6

363,251,2112

ySS

SSER

32

y

2adj SS

SSE)

pn

1n(1R

Page 33: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Using the Regression Equation

• Suppose we would like to predict the selling price for a car with 40,000 miles on the odometer

ˆ 6,533 0.0312

6,533 0.0312(40,000)

$5,285

y x

33

Page 34: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Prediction and Confidence Intervals

• Prediction Interval of y for x=xg: The confidence interval for predicting the particular value of y for a given x

• Confidence Interval of E(y|x=xg): The confidence interval for estimating the expected value of y for a given x

x

gen SS

xx

nsty

2

2,2/

)(11ˆ

x

gen SS

xx

nsty

2

2,2/

)(1ˆ

34

Page 35: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Solving by Hand(Prediction Interval)

• From previous calculations we have the following estimates:

• Thus a 95% prediction interval for x=40,000 is:

ˆ 5285, 151.6, 4309340160, 36,009xy s SS x

21 (40,000 36,009)5,285 1.984(151.6) 1

100 4,309,340,160

5,285 303

•The prediction is that the selling price of the car will fall between $4982 and $5588. 35

Page 36: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Solving by Hand(Confidence Interval)

• A 95% confidence interval of

E(y| x=40,000) is:

35285,5

160,340,309,4

)009,36000,40(

100

1)6.151(984.1285,5

2

•The mean selling price of the car will fall between $5250and $5320.

36

Page 37: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Prediction and Confidence Intervals’ Graph

50000400003000020000

6300

5800

5300

4800

Odometer

Pre

dict

ed

Prediction interval

Confidence interval

37

Page 38: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Notes

• No matter how strong is the statistical relation between X and Y, no cause-and-effect pattern is necessarily implied by the regression model. Ex: Although a positive and significant relationship is observed between vocabulary (X) and writing speed (Y), this does not imply that an increase in X causes an increase in Y. Other variables, such as age, may affect both X and Y. Older children have a larger vocabulary and faster writing speed.

38

Page 39: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Regression Diagnostics

How to diagnose violations and how to deal with observations that are unusually large or small?

• Residual Analysis:Non-normalityHeteroscedasticity (non-constant variance)Non-independence of the errorsOutlierInfluential observations

39

Page 40: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Standardized Residuals

• The standardized residuals are calculated as

where .

• The standard deviation of the i-th residual is

Standardized residual ir

s

ˆi i ir y y

2( )11 where

i

ir i i

x

x xs s h h

n SS

40

Page 41: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Non-normality:

• The errors should be normally distributed. To check the normality of errors, we use histogram of the residuals or normal probability plot of residuals or tests such as Shapiro-Wilk test.

• Dealing with non-normality:– Transformation on Y

– Other types of regression (e.g., Poisson or Logistic …)

– Nonparametric methods (e.g., nonparametric regression(i.e. smoothing))

41

Page 42: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Heteroscedasticity:

• The error variance should be constant. When this requirement is violated, the condition is called heteroscedasticity.

• To diagnose hetersocedastisticity or homoscedasticity, one method is to plot the residuals against the predicted value of y (or x). If the points are distributed evenly around the expected value of errors which is 0, this means that the error variance is constant. Or, formal tests such as: Breusch-Pagan test

2

42

Page 43: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Example

• A classic example of heteroscedasticity is that of income versus expenditure on meals. As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating less expensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

43

Page 44: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Dealing with heteroscedasticity

• Transform Y

• Re-specify the Model (e.g., Missing important X’s?)

• Use Weighted Least Squares instead of Ordinary Least Squares

n

1i i

2i

)(Varmin

44

Page 45: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Non-independence of error variable:• The values of error should be

independent. When the data are time series, the errors often are correlated (i.e., autocorrelated or serially correlated). To detect autocorrelation we plot the residuals against the time periods. If there is no pattern, this means that errors are independent.

45

Page 46: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Outlier:

• An outlier is an observation that is unusually small or large. Two possibilities which cause outlier is

1.Error in recording the data. Detect the error and correct it

The outlier point should not have been included in the data (belongs to another population) Discard the point from the sample

2. The observation is unusually small or large although it belong to the sample and there is no recording error. Do NOT remove it

46

Page 47: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Influential Observations

• One or more observations have a large influence on the statistics.

Scatter Plot of One Influential Observation

0

50

100

150

0 10 20 30 40 50

x

y

Scatter Plot Without the Influential Observation

0

10

20

30

40

50

60

0 10 20 30 40 50

x

y

47

Page 48: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Influential Observations

• Look into Cook’s Distance, DFFITS, DFBETAS (Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W., (1996) Applied Linear Statistical Models, 4th

edition, Irwin, pp. 378-384)

48

Page 49: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Multicollinearity

• A common issue in multiple regression is multicollinearity. This exists when some or all of the predictors in the model are highly correlated. In such cases, the estimated coefficient of any variable depends on which other variables are in the model. Also, standard errors of the coefficients are very high…

49

Page 50: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Multicollinearity

• Look into correlation coefficient among X’s: If Cor>0.8, suspect multicollinearity

• Look into Variance inflation factors (VIF): VIF>10 is usually a sign of multicollinearity

• If there is multicollinearity:– Use transformation on X’s, e.g. centering,

standardization. Ex: Cor(X,X²)=0.991; after standardization Cor=0!

– Remove the X that causes multicollinearity – Factor analysis– Ridge regression

50

Page 51: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Exercise

• It is doubtful that any sports collects more statistics than baseball. The fans are always interested in determining which factors lead to successful teams. The table below lists the team batting average and the team winning percentage for the 14 league teams at the end of a recent season.

51

Page 52: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

Team-B-A Winning%0.254 0.4140.269 0.5190.255 0.5000.262 0.5370.254 0.3520.247 0.5190.264 0.5060.271 0.5120.280 0.5860.256 0.4380.248 0.5190.255 0.5120.270 0.5250.257 0.562

y = winning % and x = team batting average52

Page 53: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

a) LS Regression Line2

2

3.642, 0.949

7.001, 3.549

1.824562

i i

i i

i i

x x

y y

x y

2

22

( )( ) (3.642)(7.001)1.824562 0.0033

14

(3.642)0.948622 0.00118

14

i i

xy i i

i

x i

x ySS x y

n

xSS x

n

53

Page 54: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

• The least squares regression line is

• The meaning is for each additional batting average of the team, the winning percentage increases by an average of 79.41%.

1

0 1

0.003302ˆ 0.79410.001182

ˆ ˆ 0.5 (0.7941)0.26 0.2935

xy

x

SS

SS

y x

1 0.7941

ˆ 0.2935 0.7941y x

54

Page 55: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

b) Standard Error of Estimate

So,

• Since s=0.0567 is small, we would conclude that “s” is relatively small, indicating that the regression line fits the data quite well.

22 22

2 27.001 0.003302(3.548785 ) 0.03856

14 0.00182

ixy xyyy i

xx xx

yS SSSE S y

S n S

2 20.038560.00321 and 0.0567

2 14 2

SSEs s s

n

55

Page 56: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

c) Do the data provide sufficient evidence at the 5% significance level to conclude that higher team batting average lead to higher winning percentage?

0:

0:

1

10

AH

H

1

1 1

ˆ

ˆTest statistic: t 1.69 (p-value=.058)

s

Conclusion: Do not reject H0 at = 0.05. The higher team batting average does not lead to higher winning percentage.

56

Page 57: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

d)Coefficient of Determination

22 0.03856

1 1 0.19250.04778

xy

x y y

SS SSER

SS SS SS

The 19.25% of the variation in the winning percentagecan be explained by the batting average.

57

Page 58: Simple Linear Regression. Our objective is to study the relationship between two variables X and Y. One way to study the relationship between two variables

e) Predict with 90% confidence the winning percentage of a team whose batting average is 0.275.

2

/ 2, 2

2

ˆ 0.2935 0.7941(0.275) 0.5119

( )1ˆ 1

1 (0.275 0.2601)0.5119 (1.782)(0.0567) 1

14 0.0011820.5119 0.1134

(0.3985,0.6253)

gn

x

y

x xy t s

n SS

90% PI for y:

•The prediction is that the winning percentage of the team will fall between 39.85% and 62.53%.

58