regression models

Part 2: Model and Inference2-1/49

Regression Models

Professor William Greene

Stern School of Business

IOMS Department

Department of Economics


Regression and Forecasting Models

Part 2 – Inference About the Regression


The Linear Regression Model

1. The linear regression model

2. Sample statistics and population quantities

3. Testing the hypothesis of no relationship


A Linear Regression

Predictor: Box Office = -14.36 + 72.72 Buzz


Data and Relationship We suggested the relationship between box office

and internet buzz is Box Office = -14.36 + 72.72 Buzz

Note the obvious inconsistency in the figure. This is not the relationship. The observed points do not lie on a line.

How do we reconcile the equation with the data?


Modeling the Underlying Process

A model that explains the process that produces the data that we observe: Observed outcome = the sum of two parts (1) Explained: The regression line (2) Unexplained (noise): The remainder

Regression model The “model” is the statement that part (1) is

the same process from one observation to the next. Part (2) is the randomness that is part of real world observation.


The Population Regression

THE model: A specific statement about the parts of the model (1) Explained:

Explained Box Office = β0 + β1 Buzz (2) Unexplained: The rest is “noise, ε.”

Random ε has certain characteristics Model statement

Box Office = β0 + β1 Buzz + ε


The Data Include the Noise


The Data Include the Noise

0+ 1Buzz

Box = 41, 0+ 1Buzz = 10, = 31


Model Assumptions

yi = β0 + β1xi + εi β0 + β1xi is the ‘regression function’

Contains the ‘information’ about yi in xi

Unobserved because β0 and β1 are not known for certain

εi is the ‘disturbance.’ It is the unobserved random component

Observed yi is the sum of the two unobserved parts.


Regression Model Assumptions About εi

Random Variable (1) The regression is the mean of yi for a particular xi.

εi is the deviation of yi from the regression line.

(2) εi has mean zero.

(3) εi has variance σ2.

‘Random’ Noise (4) εi is unrelated to any values of xi (no covariance) – it’s

“random noise” (5) εi is unrelated to any other observations on εj (not

“autocorrelated”) (6) Normal distribution - εi is the sum of many small influences


Regression Model

ROOMS

FUEL

BIL

L

111098765432

1400

1200

1000

800

600

400

200

Scatterplot of FUELBILL vs ROOMS


Conditional Normal Distribution of

ROOMS

FUEL

BIL

L

111098765432

1400

1200

1000

800

600

400

200

Scatterplot of FUELBILL vs ROOMS


A Violation of Point (4)c = 0 + 1 q + ?

Electricity Cost Data


A Violation of Point (5) - Autocorrelation

Time Trend of U.S. Gasoline Consumption


No Obvious Violations of Assumptions

Auction Prices for Monet Paintings vs. Area


Samples and Populations Population (Theory)

yi = β0 + β1xi + εi

Parameters β0, β1 Regression

β0 + β1xi

Mean of yi | xi

Disturbance, εi Expected value = 0

Standard deviation σ No correlation with xi

Sample (Observed) yi = b0 + b1xi + ei

Estimates, b0, b1

Fitted regression b0 + b1xi

Predicted yi|xi

Residuals, ei Sample mean 0,

Sample std. dev. se

Sample Cov[x,e] = 0


Disturbances vs. Residuals

0 1True : β + β Buzz

0 1Sample : b + b Buzz

=y- 0 - 1Buzze=y-b0 –b1Buzz


Standard Deviation of Residuals Standard deviation of εi = yi- β0 – β1xi is σ

σ = √E[εi2] (Mean of εi is zero)

Sample b0 and b1 estimate β0 and β1

Residual ei = yi – b0 – b1xi estimates εi

Use √(1/N)Σei2 to estimate σ? Close, not quite.

N

0 1 N2 2i i ii=1 i=1

e

e (y -b -b x )s = =

N- 2 N- 2

Why N-2? Relates to the fact that two parameters (β0,β1) were estimated. Same reason N-1 was used to compute a sample variance.


Linear Regression

Sample Regression Line


Residuals


Regression Computations

N

ii 1

N

ii 1

N2 2x ii 1

N2 2y ii 1

xy

i i

N = 62 complete observations.

1y = y = 20.721

N1

x = x = 0.48242N

1Var(x) = s = (x x) = 0.02453

N-11

Var(y) = s = (y y) = 305.985N-1

Cov(x,y) = s

1 = (x x)(y

N-1 N

i 1y) = 1.784

1

0

62 2

i 1

xy

2x

e

sb = = 72.72

s

b = y - bx = -14.36

s = = 13.386N- 2i 0 1 iy -b -b x


Results to Report


The Reported Results


Estimated equation


Estimated coefficients b0 and b1


Sum of squared residuals, Σiei

2


S = se = estimated std. deviation of ε


Interpreting (Estimated by se)Remember the empirical rule, 95% of observations will lie within mean ± 2 standard deviations? We show (b0 +b1x) ± 2se below.)

This point is 2.2 standard deviations from the regression.

Only 3.2% of the 62 observations lie outside the bounds. (We will refine this later.)


No Relationship: 1 = 0 Relationship: 1 0

How to Distinguish These Cases Statistically?

yi = β0 + β1xi + εi


Assumptions

(Regression) The equation linking “Box Office” and “Buzz” is stable

E[Box Office | Buzz] = α + β Buzz

Another sample of movies, say 2012, would obey the same fundamental relationship.


Sampling Variability

Samples 0 and 1 are a random split of the 62 observations.

Sample 1: Box Office = -13.25 + 68.51 Buzz

Sample 0: Box Office = -16.09 + 79.11 Buzz


Sampling Distributions

N 22ii=1

x

x

Sampling Distribution of the Mean

Estimator: x

(x -x)s 1Standard Error: s

N N N 1

Confidence Interval: x t* s

where t* is the appropriate value from the

t table (N-1 degrees of freedom).

1

1

1

N 22 i 0 1 i1

eb N N2 2

i ii=1 i=1

b

Sampling Distribution of a Regression Coefficient

Estimator: b

1(y -b -b x )s N-2Standard Error: s =

(x -x) (x -x)

Confidence Interval: b1 t* s

where t* is the appropriate

i

value from the

t table (N-2 degrees of freedom).


n = N-2

Small sample

Large sample


Standard Error of Regression Slope Estimator


Internet Buzz Regression

Regression Analysis: BoxOffice versus Buzz

The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1

Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17]

If you use 2.00 from the t table, the limits would be [50.1 to 94.6]


Some computer programs report confidence intervals automatically; Minitab does not.


Uncertainty About the Regression Slope

Hypothetical Regression Fuel Bill vs. Number of Rooms The regression equation isFuel Bill = -252 + 136 Number of Rooms

Predictor Coef SE Coef T PConstant -251.9 44.88 -5.20 0.000Rooms 136.2 7.09 19.9 0.000

S = 144.456R-Sq = 72.2% R-Sq(adj) = 72.0%

This is b1, the estimate of β1

This “Standard Error,” (SE) is the measure of uncertainty about the true value.

The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)


Sampling Distributions and Test Statistics

N 22ii=1

x

x

For Testing a Hypothesis about a Mean

Hypothesis: H0: μ=0, H1:μ 0

Estimator: x

(x -x)s 1Standard Error: s =

N N N 1

x 0Test Statistic: t = ; t statistic N-1 D.F.

s

Rejection Region: |t| > Critical Value from Table

1

0 1 1 1

1

N 22 i 0 1 i1

eb N N2 2

i ii=1 i=1

1

For Testing a Hypothesis about a Regression Coefficient

Hypothesis: H : = 0, H : 0

Estimator: b


(x -x) (x -x)

bTest Statistic: t =

i

1b

0; t statistic N-2 D.F.

s



t Statistic for Hypothesis Test


Alternative Approach: The P value

Hypothesis: 1 = 0 The ‘P value’ is the probability that you would have

observed the evidence on this hypothesis that you did observe if the null hypothesis were true.

P = Prob(|t| would be this large | 1 = 0) If the P value is less than the Type I error probability

(usually 0.05) you have chosen, you will reject the hypothesis.

Interpret: It the hypothesis were true, it is ‘unlikely’ that I would have observed this evidence.


P value for hypothesis test


Intuitive approach: Does the confidence interval contain zero?

Hypothesis: 1 = 0

The confidence interval contains the set of plausible values of 1 based on the data and the test.

If the confidence interval does not contain 0, reject H0: 1 = 0.


More General Test

1

0 1 1 1

1

N 22 i 0 1 i1

eb N N2 2

i ii=1 i=1

1

For Testing a Hypothesis about a Regression Coefficient

Hypothesis: H : = B, H : B

Estimator: b


(x -x) (x -x)

bTest Statistic: t =

i

1b

B; t statistic N-2 D.F.

s



0 1 1 1

1

1

0

H :β =100; H :β 100

b -100Test statistic: t =

SE(b )

72.72 100 =

10.94 = -2.49

Critical t = -2.00. H is rejected.


Summary: Regression Analysis Investigate: Is the coefficient in a regression model really nonzero? Testing procedure:

Model: y = β0 + β1x + ε Hypothesis: H0: β1 = B. Rejection region: Least squares coefficient is far from zero.

Test: α level for the test = 0.05 as usual Compute t = (b1 – B)/StandardError Reject H0 if t is above the critical value

1.96 if large sample Value from t table if small sample.

Reject H0 if reported P value is less than α level

Degrees of Freedom for the t statistic is N-2

regression models

Documents

regression model assumptions

model assumptionsyi

population regressionthe

underlying processa

regression functioncontains

linear regression model1

linear regression predictor

linear regression model2