regression models
DESCRIPTION
Regression Models. Professor William Greene Stern School of Business IOMS Department Department of Economics. Regression and Forecasting Models. Part 2 – Inference About the Regression. The Linear Regression Model. 1. The linear regression model - PowerPoint PPT PresentationTRANSCRIPT
Part 2: Model and Inference2-1/49
Regression Models
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
Part 2: Model and Inference2-2/49
Regression and Forecasting Models
Part 2 – Inference About the Regression
Part 2: Model and Inference2-3/49
The Linear Regression Model
1. The linear regression model
2. Sample statistics and population quantities
3. Testing the hypothesis of no relationship
Part 2: Model and Inference2-4/49
A Linear Regression
Predictor: Box Office = -14.36 + 72.72 Buzz
Part 2: Model and Inference2-5/49
Data and Relationship We suggested the relationship between box office
and internet buzz is Box Office = -14.36 + 72.72 Buzz
Note the obvious inconsistency in the figure. This is not the relationship. The observed points do not lie on a line.
How do we reconcile the equation with the data?
Part 2: Model and Inference2-6/49
Modeling the Underlying Process
A model that explains the process that produces the data that we observe: Observed outcome = the sum of two parts (1) Explained: The regression line (2) Unexplained (noise): The remainder
Regression model The “model” is the statement that part (1) is
the same process from one observation to the next. Part (2) is the randomness that is part of real world observation.
Part 2: Model and Inference2-7/49
The Population Regression
THE model: A specific statement about the parts of the model (1) Explained:
Explained Box Office = β0 + β1 Buzz (2) Unexplained: The rest is “noise, ε.”
Random ε has certain characteristics Model statement
Box Office = β0 + β1 Buzz + ε
Part 2: Model and Inference2-8/49
The Data Include the Noise
Part 2: Model and Inference2-9/49
The Data Include the Noise
0+ 1Buzz
Box = 41, 0+ 1Buzz = 10, = 31
Part 2: Model and Inference2-10/49
Model Assumptions
yi = β0 + β1xi + εi β0 + β1xi is the ‘regression function’
Contains the ‘information’ about yi in xi
Unobserved because β0 and β1 are not known for certain
εi is the ‘disturbance.’ It is the unobserved random component
Observed yi is the sum of the two unobserved parts.
Part 2: Model and Inference2-11/49
Regression Model Assumptions About εi
Random Variable (1) The regression is the mean of yi for a particular xi.
εi is the deviation of yi from the regression line.
(2) εi has mean zero.
(3) εi has variance σ2.
‘Random’ Noise (4) εi is unrelated to any values of xi (no covariance) – it’s
“random noise” (5) εi is unrelated to any other observations on εj (not
“autocorrelated”) (6) Normal distribution - εi is the sum of many small influences
Part 2: Model and Inference2-12/49
Regression Model
ROOMS
FUEL
BIL
L
111098765432
1400
1200
1000
800
600
400
200
Scatterplot of FUELBILL vs ROOMS
Part 2: Model and Inference2-13/49
Conditional Normal Distribution of
ROOMS
FUEL
BIL
L
111098765432
1400
1200
1000
800
600
400
200
Scatterplot of FUELBILL vs ROOMS
Part 2: Model and Inference2-14/49
A Violation of Point (4)c = 0 + 1 q + ?
Electricity Cost Data
Part 2: Model and Inference2-15/49
A Violation of Point (5) - Autocorrelation
Time Trend of U.S. Gasoline Consumption
Part 2: Model and Inference2-16/49
No Obvious Violations of Assumptions
Auction Prices for Monet Paintings vs. Area
Part 2: Model and Inference2-17/49
Samples and Populations Population (Theory)
yi = β0 + β1xi + εi
Parameters β0, β1 Regression
β0 + β1xi
Mean of yi | xi
Disturbance, εi Expected value = 0
Standard deviation σ No correlation with xi
Sample (Observed) yi = b0 + b1xi + ei
Estimates, b0, b1
Fitted regression b0 + b1xi
Predicted yi|xi
Residuals, ei Sample mean 0,
Sample std. dev. se
Sample Cov[x,e] = 0
Part 2: Model and Inference2-18/49
Disturbances vs. Residuals
0 1True : β + β Buzz
0 1Sample : b + b Buzz
=y- 0 - 1Buzze=y-b0 –b1Buzz
Part 2: Model and Inference2-19/49
Standard Deviation of Residuals Standard deviation of εi = yi- β0 – β1xi is σ
σ = √E[εi2] (Mean of εi is zero)
Sample b0 and b1 estimate β0 and β1
Residual ei = yi – b0 – b1xi estimates εi
Use √(1/N)Σei2 to estimate σ? Close, not quite.
N
0 1 N2 2i i ii=1 i=1
e
e (y -b -b x )s = =
N- 2 N- 2
Why N-2? Relates to the fact that two parameters (β0,β1) were estimated. Same reason N-1 was used to compute a sample variance.
Part 2: Model and Inference2-20/49
Part 2: Model and Inference2-21/49
Linear Regression
Sample Regression Line
Part 2: Model and Inference2-22/49
Residuals
Part 2: Model and Inference2-23/49
Regression Computations
N
ii 1
N
ii 1
N2 2x ii 1
N2 2y ii 1
xy
i i
N = 62 complete observations.
1y = y = 20.721
N1
x = x = 0.48242N
1Var(x) = s = (x x) = 0.02453
N-11
Var(y) = s = (y y) = 305.985N-1
Cov(x,y) = s
1 = (x x)(y
N-1 N
i 1y) = 1.784
1
0
62 2
i 1
xy
2x
e
sb = = 72.72
s
b = y - bx = -14.36
s = = 13.386N- 2i 0 1 iy -b -b x
Part 2: Model and Inference2-24/49
Part 2: Model and Inference2-25/49
Part 2: Model and Inference2-26/49
Results to Report
Part 2: Model and Inference2-27/49
The Reported Results
Part 2: Model and Inference2-28/49
Estimated equation
Part 2: Model and Inference2-29/49
Estimated coefficients b0 and b1
Part 2: Model and Inference2-30/49
Sum of squared residuals, Σiei
2
Part 2: Model and Inference2-31/49
S = se = estimated std. deviation of ε
Part 2: Model and Inference2-32/49
Interpreting (Estimated by se)Remember the empirical rule, 95% of observations will lie within mean ± 2 standard deviations? We show (b0 +b1x) ± 2se below.)
This point is 2.2 standard deviations from the regression.
Only 3.2% of the 62 observations lie outside the bounds. (We will refine this later.)
Part 2: Model and Inference2-33/49
No Relationship: 1 = 0 Relationship: 1 0
How to Distinguish These Cases Statistically?
yi = β0 + β1xi + εi
Part 2: Model and Inference2-34/49
Assumptions
(Regression) The equation linking “Box Office” and “Buzz” is stable
E[Box Office | Buzz] = α + β Buzz
Another sample of movies, say 2012, would obey the same fundamental relationship.
Part 2: Model and Inference2-35/49
Sampling Variability
Samples 0 and 1 are a random split of the 62 observations.
Sample 1: Box Office = -13.25 + 68.51 Buzz
Sample 0: Box Office = -16.09 + 79.11 Buzz
Part 2: Model and Inference2-36/49
Sampling Distributions
N 22ii=1
x
x
Sampling Distribution of the Mean
Estimator: x
(x -x)s 1Standard Error: s
N N N 1
Confidence Interval: x t* s
where t* is the appropriate value from the
t table (N-1 degrees of freedom).
1
1
1
N 22 i 0 1 i1
eb N N2 2
i ii=1 i=1
b
Sampling Distribution of a Regression Coefficient
Estimator: b
1(y -b -b x )s N-2Standard Error: s =
(x -x) (x -x)
Confidence Interval: b1 t* s
where t* is the appropriate
i
value from the
t table (N-2 degrees of freedom).
Part 2: Model and Inference2-37/49
n = N-2
Small sample
Large sample
Part 2: Model and Inference2-38/49
Standard Error of Regression Slope Estimator
Part 2: Model and Inference2-39/49
Internet Buzz Regression
Regression Analysis: BoxOffice versus Buzz
The regression equation isBoxOffice = - 14.4 + 72.7 BuzzPredictor Coef SE Coef T PConstant -14.360 5.546 -2.59 0.012Buzz 72.72 10.94 6.65 0.000
S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%
Analysis of VarianceSource DF SS MS F PRegression 1 7913.6 7913.6 44.16 0.000Residual Error 60 10751.5 179.2Total 61 18665.1
Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17]
If you use 2.00 from the t table, the limits would be [50.1 to 94.6]
Part 2: Model and Inference2-40/49
Some computer programs report confidence intervals automatically; Minitab does not.
Part 2: Model and Inference2-41/49
Uncertainty About the Regression Slope
Hypothetical Regression Fuel Bill vs. Number of Rooms The regression equation isFuel Bill = -252 + 136 Number of Rooms
Predictor Coef SE Coef T PConstant -251.9 44.88 -5.20 0.000Rooms 136.2 7.09 19.9 0.000
S = 144.456R-Sq = 72.2% R-Sq(adj) = 72.0%
This is b1, the estimate of β1
This “Standard Error,” (SE) is the measure of uncertainty about the true value.
The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)
Part 2: Model and Inference2-42/49
Sampling Distributions and Test Statistics
N 22ii=1
x
x
For Testing a Hypothesis about a Mean
Hypothesis: H0: μ=0, H1:μ 0
Estimator: x
(x -x)s 1Standard Error: s =
N N N 1
x 0Test Statistic: t = ; t statistic N-1 D.F.
s
Rejection Region: |t| > Critical Value from Table
1
0 1 1 1
1
N 22 i 0 1 i1
eb N N2 2
i ii=1 i=1
1
For Testing a Hypothesis about a Regression Coefficient
Hypothesis: H : = 0, H : 0
Estimator: b
1(y -b -b x )s N-2Standard Error: s =
(x -x) (x -x)
bTest Statistic: t =
i
1b
0; t statistic N-2 D.F.
s
Rejection Region: |t| > Critical Value from Table
Part 2: Model and Inference2-43/49
t Statistic for Hypothesis Test
Part 2: Model and Inference2-44/49
Alternative Approach: The P value
Hypothesis: 1 = 0 The ‘P value’ is the probability that you would have
observed the evidence on this hypothesis that you did observe if the null hypothesis were true.
P = Prob(|t| would be this large | 1 = 0) If the P value is less than the Type I error probability
(usually 0.05) you have chosen, you will reject the hypothesis.
Interpret: It the hypothesis were true, it is ‘unlikely’ that I would have observed this evidence.
Part 2: Model and Inference2-45/49
P value for hypothesis test
Part 2: Model and Inference2-46/49
Intuitive approach: Does the confidence interval contain zero?
Hypothesis: 1 = 0
The confidence interval contains the set of plausible values of 1 based on the data and the test.
If the confidence interval does not contain 0, reject H0: 1 = 0.
Part 2: Model and Inference2-47/49
More General Test
1
0 1 1 1
1
N 22 i 0 1 i1
eb N N2 2
i ii=1 i=1
1
For Testing a Hypothesis about a Regression Coefficient
Hypothesis: H : = B, H : B
Estimator: b
1(y -b -b x )s N-2Standard Error: s =
(x -x) (x -x)
bTest Statistic: t =
i
1b
B; t statistic N-2 D.F.
s
Rejection Region: |t| > Critical Value from Table
Part 2: Model and Inference2-48/49
0 1 1 1
1
1
0
H :β =100; H :β 100
b -100Test statistic: t =
SE(b )
72.72 100 =
10.94 = -2.49
Critical t = -2.00. H is rejected.
Part 2: Model and Inference2-49/49
Summary: Regression Analysis Investigate: Is the coefficient in a regression model really nonzero? Testing procedure:
Model: y = β0 + β1x + ε Hypothesis: H0: β1 = B. Rejection region: Least squares coefficient is far from zero.
Test: α level for the test = 0.05 as usual Compute t = (b1 – B)/StandardError Reject H0 if t is above the critical value
1.96 if large sample Value from t table if small sample.
Reject H0 if reported P value is less than α level
Degrees of Freedom for the t statistic is N-2