(classes 1 & 2) 2 var regression-for upload

7/29/2019 (Classes 1 & 2) 2 Var Regression-For Upload

1/99

1

The Two-Variable RegressionModel

Reminder: open OLS B-hat formulas example-sport.xls


2/99

Slide #2

Intentionally left blank


3/99

Slide #3

Do Large Market NBA Teams Make Higher Profits?

See regression output.

Profit Market Size Wins

33.4 3 57

22.0 3 63

16.0 3 46

8.7 2 42

5.4 2 44

4.7 2 55-1.5 1 35

-2.1 1 13

-4.0 1 28

NOTE: NBA market size = 3 for large, 2 for medium, 1 for small

NOTE: Don't use this approach for measuring market size.

Use a better measure like population.

Profit is in millions of $.


4/99

Slide #4

II. Regression

To study the influence of advertising on profits,the Celtics compiled the data in Table 1. Thissample is for each of the last five years. Adexpenditures are in $100,000s and profits arein millions of dollars.

Table 1.

Year 1 2 3 4 5Ad Expenditures (x) 2 3 4.5 5.5 7

Profit (y) 3 6 8 10 11


5/99

Slide #5

Regression (cont.)

Ad Expenditures (x) 2 3 4.5 5.5 7

Profit (y) 3 6 8 10 11

The Celtics need answers to these questions

1. Does advertising increase profit?

2. How much does another $100,000 spent onadvertising increase profit?


6/99

Slide #6

Regression (cont.)

3. What will our profit be if we spend $800,000on advertising?

4. How much will we need to spend on ads togenerate $12,000,000 in profit?


7/99

Slide #7

Regression (cont.)

Surprisingly, one statistical decision-making tool can provide answers to all

of these questions -- and more.The tool is called regression analysis.


8/99

Slide #8

Regression (cont.)

A. Regression analysis is a statisticaltechnique

B. Attempts to "explain" movements inone variable, the dependent variable . . .


9/99

Slide #9

Regression (cont.)

C. as function of movements in a set ofother variables, the independent

variables . . .D. through the quantification of one or

more equations.


10/99

Slide #10

Regression (cont.)

E. Two-variable model

1. Simplest of regression models

y = + x + 2. Model is used to describe behavior of

variables; often an equation

3. Will covera) Estimating itb) Testing hypotheses


11/99

Slide #11

Who Uses This

Northwestern Memorial Hospital, whichhas the largest birthing facility in the

Midwest, uses a simple regression modelto forecast delivery volume based onprevious delivery volumes. (Source:Jerry Lassa, Northwestern MemorialHospital, Chicago, IL.)


12/99

Slide #12

Who Uses This

IRI, the largest market research firm inthe United States uses simple regression

on adjusted weekly sales data todetermine baseline sales when there isno special promotion. (Source: DougHonnold, IRI, Inc., Chicago, IL.)


13/99

Slide #13

Regression (cont.)

To use regression analysis for answeringthose questions, the analyst needs to

find the line which best fits the data.That is, she needs to find the line that best

represents the average relationship

between x yin this datay = + x +


14/99

Slide #14

Regression (cont.)

The line which bestrepresents theaverage relationship between x & yinthis data can be written as

y = + x + where is the intercept of the line is

its slope.

The , are called regression coefficients.Also are unknown parameters


15/99

Slide #15


Profit (y) 3 6 8 10 11

*

*

*

*

*

y

x

goal: find the line which best

fits the data; that is, find the line

which best represents the average

relationship between x & y in thisdata sample.

(Profit)

(Ad expenditures)


16/99

Slide #16


Profit (y) 3 6 8 10 11

The actual line can be written as

y = a + bx

where a is the intercept

of the line & b is its slope.

One possible set of values for

a and b gives

y = 0.65 + 1.58x.

*

*

*

*

*

y

x

(Profit)

(Ad expenditures)


17/99

Slide #17

*

*

*

*

*

y

x


y = a + bx




a and b gives

y =0.65

+ 1.58x.

(Profit)

(Ad expenditures)

Intercept

(0.65)


Profit (y) 3 6 8 10 11


18/99

Slide #18

*

*

*

*

*

y

x


y = a + bx




a and b gives

y = 0.65 +1.58

x.

(Profit)

(Ad expenditures)

Intercept

Slope

(0.65)

(1.58)


Profit (y) 3 6 8 10 11


19/99

Slide #19

III. Error term

y = + x + A. Error term needed in model for

combination of four reasons1. variables omitted from model

2. captures effects of nonlinearities in model

3. errors in measuring Y4. random effects.


20/99

Slide #20

IV. Background

A. Variances Standard Deviation1. There are many terms to be learned

a) SAMPLE VARIANCE OF X (OR OFMANY Xs) = SXX

b) **VARIANCE OF THE ERROR TERM = 2c) ESTIMATOR OF 2 = s2 = sum (2t )/N-Kd) STANDARD ERROR OF THE

RESIDUALS = s = square root of s2

** most important


21/99

Slide #21

Background (cont.)

e) VARIANCE OF = 2 = 2/SXX

f) ESTIMATOR OF 2 = s2 = s2/SXX

g) **STANDARD ERROR OF = s = squareroot of s2

** most important

^

^


22/99

Slide #22

V. Assumptions of Model

A. Why Bother?

1. First objective: obtain best estimates of

parameters2. Think of these assumptions as conditions

that should be satisfied for obtaining bestestimates

3. y = + x +


23/99

Slide #23

Assumptions (cont.)

B. Therefore, impose assumptions1. Assumption #1 (linear regression model)

a) The regression model is linear in theunknown coefficients

b) ALSO we assume that Xt and Yt arerelated in a linear way: y = + x +

c) This:(1) might be true

(2) might be good approximation if are not linearlyrelated


24/99

Slide #24

Assumptions (cont.)

2. Assumption #2 (errors average to zero)a) Each error term t is a random variable

with E(t) = 0.b) Means: regression line passes through

middle of data (SEE NEXT SLIDE)

3. Assumption #3 (values of the Xs vary)

a)Not all of the values of each X

tare the

same

b) Means: if an X does not vary, it cannotexplain variation in Y


25/99

Slide #25

*

*

*

* *

y

**

*

*

**

X

Assumption #2

Means: regression line passes throughmiddle of data

line that bestrepresents the average relationship between x y


26/99

Slide #26

VI. The Method Of LeastSquares

y = + x + A. The analyst can never know the true

values of in the actual line aboveB. She can, however, calculate her bestguesses (called estimates) of the truevalues of using statisticalsoftware, spreadsheets, or somecalculators


27/99

Slide #27

Least Squares Formulas fory = + x +

_ _

1

_2

1

( )( )

( )

i

i

n

ii

n

i

x x y y

x x

= 1.58(Celtics)


28/99

Slide #28

Least Squares Formulas fory = + x +

_ _

y x

= 0.65(Celtics)


29/99

Slide #29

Example Using Data

Table 1.


Profit(y) 3 6 8 10 11

See Excel file that calculates the B-hat value

OLS B-hat formulas example-sport.exe


30/99

Slide #30

Least Squares Formulas fory = +

1x

1+

2x

2+

Use x& for x1 Use x& for x2

1

11

_ _

1

11

2

1

( )( )

( )

i

i

n

i

i

n

i

x x y y

x x

22

22

_ _

12

2

1

( )( )

( )

i

i

n

i

i

n

i

x x y y

x x

_

x

_

x


31/99

Slide #31

Least Squares (cont.)

C. These estimates of and , called -hat and -hat, are called estimatedregression coefficients

D. By the way, these estimatedregression coefficients are numbers.


32/99

Slide #32


E.Substituting the estimated regressioncoefficients into + x gives -hat + ( -hat) x.

F. Whenever you substitute a given value ofx into -hat + ( -hat) x, you will get:

G. y-hat = -hat + ( -hat) x where y-hat isthepredicted value of y orpredicted y.


33/99

Slide #33


F. Whenever you substitute a given value ofx into -hat + ( -hat) x, you will get:

G. y-hat = -hat + ( -hat) x where y-hat isthepredicted value of y orpredicted y.

EXAMPLE:

substitute x = 2 into y-hat = .65 + 1.58x

y-hat = .65 + 1.58(2) = 3.81

WHAT-IF SCENARIO


34/99

Slide #34


H. The equation

y-hat = -hat + (-hat) x is the estimatedregression line.

I. So, each y-hat value comes from theestimated regression line.


35/99

Slide #35


*

*

*

*

*

yy-hat = a-hat + (b-hat)x is

estimated regression line;

one possible set of valuesis y-hat = 0.65 + 1.58x

x

Least Squares Review


36/99

Slide #36

Least Squares Review

y = a + bx is actual regression linea and b are (unknown) regression coefficients

if b > 0, then as x rises, y also rises

if b < 0, then as x rises, y falls

possible estimated values are a-hat = 0.65 and b-

hat = 1.58: the values 0.65 and 1.58 areestimated regression coefficients.

0.65 + 1.58x is the estimated regression line

If b = 0,then as x rises, y . . . ?


37/99

Slide #37


J. For each actual value of x, there areusually differences between each y(actual y) & y-hat (predicted y).

Example

when x = 2, y =3 (both data)

when x = 2 put into y-hat = .65 + 1.58x,

y-hat = 3.81

(y - y-hat ) = 3 - 3.81 = -0.81


38/99

Slide #38


y - y-hat = 3 - 3.81 = -0.81

K. The value (y - y-hat) is the deviation

(error) caused by calculating y from theestimated regression line.

L. The researcher's goal is to find

values for -hat -hat so that sum of(y - y-hat)2 is as small as possible acrossthe entire sample of x y values.


39/99

Slide #39


Expenditures (x) 2 3 4.5 5.5 7

Profit (y) 3 6 8 10 11

y-hat 3.81 5.39 7.76 9.34 11.71

(y-hat = .65 + 1.58x)

error -0.81 0.61 0.24 0.63 -0.71[(y) - y-hat]

SEE NEXT SLIDES

Y h Y ( )


40/99

Slide #40

Y-hat - Y (error)

Y-hat = 0.65 + 1.58 XWhen x = 3, actual y = 6(from data)

*

**

**

y

xx = 3

y = 6

Y h t Y ( )


41/99

Slide #41

Y-hat - Y (error)

*

**

**

y

x

Y-hat = 0.65 + 1.58 XWhen x = 3, predicted y = 5.39(from line)

x = 3

y = 6

y-hat = 5.39

Y h Y ( )


42/99

Slide #42

Y-hat - Y (error)

(y - y-hat)is the deviation (error) caused byestimating y from the estimated regression line

*

**

**

y

xx = 3

y = 6

y-hat = 5.39(y - y-hat) = 6 - 5.39 = .61


43/99

Slide #43


M. The method of least squares (OLS)gives the line of best fit (it fits thesample data best) by finding values of

-hat -hat which minimizeN. sumof (y - y-hat)2

O. Also called Error Sum of Squares

orP. ESS or SSE


44/99

Slide #44


The aim of OLS is to pick values for -hat & -hat so that the sum of all (y - y-hat)2 is assmall as possible across entire sample of x & yvalues

*

*

*

*

*

y

x


45/99

Slide #45


Q. Software contains formulas thatcalculate values for -hat -hat fromthe values of the sample's data.


46/99

Slide #46

Example #1

A. In the case of profit (y) andexpenditures (x) mentioned above, theestimated least squares (OLS)

regression line is y-hat = 0.65 + 1.58x.B. The 1.58 estimate for : (positive or

negative?) relationship between profit

and expenditures (sign on 1.58)C. is marginal effect of x on y


47/99

Slide #47

Example #1 (cont.)

y-hat = 0.65 + 1.58x

D. always in ys unitsE. For each extra $1.00 it spends on ads,

Celtics are getting $???? of profit. (notes)


48/99

Slide #48

Example #2

A. If you estimated a house price (y)and size (x) model, the estimated least

squares (OLS) regression line isy-hat = 52.351 + 0.139x.

B. OR PRICE = 52.351 + 0.139 SQFT

C. y=Price ($1000s)D. x=Size (sq. feet)


49/99

Slide #49

Example #2 (cont.)

PRICE = 52.351 + 0.139 SQFT

E. If size increases by 1 unit (1 sq. ft.),price . . . (direction?) (notes)


50/99

Slide #50

Example #2 (cont.)

PRICE = 52.351 + 0.139 SQFT

G. If size increases by 1 unit (1 sq. ft.),price rises. . . (magnitude?) (notes)


51/99

Slide #51

Exercise

You will next do an exercise to help consolidatemany of the concepts you have seen in thisintroduction to regression.

Answer the questions on the next slide.

Form groups and work on the questions for 5minutes.

I will then lead a discussion of your answers.

Ad E dit ( ) 2 3 4 5 5 5 7


52/99

Slide #52

1. Does advertising increase profit?

2. How much does another $100, 000 increase profit?

3. What will our profit be if we spend $800,000 onadvertising?

4. How much will we need to spend on ads to generate$12,000,000 in profit?


Profit (y) 3 6 8 10 11After estimating the regression line y = a + bx, the computer prints

these results: a-hat = 0.65, b-hat = 1.58. This means thaty-hat = 0.65 + 1.58x.

Ad Expenditures (x) 2 3 4 5 5 5 7


53/99

Slide #53



Profit (y) 3 6 8 10 11After estimating the regression line y = a + bx, the computer prints

these results: a-hat = 0.65, b-hat = 1.58. This means thaty-hat = 0.65 + 1.58x.



54/99

Slide #54



Profit (y) 3 6 8 10 11After estimating the regression line y = a + bx, the computer printsthese results: a-hat = 0.65, b-hat = 1.58. This means thaty-hat = 0.65 + 1.58x.



55/99

Slide #55






56/99

Slide #56






57/99

Slide #57






58/99

Slide #58






59/99

Slide #59






60/99

Slide #60





61/99

Slide #61

Look at regression #1 on your handout

and tell me the answer.

Do Large Market NBA Teams Make

Higher Profits?


62/99

Slide #62

Dummy Variables

A. New type of variable

1. Usually use quantitative variables;continuous

2. Sometimes variables take small number ofvalues; discrete

a) Market size

b) Genderc) Season

d) Marital status (married vs. not), etc


63/99

Slide #63

Dummy Variables (cont.)

B. Will use Qualitative (or Dummy) Variables1. Create a special variable that takes a value of

a) if the unit of observation falls into onecategory

b) if the unit falls into the other category

1

0


64/99

Slide #64


C. Dummy variable as IV

PRICE = 52.351 + 0.139 SQFT

+ 18.52 POOL

POOL = 1 if house has a pool

POOL = 0 if no pool

More later


65/99

Slide #65


D. Dummy variable as DV

BUY= 3.15 + 10.19 INCOME - 1.5 PRICE

BUY = 1 if buy the house

BUY = 0 if dont buy house

More later

VII Regression Hypotheses


66/99

Slide #66

VII. Regression HypothesesTests

Look at regression #1 on your handout

Questions:

1. Is there relationship among DV IVs? Profit and market size

2. How well does my model fit the data? How well does market size explain profit?

3. Which IVs affect DV? Does market size influence profit?

Note: same as #1 when have only one IV

VIII Testing Entire


67/99

Slide #67

VIII. Testing EntireRegression (F test)

A. Always first test

B. Tests entire model

C. If model fails this test1.

2. Model no good3. Back to drawing board


68/99

Slide #68

F test (cont.)

D. Step #1

1. HO: no relationship between y xs

2. OR HO: 1 = 2 = . . . = k = 0 ( 0)3. HA: is relationship between y xs

4. OR HA: at least 1 of s 0 ( 0)E. Step #21. Software prints test statistic

y = + 1x1 + 2x2 + 3x3 +


69/99

Slide #69

F test (cont.)

F. Step #3

1. Software prints a p-value

2. p-value is probability of making a Type Ierror

G. Step #4

1. Reject HO

if

p-value 5% (or 1%)

2. Do not reject HO ifp-value > 5% (or 1%)


70/99

Slide #70

F test (cont.)

H. Rule

1. Large F-statistics are better

I. Logic1. F-statistic is ratio

explained variance

unexplained variance


71/99

Slide #71

F test (cont.)

explained variance

unexplained variance

3. If numerator = 0 (small)a) F = 0 (small)

b) Model terrible

4. If numerator large vs. denominatora) F largeb) Model good


72/99

Slide #72

F test (cont.)

ExamplePROFIT= -20.8 + .185 WINS + 11.2 MKTSIZE

F = 21.10 (0.002)

Do MKTSIZE & WINS jointly influence

PROFIT?

See regression output for F statistic.

IX Testing Regression:


73/99

Slide #73

IX. Testing Regression:Goodness of Fit

A. This is about measuring how well theestimated regression line fits the data.

B. The OLS estimated regression linefits the data better than other lines - -but how well does it fit?


74/99

Slide #74

Goodness of Fit (cont.)

C. To measure how well the regressionline fits the data (its goodness of fit), usethe calculated value called R2.

D. R2 tells thepercentageof the variationamong y-values in your sample datathat is explained by variation of the x-values that are in your regression.


75/99

Slide #75


E. R2 = explained variation / totalvariation

= RSS / TSS= 1 (ESS / TSS)

( recall: TSS = RSS + ESS)

F. Use this second, after F test


76/99

Slide #76


R2

= explained variation / total variation

Questions about R2

1. What is max R2 percentage?2. What is min R2 percentage?

3. The closer R2 is to ??, the betterthe fit of the

estimated regression line to the data4. The closer R2 is to ??, the worse the fit of the

estimated regression line to the data


77/99

Slide #77

Questions about R2(cont.)

R2

= explained variation / total variation

5. Which is better: R2 =.25 or R2 =.89 ?

6. Why?


78/99

Slide #78

Characteristics of R2

A. 0 R2 1.00

(same as 0% R2 100%)

B. By the way, R2 = 1.00 is perfectcorrelation,

C. either positive or negative

D. You cant tell from the R2 alone.

Characteristics of R2


79/99

Slide #79

Characteristics of R Review

E. 0 R2 1.00

(same as 0% R2 100%)

F. The closer R2

is to 1, the better the fitof the estimated regression line to thedata

G. The closer R2 is to 0, the worse the

fit of the estimated regression line todata


80/99

Slide #80


G. A high R2 does notmean that the xscausey. It means that xs and y arehighly correlated.

H. Example:

if R2 = 0.89, means 89% of variation of

y is explained by the regression line(see next two slides)


81/99

Slide #81

*

*

**

*

y

*

*

*

Better fit of regression

line to data

*

*

*

* *

y

* *

*

*

**

Worse fit of regression

line to data

R2 = 0.89

R2 = 0.25

Whi hWhi h


82/99

Slide #82

*

*

*

**

y

*

*

*

*

*

*

*

*y

*

*

*

*

* *

Which one

R2 = 0.89?Which oneR2 = 0.10?

A B


83/99

Slide #83

Example

PROFIT= -20.8 + .185 WINS + 11.2 MKTSIZE

with

R2

= 87.6%What % of (variation or variance?) inPROFIT is explained by MKTSIZE & WINS?

How good is this estimated regression line?

See regression output for R2.

X Testing Regression:


84/99

Slide #84

X. Testing Regression:t-tests

A. After R2 (third)

B. Tests individual IVs

C. Questions1. Does x2 affect y?

2. Does x3 affect y?

3. Etc.NOTE: t-tests apply to ALL IVs incl.dummy variables


85/99

Slide #85

t-tests (cont.)

D. If any xk does not affect y

1. Do NOT automatically drop xk

2. more later

( )


86/99

Slide #86

t-tests (cont.)

E. Step #1

1. HO: k = 02. Meansno relationship between y xk3. HA: k 04. Means is relationship between y xk

y = + 1x1 + 2x2 + 3x3 +

( )


87/99

Slide #87

t-tests (cont.)

F. Step #21. Software prints test statistic

G. Step #31. Software prints a p-value

2. p-value is probability of making a Type I error

H. Step #4

1. Reject HO ifp-value 5% (or 1%)2. Do not reject HO ifp-value > 5% (or 1%)

( )


88/99

Slide #88

t-tests (cont.)

I. Formula

1. tn-K = [(k -hat) - (k - hypothesized)] / s-hat ]2. since usually HO: k = 03. tn-K = (k-hat) / s-hat ]

J. Rules

1. Large t statistics are better2. Small p-values are better

( )


89/99

Slide #89

t-tests (cont.)

K. Limitations of t-test

1. Does not prove theoretical validitya) Only shows statistically significant

correlation

b) y-hat = 10.9 + 3.2 x

(19.5) (13.9)

(.0001) (.0001)c) appears that x causes y

( )


90/99

Slide #90

t-tests (cont.)

d) actually(1) y = Britain's overall business activity

(2) x = sunspot activity

e) actual regression from 19th

century!2. Does not prove causality

a) Only shows statistically significantcorrelation

b) See abovec) You impose causality by choices of DV

IVs

t t t ( t )


91/99

Slide #91

t-tests (cont.)

3. Does not show which x has biggestimpact on y

a) Common mistake

(1) "Biggest t-statistic means that x has biggest impacton y"

b) size of t shows prob. of type I error(1) larger t value: less chance of type I error

(2) larger t value: more confidence relationship exists

FALSE

t t t ( t )


92/99

Slide #92

t-tests (cont.)

SupposePROFIT= -20.8 + .185 WINS + 11.2 MKTSIZE

(-4.02) (1.41) (4.51)(0.007) (0.207) (0.004)

Is = 0 or0?

Is 1 = 0 or10?Is 2 = 0 or20?

See regression output for t-statistics.

E l t th M d l


93/99

Slide #93

Evaluate the Model

When you are asked to evaluate themodel:

F statistic

R2

Each t-statistic

Overall evaluationWeak, fair, . . ., Yippee!

See the handout titledEvaluate the Model

XI. Testing Regression:


94/99

Slide #94

. esting egression:Akaike Information Criteria

Number printed by many statisticalapplications along with rest ofregression output

Simple rule: lower the value of AIC, thebetter the model

XII T ti R i


95/99

Slide #95

XII. Testing Regression

The most difficult way to test regressionmodel is by applying . . .

COMMON SENSE!!

Remember: GNP = 10.9 + 3.2 sunspotsDoes that make sense?

XIII N li R l ti hi


96/99

Slide #96

XIII. Nonlinear Relationships

A. Introduction

1. So far: y = + x +

2. Linear relationship y x

3. Many nonlinear relationships in world

4. Can model those as well

5. More later

**

*

*

*

**

*

*

* *

Do More Wins Generate Higher Profits in the NBA?


97/99

Slide #97

Profit Market Size Wins

33.4 3 25

22.0 3 63

16.0 3 55

8.7 2 42

5.4 2 57

4.7 2 46-1.5 1 39

-2.1 1 13

-4.0 1 28

NOTE: NBA market size = 3 for large, 2 = medium, 1 = smallNOTE: don't use this approach for measuring market size.

Use a better measure like population.

Do More Wins Generate Higher Profits in the NBA?

See regression output and EVALUATE THIS MODEL.

QB R ti H d t


98/99

Slide #98

QB Rating Handout

Evaluate this regression output

Tom BradyNew England Patriots

MVP Super BowlXXXVIII

E i


99/99

Exercise

Two-variable Regression Exercise #1

(classes 1 & 2) 2 var regression-for upload

Documents