multiple regression ©2005 dr. b. c. paul. problems with regression so far we have only been able to...

36
Multiple Regression ©2005 Dr. B. C. Paul

Upload: regina-matthews

Post on 14-Jan-2016

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Multiple Regression

©2005 Dr. B. C. Paul

Page 2: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Problems with Regression So Far

We have only been able to consider one factor as controlling at a time Everything else has gone into sum of squares for

error With our MPG we suspect other factor are

involved

Page 3: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Looking at Other Variables

Lets try to plot MPG againstOutside temperature.

Click on Graphs

Highlight Interactive to bring upThe side menu

Highlight and click on scatterplot

Page 4: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Setting the Plot

Move your Y and XAxis variable into position

This interface requires youTo use a drag and dropMethod rather than clickOn arrows.

When you are done clickOk.

Page 5: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Up Comes the Plot

There appears to be evidenceThat MPG improves as theOutside temperature increases.

Page 6: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Ordering a Model

We would like to have a model that includes more than one factor at a time

Such a model exists

AgeTempceDisY BBBB **tan*3210

Page 7: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Function of the Model

Works by least square error Objective remains to pick the coefficients such

that the average error squared between the model and data points is minimum

Again we will skip any derivations or explanations of how this is done For us we’ll push the right SPSS buttons

Page 8: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

How Do We Deal with Significance? We have seen that any coefficient in a model can be

analyzed individually to certainty that it is not really zero (if its zero that term is not even in the model)

The trick! The significance of the coefficient depends on

How much of the total variation the model explains How much of the credit for that is going to other variables

It makes a difference what is in the model Example for MPG= f(distance driven) linear regression was

significant When quadratic term was added the regression fit

improved but neither term including the linear was significant

Page 9: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Method of Variable Entry

The Decree Method I can tell SPSS I want it to do a regression with

such and such a variable SPSS will do the best regression it can and then

show me the ANOVA table I can look and see whether I believe my coefficients

are strong enough for me to sign-off on.

Page 10: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Forward Regression

The computer will look at the variables available It will try a linear regression on each one

If one variable comes up at 95% significant it becomes a candidate to enter

The computer will get the significance of each variable and if several are over 95% it will pick the best

The computer will then look at the remaining variables to explain the residuals It will try each variable and check its significance in

explaining the residuals It looks for variables over 95% significant and then chooses

the best

Page 11: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Forward Regression Continued

The process of variables entering continues until all variables have been selected or no more variables are significant.

95% significance is the “default” significance to enter We can reset to a different significance level

Page 12: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Backward Regression

Computer starts by doing a regression with all possible variables in the equation

Computer then does a T test on each coefficient to see if the coefficient might be zero.

Any variable that falls below 75% significance is removed from the equation (75% is a default that you can reset) The T tests are then repeated

The process repeats until no more variables can be thrown out of the equation

Page 13: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Step Wise Regression

Starts out like Forward Regression Moves forward till two variables are in the equation Now the computer does T tests on all the variables

in the equation like a Backward Regression to see if anyone should be thrown out.

If not it goes into another Forward step After the forward step it checks with T tests on all

variables in the equation It continues the back and forth process until nothing

changes or it goes into an infinite loop

Page 14: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

What are the chances that the Methods will give you the same answer? About zero

Some smaller easier sets will converge for all methods but the larger sets usually do not yield the same answer.

Which is right? Maybe it’s a dumb question Method does influence answer

May be more important that you carefully make sure you have a good defensible method

(Maybe it’s the teachers favorite answer – Step Wise)

Page 15: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Lets Try It

I added a variable for DistanceSquared.

Multilinear regression can onlyConsider linear effects of aVariable, but I can trick it byCreating a non-linear variable

In my case I still think I sawThe MPG bending down as theDrive distance increased(logical cause the engine warmedUp)

Page 16: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Start Like We Are Going To Do Regular Linear Regression

Click analyze to pull downThe menu

Highlight Regression to popOut the side menu

Highlight and Click Linear

Page 17: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Select My Variables

Note the change here is that IEntered all the possibleIndependent variables(you can’t see that I also enteredDistance squared)

Page 18: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Set the Regression Method to Step Wise

Page 19: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Check My Options

Click on Options

Note that this controls my significanceTo enter and removeThe default is set to 95% to enterAnd 90% to remove.

Page 20: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Set My Plots

I ask for my histograms.

I ask for my residuals to be plottedAgainst the predicted value to searchFor trends in the residual.

Page 21: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Click Ok and Out Comes Stuff

Page 22: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

We Can See Some Model History

Our First Model was MPG is a linearFunction of outside temperatureIt explained about 54% of observedVariation.

Page 23: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

The Saga Continues

The next step was to add anEffect for distance.The two variables explained 91%Of the observed variation.

Page 24: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

The Rest of the History

The model next added Age and finally a distance squared term.It appears that none of the variables was removed in a backwardsStep. This just moved forward till all variables were in.In the end we have 93.5% of variation explained.

Page 25: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Looking at the ANOVA for the Regression Equations

All FourRegressions wereHighly significant

Page 26: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Checking the Significance of Coefficients

We Actually Knew that none of ourVariables got bounced out.

Note that everyVariable isSignificant atAbove theAlpha = 5% level.

Page 27: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

We Can See Interaction Between Variables as they Enter

Note that theT score forDistance dipsWhen distanceSquaredEntered(For someReason itAppears theValues areCorrelated).

Page 28: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

More InteractionsAs the unexplained random variations decreased the significance ofThe temperature effect increase steadily.

Page 29: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Our Equation Is

distagedisttempY2

*013.0*088.0*113.1*279.0079.6

Page 30: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Look at the Significance that Controlled the Order the Variables Came In

To start with Age had less than 50% significance but distance and distance squaredWere both strong. Distance had a better T score and entered next.

Page 31: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Next Regression Step

In the next step both Age and Distance Squared were above 5% but AgeWas stronger.

Page 32: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Checking Out Our Residuals

I’ve seen better normalDistributions on a cell by cellBasis but this doesn’t triggerAny immediate concerns.(Remember we do assumeOur residuals will be normallyDistributed with a mean ofZero around the predicted value).

Page 33: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Looking at Cumulative Probability

On an accumulative valueChart we do very well inAssuming normal distributionOf the error with a mean of0.

Page 34: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Our Scatter Plot

If there is a trend thereI don’t see it.(which is exactly whatOne wants to seeAfter the regression isWell done).

Page 35: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

Summary on Regression

ANOVA works for Category Data Is a particular category significant

Ford Escorts are made in 3 plants Is there a difference in the mechanical problems rate that

depends on which factory built the car? Plants #1, #2, #3 really have no order except arbitrary

If I had looked at MPG based on Spring, Summer, Fall, Winter Assigning a numeric value to the seasons would be

totally arbitrary Category Data Lends itself poorly to regression

Page 36: Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything

So When Should I Choose Regression?

Continuous quantitative variables I could break my drivers ages into groups but the break

points would be arbitrary This little artifact is one of the reasons two car insurance

companies can look at the same regional risk for drivers in an area and yet quote different rates for the same coverage

Creating categories out of continuous data can cause some weird effects

Regression tends to work better for continuously distributed quantitative data Also provides predictive models as opposed to category

means