multiple regression ©2005 dr. b. c. paul. problems with regression so far we have only been able to...

Multiple Regression

©2005 Dr. B. C. Paul

Problems with Regression So Far

We have only been able to consider one factor as controlling at a time Everything else has gone into sum of squares for

error With our MPG we suspect other factor are

involved

Looking at Other Variables

Lets try to plot MPG againstOutside temperature.

Click on Graphs

Highlight Interactive to bring upThe side menu

Highlight and click on scatterplot

Setting the Plot

Move your Y and XAxis variable into position

This interface requires youTo use a drag and dropMethod rather than clickOn arrows.

When you are done clickOk.

Up Comes the Plot

There appears to be evidenceThat MPG improves as theOutside temperature increases.

Ordering a Model

We would like to have a model that includes more than one factor at a time

Such a model exists

AgeTempceDisY BBBB **tan*3210

Function of the Model

Works by least square error Objective remains to pick the coefficients such

that the average error squared between the model and data points is minimum

Again we will skip any derivations or explanations of how this is done For us we’ll push the right SPSS buttons

How Do We Deal with Significance? We have seen that any coefficient in a model can be

analyzed individually to certainty that it is not really zero (if its zero that term is not even in the model)

The trick! The significance of the coefficient depends on

How much of the total variation the model explains How much of the credit for that is going to other variables

It makes a difference what is in the model Example for MPG= f(distance driven) linear regression was

significant When quadratic term was added the regression fit

improved but neither term including the linear was significant

Method of Variable Entry

The Decree Method I can tell SPSS I want it to do a regression with

such and such a variable SPSS will do the best regression it can and then

show me the ANOVA table I can look and see whether I believe my coefficients

are strong enough for me to sign-off on.

Forward Regression

The computer will look at the variables available It will try a linear regression on each one

If one variable comes up at 95% significant it becomes a candidate to enter

The computer will get the significance of each variable and if several are over 95% it will pick the best

The computer will then look at the remaining variables to explain the residuals It will try each variable and check its significance in

explaining the residuals It looks for variables over 95% significant and then chooses

the best

Forward Regression Continued

The process of variables entering continues until all variables have been selected or no more variables are significant.

95% significance is the “default” significance to enter We can reset to a different significance level

Backward Regression

Computer starts by doing a regression with all possible variables in the equation

Computer then does a T test on each coefficient to see if the coefficient might be zero.

Any variable that falls below 75% significance is removed from the equation (75% is a default that you can reset) The T tests are then repeated

The process repeats until no more variables can be thrown out of the equation

Step Wise Regression

Starts out like Forward Regression Moves forward till two variables are in the equation Now the computer does T tests on all the variables

in the equation like a Backward Regression to see if anyone should be thrown out.

If not it goes into another Forward step After the forward step it checks with T tests on all

variables in the equation It continues the back and forth process until nothing

changes or it goes into an infinite loop

What are the chances that the Methods will give you the same answer? About zero

Some smaller easier sets will converge for all methods but the larger sets usually do not yield the same answer.

Which is right? Maybe it’s a dumb question Method does influence answer

May be more important that you carefully make sure you have a good defensible method

(Maybe it’s the teachers favorite answer – Step Wise)

Lets Try It

I added a variable for DistanceSquared.

Multilinear regression can onlyConsider linear effects of aVariable, but I can trick it byCreating a non-linear variable

In my case I still think I sawThe MPG bending down as theDrive distance increased(logical cause the engine warmedUp)

Start Like We Are Going To Do Regular Linear Regression

Click analyze to pull downThe menu

Highlight Regression to popOut the side menu

Highlight and Click Linear

Select My Variables

Note the change here is that IEntered all the possibleIndependent variables(you can’t see that I also enteredDistance squared)

Set the Regression Method to Step Wise

Check My Options

Click on Options

Note that this controls my significanceTo enter and removeThe default is set to 95% to enterAnd 90% to remove.

Set My Plots

I ask for my histograms.

I ask for my residuals to be plottedAgainst the predicted value to searchFor trends in the residual.

Click Ok and Out Comes Stuff

We Can See Some Model History

Our First Model was MPG is a linearFunction of outside temperatureIt explained about 54% of observedVariation.

The Saga Continues

The next step was to add anEffect for distance.The two variables explained 91%Of the observed variation.

The Rest of the History

The model next added Age and finally a distance squared term.It appears that none of the variables was removed in a backwardsStep. This just moved forward till all variables were in.In the end we have 93.5% of variation explained.

Looking at the ANOVA for the Regression Equations

All FourRegressions wereHighly significant

Checking the Significance of Coefficients

We Actually Knew that none of ourVariables got bounced out.

Note that everyVariable isSignificant atAbove theAlpha = 5% level.

We Can See Interaction Between Variables as they Enter

Note that theT score forDistance dipsWhen distanceSquaredEntered(For someReason itAppears theValues areCorrelated).

More InteractionsAs the unexplained random variations decreased the significance ofThe temperature effect increase steadily.

Our Equation Is

distagedisttempY2

*013.0*088.0*113.1*279.0079.6

Look at the Significance that Controlled the Order the Variables Came In

To start with Age had less than 50% significance but distance and distance squaredWere both strong. Distance had a better T score and entered next.

Next Regression Step

In the next step both Age and Distance Squared were above 5% but AgeWas stronger.

Checking Out Our Residuals

I’ve seen better normalDistributions on a cell by cellBasis but this doesn’t triggerAny immediate concerns.(Remember we do assumeOur residuals will be normallyDistributed with a mean ofZero around the predicted value).

Looking at Cumulative Probability

On an accumulative valueChart we do very well inAssuming normal distributionOf the error with a mean of0.

Our Scatter Plot

If there is a trend thereI don’t see it.(which is exactly whatOne wants to seeAfter the regression isWell done).

Summary on Regression

ANOVA works for Category Data Is a particular category significant

Ford Escorts are made in 3 plants Is there a difference in the mechanical problems rate that

depends on which factory built the car? Plants #1, #2, #3 really have no order except arbitrary

If I had looked at MPG based on Spring, Summer, Fall, Winter Assigning a numeric value to the seasons would be

totally arbitrary Category Data Lends itself poorly to regression

So When Should I Choose Regression?

Continuous quantitative variables I could break my drivers ages into groups but the break

points would be arbitrary This little artifact is one of the reasons two car insurance

companies can look at the same regional risk for drivers in an area and yet quote different rates for the same coverage

Creating categories out of continuous data can cause some weird effects

Regression tends to work better for continuously distributed quantitative data Also provides predictive models as opposed to category

means

multiple regression ©2005 dr. b. c. paul. problems with regression so far we have only been able to...

Documents

linear regression

default significance

variables availableit

remaining variables

possible variables

best regression

xaxis variable

model existsfunction