multiple regression ©2005 dr. b. c. paul. problems with regression so far we have only been able to...
TRANSCRIPT
Multiple Regression
©2005 Dr. B. C. Paul
Problems with Regression So Far
We have only been able to consider one factor as controlling at a time Everything else has gone into sum of squares for
error With our MPG we suspect other factor are
involved
Looking at Other Variables
Lets try to plot MPG againstOutside temperature.
Click on Graphs
Highlight Interactive to bring upThe side menu
Highlight and click on scatterplot
Setting the Plot
Move your Y and XAxis variable into position
This interface requires youTo use a drag and dropMethod rather than clickOn arrows.
When you are done clickOk.
Up Comes the Plot
There appears to be evidenceThat MPG improves as theOutside temperature increases.
Ordering a Model
We would like to have a model that includes more than one factor at a time
Such a model exists
AgeTempceDisY BBBB **tan*3210
Function of the Model
Works by least square error Objective remains to pick the coefficients such
that the average error squared between the model and data points is minimum
Again we will skip any derivations or explanations of how this is done For us we’ll push the right SPSS buttons
How Do We Deal with Significance? We have seen that any coefficient in a model can be
analyzed individually to certainty that it is not really zero (if its zero that term is not even in the model)
The trick! The significance of the coefficient depends on
How much of the total variation the model explains How much of the credit for that is going to other variables
It makes a difference what is in the model Example for MPG= f(distance driven) linear regression was
significant When quadratic term was added the regression fit
improved but neither term including the linear was significant
Method of Variable Entry
The Decree Method I can tell SPSS I want it to do a regression with
such and such a variable SPSS will do the best regression it can and then
show me the ANOVA table I can look and see whether I believe my coefficients
are strong enough for me to sign-off on.
Forward Regression
The computer will look at the variables available It will try a linear regression on each one
If one variable comes up at 95% significant it becomes a candidate to enter
The computer will get the significance of each variable and if several are over 95% it will pick the best
The computer will then look at the remaining variables to explain the residuals It will try each variable and check its significance in
explaining the residuals It looks for variables over 95% significant and then chooses
the best
Forward Regression Continued
The process of variables entering continues until all variables have been selected or no more variables are significant.
95% significance is the “default” significance to enter We can reset to a different significance level
Backward Regression
Computer starts by doing a regression with all possible variables in the equation
Computer then does a T test on each coefficient to see if the coefficient might be zero.
Any variable that falls below 75% significance is removed from the equation (75% is a default that you can reset) The T tests are then repeated
The process repeats until no more variables can be thrown out of the equation
Step Wise Regression
Starts out like Forward Regression Moves forward till two variables are in the equation Now the computer does T tests on all the variables
in the equation like a Backward Regression to see if anyone should be thrown out.
If not it goes into another Forward step After the forward step it checks with T tests on all
variables in the equation It continues the back and forth process until nothing
changes or it goes into an infinite loop
What are the chances that the Methods will give you the same answer? About zero
Some smaller easier sets will converge for all methods but the larger sets usually do not yield the same answer.
Which is right? Maybe it’s a dumb question Method does influence answer
May be more important that you carefully make sure you have a good defensible method
(Maybe it’s the teachers favorite answer – Step Wise)
Lets Try It
I added a variable for DistanceSquared.
Multilinear regression can onlyConsider linear effects of aVariable, but I can trick it byCreating a non-linear variable
In my case I still think I sawThe MPG bending down as theDrive distance increased(logical cause the engine warmedUp)
Start Like We Are Going To Do Regular Linear Regression
Click analyze to pull downThe menu
Highlight Regression to popOut the side menu
Highlight and Click Linear
Select My Variables
Note the change here is that IEntered all the possibleIndependent variables(you can’t see that I also enteredDistance squared)
Set the Regression Method to Step Wise
Check My Options
Click on Options
Note that this controls my significanceTo enter and removeThe default is set to 95% to enterAnd 90% to remove.
Set My Plots
I ask for my histograms.
I ask for my residuals to be plottedAgainst the predicted value to searchFor trends in the residual.
Click Ok and Out Comes Stuff
We Can See Some Model History
Our First Model was MPG is a linearFunction of outside temperatureIt explained about 54% of observedVariation.
The Saga Continues
The next step was to add anEffect for distance.The two variables explained 91%Of the observed variation.
The Rest of the History
The model next added Age and finally a distance squared term.It appears that none of the variables was removed in a backwardsStep. This just moved forward till all variables were in.In the end we have 93.5% of variation explained.
Looking at the ANOVA for the Regression Equations
All FourRegressions wereHighly significant
Checking the Significance of Coefficients
We Actually Knew that none of ourVariables got bounced out.
Note that everyVariable isSignificant atAbove theAlpha = 5% level.
We Can See Interaction Between Variables as they Enter
Note that theT score forDistance dipsWhen distanceSquaredEntered(For someReason itAppears theValues areCorrelated).
More InteractionsAs the unexplained random variations decreased the significance ofThe temperature effect increase steadily.
Our Equation Is
distagedisttempY2
*013.0*088.0*113.1*279.0079.6
Look at the Significance that Controlled the Order the Variables Came In
To start with Age had less than 50% significance but distance and distance squaredWere both strong. Distance had a better T score and entered next.
Next Regression Step
In the next step both Age and Distance Squared were above 5% but AgeWas stronger.
Checking Out Our Residuals
I’ve seen better normalDistributions on a cell by cellBasis but this doesn’t triggerAny immediate concerns.(Remember we do assumeOur residuals will be normallyDistributed with a mean ofZero around the predicted value).
Looking at Cumulative Probability
On an accumulative valueChart we do very well inAssuming normal distributionOf the error with a mean of0.
Our Scatter Plot
If there is a trend thereI don’t see it.(which is exactly whatOne wants to seeAfter the regression isWell done).
Summary on Regression
ANOVA works for Category Data Is a particular category significant
Ford Escorts are made in 3 plants Is there a difference in the mechanical problems rate that
depends on which factory built the car? Plants #1, #2, #3 really have no order except arbitrary
If I had looked at MPG based on Spring, Summer, Fall, Winter Assigning a numeric value to the seasons would be
totally arbitrary Category Data Lends itself poorly to regression
So When Should I Choose Regression?
Continuous quantitative variables I could break my drivers ages into groups but the break
points would be arbitrary This little artifact is one of the reasons two car insurance
companies can look at the same regional risk for drivers in an area and yet quote different rates for the same coverage
Creating categories out of continuous data can cause some weird effects
Regression tends to work better for continuously distributed quantitative data Also provides predictive models as opposed to category
means