lecture7a applied econometrics and economic modeling

53
Simple Regression Scatterplots: Graphing Relationships

Upload: stone55

Post on 25-May-2015

614 views

Category:

Education


0 download

DESCRIPTION

Applied Econometrics and Economic Modeling

TRANSCRIPT

Page 1: Lecture7a Applied Econometrics and Economic Modeling

Simple Regression

Scatterplots: Graphing Relationships

Page 2: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information

Pharmex is a chain of drugstores that operates around the country.

To see how effective their advertising and other promotional activities are, the company has collected data from 50 randomly selected metropolitan regions.

In each region it has compared its own promotional expenditures and sales to those of the leading competitor in the region over the past year.

Page 3: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information -- continued There are two variables each of which are indexes, not

dollar amounts.

– Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor

– Sales: Pharmex’s sales as a percentage of those of the leading competitor

The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is.

Page 4: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

PHARMEX.XLS The data are listed in this file. Here is a partial listing.

What type of relationship, if any, is apparent in a scatterplot?

Page 5: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Creating the Scatterplot

In preparing to create the scatterplot we must decide which variable should be on the horizontal axis.

In regression analysis, we always put the explanatory variable on the horizontal axis and the response variable on the vertical axis.

In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so we put Sales on the vertical axis and Promote on the horizontal axis.

Page 6: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Creating the Scatterplot -- continued We create the following scatterplot using StatPro’s

Scatterplot procedure.

Page 7: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Interpretation

The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect.

The correlation of 0.673 is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large.

Page 8: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Causation

Unless the data is obtained in a carefully controlled experiment - not the case here - we can never make definitive statements about causation in regression analysis.

The reason for this is that we can almost never rule out the possibility that some other variable is causing the variation in both of the observed variables.

Page 9: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information

In Example 13.1 we created scatterplots for Pharmex.

We found that there was a positive but not perfect relationship between Promote and Sales.

We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable.

Page 10: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

PHARMEX.XLS The data are listed in this file. Here is a partial listing.

Page 11: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Least Squares Estimation

Since there are hints of a linear relationship between the two variables we can draw a line through the points to produce a reasonably good fit.

However, we need to proceed systematically and not just randomly draw lines. We must choose the line that makes the vertical distances from the points to the line as small as possible.

The fitted value is the vertical distance from the horizontal axis to the line and the residual is the vertical distance from the line to the point.

Page 12: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Least Squares Estimation -- continued The idea is simple. By using a straight line to reflect

the relationship between Promote and Sales, we expect a given Sales to be at the height of the line above any particular value of Promote. That is, we expect Sales to equal the fitted value.

But the relationship is not perfect. Not all points lie exactly on the line. The differences are the residuals.. They show how much the observed values differ from the fitted values.

Page 13: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Least Squares Estimation -- continued We can now explain how to choose the “best fitting”

line through the points in the scatterplot. We choose the line with the smallest sum of the squared residuals. This line is called the least squares line.

Most statistical packages perform the calculations to find this line so we need not be concerned with the technical details and hand calculating.

Page 14: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Finding the Least Squares Line with StatPro We use the StatPro/Regression Analysis /Simple

menu item.

After specifying that Sales is the response (dependent) variable and that Promote is the explanatory (independent) variable, we see the dialog box for scatterplot options as seen here.

Page 15: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Finding the Least Squares Line with StatPro -- continued This gives us the option of creating several

scatterplots involving the fitted values and residuals.

The regression output includes three parts. The first two are a list of fitted values and residuals, placed in columns next to the data set, and any scatterplots selected from the dialog box.

The third part of the output is the most important. It is shown on the next slide.

Page 16: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Regression Output Table

Page 17: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

The Regression Output

We will eventually learn what all the output in the table means but for now we will concentrate on a small part.

Specifically we find the intercept and slope of the least squares line under the Coefficient label in cells C16 and C17.

They imply that the equation for the least squares line is Predicated Sales = 25.1264 + 0.7623Promote

Page 18: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Least Square Line Equation We can interpret this equation as follows.

– The slope 0.7623 indicates that the sales index tends to increase by about 0.76 for each unit increase in the promotional expenses index.

– The interpretation of the intercept is less important. It is literally the predicted sales index for a region that does no promotions.

For instances like this when the range of observed explanatory variable values does not include 0, it is best to think of the intercept as an “anchor” for the least squares line.

Page 19: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

The Scatterplot A useful graph in almost any regression analysis is a

scatterplot of residuals (on the vertical axis) versus fitted values.

The scatterplot for this data appears on the following slide.

We typically examine the scatterplot for striking patterns.

A “good” fit not only has small residuals, but it has residuals scattered randomly around 0 with no apparent pattern. This is the case here.

Page 20: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

The Scatterplot of Residuals versus Fitted Values for Pharmex

Page 21: Lecture7a Applied Econometrics and Economic Modeling

Scatterplots: Graphing Relationships

Page 22: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information

The Bendrix Company manufactures various types of parts for automobiles.

The manager of the factory wants to get a better understanding of overhead costs.

These overhead costs include supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.

Page 23: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information -- continued Some of the overhead costs are “fixed” in the sense

they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done.

It is not easy to draw a clear line between the fixed and variable overhead components.

The Bendrix manager has tracked total overhead costs for 36 months.

Page 24: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information -- continued To help explain these he also collected data on two

variables that are related to the amount of work done at the factory. These variables are:

– MachHrs: number of machine hours used during the month

– ProdRuns: the number of separate production runs during the month

• To understand this variable we must know that Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime.

The manager believes both of these variables might be responsible for variations in overhead costs. Do scatterplots support his belief?

Page 25: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

BENDRIX1.XLS

The data collected by the manager appears in this file.

Each observation (row) corresponds to a single month.

We want to investigate any possible relationship between the Overhead variable and the MachHrs and ProdRuns variables but because these are time series variables we should also look out for relationships between these variables and the Month variable.

Page 26: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

The Scatterplots

This data set illustrates, even with the modest number of variables, how the number of potentially useful scatterplots can grow quickly.

At the least, we need to look at the scatterplots between each potential explanatory variable (MacHrs and ProdRuns) and the response variable (Overhead).

These scatterplots are as follows:

Page 27: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Scatterplot of Overhead versus Machine Hours

Page 28: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Scatterplot of Overhead versus Production Runs

Page 29: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

The Scatterplots -- continued

To check for possible time series patterns we can also create a time series plot for any of the variables. This is equivalent to a scatterplot of the variable versus the Month, with the points joined by lines.

One of these is the time series plot for Overhead. The plot is shown next and it shows a fairly random pattern through time, with no apparent upward trend or other obvious time series pattern.

We can check that the MachHrs and ProdRuns also indicate no obvious pattern.

Page 30: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Time Series Plot of Overhead versus Month

Page 31: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

The Scatterplots -- continued

Finally, when multiple explanatory variables exist we can check for relationships between them. The scatterplot of MachHrs versus ProdRuns is a cloud of points that indicate no relationship worth pursuing.

Page 32: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

In Summary

The Bendrix manager should continue to explore the positive relationship between Overhead and each of the MachHrs and ProdRuns variables.

However, none of the variables appear to have any time series behavior, and the two potential explanatory variables do not appear to be related to each other.

Page 33: Lecture7a Applied Econometrics and Economic Modeling

Simple Linear Regression

Page 34: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information

In Example 13.2 we created scatterplots for Bendrix.

We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables.

However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.

Page 35: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

BENDRIX1.XLS

The data collected by the manager appears in this file.

The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns.

Eventually we will estimate a regression equation with both of the variables included.

However, if we include only one at a time, what do they tell us about the overhead costs?

Page 36: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Regression Output for Overhead versus MachHrs

Page 37: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Regression Output for Overhead versus ProdRuns

Page 38: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Least Squares Line Equations

The two least squares lines are therefore Predicted Overhead = 48,621 + 34.7MacHrs

and Predicated Overhead = 75,606 + 655.1ProdRuns

Clearly these two equations are quite different, although each effectively breaks Overhead into a fixed component and a variable component.

The equations imply that expected overhead increases by about $35 for each extra machine hour and about $655 for each extra production run.

Page 39: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Least Squares Line Equations -- continued The differences between these two lines can be

attributed to neither one telling the whole story.

If the manager’s goal is to split overhead into a fixed and variable component, then the variable component should include both of the measures of work activity to give a more complete explanation of overhead.

We will see how this can be done using multiple regression at a later time.

Page 40: Lecture7a Applied Econometrics and Economic Modeling

Multiple Regression

Page 41: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Background Information In Example 13.2 we created scatterplots for Bendrix and in

Example 13.2a we determined that the variable component of overhead must include both MachHrs and ProdRuns.

We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables.

However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.

Page 42: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

BENDRIX1.XLS

The data collected by the manager appear in this file.

The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns.

We need to estimate and interpret the equation for Overhead when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation.

Page 43: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Solution

To obtain the desired output we use StatPro/Regression Analysis/Multiple menu item.

We select Overhead as the response (dependent) variable and select MachHrs and ProdRuns as the explanatory (independent) variables.

The dialog box shown here then gives us options of which scatterplots to obtain and whether we want columns of fitted values and residuals placed next to the data set. For this example we will fill it in as shown.

Page 44: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Solution -- continued

The main regression output appears in the next table.

Page 45: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Results

The coefficients in the range C16-C18 indicate that the estimated regression equation isPredicted Overhead = 3997 + 43.45MachHrs

+ 883.62ProdRuns

Page 46: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Interpretation of Equation The interpretation of the equation is that if the

number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour; and if the number of machine hours is held constant, the overhead is expected to increase by $883.62 for each extra production run.

The Bendrix manager can interpret $3997 as the fixed component of overhead. The slope terms involving MachHrs and ProdRuns are the variable components of overhead.

Page 47: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Equation Comparison It is interesting to compare this equation with the separate

equations found in the previous example: Predicted Overhead = 48,621 + 34.7MacHrs

andPredicated Overhead = 75,606 + 655.1ProdRuns

Note that both coefficients have increased.

Also, the intercept is now lower than either intercept in the single variable equation.

It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur.

Page 48: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Equation Comparison -- continued The reasoning for this is that when MachHrs is the only variable

in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead.

But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant.

Since the coefficients have different meanings, it is not surprising that we obtain different estimates.

Page 49: Lecture7a Applied Econometrics and Economic Modeling

Modeling Possibilities

Page 50: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

BANK.XLS The Fifth National Bank of Springfield is facing a

gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees.

The bank’s employee database is listed in this file. Here is a partial list of the data.

Page 51: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Variables For each of the 208 employees, the data set includes

the following variables:

– EducLev: education level, a categorical variable with categories 1 (finished high school), 2 (finished some college courses), 3 (obtianed a bachelor’s degree), 4 (took some graduate courses) and 5 (obtained a graduate degree)

– JobGrade: a categorical variable indicating the current job level, the possible levels being from 1-6 (6 is highest)

– YrHired: year employee was hired

– YrBorn: year employee was born

– Gender: a categorical variable with values “Female” and “Male”

Page 52: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Variables -- continued

– YrsPrior: number of years of work experience at another bank prior to working at Fifth National

– PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise

– Salary: current annual salary in thousands of dollars

Do the data provide evidence that females are discriminated against in terms of salary?

Page 53: Lecture7a Applied Econometrics and Economic Modeling

13.2 | 13.1a | 13.2a | 13.2b | 13.3 | 13.3a | 13.4 | 13.3b | 13.5 | 13.6

Naïve Approach A naïve approach to the problem is to compare the

average salaries of the males and females.

The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505.

The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason.

The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.