quick and dirty regression tutorial

8/12/2019 Quick and Dirty Regression Tutorial

http://slidepdf.com/reader/full/quick-and-dirty-regression-tutorial 1/6

Quick and Dirty Regression Tutorial

The statistical procedure known as linear regression basically involves drawing and analyzingtrend-lines through data points. Economists use regression analysis to test hypotheses, derived

from economic theory, against real-world data.

In your first microeconomics class you saw theoretical demand schedules (Figure 1) showingthat if price increases, the quantity demanded ought to decrease. But when we collect marketdata to actually test this theory, the data may exhibit a trend, but they are "noisy" (Figure 2).

Drawing a trendline through datapoints

To analyze the empirical relationship between price and quantity, download and open the Excelspreadsheet with the data . Right-click on the spreadsheet chart to open a chart window, and

print off a full-page copy of the chart (same as the one shown in Figure 2). Using a pencil andstraightedge, eyeball and then draw a straight line through the cloud of points that best fits theoverall trend. Extend this line to both axes. Now calculate the values of intercept A and slope Bof the linear equation that represents the trend-line Price = A + B * Quantity

Although it is standard practice to graph supply and demand with Price on the Y-axis andQuantity on the X-axis, economists more often consider demand Quantity to be the "dependent"variable influenced by the "independent" variable Price. To obtain a more conventional demandequation, invert your equation, solving for intercept and slope coefficients a and b, whereQuantity = a + b*Pri ce . Technically, since this "empirical" (i.e., data-derived) demand model

doesn't fit through the data points exactly, it ought to be written as Quantity = a + b*Pri ce + e where e is the residual "unexplained" variation in the Quantity variable (the deviations of theactual Quantity data points from the estimated regession line that you drew through them).

http://www.udel.edu/johnmack/frec424/regression/regression_data.xls








That's basically what linear regression is about: fitting trend lines through data to analyzerelationships between variables. Since doing it by hand is imprecise and tedious, mosteconomists and statisticians prefer to...

Fitting a trendline in an XY-scatterplot

MS-Excel provides two methods for fitting the best-fitting trend-line through data points, andcalculating that line's slope and intercept coefficients. The standard criterion for "best fit" is thetrend line that minimizes the sum of the squared vertical deviations of the data points from thefitted line. This is called the ordinary least-squares (OLS) regression line. (If you got a bunch of

people to fit regression lines by hand and averaged their results, you would get something veryclose to the OLS line.)

The easiest way to plot a trend line and calculate a single-variable regression equation is to right-click on the data points in an Excel XY plot and select "Add Trendline." Under the "Options" tabcheck "Display equation on chart" and click "OK." How well do this trend line and calculated

slope and intercept coefficients match the line you drew and the slope and intercept that youcalculated?

Using Excel's Regression utility (Data Analysis tools)

Excel also includes a formal regression utility in its Analysis ToolPak that provides statisticsindicating goodness-of-fit and confidence intervals for slope and intercept coefficients. Thisutility lets you regress one dependent "left-hand-side" (of the equal sign) variable against one or



several independent "right-hand side" variables, and it provides useful indicators about thestatistical reliability of your model.

Excel's Regression procedure is one of the Data Analysis tools. If you don't see it, you need toactivate the Analysis ToolPak. Click the Windows symbol or the File menu, choose Options--

Add-Ins, select Analysis ToolPak (not Analysis ToolPak VBA) and click "Go..." Check theAnalysis TookPak checkbox and "OK." You will find "Data Analysis" on the right end of the"Data" menu.

The only things you are required to specify are...(a) one column of numbers as the Y Range, aka the dependent variable, "left-hand-side" variableor endogenous variable whose variation is to be "explained" by the regression model;(b) one or several adjacent columns of numbers as the X Range, aka the independent variables,right-hand side (of the equals sign) variables, exogenous variables or "explanatory" variables;(c) the upper-left corner of a blank range of cells in your spreadsheet where the results will be

printed.The X and Y ranges must contain the same number of rows, all numeric data, no missing values.



Here is output from Excel's regression utility replicating the regression of Price (Y range) againstQuantity (X range). At the bottom of the output you can see the same Intercept and Quantityslope coefficients that are shown for the trend line in the XY plot above. This empirical inversedemand model, written out in equation form, is P = 13.675 - 0.1664*Q + e. Other parts of theoutput are explained below.)

Try specifing Quantity as the dependent variable and Price as the independent variable, andestimating the conventional demand regression model Quantity = a + b*Price . Note that youobtain an approximate rather than exact mathematical inverse of the price equation! This is

because OLS minimizes the sum of the squared vertical deviations from the regression line, notthe sum of squared perpendicular deviations:

Multivariate models



Now try regressing Quantity (Y range) against both Price and Income (the X range is both thePrice and Income columns). This will yield coefficient estimates for the multivariate demandmodel Quantity = a + b* Pri ce + c* I ncome + e. You should get something like this:

Written out in equation form, this empirical demand model is Q = 49.18 - 3.118*P + 0.510*I +e. Multivariate models such as this don't lend themselves to easy graphing, but they are muchmore interesting. In this example an increase in Income shifts the conventional Q vs. P demandschedule to the right, while an increase in Price shifts the Q vs. Income curve (aka Engel curve)to the left.

Model diagnostics

When analyzing your regression output, first check the signs of the model coefficients: are theyconsistent with your hypotheses? Is the Price coefficient negative as theory predicts? Does theIncome coefficient indicate this is a normal good, or an inferior good? Try calculating the priceand income elasticities using these slope coefficients and the average values of Price andQuantity.

The next thing you should check is the statistical significance of your model coefficients.Because the data are noisy and the regression line doesnt fit the data points exactly, eachreported coefficient is really a point estimate , a mean value from a distribution of possiblecoefficient estimates. So the residuals e (the remaining noise in the data) are used to analyze thestatistical reliability of the regression coefficients. The columns to the right of the coefficientscolumn at the bottom of the Excel output report the standard errors, t -statistics, P-values, andlower and upper 95% confidence bounds for each coefficient.

The standard error is the square root of the variance of the regression coefficient. The t -statisticis the coefficient estimate divided by the standard error. If your regression is based on whatstatisticians call a "large" sample (30 or more observations), a t -statistic greater than 2 (or lessthan -2) indicates the coefficient is significant with >95% confidence. A t -statistic greater than1.68 (or less than -1.68) indicates the coefficient is significant with >90% confidence. The



confidence thresholds for t -statistics are higher for small sample sizes. This example uses only21 observations to estimate 1 intercept and 2 slope coefficients, which leaves 21 - 3 = 18"degrees of freedom" (df) for calculating significance levels. In this example, the t -statistic onthe Income coefficient is 2.037, which would exceed the 95% confidence threshold for a "large"(N > 30 observations) dataset, but does not quite meet the 95% confidence threshold when N =

21 observations.

If that last paragraph is just statistical gibberish for you, don't worry--most people just check theP-values. These are the probabilities that the coefficients are not statistically significant. The P-value of 0.056 for the Income coefficient implies 1 - 0.056 = 94.4% confidence that the "true"coefficient is between 0 and about 1.02. The last two columns report the exact lower and upper95% confidence thresholds for the Income coefficient: -0.0159 and +1.038 respectively. Thevery low P-values for the Intercept and Price coefficients indicate they are very stronglysignificant, so their 95% confidence intervals are relatively narrower.

The R-Square statistic near the top of the output represents the percent of the total variation in

the dependent variable that is explained by the independent variables, i.e., the model's overallgoodness of fit." But whether a model is really a "good" fit or not depends on context. R-squaresfor cross-sectional models are typically much lower than R-squares for time-series models. Youcan always increase R-square by throwing another independent variable ( any variable!) into yourmodel. Remember that your real objective is to test your hypotheses, not to maximize R-square

by including irrelevant variables in your model and then making up some "hypothesis" after thefact to "explain" the results you got.

Those are all the diagnostics you really need to worry about.

Final comments

The classical OLS model assumes that the residuals e are independent of each other andrandomly distributed with a mean of zero. It is sometimes helpful to examine plots of residuals tocheck for non-random pattens that indicate problems with your model. If you take aneconometrics class, you will learn how to identify violations of these assumptions and how toadapt the OLS model to deal with these situations.

Keep in mind that a regression actually analyzes the statistical correlation between one variableand a set of other variables. It doesn't actually prove causality . It is only the context of youranalysis that lets you infer that the "independent" variabes "cause" the variation in the"dependent" variable. Somebody else out there is probably using the same data to prove that yourdependent variable is "causing" one of your independent variables!

You should never force the regression line through the origin (the "Constant is zero" check-boxin the Excel utility) without a clear theoretical justification for doing so. It makes your modeldiagnostics unreliable.

quick and dirty regression tutorial

Documents