dda2013 week4 windowsexcel2003 regression
TRANSCRIPT
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
1/8
1
Guide forWindows Excel 2003
RegressionModelling with Analysis Toolpak
James W. Taylor
The purpose of this guide is to explore linear regression using Excel. This note consists of thefollowing sections:
Summarising and describing a multi-variable data set Correlation analysis Scatter plots Simple regression Multiple regression
We must attach Excels statistical add-in options:
From the Toolsmenu, selectAdd-InsIn the Add-Ins dialog box select: Analysis ToolPak - VBA and Analysis ToolPak
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
2/8
2
1. SUMMARISING & DESCRIBING A MULTI-VARIABLE DATA SET
The Excel file ElectricityConsumption.xls contains monthly observations from January2004 to July 2012 for the following variables:
ELEC Residential electricity sales (KWh) per customer in a mid-Atlantic U.S. cityC66 Cooling degree hours at base temperature 66 degrees (a measure of summer heat)1C76 Cooling degree hours at base temperature 76 degrees (a measure of summer heat)
H55 Heating degree hours at base temperature 55 degrees (a measure of winter cold)2
DINC Disposable income per household ($)AIRC Proportion of households with air conditioning
The ultimate aim is to build a forecasting model for residential electricity consumption.
1
2
3
45
6
7
8
9
10
11
12
13
A B C D E F G
MONTH ELEC C66 C76 H55 DINC AIRC
Jan-04 681.7 20 0 10148 34825 0.698
Feb-04 620.3 0 0 12504 34934 0.701
Mar-04 590.8 20 0 9300 35050 0.705Apr-04 538.0 14 0 5333 35172 0.708
May-04 513.4 559 3 2846 35302 0.712
Jun-04 575.5 1601 83 282 35438 0.716
Jul-04 1019.3 5348 833 1 35583 0.72
Aug-04 1203.9 7416 1547 0 35734 0.724
Sep-04 1176.7 6887 1287 0 35892 0.728
Oct-04 723.0 2975 398 155 36056 0.731
Nov-04 519.0 427 5 1812 36222 0.735
Dec-04 604.9 9 0 5779 36391 0.739
Use the Analysis Toolpak Descriptive Statistics tool to get summary statistics (in one sequence ofoperations) for all 6 variables, by selecting
Tools Data Analysis Descriptive Statistics
In the Descriptive Statistics dialog box, specify:
Input Range as the range containing values and variable names: B1:G104
Click the Labels in First Rowcheckbox
Output optionsas New Worksheet Ply with the name Summary
Click the Summary Statisticscheckbox.
1The cooling degree hours at base temperature Tis:
1ii
ni where niis the number of hours in the month at temperature T+i.
2The heating degree hours at base temperature Tis:
1i
ini where niis the number of hours in the month at temperature T-i.
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
3/8
3
2. CORRELATION ANALYSIS
Return to theDataworksheet.
1. From the main menu, choose:
Tools Data Analysis...
and in the Data Analysis dialog box, specify Correlation and confirm OK. Thefollowing dialog box should appear:
2. In the Correlation dialog box, specify:
Input Range:as B5:F25 (dont include the house number column)
Grouped By:as Columns, so that Excel knows that each column is a variable.
The Labels in First Row checkbox should be crossed
Output options: as New Worksheet Plywith the name Correlations
ClickOK.
The correlation matrix below should result. Correlation coefficients for pairs of variables indicatethe levels of linear association between them, e.g. ELEC and C76 have correlation of 0.94, so thatas C76 rises, ELEC rises.
You should get the same value using the Excel function =CORREL
Note any variables strongly correlated with ELEC, and any strong inter-correlations betweenthe potential explanatory variables, C66, C76, H55, DINC and AIRC.
ELEC C66 C76 H55 DINC AIRC
ELEC 1.00 0.92 0.94 -0.36 0.14 0.14
C66 0.92 1.00 0.95 -0.65 0.02 0.02C76 0.94 0.95 1.00 -0.52 0.01 0.01
H55 -0.36 -0.65 -0.52 1.00 -0.04 -0.05
DINC 0.14 0.02 0.01 -0.04 1.00 0.94
AIRC 0.14 0.02 0.01 -0.05 0.94 1.00
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
4/8
4
3. SCATTER PLOTS
Scatter plots are of great help in identifying the strength, nature and direction of relationshipsbetween pairs of variables. In particular, they can highlight non-linear relationships, which willnot necessarily be apparent from the correlation values. Since the observed correlation, 0.94,
between ELEC and C76 suggests a relationship, lets examine their scatter plot.
Return to the Dataworksheet.
Copy the ELEC column of data to column K. Copy C76 to column J.
From the main menu, select:
Insert Chart
In Step 1 of Chart Wizard, select chart type as: XY (Scatter) and click Next>.
In Step 2, specify J1:K104as the Data range.
In Step 3, specify Chart titlesas Electricity Consumption,Value (X) Axisas C76,
Value (Y) Axisas ELEC, click Next>.
In Step 4, specify that the chart should be placed As objectin the Dataworksheet, then
click Finish.
The scatter plot confirms the reasonably strong linear relationship, with ELEC rising as C76increases.
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
5/8
5
4. SIMPLE REGRESSION
Regression analysis produces the estimated linear equation that best fits a set of data. Bybest fitting we mean the line (or linear model) for which there is least residual scatter.
1. Choose from the main menu:
Tools Data Analysis Regression
2. Complete Regression dialog box. Specify:
Input Y range as B1:B104 ELECasdependent variable
Input X range as D1:D104 C76asindependent variable
Check the Labelsbox as the first entries in each cell range are labels
Specify Output options as New Worksheet Ply, with the name Regression1.
Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots.
Then click OK.
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
6/8
6
4.1 REGRESSION ANALYSIS - INTERPRETING NUMERICAL OUTPUT
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.936601141
R Square 0.877221698
Adjusted R Sq 0.876006071
Standard Error 84.01563552
Observations 103
ANOVA
df SS MS F Significance F
Regression 1 5093652.918 5093652.918 721.6209201 8.45675E-48
Residual 101 712921.3281 7058.627011
Total 102 5806574.246
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 632.1967321 9.685863338 65.2700446 2.09858E-84 612.9825852 651.410879
C76 0.538125757 0.020032227 26.86300281 8.45675E-48 0.498387209 0.577864305
The 1st part of the output contains summary statistics for the regression as a whole, R2 and
residual standard deviation (called standard error).
Ignore the 2ndpart which displays ANOVA or Analysis of Variance calculations.
The 3rdpart of the output indicates that the best fitting linear model has equation:
ELEC = 632.20 + 0.538*C76
And that the slope, 0.538, has a t-stat of 26.86and a very small p-value. The variable C76 istherefore significantly explaining some of the variation in ELEC.
The 4thpart shows predicted values for each of the observations, and the residuals.
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
7/8
7
4.2 REGRESSION - INTERPRETING EXCELS GRAPHICAL OUTPUT
The Regression tool puts one chart on top of another. Click on the top chart so that it becomes theactive chart, then move it below.
1. The Line Fit Plot shows actual ELEC and predicted ELEC, plotted for different values ofC76. This plot is the same as your scatter plot of ELEC & C76 (only with the axes flippedround) with points from the regression line superimposed. The regression line (calledPredicted ELEC in the legend) is shown as points rather than as a line. This can bechanged by formatting the data series.
C76 Line Fit Plot
0
500
1000
1500
2000
0 1000 2000C76
ELEC ELEC
Predicted ELEC
2. Residuals Plot shows residuals plotted versus the value of the C76 variable. Check thatthe residuals do not display an obvious pattern. Ideally, residuals should be as ifrandom, not showing any systematic pattern, of much the same average size, and notincreasing in size as X (C76) increases, etc. Residual plots are also useful for spottingoutliers - data points much further from the regression line than others.
C76 Residual Plot
-300
-200
-100
0
100
200
300
0 500 1000 1500 2000
C76
Residuals
-
8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression
8/8
8
5. MULTIPLE REGRESSION
Can the ELEC predictions be improved if other possible explanatory variables are broughtinto the model? This section contains a brief description of the way Excels regression can beextended from simple (ELEC on C76) to multiple regression (ELEC on two or more
variables). The purpose is to find the best equation for predicting ELEC from one or moreofthe independent variables. Lets regress ELEC on the other five variables.
Return to the Dataworksheet.
1. Starting from a cell on Data sheet, choose from the main menu:
Tools Data Analysis Regression
2. In the Regression dialog box, specify:
Input Yrange as B1:B104 i.e.ELEC as dependent variable
Input Xrange as C1:G104 i.e. five explanatory variables
Check the Labelscheckbox.
Specify Output options: as New Worksheet Ply, with the name Regression2.
Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots.
Then click OK.