dda2013 week4 windowsexcel2003 regression

Upload: noelje

Post on 04-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    1/8

    1

    Guide forWindows Excel 2003

    RegressionModelling with Analysis Toolpak

    James W. Taylor

    The purpose of this guide is to explore linear regression using Excel. This note consists of thefollowing sections:

    Summarising and describing a multi-variable data set Correlation analysis Scatter plots Simple regression Multiple regression

    We must attach Excels statistical add-in options:

    From the Toolsmenu, selectAdd-InsIn the Add-Ins dialog box select: Analysis ToolPak - VBA and Analysis ToolPak

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    2/8

    2

    1. SUMMARISING & DESCRIBING A MULTI-VARIABLE DATA SET

    The Excel file ElectricityConsumption.xls contains monthly observations from January2004 to July 2012 for the following variables:

    ELEC Residential electricity sales (KWh) per customer in a mid-Atlantic U.S. cityC66 Cooling degree hours at base temperature 66 degrees (a measure of summer heat)1C76 Cooling degree hours at base temperature 76 degrees (a measure of summer heat)

    H55 Heating degree hours at base temperature 55 degrees (a measure of winter cold)2

    DINC Disposable income per household ($)AIRC Proportion of households with air conditioning

    The ultimate aim is to build a forecasting model for residential electricity consumption.

    1

    2

    3

    45

    6

    7

    8

    9

    10

    11

    12

    13

    A B C D E F G

    MONTH ELEC C66 C76 H55 DINC AIRC

    Jan-04 681.7 20 0 10148 34825 0.698

    Feb-04 620.3 0 0 12504 34934 0.701

    Mar-04 590.8 20 0 9300 35050 0.705Apr-04 538.0 14 0 5333 35172 0.708

    May-04 513.4 559 3 2846 35302 0.712

    Jun-04 575.5 1601 83 282 35438 0.716

    Jul-04 1019.3 5348 833 1 35583 0.72

    Aug-04 1203.9 7416 1547 0 35734 0.724

    Sep-04 1176.7 6887 1287 0 35892 0.728

    Oct-04 723.0 2975 398 155 36056 0.731

    Nov-04 519.0 427 5 1812 36222 0.735

    Dec-04 604.9 9 0 5779 36391 0.739

    Use the Analysis Toolpak Descriptive Statistics tool to get summary statistics (in one sequence ofoperations) for all 6 variables, by selecting

    Tools Data Analysis Descriptive Statistics

    In the Descriptive Statistics dialog box, specify:

    Input Range as the range containing values and variable names: B1:G104

    Click the Labels in First Rowcheckbox

    Output optionsas New Worksheet Ply with the name Summary

    Click the Summary Statisticscheckbox.

    1The cooling degree hours at base temperature Tis:

    1ii

    ni where niis the number of hours in the month at temperature T+i.

    2The heating degree hours at base temperature Tis:

    1i

    ini where niis the number of hours in the month at temperature T-i.

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    3/8

    3

    2. CORRELATION ANALYSIS

    Return to theDataworksheet.

    1. From the main menu, choose:

    Tools Data Analysis...

    and in the Data Analysis dialog box, specify Correlation and confirm OK. Thefollowing dialog box should appear:

    2. In the Correlation dialog box, specify:

    Input Range:as B5:F25 (dont include the house number column)

    Grouped By:as Columns, so that Excel knows that each column is a variable.

    The Labels in First Row checkbox should be crossed

    Output options: as New Worksheet Plywith the name Correlations

    ClickOK.

    The correlation matrix below should result. Correlation coefficients for pairs of variables indicatethe levels of linear association between them, e.g. ELEC and C76 have correlation of 0.94, so thatas C76 rises, ELEC rises.

    You should get the same value using the Excel function =CORREL

    Note any variables strongly correlated with ELEC, and any strong inter-correlations betweenthe potential explanatory variables, C66, C76, H55, DINC and AIRC.

    ELEC C66 C76 H55 DINC AIRC

    ELEC 1.00 0.92 0.94 -0.36 0.14 0.14

    C66 0.92 1.00 0.95 -0.65 0.02 0.02C76 0.94 0.95 1.00 -0.52 0.01 0.01

    H55 -0.36 -0.65 -0.52 1.00 -0.04 -0.05

    DINC 0.14 0.02 0.01 -0.04 1.00 0.94

    AIRC 0.14 0.02 0.01 -0.05 0.94 1.00

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    4/8

    4

    3. SCATTER PLOTS

    Scatter plots are of great help in identifying the strength, nature and direction of relationshipsbetween pairs of variables. In particular, they can highlight non-linear relationships, which willnot necessarily be apparent from the correlation values. Since the observed correlation, 0.94,

    between ELEC and C76 suggests a relationship, lets examine their scatter plot.

    Return to the Dataworksheet.

    Copy the ELEC column of data to column K. Copy C76 to column J.

    From the main menu, select:

    Insert Chart

    In Step 1 of Chart Wizard, select chart type as: XY (Scatter) and click Next>.

    In Step 2, specify J1:K104as the Data range.

    In Step 3, specify Chart titlesas Electricity Consumption,Value (X) Axisas C76,

    Value (Y) Axisas ELEC, click Next>.

    In Step 4, specify that the chart should be placed As objectin the Dataworksheet, then

    click Finish.

    The scatter plot confirms the reasonably strong linear relationship, with ELEC rising as C76increases.

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    5/8

    5

    4. SIMPLE REGRESSION

    Regression analysis produces the estimated linear equation that best fits a set of data. Bybest fitting we mean the line (or linear model) for which there is least residual scatter.

    1. Choose from the main menu:

    Tools Data Analysis Regression

    2. Complete Regression dialog box. Specify:

    Input Y range as B1:B104 ELECasdependent variable

    Input X range as D1:D104 C76asindependent variable

    Check the Labelsbox as the first entries in each cell range are labels

    Specify Output options as New Worksheet Ply, with the name Regression1.

    Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots.

    Then click OK.

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    6/8

    6

    4.1 REGRESSION ANALYSIS - INTERPRETING NUMERICAL OUTPUT

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R 0.936601141

    R Square 0.877221698

    Adjusted R Sq 0.876006071

    Standard Error 84.01563552

    Observations 103

    ANOVA

    df SS MS F Significance F

    Regression 1 5093652.918 5093652.918 721.6209201 8.45675E-48

    Residual 101 712921.3281 7058.627011

    Total 102 5806574.246

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 632.1967321 9.685863338 65.2700446 2.09858E-84 612.9825852 651.410879

    C76 0.538125757 0.020032227 26.86300281 8.45675E-48 0.498387209 0.577864305

    The 1st part of the output contains summary statistics for the regression as a whole, R2 and

    residual standard deviation (called standard error).

    Ignore the 2ndpart which displays ANOVA or Analysis of Variance calculations.

    The 3rdpart of the output indicates that the best fitting linear model has equation:

    ELEC = 632.20 + 0.538*C76

    And that the slope, 0.538, has a t-stat of 26.86and a very small p-value. The variable C76 istherefore significantly explaining some of the variation in ELEC.

    The 4thpart shows predicted values for each of the observations, and the residuals.

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    7/8

    7

    4.2 REGRESSION - INTERPRETING EXCELS GRAPHICAL OUTPUT

    The Regression tool puts one chart on top of another. Click on the top chart so that it becomes theactive chart, then move it below.

    1. The Line Fit Plot shows actual ELEC and predicted ELEC, plotted for different values ofC76. This plot is the same as your scatter plot of ELEC & C76 (only with the axes flippedround) with points from the regression line superimposed. The regression line (calledPredicted ELEC in the legend) is shown as points rather than as a line. This can bechanged by formatting the data series.

    C76 Line Fit Plot

    0

    500

    1000

    1500

    2000

    0 1000 2000C76

    ELEC ELEC

    Predicted ELEC

    2. Residuals Plot shows residuals plotted versus the value of the C76 variable. Check thatthe residuals do not display an obvious pattern. Ideally, residuals should be as ifrandom, not showing any systematic pattern, of much the same average size, and notincreasing in size as X (C76) increases, etc. Residual plots are also useful for spottingoutliers - data points much further from the regression line than others.

    C76 Residual Plot

    -300

    -200

    -100

    0

    100

    200

    300

    0 500 1000 1500 2000

    C76

    Residuals

  • 8/13/2019 DDA2013 Week4 WindowsExcel2003 Regression

    8/8

    8

    5. MULTIPLE REGRESSION

    Can the ELEC predictions be improved if other possible explanatory variables are broughtinto the model? This section contains a brief description of the way Excels regression can beextended from simple (ELEC on C76) to multiple regression (ELEC on two or more

    variables). The purpose is to find the best equation for predicting ELEC from one or moreofthe independent variables. Lets regress ELEC on the other five variables.

    Return to the Dataworksheet.

    1. Starting from a cell on Data sheet, choose from the main menu:

    Tools Data Analysis Regression

    2. In the Regression dialog box, specify:

    Input Yrange as B1:B104 i.e.ELEC as dependent variable

    Input Xrange as C1:G104 i.e. five explanatory variables

    Check the Labelscheckbox.

    Specify Output options: as New Worksheet Ply, with the name Regression2.

    Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots.

    Then click OK.