data analysis.ppt

Upload: dotrev-ibs

Post on 14-Apr-2018

246 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Data Analysis.ppt

    1/28

    Univariate

    Analysis

    Bivariate

    Analysis

    Multivariate

    Analysis

    Data Analysis

  • 7/29/2019 Data Analysis.ppt

    2/28

    Three Types of Analysis

    we can classify analysis into three types

    1. Univariate, involving a single variable at a time,

    2. Bivariate, involving two variables at a time, and

    3. Multivariate, involving three or more variablessimultaneously.

  • 7/29/2019 Data Analysis.ppt

    3/28

    Revision : Application Areas: Correlation

    1. Correlation and Regression are generallyperformed together. The application ofcorrelation analysis is to measure the

    degree of association between two setsof quantitative data. The correlationcoefficient measures this association. Ithas a value ranging from 0 (nocorrelation) to 1 (perfect positivecorrelation), or -1 (perfect negativecorrelation).

  • 7/29/2019 Data Analysis.ppt

    4/28

    2. For example, how are sales ofproduct A correlated with sales ofproduct B? Or, how is the advertising

    expenditure correlated with otherpromotional expenditure? Or, aredaily ice cream sales correlated with

    daily maximum temperature?

  • 7/29/2019 Data Analysis.ppt

    5/28

    3. Correlation does not necessarily meanthere is a causal effect. Given any two

    strings of numbers, there will besome correlation among them. It does notimply that one variable is causing a

    change in another, or is dependentupon another.

    4. Correlation is usually followed byregression analysis in many applications.

  • 7/29/2019 Data Analysis.ppt

    6/28

    Application Areas: Regression

    1. The main objective of regression analysis is

    to explain the variation in one variable(called the dependent variable),based on thevariation in one or more other variables(called the independent variables).

    2. The applications areas are in explainingvariations in sales of a product based onadvertising expenses, or number of salespeople, or number of sales offices, or on all

    the above variables.

    3. If there is only one dependent variable andone independent variable is used to explain

    the variation in it, then the model is known

  • 7/29/2019 Data Analysis.ppt

    7/28

    4. If multiple independent variables are usedto explain the variation in a dependentvariable, it is called a multiple regression

    model.

    5. Even though the form of the regressionequation could be either linear or non-linear, we

    will limit our discussion to linear (straight line)models.

  • 7/29/2019 Data Analysis.ppt

    8/28

    The general regression model (linear) is ofthe type

    Y = b0

    + b1x

    1+ b

    2x

    2+.+ b

    nx

    n

    ( OR Y = a + b1x1 + b2x2 +.+ bnxn )

    where

    y is the dependent variable

    x1, x2, x3.xn are the independent variables

    expected to be related to y and expected toexplain or predict y.

    b1, b2, b3bn are the coefficients of the

    respective independent variables, which will

  • 7/29/2019 Data Analysis.ppt

    9/28

    Purposes of Regression Analysis

    To establish the relationship betweena dependent variable (outcome) and a set ofindependent (explanatory) variables

    To identify the relative importance of thedifferent independent (explanatory)

    variables on the outcome

    To make predictions

  • 7/29/2019 Data Analysis.ppt

    10/28

    Steps of Regression Analysis

    Step 1: Construct a regression modelStep 2: Estimate the regression and interpret

    the result

    Step 3: Conduct diagnostic analysis of theresults

    Step 4: Change the original regression model if

    necessaryStep 5: Make predictions

  • 7/29/2019 Data Analysis.ppt

    11/28

    DATA (INPUT / OUTPUT)

    1. Input data on y and each of the x

    variables is required to do a regressionanalysis. This data is input into acomputer package to perform the

    regression analysis.

    2. The output consists of the b coefficientsfor all the independent variables in themodel. It also gives the results of a t testfor the significance of each variable in themodel, and the results of the F test for

    the model on the whole.

  • 7/29/2019 Data Analysis.ppt

    12/28

    3 Assuming the model is statistically significantat the desired confidence level (usually 90 or95%), the coefficient of determination or R2 of themodel is an important part of the output. The R2

    value is the percentage (or proportion) of the totalvariance in y explained by all the independentvariables in the regression equation.

  • 7/29/2019 Data Analysis.ppt

    13/28

    Requirements for applying Multiple regression analysis

    1. The variables used (independent and dependent) are

    assumed to be either interval scaled or ratio scaled.

    2. Nominally scaled variables can be used asindependent variables in a regression model, withdummy variable coding.

    3. If the dependent variable happens to be a

    nominally scaled one, discriminant analysisshould be the technique used instead of regression.

    4. Dependent variable essentially METRIC

    Independent variables Metric or Dummy

  • 7/29/2019 Data Analysis.ppt

    14/28

    Worked Example: Problem

    A manufacturer and marketer ofelectric motors would like to build aregression model consisting of five orsix independent variables, to predictsales. Past data has been collected for15 sales territories, on Sales and six

    different independent variables. Builda regression model and recommendwhether or not it should be used by

    the company.

  • 7/29/2019 Data Analysis.ppt

    15/28

    The data are for a particular year,

    in different sales territories inwhich the company operates, andthe variables on which data are

    collected are as follows:

  • 7/29/2019 Data Analysis.ppt

    16/28

    Dependent VariableY =sales in Rs.lakhs in the territory

    Independent Variables

    X1 = market potential in the territory(in Rs.lakhs).

    X2 = No. of dealers of the company in theterritory.

    X3 = No. of salespeople in the territory.X4 = Index of competitor activity in the

    territory on a 5 point scale(1=low, 5=high level of activity by

    competitors).X5 = No. of service people in the territory.X6 = No. of existing customers in the

    territory.The followin slide ives the Data file :

    1 2 3 4 5 6 7

  • 7/29/2019 Data Analysis.ppt

    17/28

    1

    SALES

    2

    POTENTL

    3

    DEALERS

    4

    PEOPLE

    5

    COMPET

    6

    SERVICE

    7

    CUSTOM

    1 5 25 1 6 5 2 202 60 150 12 30 4 5 503

    20 45 5 15 3 2 254 11 30 2 10 3 2 205 45 75 12 20 2 4 306 6 10 3 8 2 3 167

    15 29 5 18 4 5 308 22 43 7 16 3 6 409 29 70 4 15 2 5 3910 3 40 1 6 5 2 511

    16 40 4 11 4 2 1712 8 25 2 9 3 3 1013 18 32 7 14 3 4 3114 23 73 10 10 4 3 4315

    81 150 15 35 4 7 70

  • 7/29/2019 Data Analysis.ppt

    18/28

    Regression

    We will first run the regression model of the

    following form, by entering all the 6 'x' variablesin the model -

    Y= b0+ b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

    ..Equation 1[ OR

    Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

    ..Equation 1]

    and determine the values of b0, b1, b2, b3, b4, b5, &

  • 7/29/2019 Data Analysis.ppt

    19/28

    MULTIPLE REGRESSION RESULTS:

    All independent variables were entered in one block

    Dependent Variable: SALES

    Multiple R: .988531605

    Multiple R-Square: .977194734

    Adjusted R-Square: .960090784Number of cases: 15

    Th ANOVA T bl

  • 7/29/2019 Data Analysis.ppt

    20/28

    The ANOVA Table

    STAT.

    MULTIPLE

    REGRESS.

    Analysis of Variance; Depen.Var: SALES (regdata1.sta)

    Effect

    Sums of

    Squares df

    Mean

    Squares F

    Regress.

    Residual

    Total

    6609.484

    154.249

    6763.733

    6

    8

    1101.581

    19.281

    57.13269 .000004

    From the analysis of variance table, the last column

    indicates the p-level to be 0.000004. This indicatesthat the model is statistically significant at a

    confidence level of (1-0.000004)*100 or

    (0.999996)*100, or 99.9996.

  • 7/29/2019 Data Analysis.ppt

    21/28

    :

    STAT.MULTIPLE

    REGRESS.

    Regression Summary for Dependent Variable: SALESR= .98853160 R

    2= .97719473 Adjusted R

    2= .96009078

    F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910

    N=15

    BETA

    St.Err.

    of

    BETA

    B

    St. Err.

    of B t(8) p-level

    Intercept -3.1729 5.813394 -.54581 .600084

    POTENTL .439073 .144411 .22685 .074611 3.04044 .016052

    DEALERS .164315 .126591 .81938 .631266 1.29800 .230457PEOPLE .413967 .158646 1.09104 .418122 2.60937 .031161

    COMPET .084871 .060074 -1.89270 1.339712 -1.41276 .195427

    SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204

    CUSTOM .050490 .149302 .06594 .095002 .33817 .743935

    C l 4 f th t bl titl d B li t ll th ffi i t

  • 7/29/2019 Data Analysis.ppt

    22/28

    Column 4 of the table, titled B lists all the coefficientsfor the model. These are : a (intercept) = -3.17298

    b1 = .22685

    b2 = .81938b3 = 1.09104b4 = -1.89270b5 = -0.54925

    b6 = 0.06594

    Substituting these values of a, b1, b2, ..b6 in

    equation 1 we can write the equation (roundingoff all coefficients to 2 decimals), as

    S l 3 17 23 ( t ti l) 82

  • 7/29/2019 Data Analysis.ppt

    23/28

    Sales = -3.17 + .23 (potential) + .82(dealers) + 1.09 (salespeople) - 1.89(competitor activity) - 0.55 (service

    people) + 0.07 (existing customers)

    [Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

    ..Equation 1]The estimated increase in sales for every unit increaseor decrease in the independent variables is given by the

    coefficients of the respective variables. For instance, ifthe number of sales people is increased by 1, sales in Rs. lakhs, are estimated to increase by 1.09, if all othervariables are unchanged. Similarly, if 1 more dealeris added, sales are expected to increase by 0.82 lakh, if

    Th SERVICE i bl d t k t h i t iti

  • 7/29/2019 Data Analysis.ppt

    24/28

    The SERVICE variable does not make too much intuitivesense. If we increase the number of service people,sales are estimated to decrease according to the0.55

    coefficient of the variable "No. of Service People"(SERVICE).

    Now look at the individual variable t tests, we find that

    the coefficients of the variable SERVICE is statisticallynot significant (p-level 0.735204). Therefore, thecoefficient for SERVICE is not to be used in interpretingthe regression, as it may lead to wrong conclusions.

    Strictly speaking, only two variables, potential(POTENTL) and No. of sales people (PEOPLE) aresignificant statistically at 90 percent confidence levelsince their - level is less than 0.10. One should

  • 7/29/2019 Data Analysis.ppt

    25/28

    Different modes of entering independentvariables in the model

    Enter

    Forward Stepwise Regression

    Backward step wise Regression Step wise regression

    Th fi l d l

  • 7/29/2019 Data Analysis.ppt

    26/28

    The final model

    Sales = -10.6164 + .2433 (POTENTL)

    + 1.4244 (PEOPLE)Equation 3

    Predictions:If potential in a territory were to be Rs. 50 lakhs, andthe territory had 6 salespeople, then expected sales,using the above equation would be

    = -10.6164 +.2433(50) +1.4244(6)

    = 10.095 lakhs.Similarly, we could use this model to make predictionsregarding sales in any territory for which Potential andNo. of Sales People were known.

  • 7/29/2019 Data Analysis.ppt

    27/28

    Recommended usage

    1. It is recommended that for serious decision-making, therehas to be a-priori knowledge of the variables which arelikely to affect y, and only such variables should be used inthe regression analysis.

    2. For exploratory research, the hit-and-trial approach may beused.

    3. It is also recommended that unless the model is itself

    significant at the desired confidence level (as evidenced bythe F test results printed out for the model), the R valueshould not be interpreted.

  • 7/29/2019 Data Analysis.ppt

    28/28

    Multicollinearity and how to tackle it

    Multicollinearity : Interrelationship of the variousindependent variables

    It is essential to verify whether independent variables are

    highly correlated with each other. If they are, this may indicate

    that they are not independent of each other, and we may be

    able to use only 1 or 2 of them to predict the dependent

    variables.

    Independent variables which are highly correlated with each

    other should not be included in the model together