data analysis.ppt

7/29/2019 Data Analysis.ppt

1/28

Univariate

Analysis

Bivariate

Analysis

Multivariate

Analysis

Data Analysis


2/28

Three Types of Analysis

we can classify analysis into three types

1. Univariate, involving a single variable at a time,

2. Bivariate, involving two variables at a time, and

3. Multivariate, involving three or more variablessimultaneously.


3/28

Revision : Application Areas: Correlation

1. Correlation and Regression are generallyperformed together. The application ofcorrelation analysis is to measure the

degree of association between two setsof quantitative data. The correlationcoefficient measures this association. Ithas a value ranging from 0 (nocorrelation) to 1 (perfect positivecorrelation), or -1 (perfect negativecorrelation).


4/28

2. For example, how are sales ofproduct A correlated with sales ofproduct B? Or, how is the advertising

expenditure correlated with otherpromotional expenditure? Or, aredaily ice cream sales correlated with

daily maximum temperature?


5/28

3. Correlation does not necessarily meanthere is a causal effect. Given any two

strings of numbers, there will besome correlation among them. It does notimply that one variable is causing a

change in another, or is dependentupon another.

4. Correlation is usually followed byregression analysis in many applications.


6/28

Application Areas: Regression

1. The main objective of regression analysis is

to explain the variation in one variable(called the dependent variable),based on thevariation in one or more other variables(called the independent variables).

2. The applications areas are in explainingvariations in sales of a product based onadvertising expenses, or number of salespeople, or number of sales offices, or on all

the above variables.

3. If there is only one dependent variable andone independent variable is used to explain

the variation in it, then the model is known


7/28

4. If multiple independent variables are usedto explain the variation in a dependentvariable, it is called a multiple regression

model.

5. Even though the form of the regressionequation could be either linear or non-linear, we

will limit our discussion to linear (straight line)models.


8/28

The general regression model (linear) is ofthe type

Y = b0

+ b1x

1+ b

2x

2+.+ b

nx

n

( OR Y = a + b1x1 + b2x2 +.+ bnxn )

where

y is the dependent variable

x1, x2, x3.xn are the independent variables

expected to be related to y and expected toexplain or predict y.

b1, b2, b3bn are the coefficients of the

respective independent variables, which will


9/28

Purposes of Regression Analysis

To establish the relationship betweena dependent variable (outcome) and a set ofindependent (explanatory) variables

To identify the relative importance of thedifferent independent (explanatory)

variables on the outcome

To make predictions


10/28

Steps of Regression Analysis

Step 1: Construct a regression modelStep 2: Estimate the regression and interpret

the result

Step 3: Conduct diagnostic analysis of theresults

Step 4: Change the original regression model if

necessaryStep 5: Make predictions


11/28

DATA (INPUT / OUTPUT)

1. Input data on y and each of the x

variables is required to do a regressionanalysis. This data is input into acomputer package to perform the

regression analysis.

2. The output consists of the b coefficientsfor all the independent variables in themodel. It also gives the results of a t testfor the significance of each variable in themodel, and the results of the F test for

the model on the whole.


12/28

3 Assuming the model is statistically significantat the desired confidence level (usually 90 or95%), the coefficient of determination or R2 of themodel is an important part of the output. The R2

value is the percentage (or proportion) of the totalvariance in y explained by all the independentvariables in the regression equation.


13/28

Requirements for applying Multiple regression analysis

1. The variables used (independent and dependent) are

assumed to be either interval scaled or ratio scaled.

2. Nominally scaled variables can be used asindependent variables in a regression model, withdummy variable coding.

3. If the dependent variable happens to be a

nominally scaled one, discriminant analysisshould be the technique used instead of regression.

4. Dependent variable essentially METRIC

Independent variables Metric or Dummy


14/28

Worked Example: Problem

A manufacturer and marketer ofelectric motors would like to build aregression model consisting of five orsix independent variables, to predictsales. Past data has been collected for15 sales territories, on Sales and six

different independent variables. Builda regression model and recommendwhether or not it should be used by

the company.


15/28

The data are for a particular year,

in different sales territories inwhich the company operates, andthe variables on which data are

collected are as follows:


16/28

Dependent VariableY =sales in Rs.lakhs in the territory

Independent Variables

X1 = market potential in the territory(in Rs.lakhs).

X2 = No. of dealers of the company in theterritory.

X3 = No. of salespeople in the territory.X4 = Index of competitor activity in the

territory on a 5 point scale(1=low, 5=high level of activity by

competitors).X5 = No. of service people in the territory.X6 = No. of existing customers in the

territory.The followin slide ives the Data file :

1 2 3 4 5 6 7


17/28

1

SALES

2

POTENTL

3

DEALERS

4

PEOPLE

5

COMPET

6

SERVICE

7

CUSTOM

1 5 25 1 6 5 2 202 60 150 12 30 4 5 503

20 45 5 15 3 2 254 11 30 2 10 3 2 205 45 75 12 20 2 4 306 6 10 3 8 2 3 167

15 29 5 18 4 5 308 22 43 7 16 3 6 409 29 70 4 15 2 5 3910 3 40 1 6 5 2 511

16 40 4 11 4 2 1712 8 25 2 9 3 3 1013 18 32 7 14 3 4 3114 23 73 10 10 4 3 4315

81 150 15 35 4 7 70


18/28

Regression

We will first run the regression model of the

following form, by entering all the 6 'x' variablesin the model -

Y= b0+ b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

..Equation 1[ OR

Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

..Equation 1]

and determine the values of b0, b1, b2, b3, b4, b5, &


19/28

MULTIPLE REGRESSION RESULTS:

All independent variables were entered in one block

Dependent Variable: SALES

Multiple R: .988531605

Multiple R-Square: .977194734

Adjusted R-Square: .960090784Number of cases: 15

Th ANOVA T bl


20/28

The ANOVA Table

STAT.

MULTIPLE

REGRESS.

Analysis of Variance; Depen.Var: SALES (regdata1.sta)

Effect

Sums of

Squares df

Mean

Squares F

Regress.

Residual

Total

6609.484

154.249

6763.733

6

8

1101.581

19.281

57.13269 .000004

From the analysis of variance table, the last column

indicates the p-level to be 0.000004. This indicatesthat the model is statistically significant at a

confidence level of (1-0.000004)*100 or

(0.999996)*100, or 99.9996.


21/28

:

STAT.MULTIPLE

REGRESS.

Regression Summary for Dependent Variable: SALESR= .98853160 R

2= .97719473 Adjusted R

2= .96009078

F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910

N=15

BETA

St.Err.

of

BETA

B

St. Err.

of B t(8) p-level

Intercept -3.1729 5.813394 -.54581 .600084

POTENTL .439073 .144411 .22685 .074611 3.04044 .016052

DEALERS .164315 .126591 .81938 .631266 1.29800 .230457PEOPLE .413967 .158646 1.09104 .418122 2.60937 .031161

COMPET .084871 .060074 -1.89270 1.339712 -1.41276 .195427

SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204

CUSTOM .050490 .149302 .06594 .095002 .33817 .743935

C l 4 f th t bl titl d B li t ll th ffi i t


22/28

Column 4 of the table, titled B lists all the coefficientsfor the model. These are : a (intercept) = -3.17298

b1 = .22685

b2 = .81938b3 = 1.09104b4 = -1.89270b5 = -0.54925

b6 = 0.06594

Substituting these values of a, b1, b2, ..b6 in

equation 1 we can write the equation (roundingoff all coefficients to 2 decimals), as

S l 3 17 23 ( t ti l) 82


23/28

Sales = -3.17 + .23 (potential) + .82(dealers) + 1.09 (salespeople) - 1.89(competitor activity) - 0.55 (service

people) + 0.07 (existing customers)

[Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

..Equation 1]The estimated increase in sales for every unit increaseor decrease in the independent variables is given by the

coefficients of the respective variables. For instance, ifthe number of sales people is increased by 1, sales in Rs. lakhs, are estimated to increase by 1.09, if all othervariables are unchanged. Similarly, if 1 more dealeris added, sales are expected to increase by 0.82 lakh, if

Th SERVICE i bl d t k t h i t iti


24/28

The SERVICE variable does not make too much intuitivesense. If we increase the number of service people,sales are estimated to decrease according to the0.55

coefficient of the variable "No. of Service People"(SERVICE).

Now look at the individual variable t tests, we find that

the coefficients of the variable SERVICE is statisticallynot significant (p-level 0.735204). Therefore, thecoefficient for SERVICE is not to be used in interpretingthe regression, as it may lead to wrong conclusions.

Strictly speaking, only two variables, potential(POTENTL) and No. of sales people (PEOPLE) aresignificant statistically at 90 percent confidence levelsince their - level is less than 0.10. One should


25/28

Different modes of entering independentvariables in the model

Enter

Forward Stepwise Regression

Backward step wise Regression Step wise regression

Th fi l d l


26/28

The final model

Sales = -10.6164 + .2433 (POTENTL)

+ 1.4244 (PEOPLE)Equation 3

Predictions:If potential in a territory were to be Rs. 50 lakhs, andthe territory had 6 salespeople, then expected sales,using the above equation would be

= -10.6164 +.2433(50) +1.4244(6)

= 10.095 lakhs.Similarly, we could use this model to make predictionsregarding sales in any territory for which Potential andNo. of Sales People were known.


27/28

Recommended usage

1. It is recommended that for serious decision-making, therehas to be a-priori knowledge of the variables which arelikely to affect y, and only such variables should be used inthe regression analysis.

2. For exploratory research, the hit-and-trial approach may beused.

3. It is also recommended that unless the model is itself

significant at the desired confidence level (as evidenced bythe F test results printed out for the model), the R valueshould not be interpreted.


28/28

Multicollinearity and how to tackle it

Multicollinearity : Interrelationship of the variousindependent variables

It is essential to verify whether independent variables are

highly correlated with each other. If they are, this may indicate

that they are not independent of each other, and we may be

able to use only 1 or 2 of them to predict the dependent

variables.

Independent variables which are highly correlated with each

other should not be included in the model together

data analysis.ppt

Documents