correlation and regression. correlation. ‘correlation’ is a statistical tool which measure the...

54
Correlation and Regression

Post on 22-Dec-2015

239 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Correlation and Regression

Page 2: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Correlation. ‘Correlation’ is a statistical tool which measure the

strength of linear relationship between two variables.

It can not measure strong non-linear relationship. It can not measure Cause and Effect. “Two variables are said to be in correlated if the

change in one of the variables results in a change in the other variable”.

Types of CorrelationThere are two important types of correlation. (1) Positive and Negative correlation and (2) Linear and Non – Linear correlation.

Page 3: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Positive and Negative Correlation If the values of the two variables deviate in the same

direction i.e. if an increase (or decrease) in the values of one variable results, on an average, in a corresponding increase (or decrease) in the values of the other variable the correlation is said to be positive.

Some examples of series of positive correlation are: Heights and weights; Household income and expenditure; Price and supply of commodities; Amount of rainfall and yield of crops. Correlation between two variables is said to be negative or

inverse if variables deviate in opposite direction Some examples of series of negative correlation are: Volume and pressure of perfect gas; Current and resistance [keeping the voltage constant] . Price and demand of goods etc.

Page 4: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The Coefficient of Correlation It is denoted by ‘r’ which measures the degree of

association between the values of related variables given in the data set.

-1≤ r ≤1 If r >0 variables are said to be positively correlated. If r<0 variables are said to be negatively correlated. For any data set if r = +1, they are said to be perfectly

correlated positively if r = -1 they are said to be perfectly correlated

negatively, and if r = 0 they are uncorrelated.

Page 5: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables
Page 6: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Correlation In Statistica.

Got to “Statistic” tab and “Basic statistic”. Click “Correlation matrix” tab click ok. If you want to get correlation only on two variables, go to the “Two

variable list” and select those two variables which are required. Click ok and then click on summary.

Page 7: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

If you want to get the correlation matrix for all variables, go to “Two variable list” tab and click on “select all” tab, click ok.

Click the summary , you will get correlation matrix.

Page 8: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

To get scatter plot – click on “Scatter plot of variables” and then select the variables and click ok.

You will get the scatter plot as it has came in the right side.

Page 9: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

To get another kind of Scatter plot click on “Graphs” option. You will get a series of scatter plot for each and every pairs of

variable. The second plot is showing one of those.

Page 10: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Regression Regression analysis, in general sense, means the

estimation or prediction of the unknown value of one variable from the known value of the other variable.

If two variables are significantly correlated, and if there is some theoretical basis for doing so, it is possible to predict values of one variable from the other. This observation leads to a very important concept known as ‘Regression Analysis’.

It is specially used in business and economics to study the relationship between two or more variables that are related causally and for the estimation of demand and supply graphs, cost functions, production and consumption functions and so on.

Page 11: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Thus, the general purpose of multiple regression is to learn more about the relationship between several independent or predictor variables and a dependent or output variable.

Suppose that the Yield in a chemical process depends on Temperature and the Catalyst concentration, a multiple regression that describe this relationship is,

Y=b0+b1*X1+b2*X2+€ → (a) Where Y = Yield. X1 = Temp:, X2 = Catalyst cont:. This is multiple linear regression model with 2 regressors. The term linear is used because equation (a) is a linear

function of the unknown parameters bi’s.

Page 12: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Regression Models. Depending on nature of relationship

regression models are two types. Linear regression model, includinga. Simple-linear regression (one indep: var.)b. Multiple-linear regression.

Non-Linear regression model, includinga. Polynomial regression.b. Exponential regression ,etc.

Page 13: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Assumption of Linear regression model.

The relationship between Y (dependent variable) and independent variables are linear.

The independent variables are mutually independent to each other.

The errors are uncorrelated to each other.

The error term has fixed variance. The errors are Normally distributed.

Page 14: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Data set and Objective. The current data set has been taken from a chemical

process where we have two input or independent parameters ,

Temperature and Catalyst feed rate Response or output parameter : Viscosity of the yield.

Objective. 1. Establish the linear relation of dependent variable with

independent variables.2. Estimate regression coefficients to find out which

variable has significant effect on the response variable.3. Check the model adequacy with the help of

assumptions.

Page 15: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Data set.

Page 16: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Assumption of Linearity. First of all, as is evident in the name multiple linear

regression, it is assumed that the relationship between variables is linear.

In practice this assumption can virtually never be confirmed; fortunately, multiple regression procedures are not greatly affected by minor deviations from this assumption.

However, as a rule it is prudent to always look at bivariate scatter plot of the variables of interest.

If curvature in the relationships is evident, one may consider either transforming the variables, or explicitly allowing for nonlinear components.

Page 17: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Scatter plot.

Go to “Graph” option and select “Scatter plot”. Click on “variable” tab and select the variables in the above way. Click ok and select the option “Multiple” and “Confidence” .This will help you to plot multiple graph in a single window.(If you have large number of variables then plot it separately)

Page 18: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The scatter plot has established linear relationship of the dependent with independent variables.

Here Viscosity and Temp are linearly related to each other, but Viscosity and Catalyst concentration are not.

Page 19: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The regression coefficient (beta) is the average amount of change in the dependent (either in positive or negative direction, depending on the sign of ’s) when the independent changes one unit and other independents are held constant.

The b coefficient is the slope of the regression line. Intercept (constant,α)- It is the value of

dependent variable when all indep: are set to zero.

For any independent variable if the corresponding >0, then that variable is positively correlated with dependent variable, negatively otherwise.

OLS (ordinary least squares) is used to estimate the coefficients in such a way that the sum of the squared deviations of the distances of all the points to the line is minimized.

Parameter Estimation.

Page 20: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The confidence interval of the regression coefficient. We can 95% confident that the real regression coefficient for the population lies within this intervals.

If the confidence interval includes 0, then there is no significant linear relationship between x and y.

The confidence interval of y – It indicates 95 times out of a hundred, the true mean of y will be within the confidence limits around the observed mean of n sampled.

SEE (Standard error of estimate) is the standard deviation of the residuals.

In a good model, SEE will be markedly less than the standard deviation of the dependent variable.

It can be used to compare the accuracy of different models, lesser the value better the model.

Page 21: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

F-test and P-value: Testing the Overall Significance of the Multiple Regression Model.

It assume the null hypothesis, H0: b1 = b2 = ... = bk = 0 H1: At least one bi does not equal 0. If H0 is rejected (if p<.05) we can conclude that, At least one bi differs from zero. The regression equation does a better job of predicting the actual

values of y. t-test: Testing the Significance of a Single Regression Coefficient.

Is the independent variable xi useful in predicting the actual values of y ?

For the Individual t-testH0: bi = 0H1: bi ≠0 If H0 is rejected (if p<.05) The related X has a significant contribution on the dependent

variable,

Page 22: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

R^2 (coefficient of determination)- Is the percent of the variance in the dependent explained uniquely or jointly by the independents.

R-squared can also be interpreted as the proportionate reduction in error in estimating the dependent when knowing the independents.

Adjusted R-Square It is an adjustment of R-square when one has a large number of independents

It is possible that R-square will become artificially high simply because some independent variable "explain" small parts of the variance of the dependent.

If there is a huge difference between R-square and Adjusted R-square then we can assume that some unimportant independent variables are present in the data set.

If inclusion of a variable reduces Adjusted R-square it will be identified as a nonsense parameter for the model.

Page 23: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Estimating coefficients.

Go to “Statistics” tab and select “Multiple Regression”. Select the variables and click ok again click ok. Now click “Summary regression results”.

Page 24: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Left side table showing the model accuracy. R-square- Describing the amount of variability that has been

explained by indep: variables, here it is approx. 93%. Adjusted R-square – Give an indication whether there is any

insignificant factor or not. Adj: R square should be close to Multiple R square, if it is very

smaller than R square then we should go for stepwise regression. (Adjusted R square always < or = Multiple R square.)

Page 25: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Interpretation of Result. Here R square and Adjusted R square are very close to

each other, which indicate a good model. In regression analysis R square value will always

increase with the inclusion of parameters , but Adjusted R square may not be, this indicate the presence of nuisance parameters in the model.

The p value for F test is significant (left table) indicate, there is at least one variable which has significant contribution to the model.

The p values for t-test are all significant (as p<.05)(2nd table) which indicate all these variables has significant effect on the response.

Page 26: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Multicollinearity. Definition Multicollinearity refers to excessive

correlation of the independent variables.

Ideally independent variable should be uncorrelated to each other (according to the assumption).

If the correlation is excessive (some use the rule of thumb of r >0.90), standard errors of the beta coefficients become large, making it difficult or impossible to assess the relative importance of the predictor variables.

But multicollinearity does not violate OLS assumption, it still gives unbiased estimate of the coefficient.

Page 27: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Detecting Multicollinearity

Tolerance The regression of any independent variable on all the other independents, ignoring the dependent.

As a rule of thumb, if Tolerance ≤ 0.10, a problem with

multicollinearity is indicated. VIF (Variance-inflation factor) Is simply the reciprocal of

tolerance.

As a rule thumb, if VIF > 10 , a problem with multicollinearity is indicated.

C.I (condition indices) Another index for checking multicollinearity.

As rule thumb , if C.I >30 serious multicollinearity is present in the data set.

Page 28: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Some other indication of multicollinearity. If none of the t-test for the individual coefficients is

statistically significant, yet the overall F statistic is.

It imply the fact that some coefficients are insignificant because of multicollinearity.

Check to see how stable coefficients are when different samples are used.

For example, you might randomly divide your sample in two parts. If coefficients differ dramatically, multicollinearity may be a problem.

Correlation matrix can also be used to find out which independent variables are highly correlated (affected by multicollinearity)

Page 29: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

How to perform in Statistica?

In the “Advanced” tab click on either “Partial correlation” or “Redundancy” tab.

You will get the result which contain Tolerance, Partial correlation, Semi partial correlation etc.

From the table it is clear that Tolerance > 0.10 , so Multicollinearity is not a threat for this data set.

Page 30: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Example of VIF

The data set contains information about the physical and chemical properties of some molecules.

Dependent variable- logP.

24 numbers of indep: variables.

We will first find out VIF values and also check the correlation matrix.

Page 31: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Steps. Go to Statistics tab select

“Advanced linear non-linear model” and click “General linear model” , select “Multiple regression”.

Select variables and click ok, again click ok.

Click on “Matrix” tab ,then select “Partial correlation”.

The circled variables are highly affected by multicollinearity (as VIF>10).

Now we can create correlation matrix to see which variables are correlated to each other.

Page 32: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Correlation matrix.

Go to Statistics tab, select “Basic statistics/Tables” then select “Correlation matrices”.

Click on “Two lists” and select variables.

Click ok and then click “ Summary correlation”.

Page 33: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The Correlation Matrix will be obtained.

It is clear that large number of variables are highly correlated to each other and they are colored as red, like BO1-X and DN3 etc.

Page 34: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Methods for Dealing with Multicollinearity.

Several techniques have been proposed for dealing with the problems of multicollinearity, these are

a. Collecting additional data: The additional data should be collected in a manner designed to break up the multicollinearity in the existing data set. But this method is not always suitable for economic constraints or for sampling problem.

b. Variable elimination : If any two or three variables are highly linearly dependent, eliminating one regressor may be helpful to reduce multicollinearity . This also may not provide satisfactory result, since the eliminating variable may have significant contribution to the predicting power.

c. Stepwise regression: Most effective method for eliminating multicollinearity. This method will exclude those variables which has affected by co linearity step by step and try to maximize the model accuracy.

Page 35: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Stepwise Regression.

Stepwise multiple regression, also called statistical regression, is a way of computing OLS regression in stages.

First StepThe independent best correlated with the dependent is included in the equation.

Second StepThe remaining independent with the highest partial correlation with the dependent, controlling for the first independent, is entered.

This process is repeated, at each stage partialling for previously-entered independents, until the addition of a remaining independent does not increase R-squared by a significant amount (or until all variables are entered, of course).

Alternatively, the process can work backward, starting with all variables and eliminating independents one at a time until the elimination of one makes a significant difference in R-squared.

Page 36: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Example of stepwise Regression.

Go to “Statistics” tab and select “Multiple Regression”.

Select variables ,click ok.

Then click on “Advanced” tab and select the circled options.

Click ok. Select the circled option. (2nd figure)

Page 37: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Click on “stepwise” tab and select the circled option.

Next click ok. You will get the 2nd

window. Now you have to click the

tab “Next” until all important variables are included into the model.

Next click on “summary regression result” to get the model summary.

Page 38: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Residual Analysis.

Residuals are the difference between the observed values and those predicted by the regression equation. Residuals thus represent error, in most statistical procedures.

Residual Analysis is the most important part in Multiple regression for diagnostic checking of model assumptions.

Residual analysis is used for three main purposes: (1) to spot heteroscedasticity (ex., increasing error as the

observed Y value increases), (2) to spot outliers (influential cases), and

(3) to identify other patterns of error (ex., error associated with certain ranges of X variables).

Page 39: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Assumptions of Errors.

The following assumptions on the random errors are equivalent to the assumptions on the response variables, which are tested via Residual Analysis.

(i) The random errors are independent.

(ii) The random errors are normally distributed.

(iii) The random errors have constant variance .

Page 40: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Assumption 3: The Errors are Uncorrelated to each other. (detection of Autocorrelation).

Some application of regression involve regressor and response variables that have a natural sequential order over time. Such data are called Time series data.

A characteristic of such data can be that neighboring observations tend to be somewhat alike. This tendency is called Autocorrelation.

Autocorrelation can also be arise in laboratory experiments ,because of the sequence in which experimental runs are done or drift in instruments calibration .

Randomization reduce the possibility of Auto correlated result. Parameter estimate may or may not be seriously affected by

Autocorrelation , but autocorrelation will bias the estimation of variance, and any statistics estimated from variance like confidence intervals will be wrong.

Page 41: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

How to detect Autocorrelation? Various statistical tests are available for detecting

Auto correlation, among them Durbin-Watson test is widely used method.

It is denoted by “D”.

If the D value lies between (1.75 , 2.25) residuals are uncorrelated.

If D <1.75 residuals are correlated, positively and

If D>2.25 residuals are correlated, negatively.

Page 42: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Go to “Statistics” tab and select “Multiple regression”, select variables and click ok again click ok.

The 1st window will come , click ok ,you will get the 2nd window.

Click on “Advanced” tab and then click on “Durbin-Watson statistic”.

Page 43: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The first column of the spreadsheet shows the D statistic value which > 2.25.

According to rule residuals are negatively correlated.

Second column shows the correlation value.

Here residuals are correlated but the magnitude is not so high,(<.50) so we can take this into consideration.

Analysis done from 1st dataset. (chemical process data).

Page 44: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Constant variance of residuals.

In linear regression analysis, one assumption of the fitted model is that the standard deviations of the error terms are constant and do not depend on the x-value.

Consequently, each probability distribution for y (response variable) has the same standard deviation regardless of the x-value (predictor).

This assumption is homoskedasticity.

How to check this assumption? One simple method is to plot the residual values against the fitted

(predicted). If this plot shows a systematic pattern then it is fine. If it shows abnormal or curvature pattern then there should be problem.

Page 45: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

How to manage abnormal condition? If the graph shows abnormality , some techniques are there to

manage such condition. The usual approach to deal with inequality of variance is to apply

suitable transformation to either independent variables or response variable.

(Generally transformation of the response are employed stabilize variance).

We can also use method of weighted least square instead of ordinary least square.

A curved plot may indicate nonlinearity, this could mean that other regressor variables are needed in the model, for example a squared term may be necessary.

A plot of residual against predicted may also reveal one or more unusually large residuals these points are of course potential outlier.

we can exclude those points from analysis.

Page 46: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Go to statistics tab and select “Multiple regression”.

Select variables and click ok again click ok.

You will get the 1st window, click on “Residuals/assumptions/prediction” tab and then click on “perform residual analysis”.

You will get the 2nd window, click on “scatter plot” tab and click on “predicted v/s residuals”.

Page 47: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The above graph showing the predicted v/s residuals scatter plot.

It is clear that few of points after midpoints are going upward an downward, which means at that points there are some tendency of higher residuals positively or negatively.

In other word that affected points are not predicted properly.

Page 48: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Normality of Residuals. In regression analysis last assumption is normality of

residuals. Small departure from normality assumption do not affect

the model greatly, but gross non normality is potentially more serious as the t or F test, and confidence intervals depends on normality assumption.

A very simple method of checking the normality assumption is to construct “Normal probability plot” of the residuals.

Small sample size (n<=16) often produce normal plot that substantially deviate from linearity.

For large sample size (n>=32) the plots are much well behaved.

Usually at least 20 points or observation are required to get stable normal probability plot.

Page 49: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Normal probability plot.

From Multiple regression go to residuals tab and select “perform residual analysis”.

Then click on “probability plot” and select “Normal plot of residuals”.

Page 50: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The normal plot shows almost good fitting of normality.

Small amount of deviations are there from the linearity , that could be overcome probably if we add some new experimental points to the data set.

(current data contains only 16 observations).

Analysis done from 1st dataset (chemical process data).

Page 51: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Detecting outlier .

If any data include some extreme values (outlier) then it may causes serious problem while checking the assumptions of regression analysis.

Few classical techniques are there for detecting outliers, like Box-Whiskers plot.

Here we use some residual techniques for detecting outliers after creating the regression model.

One popular method is to check “Standard residual values” .

If any value goes beyond ( -3.5 to 3.5),that particular point will be considered as an outlier.

Page 52: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

From Multiple regression ,select variables click ok again click ok.

1st window will appear, select “Residual/assumption/prediction” then click “Perform residual analysis”.

2nd window will appear select “Outliers” and then activate “standard residuals” and click on “case wise plot of outliers”.

Page 53: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

The above output will come. Here just check the 5th column (highlighted) ,in this case

all points are within (-3 to +3). So no possible outliers are there. In the extreme left the case points have demonstrated.

Page 54: Correlation and Regression. Correlation.  ‘Correlation’ is a statistical tool which measure the strength of linear relationship between two variables

Thank you

Krishnendu Kundu

StatsoftIndia.

Email- [email protected]

Mobile - +919873119520.