fundamentals of data analysis lecture 11 correlation and regression

25
Fundamentals of Data Analysis Lecture 11 Correlation and regression

Upload: shadow

Post on 23-Feb-2016

60 views

Category:

Documents


0 download

DESCRIPTION

Fundamentals of Data Analysis Lecture 11 Correlation and regression. Program for today. Basic concepts C orrelation d iagram and correlation table Linear correlation Linear regression The correlation of the multiple variables R egression curves. Basic concepts. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Fundamentals of Data Analysis

Lecture 11

Correlation and regression

Page 2: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Program for todayBasic conceptsCorrelation diagram and correlation tableLinear correlationLinear regressionThe correlation of the multiple variablesRegression curves

Page 3: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Basic conceptsCorrelation is defined as the statistical interdependence of measurements of different phenomena, depending on the common reason or are to each other in a direct causal relationship.Note, however, that the concept of correlation is different from both the causal relationship and the notion of stochastic dependence between random variables.An extreme case is the correlation of co-linear random variables.The correlation is said to be simple or positive when an increase in one variable increases the other. However, when the increase in one variable is accompanied by degrease of second we are dealing with an inverse or negative correlation.

Page 4: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Basic conceptsRegression in mathematical statistics is empirically determined the functional relationship between the correlated random variables.

Having established that between the studied traits are very weak correlation, proceed to find a regression function that allows you to predict the value of one feature with the assumption that the second characteristic of a defined value.In practice, the most important is the linear regression, corresponding to a linear relationship between the random variables under consideration. Although linear regression is rare in practice, in the form of "pure", but is a convenient tool for obtaining approximate relationships.

Page 5: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Basic concepts

For more complex interdependencies non-linear regression is used, for example a square regression.Two models of the data are distinguished:

• I-st model , in which the values of the random variable is known (well defined)

• II-nd model , in which the random variable is random or vitiated by an error.

Page 6: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

If we have the general population, in which there are two measurable characteristics of X and Y, and they are random variables, and if certain parameters for two-dimensional variable (X, Y) distribution are unknown, this raises the problem of determination of their estimates based on the random sample n pairs of numbers (xi, yi). Treating

xi and yi as the coordinates of the point on the plane,

a sample can be represented graphically in a correlation diagram.

Page 7: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

To make the table should be for each of the features to build series of distribution, calculating the interval:

Rx = xmax - xmin Ry = ymax - ymin

then on the basis of the sample size n we take the appropriate number of classes k and calculate the length of the class:

dx = Rx / kdy = Ry / k

As the lower limit of the first class of variable we accept value slightly lower than the minimum value, and as the upper limit of the last class the value of a little more than the maximum value.

Page 8: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

Page 9: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Linear correlationThe strength of the interdependence of two variables can be expressed numerically by many measures, but the most popular of these is the Pearson correlation coefficient:

where the covariance is described in relationship:

Estimator of the correlation coefficient r between the two test features X i Y in the population is the correlation coefficient of the sample, calculated on the basis of n pairs (xi, yi) of results with the aid of equation:

yx

YX

r ,cov

n

iii yxyx

nyx

1

1,cov

Page 10: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Linear correlation

Factor called the coefficient of determination r, with (n-1) degrees of freedom, can be the estimator of correlation.

n

i

n

iii

n

iii

xy

yyxx

yyxxr

1 1

22

1

Page 11: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Linear correlationThe correlation coefficient takes values between [-1;1].

Coefficient refers to the strength of the relationship. The closer to zero is the weaker relationship them closer to 1 or -1, the stronger. The value of 1 indicates a perfect linear relationship. Sign of the correlation coefficient refers to the direction of union "+" indicates a positive relationship, ie an increase (decrease) in value of one trait will increase (decrease) in the other. "-" Negative direction, ie an increase (decrease) in the value of features results in a decrease (increase) on the other.

Page 12: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Linear correlationAssume the following assessment of the strength of correlation (keeping in mind the appropriate sample size):

• below 0.1 - negligible• from 0.1 to 0,3 - weak• from 0.3 to 0.5 - mean• from 0.5 to 0.7 - high• from 0.7 to 0.9 – very high• above 0.9 - almost full.

This scale is arbitrary.

Page 13: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

ExampleN = 50 measurements of cast dimensions was made, results are shown in Table.At the 95% confidence level to verify the hypothesis that there is a correlation between the dimensions of the castings.

Page 14: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

Example

i xi yi i xi yi

1 38.5 5.5 26 34.2 3.6

2 41.1 4.8 27 39.1 5.1

3 37.8 5.0 28 37.5 4.9

4 36.0 4.9 29 35.5 5.0

5 32.2 5.1 30 36.6 4.1

6 36.8 4.3 31 40.5 5.5

7 33.5 4.5 32 37.2 5.0

8 35.3 3.8 33 34.5 4.8

9 31.1 3.4 34 38.5 4.5

10 42.5 5.7 35 34.0 4.1

11 39.5 5.4 36 33.5 4.0

12 42.1 5.2 37 32.5 4.5

13 38.0 5.2 38 36.4 4.5

14 36.5 5.1 39 37.5 5.6

15 40.0 4.5 40 41.4 5.3

16 36.5 4.4 41 39.5 6.0

17 34.0 4.4 42 38.1 3.9

18 34.5 3.9 43 35.7 4.6

19 44.5 6.6 44 39.5 6.0

20 38.0 5.9 45 35.5 4.6

21 40.0 5.7 46 40.5 6.1

22 36.5 5.4 47 37.5 4.3

23 38.8 5.1 48 33.5 5.2

24 34.5 4.6 49 42.5 6.6

25 36.1 4.2 50 38.0 4.4

Page 15: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

Example We calculate the gaps:

Rx = 44.5 - 31.1 = 13.4 and Ry = 6.6 - 3.4 = 3.2

As the number of measurements n = 50 we take the number of classes k equal to 7. Thus, the length of the classes are equal: for characteristics of X (dimension): dx = Rx / k = 13.4 / 7 2 and for characteristics of Y : dy = 3.2 / 7 0.5.

As the lower limit for characteristics of X we assume x = 31.0 and for characteristics of Y value y = 3.25.

Thus we get correlation table which is shown in Table

Page 16: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

i 1 2 3 4 5 6 7

X

Y 31-33 33-35 35-37 37-39 39-41 41-43 43-45

1 3.25-3.75 1 1 - - - - -

2 3.75-4.25 1 3 3 1 - - -

3 4.25-4.75 1 3 5 3 1 - -

4 4.75-5.25 1 2 3 5 2 - -

5 5.25-5.75 - - 1 2 3 2 -

6 5.75-6.25 - - - 1 2 1 -

7 6.25-6.75 - - - - - - -

ni. 4 9 12 12 8 4 1

Example

Page 17: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

Example

6878.06741.09178.2

19.5273.37904102.0501

1 1

XY

k

j

k

iijij

ss

yxnxy

r

Mean values for x = 37.273 and for y = 5.19 and the standard deviations are respectively 8.5136 and 0.4544, thus

Page 18: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Correlation table and correlation diagram

30 35 40 453

4

5

6

7

X

Y

Diagram korelacyjny

Example

Page 19: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Linear regressionThe general population is given, in which the characteristics (X, Y) have a two-dimensional distribution. Regression straight line of second type for characteristics of Y versus the characteristics of X are given by the equation :

where:

is called the coefficient of a linear regression of characteristics of Y on X, and

is the coefficient of the offset.

Y

Xpa

b ax y

EXpEYbY

X

Page 20: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Linear regression

Page 21: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Linear regression

Page 22: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

The correlation of the multiple variables

In the case of correlation of more than two variables the following additional terms should be defined:• Simple correlation (total) is the correlation

between the two variables (without taking into account other variables).

• Partial correlation is correlation between the two variables when other variables are held constant.

• Multiple correlation is a correlation between the number of connected variables, which change simultaneously.

Page 23: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Regression curves

Regression curves have the general form of the equation:

y = a + b1x1 + b2x2+ ...

where bi is the partial regression coefficient of the i-th order.

Page 24: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Regression curvesSurface chart

Page 25: Fundamentals  of Data  Analysis Lecture  11 Correlation  and  regression

Thank you for attention !