lecture 17: correlation and linear regression€¦ · correlation and linear regression scatter...

46
Lecture 17: Correlation and Linear Regression MSU-STT 351-Sum17B (P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 1 / 46

Upload: duonganh

Post on 29-Apr-2018

255 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Lecture 17: Correlation and Linear Regression

MSU-STT 351-Sum17B

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 1 / 46

Page 2: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Our focus now is mainly analyzing bivariate data.

Bivariate Data The data (x1, y1), . . . , (xn, yn) obtained on two randomvariables X and Y is called a bivariate data.

For example, X=height of a student and Y=weight of a student in a class.The data of heights and weights of all students in a class constitute abivariate data.

The aim is to investigate if the random variables X and Y areassociated/correlated and the relationship is linear or non-linear.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 2 / 46

Page 3: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Scatter PlotScatter plot is the graphical display of a bivariate data, taking xi-valuesalong x-axis and yi-values along the y-axis. Just plot the points(x1, y1), . . . , (xn, yn) on the xy plane. The resulting graph is called thescatter plot.

Usually, X= explanatory (or independent) variable;Y= response (or dependent) variable.

Often, we want to predict the response variable Y based on the observeddata on X .

Examine the scatter plot for the nature of the association:

(i) Direction (negative or positive)

(ii) Strength (no, moderate, strong)

(iii) From (linear or not)

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 3 / 46

Page 4: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

10 12 14 16 18 20

100

200

300

x

y

(a) Positive correlation

12 14 16 18 20

100

150

200

250

300

x

y

(b) Strong positive correlation

12 14 16 18

100

150

200

250

x

y (c) Perfect positive correlation

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 4 / 46

Page 5: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

0 20 40 60 80 100

0

20

40

60

80

100

120

x

y

(a) Negative correlation

0 20 40 60 80 100

0

20

40

60

80

100

x

y

(b) Strong negative correlation

0 20 40 60 80 100

0

20

40

60

80

100

x

y (c) Perfect negative correlation

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 5 / 46

Page 6: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

0 20 40 60 80

20

40

60

80

x

y

(a) No correlation

0 20 40 60 80 1000

20

40

60

x

y

(b) Nonlinear relationship

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 6 / 46

Page 7: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 1 (Ex 4): A study to assess the capacity of subsurface flowwetland systems to remove biochemical oxygen demand (BOD) andvarious other chemical constituents resulted in the accompanying data onX=BOD mass loading (kg/ha/d) and Y=BOD mass removal (kg/ha/d).

X: 3 8 10 11 13 16 27 30 35 37 38 44 103 142Y: 4 7 8 8 10 11 16 26 21 9 31 30 75 90

Construct a scatter plot of the data, and comment on any interestingfeatures.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 7 / 46

Page 8: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Solution:

0 20 40 60 80 100 120 1400

20

40

60

80

x

y

(a) BOD mass loading (x) vs BOD mass removal (y)

Scatter plot of the data shows that there is a strong linear relationshipbetween BOD mass loading and BOD mass removal. There is oneobservation that appears not to match the linear pattern. This value is(37, 9).

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 8 / 46

Page 9: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 2: The following data are on X=number of hours studied andY=the score on a test. Examine their scatter plot relationship.

X: 0 1 2 3 3 4 4 5 5Y: 40 41 51 58 64 55 69 58 75X: 5 5 6 6 6 7 7 8 8Y: 68 63 93 84 67 90 76

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 9 / 46

Page 10: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

The scatter plot is given below:

0 2 4 6 8

40

60

80

x

y

(a) Scatterplot of score vs time

The plot shows that the variables X and Y are strongly associated. This isbecause the values of Y also increases, as the values of X increases.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 10 / 46

Page 11: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Sample Correlation Coefficient

Scatter plot gives only a visual impression of the relationship between Xand Y; sometimes eyes may be fooled. There is a need for a precisestatement, and it is given by correlation coefficient. This measures thedirection and the strength of linear association between X and Y.

Definition. Let (x1, y1), . . . (xn, yn) be the bivariate data. The Pearson’ssample correlation coefficient is defined by

r =1

n − 1

n∑i=1

(xi − xsx

)(yi − ysy

)= r(x, y),

where x and sx are the sample mean and sample standard deviation ofx1, . . . , xn. Similarly, y and sy are defined.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 11 / 46

Page 12: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

A useful formula for computational purpose is:

r =Sxy√SxxSyy

,

where

Sxx =n∑

i=1

x2i −

(∑

xi)2

n= (n − 1)S2

x ;

Syy =n∑

i=1

y2i −

(∑

yi)2

n= (n − 1)S2

y ;

Sxy =n∑

i=1

xiyi −(∑

xi)(∑

yi)

n;

Remarks:(i) The standardized scores say how many SD’s above or below x.(ii) The correlation r has no unit.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 12 / 46

Page 13: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Properties of Correlation

Note both variables have to be quantitative.1 r(x, y) = r(y, x) = r2 −1 ≤ r ≤ 1, and has no units (The proof uses covariance inequality).3 r measures the extent of linear relationship between x and y and

does not capture the non-linear relationship.4 Variables may be strongly associated, but still may have small r , if the

association is not linear.5 Sign of correlation gives the direction of the association; r < 0 implies

negative association: and r > o indicates positive association.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 13 / 46

Page 14: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

6 Value of r does not depend on the units of measurement for eithervariable. That is, it is not affected by the change of shifting or scalingthe variables. This is because,

r(ax + b , cy + d) = r(x, y)

for real d and positive c.

7 Strongly affected by a few outlying observations.

8 r = ±1 only when all positive (xi , yi) lie on a straight line.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 14 / 46

Page 15: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 3 (Ex 12.15): An accurate assessment of soil productivity iscritical to rational land-use planning. The following data presents the dataon corn yield X and Y peanut yield (mT/Ha) for eight types of soil.

X: 2.4 3.4 4.6 3.7 2.2 3.3 4.0 2.1Y: 1.33 2.12 1.80 1.65 2.00 1.76 2.11 1.63

Find if there is any association between X and Y .

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 15 / 46

Page 16: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Solution: With∑xi = 25.7,

∑yi = 14.40,∑

xiyi = 46.856 and∑

y2i = 26.4324,

Sxx = 88.31 −(25.7)2

8= 88.31 − 82.56 = 5.75

Syy = 26.4324 −(14.40)2

8= .5124

Sxy = 46.856 −(25.7)(14.40)

8= .5960

Hence,

r =.5960

√5.75

√.5124

= .3472.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 16 / 46

Page 17: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 4. The following data gives the marks of first midterm (X) andsecond midterms (X) of 9 students from 3 sections:

8 A.M (70,60), (72,83), (94,85)Noon: (80,72), (60,74), (55,58)

Evening: (45,63), (50,40), (35,54)

(a) Find the correlation coefficient between X and Y .

(b) Find the correlation coefficient between X and Y of sessions.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 17 / 46

Page 18: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Solution:

(a)

Sxx = 37695 − (561)2/9 = 2726,

Syy = 40223 − (589)2/9 = 1676.222,

Sxy = 38281 − (561)(589)/9 = 1566.666.

So, r(x, y) =1566.667

√2726

√1676.222

= 0.733.

(b) Nowx1 = (70 + 72 + 94)/3 = 78.667; y1 = (60 + 83 + 85)/3 = 76.x2 = (80 + 60 + 55)/3 = 65; y2 = (72 + 74 + 58)/3 = 68.x3 = (45 + 50 + 35)/3 = 43.333; y3 = (63 + 40 + 54)/3 = 52.333.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 18 / 46

Page 19: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Also,

Sx x = [(78.667)2 + (65)2 + (43.333)2 − (78.667 + 65 + 43.333)2/3

= 634.913,

Sy y = [(76)2 + (68)2 + (52.333)2 − (76 + 68 + 52.333)2/3

= 289.923,

Sx y = [(78.667)(76) + (65)(68) + (43.333)(52.333) − (187)(196.333)/3]

= 428.348.

So,

r(x, y) =428.348

√634.913

√289.923

= 0.9984.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 19 / 46

Page 20: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Population Correlation Coefficient. The population correlationcoefficient between X and Y is defined by

ρ =Cov(X ,Y)

σxσy=

E(XY) − E(X)E(Y)

σxσy.

Some properties of ρ are:

(i) |ρ| ≤ 1

(ii) ρ = 1 if all (xi , yi) in the population lie on a straight line.

(iii) Sample correlation coefficient r can be used to decide if ρ = 0 (nolinear relationship between X and Y or not).

(iv) Also, ρ = 1 for the bivariate distribution means that the variables Xand Y are linearly related.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 20 / 46

Page 21: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

A Test for ρ = 0.

To test the hypothesis H0 : ρ = 0, use the test statistic

T =r√

n − 2√

1 − r2∼ tn−2.

Example 5 (Ex 60): The following is the summary of statistic related to astudy of re-vegetation of soil at mine reclamation sites. Here,X=KCI of extractable aluminium and Y= amount of time required to bringsoil pH to 7.0;n = 24,

∑x = 48.15,

∑x2 = 155.4685,

∑y = 263.5,

∑y2 = 3750.53

and∑

xy = 658.455.

Carry out a test of significance level 0.01 to see whether ρ , 0.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 21 / 46

Page 22: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Solution:We need to rest: H0 : ρ = 0 vs H1 : ρ , 0.

The test statistic is:

T = r√

n − 2/√

1 − r2.

Reject H0 at level 0.01 if either observed T = t ≥ t.005,22 = 2.819 ort ≤ −2.819.

The oberved value of r = .5778 and hence t = 3.32 > 2.819.

So H0 should be rejected. There appears to be a non-zero correlation inthe population.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 22 / 46

Page 23: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Linear Regression Regression analysis deals with the relationshipbetween two or more random variables. Mostly, we will consider the casewhere both the response and and the explanatory/predictor varables arealso continuous.

Model. The simple linear regression model assumes a linear relationshipbetween observations of a response Y and the correspondingpredictor/independent variable X , namely,

Y = a + bX + ε, ε ∼ N(0, σ2), i = 1, . . . , n.

Alternatively, it is also written as

E(Y |X = x) = E(Y |x) = a + bx; Var(Y |x) = σ2.

Parameters: The parameters to be estimated are the intercept a, and theslope b. Also, we need to estimate σ2, the variance of the noise term.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 23 / 46

Page 24: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Least squares

This is one of the methods of estimation of regression parameters.

Assumptions: We have a bivariate data (x1, y1), . . . , (xn, yn). Ourassumptions are:

yi = a + bxi + εi , εi ∼ N(0, σ2), i = 1, . . . , n.

Furthermore, we assume that the errors εi ’s are independent. A completeanalysis will use residual plots to assess these assumptions after fittingthe preliminary model.

The Principle of Least Squares: Estimate a and b by a and b , whichminimize the sum of squares of the residuals.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 24 / 46

Page 25: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Calculation: The estimates a and b can be calculated using the formulae:

b =

∑ni=1[(yi − y)(xi − x)]∑n

i=1(xi − x)2 =Sxy

Sxx;

a = y − b x.

But, these calculations do not generalize easily for more complicatedmodels.

Regression Line. The line y = a + bx is called the fitted regression line.

The Fitted values : The fitted or the predicted values of Y are given by

yi = a + bxi , i = 1, . . . , n.

The Residuals. The difference between an observed value and the fittedvalue is called a residual. Thus, ei = yi − yi , 1 ≤ i ≤ n, are all residuals.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 25 / 46

Page 26: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Coefficient determination

The coefficient of determination R2 = r2 is the proportion of the variationin the response varable Y explained by the predictor variable X .

R2 = 1 −SSESST

, 0 ≤ r2 ≤ 1,

where SSE and SST are respecively called the residual sum of squresand the regression sum of squares, defined and discussed in detail later.

A high r2 (i.e. close to 1) indicates a successful model in the sense thatthe residual variability is much smaller than the original variability. A lowvariability r2 (close to 0) indicates an unsuccessful model.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 26 / 46

Page 27: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Genesis of Regression. Note the slope

b =Sxy

Sxx=

Sxy√

Syy√

Sxx√

Syy√

Sxx= r

Sy

Sx,

since Sxx = (n − 1)S2x and Syy = (n − 1)S2

y . Hence,

y = (y − b x) + bx = y + rSy

Sx(x − x).

Put x = x + Sx , then y = y + rSy . When,r = 1, y = y + Sy ; r = 1/2, y = y + (1/2)Sy , · · · .

For any x-value, y (predicted value) will be closer to (in terms of SD) to ythan x is to x. That is, y is pulled toward (regressed toward) y. Thisregression effect was first noticed by Sir Francis Galton who predictedheight of a son (yi) was always closer to y than his father’s height (xi).

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 27 / 46

Page 28: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 1. The following data gives the mean height of a group ofchildren in Kalama, an Egyptian village. The data were obtained on 161children each month from 18 to 29 months of age. Here, X=age (inmonths) and Y=height (in cm)=response variable.

x: 18 19 29 21 22 23 24 25 26 27 28 29y: 76.1 77.0 78.1 78.2 78.8 79.7 81.1 81.2 81.8 82.8 83.5.

For the above data:x = 23.5; y = 79.85; Sx = 3.606; Sy = 2.302 : r = r(x, y) = 0.9944.

Hence, b = rSy

Sx= (.9944)

2.3023.606

= 0.6248;

a = y − b x = 79.85 − (.6348)(23.5) = 64.932.Therefore, the fitted regression line is y = 64.932 + 0.6348x.Interpretation The slope b = .6348 cm/month is the rate of change inmean height as age increases. Though r does not change, with the unitsof measurement, the equation of least-square line changes.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 28 / 46

Page 29: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Details of Linear Regression

(a) Fitting a straight line

Often, one is interested in not only studying the relationship, but also inpredicting the value of the dependent variable Y based on independentobservations on predictor or explanatory variable X .

When scatter plot suggests a linear relationship, it is natural to find astraight line which is as close as possible to the points. An equation ofstraight line is y = a + bx. A particular equation is y = 5 + x with a = 5and b = 1. To draw a line, we need two quantities, namely, the intercept(with y-axis) a and the slope b .

Given the data (x1, y1), . . . , (xn, yn) on (X ,Y), our aim is to find the straightline y = ax + b which fits the data well.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 29 / 46

Page 30: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

(b) Method of Least Squares

Note X=explanatory (predictor) variable and Y=response variable. Letεi = yi − (a + bxi) be the error (deviation) of the yi from the liney = a + bx. Then

n∑i=1

ε2i =

n∑i=1

(yi − a − bxi)2 = sum of squares of errors.

The principle of least squares suggests to choose the line (or find a and b)such that

∑ni=1 ε

2i is minimum.

The resulting equation y = a + bx is called the “sample regression line” orthe “fitted regression line.” Also, the value yi = a + bxi is the fitted value ofy corresponding to xi and ei = yi − yi is called a residual.

If ei > 0, the model under estimate data value. If ei < 0 the model overestimate data value.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 30 / 46

Page 31: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

(c) The Derivation of a and b

Let

f(a, b) =n∑

i=1

ε2i =

n∑i=1

(yi − a − bxi)2.

For fixed b and treating f as a function of a, we have

∂f∂a

= 0 ⇒

n∑1

(yi − a − bxi)(−1) = 0

⇒ ny − na − nbx = 0

⇒ a = y − bx = a (say),

which is the intercept of the regression line.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 31 / 46

Page 32: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Also, substituting a in f(a, b) and treating the resulting function as afunction of b , we get

∂f∂b

= 0 ⇒

n∑1

(yi − a − bxi)(−xi) = 0

n∑1

xiyi − nax − bn∑1

x2i = 0

n∑1

xiyi − n(y − bx)x − bn∑1

x2i = 0 (substituting a).

Solving now for b, we obtain the slope of the regression line as

b =

∑n1 xiyi − nxy∑n1 x2

i − nx2=

Sxy

Sxx.

The line y = a + bx is called the fitted least-squares (regression) line.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 32 / 46

Page 33: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Assessing the Fit. To assess the effectiveness of the fit, the residuals canbe used. Note ei = (yi − yi) > 0 if yi > yi and ei = (yi − yi) < 0 if yi < yi .

This implies ∑(yi − yi)

2 = 0⇔ yi = yi .

That is, all observed values lie on a straight line. Also,∑n

1 e2i can be used

as a measure of the fit. Another one is the total variation in yi ’s defined by∑ni (yi − yi)

2.

Definition. The residual sum of squares, denoted by SSE, is defined as

SSE =n∑i

(yi − yi)2 =

n∑1

e2i .

and the total sum of squares, denoted by SST, is defined as

SST =∑

(yi − yi)2 = Syy .

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 33 / 46

Page 34: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Note that (yi − yi) = yi − (a + bxi) = (yi − y) − b(xi − x) (Substituting a).

Using this we see that SSE can be calculated without ei ’s from

SSE =n∑

i=1

(yi − y)2 − 2bn∑

i=1

(yi − y)(xi − x) + b2n∑

i=1

(xi − x)2

=SST − 2bSyx + b2Sxx

=SST − 2bSyx + bSyx (since bSxx = Sxy)

=SST − bSyx

=SST − SSR

where SSR = bSxy is called regression sum of squares.

Note: Note SSE is used as a measure of “unexplained” variation by theregression line, SST is used as a measure of total variation and SSE

SST =fraction of total variation that is unexplained by line.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 34 / 46

Page 35: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Definition: The coefficient of determination, denoted by R2, is defined

R2 =SSRSST

= 1 −SSESST

·

It is the proportion of variation in y explained by regression. A high value ofR2 (close to 1) means the regression line in a good fit and a low (close to0) value of R2 indicates the fitted line is not a good fit for the given data.

Result R2 = r2, where r is the sample correlation coefficient.

Proof: Note using b, and SST = Syy ,

R2 =SSRSST

=bSxy

SST=

Sxy

Sxx.Sxy

Syy=

S2xy

SxxSyy= r2.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 35 / 46

Page 36: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear RegressionEstimate of σ2: The quantity

s2e =

SSEn − 2

=

∑n1 e2

i

n − 2

is the variance of residuals and is an unbiased estimator ofσ2 = Var(Y |x).

Also, S =√

S2e is called the SD of residuals about least squares line.

The SSE can be computed easily using the formula:

SSE = SST − bSyx =∑

y2i − a

∑yi − b

∑xiyi .

Hence, the unbiased estimate of σ2 = Var(Y |x) is

S2y |x =

SSEn − 2

.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 36 / 46

Page 37: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Plotting the Residuals (Residual Plot)

Definition: A scatter plot of ei ’s against xi ’s is called residual plot.

Uses of Scatter Plot:

(i) It is used for checking if there is any unusual, highly influentialobservations or revealing patterns are present in the data.

(ii) If there is no particular pattern, such as curvature and etc, theleast-square fit is a good ‘fit’. Also, the residuals will be centeredaround x-axis.

(iii) Looking at residual plot is equivalent to examining after removinglinear dependence on x. This may sometimes show existence of anon-linear relationship.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 37 / 46

Page 38: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 2 (Ex 9): The flow rate y (m3/min) in a device used forair-quality measurement depends on the pressure drop x (in.of water)across the device’s filter. Suppose that for x values between 5 and 20, thetwo variables are related according to the simple linear regression modelwith true regression line Y = −0.12 + .095X .

1 What is the expected change in flow rate associated with a 1-inincrease in pressure drop? Explain.

2 What change in flow rate can be expected when pressure dropdecreases by 5 in.?

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 38 / 46

Page 39: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

3 What is the expected flow rate for a pressure drop of 10 in.? A drop of15 in.?

4 Suppose σ = 0.025 and consider a pressure drop of 10 in. What isthe probability that the observed value of flow rate will exceed 0.835?That observed flow rate will exceed 0.840?

5 What is the probability that an observation on flow rate when pressureis 10 in. will exceed an observation on flow rate made when pressuredrop is 11 in.?

Solution:1 The expected flow rate (y) associated with a one inch increase in

pressure drop (x) = b = 0.095.2 We expect flow rate to decrease by 5b = 0.475.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 39 / 46

Page 40: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

3 µY .10 = −0.12 + 0.095(10) = 0.830;µY .15 = −0.12 + 0.095(15) = 1.305.

4 P(Y > .835) = P(Z >

.835 − .830.025

)= P(Z > .20) = 0.4207

P(Y > .840) = P(Z >

0.840 − 0.8300.025

)= P(Z > 0.40) = 0.3446.

5 Let Y1 and Y2 denote pressure drops for flow rates of 10 and 11,respectively. Note µY .11 = .0925. So,E(Y1 − Y2) = 0.830 − 0.925 = −0.095. andSD(Y1 − Y2) =

√(0.025)2 + (0.025)2 = 0.035355. Thus

P(Y1 > Y2) = P(Y1 − Y2 > 0)

= P(Z >

0.0950.035355

)= P(Z > 2.69) = 0.0036.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 40 / 46

Page 41: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 3 (Ex 13): The accompanying data on x=current density(mA/cm2) and y=rate of deposition ( m/min) appeared in an article. Do youagree with the article’s author that ”a linear relationship was obtained fromthe tin-lead rate of deposition as a function of current density”? Explainyour reasoning.

X: 20 40 60 80Y: 0.24 1.20 1.71 2.22

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 41 / 46

Page 42: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Solution: For this data

n = 4,∑

xi = 200,∑

yi = 5.37,∑x2

i = 12.000,∑

y2i = 9.3501,

∑xiyi = 333.

Therefore,

Sxx = 12000 −(200)2

4= 2000,

Syy = 9.3501 −(5.37)2

4= 2.1409,

Sxy = 333 −(200)(5.37)

4= 64.5.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 42 / 46

Page 43: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Hence,

b =Sxy

Sxx=

64.52000

= 0.03225,

a =5.37

4− (0.03225)

2004

= −0.27000.

Also, SSE = SST − bSyx = 2.14085 − (0.03225)(64.5) = 0.060725.So,

r2 = 1 −SSESST

= 1 −0.0607252.1409

= 0.9716.

This is a very high value of r2, which confirms the authors’ claim that thereis a strong linear relationship between the two variables.

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 43 / 46

Page 44: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Example 4 (Ex 19): The following data is on X = burner area liberationrate and Y = NOx emission rate (ppm):X : 100 125 125 150 150 200 200 250 250 300 300 350 400 400Y : 150 140 180 210 190 320 280 400 430 440 390 600 610 670

(a) Assuming that the simple linear regression model is valid, obtain theleast squares estimate of the true regression line.

(b) What is the estimate of expected NOx emission rate when burner arealiberation rate equals 225?

(c) Estimate the amount by which you expect NOx emission rate tochange when burner area liberation rate is the decreased by 50.

(d) Would you use the estimated regression line to predict emission ratefor a liberation rate of 500? Why or why not?

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 44 / 46

Page 45: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Solution: The given data is:

N = 14,∑

xi = 3300,∑

yi = 5010;∑x2

i = 913.750,∑

y2i = 2207100,

∑xiyi = 1413500.

Therefore,

(a) b =32560001902500

= 1.71143233, a = −45.55190543.

So, the fitted line is: y = −45.5519 + (1.7114)x.

(b) µy.225 = −45.5519 + 1.7114(225) = 339.51

(c) Estimated expected change = −50b = −85.57.

(d) No, the value 500 is outside the range of x values for whichobservations were available (the danger of extrapolation).

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 45 / 46

Page 46: Lecture 17: Correlation and Linear Regression€¦ · Correlation and Linear Regression Scatter Plot Scatter plot is the graphical display of a bivariate data, taking xi-values along

Correlation and Linear Regression

Home work:Sec 12.1: 3, 8, 9Sec 12.2: 12, 14, 16Sec 12.5: 58, 62, 65

(P. Vellaisamy: STT 351-Sum17B) Probability & Statistics for Engineers 46 / 46