# two-variable analysis: simple linear regression/...

Post on 22-Aug-2020

18 views

Embed Size (px)

TRANSCRIPT

Pat Hammett, University of Michigan 1

Two-Variable Analysis:Simple Linear Regression/

Correlation

TopicsI. Scatter Plot (X-Y Graph)

II. Simple Linear Regression

III. Correlation, R

IV. Assessing Model Accuracy, R 2

V. Regression Abuses / Misinterpreting Correlation

Pat Hammett, University of Michigan 2

I. Scatter Plot

• Used to visualize relationship between two variables.

• Common results:Ø Linear relationshipsØ non-linear relationshipsØ No Relationships (robustness)

Scatter Plot• Shows the relationship between X (predictor)

and Y (response) given a range of X.

XIndependent Variable

Predictor Variable

YDependent VariableResponse Variable

Pat Hammett, University of Michigan 3

Example 1: Coating Thickness(From: “SPC of a Phosphate Coating Line”, Wire, J. J. Intl, May 1997, pp. 78-81.)

• Suppose you measure the efficiency of a phosphate coating operation for steel versus coating tank temperature.Ø What is the response, what is the predictor?

Sample Temp Efficiency Sample Temp Efficiency1 170 0.84 13 180 2.152 172 1.31 14 181 0.843 173 1.42 15 181 1.434 174 1.03 16 182 0.95 174 1.07 17 182 1.816 175 1.08 18 182 1.947 176 1.04 19 182 2.688 177 1.8 20 184 1.499 180 1.45 21 184 2.52

10 180 1.6 22 185 311 180 1.61 23 186 1.8712 180 2.13 24 188 3.08

Pat Hammett, University of Michigan 4

Pat Hammett, University of Michigan 5

Pat Hammett, University of Michigan 6

Pat Hammett, University of Michigan 7

Pat Hammett, University of Michigan 8

Pat Hammett, University of Michigan 9

Pat Hammett, University of Michigan 10

Pat Hammett, University of Michigan 11

Pat Hammett, University of Michigan 12

Pat Hammett, University of Michigan 13

Scatter Plot: Coating Efficiency

• Is there a relationship here?

Temperature Vs. Coating Efficiency

0

0.5

11.5

2

2.5

3

3.5

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Scatter Plot: Coating Efficiency

• Is there a relationship here?

Temperature Vs. Coating Efficiency

0

0.5

1

1.5

2

2.5

3

3.5

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

YES, as temp inc,efficiency Inc

Pat Hammett, University of Michigan 14

Lecture Exercise 1:Changing Range of X

• Open the excel file, tanktemp.xls, which has this data file.Ø Compute the range of Y (efficiency) if you reduce

the tank temperature from 170-188 to 180-182.

Ø Is the range of Y (efficiency) smaller, larger, or the same as over the full range of X?

Ø Construct a scatter plot of this new data set? Do you still think a relationship exists?

Lecture Exercise 1:Effect on Y by reducing Variation in X

• Coating Example:

• Note: if a strong relationship exists (positive or negative) between X and Y, then reducing variation in X should result in a variation reduction in Y.

Temp (Range X)

Efficiency Ratio

(Range Y)

170-188 2.24

180-182 1.84

Pat Hammett, University of Michigan 15

Efficiency Vs. Reduced Range in Temperature

• Over the smaller range of the input (temperature), this relationship weakens.

Temperature Vs. Coating Efficiency

00.5

11.5

22.5

3

179 180 181 182 183

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Lessons from Coating Example

• Relationships between Y and X variables may change depending on the range of X.

• Scatter plots provide good visualization of relationships between variables, but we need a metric to assess Strength of Relationship.Ø For Two variables – we use simple linear

regression to develop a model in order to assess the strength of relationship using correlation.

Pat Hammett, University of Michigan 16

II. Simple Linear Regression

• Simple Linear Regression examines the relationship between two variables: Ø one response (y), andØ one predictor (x).

• If two variables are related, a regression equation may be used to predict a response value given a predictor value with better than random chance.

Simple Regression Equation

• Y = βo + β1X1Ø Y = dependent variable (response)Ø X1 = independent variable (predictor)Ø β0 = intercept; the value of Y when X = 0.Ø β1 = slope; the predicted change in output Y per

unit change of input X.

• Alternatively,• Y = mX + b (m is slope, and b is y-intercept)

Pat Hammett, University of Michigan 17

Computing Slope and Intercept• We typically use software to compute the slope

and y-intercept. In Excel, we may use:Ø =slope(y-array,x-array); =intercept(y-array,x-

array)Sample Temp Efficiency

1 170 0.842 172 1.313 173 1.424 174 1.035 174 1.07

… …24 188 3.08

slope 0.094 =slope(C2:C25,B2:B25)intercept -15.245 =intercept(C2:C25,B2:B25)

Coating Example: Trend Line• You may add this line to your scatter plot by

selecting your chart and then using add trend line command under chart menu.

Temperature Vs. Coating Efficiency

00.5

11.5

22.5

33.5

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Pat Hammett, University of Michigan 18

Pat Hammett, University of Michigan 19

Pat Hammett, University of Michigan 20

Pat Hammett, University of Michigan 21

Slope Values and Trend Lines• Positive slope valuesØ Increasing trend lines on scatter plot.

• Negative slope valuesØ Decreasing trend lines on scatter plot.

• No slope (~0)Ø Horizontal trend lines.Ø Comment: be careful with using absolute

magnitudes. Depending on units, a very small slope deviation from 0 could be significant.

Coating Example: Slope Values and Trend Lines

• The slope is greater over the entire temp range (170-188) of the study.

• Slope is a positive value à increasing trend.

Temp (Range X)

Efficiency Ratio

(Range Y)Slope Y-Intercept

170-188 2.24 0.094 -15.245180-182 1.84 0.008 0.153

Pat Hammett, University of Michigan 22

Model Predictions

• Slope and intercepts are mathematical calculations. We can always compute them.

• More Important Question: how effective are these terms at predicting any individual observation.

• One way to assess effectiveness of the prediction is to examine the residuals.

Residual Terms• Residual (obs i) = Yactual(obs i) – Ypredicted (obs i)Ø Vertical bars are the residuals for each

observation of Y.Temperature Vs. Coating Efficiency

00.5

11.5

22.5

33.5

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Pat Hammett, University of Michigan 23

Lecture Exercise 2: Computing Predicted Value

and Residual• Consider sample # 22, where X = 185

and Y = 3.0.

• Using the regression equation (Y = 0.094X – 15.245), compute the following:Ø Ypredicted (obs 22)= ?Ø Yresidual (obs 22) = ?

Lecture Exercise 2: Computing Predicted Value

and Residual• Obs: 22, X = 185 and Y = 3.0.• Using the regression equation (Y = 0.094X –

15.245), compute the following:Ø Ypredicted (obs 22)= § (0.094 x 185) – 15.245 = 2.145

Ø Yresidual (obs 22) = Yactual - Ypredicted§ 3.0 – 2.145 = 0.855

Pat Hammett, University of Michigan 24

Residuals• Smaller residuals indicate a better prediction.• Consider the following graphs, which has

smaller residuals A or B?

Temperature Vs. Coating Efficiency

0

0.5

1

1.5

2

2.5

3

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Temperature Vs. Coating Efficiency

0

0.5

1

1.5

2

2.5

3

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Group A Group B

Residuals• Smaller residuals indicate a better prediction.• Consider the following graphs, which has

smaller residual A or B? A

Temperature Vs. Coating Efficiency

0

0.5

1

1.5

2

2.5

3

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Temperature Vs. Coating Efficiency

0

0.5

1

1.5

2

2.5

3

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Group A Group B

Pat Hammett, University of Michigan 25

III. Correlation• Correlation (R ) provides a measure of

model prediction.• Perfect correlation suggests that we

may pass a line through every observation (all residuals = 0).

X

Y

X

Y

R = 1.0

Correlation• In assessing relationships between variables,

we often want to know strength of relationship.

• The Pearson correlation coefficient, R,measures the extent to which two variables are related.

Ø where i = 1..n pairsØ -1 < R < 1Ø Microsoft excel function: = correl(array1,array2)

( )( )( ) yx

ii

ssn

yyxxR

1−

−−= ∑

Pat Hammett, University of Michigan 26

Correlation – Coating Example

• From Excel

• Correl (R )=correl(B2:B25,C2:C25)

R = 0.67

Temperature Vs. Coating Efficiency

00.5

11.5

22.5

33.5

165 170 175 180 185 190

Temperature

Pho

spha

te C

oatin

g E

ffici

ency

Rat

io

Correlation PatternsPerfect Positive Strong Positive

Perfect Negative Strong Negative

R = 1.0

R = -0.7R = -1.0

R = 0.7

Rule of Thumb: |Correlation| > 0.7 à strong relationship

Pat Hammett, University of Michigan 27

No Correlation

• If no correlation exists, R = 0.

Predictor, X

Res

pons

e, Y

IV. Assessing Model Accuracy, R2

• Another tool to assess model accuracy (or predictability) is R 2 .

• R 2 - multiple correlation coefficient Ø R2 is computed by squaring the correlation, R

Ø 0 (no correlation) < R2 < 1 (perfect correlation)

Pat Hammett, University of Michigan 28

What does R2 Measure?• R 2 - measures the % of the variation in Y

explained by the variation in X over the range of X.

• Suppose R = 1 à R2 = 1, thus all of the variation in Y may be explained by X.

• R =0.7 à R2 = 0.49, thus, 49% of the variation in Y may be explained by X.

• R =0.1 à R2 = 0.01, thus, only 1% of the variation in Y may be explained by X.

Coating Example Revisited• Recall our different equations based on the

range of X for coating example.• Over the full range, we have high correlation

where temp explains ~45% of efficiency ratio.• Over the tighter range, temp explains little of

the variation in efficiency ratio (~0%)

Temp (Range X)

Efficiency Ratio

(Range Y)Slope Y-Intercept R R

2

170-188 2.24 0.094 -15.245 0.672 0.45180-182 1.84 0.008 0.153 0.015 0.00

Pat Hammett, University of Michigan 29

Lecture Exercise 3:Model Prediction and Correlation

• Suppose you are in charge of a Design for Six Sigma project to determine the appropriate pressure settings for bicycle tires?Ø Currently you produce 37 mm tires.

• One of your response variables is the coefficient of rolling friction (Cr).

• Note: lower the Cr, the better the ride.

Lecture Exercise 3:Bicycle Tire Analysis Data

• Experiment: Ø Response:

§ coefficient of rolling friction (Cr).

Ø Predictor:§ tire pressure,

Ø Target: Cr < 0.006

• Perform the following:Scatter plot (pressure Vs. Cr), fitted regression line,Correlation (R), and Assess model accuracy with R2

Pressure (PSI) Width = 37 mm20 0.010025 0.009530 0.008835 0.008140 0.007445 0.006750 0.006055 0.005860 0.005665 0.005470 0.005275 0.0050

Pat Hammett, University of Michigan 30

Pat Hammett, University of Michigan 31

Pat Hammett, University of Michigan 32

Pat Hammett, University of Michigan 33

Pat Hammett, University of Michigan 34

Pat Hammett, University of Michigan 35

Pat Hammett, University of Michigan 36

Tire Example: Scatter Plot / R2

• Tire Example: R = -0.9698; R 2 = 0.940

Cr Vs. Tire Pressure

y = -9E-05x + 0.0115R2 = 0.9405

0.0000

0.0020

0.0040

0.0060

0.0080

0.0100

0.0120

0 10 20 30 40 50 60 70 80

Tire Pressure (PSI)

Coe

ffic

ient

Rol

ling

Fric

tion

Pat Hammett, University of Michigan 37

Lecture Exercise 4:Interpreting Results

• Obviously, tire pressure has a tremendous impact on coefficient of rolling friction.

1. Suppose specification of Cr < 0.006, how might we determine the appropriate tire pressure from our model?

2. What tire pressure would eliminate Cr (Cr = 0)?

Solve the Equation for X

• Equation: Y = -0.00009X + 0.01146Ø If Y = 0.006, X = 60 psiØ If Y = 0, X = 127 psi

• Do these values make sense?

Pressure (PSI) Width=1.25"20 0.010025 0.009530 0.008835 0.008140 0.007445 0.006750 0.006055 0.005860 0.005665 0.005470 0.005275 0.0050

Pat Hammett, University of Michigan 38

Re-Examine Scatter Plot

• Is this graph linear?

Cr Vs. Tire Pressure

0.0000

0.0020

0.0040

0.0060

0.0080

0.0100

0.0120

0 10 20 30 40 50 60 70 80

Tire Pressure (PSI)

Coe

ffic

ient

Rol

ling

Fric

tion

Pat Hammett, University of Michigan 39

Pat Hammett, University of Michigan 40

Re-Examine Scatter Plot

• Is this graph linear? No, non-linear

Cr Vs. Tire Pressure

0.0000

0.0020

0.0040

0.0060

0.0080

0.0100

0.0120

0 10 20 30 40 50 60 70 80

Tire Pressure (PSI)

Coe

ffic

ient

Rol

ling

Fri

ctio

n

Pat Hammett, University of Michigan 41

V. Regression Abuses / Misinterpreting Correlation

• Between the coating efficiency and tire examples, we have noted several potential abuses:Ø Be careful that you have a linear model

when applying linear regression.Ø Do not make inferences outside the region

of study (example: tire pressure = 0, or tire pressure = 130 psi).

Ø Relationships between X and Y may change depending on the range of observed X values.

Extreme Values• Consider an

experiment between tonnage and draw depth.

• Based on these data, are they strongly related?

Tonnage Drawdepth946 60.22940 60.24935 60.25939 60.29944 60.30936 60.36946 60.37912 60.92939 60.02940 60.08

Correlation -0.79

Pat Hammett, University of Michigan 42

Draw Depth Example• With tonnage = 912 reading à R = -0.77;

without this reading à 0.015• Lesson –graph before interpreting

correlation!Tonnage Vs. Drawdepth

59.80

60.00

60.20

60.40

60.60

60.80

61.00

910 920 930 940 950

Tonnage

Dra

wde

pth

Interpreting Correlation

• When drawing conclusions based on correlation, several issues must be considered: Ø Pearson correlation coefficient (R)

measures the linear relationship (non-linear may exist).

Ø Correlation does not always indicate cause and effect!

Ø Correlation coefficient is very sensitive to extreme values – ALWAYS GRAPH.

Pat Hammett, University of Michigan 43

Correlation Vs. Causation• Correlation does not necessarily imply causation.Ø Does your income increase because you are older or

because you have more experience/ seniority company?

Age, X

Inco

me,

Y

Verifying Causation

• To verify that correlation relates to causation, you need to conduct controlled experiments.

• Hold other process variables fixed and then test if Y changes in relation to X.

• Note: Design of Experiments (Black Belt Skill) provides more advanced verification approaches.

Pat Hammett, University of Michigan 44

Regression / Correlation and Six Sigma Projects

• During the Analysis phase of a Six Sigma project, we try to understand relationships between our outputs (KPOVs) and our inputs (KPIVs).

• Regression and correlation provide tools to assess relationships.

• Remember, no correlation may be just as important to determine than strong correlation.