two-variable analysis: simple linear regression/...

of 44/44
Pat Hammett, University of Michigan 1 Two-Variable Analysis: Simple Linear Regression/ Correlation Topics I. Scatter Plot (X-Y Graph) II. Simple Linear Regression III. Correlation, R IV. Assessing Model Accuracy, R 2 V. Regression Abuses / Misinterpreting Correlation

Post on 22-Aug-2020

18 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Pat Hammett, University of Michigan 1

    Two-Variable Analysis:Simple Linear Regression/

    Correlation

    TopicsI. Scatter Plot (X-Y Graph)

    II. Simple Linear Regression

    III. Correlation, R

    IV. Assessing Model Accuracy, R 2

    V. Regression Abuses / Misinterpreting Correlation

  • Pat Hammett, University of Michigan 2

    I. Scatter Plot

    • Used to visualize relationship between two variables.

    • Common results:Ø Linear relationshipsØ non-linear relationshipsØ No Relationships (robustness)

    Scatter Plot• Shows the relationship between X (predictor)

    and Y (response) given a range of X.

    XIndependent Variable

    Predictor Variable

    YDependent VariableResponse Variable

  • Pat Hammett, University of Michigan 3

    Example 1: Coating Thickness(From: “SPC of a Phosphate Coating Line”, Wire, J. J. Intl, May 1997, pp. 78-81.)

    • Suppose you measure the efficiency of a phosphate coating operation for steel versus coating tank temperature.Ø What is the response, what is the predictor?

    Sample Temp Efficiency Sample Temp Efficiency1 170 0.84 13 180 2.152 172 1.31 14 181 0.843 173 1.42 15 181 1.434 174 1.03 16 182 0.95 174 1.07 17 182 1.816 175 1.08 18 182 1.947 176 1.04 19 182 2.688 177 1.8 20 184 1.499 180 1.45 21 184 2.52

    10 180 1.6 22 185 311 180 1.61 23 186 1.8712 180 2.13 24 188 3.08

  • Pat Hammett, University of Michigan 4

  • Pat Hammett, University of Michigan 5

  • Pat Hammett, University of Michigan 6

  • Pat Hammett, University of Michigan 7

  • Pat Hammett, University of Michigan 8

  • Pat Hammett, University of Michigan 9

  • Pat Hammett, University of Michigan 10

  • Pat Hammett, University of Michigan 11

  • Pat Hammett, University of Michigan 12

  • Pat Hammett, University of Michigan 13

    Scatter Plot: Coating Efficiency

    • Is there a relationship here?

    Temperature Vs. Coating Efficiency

    0

    0.5

    11.5

    2

    2.5

    3

    3.5

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    Scatter Plot: Coating Efficiency

    • Is there a relationship here?

    Temperature Vs. Coating Efficiency

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    YES, as temp inc,efficiency Inc

  • Pat Hammett, University of Michigan 14

    Lecture Exercise 1:Changing Range of X

    • Open the excel file, tanktemp.xls, which has this data file.Ø Compute the range of Y (efficiency) if you reduce

    the tank temperature from 170-188 to 180-182.

    Ø Is the range of Y (efficiency) smaller, larger, or the same as over the full range of X?

    Ø Construct a scatter plot of this new data set? Do you still think a relationship exists?

    Lecture Exercise 1:Effect on Y by reducing Variation in X

    • Coating Example:

    • Note: if a strong relationship exists (positive or negative) between X and Y, then reducing variation in X should result in a variation reduction in Y.

    Temp (Range X)

    Efficiency Ratio

    (Range Y)

    170-188 2.24

    180-182 1.84

  • Pat Hammett, University of Michigan 15

    Efficiency Vs. Reduced Range in Temperature

    • Over the smaller range of the input (temperature), this relationship weakens.

    Temperature Vs. Coating Efficiency

    00.5

    11.5

    22.5

    3

    179 180 181 182 183

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    Lessons from Coating Example

    • Relationships between Y and X variables may change depending on the range of X.

    • Scatter plots provide good visualization of relationships between variables, but we need a metric to assess Strength of Relationship.Ø For Two variables – we use simple linear

    regression to develop a model in order to assess the strength of relationship using correlation.

  • Pat Hammett, University of Michigan 16

    II. Simple Linear Regression

    • Simple Linear Regression examines the relationship between two variables: Ø one response (y), andØ one predictor (x).

    • If two variables are related, a regression equation may be used to predict a response value given a predictor value with better than random chance.

    Simple Regression Equation

    • Y = βo + β1X1Ø Y = dependent variable (response)Ø X1 = independent variable (predictor)Ø β0 = intercept; the value of Y when X = 0.Ø β1 = slope; the predicted change in output Y per

    unit change of input X.

    • Alternatively,• Y = mX + b (m is slope, and b is y-intercept)

  • Pat Hammett, University of Michigan 17

    Computing Slope and Intercept• We typically use software to compute the slope

    and y-intercept. In Excel, we may use:Ø =slope(y-array,x-array); =intercept(y-array,x-

    array)Sample Temp Efficiency

    1 170 0.842 172 1.313 173 1.424 174 1.035 174 1.07

    … …24 188 3.08

    slope 0.094 =slope(C2:C25,B2:B25)intercept -15.245 =intercept(C2:C25,B2:B25)

    Coating Example: Trend Line• You may add this line to your scatter plot by

    selecting your chart and then using add trend line command under chart menu.

    Temperature Vs. Coating Efficiency

    00.5

    11.5

    22.5

    33.5

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

  • Pat Hammett, University of Michigan 18

  • Pat Hammett, University of Michigan 19

  • Pat Hammett, University of Michigan 20

  • Pat Hammett, University of Michigan 21

    Slope Values and Trend Lines• Positive slope valuesØ Increasing trend lines on scatter plot.

    • Negative slope valuesØ Decreasing trend lines on scatter plot.

    • No slope (~0)Ø Horizontal trend lines.Ø Comment: be careful with using absolute

    magnitudes. Depending on units, a very small slope deviation from 0 could be significant.

    Coating Example: Slope Values and Trend Lines

    • The slope is greater over the entire temp range (170-188) of the study.

    • Slope is a positive value à increasing trend.

    Temp (Range X)

    Efficiency Ratio

    (Range Y)Slope Y-Intercept

    170-188 2.24 0.094 -15.245180-182 1.84 0.008 0.153

  • Pat Hammett, University of Michigan 22

    Model Predictions

    • Slope and intercepts are mathematical calculations. We can always compute them.

    • More Important Question: how effective are these terms at predicting any individual observation.

    • One way to assess effectiveness of the prediction is to examine the residuals.

    Residual Terms• Residual (obs i) = Yactual(obs i) – Ypredicted (obs i)Ø Vertical bars are the residuals for each

    observation of Y.Temperature Vs. Coating Efficiency

    00.5

    11.5

    22.5

    33.5

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

  • Pat Hammett, University of Michigan 23

    Lecture Exercise 2: Computing Predicted Value

    and Residual• Consider sample # 22, where X = 185

    and Y = 3.0.

    • Using the regression equation (Y = 0.094X – 15.245), compute the following:Ø Ypredicted (obs 22)= ?Ø Yresidual (obs 22) = ?

    Lecture Exercise 2: Computing Predicted Value

    and Residual• Obs: 22, X = 185 and Y = 3.0.• Using the regression equation (Y = 0.094X –

    15.245), compute the following:Ø Ypredicted (obs 22)= § (0.094 x 185) – 15.245 = 2.145

    Ø Yresidual (obs 22) = Yactual - Ypredicted§ 3.0 – 2.145 = 0.855

  • Pat Hammett, University of Michigan 24

    Residuals• Smaller residuals indicate a better prediction.• Consider the following graphs, which has

    smaller residuals A or B?

    Temperature Vs. Coating Efficiency

    0

    0.5

    1

    1.5

    2

    2.5

    3

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    Temperature Vs. Coating Efficiency

    0

    0.5

    1

    1.5

    2

    2.5

    3

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    Group A Group B

    Residuals• Smaller residuals indicate a better prediction.• Consider the following graphs, which has

    smaller residual A or B? A

    Temperature Vs. Coating Efficiency

    0

    0.5

    1

    1.5

    2

    2.5

    3

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    Temperature Vs. Coating Efficiency

    0

    0.5

    1

    1.5

    2

    2.5

    3

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    Group A Group B

  • Pat Hammett, University of Michigan 25

    III. Correlation• Correlation (R ) provides a measure of

    model prediction.• Perfect correlation suggests that we

    may pass a line through every observation (all residuals = 0).

    X

    Y

    X

    Y

    R = 1.0

    Correlation• In assessing relationships between variables,

    we often want to know strength of relationship.

    • The Pearson correlation coefficient, R,measures the extent to which two variables are related.

    Ø where i = 1..n pairsØ -1 < R < 1Ø Microsoft excel function: = correl(array1,array2)

    ( )( )( ) yx

    ii

    ssn

    yyxxR

    1−

    −−= ∑

  • Pat Hammett, University of Michigan 26

    Correlation – Coating Example

    • From Excel

    • Correl (R )=correl(B2:B25,C2:C25)

    R = 0.67

    Temperature Vs. Coating Efficiency

    00.5

    11.5

    22.5

    33.5

    165 170 175 180 185 190

    Temperature

    Pho

    spha

    te C

    oatin

    g E

    ffici

    ency

    Rat

    io

    Correlation PatternsPerfect Positive Strong Positive

    Perfect Negative Strong Negative

    R = 1.0

    R = -0.7R = -1.0

    R = 0.7

    Rule of Thumb: |Correlation| > 0.7 à strong relationship

  • Pat Hammett, University of Michigan 27

    No Correlation

    • If no correlation exists, R = 0.

    Predictor, X

    Res

    pons

    e, Y

    IV. Assessing Model Accuracy, R2

    • Another tool to assess model accuracy (or predictability) is R 2 .

    • R 2 - multiple correlation coefficient Ø R2 is computed by squaring the correlation, R

    Ø 0 (no correlation) < R2 < 1 (perfect correlation)

  • Pat Hammett, University of Michigan 28

    What does R2 Measure?• R 2 - measures the % of the variation in Y

    explained by the variation in X over the range of X.

    • Suppose R = 1 à R2 = 1, thus all of the variation in Y may be explained by X.

    • R =0.7 à R2 = 0.49, thus, 49% of the variation in Y may be explained by X.

    • R =0.1 à R2 = 0.01, thus, only 1% of the variation in Y may be explained by X.

    Coating Example Revisited• Recall our different equations based on the

    range of X for coating example.• Over the full range, we have high correlation

    where temp explains ~45% of efficiency ratio.• Over the tighter range, temp explains little of

    the variation in efficiency ratio (~0%)

    Temp (Range X)

    Efficiency Ratio

    (Range Y)Slope Y-Intercept R R

    2

    170-188 2.24 0.094 -15.245 0.672 0.45180-182 1.84 0.008 0.153 0.015 0.00

  • Pat Hammett, University of Michigan 29

    Lecture Exercise 3:Model Prediction and Correlation

    • Suppose you are in charge of a Design for Six Sigma project to determine the appropriate pressure settings for bicycle tires?Ø Currently you produce 37 mm tires.

    • One of your response variables is the coefficient of rolling friction (Cr).

    • Note: lower the Cr, the better the ride.

    Lecture Exercise 3:Bicycle Tire Analysis Data

    • Experiment: Ø Response:

    § coefficient of rolling friction (Cr).

    Ø Predictor:§ tire pressure,

    Ø Target: Cr < 0.006

    • Perform the following:Scatter plot (pressure Vs. Cr), fitted regression line,Correlation (R), and Assess model accuracy with R2

    Pressure (PSI) Width = 37 mm20 0.010025 0.009530 0.008835 0.008140 0.007445 0.006750 0.006055 0.005860 0.005665 0.005470 0.005275 0.0050

  • Pat Hammett, University of Michigan 30

  • Pat Hammett, University of Michigan 31

  • Pat Hammett, University of Michigan 32

  • Pat Hammett, University of Michigan 33

  • Pat Hammett, University of Michigan 34

  • Pat Hammett, University of Michigan 35

  • Pat Hammett, University of Michigan 36

    Tire Example: Scatter Plot / R2

    • Tire Example: R = -0.9698; R 2 = 0.940

    Cr Vs. Tire Pressure

    y = -9E-05x + 0.0115R2 = 0.9405

    0.0000

    0.0020

    0.0040

    0.0060

    0.0080

    0.0100

    0.0120

    0 10 20 30 40 50 60 70 80

    Tire Pressure (PSI)

    Coe

    ffic

    ient

    Rol

    ling

    Fric

    tion

  • Pat Hammett, University of Michigan 37

    Lecture Exercise 4:Interpreting Results

    • Obviously, tire pressure has a tremendous impact on coefficient of rolling friction.

    1. Suppose specification of Cr < 0.006, how might we determine the appropriate tire pressure from our model?

    2. What tire pressure would eliminate Cr (Cr = 0)?

    Solve the Equation for X

    • Equation: Y = -0.00009X + 0.01146Ø If Y = 0.006, X = 60 psiØ If Y = 0, X = 127 psi

    • Do these values make sense?

    Pressure (PSI) Width=1.25"20 0.010025 0.009530 0.008835 0.008140 0.007445 0.006750 0.006055 0.005860 0.005665 0.005470 0.005275 0.0050

  • Pat Hammett, University of Michigan 38

    Re-Examine Scatter Plot

    • Is this graph linear?

    Cr Vs. Tire Pressure

    0.0000

    0.0020

    0.0040

    0.0060

    0.0080

    0.0100

    0.0120

    0 10 20 30 40 50 60 70 80

    Tire Pressure (PSI)

    Coe

    ffic

    ient

    Rol

    ling

    Fric

    tion

  • Pat Hammett, University of Michigan 39

  • Pat Hammett, University of Michigan 40

    Re-Examine Scatter Plot

    • Is this graph linear? No, non-linear

    Cr Vs. Tire Pressure

    0.0000

    0.0020

    0.0040

    0.0060

    0.0080

    0.0100

    0.0120

    0 10 20 30 40 50 60 70 80

    Tire Pressure (PSI)

    Coe

    ffic

    ient

    Rol

    ling

    Fri

    ctio

    n

  • Pat Hammett, University of Michigan 41

    V. Regression Abuses / Misinterpreting Correlation

    • Between the coating efficiency and tire examples, we have noted several potential abuses:Ø Be careful that you have a linear model

    when applying linear regression.Ø Do not make inferences outside the region

    of study (example: tire pressure = 0, or tire pressure = 130 psi).

    Ø Relationships between X and Y may change depending on the range of observed X values.

    Extreme Values• Consider an

    experiment between tonnage and draw depth.

    • Based on these data, are they strongly related?

    Tonnage Drawdepth946 60.22940 60.24935 60.25939 60.29944 60.30936 60.36946 60.37912 60.92939 60.02940 60.08

    Correlation -0.79

  • Pat Hammett, University of Michigan 42

    Draw Depth Example• With tonnage = 912 reading à R = -0.77;

    without this reading à 0.015• Lesson –graph before interpreting

    correlation!Tonnage Vs. Drawdepth

    59.80

    60.00

    60.20

    60.40

    60.60

    60.80

    61.00

    910 920 930 940 950

    Tonnage

    Dra

    wde

    pth

    Interpreting Correlation

    • When drawing conclusions based on correlation, several issues must be considered: Ø Pearson correlation coefficient (R)

    measures the linear relationship (non-linear may exist).

    Ø Correlation does not always indicate cause and effect!

    Ø Correlation coefficient is very sensitive to extreme values – ALWAYS GRAPH.

  • Pat Hammett, University of Michigan 43

    Correlation Vs. Causation• Correlation does not necessarily imply causation.Ø Does your income increase because you are older or

    because you have more experience/ seniority company?

    Age, X

    Inco

    me,

    Y

    Verifying Causation

    • To verify that correlation relates to causation, you need to conduct controlled experiments.

    • Hold other process variables fixed and then test if Y changes in relation to X.

    • Note: Design of Experiments (Black Belt Skill) provides more advanced verification approaches.

  • Pat Hammett, University of Michigan 44

    Regression / Correlation and Six Sigma Projects

    • During the Analysis phase of a Six Sigma project, we try to understand relationships between our outputs (KPOVs) and our inputs (KPIVs).

    • Regression and correlation provide tools to assess relationships.

    • Remember, no correlation may be just as important to determine than strong correlation.