chapter 12: linear regression 1. introduction regression analysis and analysis of variance are the...

Chapter 12:

Linear Regression

1

Introduction

• Regression analysis and Analysis of variance are the two most widely used statistical procedures.

• Regression analysis:– Description– Prediction– Estimation

2

12.1 Simple Linear Regression

• In (univariate) regression, there is always a single “dependent” variable, and one or more “independent” variables. – Number of non-conforming units is dependent on the amount

of time devoted to maintain control charts• Simple is used to denote the fact that a single

independent variable is being used.• Linear is referred to the parameters, not independent

variables.

3

(12.1)

(12.2)


• is the general form of the equation for a straight line.• indicates that there is not an exact relationship between X

and Y.• Regression analysis is not used for variables that have an

exact linear relationship. • and are generally unknown and must be estimated.• The is generally thought as an error term.• Let Y denotes the number of non-conforming units

produced in each month, and X represents the amount of time devoted to use QC charts each month.

4

Table 12.1 Quality Improvement Data

5

Month Time Devoted to Quality Impr.

# of Non-conforming

January 56 20February 58 19March 55 20April 62 16May 63 15June 68 14July 66 15August 68 13September 70 10October 67 13November 72 9December 64 8

Figure 12.1 Scatter Plot

6

Figure 12.1a Scatter Plot

7


• Regression equation: a line through the center of the points minimizing the sum of the squares of the deviations from each point to the line. (Method of least squares)

• is to be minimized where

• Round-off error• Prediction equation

8


The regression equation isY = 55.9 - 0.641 X

Predictor Coef SE Coef T PConstant 55.923 2.824 19.80 0.000X -0.64067 0.04332 -14.79 0.000

S = 0.888854 R-Sq = 95.6% R-Sq(adj) = 95.2%

Analysis of Variance

Source DF SS MS F PRegression 1 172.77 172.77 218.67 0.000Residual Error 10 7.90 0.79Total 11 180.67

9


• Prediction equation: should only be used for values within the data range, or slightly outside the interval.

• Descriptive:– A decrease of 0.64 non-conforming units for every additional hour

devoted to quality improvement

10

12.2 Worth of the Prediction Equation

11

Obs X Y Fit SE Fit Residual St Resid1 56.0 20.000 20.046 0.464 -0.046 -0.062 58.0 19.000 18.765 0.395 0.235 0.303 55.0 20.000 20.687 0.500 -0.687 -0.934 62.0 16.000 16.202 0.286 -0.202 -0.245 63.0 15.000 15.561 0.270 -0.561 -0.666 68.0 14.000 12.358 0.289 1.642 1.957 66.0 15.000 13.639 0.261 1.361 1.608 68.0 13.000 12.358 0.289 0.642 0.769 70.0 10.000 11.077 0.338 -1.077 -1.31

10 67.0 13.000 12.999 0.272 0.001 0.0011 72.0 9.000 9.795 0.400 -0.795 -1.0012 74.0 8.000 8.514 0.470 -0.514 -0.68

12.2 Worth of the Prediction Equation

12

• Pure error: data points with the same X but different Y’s constitute pure error since regression line can’t be vertical.

• Measure of the worth of the prediction equation:

• Since , (

(

• If =0 (no relationship between X and Y), =0

(12.4)

12.3 Assumptions

13

• The true relationship between X and Y can be adequately represented by the model

• The errors should be independent.• The errors are approximately normally distributed

Y = 𝛽0 + 1X + (12.1)

12.4 Checking Assumptions through Residual Plots

14

• The residuals should be plotted against– X or – Time– Any other variable

• Residual plots– All points close to the midline– Form a tight cluster that can be enclosed in a rectangle

• If there were residual outliers, investigate• If the error variance increases or decreases, this

problem can be remedied by a transformation of X.• If in the form of parabola, X2 term would probably

needed.

12.4 Checking Assumptions through Residual Plots

15

12.5 Confidence Intervals

16

• Assumption: Normality of the error terms– Robust regression– Non-parametric regression

• Confidence Interval for

• Confidence Interval for

Where

12.5 Hypothesis Test

17

• Hypothesis Test for

Where and

12.6 Prediction Interval for Y

18

Where and

12.6 Prediction Interval for Y

19

12.7 Regression Control Chart

20

• To monitor the dependent variable using a control chart approach

• The center line is

• Control Limits for

Where and

(12.5)

(12.6)

12.8 Cause-Selecting Control Chart

21

• The general idea is to try to distinguish between quality problems that occur at one stage in a process from problems that occur at a previous processing step.

• Let Y be the output from the second step and let X denote the output from the first step. The relationship between X and Y would be modeled.

12.9 Linear, Nonlinear, and Nonparametric Profiles

22

• Profile refers to the quality of a process or product being characterized by a (Linear, Nonlinear, or Nonparametric) relationship between a response variable and one or more explanatory variables.

• A possible way is to monitor each parameter in the model with a Shewhart chart.– The independent variables must be fixed– Control chart for R2

12.10 Inverse Regression

23

• An important application of simple linear regression for quality improvement is in the area of calibration.

• Assume two measuring tools are available – One is quite accurate but expensive to use and the other is not as expensive but also not as accurate. If the measurements obtained from the two devices are highly correlated, then the measurement that would have been made using the expensive measuring device could be predicted fairly from the measurement using the less expensive device.

• Let Y = measurement from the less expensive deviceX = measurement from the accurate device

12.10 Inverse Regression

24

Classical estimation approach• First, regress Y on X, to obtain • Solve for X, • For a known value of Y, , the equation is

Inverse regression (X is regressed on Y)

• if X and Y were perfectly correlated

12.10 Inverse RegressionExample

25

Classical estimation approach• First, regress Y on X, to obtain

Inverse regression (X is regressed on Y)

• At Y X

Y X2.3 2.42.5 2.62.4 2.52.8 2.92.9 3.02.6 2.72.4 2.52.2 2.32.1 2.22.7 2.7

12.11 Multiple Linear Regression

• In multiple regression, there is more than one “independent” variable.

26

12.12 Issues in Multiple Regression12.12.1 Variable Selection

• R2 will virtually always increase when additional variables are added to a prediction equation.

• increases when new regressors are added• A commonly used statistic for determining the number

of parameters is the Cp

Where p is the number of parameters in the modelSSEp is the residual sum of squares is the error variance using all the available regressors

• The idea is to look hard at those prediction equations for which Cp is small and close to p.

27

12.12.3 Multicollinear Data

• Problems occur when at least two of the regressors are related in some manner.

• Solutions:– Discard one or more variables causing the multicollinearity– Use ridge regression

28

12.12.4 Residual Plots

• Residual plots are used extensively in multiple regression for checking on the model assumptions

• The residuals should generally be plotted against , each of the regressors, time, and any potential regressor.

29

12.12.6 Transformations

• A regression model can often be improved by transforming one or more of the regressors, and possibly the dependent variable as well.

• Transformation can also often be used to transform a nonlinear regression model into a linear one.

• For example, can be transformed into a linear model

30

chapter 12: linear regression 1. introduction regression analysis and analysis of variance are the...

Documents

simple linear regression8812

simple linear regression312

simple linear regression101012

quality problems

prediction interval

expensive devicex

expensive measuring

prediction equation1212