chapter 12: linear regression 1. introduction regression analysis and analysis of variance are the...
TRANSCRIPT
Chapter 12:
Linear Regression
1
Introduction
• Regression analysis and Analysis of variance are the two most widely used statistical procedures.
• Regression analysis:– Description– Prediction– Estimation
2
12.1 Simple Linear Regression
• In (univariate) regression, there is always a single “dependent” variable, and one or more “independent” variables. – Number of non-conforming units is dependent on the amount
of time devoted to maintain control charts• Simple is used to denote the fact that a single
independent variable is being used.• Linear is referred to the parameters, not independent
variables.
3
(12.1)
(12.2)
12.1 Simple Linear Regression
• is the general form of the equation for a straight line.• indicates that there is not an exact relationship between X
and Y.• Regression analysis is not used for variables that have an
exact linear relationship. • and are generally unknown and must be estimated.• The is generally thought as an error term.• Let Y denotes the number of non-conforming units
produced in each month, and X represents the amount of time devoted to use QC charts each month.
4
Table 12.1 Quality Improvement Data
5
Month Time Devoted to Quality Impr.
# of Non-conforming
January 56 20February 58 19March 55 20April 62 16May 63 15June 68 14July 66 15August 68 13September 70 10October 67 13November 72 9December 64 8
Figure 12.1 Scatter Plot
6
Figure 12.1a Scatter Plot
7
12.1 Simple Linear Regression
• Regression equation: a line through the center of the points minimizing the sum of the squares of the deviations from each point to the line. (Method of least squares)
• is to be minimized where
• Round-off error• Prediction equation
8
12.1 Simple Linear Regression
The regression equation isY = 55.9 - 0.641 X
Predictor Coef SE Coef T PConstant 55.923 2.824 19.80 0.000X -0.64067 0.04332 -14.79 0.000
S = 0.888854 R-Sq = 95.6% R-Sq(adj) = 95.2%
Analysis of Variance
Source DF SS MS F PRegression 1 172.77 172.77 218.67 0.000Residual Error 10 7.90 0.79Total 11 180.67
9
12.1 Simple Linear Regression
• Prediction equation: should only be used for values within the data range, or slightly outside the interval.
• Descriptive:– A decrease of 0.64 non-conforming units for every additional hour
devoted to quality improvement
10
12.2 Worth of the Prediction Equation
11
Obs X Y Fit SE Fit Residual St Resid1 56.0 20.000 20.046 0.464 -0.046 -0.062 58.0 19.000 18.765 0.395 0.235 0.303 55.0 20.000 20.687 0.500 -0.687 -0.934 62.0 16.000 16.202 0.286 -0.202 -0.245 63.0 15.000 15.561 0.270 -0.561 -0.666 68.0 14.000 12.358 0.289 1.642 1.957 66.0 15.000 13.639 0.261 1.361 1.608 68.0 13.000 12.358 0.289 0.642 0.769 70.0 10.000 11.077 0.338 -1.077 -1.31
10 67.0 13.000 12.999 0.272 0.001 0.0011 72.0 9.000 9.795 0.400 -0.795 -1.0012 74.0 8.000 8.514 0.470 -0.514 -0.68
12.2 Worth of the Prediction Equation
12
• Pure error: data points with the same X but different Y’s constitute pure error since regression line can’t be vertical.
• Measure of the worth of the prediction equation:
• Since , (
(
• If =0 (no relationship between X and Y), =0
(12.4)
12.3 Assumptions
13
• The true relationship between X and Y can be adequately represented by the model
• The errors should be independent.• The errors are approximately normally distributed
Y = 𝛽0 + 1X + (12.1)
12.4 Checking Assumptions through Residual Plots
14
• The residuals should be plotted against– X or – Time– Any other variable
• Residual plots– All points close to the midline– Form a tight cluster that can be enclosed in a rectangle
• If there were residual outliers, investigate• If the error variance increases or decreases, this
problem can be remedied by a transformation of X.• If in the form of parabola, X2 term would probably
needed.
12.4 Checking Assumptions through Residual Plots
15
12.5 Confidence Intervals
16
• Assumption: Normality of the error terms– Robust regression– Non-parametric regression
• Confidence Interval for
• Confidence Interval for
Where
12.5 Hypothesis Test
17
• Hypothesis Test for
Where and
12.6 Prediction Interval for Y
18
Where and
12.6 Prediction Interval for Y
19
12.7 Regression Control Chart
20
• To monitor the dependent variable using a control chart approach
• The center line is
• Control Limits for
Where and
(12.5)
(12.6)
12.8 Cause-Selecting Control Chart
21
• The general idea is to try to distinguish between quality problems that occur at one stage in a process from problems that occur at a previous processing step.
• Let Y be the output from the second step and let X denote the output from the first step. The relationship between X and Y would be modeled.
12.9 Linear, Nonlinear, and Nonparametric Profiles
22
• Profile refers to the quality of a process or product being characterized by a (Linear, Nonlinear, or Nonparametric) relationship between a response variable and one or more explanatory variables.
• A possible way is to monitor each parameter in the model with a Shewhart chart.– The independent variables must be fixed– Control chart for R2
12.10 Inverse Regression
23
• An important application of simple linear regression for quality improvement is in the area of calibration.
• Assume two measuring tools are available – One is quite accurate but expensive to use and the other is not as expensive but also not as accurate. If the measurements obtained from the two devices are highly correlated, then the measurement that would have been made using the expensive measuring device could be predicted fairly from the measurement using the less expensive device.
• Let Y = measurement from the less expensive deviceX = measurement from the accurate device
12.10 Inverse Regression
24
Classical estimation approach• First, regress Y on X, to obtain • Solve for X, • For a known value of Y, , the equation is
Inverse regression (X is regressed on Y)
• if X and Y were perfectly correlated
12.10 Inverse RegressionExample
25
Classical estimation approach• First, regress Y on X, to obtain
Inverse regression (X is regressed on Y)
• At Y X
Y X2.3 2.42.5 2.62.4 2.52.8 2.92.9 3.02.6 2.72.4 2.52.2 2.32.1 2.22.7 2.7
12.11 Multiple Linear Regression
• In multiple regression, there is more than one “independent” variable.
26
12.12 Issues in Multiple Regression12.12.1 Variable Selection
• R2 will virtually always increase when additional variables are added to a prediction equation.
• increases when new regressors are added• A commonly used statistic for determining the number
of parameters is the Cp
Where p is the number of parameters in the modelSSEp is the residual sum of squares is the error variance using all the available regressors
• The idea is to look hard at those prediction equations for which Cp is small and close to p.
27
12.12.3 Multicollinear Data
• Problems occur when at least two of the regressors are related in some manner.
• Solutions:– Discard one or more variables causing the multicollinearity– Use ridge regression
28
12.12.4 Residual Plots
• Residual plots are used extensively in multiple regression for checking on the model assumptions
• The residuals should generally be plotted against , each of the regressors, time, and any potential regressor.
29
12.12.6 Transformations
• A regression model can often be improved by transforming one or more of the regressors, and possibly the dependent variable as well.
• Transformation can also often be used to transform a nonlinear regression model into a linear one.
• For example, can be transformed into a linear model
30