statistics regression & correlation analysis

Upload: sathya-sathya-s

Post on 06-Apr-2018

261 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Statistics Regression & Correlation Analysis

    1/24

    REGRESSION& CORRELATION

    DEPARTMENT OFSTATISTICS

    DR. RICKEDGEMAN, PROFESSOR& CHAIR SIX SIGMABLACKBELT

    [email protected] OFFICE: +1-208-885-4410

    ANALYSIS

  • 8/2/2019 Statistics Regression & Correlation Analysis

    2/24

    Father of Regression AnalysisCarl F. Gauss (1777-1855)

    German mathematician, noted for his wide-ranging contributions to physics, particularly the study of electromagnetism. Born in

    Braunschweig on April 30, 1777, Gauss studied ancient languages in college, but at the age of 17 he became interested in mathematics and

    attempted a solution of the classical problem of constructing a regular heptagon, or seven-sided figure, with ruler and compass. He not only

    succeeded in proving this construction impossible, but went on to give methods of constructing figures with 17, 257, and 65,537 sides. In so

    doing he proved that the construction, with compass and ruler, of a regular polygon with an odd number of sides was possible only when the

    number of sides was a prime number of the series 3, 5, 17, 257, and 65,537 or was a multiple of two or more of these numbers. With this

    discovery he gave up his intention to study languages and turned to mathematics. He studied at the University of Gttingen from 1795 to

    1798; for his doctoral thesis he submitted a proof that every algebraic equation has at least one root, or solution. This theorem, which hadchallenged mathematicians for centuries, is still called the fundamental theorem of algebra (see ALGEBRA; EQUATIONS, THEORYOF). His volume on the theory of numbers, Disquisitiones Arithmeticae (Inquiries into Arithmetic, 1801), is a classic work in the field of

    mathematics.

    Gauss next turned his attention to astronomy. A faint planetoid, Ceres, had been discovered in 1801; and because astronomers thought it was

    a planet, they observed it with great interest until losing sight of it. From the early observations Gauss calculated its exact position, so that it

    was easily rediscovered. He also worked out a new method for calculating the orbits of heavenly bodies. In 1807 Gauss was appointed

    professor of mathematics and director of the observatory at Gttingen, holding both positions until his death there on February 23, 1855.

    Although Gauss made valuable contributions to both theoretical and practical astronomy, his principal work was in mathematics and

    mathematical physics. In theory of numbers, he developed the important prime-number theorem (see E). He was the first to develop a non-

    Euclidean geometry (see GEOMETRY), but Gauss failed to publish these important findings because he wished to avoid publicity. In

    probability theory, he developed the important method of least squares and the fundamental laws of probability distribution, (see

    PROBABILITY; STATISTICS). The normal probability graph is still called the Gaussian curve. He made geodetic surveys, and applied

    mathematics to geodesy (see GEOPHYSICS). With the German physicist Wilhelm Eduard Weber, Gauss did extensive research on

    magnetism. His applications of mathematics to both magnetism and electricity are among his most important works; the unit of intensity of

    magnetic fields is today called the gauss. He also carried out research in optics, particularly in systems of lenses. Scarcely a branch of

    mathematics or mathematical physics was untouched by Gauss.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    3/24

    Introduction to Regression Analysis

    Regression analysis is the most often applied technique ofstatistical analysis and modeling.

    In general, it is used to model a response variable (Y) as afunction of one or more driver variables (X1, X2, ..., Xp).

    The functional form used is:

    Yi = 0 + 1X1i + 2X2i + ... + pXpi +

  • 8/2/2019 Statistics Regression & Correlation Analysis

    4/24

    Introduction to Regression Analysis

    If there is only one driver variable, X, then we usuallyspeak of simple linear regression analysis.

    When the model involves(a) multiple driver variables,

    (b) a driver variable in multiple forms, or

    (c)a mixture of these, the we speak of multiple linearregression analysis.

    The linear portion of the terminology refers to theresponse variable being expressed as a linearcombination of the driver variables.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    5/24

    The term in the model is referred to as arandom error term and may reflect anumber of things including the generalidea that knowledge of the driver variables

    will not ordinarily lead toperfect

    reconstruction of the response.

    Introduction to Regression Analysis

  • 8/2/2019 Statistics Regression & Correlation Analysis

    6/24

    Regression Analysis: Model Assumptions

    Model assumptions are stated in terms of therandom errors, , as follows:

    the errors are normally distributed,

    with mean = zero, andconstant variance 2, that does not depend on the

    settings of the driver variables, and

    the errors are independent of one another.

    This is often summarized symbolically as: isNID(0, 2)

  • 8/2/2019 Statistics Regression & Correlation Analysis

    7/24

    Model Estimation Ordinarily the regression coefficients (the s) are of unknown

    value and must be estimated from sample information. Theestimate of a given coefficient, i, is often symbolized by i.

    Although there are well-established statistical/ mathematicalmethods for determining estimates by these methods, they aregenerally tedious and are well-suited to performance by acomputer. The resulting estimated model is:

    ^ ^ ^ ^ ^

    Yi = 0 + 1X1i + 2X2i + ... + pXpi

    ^ The random error term, i, is then estimated by ei = Yi - Yi

  • 8/2/2019 Statistics Regression & Correlation Analysis

    8/24

    Interval Estimation

    Estimates will vary from sample to sample and it is useful to haveestimates of the standard deviations of these estimates, Si. Theseestimated standard deviations tend to be included in the regression

    output of most statistical packages and can be used in the formationof confidence intervals for the true value ofi, that is:i +/- t/2,n-(p+1)Si

    Where t/2,n-(p+1) is the value of Students t distribution with n-(p+1)degrees of freedom that places a proportion /2 in the upper tail ofthe t-distribution.

    ^

    ^

    ^

  • 8/2/2019 Statistics Regression & Correlation Analysis

    9/24

    Interval Estimation: Key Concept

    When examining a confidence interval for aparticular regression coefficient, j, we will want toknow whether the interval includes the value zero.

    If zero is included in the interval then, conceivably,j = 0, which would imply that the model could besimplified by dropping the corresponding term, jXj

    from the model.Otherwise, the corresponding variable, Xj is

    considered to be a potentially important predictor ordeterminant of Y.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    10/24

    Analysis of Variance for Regression

    An omnibus or global test of the overall contribution of the set of drivervariables to the prediction of the response variable is carried out via theanalysis of variance (anova). A summary table for theanova of regressionfollows:

    Source of Degrees of Sum of Mean Fcalc Fcrit Variation (SV) Freedom (df) Squares (SS) Square (MS) Regression p SSR MSR MSR/MSE F,p,n-(p+1) Residual n-(p+1) SSE MSE Total n-1 SST

    Where F,p,n-(p+1) is the value of F with p numerator df and n-(p+1)denominator df that places in the upper tail of the distribution.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    11/24

    ANOVA for Regression Formulas

    In the anova table we have the following: SSR = Sum of Squares due to Regression

    SSE = Sum of Squares due to Error or Residual SST = Sum of Squares Total MSR = SSR/p = Mean Square Regression MSE = SSE/[n-(p+1)] = Mean Square Error or Residual

    The sums of squares are derived from the algebraic identity:(Yi - Y)2 = (Yi - Y)2 + (Yi -Yi)2 That is: SST = SSR + SSE So that R2 = SSR/SST represents the

    proportion of variation in Y that is explained by the behavior of the driver

    variables. R2 is thecoefficient of determination.

    ^^

  • 8/2/2019 Statistics Regression & Correlation Analysis

    12/24

    Regression Diagnostics:The Normality Assumption

    Are the residuals (or errors) approximately

    normally distributed? A variety of methods

    are available for checking this regression

    assumption:

    Anderson-Darling, Watson, Cramer-von Mises,

    Kolmogorov-Smirnov, Lillieforss, and Chi-Square

    Tests;

    Histograms or Boxplots;

    Correlation Assessment of Normal Probability

    Plots.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    13/24

    Regression Diagnostics:Independence of Errors

    Are the errors independent of one another, or autocorrelated?

    This assumption may be graphically examined by plotting theerrors in time sequence and determining if any patterns exist - acontrol chart for individuals could be used for this with all eight

    PATs appropriate for use. Other plots are also available. This assumption is commonly evaluated via the Durbin-Watson

    Test. This test is based on the value of

    D = (ei - ei-1)2/SSE

    Which may range in value from 0 to 4. Tables of lower and uppercritical values of D, denoted by dl and du, respectively, are widelyavailable for significance levels of = .01 and = .05. Thecorresponding autocorrelation coefficient which may rangefrom -1 to +1 is given by: r

    a

    = 1 - (D/2)

  • 8/2/2019 Statistics Regression & Correlation Analysis

    14/24

    Regression Analysis Example:Timeliness of Order Delivery

    The order fulfillment process of a major distribution center ishaving trouble delivering orders on time. It is conjectured thatorder volume is the root cause and that problems occur when

    order volumes are high. A new computer system has beenrequested to handle the increased volume of orders. Data is onthe following slide. A negative response time indicates earlydelivery.

    Do the data support the request? Why or why not? What recommendations would you make based on youranalysis of these data? What are some other possible solutions to this problem?

  • 8/2/2019 Statistics Regression & Correlation Analysis

    15/24

    Day X Y Day X Y Day X Y Day X Y

    1 31 -2 14 66 14 27 88 19 40 3 13

    2 91 9 15 73 10 28 97 12 41 35 25

    3 13 -4 16 45 -2 29 93 19 42 40 15

    4 69 15 17 21 -12 30 72 10 43 64 32

    5 70 12 18 7 -1 31 6 11 44 43 17

    6 64 6 19 69 11 32 55 23 45 30 28

    7 38 7 20 38 5 33 15 12 46 73 33

    8 50 4 21 2 -17 34 10 16 47 46 19

    9 94 23 22 99 18 35 21 20 48 82 32

    10 82 24 23 36 8 36 88 43 49 35 22

    11 15 -2 24 82 9 37 55 23 50 2 6

    12 42 -4 25 58 21 38 27 16 X = 2,448 X2 =161,128

    13 27 -6 26 20 -5 39 66 32 Y = 639 Y2 = 15,731XY = 40,175

    X = Order Volume Y = Average Response Time / Order

    NOTE: days are on a M-F rotation.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    16/24

    Covariances: Volume, Res.Time

    Volume Res.Time

    Volume 842.325

    Res.Time 181.420 154.379

    Correlations: Volume, Res.Time

    Pearson correlation of Volume

    and Res.Time = 0.503

    P-Value = 0.000

    Descriptive Statistics: Volume, Res.Time

    Sum of

    Variable N Mean StDev Variance Sum Squares Volume 50 48.96 29.02 842.32 2448.00 161128.00

    Res.Time 50 12.78 12.42 154.38 639.00 15731.00

    Note: SX2= 842.32 = (X2n(X2bar))/(n-1) = (161,12850(48.962))/(49) andSX = 29.02

    SY2= 154.38 = (Y2n(X2bar))/(n-1) = (15,73150(12.782))/(49) andSY = 12.42SXY= 181.42 = (XYn(Xbar)(Ybar)/(n-1) = (40,17550(48.96)(12.78))/(49)

  • 8/2/2019 Statistics Regression & Correlation Analysis

    17/24

    Manual Regression Calculations

    First, get the following:

    Means (Xbar and Ybar). These are: Xbar = 48.96 Ybar = 12.78

    Variances (SX2 and SY

    2), standard deviations (SX and SY), and the covariance (SXY).

    These are: SX2 = 842.32 SX = 29.02 SY

    2 = 154.38 SY = 12.42 SXY = 181.42

    Second, get the correlation coefficient (rXY) and the coefficient of determination (r2).

    These are: rXY = SXY/(SXSY) = 181.42/ (29.02*12.42) = .503 and r2 = (.503)2 = .253

    Next, get the estimates of the Slope (1), Intercept (0), and Regression Equation Y = 0 +1XThese are:

    1 = SXY/SX2 = 181.42 / 842.32 = 0.215

    0 = Ybar1Xbar = 12.78(0.215)(48.96) = 2.24

    Y = 0 + 1X = 2.24 + 0.215X BE ABLE TO USE THIS.

    ^

    ^ ^^^

    ^

    ^

    ^

    ^

    ^

    ^

  • 8/2/2019 Statistics Regression & Correlation Analysis

    18/24

    Regression Analysis: Res.Time versus Volume

    The regression equation isRes.Time = 2.24 + 0.215 Volume

    Predictor Coef SE Coef T P

    Constant 2.235 3.032 0.74 0.465

    Volume 0.21538 0.05340 4.03 0.000

    S = 10.8493 R-Sq = 25.3% R-Sq(adj) = 23.8%

    Analysis of Variance

    Source DF SS MS F P

    Regression 1 1914.6 1914.6 16.27 0.000

    Residual Error 48 5650.0 117.7

    Total 49 7564.6

    Unusual Observations

    Obs Volume Res.Time Fit SE Fit Residual St Resid

    36 88.0 43.00 21.19 2.59 21.81 2.07R

    NOTE THAT:

    r2 = SSR/SST =

    = 1914.6/7564.6 = 0.253

    So that: r = r2 = 0.503with the algebraic sign being

    the same as that for the

    estimated slope, 1.^

  • 8/2/2019 Statistics Regression & Correlation Analysis

    19/24

    Standardized Residual

    Percent

    210-1-2

    99

    90

    50

    10

    1

    Fitted Value

    StandardizedResidual

    20151050

    2

    1

    0

    -1

    -2

    Standardized Residual

    Frequ

    ency

    210-1-2

    10.0

    7.5

    5.0

    2.5

    0.0

    Observation Order

    StandardizedResidual

    50454035302520151051

    2

    1

    0

    -1

    -2

    Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

    Histogram of the Residuals Residuals Versus the Order of the Data

    Residual Plots for Res.Time

  • 8/2/2019 Statistics Regression & Correlation Analysis

    20/24

    S1= MSE / (n-1)SX2

    S0 = MSE[1/n + X2

    /(n-1)SX2

    ]Verify that these are: 0.0534 and 3.032, respectively, then

    construct and interpret 95% confidence intervals for each

    regression coefficient (e.g., for the slope and intercept.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    21/24

    Estimate of and confidence interval for Y|X=X*(the mean of Y given that X is equal to X*)

    Y|X=X* = Y = 0 + 1X*

    S|X=X* = MSE[1/n + (X* - X)2/(n-1)SX2]

    ^ ^ ^ ^

  • 8/2/2019 Statistics Regression & Correlation Analysis

    22/24

    Estimate of and confidence interval for the mean of

    m new observations at X = X*

    Y|X=X* = Y = 0 + 1X*

    S|X=X* = MSE[1/m + 1/n + (X* - X)2/(n-1)SX2]

    ^ ^ ^ ^

  • 8/2/2019 Statistics Regression & Correlation Analysis

    23/24

    Estimate of and confidence interval for 2(the error variance)

    (n-2)MSE/2n-2,big < 2 < (n-2)MSE/2n-2,small

    Where the big and small values of chi-square are the onesplacing /2 in the upper and lower tails, respectively, of the chi-

    square distribution with (n-2) degrees of freedom or, more generally,n-(p+1) degrees of freedom. Construct and interpret a 95%

    confidence interval for the error variance.

  • 8/2/2019 Statistics Regression & Correlation Analysis

    24/24

    REGRESSION& CORRELATION

    DEPARTMENT OFSTATISTICS

    DR. RICKEDGEMAN, PROFESSOR& CHAIR SIX SIGMABLACKBELT

    [email protected] OFFICE: +1-208-885-4410

    ANALYSISEndof Session