linear regrssion analysis and residual

Upload: hebahaddad

Post on 30-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Linear Regrssion Analysis and Residual

    1/52

    Simple Linear Regression

    and Correlation

    Presented by :

    Eng. Heba El-Haddad

    1

  • 8/9/2019 Linear Regrssion Analysis and Residual

    2/52

    Introduction

    Regression analysis is a statistical tool foranalyzing the relationships

    between variables.For example:

    Collage guidance counselor have just administrated a vocational

    aptitude test to 1000 entering freshman, she is interested in knowing

    whether there is a relationship between the math aptitude scores andthe business aptitude score

    To determine the relationship between the math aptitude scores and the

    business aptitude scores, we have to compute a number that measurethe relationship between these two sets of scores.

    This number is called thecorrelation coefficient

  • 8/9/2019 Linear Regrssion Analysis and Residual

    3/52

  • 8/9/2019 Linear Regrssion Analysis and Residual

    4/52

    Correlation Coefficient

    To determine the correlation

    between the math aptitude andbusiness aptitude we cananalyze the situationpictorially by using scatterdiagram.

    The math score will representthe independent variable anddenote it byx.

    The business score willrepresent thedependentvariable and denote it byy

    Different Aptitude Scores Received by Ten Students

    Student Mathaptitude

    Businessaptitude

    languageaptitude

    Musicaptitude

    A 52 48 26 22

    B 49 49 53 23

    C 26 27 48 57

    D 28 24 31 54

    E 63 59 67 13

    F 44 40 75 20

    G 70 72 31 9

    H 32 31 22 50

    I 49 50 11 17

    J 51 49 19 24

  • 8/9/2019 Linear Regrssion Analysis and Residual

    5/52

    Scatter Diagram

    Analyze the situation pictorially by

    using scatter diagram.To create a scatter diagram by

    using Minitab

    click on Graph > Scatterplot

    Click on "With Regression" then "OK" in

    the first dialog box.

    In the second box, select business score

    into the first box in the Y column, and math

    score into the first box in the X column.

    Note: each plot represent each

    person score

    Student F

  • 8/9/2019 Linear Regrssion Analysis and Residual

    6/52

  • 8/9/2019 Linear Regrssion Analysis and Residual

    7/52

    Another Types of scatter diagram

    Negative linear

    correlation

    No correlation

  • 8/9/2019 Linear Regrssion Analysis and Residual

    8/52

    The coefficient Correlation

    Once we have determined that there is a linear relation

    between two variables we cam measure the strength of thisrelation by usingthe coefficient correlation ofthe linear

    relationship developed by Karl Pearson.

    The coefficient of linear correlation is given by

    nxy (x)(y)

    n(x2 ) (x)2 n(y2 ) (y)2r=

    Where

    x = labelforoneofthevariablesy = label for the other variablen = numberofpairsofscores

  • 8/9/2019 Linear Regrssion Analysis and Residual

    9/52

    Possibilities of the r value

    The coefficient correlation will always have a value -1 r + 1

    No Correlation

    r = 0

    positive Correlation

    r > 0

    Strong positive Correlation

    r close to 1

    PerfectpositiveCorrelation

    r = 1

    negative Correlation

    r < 0

    Strong negative Correlation

    r close to -1

    PerfectnegativeCorrelation

    r = -1

  • 8/9/2019 Linear Regrssion Analysis and Residual

    10/52

    Example on coefficient correlation

    r =n (22,729) (464)(449)

    10)23,396) (464(2 10)22,137) (449(2

    0.986747791r =

    Thus, the coefficient of correlation is 0.9867. Since this value is

    close to +1 we say that there is ahigh degree of positive

    correlation

    M ath

    ap titud e " xBusiness

    aptitude"y"x 2 y 2 x y

    5 2 .0 0 4 8 .0 0 2 7 0 4 2 3 0 4 2 4 9 6

    4 9 .0 0 4 9 .0 0 2 4 0 1 2 4 0 1 2 4 0 1

    2 6 .0 0 2 7 .0 0 6 7 6 7 2 9 7 0 2

    2 8 .0 0 2 4 .0 0 7 8 4 5 7 6 6 7 2

    6 3 .0 0 5 9 .0 0 3 9 6 9 3 4 8 1 3 7 1 7

    4 4 .0 0 4 0 .0 0 1 9 3 6 1 6 0 0 1 7 6 0

    7 0 .0 0 7 2 .0 0 4 9 0 0 5 1 8 4 5 0 4 03 2 .0 0 3 1 .0 0 1 0 2 4 9 6 1 9 9 2

    4 9 .0 0 5 0 .0 0 2 4 0 1 2 5 0 0 2 4 5 0

    5 1 .0 0 4 9 .0 0 2 6 0 1 2 4 0 1 2 4 9 9

    464 449 2 33 96 2 21 37 2 27 2 9

  • 8/9/2019 Linear Regrssion Analysis and Residual

    11/52

  • 8/9/2019 Linear Regrssion Analysis and Residual

    12/52

    The sensitivity of correlation coefficient

    The correlation coefficient is

    unaffectedby adding orsubtracting a number to either x

    or y or both, even if x coded in

    one way perhaps by adding or

    subtracting a number- and y is

    coded by another way say, by

    multiplying by a number

    0.986747791r =

    Math aptitude "x+ 29"

    Businessaptitude" y -

    38"

    x2 y2 XY

    81 10 6561 100 810

    78 11 6084 121 858

    55 -11 3025 121 -605

    57 -14 3249 196 -798

    92 21 8464 441 1932

    73 2 5329 4 146

    99 34 9801 1156 3366

    61 -7 3721 49 -427

    78 12 6084 144 936

    80 11 6400 121 880

    754 69 58718 2453 7098

  • 8/9/2019 Linear Regrssion Analysis and Residual

    13/52

    The Reliability of r

    When r is computed we may get a

    strong correlation, positive ornegative which is due purely to

    chance not to some relation that

    exists between x and y

    Business aptitude"x"

    Music aptitude "Y" x2 y2 XY

    48 22 2304 484 1056

    49 23 2401 529 1127

    27 57 729 3249 1539

    24 54 576 2916 1296

    59 13 3481 169 767

    40 20 1600 400 800

    72 9 5184 81 648

    31 50 961 2500 1550

    50 17 2500 289 850

    49 24 2401 576 1176

    449 289 22137 11193 10809

    r = -0.914447

  • 8/9/2019 Linear Regrssion Analysis and Residual

    14/52

    amount of snow ininchs "X"

    no. of hoursestudied "Y"

    x2 y2 XY

    1 2 1 4 2

    4 6 16 36 24

    2 3 4 9 6

    6 4 36 16 24

    3 4 9 16 12

    16 19 66 81 68

    r = 0.63

    The value of r in this case is

    0.63, but we can not concludethat if it snows in U.S.A then

    the students in Egypt studies

    more !!!

  • 8/9/2019 Linear Regrssion Analysis and Residual

    15/52

    A chart has been constructed that allow us to determine the significance of particularvalue of the correlation coefficient

    1. Compute the value of r

    2. Look in the chart for the appropriate r-value corresponding to some given n,where n is the number of pairs of scores

    3. The value of r is not satisfactory significant if it is between r

    and rfor a

    particular value of n.

  • 8/9/2019 Linear Regrssion Analysis and Residual

    16/52

    Coefficient Correlation Chart

    In case of the correlation between

    amount of snow in U.S.A and the

    studied hours for the students egypt

    Assume = 0.025

    1. r = 0.63 , n = 5

    2. From table r0.025 is between -

    0.878 and + 0.878

    3. Since the value of r = 0.63 is

    between than + 0.878 and - 0.878

    We conclude that the correlation is due

    purely the chance

  • 8/9/2019 Linear Regrssion Analysis and Residual

    17/52

    Coefficient Correlation Chart

    In case of the correlation between math

    aptitude and business aptitude scores

    Assume = 0.025

    1. r = 0.986747791 , n = 10

    2. From table r0.025 is between -

    0.0632 and + 0.0632

    3. Since the value of r is greater than +

    0.0632

    We conclude that there is adefinite positivecorrelation between the math aptitude

    score and the business aptitude score.

  • 8/9/2019 Linear Regrssion Analysis and Residual

    18/52

    The correlation coefficientmerely determines weather two

    variables are related, but it does not specify how

  • 8/9/2019 Linear Regrssion Analysis and Residual

    19/52

    Linear Regression

    Once we determine the linear correlation between two

    variable,Linear Regression is used to predict the value ofone variable (thedependent variable y ) on the basis of other

    variables (the independent variablesx).

    To predict the value of y

    A- From scatter diagram

    Which line has the best fit to the

  • 8/9/2019 Linear Regrssion Analysis and Residual

    20/52

    Which line has the best fit to thedata?

    ?

    ?

    ?

    B - Least Square Method

  • 8/9/2019 Linear Regrssion Analysis and Residual

    21/52

    A Digression into History

    The Statistical method of least

    squares was developed by Frenchmathematician Adrien-Marie

    Legendre (1752 1833)

    Adrien-Marie

    Legendre

  • 8/9/2019 Linear Regrssion Analysis and Residual

    22/52

    The Method of the Least Square

    B- Least Square Method

    The differencesbetween theobserved andpredict value

    Theequ

    ation

    ofthel

    inethat

    minimize

    sthe

    sum

    ofthesq

    uared

    betw

    eenv

    ertica

    ldeviati

    o

    Regression line

  • 8/9/2019 Linear Regrssion Analysis and Residual

    23/52

    The Method of the Least Square

    The regression equation of the estimated regression line is

    Where

    nxy (x)(y)

    n(x2 ) (x)2b1= b0= y - b1 xn

    1

    and n is a number of pairs of scores

  • 8/9/2019 Linear Regrssion Analysis and Residual

    24/52

    The Prediction of y value

    If the counselor was interested in

    predicting how will student do onthe business aptitude if she knows

    the student score in the math

    aptitude.

    Math aptitude"x"

    Businessaptitude" y"

    x2 XY

    52 48 2704 2496

    49 49 2401 2401

    26 27 676 702

    28 24 784 672

    63 59 3969 3717

    44 40 1936 1760

    70 72 4900 5040

    32 31 1024 992

    49 50 2401 2450

    51 49 2601 2499

    464 449 23396 22729

    b1= 10(22729) (464)(449)

    10(23396 ) (464)2

    nxy (x)(y)

    n(x2 ) (x)2

    b1=

    b0= y - b1 xn1

    b0= 1

    10449 1.01553*464

    = 1.01553

    = -2.221

    -2.221+ 1.01553 X

    For example at x =50

    -2.221 + 1.01553 * 50 = 48.56

  • 8/9/2019 Linear Regrssion Analysis and Residual

    25/52

    Alternative way to compute b1 and b0The coefficients b

    1and b

    0for

    the least squares line

    are calculated as:

    1. Compute the average of x-values

    and average of y values.

    2. Compute sample standard deviation

    for x values Sx

    3. Compute sample covariance of n

    data points, which is defined by

    Sxy

  • 8/9/2019 Linear Regrssion Analysis and Residual

    26/52

    Alternative way to compute b1 and b0

    = 464/10 = 46.4 = 449/10 = 44.9

    = 1866.4/9 = 207.378

    = 1895.4/9 = 210.6

    = 210.6 / 207.378 = 1.01554

    = 44.9 (1.01554 * 46.4) = -2.221

    -2.221 + 1.01554 X

  • 8/9/2019 Linear Regrssion Analysis and Residual

    27/52

    STANDARD ERROR TO ESTIMATE

    X

    Y

    48.56

    For each x there is a correspondingpopulation y values

    50

    At x =50 -2.221 + 1.01553 * 50 = 48.56

    Predicted value

    y = + x

    We can not expect such a prediction to be accurate

    The relationship between X

    and Y is a straight-Line

    (linear) relationship.

    The values of the

    independent variable X are

    assumed fixed (not random);

    the only randomness in the

    values of Y comes from the

    error term .

  • 8/9/2019 Linear Regrssion Analysis and Residual

    28/52

    X

    Y

    Identical normal

    distributions of errors,

    all centered on the

    regression line.

    STANDARD ERROR TO ESTIMATE

    my|x = 0 + 1x

    x

    y

    The mean of the corresponding

    y value lies on some straight

    line whose equation we do notknow but which is of the

    form:

  • 8/9/2019 Linear Regrssion Analysis and Residual

    29/52

    STANDARD ERROR TO ESTIMATE

    The error term (vertical distance

    between the predicted y value and

    the true population values ) are

    normally distributed with mean 0

    and the same standard deviation

    This is called error sum of square

    The value of can be estimatedfrom the sample data by computing

    thestandard error of the estimate

    also called residual standarddeviation

    SSE = (y )2

    If is zero, all the points fall on the regression

    line

  • 8/9/2019 Linear Regrssion Analysis and Residual

    30/52

    STANDARD ERROR TO ESTIMATE

    Math aptitude"x"

    Business aptitude"y"

    y - (y - )2

    52 48 50.58656 -2.58656 6.69029263

    49 49 47.53997 1.46003 2.1316876

    26 27 24.18278 2.81722 7.93672853

    28 24 26.21384 -2.21384 4.90108755

    63 59 61.75739 -2.75739 7.60319961

    44 40 42.46232 -2.46232 6.06301978

    70 72 68.8661 3.1339 9.82132921

    32 31 30.27596 0.72404 0.52423392

    49 50 47.53997 2.46003 6.0517476

    51 49 49.57103 -0.57103 0.32607526

    464 449 52.0494017

    SSE

    =52.049)/10-2= (2.55

  • 8/9/2019 Linear Regrssion Analysis and Residual

    31/52

    Hypothesis Tests About The Regression

    If no linear relationship exists between the two variables, we would expect the

    regression line to behorizontal, that is, to have a b1 = 0

    = b 0 + b1 x

    = b 0

    to determine whether x can be used as a predictor of y, we will implement test

    of hypothesis

  • 8/9/2019 Linear Regrssion Analysis and Residual

    32/52

    Hypothesis Tests About The Regression

    1. State hypothesis becomes:

    H0

    : b1

    =

    0

    H1: b1 0

    2. Compute the value of test statistic

    3. Find n-2 which is the degree of freedom for the t-distribution

    4. Find the appropriate critical value t/2 by using t-student table

    5. If the value of the test statistic falls in the rejection region two-tail test,

    reject H0 .Otherwise do not reject H0.

    6. State the conclusion.

  • 8/9/2019 Linear Regrssion Analysis and Residual

    33/52

  • 8/9/2019 Linear Regrssion Analysis and Residual

    34/52

    The confidence interval for b1

    Wheret/2 represent the t-distribution value obtained from table using n-2

    degrees of freedom.

    Se the standard error to estimate

    p is the predicted value of y corresponding to x=xp

  • 8/9/2019 Linear Regrssion Analysis and Residual

    35/52

    Minitab

    Regression Analysis

    35

  • 8/9/2019 Linear Regrssion Analysis and Residual

    36/52

    Minitab

    Type the data into C1 and C2 in the data window, and

    label the columns; we will call C1 math aptitude" and C2business aptitude.

    36

  • 8/9/2019 Linear Regrssion Analysis and Residual

    37/52

    i i b

  • 8/9/2019 Linear Regrssion Analysis and Residual

    38/52

    Minitab

    The resulted information:

    The resulted information can be filtered can deleting the unwanted

    information

    38

    Mi i b

  • 8/9/2019 Linear Regrssion Analysis and Residual

    39/52

    Minitab

    Tocomputethe linear correlation coefficient.

    Click on Stat > Basic Statistics > Correlation then select mathaptitude and math aptitude into the "Variables: box

    click "OK."

    39

    Mi it b

  • 8/9/2019 Linear Regrssion Analysis and Residual

    40/52

    Minitab

    To predict business aptitude score when the math aptitude score is 50

    Give C6 new math score label; then write in it 50 the values of mathscore for which we want predictions.

    click on Stat > Regression > Regression.

    Select math aptitude into the "Response:" box and business aptitude

    into the "Predictors:" box.

    Now go to "Options" and select "New math score into the box called

    "Prediction intervals for new observations:

    click on the "Fits" box under "Storage" to check it.

    click "OK" to get back to the Regression dialog box

    click on "Results then click on the top button, "Display nothing."

    Click "OK" on the Results dialog box and on the Regression dialog box.

    The result is displayed on C7

    40

  • 8/9/2019 Linear Regrssion Analysis and Residual

    41/52

    Residual Analysis

    Model Adequacy Checking

    41

    R i M d l

  • 8/9/2019 Linear Regrssion Analysis and Residual

    42/52

    Regression Model:

    Regression Model:

    Assumptions:

    1.The relationship between and the predictors is linear.

    2.The s are normally distributed3.The noise term has zero mean.

    4.All s have the same variance 2.

    5.The s are uncorrelated between observations.

    6.The s are independent of the predictors.

    Residual analysis is used for detecting departures from

    assumptions.

    42

    ++ X10

    R id l A l i

  • 8/9/2019 Linear Regrssion Analysis and Residual

    43/52

    Residual AnalysisDefinition of Residuals

    Residual are estimates of experimental error.

    Mathematically, the residual for a specific predictor value is the difference between the

    response valuey and the predicted response value.

    Where:

    yi = actual observation at Xi

    i = predicted value from equation i = 0+ 1Xi

    Residual PlotsResidual plotting is a very effective way to investigate the adequacy of the fit of a regression model and to check the underlying

    assumption.

    43

    ( ) y y =

    R id l A l i

  • 8/9/2019 Linear Regrssion Analysis and Residual

    44/52

    Residual Analysis

    The residual Plots are:

    Histogram of the residual

    Checknormal probability plot of residuals

    A symmetric bell-shaped histogram which is evenly distributed around zero indicates that the

    normality assumption is likely to be true. If the histogram indicates that random error is not

    normally distributed, it suggests that the model's underlying assumptions may have been

    violated.

    Sample sizes of residuals are generally small (

  • 8/9/2019 Linear Regrssion Analysis and Residual

    45/52

    Residual Analysis

    The normal probability plot

    The normal probability plot should produce anapproximately straight line if the points come from a

    normal distribution.

    45

    DESI GN-EX PERT P lo t

    L i fe

    R es idua l

    Norm

    al%

    probability

    Norma l plot of residuals

    -60.75 -34.25 -7.75 18.75 45.25

    1

    5

    10

    20

    30

    50

    70

    80

    90

    95

    99

    R id l A l i

  • 8/9/2019 Linear Regrssion Analysis and Residual

    46/52

    Residual Analysis Residuals plotted against the fitted values,

    Check for the error variance

    This plot should produce a distribution of points scattered randomly about 0, regardless of the size ofthe fitted value.

    A residuals plot which has an increasing trend suggests that the error variance increases with the

    independent variable; while a distribution that reveals a decreasing trend indicates that the error

    variance decreases with the independent variable. Neither of these distributions are constant variance

    patterns.

    Therefore they indicate that the assumption of constant variance is not likely to be true and the regressionis not a good one.

    On the other hand, a horizontal-band pattern suggests that the variance of the residuals is constant

    46

    R id l l i

  • 8/9/2019 Linear Regrssion Analysis and Residual

    47/52

    Residual analysis

    Residuals against run-order sequence or time

    Checkingthe process driftThe Residual vs. Order of the Data plot can be used to check the drift of the variance during

    the experimental process, when data are time-ordered. If the residuals are randomly distributed

    around zero, it means that there is no drift in the process.

    Checking independenceof the error term

    the Residual vs. Order of the Data plot will reflect the correlation between the error term and

    time. Fluctuating patterns around zero will indicate that the error term is dependent.

    47

    Residual Analysis

  • 8/9/2019 Linear Regrssion Analysis and Residual

    48/52

    Residual AnalysisOutlier

    Outlier is a single or a group of observations which are markedly different from the bulk of the data or from the pattern set by the majority of the observations.

    The presence of one or more outliers can seriously distort the analysis of variance.

    The check of the outlier may be made by examining the standardized residuals

    48

    The standardized residual should be approximately normal with mean zero and

    unit variance.

    A residual bigger than 3 or 4 standard deviations from zero is a potential

    outlier.

    Where:

    di= standardized residual

    ei = residual

    MSE = error to be estimate

    Residual Analysis

  • 8/9/2019 Linear Regrssion Analysis and Residual

    49/52

    Residual Analysis

    Example:

    the regression equation:

    49

    Math aptitude"x"

    Business aptitude"y"

    residual

    y

    52 48 50.58656 -2.58656

    49 49 47.53997 1.46003

    26 27 24.18278 2.81722

    28 24 26.21384 -2.21384

    63 59 61.75739 -2.75739

    44 40 42.46232 -2.46232

    70 72 68.8661 3.1339

    32 31 30.27596 0.72404

    49 50 47.53997 2.46003

    51 49 49.57103 -0.57103

    464 449

    -2.221 + 1.01553 X

    Residual Analysis

  • 8/9/2019 Linear Regrssion Analysis and Residual

    50/52

    Residual Analysis

    Using Minitab

    50

    Stat

    Regression

    Regression

    Graphs

    Residual Analysis

  • 8/9/2019 Linear Regrssion Analysis and Residual

    51/52

    Residual Analysis

    51

  • 8/9/2019 Linear Regrssion Analysis and Residual

    52/52

    Thank you