part iii : correlation and regressionpeople.math.binghamton.edu/jbrennan/home/s13mat148/...scatter...

52
Chapter 8 & 9 - Correlation PART III : CORRELATION AND REGRESSION Dr. Joseph Brennan Math 148, BU Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 1 / 52

Upload: others

Post on 11-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Chapter 8 & 9 - CorrelationPART III : CORRELATION AND REGRESSION

    Dr. Joseph Brennan

    Math 148, BU

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 1 / 52

  • Two Quantitative Variables

    INTRODUCTION

    Two quantative variables measured on the same individual are oftenstudied simultaneously. Examples include

    Shoe size and weight;

    Father’s and son’s heights;

    Study time and exam score.

    We can describe each of the two variables by using descriptive statisticswhich we discussed in Part II. Since both variables are measured on thesame individual, we may expect them to depend on each other.Descriptive statistics (Part II) for each variable will not reveal the degreeof that dependance.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 2 / 52

  • Correlation and Regression

    Correlation Analysis

    The establishment of an association (or correlation) between two variablesand assessing its strength.

    Regression Analysis

    The creation of a mathematical model or formula that relates the values ofone variable to the values of the other.

    Is there a correlation between blood pressure and age?

    Is there a formula that will statistically predict the blood pressure ofan adult male from his age?

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 3 / 52

  • Independent and Dependent Variables

    In the study of two quantitative variables, each variable has a distinctdesignation:

    Independent Variable

    The variable that is expected to influence the other variable.

    Also known as the explanatory variable.

    Dependent Variable

    The variable that is expected to be influenced by the other variable.

    Also known as the response variable.

    Example: The relationship between the age and weight of men with thesame type of bone structure is under study. Which variable would you taketo be the response variable?

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 4 / 52

  • Notation

    As in highschool Algebra:

    Independent variables will be denoted by x .

    Dependent variables will be denoted by y .

    As we will be analyzing data with numerous data points, we will organizethe values of data points by subscripts. If we have n data points in asample, we denote:

    x1, x2, . . . , xn y1, y2, . . . , yn

    where the i-th data point will have independent value xi and dependentvalue yi .

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 5 / 52

  • Notes on Variables

    Independent variables are used to predict or explain the values ofdependent variable.

    Dependent variables measure the outcome of a study. This variablewill be of primary interest.

    In many problems the roles of the dependent and independentvariables may be switched.

    For example, it is at the discretion of the researcher to assign variabletype when searching for a connection between the number of lotterytickets bought and the amount of liquor consumed.

    If the variables x and y are related, then certain values of y will tendto occur with certain values of x .

    For instance, y may tend to decrease when x increases.Consider illness and vaccination; as the incidence of vaccination isincreased, the incidence of illness is decreased.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 6 / 52

  • Scatter Diagram

    The type of relationship (if any) between x and y may be establishedfrom the graph called the scatter diagram (or scatter plot).

    SC

    A T

    T ER

    DIA

    G

    R

    A

    M

    As in highschool Algebra:

    Scatter diagrams will use two axes.

    The horizontal axis represents values of the independent variable.(x-axis)

    The vertical axis represents values of the dependent variable. (y-axis)

    Every data point is plotted onto the graph relative its independent variablevalue and dependent variable value.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 7 / 52

  • Example 1 (Metabolic Rate)

    From Moore and McCabe, Exercise 2.13.

    Metabolic rate, the rate at which the body consumes energy, is importantin studies of weight gain, dieting and exercise. The table on the next slidegives data on the lean body mass and resting metabolic rate for 12women and 7 men who are subjects in a study of dieting.

    Lean body mass, given in kilograms, is a person’s weight leaving outall fat.

    Metabolic rate is measured in calories burned per 24 hours, the samecalories used to describe the energy content of foods.

    The researchers believe that lean body mass has an important influence onmetabolic rate.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 8 / 52

  • Example 1 (Metabolic Rate)

    The data:

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 9 / 52

  • Example 1 (Metabolic Rate)

    (a) Identify the variables :Independent variable Dependent variableMetabolic Rate Body Mass

    (b) Computing the descriptive statistics for each variable:

    Figure : Descriptive statistics for Mass and Rate.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 10 / 52

  • Example 1 (Metabolic Rate)

    (c) Plot the scatter diagram:

    Figure : Scatter diagram for the Metabolic rate data.

    NOTE: There are 19 points on the scatter diagram, each pointcorresponds to a person.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 11 / 52

  • Example 2 (Father-son Heights)

    An example from our textbook, p. 120.

    Karl Pearson measured the heights of 1,078 fathers and their sons atmaturity. The corresponding scatter diagram is given on the next slide.Each dot represent one father-son pair. The x - coordinate of the dot givesthe height of the father, the y - coordinate gives the height of the son.

    The 45-degree line (y = x), plotted in Figure 3, corresponds to thefamilies where

    son’s height = father’s height

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 12 / 52

  • Example 2 (Father-son heights)

    Figure : Scatter diagram of father-son heights.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 13 / 52

  • Example 2 (Father-son heights)

    Only a few points fall exactly on the line. Most of the points scatterabout the line.

    Points above the line correspond to taller sons compared to theirfathers. Points below the line correspond to shorter sons compared totheir fathers.

    If a son’s height is close to his father’s height, their point on thescatter diagram will be close to the 45-degree line. The further thepoint from the line, the greater the discrepancy between the father’sheight and his son’s height.

    The scatter of points about the line shows the weakness of therelationship between father’s height and son’s height.

    How much help does the father’s height gives you in predicting hisson’s height?

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 14 / 52

  • Interpreting Scatter Diagrams

    When analyzing a scatter diagram look for:

    The overall pattern of the plot which is described byform,direction,strength

    of the relationship between variables.

    Any striking deviations from that pattern.An important kind of deviation is an outlier, an individual value thatfalls outside the overall pattern of the relationship.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 15 / 52

  • Form of the Relationship

    We need to distinguish between the linear and non-linear forms of therelationship between two variables.

    If the points roughly follow a straight line, we say that therelationship between the variables is roughly linear. In this case a linecan be drawn through the points, and points will follow the linereasonably close.

    In both the examples (Example 1 (metabloic rate) and Example 2(father-son height)), roughly speaking, there is a linear relationshipbetween the variables.

    If points scatter around some curve, we say that the relationshipbetween the two variables is non-linear.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 16 / 52

  • Non-linear Relationship

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 17 / 52

  • Direction of the Relationship

    Variables with a linear relationship have two classifications:

    Positive Association: both variable increase or decreasesimultaneously.

    In both Examples 1 and 2 the variables are positively associated: theyincrease simultaneously.

    Negative Association: when one of the variables increases the otherdecreases.

    The next slide depicts a negative association: as x increases, ydecreases.

    NOTE: The direction of association follows from the slope of imaginaryline plotted through the points.

    A positive association represents a positive slope.

    A negative association represents a negative slope.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 18 / 52

  • Negative Relationship

    Figure : Negative association between variables.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 19 / 52

  • Relationship Strength

    The strength of a relationship is determined by how close the points in thescatter diagram lie to a simple form such as a line. The closer pointsgroup around a line (or a curve), the stronger the relationship.

    The previous scatter diagram depicting a non-linear relationshipdepicts a strong relationship about an inverted parabola.

    CAUTION : The visual impression about the strength of the relationshipbetween x and y from the scatter diagram greatly depends on its scale.We will illustrate this later with an example. We need a numericalmeasure of strength of the relationship between x and y which does notdepend on the scale at which the scatter diagram is presented. One of themost often used such measures is the correlation coefficient.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 20 / 52

  • Football-shaped Scatter Diagram

    A scatter diagram is football-shaped or the points form a cloud on thescatter diagram if most of the points on the graph fall within the region ofan elliptical shape.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 21 / 52

  • Football-shaped Scatter Diagram

    Football-shaped scatter diagrams correspond to an important class ofproblems we have yet to study. The term football-shaped is reservedfor linear relationships only.

    All the scatter diagrams we have considered so far, except for nonlinearone, were football-shaped.

    Not all the linear scatter diagrams are football-shaped as thefollowing plot shows:

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 22 / 52

  • The Correlation Coefficient

    The correlation coefficient, r , is a descriptive statistic which measures thedirection and strength of the linear relationship between twoquantitative variables.

    Suppose that we have data on variables x and y for n individuals.

    Let the mean of x values be x̄ and let the mean of y values be ȳ .

    Let the standard deviation of x values be sx and let the standarddeviation of y values be sy .

    The sample correlation coefficient r between x and y is computed as

    r =1

    n

    n∑i=1

    (xi − x̄sx

    )(yi − ȳsy

    )=

    1

    n

    n∑i=1

    zx ,i · zy ,i (1)

    where zx ,i =xi − x̄sx

    and zy ,i =yi − ȳsy

    are the z - scores for xi and yi ,

    respectively.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 23 / 52

  • Properties of the Correlation Coefficient

    r =1

    n

    n∑i=1

    (xi − x̄sx

    )(yi − ȳsy

    )=

    1

    n

    n∑i=1

    zx ,i · zy ,i

    (1) The sign of the correlation coefficient r indicates the direction of therelationship between the variables :

    When r < 0, the relationship is negative.When r > 0, the relationship is positive.

    (2) The correlation coefficient is just a number, it has no units ofmeasurement.

    This is because the correlation is obtained by averaging the products ofthe z - scores which do not have a unit of measurement.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 24 / 52

  • Properties of the Correlation Coefficient

    r =1

    n

    n∑i=1

    (xi − x̄sx

    )(yi − ȳsy

    )=

    1

    n

    n∑i=1

    zx ,i · zy ,i

    (3) The correlation r is always a number between −1 and 1:

    −1 ≤ r ≤ 1

    The closer r to 1 or -1 is, the stronger the linear association betweenx and y . Values of r close to 0 may be caused by:

    1) lack of association between x and y ,2) presence of outliers,3) nonlinearity of the relationship between x and y .

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 25 / 52

  • Properties of the Correlation Coefficient

    r =1

    n

    n∑i=1

    (xi − x̄sx

    )(yi − ȳsy

    )=

    1

    n

    n∑i=1

    zx ,i · zy ,i

    (4) Correlation only measures the strength of a LINEAR relationshipbetween two variables. Correlation DOES NOT describe curvedrelationships between variables, no matter how strong they are!

    On slide 17 (nonlinear association), the correlation coefficient is closeto 0, even though the points show a strong association. This is due tothe non-linear relationship! Correlation only measures along a linearrelationship.

    (5) Like the average x̄ and standard deviation s, the correlation is NOTresistant to outliers. In fact, it is strongly affected by a few outlyingobservations. The figure on the next slide illustrates this:

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 26 / 52

  • Outlier Influence

    In the figure, the dots show a perfect correlation of 1. The outlier,marked by a cross, brings the correlation almost to 0.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 27 / 52

  • Properties of the Correlation Coefficient

    (6) The correlation coefficient r is symmetric: correlation between thevalues of x and y is the same as the correlation between the values ofy and x . In other words, if we switch the axes on the scatter diagram,the correlation will not change.

    A scatter diagram for the data on temperatures in New York in June 2005.The panels have transposed labels, yet the same correlation coefficient!

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 28 / 52

  • Properties of the Correlation Coefficient

    (7) The correlation coefficient r is not affected by shifting and/orrescaling the data (x-data or y -data, or both), when the direction ofassociation is preserved.

    Temperature is translated from Fahrenheit to Celsius by y = − 1609 +59x .

    When exactly one of therescaling constants (forx - or y - data) isnegative, the direction ofassociation changes, andthe correlation coefficientr changes its sign, butnot the absolute value.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 29 / 52

  • Example 1, p. 132

    Example 1, p.132 of the textbook.

    Compute r for the following data :

    x y

    1 53 94 75 17 13

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 30 / 52

  • Example 1, p. 132

    Step 1. Compute the mean and standard deviation for each variable:

    x̄ = 4 sx = 2;

    ȳ = 7 sy = 4.

    Step 2. Compute the z - scores for

    x ’s and y ’s.

    Step 3. Find the product of the

    z - scores zx · zy in each row.

    x y zx zy zx · zy1 5 -1.5 -0.5 0.753 9 -0.5 0.5 -0.254 7 0.0 0.0 0.005 1 0.5 -1.5 -0.757 13 1.5 1.5 2.25

    For example, the z - score zx = −1.5 in the first row was found as

    zx =x − x̄sx

    =1− 4

    2= −1.5

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 31 / 52

  • Example 1, p. 132

    Recall the formula of r :

    r =1

    n

    n∑i=1

    zx ,i · zy ,i

    In our example n = 5.

    x y zx zy zx · zy1 5 -1.5 -0.5 0.753 9 -0.5 0.5 -0.254 7 0.0 0.0 0.005 1 0.5 -1.5 -0.757 13 1.5 1.5 2.25

    Step 4. Obtain r by averaging the products:

    r =1

    5

    5∑i=1

    zx ,i · zy ,i =0.75 + (−0.25) + 0.00 + (−0.75) + 2.25

    5= 0.4

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 32 / 52

  • Interpreting the Correlation Coefficient

    What does a correlation coefficient of 0.4 tell us?! Keep in mind that−1 ≤ r ≤ 1.

    Values of r near 0 indicate a very weak linear relationship.Random scattering of points in a rectangular region with no clearpattern will correspond to correlation close to 0.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 33 / 52

  • Interpreting the Correlation Coefficient

    The strength of the relationship increases as r moves away from 0toward either −1 or 1. Values of r close to −1 or 1 indicate that thepoints lie close to a straight line.

    The extreme values r = −1 and r = 1 occur only when the points ina scatter diagram lie EXACTLY along a straight line.

    Figure : Perfect linear relationship between variables.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 34 / 52

  • Interpreting the Correlation Coefficient

    Conventional interpretations of the computed (absolute) values of r aregiven in the following table.

    Value of |r | Interpretation0.0 - 0.2 Very weak to negligible correlation0.2 - 0.4 Weak, low correlation (not very significant)0.4 - 0.7 Moderate correlation0.7 - 0.9 Strong, high correlation0.9 - 1.0 Very strong correlation

    The scatter diagrams on the following two slides illustrate the meaning ofr graphically. To make the essential meaning of r clear, the standarddeviations of both variables in this plots are equal to 1 and the horizontaland vertical scales are the same.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 35 / 52

  • Positive Correlation

    Each diagram contains 50 data points and zoomed to a scale where themean is 3 units and the standard deviation is 1 unit horizontally andvertically.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 36 / 52

  • Negative Correlation

    Each diagram contains 50 data points and zoomed to a scale where themean is 3 units and the standard deviation is 1 unit horizontally andvertically.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 37 / 52

  • Correlation Coefficient: CAUTION

    CAUTION 1 In general, it is not easy to guess the value of r from theappearance of a scatter diagram. Changing the plotting scales in a scatterdiagram may mislead your eyes!

    The two scatter diagrams in the figure depict exactly the same data. The

    correlation between the variables is r = 0.93.

    Figure 3 on page 145 in the textbook illustrates the same idea.Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 38 / 52

  • Correlation Coefficient: CAUTION

    CAUTION 2.

    Correlation 6= CausationDO NOT conclude that correlated variables are causally related!

    Most of the data sets analyzed using correlation and regression comefrom observational studies. In this case r measures associationbetween the variables, but not the cause-effect relationship. Even ifthe correlations between the two variables is very high inobservational study, it may be due to confounding factors.

    Causation may be established only from well-designedexperimental studies.

    If we examine annual military appropriations and annual paper clipproduction in the US since 1948, we find that both have increased andthat there is a strong correlation between these variables. This finding doesnot suggest that one of these trends is responsible for the other. In this case

    there is a third variable, time, to which both are related.Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 39 / 52

  • Causation?

    We have a strong correlation between fatality rates in US highways andlemons imported from Mexico. Don’t believe it? See for yourself!

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 40 / 52

  • Causation?

    The correlation coefficient between a countries’ chocolate consumptionand the number of Nobel Laureates per capita is r = 0.791.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 41 / 52

  • Causation?

    The correlation coefficient between a countries’ milk consumption and thenumber of Nobel Laureates per capita is r2 = 0.5733.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 42 / 52

  • Correlation Coefficient: CAUTION

    CAUTION 3. Beware of correlations based on summarized data.

    By summarizing the data we mean computing averages, medians or ratesfor some groups of individuals. The book calls the correlation computedfrom presummarized data the ecological correlation.

    A correlation based on averages (medians) over many individuals isusually higher than the correlation between the same variables basedon raw data for individuals.

    The figure on the next slide illustrates how ecological correlationoverstates the strength of an association between income andeducation.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 43 / 52

  • Ecological Correlation: Income vs. Education

    The left panel represents income and education for individuals in threestates, labeled A, B, C. Each individual is marked by a letter showing theirstate of residence. The correlation is moderate.

    The right panel shows the average for each state. The correlation for thispanel is nearly one!Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 44 / 52

  • Example (Infants)

    If we plot the median weight of infants against their age in months, wewill see a very strong association with correlation near one. However,individual children of the same age vary a great deal in weight. A plot ofweight against age for individual children shows much more scatter andlower correlation than the plot of the median weight against age.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 45 / 52

  • Correlation Coefficient: CAUTION

    CAUTION 4. Correlation for the data combined from severalgroups of individuals may be misleading.

    Case 1. If data from several groups form clusters, the overallcorrelation may be too low or too high.

    Two groups (men and women) are

    combined on a single graph. Men

    and women form separate clusters

    of points in the scatter diagram.

    There is a strong relationship

    between x and y within each of the

    clusters (r = 0.85 and r = 0.91).

    Because similar values of x

    correspond to quite different values

    of y in the two clusters, the overall

    correlation is low: r = 0.14.Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 46 / 52

  • Correlation Coefficient: CAUTION

    Case 2. If the data from the groups form a single cloud, the overallcorrelation may be larger or smaller than individual correlations in thegroups.

    Example (Metabolic Rate)The data on metabolic rate from men and women mixed up such thatthe overall correlation is about the same or stronger compared tocorrelations in separate groups.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 47 / 52

  • Correlation Coefficient: CAUTION

    Figure : Correlations for separate groups and for combined data.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 48 / 52

  • Correlation Coefficient: CAUTION

    The correlation coefficient only measures the strength ofLINEAR relationships.

    CORRELATION 6= CAUSATIONIn order to accurately visually guess r , construct a scatter diagramsuch that the vertical standard deviations cover the same distance onthe page as the horizontal standard deviations.

    A coefficient r = 0.80 does NOT mean that 80% of the points aretightly clustered around a line, NOR does it indicate twice as muchlinearity as r = 0.40.

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 49 / 52

  • Boston Marathon versus Temperature

    The average finish time in minutes and the temperature during the race inFahrenheit for the Boston Marathon are listed below:

    Year Avg. Finish Time (minutes) Temperature (F)2000 221 492001 226 542002 221 552003 235 652004 253 852005 237 682006 230 542007 234 492008 231 532009 229 502010 230 53

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 50 / 52

  • Boston Marathon versus Temperature

    Finish Time TemperatureMean 231.5 57.7

    Standard Deviation 8.4 10.4

    Once the mean and standard deviation are calculated, a properly scaledscatter plot can be drawn:

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 51 / 52

  • Boston Marathon versus Temperature

    Recall that Zx =x − x̄s

    .

    X Y Zx Zy Zx · Zy221 49 -1.25 -0.82 1.03226 54 -.065 -0.34 0.22221 55 -1.25 -0.24 0.3235 65 0.42 0.72 0.3253 85 2.56 2.64 6.76237 68 0.65 1.01 0.66230 54 -0.18 -0.34 0.06234 49 0.3 -0.82 -0.25231 53 -0.06 -0.43 0.03229 50 -0.3 -0.72 0.22230 53 -0.18 -0.43 0.08

    r =1

    n

    n∑i=1

    zx ,i · zy ,i

    Averaging the valuesof the final columnyields a correlationcoefficient:

    r = 0.85

    Dr. Joseph Brennan (Math 148, BU) Chapter 8 & 9 - Correlation 52 / 52