stat 111 introductory statistics

STAT 111 Introductory Statistics

Lecture 3: Regression

May 20, 2004

Today’s Topics

• Regression line – Fitting a line– Prediction– Least-squares– Interpretation

• Correlation and regression

• Causation

• Transforming variables (briefly)

Review: The Scatterplot

• The scatterplot shows the relationship between two quantitative variables.

• It plots the observations of different individuals in a two-dimensional graph.

• Each point in a scatterplot corresponds to an observation of two variables of the same individual.

The Regression Line

• A regression line is a straight line that summarizes the linear relationship between two variables.

• It describes how a response variable y changes as an explanatory variable x changes.

• A regression line is often used as a model to predict the value of the response y for a given value of the explanatory variable x.

The Regression Line (cont.)

• We fit a line to data by drawing the line that comes as close as possible to the points.

• Once we have a regression line, we can predict the y for a specific value of x. Accuracy depends on how scattered the data are about the line.

• Using the regression line for prediction for far outside the range of values of x used to obtain the line is called extrapolation. This is generally not advised, since predictions will be inaccurate.

Example: Predicting SAT Math Scores using SAT Verbal Scores

• Making a regression line using JMP:Analyze → Fit Y by X → Put the response variable into Y, explanatory variable into X → Hit OK → Double-click the red triangle above the scatterplot → Fit line

• Mathematically, a straight line has an equation of the form y = a + bx, where b is the slope and a is the intercept. But how do we determine the value of these two numbers?

The Least-Squares Regression Line

• The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

• Mathematically, the line is determined by minimizing

2 ii bxay

The Least-Squares Regression Line (cont.)

• The equation of the least-squares regression line of y on x is

• The slope is determined using the formula

• The intercept is calculated using

bxay ˆ

x

y

s

srb

xbya

Interpreting the Regression Line

• The slope b tells us that along the regression line, a change of 1 unit in x corresponds to a change of b units in y.

• The least-squares regression line always passes through the point .

• If both x and y are standardized variables, then the slope of the least-squares regression line will be r, and the line will pass through the origin (0,0).

),( yx

Interpreting the Regression Line (cont.)

• Since standard deviation can never be negative, the signs of r and b will always be the same.

• Hence, if our slope is positive, we have a positive association between our explanatory variable and our response.

• On the other hand, if our slope is negative, then we have a negative association between our explanatory variable and our response.

Example: SAT Scores Again

• In our SAT data, the math score is the response, and the verbal score is the explanatory variable. The least-squares regression line as reported by JMP is

math = 498.00765 + 0.3167866 verbal

• Hence, in the context of the SAT, if a student’s verbal score increases by 10 points, then his math score will increase by a little bit more than 3 points.

Example: SAT Scores (cont.)

• Suppose we want to predict using our regression line a student’s math score given that his verbal score was 550.

• The predicted math score then would be

498.00765 + 0.3167866 (550) = 672

• Remember not to extrapolate when you make your predictions.


• Now, suppose we instead wanted to use a regression line to predict verbal scores using math scores, and suppose that one student had a math score of 670.

• Naively, we would predict the verbal score by taking the inverse of our existing regression line, in which case we would predict a verbal score between 540 and 550.

• It is not quite as simple as this.


• What we would need to do is re-fit the regression line using math scores as our explanatory variable and verbal scores as our response.

• The new regression line is (from JMP)

verbal = 408.37653 + 0.3901289 math

• So, our predicted verbal score given a math score of 670 would be

408.37653 + 0.3901289 (670) = 670

Correlation and Regression

• The square of the correlation, r2, is the proportion of the variation in the data that is explained by our least-squares regression line.

• r2 is always between 0 and 1.

• If r = ± 0.7, then r2 = 0.49, or about ½ of the variation.

• In our SAT data, r2 = 0.1236 (it is the same for both regressions), so our regression line only captures about 12% of the response’s variation.

Understanding r2

• Let’s look at the SAT line (verbal as x, math as y) once again.

• The variance in our observed math values is (61.262875)2 = 3753.14

• If the only variability in observed math scores was because of the linear fit, then math scores would lie exactly on our line.

• In other words, the math scores would be identical to our predicted math scores.

Understanding r2 (cont.)

• After computing the predicted math scores, we have that the variance in our predicted values is (21.53698)2 = 463.84

• If we divide the variance of our predicteds by the variance of our actuals, we have

463.84 / 3753.14 = .1236

• It is always true for least-squares regression when we say that r2 gives us the variance of predicted responses as a fraction of the variance of actual responses.

Diagnosis (How Good is our Model?)

• Although we are most interested in the overall pattern as described by the regression line, deviations from this pattern are also important.

• In the regression setting, the deviations we consider are the vertical distances from the actual points to the least-squares regression line.

• These distances represent the variation left in the response after fitting the line and are called residuals.

Residuals

• A residual is the difference between an observed value and the predicted value.

• Residual = observed y – predicted y

• The sum of the residuals of a regression line is always equal to 0.

• A residual plot is a scatterplot of regression residuals against the explanatory variable and is used to assess the fit of a regression line.

yy ˆ

Simplified Patterns of Least-squares Residuals

x x

x

residual

residual

residual

linear relationship nonlinear relationship

Nonconstant prediction error

Outliers and Influential Observations

• An outlier is an observation that lies outside the overall pattern of the other observations.

• Points that are outliers in the y direction have large regression residuals, but that need not be the case for all outliers.

• An influential observation is one that would significantly change the regression line if removed. An outlier in the x direction is often influential for the least-squares regression line.

Example: Age at First Word and Gesell Score

• Does the age at which a child begin to talk predict a later score on a test of mental ability?

• The age in months at which the first word was spoken and the score on an ability test taken much later were recorded for 21 children.

• Fitting a line to all data reveals a negative linear relationship: early talkers tend to have higher test scores than those who start talking later.

Example: First Word and Gesell Score (cont.)

50

60

70

80

90

100

110

120

130

Sco

re

18

19

5 10 15 20 25 30 35 40 45

Age

Linear Fit

Linear Fit

Bivariate Fit of Score By Age

Example: First Word and Gesell Score (cont.)

• In the scatterplot, we see that observations 18 and 19 are unusual.

• Observation 18 is far out in the x direction; observation 19 is far out in the y direction.

• The red line is the regression line we obtained by including 18; the green is obtained by excluding 18.

• 18 is pulling the line towards itself; hence it is influential.

Extreme Example: Random Data

-4

-2

0

2

4

6

8

y

-2 -1 0 1 2 3 4 5 6

x

Linear Fit

Linear Fit

Bivariate Fit of Column 2 By Column 3

Causation vs Association

• Example of causation: Increased consumption of alcohol causes a decrease in coordination and reflexes.

• Example of association: A high SAT score in senior year of high school is typically associated with a high GPA in freshman year of college.

• In general, an association between an explanatory variable x and a response y is not sufficient evidence to prove that x causes y.

Causation vs Association (cont.)

• Examples:– High SAT math scores tend to be accompanied by

high SAT verbal scores, but does this mean a high math score causes a high verbal score?

– Nations in which people have easy access to the Internet tend to have higher life expectancies. Does better access to the Internet cause people to live longer?

– The divorce rate tends to be positively correlated with the quantity of bananas imported. Does importing more bananas cause more people to get divorced?

Lurking Variables

• A lurking variable is one that is not among the explanatory or response variables in a study, but may influence the interpretation of relationships among those variables.

• In each of our three cases mentioned previously, there is likely a lurking variable at work.

• Give a an example of one for each of the scenarios.

Lurking Variables (cont.)

• Lurking variables can create “nonsense correlations” in the sense that they suggest that changing one variable causes changes in the other.

• In addition, lurking variables can hide a true relationship between explanatory and response variables.

Causation

• In many cases, we wish to determine whether changes in an explanatory cause changes in the response variable.

• Even in the presence of strong association, it is difficult to decide whether this is due to a causal link.

• There are three main ways to explain an association between two variables.

Explaining Association

• The association between an explanatory and a response variable may be due to– Causation when there is a direct cause-and-effect link

between these two variables.– Common response when there is a lurking variable

whose changes cause both the explanatory variable and the response variable to change.

– Confounding when there are multiple influences at work that are getting mixed up.

Explaining Association (cont.)

• Officially, two variables are considered confounded when their effects on a response variable cannot be distinguished from each other.

• Confounded variables can be either explanatory or lurking.

• Even a very strong association between two variables is not sufficient evidence that there is a cause-and-effect link between the variables.

• The best way to establish that an association is due to causation is with a carefully designed experiment – more on this later.

Transformations of Relationships

• In some situations, the values of quantitative variables are quite spread out, with some isolated points. The rest of the data becomes very compressed, making it somewhat difficult to look at.

• Situations like this suggest using a function of the original variable; for example, we might use a function that will shrink the distance between values. This is what we call transforming the data.

Transformations of Relationships (cont.)

• Transforming data changes the original scale of measurement. Our most common transformations are linear (˚F → ˚C, lb → kg).

• Linear transformations cannot straighten curved relationships, though; to do that, we need a nonlinear transformation (e.g., powers, exponentials, logarithms).

• The most common transformations of our explanatory variable x are power transformations of the form xp.


• We call a function f(x) monotone if its values move in only one direction as x increases

• For positive values of x, power functions with positive p (and the logarithm function) are monotonic increasing and preserve the order of observations.

• For negative p, the power functions are monotonic decreasing and reverse the order of observations.

• If we believe that there is some mathematical model that describes our data, then transformations will be quite effective.

• For example, the exponential growth model

y = a * bx can be written as a linear model if we take the logarithm of y (log y = log a + x log b).

• On the other hand, a power law growth model

y = a * xp can be written as a linear model if we take the logarithm of both x and y

(log y = log a + p log x).


• In practice, our decision to make a transformation is governed by what we know about the data.

• This also holds true in terms of what type of transformation we decide to make.

• For example, animal populations and values of investments are often well-described by exponential growth model, though we do not always know the values of the parameters.

stat 111 introductory statistics

Documents

squares regression line

straight line

squares regression line

squares regression linethe

todays topicsregression

response y

regression linethe slope

response variable y