stat 111 introductory statistics
DESCRIPTION
STAT 111 Introductory Statistics. Lecture 3: Regression May 20, 2004. Today’s Topics. Regression line Fitting a line Prediction Least-squares Interpretation Correlation and regression Causation Transforming variables (briefly). Review: The Scatterplot. - PowerPoint PPT PresentationTRANSCRIPT
STAT 111 Introductory Statistics
Lecture 3: Regression
May 20, 2004
Today’s Topics
• Regression line – Fitting a line– Prediction– Least-squares– Interpretation
• Correlation and regression
• Causation
• Transforming variables (briefly)
Review: The Scatterplot
• The scatterplot shows the relationship between two quantitative variables.
• It plots the observations of different individuals in a two-dimensional graph.
• Each point in a scatterplot corresponds to an observation of two variables of the same individual.
The Regression Line
• A regression line is a straight line that summarizes the linear relationship between two variables.
• It describes how a response variable y changes as an explanatory variable x changes.
• A regression line is often used as a model to predict the value of the response y for a given value of the explanatory variable x.
The Regression Line (cont.)
• We fit a line to data by drawing the line that comes as close as possible to the points.
• Once we have a regression line, we can predict the y for a specific value of x. Accuracy depends on how scattered the data are about the line.
• Using the regression line for prediction for far outside the range of values of x used to obtain the line is called extrapolation. This is generally not advised, since predictions will be inaccurate.
Example: Predicting SAT Math Scores using SAT Verbal Scores
• Making a regression line using JMP:Analyze → Fit Y by X → Put the response variable into Y, explanatory variable into X → Hit OK → Double-click the red triangle above the scatterplot → Fit line
• Mathematically, a straight line has an equation of the form y = a + bx, where b is the slope and a is the intercept. But how do we determine the value of these two numbers?
The Least-Squares Regression Line
• The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
• Mathematically, the line is determined by minimizing
2 ii bxay
The Least-Squares Regression Line (cont.)
• The equation of the least-squares regression line of y on x is
• The slope is determined using the formula
• The intercept is calculated using
bxay ˆ
x
y
s
srb
xbya
Interpreting the Regression Line
• The slope b tells us that along the regression line, a change of 1 unit in x corresponds to a change of b units in y.
• The least-squares regression line always passes through the point .
• If both x and y are standardized variables, then the slope of the least-squares regression line will be r, and the line will pass through the origin (0,0).
),( yx
Interpreting the Regression Line (cont.)
• Since standard deviation can never be negative, the signs of r and b will always be the same.
• Hence, if our slope is positive, we have a positive association between our explanatory variable and our response.
• On the other hand, if our slope is negative, then we have a negative association between our explanatory variable and our response.
Example: SAT Scores Again
• In our SAT data, the math score is the response, and the verbal score is the explanatory variable. The least-squares regression line as reported by JMP is
math = 498.00765 + 0.3167866 verbal
• Hence, in the context of the SAT, if a student’s verbal score increases by 10 points, then his math score will increase by a little bit more than 3 points.
Example: SAT Scores (cont.)
• Suppose we want to predict using our regression line a student’s math score given that his verbal score was 550.
• The predicted math score then would be
498.00765 + 0.3167866 (550) = 672
• Remember not to extrapolate when you make your predictions.
Example: SAT Scores (cont.)
• Now, suppose we instead wanted to use a regression line to predict verbal scores using math scores, and suppose that one student had a math score of 670.
• Naively, we would predict the verbal score by taking the inverse of our existing regression line, in which case we would predict a verbal score between 540 and 550.
• It is not quite as simple as this.
Example: SAT Scores (cont.)
• What we would need to do is re-fit the regression line using math scores as our explanatory variable and verbal scores as our response.
• The new regression line is (from JMP)
verbal = 408.37653 + 0.3901289 math
• So, our predicted verbal score given a math score of 670 would be
408.37653 + 0.3901289 (670) = 670
Correlation and Regression
• The square of the correlation, r2, is the proportion of the variation in the data that is explained by our least-squares regression line.
• r2 is always between 0 and 1.
• If r = ± 0.7, then r2 = 0.49, or about ½ of the variation.
• In our SAT data, r2 = 0.1236 (it is the same for both regressions), so our regression line only captures about 12% of the response’s variation.
Understanding r2
• Let’s look at the SAT line (verbal as x, math as y) once again.
• The variance in our observed math values is (61.262875)2 = 3753.14
• If the only variability in observed math scores was because of the linear fit, then math scores would lie exactly on our line.
• In other words, the math scores would be identical to our predicted math scores.
Understanding r2 (cont.)
• After computing the predicted math scores, we have that the variance in our predicted values is (21.53698)2 = 463.84
• If we divide the variance of our predicteds by the variance of our actuals, we have
463.84 / 3753.14 = .1236
• It is always true for least-squares regression when we say that r2 gives us the variance of predicted responses as a fraction of the variance of actual responses.
Diagnosis (How Good is our Model?)
• Although we are most interested in the overall pattern as described by the regression line, deviations from this pattern are also important.
• In the regression setting, the deviations we consider are the vertical distances from the actual points to the least-squares regression line.
• These distances represent the variation left in the response after fitting the line and are called residuals.
Residuals
• A residual is the difference between an observed value and the predicted value.
• Residual = observed y – predicted y
• The sum of the residuals of a regression line is always equal to 0.
• A residual plot is a scatterplot of regression residuals against the explanatory variable and is used to assess the fit of a regression line.
yy ˆ
Simplified Patterns of Least-squares Residuals
x x
x
residual
residual
residual
linear relationship nonlinear relationship
Nonconstant prediction error
Outliers and Influential Observations
• An outlier is an observation that lies outside the overall pattern of the other observations.
• Points that are outliers in the y direction have large regression residuals, but that need not be the case for all outliers.
• An influential observation is one that would significantly change the regression line if removed. An outlier in the x direction is often influential for the least-squares regression line.
Example: Age at First Word and Gesell Score
• Does the age at which a child begin to talk predict a later score on a test of mental ability?
• The age in months at which the first word was spoken and the score on an ability test taken much later were recorded for 21 children.
• Fitting a line to all data reveals a negative linear relationship: early talkers tend to have higher test scores than those who start talking later.
Example: First Word and Gesell Score (cont.)
50
60
70
80
90
100
110
120
130
Sco
re
18
19
5 10 15 20 25 30 35 40 45
Age
Linear Fit
Linear Fit
Bivariate Fit of Score By Age
Example: First Word and Gesell Score (cont.)
• In the scatterplot, we see that observations 18 and 19 are unusual.
• Observation 18 is far out in the x direction; observation 19 is far out in the y direction.
• The red line is the regression line we obtained by including 18; the green is obtained by excluding 18.
• 18 is pulling the line towards itself; hence it is influential.
Extreme Example: Random Data
-4
-2
0
2
4
6
8
y
-2 -1 0 1 2 3 4 5 6
x
Linear Fit
Linear Fit
Bivariate Fit of Column 2 By Column 3
Causation vs Association
• Example of causation: Increased consumption of alcohol causes a decrease in coordination and reflexes.
• Example of association: A high SAT score in senior year of high school is typically associated with a high GPA in freshman year of college.
• In general, an association between an explanatory variable x and a response y is not sufficient evidence to prove that x causes y.
Causation vs Association (cont.)
• Examples:– High SAT math scores tend to be accompanied by
high SAT verbal scores, but does this mean a high math score causes a high verbal score?
– Nations in which people have easy access to the Internet tend to have higher life expectancies. Does better access to the Internet cause people to live longer?
– The divorce rate tends to be positively correlated with the quantity of bananas imported. Does importing more bananas cause more people to get divorced?
Lurking Variables
• A lurking variable is one that is not among the explanatory or response variables in a study, but may influence the interpretation of relationships among those variables.
• In each of our three cases mentioned previously, there is likely a lurking variable at work.
• Give a an example of one for each of the scenarios.
Lurking Variables (cont.)
• Lurking variables can create “nonsense correlations” in the sense that they suggest that changing one variable causes changes in the other.
• In addition, lurking variables can hide a true relationship between explanatory and response variables.
Causation
• In many cases, we wish to determine whether changes in an explanatory cause changes in the response variable.
• Even in the presence of strong association, it is difficult to decide whether this is due to a causal link.
• There are three main ways to explain an association between two variables.
Explaining Association
• The association between an explanatory and a response variable may be due to– Causation when there is a direct cause-and-effect link
between these two variables.– Common response when there is a lurking variable
whose changes cause both the explanatory variable and the response variable to change.
– Confounding when there are multiple influences at work that are getting mixed up.
Explaining Association (cont.)
• Officially, two variables are considered confounded when their effects on a response variable cannot be distinguished from each other.
• Confounded variables can be either explanatory or lurking.
• Even a very strong association between two variables is not sufficient evidence that there is a cause-and-effect link between the variables.
• The best way to establish that an association is due to causation is with a carefully designed experiment – more on this later.
Transformations of Relationships
• In some situations, the values of quantitative variables are quite spread out, with some isolated points. The rest of the data becomes very compressed, making it somewhat difficult to look at.
• Situations like this suggest using a function of the original variable; for example, we might use a function that will shrink the distance between values. This is what we call transforming the data.
Transformations of Relationships (cont.)
• Transforming data changes the original scale of measurement. Our most common transformations are linear (˚F → ˚C, lb → kg).
• Linear transformations cannot straighten curved relationships, though; to do that, we need a nonlinear transformation (e.g., powers, exponentials, logarithms).
• The most common transformations of our explanatory variable x are power transformations of the form xp.
Transformations of Relationships (cont.)
• We call a function f(x) monotone if its values move in only one direction as x increases
• For positive values of x, power functions with positive p (and the logarithm function) are monotonic increasing and preserve the order of observations.
• For negative p, the power functions are monotonic decreasing and reverse the order of observations.
• If we believe that there is some mathematical model that describes our data, then transformations will be quite effective.
• For example, the exponential growth model
y = a * bx can be written as a linear model if we take the logarithm of y (log y = log a + x log b).
• On the other hand, a power law growth model
y = a * xp can be written as a linear model if we take the logarithm of both x and y
(log y = log a + p log x).
Transformations of Relationships (cont.)
• In practice, our decision to make a transformation is governed by what we know about the data.
• This also holds true in terms of what type of transformation we decide to make.
• For example, animal populations and values of investments are often well-described by exponential growth model, though we do not always know the values of the parameters.