appendix [mte 3105]

APPENDIX

http://en.wikipedia.org/wiki/Linear_regression

Linear regression

In statistics, linear regression refers to any approach to modeling the relationship between one or more variables denoted y and one or more variables denoted X, such that the model depends linearly on the unknown parameters to be estimated from the data. Such a model is called a "linear model." Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis.

Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.

Linear regression has many practical uses. Most applications of linear regression fall into one of the following two broad categories:

If the goal is prediction, or forecasting, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y.

Given a variable y and a number of variables X1, ..., Xp that may be related to y, then linear regression analysis can be applied to quantify the strength of the relationship between y and the Xj, to assess which Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y, thus once one of them is known, the others are no longer informative.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm, or by minimizing a penalized version of the least squares loss function as in ridge regression. Conversely, the least squares approach can be used to fit models

http://en.wikipedia.org/wiki/Ridge_regression

http://en.wikipedia.org/wiki/Loss_function

http://en.wikipedia.org/wiki/Norm_(mathematics)

http://en.wikipedia.org/wiki/Least_squares

http://en.wikipedia.org/wiki/Forecasting

http://en.wikipedia.org/wiki/Prediction

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Multivariate_analysis

http://en.wikipedia.org/wiki/Joint_probability_distribution

http://en.wikipedia.org/wiki/Joint_probability_distribution

http://en.wikipedia.org/wiki/Conditional_probability_distribution

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Quantile

http://en.wikipedia.org/wiki/Median

http://en.wikipedia.org/wiki/Affine_transformation

http://en.wikipedia.org/wiki/Conditional_expectation

http://en.wikipedia.org/wiki/Linear_model

http://en.wikipedia.org/wiki/Data

http://en.wikipedia.org/wiki/Estimation_theory

http://en.wikipedia.org/wiki/Parameters

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Linear_regression

http://en.wikipedia.org/wiki/File:Linear_regression.png

that are not linear models. Thus, while the terms "least squares" and linear model are closely linked, they are not synonymous.

Introduction to linear regression

Given a data set of n statistical units, a linear regression model assumes that the relationship between the dependent variable yi and the p-vector of regressors xi is approximately linear. This approximate relationship is modeled through a so-called “disturbance term” εi — an unobserved random variable that adds noise to the linear relationship between the dependent variable and regressors. Thus the model takes the form

where x′iβ is the inner product between vectors xi and β.

Often these n equations are stacked together and written in vector form as

where

[edit] Applications of linear regression

Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.

[edit] Trend line

For trend lines as used in technical analysis, see Trend lines (technical analysis)

A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

[edit] Epidemiology

As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might

http://en.wikipedia.org/wiki/Spurious_correlation

http://en.wikipedia.org/wiki/Morbidity

http://en.wikipedia.org/wiki/Tobacco_smoking

http://en.wikipedia.org/w/index.php?title=Linear_regression&action=edit&section=8

http://en.wikipedia.org/wiki/Time_series

http://en.wikipedia.org/w/index.php?title=Trend_(statistics)&action=edit&redlink=1

http://en.wikipedia.org/wiki/Trend_lines_(technical_analysis)

http://en.wikipedia.org/wiki/Technical_analysis



http://en.wikipedia.org/wiki/Vector

http://en.wikipedia.org/wiki/Inner_product

http://en.wikipedia.org/wiki/Random_variable

http://en.wikipedia.org/wiki/Linear_function

http://en.wikipedia.org/wiki/Statistical_unit

http://en.wikipedia.org/wiki/Data

include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are often able to generate more compelling evidence of causal relationships than correlational analysis using linear regression. When controlled experiments are not feasible, variants of regression analysis such as instrumental variables and other methods may be used to attempt to estimate causal relationships from observational data.

[edit] Finance

The capital asset pricing model uses linear regression as well as the concept of Beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.

Regression may not be the appropriate way to estimate beta in finance given that it is supposed to provide the volatility of an investment relative to the volatility of the market as a whole. This would require that both these variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being in the investment returns variable, i.e. it only considers residuals in the dependent variable.[19]

[edit] Environmental science

Linear regression finds application in a wide range of environmental science applications.

[edit] Software toolsMain article: List of statistical packages

This section may be inaccurate in or unbalanced towards certain viewpoints. Please improve the article by adding information on neglected viewpoints, or discuss the issue on the talk page. (July 2009)

In Microsoft Excel, the LINEST spreadsheet function performs linear regression analysis with optional calculation of confidence intervals.

The free open source software package "R" offers several programs for linear regression and related methods.

The Unscrambler - can perform multiple linear regression (MLR), partial least squares regression (PLS-R), and 3-way PLS regression.

http://www.graphpad.com/curvefit/linear_regression.htm

Linear regression

Introduction to linear regression

Linear regression analyzes the relationship between two variables, X and Y. For each subject (or experimental unit), you know both X and Y and you want to find the best straight line through the data. In some situations, the slope and/or intercept have a scientific meaning. In other cases, you use the linear regression line as a standard curve to find new values of X from Y, or Y from X.

The term "regression", like many statistical terms, is used in statistics quite differently than it is used in other contexts. The method was first used to examine the relationship between the heights of fathers and sons. The two were related, of course, but the slope is less than 1.0. A tall father tended to have sons shorter than himself; a short father tended to have sons taller than himself. The height of sons regressed to the mean. The term "regression" is now used for many sorts of curve fitting.

http://www.graphpad.com/curvefit/linear_regression.htm

http://en.wikipedia.org/wiki/Partial_least_squares_regression

http://en.wikipedia.org/wiki/The_Unscrambler

http://en.wikipedia.org/wiki/R_(programming_language)

http://en.wikipedia.org/wiki/Microsoft_Excel

http://en.wikipedia.org/wiki/Talk:Linear_regression

http://en.wikipedia.org/w/index.php?title=Linear_regression&action=edit

http://en.wikipedia.org/w/index.php?title=Linear_regression&action=edit

http://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view#Undue_weight

http://en.wikipedia.org/wiki/List_of_statistical_packages



http://en.wikipedia.org/wiki/Beta_coefficient

http://en.wikipedia.org/wiki/Beta_coefficient

http://en.wikipedia.org/wiki/Capital_asset_pricing_model


http://en.wikipedia.org/wiki/Instrumental_variables

http://en.wikipedia.org/wiki/Randomized_controlled_trial

http://en.wikipedia.org/wiki/File:Unbalanced_scales.svg

Prism determines and graphs the best-fit linear regression line, optionally including a 95% confidence interval or 95% prediction interval bands. You may also force the line through a particular point (usually the origin), calculate residuals, calculate a runs test, or compare the slopes and intercepts of two or more regression lines.

In general, the goal of linear regression is to find the line that best predicts Y from X. Linear regression does this by finding the line that minimizes the sum of the squares of the vertical distances of the points from the line.

Note that linear regression does not test whether your data are linear (except via the runs test). It assumes that your data are linear, and finds the slope and intercept that make a straight line best fit your data.

How linear regression works

Minimizing sum-of-squares

The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. Why minimize the sum of the squares of the distances? Why not simply minimize the sum of the actual distances?

If the random scatter follows a Gaussian distribution, it is far more likely to have two medium size deviations (say 5 units each) than to have one small deviation (1 unit) and one large (9 units). A procedure that minimized the sum of the absolute value of the distances would have no preference over a line that was 5 units away from two points and one that was 1 unit away from one point and 9 units from another. The sum of the distances (more precisely, the sum of the absolute value of the distances) is 10 units in each case. A procedure that minimizes the sum of the squares of the distances prefers to be 5 units away from two points (sum-of-squares = 50) rather than 1 unit away from one point and 9 units away from another (sum-of-squares = 82). If the scatter is Gaussian (or nearly so), the line determined by minimizing the sum-of-squares is most likely to be correct.

The calculations are shown in every statistics book, and are entirely standard.

Slope and intercept

Prism reports the best-fit values of the slope and intercept, along with their standard errors and confidence intervals.

The slope quantifies the steepness of the line. It equals the change in Y for each unit change in X. It is expressed in the units of the Y-axis divided by the units of the X-axis. If the slope is positive, Y increases as X increases. If the slope is negative, Y decreases as X increases.

The Y intercept is the Y value of the line when X equals zero. It defines the elevation of the line.

The standard error values of the slope and intercept can be hard to interpret, but their main purpose is to compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a 95% chance that the 95% confidence interval of the slope contains the true value of the slope, and that the 95% confidence interval for the intercept contains the true value of the intercept.

r2, a measure of goodness-of-fit of linear regression

The value r2 is a fraction between 0.0 and 1.0, and has no units. An r2 value of 0.0 means that knowing X does not help you predict Y. There is no linear relationship between X and Y, and the best-fit line is a horizontal line going through the mean of all Y values. When r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you predict Y perfectly.

This figure demonstrates how Prism computes r2.

The left panel shows the best-fit linear regression line This lines minimizes the sum-of-squares of the vertical distances of the points from the line. Those vertical distances are also shown on the left panel of the figure. In this example, the sum of squares of those distances (SSreg) equals 0.86. Its units are the units of the Y-axis squared. To use this value as a measure of goodness-of-fit, you must compare it to something.

The right half of the figure shows the null hypothesis -- a horizontal line through the mean of all the Y values. Goodness-of-fit of this model (SStot) is also calculated as the sum of squares of the vertical distances of the points from the line, 4.907 in this example. The ratio of the two sum-of-squares values compares the regression model with the null hypothesis model. The equation to compute r2 is shown in the figure. In this example r2 is 0.8248. The regression model fits the data much better than the null hypothesis, so SSreg is much smaller than SStot, and r2 is near 1.0. If the regression model were not much better than the null hypothesis, r2 would be near zero.

You can think of r2 as the fraction of the total variance of Y that is "explained" by variation in X. The value of r2 (unlike the regression line itself) would be the same if X and Y were swapped. So r2 is also the fraction of the variance in X that is "explained" by variation in Y. In other words, r2 is the fraction of the variation that is shared between X and Y.

In this example, 84% of the total variance in Y is "explained" by the linear regression model. That leaves the rest of the vairance (16% of the total) as variability of the data from the model (SStot)

The dashed lines that demarcate the confidence interval are curved. This does not mean that the confidence interval includes the possibility of curves as well as straight lines. Rather, the curved lines are the boundaries of all possible straight lines. The figure below shows four possible linear regression lines (solid) that lie within the confidence interval (dashed).

Given the assumptions of linear regression, you can be 95% confident that the two curved confidence bands enclose the true best-fit linear regression line, leaving a 5% chance that the true line is outside those boundaries.

Many data points will be outside the 95% confidence interval boundary. The confidence interval is 95% sure to contain the best-fit regression line. This is not the same as saying it will contain 95% of the data points.

Prism can also plot the 95% prediction interval. The prediction bands are further from the best-fit line than the confidence bands, a lot further if you have many data points. The 95% prediction interval is the area in which you expect 95% of all data points to fall. In contrast, the 95% confidence interval is the area that has a 95% chance of containing the true regression line. This graph shows both prediction and confidence intervals (the curves defining the prediction intervals are further from the regression line).

Comparing slopes and intercepts

Prism can test whether the slopes and intercepts of two or more data sets are significantly different. It compares linear regression lines using the method explained in Chapter 18 of J Zar, Biostatistical Analysis, 2nd edition, Prentice-Hall, 1984.

Prism compares slopes first. It calculates a P value (two-tailed) testing the null hypothesis that the slopes are all identical (the lines are parallel). The P value answers this question: If the slopes really were identical, what is the chance that randomly selected data points would have slopes as different (or more different) than you observed. If the P value is less than 0.05, Prism concludes that the lines are significantly different. In that case, there is no point in comparing the intercepts. The intersection point of two lines is:

If the P value for comparing slopes is greater than 0.05, Prism concludes that the slopes are not significantly different and calculates a single slope for all the lines. Now the question is whether the lines are parallel or identical. Prism calculates a second P value testing the null hypothesis that the lines are identical. If this P value is low, conclude that the lines are not identical (they are distinct but parallel). If this second P value is high, there is no compelling evidence that the lines are different.

This method is equivalent to an Analysis of Covariance (ANCOVA), although ANCOVA can be extended to more complicated situations.

Standard Curve

To read unknown values from a standard curve, you must enter unpaired X or Y values below the X and Y values for the standard curve.

Depending on which option(s) you selected in the Parameters dialog, Prism calculates Y values for all the unpaired X values and/or X values for all unpaired Y values and places these on new output views.

How to think about the results of linear regression

Your approach to linear regression will depend on your goals.

If your goal is to analyze a standard curve, you won't be very interested in most of the results. Just make sure that r2 is high and that the line goes near the points. Then go straight to the standard curve results.

In many situations, you will be most interested in the best-fit values for slope and intercept. Don't just look at the best-fit values, also look at the 95% confidence interval of the slope and intercept. If the intervals are too wide, repeat the experiment with more data.

If you forced the line through a particular point, look carefully at the graph of the data and best-fit line to make sure you picked an appropriate point.

Consider whether a linear model is appropriate for your data. Do the data seem linear? Is the P value for the runs test high? Are the residuals random? If you answered no to any of those questions, consider whether it makes sense to use nonlinear regression instead.

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables. A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

Least-Squares RegressionThe most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values.

Example

The dataset "Televisions, Physicians, and Life Expectancy" contains, among other variables, the number of people per television set and the number of people per physician for 40 countries. Since both variables probably reflect the level of wealth in each country, it is reasonable to assume that there is some positive association between them. After removing 8 countries with missing values from the dataset, the remaining 32 countries have a correlation coefficient of 0.852 for number of people per television set and number of people per physician. The r² value is 0.726 (the square of the correlation coefficient), indicating that 72.6% of the variation in one variable may be explained by the other. (Note: see correlation for more detail.) Suppose we choose to consider number of people per television set as the explanatory variable, and number of people per physician as the dependent variable. Using the MINITAB "REGRESS" command gives the following results:

The regression equation is People.Phys. = 1019 + 56.2 People.Tel.

http://www.stat.yale.edu/Courses/1997-98/101/correl.htm

http://www.stat.yale.edu/Courses/1997-98/101/correl.htm

http://www.stat.yale.edu/Courses/1997-98/101/scatter.htm

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

To view the fit of the model to the observed data, one may plot the computed regression line over the actual data points to evaluate the results. For this example, the plot appears to the right, with number of individuals per television set (the explanatory variable) on the x-axis and number of individuals per physician (the dependent variable) on the y-axis. While most of the data points are clustered towards the lower left corner of the plot (indicating relatively few individuals per television set and per physician), there are a few points which lie far away from the main cluster of the data. These points are known as outliers, and depending on their location may have a major impact on the regression line (see below).

Data source: The World Almanac and Book of Facts 1993 (1993), New York: Pharos Books. Dataset available through the JSE Dataset Archive.

Outliers and Influential Observations

After a regression line has been computed for a group of data, a point which lies far from the line (and thus has a large residual value) is known as an outlier. Such points may represent erroneous data, or may indicate a poorly fitting regression line. If a point lies far from the other data in the horizontal direction, it is known as an influential observation. The reason for this distinction is that these points have may have a significant impact on the slope of the regression line. Notice, in the above example, the effect of removing the observation in the upper right corner of the plot:

With this influential observation removed, the regression equation is now

People.Phys = 1650 + 21.3 People.Tel. The correlation between the two variables has dropped to 0.427, which reduces the r² value to 0.182. With this influential observation removed, less that 20% of the variation in number of people per physician may be explained by the number of people per television. Influential observations are also visible in the new model, and their impact should also be investigated.

Residuals

gopher://jse.stat.ncsu.edu:70/11/jse/data

Once a regression model has been fit to a group of data, examination of the residuals (the deviations from the fitted line to the observed values) allows the modeler to investigate the validity of his or her assumption that a linear relationship exists. Plotting the residuals on the y-axis against the explanatory variable on the x-axis reveals any possible non-linear relationship among the variables, or might alert the modeler to investigate lurking variables. In our example, the residual plot amplifies the presence of outliers.

Lurking Variables

If non-linear trends are visible in the relationship between an explanatory and dependent variable, there may be other influential variables to consider. A lurking variable exists when the relationship between two variables is significantly affected by the presence of a third variable which has not been included in the modeling effort. Since such a variable might be a factor of time (for example, the effect of political or economic cycles), a time series plot of the data is often a useful tool in identifying the presence of lurking variables.

appendix [mte 3105]

Documents