regression line regression - duke university

8
9/16/09 1 FPP 10 kind of Regression Regression line Correlation coefficient a nice numerical summary of two quantitative variables It indicates direction and strength of association But does it quantify the association? It would be of interest to do this for Predictions Understanding phenomena Regression line Correlation measures the direction and strength of the straight-line (linear) relationship between two quantitative variables If a scatter plot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatter plot This line represents a mathematical model. Later we will make the mathematical model a statistical one. Slope intercept form review

Upload: others

Post on 28-Apr-2022

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression line Regression - Duke University

9/16/09

1

FPP 10 kind of

Regression

Regression line  Correlation coefficient a nice numerical summary of two

quantitative variables  It indicates direction and strength of association

 But does it quantify the association?

  It would be of interest to do this for  Predictions  Understanding phenomena

Regression line  Correlation measures the direction and strength of the

straight-line (linear) relationship between two quantitative variables

  If a scatter plot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatter plot

 This line represents a mathematical model. Later we will make the mathematical model a statistical one.

Slope intercept form review

Page 2: Regression line Regression - Duke University

9/16/09

2

Regression line  Slope intercept form notation

 Regression form notation €

y = mx + b

ˆ y = a + bx

Regression

Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT

r = 0.8718945

Which line is best

Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green)

Which model to use  Different people might draw different lines by eye on a scatterplot

 What are some ways we can determine which model(line) out of all the possible models(lines) is the “best” one?

 What are some ways that we can numerically rank the different models? (i.e. the different lines)

 This will come later in the course

Page 3: Regression line Regression - Duke University

9/16/09

3

Slope interpretation

 The slope, b, of a regression line is almost always important for interpreting the data. The slope is the rate of change, the mean amount of change in y-hat when x increases by 1

ˆ y = a + bx

Slope interpretation

Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT

r = 0.8718945

For every 1 sqft increase in size of home on average the house price increases by $159.8 dollars

Intercept interpretation

 The intercept, a, of the regression line is the value of y-hat when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero. €

ˆ y = a + bx

Intercept interpretation Price of Homes Based on Square Feet

Price = -90.2458 + 0.1598SQFT r = 0.8718945

If the sqft of a home was 0 on average the house price will be -$90,245.80 dollars

This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero.

Page 4: Regression line Regression - Duke University

9/16/09

4

Prediction

Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT

r = 0.8718945

For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = $469,054.2

OECD data: Income and unemployment in the U.S.  What is the relationship between households’ disposable

income and the nation’s unemployment rate?

 Data from the U.S. 1980 to 1998  (data provided by the economics department at Duke)

Disposable income vs unemployment rates  Disposable income

and unemployment rates regression output

Page 5: Regression line Regression - Duke University

9/16/09

5

Does regression fit data well?  A regression line is reasonable if

 Association between two variables is indeed linear  When points are randomly scattered around line

  Income/unemployment rate data well-described by regression line.

  Regression of AIDS rates per 1000 people of GNP per capita

  Line is too low for GDP values near zero and too high for big GDP values.

 We shouldn’t use line for predictions

Birth and death rates in 74 countries Changing the response variable

 When the regression line fits the data badly, sometimes you can transform variables to obtain a better fitting line.

 With monetary variables, typically this can be accomplished by taking logarithms.

Page 6: Regression line Regression - Duke University

9/16/09

6

 Regression of log(AIDS) on log(GNP)

 Much better fit

 Predict log(AIDS) from log(GNP). Exponentiate to estimate AIDS

Facts about regression  The distinction between explanatory and response variable is

essential in regression  If you have a slope computed using x as the explanatory and y

as the response variable you can’t “back solve” to get predictions of x given y

 If you want to predict x given a y then you must find the intercept and slope with y being the explanatory variable and x being the resopnse

Facts about regression  There is a close relationship between the correlation

coefficient and the slope of a regression line

 They have the same sign  They are proportional to each other

 The intercept has no relationship with the correlation coefficient but here is the formula

b = rSDy

SDx

a = y − bx

Warnings about regression  Predicting y at values of x beyond the range of x in the data is

called extrapolation

 This is risky, because we have no evidence to believe that the association between x and y remains linear for unseen x values

 Extrapolated predictions can be absolutely wrong

Page 7: Regression line Regression - Duke University

9/16/09

7

Extrapolation  Diamond price and carat

 Explanatory variable is measured by carats and response variable is dollars

 Predict price of hope diamond

ˆ y = 48.88 + 2430.77(45.52) = $110,697.53

Extrapolation  The relationship between

diamond carat and price doesn’t remain linear after a carat size of about 0.4

Extrapolation  Green line is

linear fit with only diamonds less then 0.4 carats

 Blue line is linear fit with all carat sizes

 Red curve a quadratic fit

Lurking variable  A variable not being considered could be driving the

relationship

  In practice this is a difficult issue to tackle. Especially when everything seems OK

Page 8: Regression line Regression - Duke University

9/16/09

8

Influential point  An outlier in either the X or Y direction which, if removed,

would markedly change the value of the slope and y-interept.

 applet

Causality  On its own, regression only quantifies an association between

x and y

  It does not prove causality

 Under a carefully designed experiment (or in some cases observational studies) regression can be used to show causality.