3.2$least*squares$regression$teachers.dadeschools.net/rvancol/statsnotetakingguides/3-2least... ·...

29
3.2 LeastSquares Regression Linear (straightline) relationships between two quantitative variables are pretty common and easy to understand. Correlation measures the direction and strength of these relationships. When a scatterplot shows a linear relationship, we’d like to summarize the overall pattern by drawing a line on the scatterplot. A regression line summarizes the relationship between two variables, but only in a specific setting: when one of the variables helps explain or predict the other Regression, unlike correlation, requires that we have an explanatory variable and a response variable. Regression line A regression line is a line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value ofy for a given value of x. Example – Does Fidgeting Keep You Slim? Regression lines as models Some people don’t gain weight even when they overeat. Perhaps fidgeting and other “nonexercise activity” (NEA) explains why—some people may spontaneously increase nonexercise activity when fed more. Researchers deliberately overfed 16 healthy young adults for 8 weeks. They measured fat gain (in kilograms) as the response variable and change in energy use (in calories) from activity other than deliberate exercise—fidgeting, daily living, and the like—as the explanatory variable. Here are the data: Do people with larger increases in NEA tend to gain less fat? The figure below is a scatterplot of these data. The plot shows a moderately strong, negative linear association between NEA change and fat gain with no outliers. The correlation is r= −0.7786. The line on the plot is a regression line for predicting fat gain from change in NEA

Upload: phungkhanh

Post on 24-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

3.2  Least-­‐Squares  Regression    Linear  (straight-­‐line)  relationships  between  two  quantitative  variables  are  pretty  common  and  easy  to  understand.    Correlation  measures  the  direction  and  strength  of  these  relationships.  When  a  scatterplot  shows  a  linear  relationship,  we’d  like  to  summarize  the  overall  pattern  by  drawing  a  line  on  the  scatterplot.  A  regression  line  summarizes  the  relationship  between  two  variables,  but  only  in  a  specific  setting:  when  one  of  the  variables  helps  explain  or  predict  the  other    Regression, unlike correlation, requires that we have an explanatory variable and a response variable.  Regression  line  -­‐  A  regression  line  is  a  line  that  describes  how  a  response  variable  y  changes  as  an  explanatory  variable  x  changes.  We  often  use  a  regression  line  to  predict  the  value  ofy  for  a  given  value  of  x.    Example  –  Does  Fidgeting  Keep  You  Slim?  Regression  lines  as  models    Some  people  don’t  gain  weight  even  when  they  overeat.  Perhaps  fidgeting  and  other  “nonexercise  activity”  (NEA)  explains  why—some  people  may  spontaneously  increase  nonexercise  activity  when  fed  more.  Researchers  deliberately  overfed  16  healthy  young  adults  for  8  weeks.  They  measured  fat  gain  (in  kilograms)  as  the  response  variable  and  change  in  energy  use  (in  calories)  from  activity  other  than  deliberate  exercise—fidgeting,  daily  living,  and  the  like—as  the  explanatory  variable.  Here  are  the  data:  

 Do  people  with  larger  increases  in  NEA  tend  to  gain  less  fat?    The  figure  below  is  a  scatterplot  of  these  data.  The  plot  shows  a  moderately  strong,  negative  linear  association  between  NEA  change  and  fat  gain  with  no  outliers.  The  correlation  is  r=  −0.7786.  The  line  on  the  plot  is  a  regression  line  for  predicting  fat  gain  from  change  in  NEA            

3.2.1  Interpreting  a  Regression  Line    To  “regress”  means  to  go  backward.  Why  are  statistical  methods  for  predicting  a  response  from  an  explanatory  variable  called  “regression”?  Sir  Francis  Galton  (1822–1911)  looked  at  data  on  the  heights  of  children  versus  the  heights  of  their  parents.  He  found  that  the  taller-­‐than-­‐average  parents  tended  to  have  children  who  were  also  taller  than  average  but  not  as  tall  as  their  parents.  Galton  called  this  fact  “regression  toward  the  mean,”  and  the  name  came  to  be  applied  to  the  statistical  method.    A  regression  line  is  a  model  for  the  data,  much  like  density  curves.  The  equation  of  a  regression  line  gives  a  compact  mathematical  description  of  what  this  model  tells  us  about  the  relationship  between  the  response  variable  y  and  the  explanatory  variable  x.    Regression  line  -­‐  Suppose  that  y  is  a  response  variable  (plotted  on  the  vertical  axis)  and  x  is  an  explanatory  variable  (plotted  on  the  horizontal  axis).  A  regression  line  relating  y  to  x  has  an  equation  of  the  form:          In  this  equation,    

•  (read  “y  hat”)  is  the  predicted  value  of  the  response  variable  y  for  a  given  value  of  the  explanatory  variable  x.  

• b  is  the  slope,  the  amount  by  which  y  is  predicted  to  change  when  x  increases  by  one  unit.  • a  is  the  y  intercept,  the  predicted  value  of  y  when  x  =  0.  

 Although  you  are  probably  used  to  the  form  y  =  mx  +  b  for  the  equation  of  a  line  from  algebra,  statisticians  have  adopted  a  different  form  for  the  equation  of  a  regression  line.  Some  

use   .  We  prefer    for  two  reasons:    (1)  it’s  simpler  (2)  your  calculator  uses  this  form  Don’t  get  so  caught  up  in  the  symbols  that  you  lose  sight  of  what  they  mean!  The  coefficient  of  x  is  always  the  slope,  no  matter  what  symbol  is  used.  

Example  –  Does  Fidgeting  Keep  You  Slim?  Interpreting  the  slope  and  y  intercept    The  regression  line  for  the  figure  to  the  right  is  shown  below:  

       

   Identify  the  slope  and  y  intercept  of  the  regression  line.    Interpret  each  value  in  context.                                              The  slope  of  a  regression  line  is  an  important  numerical  description  of  the  relationship  between  the  two  variables.  Although  we  need  the  value  of  the  y  intercept  to  draw  the  line,  it  is  statistically  meaningful  only  when  the  explanatory  variable  can  actually  take  values  close  to  zero,  as  in  this  setting.      Does  a  small  slope  mean  that  there’s  no  relationship?  For  the  NEA  and  fat  gain  regression  line,  the  slope  b  =  −0.00344  is  a  small  number.  This  does  not  mean  that  change  in  NEA  has  little  effect  on  fat  gain.  The  size  of  the  slope  depends  on  the  units  in  which  we  measure  the  two  variables.  In  this  setting,  the  slope  is  the  predicted  change  in  fat  gain  in  kilograms  when  NEA  increases  by  1  calorie.  There  are  1000  grams  in  a  kilogram.  If  we  measured  fat  gain  in  grams,  the  slope  would  be  1000  times  larger,  b  =  3.44.  You  can’t  say  how  important  a  relationship  is  by  looking  at  the  size  of  the  slope  of  the  regression  line.  

3.2.2  Prediction    Example  –  Does  Fidgeting  Keep  You  Slim?  Predicting  with  a  regression  line    For  the  NEA  and  fat  gain  data,  the  equation  of  the  regression  line  is:    

   If  a  person’s  NEA  increases  by  400  calories  when  she  overeats,  substitute  x  =  400  in  the  equation.  The  predicted  fat  gain  is:    

   

The  accuracy  of  predictions  from  a  regression  line  depends  on  how  much  the  data  scatter  about  the  line.  In  this  case,  fat  gains  for  similar  changes  in  NEA  show  a  spread  of  1  or  2  kilograms.  The  regression  line  summarizes  the  pattern  but  gives  only  roughly  accurate  predictions.    Can  we  predict  the  fat  gain  for  someone  whose  NEA  increases  by  1500  calories  when  she  overeats?  We  can  certainly  substitute  1500  calories  into  the  equation  of  the  line.  The  prediction  is:    

   Extrapolation  -­‐  Extrapolation  is  the  use  of  a  regression  line  for  prediction  far  outside  the  interval  of  values  of  the  explanatory  variable  x  used  to  obtain  the  line.  Such  predictions  are  often  not  accurate.    Few  relationships  are  linear  for  all  values  of  the  explanatory  variable.  Don’t  make  predictions  using  values  of  x  that  are  much  larger  or  much  smaller  than  those  that  actually  appear  in  your  data.                    

CHECK  YOUR  UNDERSTANDING    Some data were collected on the weight of a male white laboratory rat for the first 25 weeks after its birth. A scatterplot of the weight (in grams)and time since birth (in weeks) shows a fairly strong, positive linear

relationship. The linear regression equation models the data fairly well.  1. What is the slope of the regression line? Explain what it means in context.                2. What’s the y intercept? Explain what it means in context.                    3. Predict the rat’s weight after 16 weeks. Show your work.                      4. Should you use this line to predict the rat’s weight at age 2 years? Use the equation to make the prediction and think about the reasonableness of the result. (There are 454 grams in a pound.)

3.2.3  Residuals  and  the  Least-­‐Squares  Regression  Line    In  most  cases,  no  line  will  pass  exactly  through  all  the  points  in  a  scatterplot.  Because  we  use  the  line  to  predict  y  from  x,  the  prediction  errors  we  make  are  errors  in  y,  the  vertical  direction  in  the  scatterplot.  A  good  regression  line  makes  the  vertical  distances  of  the  points  from  the  line  as  small  as  possible.    Look  at  the  following  example  describing  the  relationship  between  body  weight  and  backpack  weight  for  a  group  of  8  hikers.    

   The  figure  below  shows  a  scatterplot  of  the  data  with  a  regression  line  added.  The  prediction  errors  are  marked  as  bold  segments  in  the  graph.  These  vertical  deviations  represent  “leftover”  variation  in  the  response  variable  after  fitting  the  regression  line.  For  that  reason,  they  are  called  residuals.                                Residual  -­‐  A  residual  is  the  difference  between  an  observed  value  of  the  response  variable  and  the  value  predicted  by  the  regression  line.  That  is:                          

Example  –  Back  to  the  Backpackers  Finding  a  residual    

   Find  and  interpret  the  residual  for  the  hiker  who  weighed  187  pounds.                            AP  EXAM  TIP  There’s  no  firm  rule  for  how  many  decimal  places  to  show  for  answers  on  the  AP  exam.  Our  advice:  Give  your  answer  correct  to  two  or  three  nonzero  decimal  places.  Exception:  If  you’re  using  one  of  the  tables  in  the  back  of  the  book,  give  the  value  shown  in  the  table.      The  line  shown  in  the  figure  above  makes  the  residuals  for  the  8  hikers  “as  small  as  possible.”  But  what  does  that  mean?  Maybe  this  line  minimizes  the  sum  of  the  residuals.  Actually,  if  we  add  up  the  prediction  errors  for  all  8  hikers,  the  positive  and  negative  residuals  cancel  out.  That’s  the  same  issue  we  faced  when  we  tried  to  measure  deviation  around  the  mean.  We’ll  solve  the  current  problem  in  much  the  same  way:  by  squaring  the  residuals.  The  regression  line  we  want  is  the  one  that  minimizes  the  sum  of  the  squared  residuals.  That’s  what  the  line  shown  in  the  above  figure  does  for  the  hiker  data,  which  is  why  we  call  it  the  least-­‐squares  regression  line.                

Least-­‐squares  regression  line  -­‐  The  least-­‐squares  regression  line  of  y  on  x  is  the  line  that  makes  the  sum  of  the  squared  residuals  as  small  as  possible.    The  figure  at  the  right  gives  a  geometric  interpretation  of  the  least-­‐squares  idea  for  the  hiker  data.  The  least-­‐squares  regression  line  shown  minimizes  the  sum  of  the  squared  prediction  errors,  30.90.  No  other  regression  line  would  give  a  smaller  sum  of  squared  residuals.                

CHECK  YOUR  UNDERSTANDING    It’s time to practice your calculator regression skills. Using the familiar hiker data in the table below, calculate

the least-squares regression line on your calculator. You should get as the equation of the regression line.

3.2.4  Calculating  the  Equation  of  the  Least-­‐Squares  Line    Another  reason  for  studying  the  least-­‐squares  regression  line  is  that  the  problem  of  finding  its  equation  has  a  simple  answer.  We  can  give  the  equation  of  the  least-­‐squares  regression  line  in  terms  of  the  means  and  standard  deviations  of  the  two  variables  and  their  correlation.    Equation  of  the  least-­‐squares  regression  line  We  have  data  on  an  explanatory  variable  x  and  a  response  variable  y  for  n  individuals    From  the  data,  calculate  the  

means    and    and  the  standard  deviations  sx  and  sy  of  the  two  variables  and  their  correlation  r.  The  least-­‐squares  

regression  line  is  the  line          

with  slope          

and  y  intercept          AP  EXAM  TIP  The  formula  sheet  for  the  AP  exam  uses  different  notation  for  these  equations:  

             and                        

That’s  because  the  least-­‐squares  line  is  written  as   .    We  prefer  our  simpler  versions  without  the  subscripts.                                        

What  does  the  slope  of  the  least-­‐squares  line  tell  us?  The  figure  below  shows  the  regression  line  in  black  for  the  hiker  data.  We  have  added  four  more  lines  to  the  graph:  a  vertical  line  at  the  mean  body  weight    a  vertical  line  at    +  sx  (one  standard  deviation  above  the  mean  body  weight)  

a  horizontal  line  at  the  mean  pack  weight    

a  horizontal  line  at    +  sy  (one  standard  deviation  above  the  mean  pack  weight)    

Note  that  the  regression  line  passes  through  ( ,   )  as  expected.                              From  the  graph,  the  slope  of  the  line  is:  

   From  the  definition  box,  we  know  that  the  slope  is    

   Setting  the  two  formulas  equal  to  each  other,  we  have    

 So  the  unknown  distance  ??  above  must  be  equal  to  r  ·∙  sy.  In  other  words,  for  an  increase  of  one  standard  deviation  in  the  value  of  the  explanatory  variable  x,  the  least-­‐squares  regression  line  predicts  an  increase  of  r  standard  deviations  in  the  response  variable  y.            

There  is  a  close  connection  between  correlation  and  the  slope  of  the  least-­‐squares  line.  The  slope  is            This  equation  says  that  along  the  regression  line,  a  change  of  one  standard  deviation  in  x  corresponds  to  a  change  of  r  standard  deviations  in  y.  When  the  variables  are  perfectly  correlated  (r  =  1  or  r  =  

−1),  the  change  in  the  predicted  response    is  the  same  (in  standard  deviation  units)  as  the  change  

in  x.  Otherwise,  because  −1  ≤  r  ≤  1,  the  change  in    is  less  than  the  change  in  x.  As  the  correlation  

grows  less  strong,  the  prediction   moves  less  in  response  to  changes  in  x.    Example  –  Fat  Gain  and  NEA  Calculating  the  least-­‐squares  regression  line    Refer  to  the  data  from  the  example  below:    

The  mean  and  standard  deviation  of  the  16  changes  in  NEA  are    calories  (cal)  and  sx  =  

257.66  cal.  For  the  16  fat  gains,  the  mean  and  standard  deviation  are   and  sy  =  1.1389  kg.  The  correlation  between  fat  gain  and  NEA  change  is  r  =  −0.7786.    (a)  Find  the  equation  of  the  least-­‐squares  regression  line  for  predicting  fat  gain  from  NEA  change.  Show  your  work.                              

(b)  What  change  in  fat  gain  does  the  regression  line  predict  for  each  additional  257.66  cal  of  NEA?  Explain.                                              What  happens  if  we  standardize  both  variables?  Standardizing  a  variable  converts  its  mean  to  0  and  

its  standard  deviation  to  1.  Doing  this  to  both  x  and  y  will  transform  the  point  ( )  to  (0,  0).  So  the  least-­‐squares  line  for  the  standardized  values  will  pass  through  (0,  0).  What  about  the  slope  of  this  line?  From  the  formula,  it’s?  b  =  rsy/sx.  Since  we  standardized,  sx  =  sy  =  1.  That  means  b  =  r.  In  other  words,  the  slope  is  equal  to  the  correlation.  The  Fathom  screen  shot  confirms  these  results.It  shows  

that  r2  =  0.63,  so   .    

                           

3.2.5  How  Well  the  Line  Fits  the  Data:    Residual  Plots    Example  –  Does  Fidgeting  Keep  You  Slim?  Examining  Residuals    Let’s  return  to  the  fat  gain  and  NEA  study  involving  16  young  people  who  volunteered  to  overeat  for  8  weeks.  Those  whose  NEA  rose  substantially  gained  less  fat  than  others.      We  confirmed  that  the  least-­‐squares  regression  line  for  these  data  

is   .  The  calculator  screen  shot  above  shows  a  scatterplot  of  the  data  with  the  least-­‐squares  line  added.    One  subject’s  NEA  rose  by  135  cal.  That  subject  gained  2.7  kg  of  fat.  (This  point  is  marked  in  the  screen  shot  with  an  X.)  The  predicted  fat  gain  for  135  cal  is:          The  residual  for  this  subject  is  therefore:    

   This  residual  is  negative  because  the  data  point  lies  below  the  line.  The  16  data  points  used  in  calculating  the  least-­‐squares  line  produce  16  residuals.  Rounded  to  two  decimal  places,  they  are    

   Because  the  residuals  show  how  far  the  data  fall  from  our  regression  line,  examining  the  residuals  helps  assess  how  well  the  line  describes  the  data.  Although  residuals  can  be  calculated  from  any  model  that  is  fitted  to  the  data,  the  residuals  from  the  least-­‐squares  line  have  a  special  property:  the  mean  of  the  least-­‐squares  residuals  is  always  zero.  You  can  check  that  the  sum  of  the  residuals  in  the  above  example  is  0.01.  The  sum  is  not  exactly  0  because  we  rounded  to  two  decimal  places.              

You  can  see  the  residuals  in  the  scatterplot  of  (a)  by  looking  at  the  vertical  deviations  of  the  points  from  the  line.  The  residual  plot  in  (b)  makes  it  easier  to  study  the  residuals  by  plotting  them  against  the  explanatory  variable,  change  in  NEA.  Because  the  mean  of  the  residuals  is  always  zero,  the  horizontal  line  at  zero  in  (b)  helps  orient  us.  This  “residual  =  0”  line  corresponds  to  the  regression  line  in  (a).    

   Residual  plot  -­‐  A  residual  plot  is  a  scatterplot  of  the  residuals  against  the  explanatory  variable.  Residual  plots  help  us  assess  how  well  a  regression  line  fits  the  data.  

CHECK  YOUR  UNDERSTANDING    Refer  to  the  data  below:  

1. Find the residual for the subject who increased NEA by 620 calories. Show your work. 2. Interpret the value of this subject’s residual in context. 3. For which subject did the regression line overpredict fat gain by the most? Justify your answer.

Examining  residual  plots  A  residual  plot  in  effect  turns  the  regression  line  horizontal.  It  magnifies  the  deviations  of  the  points  from  the  line,  making  it  easier  to  see  unusual  observations  and  patterns.  If  the  regression  line  captures  the  overall  pattern  of  the  data,  there  should  be  no  pattern  in  the  residuals.  Figure  (a)  shows  a  residual  plot  with  a  clear  curved  pattern.  A  straight  line  is  not  an  appropriate  model  for  these  data,  as  Figure  (b)  confirms.    

     Here  are  two  important  things  to  look  for  when  you  examine  a  residual  plot.    

1. The  residual  plot  should  show  no  obvious  pattern.  Ideally,  the  residual  plot  will  look  something  like  the  one  in  the  figure  to  the  right  below.  This  graph  shows  an  unstructured  (random)  scatter  of  points  in  a  horizontal  band  centered  at  zero.  A  curved  pattern  in  a  residual  plot  shows  that  the  relationship  is  not  linear.  Another  type  of  pattern  is  shown  in  the  figure  to  the  left.  This  residual  plot  reveals  increasing  spread  about  the  regression  line  as  x  increases.  Predictions  of  y  using  this  line  will  be  less  accurate  for  larger  values  of  x.  

2. The  residuals  should  be  relatively  small  in  size.  A  regression  line  that  fits  the  data  well  should  come  “close”  to  most  of  the  points.  That  is,  the  residuals  should  be  fairly  small.  How  do  we  decide  whether  the  residuals  are  “small  enough”?  We  consider  the  size  of  a  “typical”  prediction  error.  

 

     

In  the  figure  above,  for  example,  most  of  the  residuals  are  between  −0.7  and  0.7.  For  these  individuals,  the  predicted  fat  gain  from  the  least-­‐squares  line  is  within  0.7  kilogram  (kg)  of  their  actual  fat  gain  during  the  study.  That  sounds  pretty  good.  But  the  subjects  gained  only  between  0.4  and  4.2  kg,  so  a  prediction  error  of  0.7  kg  is  relatively  large  compared  with  the  actual  fat  gain  for  an  individual.  The  largest  residual,  1.64,corresponds  to  a  prediction  error  of  1.64  kg.  This  subject’s  actual  fat  gain  was  3.8  kg,  but  the  regression  line  predicted  a  fat  gain  of  only  2.16  kg.  That’s  a  pretty  large  error,  especially  from  the  subject’s  perspective!    Standard  deviation  of  the  residuals  We  have  already  seen  that  the  average  prediction  error  (that  is,  the  mean  of  the  residuals)  is  0  whenever  we  use  a  least-­‐squares  regression  line.  That’s  because  the  positive  and  negative  residuals  “balance  out.”  But  that  doesn’t  tell  us  how  far  off  the  predictions  are,  on  average.  Instead,  we  use  the  standard  deviation  of  the  residuals:    

 For  the  NEA  and  fat  gain  data,  the  sum  of  the  squared  residuals  is  7.663.  So  the  standard  deviation  of  the  residuals  is:    

   Standard  deviation  of  the  residuals  -­‐  If  we  use  a  least-­‐squares  line  to  predict  the  values  of  a  response  variable  y  from  an  explanatory  variable  x,  the  standard  deviation  of  the  residuals  (s)  is  given  by:                    

CHECK  YOUR  UNDERSTANDING    The graph shown is a residual plot for the least-squares regression of pack weight on body weight for the 8 hikers.                              1. The residual plot does not show a random scatter. Describe the pattern you see.     2. For this regression, s = 2.27. Interpret this value in context.

3.2.6  How  Well  the  Line  Fits  the  Data:    The  Role  of  r2  in  Regression    A  residual  plot  is  a  graphical  tool  for  evaluating  how  well  a  regression  line  fits  the  data.  The  standard  deviation  of  the  residuals,  s,  gives  us  a  numerical  estimate  of  the  average  size  of  our  prediction  errors  from  the  regression  line.  There  is  another  numerical  quantity  that  tells  us  how  well  the  least-­‐squares  line  predicts  values  of  the  response  variable  y.  It  is  r2,  the  coefficient  of  determination.  Some  computer  packages  call  it  “R-­‐sq.”  You  may  have  noticed  this  value  in  some  of  the  calculator  and  computer  regression  output  that  we  showed  earlier.  Although  it’s  true  that  r2  is  equal  to  the  square  of  r,  there  is  much  more  to  this  story.    Example  –  Pack  weight  and  body  weight  How  can  we  predict  y  if  we  don’t  know  x?    

   Suppose  a  new  student  is  assigned  at  the  last  minute  to  our  group  of  8  hikers.  What  would  we  predict  for  his  pack  weight?  The  figure  above  shows  a  scatterplot  of  the  hiker  data  that  we  have  studied  throughout  this  chapter.  The  least-­‐squares  line  is  drawn  on  the  plot  in  green.  Another  

line  has  been  added  in  blue:  a  horizontal  line  at  the  mean  y-­‐value,   .  If  we  don’t  know  this  new  student’s  body  weight,  then  we  can’t  use  the  regression  line  to  make  a  prediction.  What  should  we  do?  Our  best  strategy  is  to  use  the  mean  pack  weight  of  the  other  8  hikers  as  our  prediction.                          

The  figure  above  (a)  shows  the  prediction  errors  if  we  use  the  average  pack  weight    as  our  prediction  for  the  original  group  of  8  hikers.  We  can  see  that  the  sum  of  the  squared  residuals  for  this  line  

is    SST  measures  the  total  variation  in  the  y-­‐values.  

If  we  learn  our  new  hiker’s  body  weight,  then  we  could  use  the  least-­‐squares  line  to  predict  his  pack  weight.  How  much  better  does  the  regression  line  do  at  predicting  pack  weights  than  simply  using  the  average  pack  weight  y  of  all  8  hikers?  Figure  (b)  reminds  us  that  the  sum  of  squared  residuals  for  the  least-­‐squares  line  is  Σ  residual2  =  30.90.  We’ll  call  this  SSE,  for  sum  of  squared  errors.  The  ratio  SSE/SST  tells  us  what  proportion  of  the  total  variation  in  y  still  remains  after  using  the  regression  line  to  predict  the  values  of  the  response  variable.  In  this  case,    

   This  means  that  36.8%  of  the  variation  in  pack  weight  is  unaccounted  for  by  the  least-­‐squares  regression  line.  Taking  this  one  step  further,  the  proportion  of  the  total  variation  in  y  that  is  accounted  for  by  the  regression  line  is    

 We  interpret  this  by  saying  that  “63.2%  of  the  variation  in  backpack  weight  is  accounted  for  by  the  linear  model  relating  pack  weight  to  body  weight.”  For  this  reason,  we  define    

     Coefficient  of  determination  -­‐  The  coefficient  of  determination  r2  is  the  fraction  of  the  variation  in  the  values  of  y  that  is  accounted  for  by  the  least-­‐squares  regression  line  of  y  on  x.  We  can  calculater2  using  the  following  formula:    

   

where  SSE  =  Σ  residual2  and   .    It  seems  pretty  remarkable  that  the  coefficient  of  determination  is  actually  the  correlation  squared.  This  fact  provides  an  important  connection  between  correlation  and  regression.  When  you  report  a  regression,  give  r2  as  a  measure  of  how  successful  the  regression  was  in  explaining  the  response.  When  you  see  a  correlation,  square  it  to  get  a  better  feel  for  the  strength  of  the  linear  relationship.      

CHECK  YOUR  UNDERSTANDING    1. For the least-squares regression of fat gain on NEA, r2 = 0.606. Which of the following gives a correct interpretation of this value in context? (a) 60.6% of the points lie on the least-squares regression line. (b) 60.6% of the fat gain values are accounted for by the least-squares line. (c) 60.6% of the variation in fat gain is accounted for by the least-squares line. (d) 77.8% of the variation in fat gain is accounted for by the least-squares line.  2. A recent study discovered that the correlation between the age at which an infant first speaks and the child’s score on an IQ test given upon entering elementary school is −0.68. A scatterplot of the data shows a linear form. Which of the following statements about this finding is correct? (a) Infants who speak at very early ages will have higher IQ scores by the beginning of elementary school than those who begin to speak later. (b) 68% of the variation in IQ test scores is explained by the least-squares regression of age at first spoken word and IQ score. (c) Encouraging infants to speak before they are ready can have a detrimental effect later in life, as evidenced by their lower IQ scores. (d) There is a moderately strong, negative linear relationship between age at first spoken word and later IQ test score for the individuals in this study.

3.2.7  Interpreting  Computer  Regression  Output    

   The  figure  above  displays  the  basic  regression  output  for  the  NEA  data  from  two  statistical  software  packages:  Minitab  and  JMP.  Other  software  produces  very  similar  output.  Each  output  records  the  slope  and  y  intercept  of  the  least-­‐squares  line.  The  software  also  provides  information  that  we  don’t  yet  need  (or  understand!),  although  we  will  use  much  of  it  later.  Be  sure  that  you  can  locate  the  slope,  the  y  intercept,  and  the  values  of  s  and  r2  on  both  computer  outputs.  Once  you  understand  the  statistical  ideas,  you  can  read  and  work  with  almost  any  software  output.    AP EXAM TIP Students often have a hard time interpreting the value ofr2 on AP exam questions. They frequently leave out key words in the definition. Our advice: Treat this as a fill-in-the-blank exercise. Write “____% of the variation in [response variable name] is accounted for by the regression line.”                                                

Example  –  Beer  and  Blood  Alcohol  Interpreting  regression  output    How  well  does  the  number  of  beers  a  person  drinks  predict  his  or  her  blood  alcohol  content  (BAC)?  Sixteen  volunteers  with  an  initial  BAC  of  0  drank  a  randomly  assigned  number  of  cans  of  beer.  Thirty  minutes  later,  a  police  officer  measured  their  BAC.  Least-­‐squares  regression  was  performed  on  the  data.  A  scatterplot  with  the  regression  line  added,  a  residual  plot,  and  some  computer  output  from  the  regression  are  shown  below.                                        (a)  What  is  the  equation  of  the  least-­‐squares  regression  line  that  describes  the  relationship  between  beers  consumed  and  blood  alcohol  content?  Define  any  variables  you  use.                    (b)  Interpret  the  slope  of  the  regression  line  in  context.                

(c)  Find  the  correlation.                          (d)  Is  a  line  an  appropriate  model  to  use  for  these  data?  What  information  tells  you  this?                          (e)  What  was  the  BAC  reading  for  the  person  who  consumed  9  beers?  Show  your  work.  

3.2.8  Correlation  and  Regression  Wisdom    Correlation  and  regression  are  powerful  tools  for  describing  the  relationship  between  two  variables.  When  you  use  these  tools,  you  should  be  aware  of  their  limitations    1.  The  distinction  between  explanatory  and  response  variables  is  important  in  regression.  This  isn’t  true  for  correlation:  switching  x  and  y  doesn’t  affect  the  value  of  r.  Least-­‐squares  regression  makes  the  distances  of  the  data  points  from  the  line  small  only  in  the  y  direction.  If  we  reverse  the  roles  of  the  two  variables,  we  get  a  different  least-­‐squares  regression  line.    Example  –  Predicting  Fat  Gain,  Predicting  NEA  Two  different  regression  lines    Figure  a    repeats  the  scatterplot  of  the  NEA  data  with  the  least-­‐squares  regression  line  for  predicting  fat  gain  from  change  in  NEA  added.  We  might  also  use  the  data  on  these  16  subjects  to  predict  the  NEA  change  for  another  subject  from  that  subject’s  fat  gain  when  overfed  for  8  weeks.  Now  the  roles  of  the  variables  are  reversed:  fat  gain  is  the  explanatory  variable  and  change  in  NEA  is  the  response  variable.  Figure  b  shows  a  scatterplot  of  these  data  with  the  least-­‐squares  line  for  predicting  NEA  change  from  fat  gain.  The  two  regression  lines  are  very  different.  However,  no  matter  which  variable  we  put  on  the  x  axis,  r2  =  0.606  and  the  correlation  is  r  =  −0.778.    

   2.  Correlation  and  regression  lines  describe  only  linear  relationships.  You  can  calculate  the  correlation  and  the  least-­‐squares  line  for  any  relationship  between  two  quantitative  variables,  but  the  results  are  useful  only  if  the  scatterplot  shows  a  linear  pattern.  Always  plot  your  data!    

3.  Correlation  and  least-­‐squares  regression  lines  are  not  resistant.  You  already  know  that  the  correlation  r  is  not  resistant.  One  unusual  point  in  a  scatterplot  can  greatly  change  the  value  of  r.  Is  the  least-­‐squares  line  resistant?  Not  surprisingly,  the  answer  is  no.  The  following  example  sheds  some  light  on  this  issue.    Example  –  Gesell  Scores  Dealing  with  unusual  points  in  regression    Does  the  age  at  which  a  child  begins  to  talk  predict  a  later  score  on  a  test  of  mental  ability?  A  study  of  the  development  of  young  children  recorded  the  age  in  months  at  which  each  of  21  children  spoke  their  first  word  and  their  Gesell  Adaptive  Score,  the  result  of  an  aptitude  test  taken  much  later.  The  data  appear  in  the  table  below.                    STATE:  Can  we  use  a  child’s  age  at  first  word  to  predict  his  or  her  Gesell  score?  How  accurate  will  our  predictions  be?  PLAN:  Let’s  start  by  making  a  scatterplot  with  age  at  first  word  as  the  explanatory  variable  and  Gesell  score  as  the  response  variable.  If  the  graph  shows  a  linear  form,  we’ll  fit  a  least-­‐squares  line  to  the  data.  Then  we  should  make  a  residual  plot.  The  residuals,  r2,  and  s  will  tell  us  how  well  the  line  fits  the  data  and  how  large  our  prediction  errors  will  be.  DO:  The  figure  below  shows  a  scatterplot  of  the  data.  Children  3  and  13,  and  also  Children  16  and  21,  have  identical  values  of  both  variables.  We  used  a  different  plotting  symbol  to  show  that  one  point  stands  for  two  individuals.  The  scatterplot  shows  a  negative  association.  That  is,  children  who  begin  to  speak  later  tend  to  have  lower  test  scores  than  early  talkers.  The  overall  pattern  is  moderately  linear  (a  calculator  gives  r  =  −0.640).  There  are  two  outliers  on  the  scatterplot:  Child  18  and  Child  19.  These  two  children  are  unusual  in  different  ways.  Child  19  is  an  outlier  in  the  y  direction,  with  a  Gesell  score  so  high  that  we  should  check  for  a  mistake  in  recording  it.  (In  fact,  the  score  is  correct.)  Child  18  is  an  outlier  in  the  x  direction.  This  child  began  to  speak  much  later  than  any  of  the  other  children.                        

We  used  a  calculator  to  perform  least-­‐squares  regression.  The  equation  of  the  least-­‐squares  line  

is    We  added  this  line  to  the  scatterplot  in  figure  a  above.  The  slope  suggests  that  for  every  month  older  a  child  is  when  she  first  speaks,  her  Gesell  score  is  predicted  to  decrease  by  1.127  points.  Since  a  child  isn’t  going  to  speak  her  first  word  at  age  0  months,  the  y  intercept  of  this  line  has  no  statistical  meaning.  How  well  does  the  least-­‐squares  line  fit  the  data?  Figure  b  above  shows  a  residual  plot.  The  graph  shows  a  fairly  “random”  scatter  of  points  around  the  “residual  =  0”  line  with  one  very  large  positive  residual  (Child  19).  Most  of  the  prediction  errors  (residuals)  are  10  points  or  fewer  on  the  Gesell  score.  We  calculated  the  standard  error  of  the  residuals  to  be  s  =  11.023.  This  is  roughly  the  size  of  an  average  prediction  error  using  the  regression  line.  Since  r2  =  0.41,  41%  of  the  variation  in  Gesell  scores  is  accounted  for  by  the  least-­‐squares  regression  of  Gesell  score  on  age  at  first  spoken  word.  That  leaves  59%  of  the  variation  in  Gesell  scores  unaccounted  for  by  the  linear  relationship  for  these  data.  

CONCLUDE:  We  can  use  the  equation    (age)  to  predict  a  child’s  score  on  the  Gesell  test  from  the  age  at  which  the  child  first  speaks.  Our  predictions  may  not  be  very  accurate,  though.  On  average,  we’ll  be  off  by  about  11  points  on  the  Gesell  score.  Also,  most  of  the  variation  in  Gesell  score  from  child  to  child  is  not  accounted  for  by  this  linear  model.  We  should  hesitate  to  use  this  model  to  make  predictions,  especially  until  we  better  understand  the  effect  of  the  two  outliers  on  the  regression  results.        In  the  previous  example,  Child  18  and  Child  19  were  identified  as  outliers  in  the  scatterplot  of  figure  a.  These  points  are  also  marked  in  the  residual  plot  of  figure  b.  Child  19  has  a  very  large  residual  because  this  point  lies  far  from  the  regression  line.  However,  Child  18  has  a  pretty  small  residual.  That’s  because  Child  18’s  point  is  close  to  the  line.  How  do  these  two  outliers  affect  the  regression?  The  figure  below  shows  the  results  of  removing  each  of  these  points  on  the  correlation  and  the  regression  line.  The  graph  adds  two  more  regression  lines,  one  calculated  after  leaving  out  Child  18  and  the  other  after  leaving  out  Child  19.  You  can  see  that  removing  the  point  for  Child  18  moves  the  line  quite  a  bit.  (In  fact,  the  equation  of  the  

new  least-­‐squares  line  is   ).  Because  of  Child  18’s  extreme  position  on  the  age  scale,  this  point  has  a  strong  influence  on  the  position  of  the  regression  line.                              However,  removing  Child  19  has  little  effect  on  the  regression  line.  

 Outliers  and  influential  observations  in  regression  An  outlier  is  an  observation  that  lies  outside  the  overall  pattern  of  the  other  observations.  Points  that  are  outliers  in  the  y  direction  but  not  the  x  direction  of  a  scatterplot  have  large  residuals.  Other  outliers  may  not  have  large  residuals.  An  observation  is  influential  for  a  statistical  calculation  if  removing  it  would  markedly  change  the  result  of  the  calculation.  Points  that  are  outliers  in  the  x  direction  of  a  scatterplot  are  often  influential  for  the  least-­‐squares  regression  line.    We  finish  with  our  most  important  caution  about  correlation  and  regression.    4.  Association  does  not  imply  causation.  When  we  study  the  relationship  between  two  variables,  we  often  hope  to  show  that  changes  in  the  explanatory  variable  cause  changes  in  the  response  variable.  A  strong  association  between  two  variables  is  not  enough  to  draw  conclusions  about  cause  and  effect.  Sometimes  an  observed  association  really  does  reflect  cause  and  effect.  A  household  that  heats  with  natural  gas  uses  more  gas  in  colder  months  because  cold  weather  requires  burning  more  gas  to  stay  warm.  In  other  cases,  an  association  is  explained  by  lurking  variables,  and  the  conclusion  that  x  causes  y  is  not  valid.    Example  –  Does  Having  More  Cars  Make  You  Live  Longer  Association,  not  causation    A  serious  study  once  found  that  people  with  two  cars  live  longer  than  people  who  own  only  one  car.  Owning  three  cars  is  even  better,  and  so  on.  There  is  a  substantial  positive  correlation  between  number  of  cars  x  and  length  of  life  y.  The  basic  meaning  of  causation  is  that  by  changing  x  we  can  bring  about  a  change  in  y.  Could  we  lengthen  our  lives  by  buying  more  cars?  No.  The  study  used  number  of  cars  as  a  quick  indicator  of  wealth.  Well-­‐off  people  tend  to  have  more  cars.  They  also  tend  to  live  longer,  probably  because  they  are  better  educated,  take  better  care  of  themselves,  and  get  better  medical  care.  The  cars  have  nothing  to  do  with  it.  There  is  no  cause-­‐and-­‐effect  tie  between  number  of  cars  and  length  of  life.\    Correlations  such  as  those  in  the  previous  example  are  sometimes  called  “nonsense  correlations.”  The  correlation  is  real.  What  is  nonsense  is  the  conclusion  that  changing  one  of  the  variables  causes  changes  in  the  other.  A  “lurking  variable”—such  as  personal  wealth  in  this  example—that  influences  both  x  and  y  can  create  a  high  correlation  even  though  there  is  no  direct  connection  between  x  and  y.    Remember: It only makes sense to talk about the correlation between two quantitative variables. If one or both variables are categorical, you should refer to the association between the two variables. To be safe, you can use the more general term “association” when describing the relationship between any two variables.