3. linear modelling and residual analysis
TRANSCRIPT
3
Contents:
Linear modellingand residual analysis
Linear modellingand residual analysis
A
B
C
D
E
Slopes and equations of lines
Correlation
Measuring correlation
Line of best fit
Residual analysis
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\117SA12MET2_03.CDR Wednesday, 15 September 2010 10:31:27 AM PETER
In the next two chapters we will study mathematical modelling. This is the process of finding
an equation to describe the relationship between two variables using data obtained by observation
or experiment.
In this chapter we will be concerned with linear modelling. We will use linear equations to
describe the relationship between two variables. We will also consider how to determine whether
it is appropriate to use a linear equation to model the data.
We will begin by reviewing some important properties of lines.
In previous courses we establish that:
The slope of a straight line passing through the points
(x1, y1) and (x2, y2) is m =y-step
x-step=
y2 ¡ y1
x2 ¡ x1
.
If the graph of y against x is linear, then x and y are
connected by the rule y = mx+ c, where m and c are
constants.
y = mx+ c is the equation of the line with slope m and
y-intercept c.
Find the slope and y-intercept
of the illustrated line:
(0, 2) and (4, 3) lie on the line
) m =y-step
x-step=
3¡ 2
4¡ 0= 1
4
So, the slope is 1
4and the y-intercept is 2.
To find the equation of the line passing through two
points (x1, y1) and (x2, y2), we first find the slope m
of the line.
The equation of the line isy ¡ y1
x¡ x1
= m.
SLOPES AND EQUATIONS OF LINESA
Example 1
x
y
cslope = m
x
y
2(4, 3)
x
y
(x , y )1 1
(x , y )2 2P(x, y)
118 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\118SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:00 AM PETER
Suppose we are given or have determined the equation of a line. Given the value of one variable,
we can use substitution to find the value of the other.
For the line with equation y = 2x+ 5, find:
a y given that x = 6 b x given that y = 11.
a Substituting x = 6 into
y = 2x+ 5 gives y = 2(6) + 5
) y = 17
b Substituting y = 11 into
y = 2x+ 5 gives 11 = 2x+ 5
) 6 = 2x
) x = 3
EXERCISE 3A
1 Determine the slope and y-intercept of the line with equation:
a y = 2x+ 5 b y = 0:8x c y = 8 d y = 0:52x+ 10:3
2 Give the slope and y-intercept of the following lines:
a b c
3 Write down the equation of the line:
a with slope 0:75 and y-intercept 2:13 b with y-intercept 0:75 and slope 2:13 .
4 Find the equations of the illustrated lines:
a b c
Example 3
x
y
3
4
x
y
(4, 4)
(0, 2)
x
y
(6, 0)
4
x
y
(5, 5)3
x
y
(8, 2)
(20, 3)
x
y
3.1
(5.2, 2.9)
Find the equation of the line passing through (1, 4) and (3, ¡2).
The line has slope m =y-step
x-step=
¡2¡ 4
3¡ 1=
¡6
2= ¡3.
So, the equation of the line isy ¡ y1
x¡ x1
= m
)y ¡ 4
x¡ 1= ¡3
) y ¡ 4 = ¡3(x¡ 1)
) y = ¡3x+ 7
Example 2
119LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\119SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:06 AM PETER
5 For the line with equation y = ¡5x+17, find: a y when x = 3 b x when y = 9:5 .
6 A TV repair business charges $40 for a call-out plus an hourly rate of $35.
a Copy and complete the table alongside:
b Graph y against x.
c Find and interpret the slope of the line.
d Find and interpret the y-intercept. e Find the equation of the line.
f What is the charge for a 21
4hour repair job?
7 Next month’s projected sales of toasters, y, can be modelled by the equation
y = 9250 + 8:2x, where x is the advertising expenditure in dollars.
a What increase in sales should result from each $1 of advertising?
b What increase in sales should result by increasing advertising by $2000?
8 When the price of an electric kettle is $x, the demand is modelled by y = 18000¡ 350xkettles.
a How will the demand change if the price per kettle is increased by $1?
b How will the demand be affected if:
i the price is increased by $8 ii the price is decreased by $4?
9 Which of the following statements are true concerning the straight line equation y = mx+c?
A The slope is m and the y-intercept is c.
B If an increase in x results in an increase in y, then m > 0.
C If m < 0, constant increases in x result in constant decreases in y.
D If c < 0, the graph of y = mx+ c cuts the horizontal axis to the left of the origin.
Often, we wish to know how two variables are associated or related.
To find such a relationship we construct and observe a scatter plot.
A scatter plot consists of points plotted on a set of axes.
The independent variable is placed on the horizontal axis.
The dependent variable is placed on the vertical axis.
Examples of typical plots are:
² ² ²
CORRELATIONB
height (cm)
weight (kg)
advertising ($)
profit ($)
weight (kg)
IQ
for a soccer team where
is dependent on .
weight
height
for a sports goods store where
is dependent on the
amount of done.
profit
advertising
for a study investigating
whether a person’s
has any effect on their .
weight
IQ
120 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
Hours (x) 0 1 2 3
Total charge ($y)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\120SA12MET2_03.CDR Wednesday, 15 September 2010 12:08:35 PM PETER
Consider the following experiment:
We wish to examine the relationship between the length of a
helical spring and the mass that is hung from the spring.
The force of gravity on the mass causes the spring to stretch.
As the length of the spring depends on the force applied, the
dependent variable is the length.
The following experimental results are obtained when objects
of varying mass are hung from the spring:
For each addition of 50 grams in
mass, the consecutive increases in
length are roughly constant.
The points are approximately
linear.
CORRELATION
Correlation refers to the relationship or association between two variables.
When looking at the correlation between two variables, we should follow these steps.
Step 1: Look at the scatter plot for any pattern.
For a generally upward shape we say that
the correlation is positive.
As the independent variable increases, the
dependent variable generally increases.
For a generally downward shape we say that
the correlation is negative.
As the independent variable increases, the
dependent variable generally decreases.
For randomly scattered points, with no
upward or downward trend, there is usually
no correlation.
15
20
25
30
0 50 100 150 200 250
length (cm)
mass (g)
L cm
w grams
121LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
Mass (w grams) 0 50 100 150 200 250
Length (L cm) 17:7 20:4 22:0 25:0 26:0 27:8
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\121SA12MET2_03.CDR Wednesday, 15 September 2010 12:09:29 PM PETER
Step 2: Look at the spread of points to make a judgement about the strength of the correlation.
This is a measure of how closely the data follows a pattern or trend.
For positive relationships we would classify the following scatter plots as:
Similarly there are strength classifications for negative relationships:
Step 4: Look for and investigate any outliers. These
appear as isolated points which do not fit in
with the general trend of the data.
Outliers should be investigated as they are
sometimes mistakes made in recording or
plotting the data.
Genuine extraordinary data should be
included.
EXERCISE 3B
1 Describe what is meant by:
a a scatter plot b correlation c positive correlation
d negative correlation e an outlier.
2 a What is meant by independent and dependent variables?
b When drawing a scatter plot, which variable is placed on the horizontal axis?
Step 3: Look at the pattern of points to see whether it is linear.
These points are roughly linear. These points do not appear to be linear.
Looking at the scatter plot for the spring data, we can say that there appears to be a strong positive
correlation between the mass of the object hung from the spring, and the length of the spring. The
relationship appears to be linear, with no obvious outliers.
outlier
not anoutlier
strong moderate weak
strong moderate weak
122 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\122SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:16 AM PETER
3 For the following scatter plots, comment on:
i the existence of any pattern (positive, negative or no association)
ii the relationship strength (zero, weak, moderate or strong)
iii whether the relationship is linear iv whether there are any outliers.
a b c
d e f
4 Ten students participated in a typing contest, where the students were given one minute to
type as many words as possible. The table below shows how many words each student typed,
and how many errors they made:
a Draw a scatter plot for this data.
b Name the student who is best described as:
i slow but accurate
ii fast but inaccurate
iii an outlier.
c Describe the direction and strength of
correlation between these variables.
d Is the data linear?
In order to measure more precisely the degree to which two variables are linearly related, we can
calculate Pearson’s correlation coefficient. We denote this coefficient r.
PEARSON’S CORRELATION COEFFICIENT r
For a set of n data given as ordered pairs (x1, y1), (x2, y2), (x3, y3), ...., (xn, yn),
Pearson’s correlation coefficient is r =
Pxy ¡ nxy
p(P
x2¡ nx
2)(P
y2¡ ny
2)
where x and y are the means of the x and y data respectively, andP
means the sum over all
the data values.
MEASURING CORRELATIONC
y
x
y
x
y
x
y
x
y
x
y
x
You can use technology to
construct scatter plots.
Consult the
at
the front of the book.
graphics
calculator instructions
You are not required to learn this formula.
123LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
Student A B C D E F G H I J
Number of words (x) 40 53 20 65 35 60 85 49 35 76
Number of errors (y) 11 15 2 20 4 22 30 16 27 25
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\123SA12MET2_03.CDR Wednesday, 15 September 2010 12:10:21 PM PETER
The values of r range from ¡1 to +1.
If r = +1, the data are perfectly positively correlated.
The data lie exactly on a straight line with positive slope.
If r = 0, the data show no correlation.
If r = ¡1, the data are perfectly negatively
correlated.
The data lie exactly on a straight line with
negative slope.
POSITIVE CORRELATION
A positive value for r indicates the variables are positively correlated.
The closer r is to +1, the stronger the correlation.
Here are some examples of scatter plots for positive correlation:
NEGATIVE CORRELATION
A negative value for r indicates the variables are negatively correlated.
The closer r is to ¡1, the stronger the correlation.
Here are some examples of scatter plots for negative correlation:
y
x
y
x
y
x
y
xr = +1
y
xr = +0.8
y
xr = +0.5
y
xr = +0.2
y
xr = -1
y
xr = -0.8
y
xr = -0.5
y
xr = -0.2
124 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\124SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:25 AM PETER
EXERCISE 3C.1
1 Estimate Pearson’s correlation coefficient r for the data in the scatter plots below:
a b c
d e f
Age 2 5 7 5 8 3
No. of doctor visits 10 6 5 4 3 8
2 The table alongside shows the ages of six
children, and the number of times they
visited the doctor in the last year:
a Draw a scatter plot of the data.
b Estimate the correlation coefficient.
c Describe the correlation between age and number of doctor visits.
COEFFICIENT OF DETERMINATION r2
Value Strength of association
r2 = 0 no correlation
0 < r2 < 0:25 very weak correlation
0:25 6 r2 < 0:50 weak correlation
0:50 6 r2 < 0:75 moderate correlation
0:75 6 r2 < 0:90 strong correlation
0:90 6 r2 < 1 very strong correlation
r2 = 1 perfect correlation
To help determine the strength of correlation
between two variables, we calculate the
coefficient of determination r2. This is simply
the square of Pearson’s correlation coefficient r.
Squaring r eliminates the direction of the
correlation, and gives us a value from 0 to 1which measures the strength of correlation.
The table alongside is a guide for assessing
the strength of linear correlation between two
variables.
USING TECHNOLOGY TO FIND r AND r2
We can use technology to find r and r2. For help, consult the graphics calculator instructions
at the front of the book.
y
x
y
x
y
x
y
x
y
x
y
x
125LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\125SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:29 AM PETER
A group of adults was weighed, and their maximum speed when sprinting was measured:
Weight (x kg) 85 60 78 100 83 67 79 62 88 68
Max. speed (y km h¡1) 26 29 24 17 22 30 25 24 19 27
a Use technology to find r and r2 for the data.
b Describe the correlation betwen weight and maximum speed.
a
Using technology, r ¼ ¡0:813 and r2 ¼ 0:662.
b There is a moderate negative correlation between weight and maximum speed.
EXERCISE 3C.2
1 Jill hangs her clothes out to dry every Saturday, and notices that the clothes dry faster some
days than others. She investigates the relationship between temperature and the time her
clothes take to dry:
Temperature (x oC) 25 32 27 39 35 24 30 36 29 35
Drying time (y min) 100 70 95 25 38 105 70 35 75 40
a Draw a scatter plot for this data. b Calculate r and r2.
c Describe the correlation between temperature and drying time.
2 The table below shows the ticket and beverage sales for each day of a 12 day music festival:
Ticket sales ($x£ 1000) 25 22 15 19 12 17 24 20 18 23 29 26
Beverage sales ($y £ 1000) 9 7 4 8 3 4 8 10 7 7 9 8
a Draw a scatter plot for this data. b Calculate r and r2.
c Describe the correlation between ticket sales and beverage sales.
3 A local council collected data from a number of parks in the area, recording the size of the
parks and the number of gum trees each contained:
a Draw a scatter plot for this data.
b Would you expect r to be positive or negative? c Calculate r and r2.
d Are there any outliers? e Remove the outlier, and re-calculate r and r2.
Example 4
Casio fx-9860G Plus TI-84 Plus TI- spiren
126 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
Size (hectares) 2:8 6:9 7:4 4:3 8:5 2:3 9:4 5:2 8:0 4:9 6:2 3:3 4:5
No. of gum trees 18 31 33 24 13 17 40 32 37 30 32 25 28
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\126SA12MET2_03.CDR Wednesday, 15 September 2010 12:11:09 PM PETER
² the line of best fit ‘by eye’
² the ‘least squares’ regression line.
LINE OF BEST FIT ‘BY EYE’
a draw the scatter plot, and draw a line of best fit through the data
b find the equation of the line you have drawn.
a
b The line of best fit above passes through (100, 22) and (200, 26).
So, the line has slope m =26¡ 22
200¡ 100= 0:04 and equation
y ¡ 22
x¡ 100= 0:04
) y ¡ 22 = 0:04x¡ 4
) y = 0:04x+ 18
or in this case L = 0:04w + 18
LINE OF BEST FITD
Example 5
15
20
25
30
0 50 100 150 200 250
length (cm)
mass (g)
Consider again the scatter plot of the spring data. Since the data is approximately linear, it is
reasonable to draw a through the data.line of best fit
This line can be used to predict the
value of one variable given the value
of the other.
There are several ways to fit a straight
line to a data set. We will examine two
of them:
15
20
25
30
0 50 100 150 200 250
length (cm)
mass (g)
Given a scatter plot for a data set, we can draw a line of best fit ‘by eye’, which should have
about the same number of points above as below it. Its direction should follow the general trend
of the data.
To find the equation of this line, we first select two points which lie on the line. We then find the
equation of the line passing through these points using the techniques revised in Section A.
127LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
For the spring data on page :121
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\127SA12MET2_03.CDR Wednesday, 15 September 2010 12:16:01 PM PETER
EXERCISE 3D.1
1 For the following data sets:
i draw the scatter plot
ii draw a line of best fit through the data
iii find the equation of the line you have drawn.
a x 11 7 16 4 8 10 17 5 12 2 8 13 9 18 5 12
y 16 12 32 5 7 19 30 14 19 6 17 24 15 34 6 26
b x 13 18 7 1 12 6 15 4 17 3 10 5
y 10 6 17 18 13 14 6 15 5 14 10 13
2 Over 10 days the maximum temperature and number of car break-ins was recorded for a city:
Max. temperature (x oC) 22 17 14 18 24 29 33 32 26 22
No. of car break-ins (y) 30 18 9 20 31 38 47 40 29 25
a Draw a scatter plot for the data.
b Describe the correlation between temperature and number of break-ins.
c Draw a line of best fit through the data.
d Find the equation of the line of best fit.
e Use your equation to estimate the number of car break-ins you would expect to occur
on a 25oC day.
THE LEAST SQUARES REGRESSION LINE
The problem with finding the line of best fit by eye is that the line drawn will vary from one
person to the next.
Instead, mathematicians use a method known as linear regression to find the equation of the line
which best fits the data.
Consider the set of points alongside.
For any line we draw to model the points, we can
find the vertical distances d1, d2, d3, .... between
each point and the line.
We can then square the distances and find their sum
d 2
1+ d 2
2+ d 2
3+ :::: : If all the points are close to
the line, this value will be small.
The least squares regression line is the line which
minimises this value.
This demonstration allows you to experiment with various data sets. Use trial and
error to find the least squares line of best fit for each set.
y
x
d1
d2 d3
d4
DEMO
128 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\128SA12MET2_03.CDR Wednesday, 15 September 2010 12:11:41 PM PETER
In practice, rather than finding this line by experimentation, we use the following formula:
The least squares line has equation y = mx+ c where m =
Pxy ¡ nxy
(P
x2)¡ nx2
and c = y ¡mx
USING TECHNOLOGY TO FIND THE LINE OF BEST FIT
Instead of using the above formula, we can use technology to find the least squares line of best
fit. For help, consult the graphics calculator instructions at the start of the book.
Use technology to find the least squares line of best fit for the spring data.
INTERPOLATION AND EXTRAPOLATION
We use least squares regression to obtain
a line of best fit. We can use the line of
best fit to estimate values of one variable
given a value for the other.
If we use values of x in between the poles,
we say we are interpolating between the
poles.
If we use values of x outside the poles,
we say we are extrapolating outside the
poles.
Example 6
STATISTICS
PACKAGE
lowerpole
upper pole
line ofbest fit
y
x
interpolationextrapolation extrapolation
Casio fx-9860G Plus TI-84 Plus TI- spiren
So, the least squares line of best fit is
y ¼ 0:0402x+ 18:1,
or L ¼ 0:0402w + 18:1.
Compare this equation
with the one we obtained
when we found the line of
best fit by eye.
Suppose we have gathered data to investigate the association between
two variables. We obtain the scatter diagram shown below. The data
values with the lowest and highest values of are called the .x poles
129LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\129SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:47 AM PETER
The accuracy of an interpolation depends on how linear the original data was. This can be gauged
by determining the correlation coefficient and ensuring that the data is randomly scattered around
the line of best fit.
The accuracy of an extrapolation depends not only on how linear the original data was, but also on
the assumption that the linear trend will continue past the poles. The validity of this assumption
depends greatly on the situation under investigation.
As a general rule, it is reasonable to interpolate between the poles, but unreliable to extrapolate
outside them.
The table below shows how far a group of students live from school, and how long it takes
them to travel there each day.
Distance from school (x km) 7:2 4:5 13 1:3 9:9 12:2 19:6 6:1 23:1
Time to travel to school (y min) 17 13 29 2 25 27 41 15 53
a Draw a scatter plot of the data.
b Use technology to find: i r2 ii the equation of the line of best fit.
c Pam lives 15 km from school.
i Estimate how long it takes Pam to travel to school.
ii Comment on the reliability of your estimate.
a b
i r2 ¼ 0:987
ii The line of best fit is
y ¼ 2:16x+ 1:42.
c i When x = 15, y ¼ 2:16(15) + 1:42 ¼ 33:8So, it will take Pam approximately 33:8 minutes to travel to school.
ii The estimate is an interpolation, and the r2 value indicates a very strong
correlation. This suggests that the estimate is reliable.
EXERCISE 3D.2
x 3 7 8 4 6 3 8 1
y 9 5 2 7 5 10 4 15
1 Use technology to find the equation of the least
squares regression line for this data set:
2 Consider the temperature vs drying time problem on
a Use technology to find the equation of the line of best fit.
b Estimate the time it will take for Jill’s clothes to dry on a 28oC day.
c How reliable is your estimate in b?
Example 7
y
x
130 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
page .126
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\130SA12MET2_03.CDR Wednesday, 15 September 2010 12:12:06 PM PETER
3 Consider the ticket sales vs beverage sales problem on
a Find the equation of the line of best fit.
b The music festival is extended by one day, and $35 000 worth of tickets are sold.
i Predict the beverage sales for this day.
ii Comment on the reliability of your prediction.
4 The table below shows the amount of time a collection of families spend preparing homemade
meals each week, and the amount of money they spend each week on fast food.
Time on homemade meals (x hours) 3:3 6:0 4:0 8:5 7:2 2:5 9:1 6:9 3:8 7:7
Money on fast food ($y) 85 0 60 0 27 100 15 40 59 29
a Draw a scatter plot of the data.
b Use technology to find the line of best fit.
c State the values of r and r2.
d Interpret the slope and y-intercept of the line of best fit.
e Another family spends 5 hours per week preparing homemade meals. Estimate how
much money they spend on fast food each week. Comment on the reliability of your
estimate.
5 The ages and heights of children at a playground are given below:
Age (x years) 3 9 7 4 4 12 8 6 5 10 13
Height (y cm) 94 132 123 102 109 150 127 110 115 145 157
a Draw a scatter plot of the data.
b Use technology to find the line of best fit.
c At what age would you expect children to reach a
height of 140 cm?
d Interpret the slope of the line of best fit.
e Use the line to predict the height of a 20 year old.
Do you think this prediction is reliable?
6 Once a balloon has been blown up, it slowly starts to deflate. A balloon’s diameter was
recorded at various times after it was blown up:
Time (t hours) 0 10 25 40 55 70 90 100 110
Diameter (D cm) 40:2 37:8 34:5 30:2 26:1 23:9 19:8 17:2 14:0
a Draw a scatter plot of the data.
b Describe the correlation between D and t.
c Find the equation of the least squares regression line.
d Use this equation to predict:
i the diameter of the balloon after 80 hours
ii the time it took for the balloon to completely
deflate.
e Which of your predictions in d is more likely to be
reliable?
131LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
page .126
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\131SA12MET2_03.CDR Wednesday, 15 September 2010 12:12:23 PM PETER
7 Each year in AFL football, the Brownlow medal is awarded to the ‘Fairest and Best’ player
in the competition.
Scott has a theory that he can predict the winner from the average number of disposals (kicks
and handballs) per game. He wants to test his theory on the results for the top 20 vote-getters
for the 2009 season.
a Construct a scatter plot for the data in the table above, using disposals per game as the
independent variable.
b Describe the correlation between disposals per game and Brownlow votes.
c Find Pearson’s correlation coefficient for the data.
d Find the equation of the least squares line of best fit.
e Use the line in d to predict the Brownlow votes for a player who averaged 25 disposals
per game.
f There are four players in the top 20 who averaged 25 disposals per game. Identify these
players and their actual Brownlow votes.
g How reliable is the variable disposals per game as a predictor of Brownlow votes?
Given a set of data, we have seen how we can draw a
scatter plot, then find the line of best fit to model it.
However, it is not always appropriate to model data
using a straight line. For example, the data alongside
exhibits strong positive correlation. However, it is
clearly not linear.
The values of r and r2 can be used to determine how
well the linear model fits the data. However, to further
assess the appropriateness of the linear model, we need
to analyse the residuals.
RESIDUAL ANALYSISE
Player A B C D E F G H I J
Disposals per game, x 34 27 28 25 16 28 21 27 27 25
Brownlow votes, y 30 22 20 19 19 17 17 16 15 15
Player K L M N O P Q R S T
Disposals per game, x 17 27 29 26 25 27 30 28 28 25
Brownlow votes, y 15 14 14 13 13 13 13 13 13 12
y
x
y
x
132 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\132SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:59 AM PETER
RESIDUALS
For each data point, the residual is given by
yobs ¡ ypred
where yobs is the observed y-value of the data point,
and ypred is the y-value predicted by the line of best
fit for the x-value of the data point.
A positive residual indicates that the data point is above
the line of best fit.
A negative residual indicates that the data point is below
the line of best fit.
We can then plot the residuals against the x-values to
form a residual plot. The residual plot shows how the
points vary about the line of best fit.
Here is the residual plot for the data points above:
Consider the data set: x 3 4 6 9 11
y 7 4 10 11 20
a Find the equation of the line of best fit.
b Calculate the residuals. c Draw the residual plot.
a Using technology, the line of best fit is
y ¼ 1:61x¡ 0:230 .
b We find ypred for each data point by
evaluating y = 1:61x ¡ 0:230 for each of
the x-values.
x yobs ypred residual = yobs ¡ ypred
3 7 4:60 2:40
4 4 6:21 ¡2:21
6 10 9:43 0:57
9 11 14:27 ¡3:27
11 20 17:49 2:51
c
Example 8
y
x
-4
-2
2
4residual
x
x
y
x
yobs
ypred
positiveresidual
negativeresidual
residual
133LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\133SA12MET2_03.cdr Wednesday, 15 September 2010 12:40:01 PM PETER
EXERCISE 3E.1
1 Match the following scatter plots with the correct residual plot:
a b c
2 A least squares regression line is shown on the
scatter plot alongside.
Which one of the following would be the
residual plot for the regression line?
3 For the following data sets:
i draw the scatter plot ii find the line of best fit
iii calculate the residuals iv draw the residual plot.
ax 2 5 7 10
y 3 9 13 16
bx 3 7 8 11 15
y 8 10 12 13 17
cx 1 9 6 15 4 12
y 18 14 12 2 10 6
AA BB CC
AA BB
CC DD
residual
x
residual
x
residual
x
y
x
y
x
y
x
y
x
20
15
10
5
1 2 3 4 5
5 10 15 20
x
1
0.5
0
-0.5
-1
residual
-2
-4
1 2 3 4 5
4
2
0x
residual
1 2 3 4 5
15
10
5
residual
x
residual
x
1 2 3 4 5
3210
-1-2
134 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\134SA12MET2_03.CDR Wednesday, 15 September 2010 12:13:01 PM PETER
ANALYSING RESIDUAL PLOTS
We can use residual plots to determine whether it is appropriate to fit a linear model to a data set.
Consider the set of data points alongside, which appear
linear. The line of best fit is also shown.
Here is the residual plot for these data points. Notice
that the points are randomly scattered about the x-axis,
with no obvious pattern. This indicates that the data
varies randomly about the line of best fit, and so the
linear model is appropriate for the data.
Now consider this second set of points, which do not
appear to be linear. Again, the line of best fit is given.
residual
x
y
x
y
x
4
5 The equation of the least squares
regression line applied to the data
graphed alongside is
Diastolic blood pressure
= 68:5 + 0:2£ weight.
a Draw the least squares
regression line on the graph.
b Estimate the residual for the following points from the graph, then check the value using
the equation.
i (82, 87:5) ii (103:3, 81:4)
c Sketch the residual plot for this regression line.
60 70 80 90 100 110
100
95
90
85
80
75
70weight (kg)
diastolic blood pressure (mm Hg)
PRINTABLE
GRAPH
Check your residual plots from 3 using technology. For help, consult the graphics calculator
instructions at the front of the book.
135LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\135SA12MET2_03.CDR Wednesday, 15 September 2010 10:33:09 AM PETER
On the residual plot for these data points, we see the
points are not random, but show a clear pattern. This
indicates that the linear model is not appropriate for the
data.
In general, a residual plot with points randomly scattered about the x-axis indicates the linear
model is appropriate for the data. A residual plot which exhibits a clear pattern indicates the
linear model is not appropriate for the data.
Consider the data set: x 1 2 3 4 5 6 7
y 0:75 1:95 3:1 4:2 5:1 5:95 6:7
a Find the line of best fit, and state the value of r2.
b Construct a residual plot for the data.
c Is the linear model appropriate for the data?
a Using technology, the line of best fit is
y ¼ 0:995x¡ 0:0143, and r2 ¼ 0:993.
b Using technology, the residual plot is:
c The residual plot shows a clear pattern, and does not appear random. This indicates
that the linear model is not appropriate for the data.
This example shows how a high value of r2 does not necessarily mean that the line of best fit
is appropriate. The original scatter plot, the coefficient of determination r2, and the residual plot
should all be considered before concluding that a linear model is appropriate.
Example 9
2
4
6
8
2 4 6 8
y
x
residual
x
136 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\136SA12MET2_03.CDR Wednesday, 15 September 2010 10:33:12 AM PETER
EXERCISE 3E.2
1 Which one of the following residual plots shows a regression line that is not a good fit for
the data? Explain your answer.
A B
C D
2 For each of the following data sets:
i draw the scatter plot
ii use technology to find the line of best fit, and state the value of r2
iii use technology to construct the residual plot
iv determine whether the line of best fit is appropriate to model the data.
3
a Draw a scatter plot of the data.
b Use technology to find the line of best fit, and state the value of r2.
c Draw the line of best fit on your scatter diagram.
d Use technology to construct the residual plot.
e Is the line of best fit appropriate to model the data?
a x 1 2 3 4 5 6 7 8 9
y 33 29 25 24 20 18 13 9 8
b x 2:2 3:7 9:5 6:2 1:4 3:9 7:5 8 5:5
y 3:6 7:1 22:5 13:3 1:9 7:6 16:8 18:2 11:5
c x 5 9 1 12 6 5 9 7 2 10
y 13 1 6 14 2 9 10 8 4 11
Time taken (t min) 6 8:5 4 5 8 7:5 10 7
Cranes made (C) 10 7 15 12 7 8 6 8
residual
5
-5
x
1 2 3 4 5
10
5
-5
-10
1 2 3 4 5
residual
x
residual
21
-1-2-3
10 20 30 40
x
residual
5x
10 20 30 40
-5
137LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
In a 60 minute Art lesson, students had to make as
many paper cranes as possible. The table shows
how long it took each student to make a paper
crane, and how many cranes they made during the
lesson:
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\137SA12MET2_03.CDR Wednesday, 15 September 2010 12:14:02 PM PETER
REVIEW SET 3
4 Ten people were asked how many text messages they had sent and received in the last week:
a Draw a scatter plot for the data.
b Find the line of best fit, and state the value of r2.
c Describe the correlation between text messages sent and text messages received.
d Construct the residual plot.
e Is the line of best fit appropriate to model the data?
f Ted received 10 text messages in the last week.
i Estimate the number of text messages he sent.
ii How reliable is this estimate?
1 A storage bin for chicken food contains 1000 kg of food pellets.
Exactly 13 kg of food is removed each day for the chickens.
Days elapsed (t) 0 1 2 3 4
Pellets in bin (F kg) 1000 987
a Copy and complete
the following table:
b What is the dependent variable? What is the independent variable?
c If a graph was to be drawn, what axis should F be plotted on?
d Graph the relationship between F and t.
e What is the slope of the line through these points?
f Write down the function for F in terms of t.
g Interpret the slope and the vertical intercept of the function.
h Using the function from f, determine the amount of food left after a fortnight.
i Determine when the food supply will run out.
2 Temperatures can be expressed in a variety of units. Consider the following table,
which shows the relationship between the temperature Tc in degrees Celsius, and
the temperature TF in degrees Fahrenheit:
Temperature (TCoC) 10 20 30 40
Temperature (TFoF) 50 68 86 104
a Show that the relationship between TF and TC is linear.
b Find the change in TF per unit increase in TC .
c Write down the function for TF in terms of TC .
d Convert 100 oC to oF. e Convert 32 oF to oC.
3 Thomas rode for an hour each day for eleven days and recorded the number of kilometres
travelled against the temperature that day.
Temp. (T oC) 32:9 33:9 35:2 37:1 38:9 30:3 32:5 31:7 35:7 36:3 34:7
Distance (d km) 26:5 26:7 24:4 19:8 18:5 32:6 28:7 29:4 23:8 21:2 29:7
18 3 7 22 15 5 20 30 7 25
22 2 9 21 16 9 23 33 7 24
138 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
Text messages sent (x)
Text messages received (y)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\138SA12MET2_03.CDR Wednesday, 15 September 2010 12:14:50 PM PETER
a Draw a scatter plot for this data.
b Describe the association between the distance travelled and temperature.
c Determine the equation of the line of best fit.
d Interpret the slope and vertical intercept of this line.
e Determine the values of r and r2.
f Use your equation to predict how hot it must get before Thomas does not ride at
all. Comment on the reasonableness of this prediction.
4 Eight identical garden beds were watered a varying number of times each week, and the
number of flowers each bed produced is recorded in the table below:
Number of waterings (n) 0 1 2 3 4 5 6 7
Flowers produced (f ) 18 52 86 123 158 191 228 250
a Draw a scatter plot for this data.
b Describe the association between the number of waterings and the flowers produced.
c Find the equation of the least squares regression line, and state the values of r
and r2.
d Interpret the slope and vertical intercept of this line.
e Violet has two flower beds. She waters one five times a fortnight, and the other ten
times a week.
i How many flowers can she expect from each bed?
ii Which is the more reliable estimate?
5 After an outbreak of the flu at a school, medical authorities begin recording the number
of people diagnosed with the flu.
Days after outbreak (n) 2 3 4 5 6 7 8 9 10 11
People diagnosed (d) 8 14 33 47 80 97 118 123 105 83
a Draw a scatter plot for this data.
b Determine: i the least squares regression line ii the values of r and r2.
c Construct a residual plot for the linear relationship between d and n.
d Is the line of best fit an appropriate model for the data? Explain your answer.
6 Two supervillains, Silent Predator and the Furry Reaper, terrorise Metropolis by abducting
fair maidens. Superman believes that they are collaborating, alternately abducting fair
maidens so as not to compete with each other for ransom money. He records their
abduction rate below, in dozens of maidens.
Silent Predator (s) 4 6 5 9 3 5 8 11 3 7 7 4
Furry Reaper (f) 13 10 11 8 11 9 6 6 12 7 10 8
a Draw a scatter plot for this data. Plot s on the horizontal axis.
b Determine: i the least squares regression line ii the values of r and r2.
c Construct a residual plot for the data.
d Is the line of best fit an appropriate model for the data? Explain your answer.
139LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3)
SA_12MET-2magentacyan yellow black
0 05 5
25
25
75
75
50
50
95
95
100
100 0 05 5
25
25
75
75
50
50
95
95
100
100
Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\139SA12MET2_03.CDR Wednesday, 15 September 2010 10:33:24 AM PETER