ocr mathsaurus data presentation and measures of central ...€¦ · data presentation and measures...
TRANSCRIPT
All rights reserved. © Kevin Olding 2013 1
Statistics1
DataPresentationandMeasuresofCentralTendencyandDispersion‐MEI
KevinOlding
Contents1. Types of data ............................................................................................................................................................. 2
2. Frequency tables for grouped and ungrouped data ................................................................................................. 2
3. Pie Charts, Bar Charts and Vertical Line Charts ........................................................................................................ 4
4. Histograms ................................................................................................................................................................ 4
5. Stem and leaf diagrams ............................................................................................................................................ 7
6. The Range and the Median ....................................................................................................................................... 8
7. Quartiles, and the Inter‐Quartile Range ................................................................................................................... 9
8. Box and whisker plots (Boxplots), Outliers (first definition) ................................................................................... 10
9. Cumulative frequency, cumulative frequency curves, percentiles ......................................................................... 12
10. Reading off the median, quartiles and percentiles from a cumulative frequency curve ................................... 14
11. Mean, mean square deviation, root mean square deviation ............................................................................. 15
12. Variance and standard deviation, outliers (second definition) .......................................................................... 16
13. Calculating the mean and variance from an ungrouped frequency table .......................................................... 18
14. Estimating the mean and median from a grouped frequency table .................................................................. 19
15. Linear Coding ...................................................................................................................................................... 20
16. Advantages and Disadvantages of the mean, median mode and midrange ...................................................... 21
17. Skewness ............................................................................................................................................................. 22
All rights reserved. © Kevin Olding 2013 2
1. Types of data
There are three types of data:
1. Qualitative data, which consists of descriptions using names
Head or Tail
Black or White
Labour, Conservative or Liberal Democrat
2. Discrete data, which consists of numerical values in cases where we can make a list of the possible
values. Often this list can be very short, for example the outcomes of the roll of a die, {1,2,3,4,5,6}.
Sometimes it will be longer and can potentially be infinite, for example the number of Tails that appear
before the first Head when I repeatedly toss a coin has outcomes {0,1,2,3,...}.
3. Continuous data, which consists of numerical values in cases where it is not possible to make a list of all
possible outcomes. Eg measurements of physical quantities, such as weight, height and time.
Note 1: Continuous data can often appear discrete, for example because the limitations of measuring instruments
force us to round to eg the nearest 10 grams.
Note 2: Discrete data can sometimes appear continuous. For example, amounts of money feel like they could be
continuous but usually we only think in terms of multiples of 1p. Sometimes we might have fractions of 1p, for
example in a financial transaction but there is always a limit to the subdivisions allowed. In fact we often treat such
data as continuous simply because it is convenient to do so.
Note 3: Is could be argued that there really is no such thing as continuous data as there is always a theoretical limit
on the accuracy of a measurement, for example a measure of the mass of an object might ultimately be limited by
multiples of a part of the object on an atomic level. We leave it to the philosophers to debate such issues. For us,
when treating data as continuous is useful we will do so.
2. Frequency tables for grouped and ungrouped data
Ungrouped data:
The table below describes the A‐level results of a group of students.
Number of A grades Frequency
4 62
3 38
2 12
1 7
0 1
From this table we can see that the most common number of A grades was 4, with 62 students getting 4 As. We say
that the mode is 4 or that 4 is the modal value.
All rights reserved. © Kevin Olding 2013 3
Grouped data:
Sometimes it can be useful to group data into classes, especially when there are many different values. The
information becomes more concise than the raw data, but the disadvantage is that the original data has been lost.
Example: FTSE 100 Share prices
Price of share Frequency
0‐49p 7
50p – 99p 9
£1‐£1.99 28
£2‐£3.99 42
£4+ 14
Class boundaries
Sometimes it is obvious which class a piece of data falls into. For example a share of price £1.45 fits into the third
class above. But what about a share that is trading at 49.5p? (Shares do trade at non‐integer values!)
There is no correct answer to this question. If we are designing a frequency table ourselves we should be careful to
write the boundaries without ambiguity, for example we could have written 0 50x and 50 100x pence for
the first two class boundaries. If we are presented with an ambiguous table someone else has constructed we just
have to make the best guess we can as to what their intentions were.
How to manipulate statistics 1
Statistics can be used to give clarity to and to describe and summarise data in a useful and informative way. It can
also be used to deceive, or be presented in a way which is misleading.
The following tables describe the wages paid to workers in two different factories which each employ 150 people.
Which would you rather work in?
The Fair Factory with Fun‐loving Foremen
£ per hour Frequency
0‐£7.50 13
£7.51‐£50 137
The Sad Store of Selfish Slave‐drivers
£ per hour Frequency
0‐£8.50 133
£8.51‐£50 17
All rights reserved. © Kevin Olding 2013 4
In fact both factories are the same...here is the ungrouped data:
£ per hour Frequency
7 13
8 120
10 15
25 2
3. Pie Charts, Bar Charts and Vertical Line Charts
You are expected to be able to construct and interpret these basic diagrams for Statistics 1, but they are not
included in these notes. Please ask your teacher if there is anything you are unsure about.
4. Histograms
Grouped data can be displayed in a histogram. For example, consider the data below which shows the
marks gained by a group of students in an examination:
Mark ( x %) Frequency
0 30x 4
30 50x 12
50 70x 37
70 100x 14
The data has been grouped into intervals or classes and we can calculate the class widths by comparing the
endpoints. For example for 50 70x has width 70 50 20 . The frequency density is then calculated
as
FrequencyFrequency density =
Class width
Mark ( x %) Frequency Class width Frequency density (students/mark)
0 30x 4 30 430 0.13
30 50x 12 20 1220 0.6
50 70x 37 20 3720 1.85
70 100x 14 30 1430 0.47
Now we can draw the histogram, which will have the Mark on the horizontal axis and Frequency Density on
the vertical axis
Histograms can seem similar to bar charts, but there are some important differences.
In a histogram:
a. The vertical axis is always labelled "Frequency Density"
b. There are no gaps between the bars
All rights reserved. © Kevin Olding 2013 5
c. The area of each bar is proportional to the frequency that it represents.
The highest bar in the histogram represents the interval 50 70x and this is called the modal class.
Note that this interval has the highest frequency density. In this example it also has the highest frequency,
but this need not be the case. It is the frequency density that matters.
Because the frequency density was calculated as Frequency
Frequency density = Class width
If we have a histogram and we want to retrieve the frequencies for each class we can rearrange this to give
Frequency = Frequency density Class width
For example, the frequency of the class 30 50x above is 0.6 20 12 as we can check from the table.
Note Frequency density has units which should be included on the histogram, here students per mark. If it
is more convenient you could instead calculate students per 10 marks and alter the units accordingly. Only
use an unusual unit if you are confident, it is usually best just to use the basic calculation.
Histogram to show the marks of a group of students
50 100
0.5
1
1.5
2
Mark
Frequency Density (students/mark)
All rights reserved. © Kevin Olding 2013 6
Note 2 In this diagram, not all of the classes have the same width, but you could still construct a histogram in
the same way if all of the classes do happen to have the same width.
How to manipulate statistics 2
Here is a histogram for another class of students. Which class do you think has done better overall based on these
histograms?
At first glance it looks like the second class has done a lot better than the first, with much higher frequency density
for the higher marks than the lower marks. But in fact both histograms are based on exactly the same data. The
only differences are:
a. in the second histogram the third and fourth classes have been combined to produce a 50 100x class
with frequency 37+14 = 51 and frequency density 5150 1.02
b. The frequency density axis in the second histogram has been altered so that it goes up to 1.2 rather than 2 .
When compared side by side with the first histogram to give an impression of relatively higher frequency
densities.
Beware of cunning and devious (or simply lazy and ignorant) makers of graphs when reading newspapers as visual
representations of data can often be more misleading than they are helpful.
More examples of this sort of deception can be found in the excellent little book 'How to Lie With Statistics' by
Darrell Huff. You will be surprised how many examples of bad statistics you can easily find having read this.
Statistics and graphs are increasingly used to support arguments in professional contexts and honing your skills in
this way will give you a real advantage both in challenging incorrect examples and in not making the same mistakes
yourself.
50 100
0.2
0.4
0.6
0.8
1
1.2
Mark
Frequency Density
All rights reserved. © Kevin Olding 2013 7
5. Stem and leaf diagrams
Here are the marks obtained by 20 students in a test out of 100:
68 45 22 14 92 55 58 53 78 71
16 39 80 42 72 72 88 31 12 89
It is sensible here to choose intervals 10‐19, 20‐29, 30‐39, ...., 90‐99 for this data and we can represent the entries by
a stem and leaf diagram. Grouping in tens is common in stem and leaf diagrams but you could choose other
groupings so long as your Key makes this clear. The first five entries above can be represented as:
Stem Leaf 1 4 2 2 3 4 5 5 6 8 7 8 9 2
And we can fill in the rest of the data to complete the diagram
1 4 6 2 2 2 3 9 1 4 5 2 5 5 8 3 6 8 7 8 1 2 2 8 0 8 9 9 2
The words stem and leaf at the top are optional but the Key is a vital part of the diagram. It is also important that
the numbers in the leaves are uniformly spaced out, as the diagram should give a visual representation of the data.
The diagrams above are unordered stem and leaf diagrams. Ordered stem and leaf diagrams are more useful and so
usually the data is rearranged so that it is in ascending numerical order as follows:
1 2 4 6 2 2 3 1 9 4 2 5 5 3 5 8 6 8 7 1 2 2 8 8 0 8 9
Key 2 2 = 22
Key 2 2 = 22
Key 2 2 = 22
All rights reserved. © Kevin Olding 2013 8
9 2
Sometimes we want to compare two sets of data and in this case we can form a back to back stem and leaf diagram.
Suppose I want to compare the class above to another, whose test results were:
25 23 13 42 59 18 21 32 44 10
50 44 48 32 25 14 14 68 18 15
To make the diagram, we simply add these data to the left hand side of the stem as follows:
Class 2
Class 1
8 8 5 4 4 3 0 1 2 4 6 5 5 3 1 2 2 2 2 3 1 9 8 4 4 2 4 2 5 9 0 5 3 5 8 8 6 8 7 1 2 2 8 8 0 8 9 9 2
This diagram allows us to quickly compare the two sets of data ‐ we can see immediately that the second class has
not done as well.
6. The Range and the Median
Let us consider again the first class of students from the section above. The ordered stem and leaf diagram
we produced earlier will now be useful to us so here it is again:
1 2 4 6 2 2 3 1 9 4 2 5 5 3 5 8 6 8 7 1 2 2 8 8 0 8 9 9 2
The lowest score was 12 and the highest was 92. We could say that the data range from 12 to 92 and we
define the range of the data as 92 minus 12, or 80.
Key 2 2 = 22
Key 2 2 = 22
All rights reserved. © Kevin Olding 2013 9
Range = Maximum value - Minimum value
It is useful to be able to talk about the middle data point too, and this is called the median. If there was an
extra piece of data there would be 21 pieces of data and then the 11th would have 10 pieces of data below it
and 10 pieces of data above it and so we could say that values was the median. Here we have an even
number of pieces of data, so there is no exact middle value. The 10th and 11th data points are 55 and 58
and so the best we can do is to say that the median is the average of 55 and 58, or 56.5.
Warning! We must be very careful using the word 'average' as I have done above as it could mean either
the mean, the median or even the mode. The usual meaning of 'average' is the mean, which we will come
to again later, but be very cautious when you hear someone talk about the 'average' as people often use
whichever of the mean or median is most convenient for the point they are trying to make, especially on
television or radio shows where there is little opportunity to view their detailed calculations!
7. Quartiles, and the Inter‐Quartile Range
The median splits the data into a top half and a bottom half, but there is another half that is very interesting,
and that is the middle half and we are often interested in how spread out the most central half of the data is.
This spread is called the Inter‐Quartile Range, or IQR.
Before we can work out the IQR we must first calculate the Lower Quartile ( LQ or 1Q ) and the Upper
Quartile (UQ or 3Q ). The Upper Quartile is the median of the top half of the data, and the Lower Quartile
is the median of the bottom half of the data.
So, returning to our example above, if we focus on the bottom half of the data we have 10 values and so the
Lower Quartile is the median of these.
1 2 4 6 2 2 3 1 9 4 2 5 5 3 5 8 6 8 7 1 2 2 8 8 0 8 9 9 2
As there are 10 values, we must average the 5th and 6th values to give 31 39
352
LQ
.
Summary: How to find the median
Odd number of pieces of data ‐ take the middle value
Even number of pieces of data ‐ take the average of the two middle values
Key 2 2 = 22
All rights reserved. © Kevin Olding 2013 10
Similarly, the Upper Quartile is 72 78
752
UQ
.
We can then define the Inter‐Quartile Range as
Inter-Quartile Range = Upper Quartile - Lower Quartile
IQR = UQ - LQ
So in our example, the IQR is 75 ‐ 35 = 40.
Note: In case you are wondering about the notation 1Q and 3Q for the lower and upper quartiles, 2Q is
often used to denote the median.
Note: If we have an odd number of pieces of data, as in the data set below with 11 data points we must
make a decision. Here the median is 58 and when calculating the Lower Quartile (and the Upper Quartile)
we have to choose whether make the LQ the median of the first five pieces of data (including the median) or
the first four pieces of data (excluding the median)
13 32 47 55 58 59 69 71 74
There is no statistical convention on this, and it is rarely important in practice, but for the sake of consistency
we will choose to exclude the median for calculations. So, here the Lower Quartile would be based on 13,
32, 47, 55 and would be 32 47
34.52
LQ
.
8. Box and whisker plots (Boxplots), Outliers (first definition)
Now we know how to calculate the quartiles, we can create a five‐point summary of our data to create another
visual representation of the data. A five‐point summary is just a list of the following five values. The
numbers are from the same example as above. These values are useful since roughly one quarter of the data lie
between each of the consecutive pairs of values.
0Q Minimum 12
1Q Lower Quartile 35
2Q Median 56.5
3Q Upper Quartile 75
4Q Maximum 92
All rights reserved. © Kevin Olding 2013 11
A box and whisker plot is constructed by first drawing a box from the Lower Quartile to The Upper Quartile with a
vertical line to represent the median.
We then add the whiskers which go from the Lower Quartile to the Minimum and from the Upper Quartile to the
Maximum.
Having a scale on the horizontal axis is very important, but there is no vertical scale so it doesn't matter how fat or
thin your box is or how long any of the vertical lines are so long as the finished picture looks sensible.
There is one last component to the box and whisker plot, which is how we represent outliers. An outlier is a piece
of data which is far enough away from the centre to make us question whether or not it really is a valid data point.
For example, suppose that there was another piece of data in our list of marks, 155. We know that the test was out
of 100 and so this cannot be a real piece of data and must have been put in the table as a result of a human error. It
is probably really 15 or 55 but someone has mis‐typed it. Outliers are not always as obvious as this and so we have
devise some methods for identifying potential outliers. There are two tests and we will come to the second later.
For now, we will define an outlier as follows:
So we work out the IQR, here 40 and multiply it by 1.5 to give 60. Then a piece of data is an outlier if it is more than
60 above the UQ or more than 60 below the LQ. That is, it is an outlier if on the boxplot it would be more than 1.5
IQRs from the box. Here the boundary for outliers would be
below 1.5 35 60 25LQ IQR
or above 1.5 75 60 135UQ IQR
Outlier ‐ First Definition (IQR Definition)
An outlier is a piece of data which is more than one and a half Inter‐Quartile Ranges
below the Lower Quartile or above the Upper Quartile.
All rights reserved. © Kevin Olding 2013 12
and so we do not have any outliers in our dataset. If the value of 155 were included though, it would be above 135
and so would be an outlier. We would represent it on our boxplot with a cross, as follows:
Note: Of course, if 155 were included the five point summary would also change to incorporate it, this diagram is just
to illustrate how to draw the outlier if there is one.
9. Cumulative frequency, cumulative frequency curves, percentiles
A group of students were asked about their journey times into school in the morning and the results were as follows:
Time ( x minutes) Frequency
0 5x 24
5 10x 32
10 15x 48
15 20x 23
20 30x 18
30 45x 10
45 60x 5
We might be interested in knowing how many students take less than 15 minutes to travel to school, or how many
take less than 45 minutes and the answers to these questions are called cumulative frequencies. So for example for
the 15 20x category the Frequency is 23 and the Cumulative Frequency is the total of all the Frequencies in the
first for categories up to and including the 15 20x category. So here it is 24 + 32 + 48 + 23 =127. You can
calculate the cumulative frequencies quickly by starting with the first category and repeatedly adding on the
frequency for the next category.
Time ( x minutes) Frequency Cumulative Frequency
0 5x 24 24
5 10x 32 56
10 15x 48 104
15 20x 23 127
20 30x 18 145
30 45x 10 155
45 60x 5 160
All rights reserved. © Kevin Olding 2013 13
Now we have calculated the cumulative frequencies for each category we can draw a cumulative frequency curve to
represent the data, which will show us the number of students who take less than or equal to a given time to travel
to school. We plot the cumulative frequencies against the values at the upper end of each category, so for example
the Cumulative Frequency of 127 is plotted at 20 because we know there are 127 students who take less than or
equal to 20 minutes to travel to school.
We plot all of the cumulative frequencies and
join with a smooth curve to form the
cumulative frequency curve.
Note, we also plot the point (0,0) here at
the bottom left as we know that no‐one takes
less than no minutes to travel to school!
Warning! Cumulative frequency curves do not
necessarily start at (0,0) . If the first class
had been 5 10x (i.e. we know that no
student takes less than 5 minutes to travel to
school), we would start the curve at (5,0)
instead.
Rather than joining the points with a smooth
curve, we could instead join them with a
straight line. In this case the diagram is called
a cumulative frequency polygon and nothing
else is different.
There is no particular rule as to which to use
and here you can see that it has made little
difference to the resulting graph. If in doubt,
draw a curve.
Cumulative frequency curve showing journey times to school
10 20 30 40 50 60 70
20
40
60
80
100
120
140
160
Travel time
Cumulative Frequency
Cumulative frequency polygon showing journey times to
school
10 20 30 40 50 60 70
20
40
60
80
100
120
140
160
Travel time
Cumulative Frequency
All rights reserved. © Kevin Olding 2013 14
10. Reading off the median, quartiles and percentiles from a cumulative frequency curve
We can use a cumulative frequency curve to approximate the median and lower and upper quartiles of the data.
The median is the middle value. Here there are 160 values, so the median should have roughly 80 of the pieces of
below it and 80 above it. Hence the median has a cumulative frequency of 80 and if we read off the value with a
cumulative frequency of 80 we can approximate the median.
Note: The precise reader might suggest reading off 80.5 instead of 80, and this would in fact be slightly better, but
when we are dealing with cumulative frequency curves the number of pieces of data is usually large and since the
reading will only give us an approximation anyway it is common just to read off at 80 for convenience.
We can also read off the Lower and Upper quartiles in the same way, by looking for values with Cumulative
Frequencies one quarter and three quarters of the total number of pieces of data, so here 40 and 120 respectively.
We can also use a cumulative frequency diagrams to read off percentiles in a similar way. For example, the 95th
percentile is the value below which 95% of the pieces of data lie. The median then is the 50th percentile, the Lower
Quartile is the 25th percentile and the Upper Quartile is the 75% percentile.
All rights reserved. © Kevin Olding 2013 15
11. Mean, mean square deviation, root mean square deviation
We have now met two measures of 'central tendency' or 'average':
a. the mode, the most common value; and
b. the median, the middle value
and you are already familiar with the third, the mean, which is calculated by adding up all the data and dividing by
the number of pieces of data you have. For example, the mean of 1, 4, 9, 16, 25 and 36 is
1 4 9 16 25 36 9115.17
6 6
The mean is often denoted by x (or y or z or similar ‐ it's the bar that denotes the mean, the letter is arbitrary).
And we can write x
xn
, where the (sigma) means 'sum', or 'add up all the...' and n is the number of
pieces of data. So to get the mean, we add up all the pieces of data, x , and divide by the total number of pieces of
data.
We have also met some measures of dispersion (or spread), namely the range and the inter‐quartile range. We
now consider some other such measures based on the mean rather than the median.
For illustration, let us consider the following set of 10 data points:
5 12 45 89 123 158 232 288 314 404
Adding up the data tells us that 1670x and so the mean, 1670
16710
x .
One way we might think about measuring spread, is to look at the average of the distances of the data points from
the mean. A quick consideration of this will make you realise that the positive and negative differences will cancel
out and this will not be useful. Some statisticians simply ignore the signs and consider the modulus (or absolute
value) of the differences from the mean, but we will adopt the more common route of looking at the average
squared differences from the mean.
For example, the data point 45 is 45 167 122 from the mean, so its squared difference from the mean is 2( 122) 14884 . For each piece of data x , the squared difference from the mean is 2( )x x and it is useful to
introduce notation for the sum of these values:
2( )xxS x x
Even with a calculator, this could be quite time consuming to calculate, but fortunately there is another way we can
calculate xxS , which is to simply add up all of the values squared, that is 2x and to subtract the n times the
mean squared 2x . The equivalence of this is not too difficult to justify but we will skip a proof here to avoid
distraction and accept that another way of writing xxS is:
2 2xxS x nx
All rights reserved. © Kevin Olding 2013 16
Note: Be careful, this means 2 2( )xxS x nx , i.e. we add up all the 2x but just subtract 2nx once.
So in our example, we already know 10n and 167x , so 2 278890nx , and
2 2 2 2 2 2 2 2 2 2 25 12 45 89 123 158 232 288 314 404 448788x
and so
448788 278890 169898xxS
This is the total of the squared differences from the mean. We began by looking for the average of the squared
differences of the mean, and so we must divide by the number of pieces of data n . This gives us the mean square
deviation, here 169898
16989.810
. By square rooting this value, we can compensate for the fact that we have
squared all the data to give a value which is closer to the size of the differences. The square root of the mean
squared deviation is called the root mean square deviation, here 16989.8 130.3 .
12. Variance and standard deviation, outliers (second definition)
For technical reasons, the formulae for mean square deviation and root mean square deviation are often altered to
divide by ( 1)n instead of n . The resulting statistics are called the variance and standard deviation respectively.
Nothing in the calculations changes apart from this and so the box below looks very similar to the one above.
Note: The letter s is used to denote the standard deviation and 2s for the variance. In another part of the
Statistics 1 course we will meet standard deviation and variance in the context of random variables where the Greek
and 2 will be used. In the context of data, the Roman s should always be used.
Summary: mean square deviation (msd) and root mean square deviation (rmsd)
mean square deviation = xxS
n root mean square deviation = xxS
n
where 2 2 2( )xxS x x x nx
Variance and Standard Deviation
2variance = = 1
xxSs
n standard deviation =
1xxS
sn
where 2 2 2( )xxS x x x nx
All rights reserved. © Kevin Olding 2013 17
In section 8, we considered a method of classifying pieces of data as outliers based on the quartiles and the inter‐
quartile range. There is also a method of identifying outliers based on the mean and the standard deviation. There
is no set rule as to which definition to use, we usually use whichever is most convenient given the statistics we are
able to calculate (or have already calculated) from the data.
When we look for outliers, we are trying to identify pieces of data which are not genuine. The fact that a piece of
data passes one or both of the tests for an outlier raises a suspicion, but we then must consider qualitative
information to decide whether or not to exclude the data from our considerations. For example, we previously
considered a mark of 155%. Our statistical procedure flagged the data as an outlier and then our knowledge that
you cannot score over 100% on the test led us to exclude the data point. If we were to consider lottery prizes and
had a dataset which included lots of 0s, a couple of 10s and one entry of 65487 then the 65487 would certainly be
flagged as a statistical outlier. We would not want to exclude this piece of data however, as it simply shows that the
person has won a large prize, information that is highly relevant to the data set.
Outlier ‐ Second Definition (Variance Definition)
An outlier is a piece of data which is more than two standard deviations above or below the mean.
All rights reserved. © Kevin Olding 2013 18
13. Calculating the mean and variance from an ungrouped frequency table
A group of students were asked how many siblings they have and the results were as follows:
Number of siblings x
Frequency
f
0 17
1 23
2 18
3 9
4 3
If we want to calculate the mean and variance, one way to do this would be to write all the data out in a list
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3
,3,3,3,3,3,,4,4,4. and to proceed as before. We would need to add up all the data for the mean, and add up all the
squares of the data points for the calculation of the variance. But this is tedious, and since so many of the data
points are the same we can speed up the process. For example there are 18 '2's and so the sum of the squares of
these data is 218 2 . Hence we fill in the table as follows:
Number of siblings x
Frequency
f
xf
2x
2x f
0 17 0 0 0
1 23 23 1 23
2 18 36 4 72
3 9 27 9 81
4 3 12 16 48
70n f 98xf 2 224x f
We can now quickly calculate the mean 98
1.470
xfx
n
and the variance 2 2 2
2 224 70 1.41.26
1 1 69xx
x f nxSs
n n
The values of xf , 2x f etc are called summary statistics and will sometimes be calculated for you in exam
questions so you don't have to carry out too many routine calculations under timed conditions, but you cannot rely
on this and so you should be able to do the calculations from scratch if necessary.
Modern calculators have statistical functions which can eliminate some of the work in carrying out these
calculations, but you need to practice in advance to know how to use these and you should only use the calculator if
you are confident you know what it is doing. Most calculators have many different settings and you should do at
least one calculation by hand and using the calculator to make sure you have chosen the correct ones!
All rights reserved. © Kevin Olding 2013 19
14. Estimating the mean and median from a grouped frequency table
The amount of pocket money received by a group of children was recorded as follows:
Amount of pocket money x Frequency
0 5x 8
5 10x 32
10 20x 24
20 50x 6
We cannot calculate the mean of this data precisely since we do not have the raw data, we only know which
category each of the pieces of data falls into. The best we can do then is to make an estimate of the mean, based
on the assumption that each of the pieces of data falls exactly in the middle of the interval it falls in.
Amount of pocket money x Frequency, f Mid‐interval value m mf
0 5x 8 2.5 20
5 10x 32 7.5 240
10 20x 24 15 360
20 50x 6 35 210
We can then estimate the mean as 20 240 360 210 830
£11.868 32 24 6 70
mf
n
Similarly, we could estimate the variance (or standard deviation or msd or rmsd) in the same way, by assuming that
all of the pieces of data fall at the relevant mid‐interval values and completing the calculation as before.
We could also estimate the median from this data. There are 70 pieces of data, so we would like to average the
35th and the 36th pieces of data. We do not have the exact data, but we can tell from the information given that
both of these values lie in the 5 10x interval. Furthermore, we know that 32 pieces of data lie in this interval,
and that there are 8 in the interval below. Hence the 35th and 36th data points will be the 27th and 28th data
points in the 5 10x interval. Hence we can make an estimate of the median as:
27.55 (10 5) 9.30
32 to 2 decimal places.
If desired, we could also estimate the quartiles and other percentiles in the same way.
All rights reserved. © Kevin Olding 2013 20
15. Linear Coding
Suppose you are a professional who charges a fixed call out rate and then an additional fee per hour you spend on a
job and you have gathered data on the amount of time you have spent on jobs in the last few months and calculated
the mean and standard deviation.
Suppose your call out rate is £50 and you then charge £80 an hour for your services. If x is the mean number of
hours you spend on a job and 2
xs the variance, what are the mean and variance of the amount of money you
receive, which we shall call y and 2ys ?
Mean: Each time you do a job you get £50 plus £80 times the number of hours the job takes. So if you take x
hours on a job you get paid 50 80y x pounds. If you do n jobs, the mean number of hours is x
n
and so the
mean pay you receive is:
50 80
50 80
50 80
y n xy
n n
x
nx
Variance: The variance of the number of hours is
2( )
1
x x
n
and so the variance of the pay you receive is:
2 22
2
22
2 2
( ) (50 80 (50 80 ))
1 1
(80 80 )
1
( ) 80
1
80
y
x
y y x xs
n n
x x
n
x x
n
s
Similarly, the standard deviation of the pay you receive is 2 80y y xs s s .
This method is called linear coding and allows us to calculate the values we are interested in without first going back
and calculating the amount of money we have received for each job.
In general, f we have a data set x and apply the linear coding y ax b , the following results are
true:
Mean: y ax b
Variance: 2 2 2y xs a s Standard deviation: y xs a s
The results for the variance and standard deviation also hold for the msd and rmsd respectively.
All rights reserved. © Kevin Olding 2013 21
16. Advantages and Disadvantages of the mean, median mode and midrange
In common language the word 'average' usually means the mean. However the mean, median, mode and midrange
are all averages, and which is the most suitable and useful depends both on the particular data we are looking at and
the reason we want an average. This is why some of the advantages in the table below also appear as potential
disadvantages. Using an inappropriate average can give a misleading or unrepresentative figure as an average.
Listed below are some possible advantages and disadvantages of each. This list is by no means comprehensive.
Advantages Disadvantages
Mean All pieces of data are used in the calculation.
Useful when the total quantity is also of interest.
Will be skewed (ie changed significantly), by the presence of a few exceptionally
small or large pieces of data.
Median Will not be significantly skewed by outliers. This is why the median is often used for average salaries for
example, where there are often one or two people who earn a lot more than everybody else.
Does not take all pieces of data fully into account and ignores the presence of exceptionally high or low data points
which may be of significance.
Mode Can be useful where there are just a few possible values, for example the number of A grades students have
achieved at A‐Level.
Can be very misleading, especially when there are a large number of different values which all have low frequencies.
Midrange There are very few situations in which the midrange is the best average to use. One advantage is that it is easy
to calculate.
Only takes into account the lowest and highest pieces of data. As such it is
heavily affected by outliers.
Note: The midrange is defined as
Minimum Value + Maximum ValueMidrange
2
It is literally in the middle of the range of values in the data set, hence the name!
All rights reserved. © Kevin Olding 2013 22
17. Skewness
Distributions can be symmetrical, or skewed either positively or negatively.
In a positively skewed distribution, the mean is pulled in the positive direction by the presence of some values which are significantly larger than the bulk of the distribution. The median is not as badly affected by a few large values and
stays nearer the bulk of the distribution.
In a negatively skewed distribution, the reverse is true. So there are some values significantly smaller than the bulk of
the distribution which pull, or skew, the mean in the negative direction from the median.
If a distribution has no skew then it is described as symmetrical.
The above diagrams are all frequency diagrams. Some students have found the following pictures below useful to
memorise positive and negative skew. How would you feel if you were either of the cyclists below?
Feeling positive!
Feeling negative!
All rights reserved. © Kevin Olding 2013 23
We can also identify skewness from histograms, boxplots and cumulative frequency diagrams:
Positive skew Symmetrical Negative skew
Frequency diagram/histogram
Cumulative frequency diagram
Boxplot