revision topic 4: statistical...

Mathematical Studies Standard Level for the IB Diploma

Revision Topic 4: Statistical applications

The normal distribution curve The shape of data plotted in a histogram can be compared to the normal distribution curve. This is a standardised view of how data can be distributed, and it has the following properties:

The normal distribution is written using this notation:

X ∼ N(μ, σ2)

In a question you may be told that some data, X, follows the normal distribution, X ∼ N(12, 52), or you may just be given the values of μ and σ.

• bell-shaped • symmetrical about the mean value, μ • equal values for the mean, median and

mode • area under the curve equals 1 • 68% of the data lies within 1 standard

deviation, σ, of the mean • 95% of the data lies within 2 standard

deviations of the mean • 99% of the data lies within 3 standard

deviations of the mean • to find the standard deviation marks

along the horizontal axis for the percentages 68%, 95% and 99%, start at the middle (the mean) and add or take away the correct number of standard deviations.

Chapter 11: The normal distribution

Copyright Cambridge University Press 2014. All rights reserved. of 12


Probability calculations using the normal distribution You may be asked for the probability that an event will happen or the percentage of time that an event occurs. These mean the same thing, which is that you should work out the area under the curve between two points.

You should use your GDC to do this, obtaining both a graph of the relevant area and the value that you want.

Questions that ask you to do probability calculations with the normal distribution will give you the following information:

• that the data follows a normal distribution or is ‘normally distributed’

• the value of the mean

• the value of the standard deviation

• one or two boundary values.

When given boundary values, the question will ask you to calculate a probability associated with one of the following situations:

Situation given in the question Lower value to enter into GDC Upper value to enter into GDC More than a value the value you are given 99999 Between two values the lower value you are given the higher value you are given Below a value −99999 the value you are given

Using your GDC It is easier to follow what to do on your GDC if we look at a particular example:

A large number of mobile phone calls were monitored, and their lengths were recorded to the nearest minute. The call lengths were found to be normally distributed with a mean of 12 minutes and a standard deviation of 5 minutes. What is the percentage of calls that lasted over 15 minutes?

This question gives you the following values:

• lower boundary value = 15

• upper boundary value = 99999

• μ = 12

• σ = 5



Texas TI-84 Casio fx-9750GII Before drawing the graph on the TI-84, you need to set your window so you can actually see the graph:

Set the window boundaries as follows: Xmin = μ − 3σ Xmax = μ + 3 Ymin = −0.25 Ymax = 0.25

This will give you shadenorm(…); then the values should be entered in this order: lower upper mean standard deviation

Then draw the graph and read off the probability value (Area):

Get to the variable screen:

Input the values in this order: lower upper standard deviation mean

Then draw the graph and read off the probability value:

The GDC gives P = 0.274, so the percentage of calls lasting more than 15 minutes is 27.4%.



Inverse normal calculations If you know the probability of an event happening, along with the mean and standard deviation of the normal distribution, you can work out the boundary value(s) of the event.

For example, if 30% of a group of students scored below the pass mark on a test, and you know that 45µ = and 7.2σ = for the test scores, then you can find the pass mark.

To do this, you need to use the ‘inverse normal’ function on your GDC.

You will need to input the following values to calculate the boundary value:

Area under the normal curve This is the given probability or percentage written as a decimal. σ The standard deviation μ The mean

So, for the example above you would do the following:

Texas TI-84 Casio fx-9750GII

Navigate to the inverse normal function.

Enter the known values in this order: Area, μ, σ

Input the values given in the question.

Enter the values of Area, σ, μ. ‘Tail’ means the side of the graph that is shaded and depends on whether the probability given in the question corresponds to ≤ or ≥ the boundary value: ≤ is left ≥ is right

Note: The TI-84 always gives the ≤ value or the left tail boundary. If you want the ≥ value (right tail boundary), you need to subtract the GDC result from 1 to get the final answer.

So the pass mark was 41.2 (or 41 to the nearest whole number).



The concept of correlation Bivariate data Data that consists of measurements of two variables collected from each

individual in a sample Correlation The relationship between the two variables of bivariate data

The variables of bivariate data can be classified as follows:

Independent variable Variable that is controlled by the person conducting the study Dependent variable Observed variable that should demonstrate the effect of the

hypothesised relationship

For example, in the hypothesis

‘A greater number of calories eaten per day will make a person heavier’

the independent variable is the number of calories consumed per day and the dependent variable is the person’s weight.

Scatter diagrams The easiest way to see if there is a pattern in bivariate data is to draw a scatter diagram by creating coordinates from your data in this order:

(independent variable value, dependent variable value)

Then plot these coordinates on a grid and look at the grouping of the points to determine what type of correlation there is.

Positive correlation No correlation Negative correlation

As one variable increases, so does the other.

No apparent relationship As one variable increases, the other decreases.

Chapter 12: Correlation



Correlation and causation • Just because two variables have a correlation, it doesn’t mean that one causes the other.

Be cautious when making judgements based on data.

• Don’t forget to consider all the variables that might affect the results.

Line of best fit To highlight the relationship between the two variables of bivariate data, you should draw a line of best fit on your scatter diagram.

To do this, follow these steps:

• Find the mean of each variable (i.e. the data plotted along the x-axis and the data plotted along the y-axis), giving you the mean point ( , )x y .

• Plot the mean point on the scatter diagram. • Draw a line through the mean point so that

the other points of the scatter diagram are spread evenly above and below the line.

This line represents the relationship between the two variables.

Drawing a scatter diagram and line of best fit on your GDC


Put the bivariate data into your GDC: enter it in the data table as two lists, the first for the independent variable and the second for the dependent variable.



Set the graph type to ‘scatter’.

Then draw the graph.

Once you have created the scatter diagram, you can get the GDC to calculate the line of best fit along with a measure of the strength of the correlation.


In this case the equation of the line of best fit (regression line) is y = −0.944x + 10.3 To draw the regression line on the TI-84, you need to manually input the data into the [Y=] screen.

To draw the regression line on the scatter diagram:



Pearson’s product moment correlation coefficient This is a measure, based on the data and the line of best fit, which tells you how strong the correlation is. Remember the following points:

• This coefficient is usually denoted by r.

• −1 ≤ r ≤ 1

• If r = +1, there is a perfect positive correlation.

• If r = 0, there is no correlation.

• If r = −1, there is a perfect negative correlation.

• If the value of r is between −0.5 and 0.5, the correlation is too weak to draw any meaningful conclusions from the regression line.

• The closer to ±1 the value of r is, the stronger the correlation.

In the GDC example above, r = −0.913, which indicates a very strong negative correlation.

Regression line of y on x A regression line is a line of best fit that minimises the overall distance between the data points and the line of fit. Remember that:

• The line has an equation of the form y ax b= +

• You should use your GDC to find the values of a and b.

• You should rearrange the equation so that it is written sensibly.

• If the correlation is strong, you can use the regression line to predict values.

• If the correlation is weak (i.e. −0.5 < r < 0.5), then you should not predict values using the regression line.

• You should not use a regression line to predict values outside the range of data given.

In the GDC example above, a = −0.944 and b = 10.3, so the equation of the regression line is 0.944 10.3y x= − + , which could also be written as 0.944 10.3y x+ = .



The chi-squared test is used to see if two variables are independent. It can also be used to assess whether data differs significantly from what is expected, called the ‘goodness of fit’.

Expected frequencies First, you need to be able to work out the frequencies that you would ‘expect’ to see, based on some hypothesis you assume for the data. How this is done depends on the type of problem you have.

• For a goodness-of-fit test: Assuming a certain theoretical distribution for the data, the expected frequency of each outcome would be total frequency × probability of that outcome occurring

• For a test of independence of two variables: Given a two-way table summarising the observed

frequencies of the data, the expected frequencies would be row total column totaltotal×

(This is the probability of the row outcome multiplied by the column total, or vice versa, and it gives you the correct share of the total you should expect.)

The χ2 statistic The χ2 statistic is a measure of the discrepancy between the observed and expected frequencies. You should use your GDC to calculate the χ2 statistic, and then interpret it in relation to the following:

Critical χ2 value The threshold value above which the discrepancy is considered significant. In exam questions you will be given this value.

Significance level, α The maximum probability of making a mistake in your conclusion, deciding that the result is significant when actually it isn’t. In questions you will be given this value, and it is normally 1%, 5% or 10%.

Null hypothesis, H0 The hypothesis that the factors being tested are independent.

Alternative hypothesis, H1 The hypothesis that the factors being tested are dependent. p-value The probability of getting a discrepancy as large as the calculated χ2 statistic

if the theoretical distribution or null hypothesis were correct. Degrees of freedom The number of outcomes that can be independent, given that the total

frequency is fixed. In the goodness-of-fit test, degrees of freedom number of outcomes 1= − In the independence test, degrees of freedom (number of rows 1) (number of columns 1)= − × −

Chapter 13: Chi-squared hypothesis testing



Using your GDC to calculate the χ2 statistic Suppose you need to do a goodness-of-fit test on the following data, where the theoretical distribution assumes equally likely outcomes:

A B C D 11 6 9 10

Put the data into your GDC as a list.


Go into list mode.

Enter your data (observed frequencies) in list 1 and the expected frequencies in list 2.

Access the χ2 statistic and p-value.



Suppose you need to perform an independence test on the following bivariate data, given in a two-way frequency table:

A B C D a 15 12 8 6 b 6 11 7 7 c 9 6 14 20

In this case, put the data into a table or matrix.


Go into matrix edit mode.

Set the size of the matrix: enter the number of rows first (3) followed by the number of columns (4). Then input your data.

Access the χ2 statistic and p-value.

If you need to see the expected frequencies (which are calculated automatically by the GDC), open ‘matrix B’.



Understanding the χ2 statistic and p-value The GDC gives you the χ2 statistic, the p-value and the degrees of freedom (df).

For each test, comparing the χ2 statistic with the critical value or the p-value with the significance level will lead to the following conclusions:

χ2 statistic p-value comparison Goodness-of-fit test Independence test χ2 < critical value p-value > significance level Good fit Accept the null hypothesis χ2 > critical value p-value < significance level Not a good fit Reject the null hypothesis


revision topic 4: statistical...

Documents