statistical inference friday 15 th february 2013

Statistical InferenceFriday 15th February 2013

Outline:• Inference

• Confidence intervals

• Sampling distributions

• The normal distribution and z-scores

• Working out confidence intervals

• Hypothesis testing

• Types of error

• t-tests (and ANOVA)

We recommend Statistics for the Terrified

‘Standard Error and Confidence Intervals’

What is inference?• Most of the time we care about the attributes of a

population – adults in the UK; women workers; small businesses…

• But we usually only study a sample of the population. • Inferential statistics give you the tools to infer

population characteristics from the sample. • Inferential statistics usually assume a random

sample. This is why it is so important to use methods of random sampling when at all possible.

• Instead of, say, reporting that 35% of our sample have some characteristic, using inferential statistics we are able to estimate, or infer, the proportion of the population that is likely to have that characteristic.

• In order to do this we use confidence intervals.

What is a ‘Confidence Interval’?

• A ‘Confidence Interval’ for a particular sample statistic (e.g. the mean) is a range of values around the statistic that is believed to contain, with a certain level of probability (often 95%) the ‘true’ value of that statistic (i.e. the population value).

• For example, if we see a report that 37% of people (plus or minus 3%) intend to vote Labour. What is being said is that the pollsters are reasonably confident that the true number of people who intend to vote Labour is between 34% and 40%. If they have not said otherwise, it is very likely that this is a 95% confidence interval.

• See Statistics for the Terrified: Chapter 4

How do we arrive at a confidence interval?

• How do we judge how big a confidence interval should be (plus or minus 2% or 5% or 15%...)?

• What does it mean to be 95% certain that it is the size that we say it is?

• And how do we know that the results we got in our sample of the population are not just a quirk of our particular sample (or ‘sampling error’)?

• Part of the answer to these questions can be seen in common-sense assessments…

Example: Judging whether differences occur by chance…

How do we judge whether it is plausible that two population means are the same and that any difference

between them simply reflects sampling error?

Example: Household size of minority ethnic groups(HOH = Head of household; data adapted from 1991 Census)

1.The size of the difference between the two sample means

MeanIndian HOH: 3.0Bangladeshi HOH: 5.0

MeanIndian HOH: 3.0Pakistani HOH: 4.0

The first difference is more ‘convincing’

2. The sample sizes of the two samples

MeanPakistani HOH: 3 4 5 4.0Bangladeshi HOH: 4 5 6 5.0

MeanPakistani HOH: 2 2 3 4 4 4 5 5 5 6

4.0Bangladeshi HOH: 2 3 4 4 5 5 6 6 7 8 5.0

The second difference is more ‘convincing’

Judging whether differences occur by chance…

3. The amount of variation in each of the two groups (samples)

MeanPakistani HOH: 2 2 3 4 4 4 5 5 5 6 4.0Bangladeshi HOH: 2 3 4 4 5 5 6 6 7 8 5.0

MeanPakistani HOH: 4 4 4 4 4 4 4 4 4 4

4.0Bangladeshi HOH: 5 5 5 5 5 5 5 5 5 5 5.0

The second difference is more ‘convincing’.


Example continued: the impact of variability on a difference in means.

The three graphs each show two groups with the same mean difference.

However the groups in each of the three graphs have different levels of variability.

Where there is lower variability there is less cross-over between the groups, and so the difference between the means expresses a more pertinent difference (there is almost no one in group A with the same score as anyone in group B).

As we’ll see these three things – the size of the difference between the means, sample size, and the amount of variation (measured by the standard deviation) within the sample(s) – are critical to our determination of whether a difference we observe in a sample (or between samples) is likely to represent a real difference in the population (or between populations).


So, what is the relation of the sample to the population?

• If the sample is a random sample of the population, it may sometimes have a large number of extremely high values (for example: very happy people)

• And sometimes it may have a large number of extremely low values cases (for example: very sad people)

• But over the long run (if we kept on taking a sample, and then putting it back and taking another one), we would expect that most of the samples would fairly well represent the population (for example: with a mean happiness that corresponds fairly closely to the mean happiness of the population).

Sampling Distributions• The distribution of different possible samples

that could be taken from a population is known as a sampling distribution.

• The more we understand about this distribution the better because it will help us to work out the likely relationship of our particular sample to the population

• What we find is that as more and more samples are taken, the average (i.e. mean) of the sample means tends to equal the mean of the population.

• The sampling distribution of means also looks like a normal distribution (Central Limit Therorem).

• However the sampling distribution of means is less varied than the population.

• See sampling distribution simulation at: http://onlinestatbook.com/stat_sim/ (you can access this via the links page of the module website).

• Or Statistics for the Terrified, Chapter 4: Standard error and confidence Intervals Sampling from a Population: Sample

Means (from Field, 2005).

“If repeated (simple random) samples of size N are drawn from a normally distributed population, the means of such samples will be normally distributed with mean and standard error [i.e. standard deviation] /N... if the N of each sample drawn is large, then regardless of the shape of the population distribution the sample means will tend to distribute themselves normally with mean and standard error /N”.

= population mean = population standard deviation

N = number in sample

The formal theorem:

So where does this get us…???

• Well, we know that over the long run the mean of our samples is likely end up as the population mean.

• We know that over the long run (when the sample is ‘large’ enough) that the distribution of sample means looks normal. [Note: A “large sample” is sometimes considered to be one of size 30+, but a size of 100+ can more ‘safely’ be viewed as adequately large.]

• And we know that the variation in the sample means, known as the standard error, is (more or less) /N.

• Although we usually only have a single sample, this information means we can work out a fairly reliable estimate of the population mean by combining the sample with what we know about normal distributions.

What’s so special about the ‘Normal Curve’?

• The normal curve 'is a symmetrical distribution of scores with an equal number of scores above and below the midpoint of the abscissa (the horizontal axis, or ‘x-axis’, for the curve).

• Since the distribution of scores is symmetric, the mean, median, and mode are all at the same point on the abscissa. In other words, the mean = the median = the mode.

• If we divide the distribution up into standard deviation units, a known proportion of scores lies within each portion under the curve.

• From published or online tables, we can find the proportion of scores above and below any point on the abscissa, expressed in standard deviation units. Scores expressed in standard deviation units, are referred to as z-scores.

34% of cases are between the mean and one SD away

z-scoresz-scores can be calculated for any value. They are a means of standardizing values that are measured on different scales by showing these values just in terms of the number of standard deviations away from the mean they fall.

z-scores are calculated by subtracting the mean from any value and dividing it by the standard deviation.

z = x - mean

s

z-scores will always have: a mean of 0 and standard deviation of 1.

We can quickly see that this is true of the mean, since when x = mean, the numerator (top bit!) will equal 0, and therefore z must = 0.

It may be a little less clear that it is true of the standard deviation.

However if you think about the instance when x is one standard deviation bigger than the mean (i.e. x = mean + s)

z = (mean + s) - mean = s = 1

s s

Finding the 95% point on a normal distribution…

• From the table of we can see that when z = 1.96 (sometimes simplified to z = 2) the p-value, which represents the probability of being in the larger area (to the left), is 0.975.

• Therefore the area under one (small) tail of the curve is p=0.025.

• This means that scores greater than z = 1.96 occur just 2.5% of the time.

• Further (because the normal curve is symmetric) we can calculate that the area under both tails (beyond z = 1.96 and z = -1.96) is 0.05.

• In other words 95% of the area is in the middle, between z = -1.96 and z = 1.96

• And scores further from the mean than 1.96 thus only occur 5% of the time

z = 1.96

97.5%

z = -1.96

95%

z = 1.96

2.5%

Note: What happens if the sample size is too small for one to safely assume that the sample mean has a

Normal distribution?

• When a sample is small (i.e. less than about 25) the assumption that the sample mean is normally distributed is not reasonable.

• In fact, regardless of sample size, the sample mean can be assumed to have a t-distribution; the precise shape of a t-distribution depends on the sample size, and for moderate-to-large sample sizes the t-distribution is very similar to the Normal distribution (and, as the sample size approaches infinity, eventually converges with it).

Combining that with what we know about the sampling distribution:

• 95% of cases lie within +/- 1.96 standard deviations of the mean in a normal distribution.

• The distribution of sample means is normal…• …and the standard error of sample means is approximately /n

Therefore 95% of sample means fall into the range: - 1.96(/n) to + 1.96(/n)

Frequency

(population mean)

1.96/n 1.96/n

95% of sample means

2.5% of sample means

2.5% of sample means

Sample mean

Example• If we take a sample of 100 people and find that they work a mean

of 34 hours per week with a standard deviation of 8 hours, how do we construct a 95% confidence interval for the mean number of hours worked by the population?

• We know that 95% of sample means fall in the range: - 1.96(/n) to + 1.96(/n)

• We estimate using the sample standard deviation, which is 8. • The sample size (n) is 100. Therefore n = 10.• Therefore 1.96(/n) = 1.96 x (8 / 10)

= 1.96 x .8

= 1.568• Therefore there is a 95% likelihood that the sample mean that we

have found is within (about) 1.57 hours of the actual mean. • And so we can say with 95% confidence that the population’s

mean weekly hours of work will fall somewhere between 34 minus 1.57 and 34 plus 1.57.

• A 95% confidence interval of 32.43 to 35.57 hours per week.

Why 95%?• A confidence interval need not be 95%.

• However this is the generally accepted level for statistical testing. It is considered that errors occurring only 5% (or 1/20) times are acceptable. Furthermore, a higher value can produce confidence intervals that may be viewed too wide (producing an unacceptable risk of Type I errors – discussed later).

• However for some purposes a more cautious approach may be necessary.

• For instance, if you were an antiquarian librarian sampling over time the humidity in your rare book storage facility, you might want to be confident that the average humidity level was neither destructively high or low at a 99.9% level at least! In this case you would construct a 99.9% confidence interval (where only 0.1% of cases fell outside of the range). You could use the normal distribution to do this, in a similar fashion to the way in which we used it to work out that the 95% confidence level relates to plus or minus 1.96 standard errors.

• The procedure for producing 95% confidence intervals remains very similar to the one for larger sample sizes (i.e. the one using the ‘normal distribution’, which might just as well be referred to as the z-distribution), as does the test to see whether a suggested population mean is plausible.

• The only difference is that the ‘magic number’ 1.96 is replaced by a slightly larger number, the magnitude of which gets bigger as the sample size gets smaller.

• Thus, for a sample size of 25, 1.96 is replaced by 2.06 and, for a sample size of 15, by 2.13. (You can sometimes find a table of values for the t-distribution at the back of a statistics textbook).

However another problem arises with small samples: the distribution of sample means can be asymmetric. In fact, the assumption that the sample mean has a t-distribution is only reasonable for small samples if the distribution of the variable under consideration approximates the normal distribution.

Note: Small samples continued…

1. If we take a sample of 144 people and find that they eat a mean of 2,450 calories per day, with a standard deviation of 840 calories, how do we construct a 95% confidence interval for the mean number of calories eaten by the whole population?

2. If you have time, start thinking about this one: Imagine that we already know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduates earn less than other graduates? Why?

For you to think about:

Hypothesis testing• With reference to the second question in the preceding slide:

Imagine that we already know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduates earn less than other graduates?

• The null hypothesis here is that sociology graduates earn the same as other graduates. This is a hypothesis of no difference.

• The alternative hypothesis is that there is a difference.

• The null hypothesis (or Ho) is usually of no difference. Whereas the alternative hypothesis (or Ha) is usually of difference.

• When we carry out statistical tests, we attempt, as here, to reject the null hypothesis at a 95% level of confidence (or sometimes at a 99% or 99.9% level).

Statistical significance• A conclusion (e.g. that a difference or

relationship exists) is statistically significant if the probability that the conclusion would be drawn if it is, in fact, erroneous falls below the significance level chosen (in social science research this is often 5% = 0.05 = 1 in 20).

• The significance level is sometimes referred to as alpha (α).

Hypothesis testing• So to think about the example again:

Imagine that we already know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduates earn less than other graduates?

• If we construct a 95% confidence interval for the population mean income of sociology graduates it will look like this: – 15,400 plus or minus 1.96 x (4,000 / 64)– 15,400 plus or minus 1.96 x (4,000 / 8)– 15,400 plus or minus 980 £14,420 to £16,380

• The top point of this range is still below the mean income for graduates generally – there is no overlap. This means that there is less than a 5% chance that a difference as big as £1,100 would have occurred if there is no difference between sociology graduates’ mean income and the mean income for all graduates.

p-values

• A p-value quantifies (more precisely) the statistical significance of a result.

• More precisely, it quantifies how likely a difference or relationship of equal or greater magnitude to that observed would be to have occurred if there is no difference/relationship in the population (i.e. if the null hypothesis is correct)

Back to the example…• In the example, the standard error (i.e. the standard

deviation of the sample mean) is equal to (4,000 / 64) = 500.

• Thus the sample mean is 1,100/500 = 2.2 standard errors away from the suggested population mean.

• Statistical theory tells us that 95% of sample means are within 1.96 standard errors of the population mean.

• And also tells us that 97.2% of sample means are within 2.2 standard errors of the population mean.

• Hence the p-value for the difference of 2.2 standard errors (which is a test statistic) is (100-97.2)/100 = 0.028

• Since p < 0.05, it is statistically significant at the conventional 5% significance level.

Hypothesis testingTheory

You test out particular hypotheses with reference to your sample statistics. However these hypotheses are about underlying population characteristics (parameters)

Procedure• Set up ‘null’ (and ‘alternative’) hypothesis• Note sample size and design• Establish sampling distribution under the assumption that

the null hypothesis is true• Identify decision rule (i.e. what constitutes

acceptance/rejection of the null hypothesis)• Compute sample statistic(s), and apply the decision rule

(N.B. This is where Type I and Type II errors can occur).

Error Types

Decision (based on hypothesis test)

Truth about population

H0 true Ha true

Reject H0 Type I error Correct decision

Do not reject H0 Correct decision Type II error

Note: Reducing the chance of one type of error occurring increases the chance that the other type will!

(Statistical) Power

• Power is defined as the probability that a test will correctly reject the null hypothesis, i.e. correctly conclude that there is a difference, relationship, etc.

• The probability of a Type II error is sometimes labelled beta (β), hence power equals 1-β.

• The power of a test depends on the size of the effect (which is, of course, unknown!)

What is the point of power?• Power also depends on the sample size and the

significance level chosen. • So if we want to use the usual 5% significance

level (to obtain ‘95% confidence’ in our results) and we want to be able to identify an effect of a given size, we can calculate how likely, for a given sample size, we are to find an effect of that size, assuming such an effect exists.

• If the power of a test is low, there is little point in applying it, which suggests a need for a larger sample.

Never innocent…• Rather deciding between ‘guilty’ and ‘innocent’,

statistical tests decide between ‘guilty’ and ‘not proven’.

• In other words, a statistically insigificant or non-significant result (sometimes indicated by NS rather than, say p > 0.05) does not indicate that a difference or relationship does not exist, but simply that there is insufficient evidence to conclude that one does exist!

• This leaves open the possibility of a small difference or weak relationship, which the the statistical test was insufficiently powerful to identify…

Applying the logic of a statistical test…Today and next week we will look at a number of different statistical tests that use inferential methods to ask :

• Is the sample mean sufficiently different from the suggested population mean that it is implausible that the suggested population mean is correct? Testing the plausibility of a suggested population mean (via a z-test). [This is what we’ve just done].

• Are the means from two samples sufficiently different for it to be implausible that the populations from which they come are actually the same? Test via a two-sample t-test, or if comparing more than two (sub-) samples (i.e. more than two groups) testing for differences via Analysis of Variance (usually referred to as ANOVA).

• Are the observed frequencies in a cross-tabulation sufficiently different from what one would have expected to have seen if there were no relationship in the population for the idea that there is no relationship in the population to be implausible? Test this via a chi-square test.

In each instance we are asking whether the difference between the actual (observed) data and what one would have expected to have seen, given some hypothesis Ho, is sufficiently large that the hypothesis is implausible.

Thus we are always trying to disprove a (null) hypothesis.

(Two sample) t-tests

• Test the null hypothesis, which is:H0: 1 = 2 or H0: 1- 2 = 0

i.e. the equality of means

• The alternative hypothesis is:

Ha: 1 2 or Ha: 1- 2 0

What does a t-test measure?

Note: T = treatment group and C = control group. (The above depicts a comparison in experimental research; in most discussions these will just be shown as groups 1 and 2, indicating different groups.)

Example• We want to compare the average amounts of

television watched by Australian and by British children.

• We have a sample of Australian and a sample of British children. We could say that what we have and want to do are something like this:

Population of Australian children

Population of British children

Sample of Australian children

Sample of British children

inference inference

Want to compare

Example (continued)

• Here the dependent variable is number of hours of TV watched each night

• And the independent variable is nationality (or, perhaps, national context).

• When we are comparing means SPSS calls the independent variable the grouping variable and the dependent variable the test variable.

Example (continued)• If the null hypothesis, hypothesising no difference

between the two groups, was correct (and children thus watch the same average amount of television in Australia as in Britain), we would assume that if we took repeated samples from the two groups the difference in means between them would generally be small or zero.

• However it is highly likely that the difference between any two particular samples will not be zero.

• Therefore we acquire a knowledge of the sampling distribution of the difference between the two sample means.

• We use this distribution to determine the probability of getting an observed difference (of a given size) between two sample means from populations with no difference.

If we take a large number of random samples and calculate the difference between each pair of sample means, we will end up with a sampling distribution that has the following properties:

It will be a t-distribution, with

The mean of the difference between sample means will be zero if the null hypothesis is correct.

Mean (M1 – M2) = 0

The ‘average’ spread of scores around this mean of zero (the standard error) will be defined by the formula:

This estimate ‘pools’ the variance in the groups – just take it at face value!

2 21 1 2 2

1 2 1 2

1 1 1 1

2DM

N s N sS

N N N N

Back to the example…

Descriptive statistic Australian sample British sample

Mean 166 minutes 187 minutes

Standard deviation 29 minutes 30 minutes

Sample size 20 20

When we are choosing the test of significance it is important to note that:

1. We are making an inference from TWO samples (of Australian and of British children). And these samples are independent (the number of hours of TV watched by British children doesn’t affect the number of hours watched by Australian children) Therefore we need an two-sample test (what SPSS calls an ‘independent samples’ t-test)

2. The two samples are being compared in terms of an interval-ratio variable (hours of TV watched). Therefore the relevant descriptive statistic is the mean.

These facts lead us to select the two sample t-test for the equality of means as the relevant test of significance.

Table 1. Descriptive statistics for the samples

t-test of independent means: formulae

Where: M = meanSDM = Standard error of the difference between meansN = number of subjects in a groups = Sample standard deviation of a groupdf = degrees of freedom

Note: 1 + 1 = N1 + N2 N1 N2 N1 N2

2 21 1 2 2

1 2 1 2

1 1 1 1

2DM

N s N sS

N N N N

1 2

DM

M Mt

S

1 2 2df N N

What are ‘degrees of freedom’?

• Degrees of freedom can be thought of as the ‘sources of variation’ in a particular situation.

• If we are comparing groups of 20, then within each group there are 19 (independent) sources of difference between the values for that group.

• Thus for the two groups combined there are 19+19 = 38 degrees of freedom (d.f.)

2 21 1 2 2

1 2 1 2

1 1 1 1

2DM

N s N sS

N N N N

Descriptive statistic

Australian sample

British sample

Mean 166 minutes 187 minutes

Std. dev. 29 minutes 30 minutes

Sample size 20 20

S DM = (20-1)292 + (20-1)302 20+20 = 9.3 20 + 20 – 2 20 x 20

tsample = 166 – 187 = – 2.3

9.3

Example: Calculating the t-value

1 2

DM

M Mt

S

Example: Obtaining a p-value for a t-value

• To obtain the p-value for this t-value (score) we could consult a table of critical values for the t-distribution.

• Such a table may not have a row of probabilities for 38 degrees of freedom (d.f.) In that case we (to be cautious) would refer to the row for the nearest reported number of degrees of freedom below the desired number. Here that might be 30.

• For 30 degrees of freedom and a two-tailed test, the tabulated t-scores for p=0.05 and p=0.02 are 2.042 and 2.457.

• The (absolute magnitude) of the t-statistic, falls between these scores, hence the p-value linked to this t-statistic is therefore between 0.02 and 0.05.

• Therefore the p-value is statistically significant at the 5% (0.05) level but not at the 2% or 1% (0.02 or 0.01) level.

• Of course, SPSS is set up to calculate exact p-values for test statistics such as the t-statistic (in this case the exact value is p=0.030).

Example: Reporting the results

“The mean number of minutes of TV watched by the sample of 20 British children is 187 minutes, which is 21 minutes higher than the mean of 166 minutes for the sample of 20 Australian children; this difference is statistically significant at the 0.05 level (t(38)= -2.3, p = 0.03, two-tailed test).

Based on these results we can reject the hypothesis that British and Australian children watch the same average amount of television every night.”

t-tests and ANOVA• ANOVA (Analysis of Variance) works on broadly similar principles, but is

a technique allowing one to look simultaneously at differences between the means of more than two groups.

• Both t-tests and ANOVA make an assumption of homogeneity of variance (i.e. that the spread of values in each of the groups being considered is consistent).

• We will use both t-tests and ANOVA in the computing session this afternoon.

• You DO NOT have to remember the equations for any of these tests! • What are crucial to remember are the principles of hypothesis testing:

– That we start with a null hypothesis (of no difference in the population). – That, using our sample we can test whether this is plausible. – The p-values that we get (and that we report) show the likelihood of the

observed results given no difference.– Therefore (to simplify), the lower the p-value the more likely it is that there is

a real difference between the groups.

• Note that the three things that impact upon these test statistics are the sample size (of each group), the size of the differences in the means (between groups) and the variability of scores (within each group).

statistical inference friday 15 th february 2013

Documents