-
One-Sample t-Test
Scenario:
We are testing if the mean for a population is equal to a value.
We dont know the variance for the population, so it must be estimated with 2.
Assume the data are normally distributed or, if 30, we can use the CLT.
Hypothesis test:
0: = 0 vs. 1: 0 0: = 0 vs. 1: > 0 0: = 0 vs. 1: < 0
A logical test statistic:
=0
But what is its distribution?
-
Distribution
When the population standard deviation is unknown, use the sample standard deviation as an estimate:
==1 ( )
2
1
Substituting for :
/
/
If either:
o The sample size is small ( < 30) but the underlying population distribution is normal
or
o The sample size is large ( 30)
Then:
/ has a -distribution with 1 degrees of freedom ()
-
Distribution
One logical question might be why does this thing that
looks a lot like the Z-statistic not have a normal distribution?
The simplest answer is that the population standard deviation
is being estimated by the sample standard deviation.
It also a sampling distribution, like the sample mean does,
though its distribution looks much different from the
sampling distribution for the means.
It is a skewed distribution.
It is related to what is called the chi-squared distribution.
The added variability due to having to estimate the variance makes the test statistic have a distribution with fatter tails.
Essentially, there is more uncertainty in the test, which is captured in the distribution.
=
/ has a -distribution with 1 degrees of freedom ()
-
Distribution
Students distribution William Gossett and Guinness
Similar to the standard normal distribution
values range from to = 0 and symmetric about = 0 Instead of , spread/shape defined by degrees of freedom (). This is its only parameter. As increases, distribution approaches the standard normal distribution
=
+1
2
2
1 +2
+1
2,
where is the gamma function, and is the degrees of freedom.
= 0
=
2, for > 2
-
Distribution
What are Degrees of Freedom ()?
Number of data values in the sample that are free to vary when estimating parameters
Suppose we know that the sample mean of 5 values is equal to 10. In other words, = 10 is an estimate of the parameter based on a sample of = 5 values. However, we dont actually know the individual values of the sample. If we were to guess the 5 values, note that we would be free to guess any value for the first 4. However, once weve guessed 4 numbers, the last number
must be chosen such that the average comes out to 10.
For example: ____ ____ ____ ____ ____
=8 + 23 + 12 + 4 + 5
5= 10 5 = 3
= 1 = 5 1 = 4
8 23 12 4 3
FreeFreeFree Free Fixed 4 degrees of freedom
-
One-Sample t-Test (Example)Suppose that non-psychology UCM students mean IQ is 110 and their IQs are normally distributed. We randomly sample 9 students from UCMs psych department and give them an IQ test. The average IQ of the sample is 117 with variance 121. Assuming psych student IQs are also normally distributed. Are the psych students IQ levels significantly larger than the average non-psych students IQ? Use = .05.
What are the null and alternative hypotheses?
0: = 110 vs. 1: > 110
Because the population sd is not known, a t-statistic is used:
=0
2
=117110
121 9=
7
11 3 1.91
Critical value method:
= 1 = 9 1 = 8
One-tailed test with = .05
= 1.860. The area beyond the critical value (1.860) is the critical region where we would reject 0.
Since > , we reject 0.
Thus, we can conclude that the mean IQ levels of UCM psych students is larger than UCM non-psych students.
-
Confidence Interval for ( )
Suppose we are interested in the average high score of millions all over the world who play a very popular computer game.
Unfortunately, the server does not keep a record of high scores, so we cannot simply determine the average score of the entire
population (true mean ). We do not know the population standard deviation either. However, all individuals do know their own high score, and we also happen to know that the population high scores are not normally distributed. We take a
random sample of 121 players and calculate their mean high score to be 5000 and standard deviation to be 1000.
What is the 95% confidence interval for ?
o Population distribution not normal
o = 1000 ( unknown)
o = 121 ( 30; Central Limit Theorem holds)
o = 5000
-
Confidence Interval for () As with all confidence intervals, we need to know what the point estimate,
appropriate multiplier, and standard errors are.
The point estimate is the estimate of the population parameter. Here, is estimated by .
The standard error is the standard deviation of the distribution of the point estimate. Here we are estimating the standard error, we showed that the standard error of is =
. When we dont know , this is estimated with
.
Finally, the appropriate multiplier is a value determined by the distribution of the point estimate and the desired level of confidence. For the t-distribution, a value of /2, is the appropriate multiplier for a
100(1-)% confidence interval, where = 1.
100(1-)% CI for 1 2(): /2,
-
/2,1 /2,1
Confidence Interval for ()
% Confidence Level =
100 = 1 = 1
100
= /2,1 <
/ < /2,1
/2,1 <
/ < /2,1
/2,1
< < + /2,1
2
2
/ has a distribution with = 1
% Confidence Interval
< <
, = /2,1
/2,1: critical value
: standard error
/2,1
: margin of error
-
0.025,5 = 2.57 0.025,5 = 2.57
= 0.95
2= 0.025
2= 0.025
0.025 = 1.96 0.025 = 1.96
2= 0.025
= 0.95
2= 0.025
Confidence Interval for ( vs. )
95% Confidence Interval using ( = 5) 95% Confidence Interval using
, = /2,1
= 2.57
, = /2
= 1.96
-
Confidence Interval for ( )
Suppose we are interested in the average high score of millions all over the world who play a very popular computer game.
Unfortunately, the server does not keep a record of high scores, so we cannot simply determine the average score of the entire
population (true mean ). We do not know the population standard deviation either. However, all individuals do know their own high score, and we also happen to know that the population high scores are not normally distributed. We take a
random sample of 121 players and calculate their mean high score to be 5000 and standard deviation to be 1000.
What is the 95% confidence interval for ?
o Population distribution not normal
o = 1000 ( unknown)
o = 121 ( 30)
o = 5000
= 0.05 (95% confidence)
= 1 = 121 1 = 120
95% (): , = /2,1
= 5000 1.980
1000
121= [4820, 5180]
95% (): , = 2
= 5000 1.960
1000
121= [4821.82, 5178.18]
/ has -distribution but is well approximated by standard normal
/2,1 = 0.025,120 = 1.980
-
Confidence Interval for ()
Suppose we are interested in the average high score of millions all over the world who play a very popular computer game.
Unfortunately, the server does not keep a record of high scores, so we cannot simply determine the average score of the entire
population (true mean ). We do not know the population standard deviation either. However, all individuals do know their own high score, and we also happen to know that the population high scores are normally distributed. We take a
random sample of 25 players and calculate their mean high score to be 5000 and standard deviation to be 1000.
What is the 99% confidence interval for ?
o Population distribution normal
o = 1000 ( unknown)
o = 25 ( < 30)
o = 5000
= 0.01 (99% confidence)
= 1 = 25 1 = 24
99% (): , = /2,1
= 5000 2.797
1000
25= [4440.6, 5559.4]
/ has -distribution
/2,1 = 0.005,24 = 2.797
-
Confidence Interval for () Founded in 1998, Telephia provides a wide variety of information on cellular phone use. In 2006, Telephia reported that, on average, United Kingdom
(UK) subscribers with 3G phones spent an average of 8.3 hours per month listening to full-track music on their cell phones. Suppose we
hypothesize that US subscribers are different from UK subscribers in their phone usage. Say we draw a random sample of size 8 from the US
population of 3G subscribers. Further suppose (unrealistically) that the distribution of time usage follows a normal distribution. Suppose we are
interested in constructing a 95% confidence interval for the mean usage for US subscribers and using that to test our hypothesis. What would the
95% confidence interval about the population mean time of US subscribers look like? With =. , can we conclude that US subscribers have a different mean time usage than UK subscribers?
Sample: 5, 6, 0, 4, 11, 9, 2, 3
What are the null and alternative hypotheses?
0: = 8.3 vs. 1: 8.3
What is , and what is s?
= 5
= 3.625
What is the confidence interval?
= 0.05 (95% confidence)
= 1 = 8 1 = 7
95% CI: 5 2.3653.625
8= 5 3.031 = 1.969, 8.031
What is the conclusion?
The CI does not contain 8.3, so we reject 0.
Substantively, this means that we conclude that US 3G subscribers mean time usage is statistically significantly different from 8.3 hours per month (UK subscribers mean time).
/2,1 = 0.025,7 = 2.365