# mathematical readiness & descriptive statistics (part 1) .š¼/2, ā1= 0.025,120=1.980 confidence

Post on 09-Sep-2018

212 views

Embed Size (px)

TRANSCRIPT

One-Sample t-Test

Scenario:

We are testing if the mean for a population is equal to a value.

We dont know the variance for the population, so it must be estimated with 2.

Assume the data are normally distributed or, if 30, we can use the CLT.

Hypothesis test:

0: = 0 vs. 1: 0 0: = 0 vs. 1: > 0 0: = 0 vs. 1: < 0

A logical test statistic:

=0

But what is its distribution?

Distribution

When the population standard deviation is unknown, use the sample standard deviation as an estimate:

==1 ( )

2

1

Substituting for :

/

/

If either:

o The sample size is small ( < 30) but the underlying population distribution is normal

or

o The sample size is large ( 30)

Then:

/ has a -distribution with 1 degrees of freedom ()

Distribution

One logical question might be why does this thing that

looks a lot like the Z-statistic not have a normal distribution?

The simplest answer is that the population standard deviation

is being estimated by the sample standard deviation.

It also a sampling distribution, like the sample mean does,

though its distribution looks much different from the

sampling distribution for the means.

It is a skewed distribution.

It is related to what is called the chi-squared distribution.

The added variability due to having to estimate the variance makes the test statistic have a distribution with fatter tails.

Essentially, there is more uncertainty in the test, which is captured in the distribution.

=

/ has a -distribution with 1 degrees of freedom ()

Distribution

Students distribution William Gossett and Guinness

Similar to the standard normal distribution

values range from to = 0 and symmetric about = 0 Instead of , spread/shape defined by degrees of freedom (). This is its only parameter. As increases, distribution approaches the standard normal distribution

=

+1

2

2

1 +2

+1

2,

where is the gamma function, and is the degrees of freedom.

= 0

=

2, for > 2

Distribution

What are Degrees of Freedom ()?

Number of data values in the sample that are free to vary when estimating parameters

Suppose we know that the sample mean of 5 values is equal to 10. In other words, = 10 is an estimate of the parameter based on a sample of = 5 values. However, we dont actually know the individual values of the sample. If we were to guess the 5 values, note that we would be free to guess any value for the first 4. However, once weve guessed 4 numbers, the last number

must be chosen such that the average comes out to 10.

For example: ____ ____ ____ ____ ____

=8 + 23 + 12 + 4 + 5

5= 10 5 = 3

= 1 = 5 1 = 4

8 23 12 4 3

FreeFreeFree Free Fixed 4 degrees of freedom

One-Sample t-Test (Example)Suppose that non-psychology UCM students mean IQ is 110 and their IQs are normally distributed. We randomly sample 9 students from UCMs psych department and give them an IQ test. The average IQ of the sample is 117 with variance 121. Assuming psych student IQs are also normally distributed. Are the psych students IQ levels significantly larger than the average non-psych students IQ? Use = .05.

What are the null and alternative hypotheses?

0: = 110 vs. 1: > 110

Because the population sd is not known, a t-statistic is used:

=0

2

=117110

121 9=

7

11 3 1.91

Critical value method:

= 1 = 9 1 = 8

One-tailed test with = .05

= 1.860. The area beyond the critical value (1.860) is the critical region where we would reject 0.

Since > , we reject 0.

Thus, we can conclude that the mean IQ levels of UCM psych students is larger than UCM non-psych students.

Confidence Interval for ( )

Suppose we are interested in the average high score of millions all over the world who play a very popular computer game.

Unfortunately, the server does not keep a record of high scores, so we cannot simply determine the average score of the entire

population (true mean ). We do not know the population standard deviation either. However, all individuals do know their own high score, and we also happen to know that the population high scores are not normally distributed. We take a

random sample of 121 players and calculate their mean high score to be 5000 and standard deviation to be 1000.

What is the 95% confidence interval for ?

o Population distribution not normal

o = 1000 ( unknown)

o = 121 ( 30; Central Limit Theorem holds)

o = 5000

Confidence Interval for () As with all confidence intervals, we need to know what the point estimate,

appropriate multiplier, and standard errors are.

The point estimate is the estimate of the population parameter. Here, is estimated by .

The standard error is the standard deviation of the distribution of the point estimate. Here we are estimating the standard error, we showed that the standard error of is =

. When we dont know , this is estimated with

.

Finally, the appropriate multiplier is a value determined by the distribution of the point estimate and the desired level of confidence. For the t-distribution, a value of /2, is the appropriate multiplier for a

100(1-)% confidence interval, where = 1.

100(1-)% CI for 1 2(): /2,

/2,1 /2,1

Confidence Interval for ()

% Confidence Level =

100 = 1 = 1

100

= /2,1 <

/ < /2,1

/2,1 <

/ < /2,1

/2,1

< < + /2,1

2

2

/ has a distribution with = 1

% Confidence Interval

< <

, = /2,1

/2,1: critical value

: standard error

/2,1

: margin of error

0.025,5 = 2.57 0.025,5 = 2.57

= 0.95

2= 0.025

2= 0.025

0.025 = 1.96 0.025 = 1.96

2= 0.025

= 0.95

2= 0.025

Confidence Interval for ( vs. )

95% Confidence Interval using ( = 5) 95% Confidence Interval using

, = /2,1

= 2.57

, = /2

= 1.96

Confidence Interval for ( )

Suppose we are interested in the average high score of millions all over the world who play a very popular computer game.

Unfortunately, the server does not keep a record of high scores, so we cannot simply determine the average score of the entire

population (true mean ). We do not know the population standard deviation either. However, all individuals do know their own high score, and we also happen to know that the population high scores are not normally distributed. We take a

random sample of 121 players and calculate their mean high score to be 5000 and standard deviation to be 1000.

What is the 95% confidence interval for ?

o Population distribution not normal

o = 1000 ( unknown)

o = 121 ( 30)

o = 5000

= 0.05 (95% confidence)

= 1 = 121 1 = 120

95% (): , = /2,1

= 5000 1.980

1000

121= [4820, 5180]

95% (): , = 2

= 5000 1.960

1000

121= [4821.82, 5178.18]

/ has -distribution but is well approximated by standard normal

/2,1 = 0.025,120 = 1.980

Confidence Interval for ()

Suppose we are interested in the average high score of millions all over the world who play a very popular computer game.

Unfortunately, the server does not keep a record of high scores, so we cannot simply determine the average score of the entire

population (true mean ). We do not know the population standard deviation either. However, all individuals do know their own high score, and we also happen to know that the population high scores are normally distributed. We take a

random sample of 25 players and calculate their mean high score to be 5000 and standard deviation to be 1000.

What is the 99% confidence interval for ?

o Population distribution normal

o = 1000 ( unknown)

o = 25 ( < 30)

o = 5000

= 0.01 (99% confidence)

= 1 = 25 1 = 24

99% (): , = /2,1

= 5000 2.797

1000

25= [4440.6, 5559.4]

/ has -distribution

/2,1 = 0.005,24 = 2.797

Confidence Interval for () Founded in 1998, Telephia provides a wide variety of information on cellular phone use. In 2006, Telephia reported that, on average, United Kingdom

(UK) subscribers with 3G phones spent an average of 8.3 hours per month listening to full-track music on their cell phones. Suppose we

hypothesize that US subscribers are different from UK subscribers in their phone usage. Say we draw a random sample of size 8 from the US

population of 3G subscribers. Further suppose (unrealistically) that the distribution of time usage follows a normal distribution. Suppose we are

interested in constructing a 95% confidence interval for the mean usage for US subscribers and using that to test our hypothesis. What would the

95% confidence interval about the population mean time of US subscribers look like? With =. , can we conclude that US subscribers have a different mean time usage than UK subscribers?

Sample: 5, 6, 0, 4, 11, 9, 2, 3

What are the null and alternative hypotheses?

0: = 8.3 vs. 1: 8.3

What is , and what is s?

= 5

= 3.625

Recommended