stat 360—regression analysiscourse1.winona.edu/bdeppa/biostatistics/handouts/ho3 d.docx · web...

20
STAT 405 - BIOSTATISTICS Handout 3 – Methods for a Single Categorical Variable, Part II In the previous handout, we discussed the binomial distribution. To learn more about this distribution, refer to Section 4.8 of your text. In this handout, we will continue to discuss the use of the binomial distribution for testing hypotheses concerning a single binomial proportion. For additional information, see Section 7.10 of your text. THE Z-TEST FOR A SINGLE PROPORTION In your introductory course, you more than likely discussed normal theory methods (e.g., the z-test) for testing hypotheses concerning a single proportion. Let’s review these methods for the following example. EXAMPLE 1: Cancer (Example 7.49 of your text, pg. 248) The safety of people who work at or live by nuclear-power plants has been the subject of widely publicized debate in recent years. One possible health hazard from radiation exposure is an excess of cancer deaths among those exposed. One problem with studying this question is that the number of deaths attributable to either cancer in general or specific types of cancer is small, and reaching statistically significant conclusions is difficult, except after long periods of follow-up. An alternative approach is to perform a proportional-mortality study, whereby the proportion of deaths attributed to a specific cause in an exposed group is compared with the corresponding proportion in a large population. Suppose, for example, that 13 deaths have occurred among 55-64 year-old male workers in a nuclear-power plant and that in 5 of them the cause of death was cancer. Assume, based on vital- statistics reports, that approximately 20% of all deaths can be attributed to some form of cancer. Is there evidence that the proportion of deaths from cancer in nuclear-power plant workers 1

Upload: others

Post on 07-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

STAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable, Part II

In the previous handout, we discussed the binomial distribution. To learn more about this distribution, refer to Section 4.8 of your text.

In this handout, we will continue to discuss the use of the binomial distribution for testing hypotheses concerning a single binomial proportion. For additional information, see Section 7.10 of your text.

THE Z-TEST FOR A SINGLE PROPORTION

In your introductory course, you more than likely discussed normal theory methods (e.g., the z-test) for testing hypotheses concerning a single proportion. Let’s review these methods for the following example.

EXAMPLE 1: Cancer (Example 7.49 of your text, pg. 248)

The safety of people who work at or live by nuclear-power plants has been the subject of widely publicized debate in recent years. One possible health hazard from radiation exposure is an excess of cancer deaths among those exposed. One problem with studying this question is that the number of deaths attributable to either cancer in general or specific types of cancer is small, and reaching statistically significant conclusions is difficult, except after long periods of follow-up. An alternative approach is to perform a proportional-mortality study, whereby the proportion of deaths attributed to a specific cause in an exposed group is compared with the corresponding proportion in a large population.

Suppose, for example, that 13 deaths have occurred among 55-64 year-old male workers in a nuclear-power plant and that in 5 of them the cause of death was cancer. Assume, based on vital-statistics reports, that approximately 20% of all deaths can be attributed to some form of cancer. Is there evidence that the proportion of deaths from cancer in nuclear-power plant workers is greater than the proportion of deaths from cancer in men of comparable age in the general population?

Set up your null and alternative hypotheses:

1

Page 2: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Find the test statistic:

Find the p-value:

Write a conclusion in the context of the original problem:

EXACT BINOMIAL TEST FOR A SINGLE PROPORTION

Note that the z-test for a single proportion is valid only when the normal approximation to the binomial distribution is valid. When this criterion is not satisfied, the hypothesis test should instead be based on exact binomial probabilities.

The exact binomial test for the Cancer example can be carried out as follows:

Check assumptions. For this test, you must check whether the binomial distribution is appropriate for the problem.

2

Page 3: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Set up your null and alternative hypotheses.

Use the binomial distribution to find the exact p-value.

The following graphic shows the binomial distribution for the number of deaths attributable to cancer (n = 13 and p = .20):

Recall that the p-value is the probability of observing a sample AT LEAST AS EXTREME as our data, assuming the null hypothesis is true. For this example, “at least as extreme” implies observing 5 or more deaths due to cancer.

We can use SAS to find the probability of 5 or more deaths due to cancer:

data BinomialProbabilities;

3

Page 4: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

prob = 1-cdf('Binomial', 4, .20, 13);proc print data=BinomialProbabilities;run;

The interpretation of this p-value is as follows: Even if the true proportion of deaths from cancer for nuclear-power plant workers is the same as that of the general population (p = 20%), there is still about a 10% probability of observing a sample proportion of 5/13 = 38.5% just by random chance alone.

Note that we reject the null hypothesis for small p-values. This is because a small p-value indicates that observing a difference at least as extreme as the one obtained in the sample is NOT LIKELY to happen by random chance alone. Therefore, when the p-value is small, the observed difference between the sample proportion and the hypothesized value is statistically significant since it is not attributable to chance.

Write a conclusion in the context of the original problem:

“We do NOT have evidence to conclude the proportion of deaths from cancer is different for nuclear-power plant workers than for men of comparable age in the general population (p-value = .099).”

WHY DO WE DEFINE P-VALUE WITH “AT LEAST AS EXTREME”?

Consider the previous example. One question that often arises is why we used the probability of 5 or more deaths to find the p-value, rather than the probability of exactly 5 deaths (what we actually observed in the study). One intuitive answer is that if the number of overall deaths studied was very large, then the probability of any specific occurrence would be small.

For example, suppose that out of 1,000 deaths in some study population, exactly 200 were attributable to cancer. If we assume the proportion of deaths in the study population due to cancer is the same as that of the general population for this age/gender group (20%), then the probability of observing exactly 200 deaths in the study population is calculated as follows:

data BinomialProbabilities;prob = pdf('Binomial', 200, .20, 1000);proc print data=BinomialProbabilities;run;

4

Page 5: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Questions:

1. If this were the p-value, what would our conclusion be?

2. Does this conclusion make sense? Explain.The better approach is to find the probability of obtaining a result at least extreme as the observed data. In this case, the p-value should have been calculated as follows:

data BinomialProbabilities;prob = 1-cdf('Binomial', 199, .20, 1000);proc print data=BinomialProbabilities;run;

Questions:

1. Now, what is our conclusion?

2. Does this conclusion make sense? Explain.

EXACT CONFIDENCE INTERVALS FOR A BINOMIAL PROPORTION

Once again, in your introductory course, you more than likely constructed confidence intervals for a single proportion using normal-theory methods. Consider the data from the previous example. Let’s construct a 95% confidence interval for p = the true proportion of deaths from cancer for nuclear-power plant workers using normal theory methods:

5

Page 6: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Just like for hypothesis tests, this method is valid only when the normal approximation to the binomial distribution is valid. When this criterion is not satisfied, the confidence interval should be based on exact binomial probabilities. For additional information, see Section 6.8 of your text.

EXAMPLE 1 (cont’d): Consider the data from the previous example. Let’s construct a 95% confidence interval for p = the true proportion of deaths from cancer for nuclear-power plant workers using the binomial distribution.

The basic idea for constructing a confidence interval for a binomial proportion is as follows: the confidence interval includes all values p* such that the data obtained in the sample would NOT lead us to reject Ho: p = p*.

One-sided Interval:

Let p* = .15. P(5 or more deaths out of 13 if p = .15) =

data BinomialProbabilities;prob = 1-cdf('Binomial', 4, .15, 13);proc print data=BinomialProbabilities;run;

Let p* = .16. P(5 or more deaths out of 13 if p = .16) =

data BinomialProbabilities;prob = 1-cdf('Binomial', 4, .15, 13);proc print data=BinomialProbabilities;

6

Page 7: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

run;

Let p* = .17. P(5 or more deaths out of 13 if p = .17) =

data BinomialProbabilities;prob = 1-cdf('Binomial', 4, .15, 13);proc print data=BinomialProbabilities;run;

Two-sided Interval:

Section 6.8 of your text discusses two-sided binomial exact confidence intervals:

An exact (1-α)100% confidence interval for the binomial proportion is given by (p1, p2), where p1 and p2 satisfy these equations:

P(X ≥ x | p = p1) = α/2P(X ≤ x | p = p2) = α/2

To calculate these endpoints in SAS, you can use the following program:

data BinomialExactCIs;

n = 13; *Input this--sample size;x = 5; *Input this--number of "successes" in your sample;

do i = 0 to 1 by .01; lower = 1-cdf('Binomial', x-1, i, n); upper = cdf('Binomial', x, i, n); output; end;

proc print data=BinomialExactCIs; run;

7

Page 8: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Output:

8

Page 9: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Usually with exact confidence intervals, we cannot exactly satisfy α/2 in each tail. Instead, we use a more conservative approach:

Lower endpoint: Find the largest value of p1 so that P(X ≥ x | p = p1) ≤ α/2

Upper endpoint: Find the smallest value of p2 so that P(X ≤ x | p = p2) ≤ α/2

Using these guidelines and the above SAS output, write the two-sided 95% confidence interval for p = the true proportion of deaths from cancer for nuclear-power plant workers.

9

Page 10: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

One-sided Intervals:

To obtain one-sided confidence intervals use the same process but use instead of

One-sided lower bound 100(1- CI for p

Lower endpoint: Find the largest value of p1 so that P(X ≥ x | p = p1) ≤ α

100(1 - )% CI for p is then given by (p1 , 1)

One-sided upper bound 100(1- CI for p

Upper endpoint: Find the smallest value of p2 so that P(X ≤ x | p = p2) ≤ α

100(1 - )% CI for p is then given by (0, p2)

Using these guidelines and the above SAS output, write the one-sided lower bound 95% confidence interval for p = the true proportion of deaths from cancer for nuclear-power plant workers.

USING SAS PROC FREQ FOR TESTS AND CONFIDENCE INTERVALSOnce again, consider the example dealing with cancer deaths and nuclear-power plant workers. Recall that 5 out of 13 deaths of workers were cancer-related, while the rate in the general population was only 20%.

You can use the following commands to carry out the test in SAS PROC FREQ:

data Cancer;input cancer_death$ count;datalines;yes 5no 8;proc freq order=data;tables cancer_death / binomial(p=.20);weight count;

10

Page 11: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

exact binomial;run;

EXAMPLE 2: Obstetrics (Exercise 7.22 of your text, page 260)

The probability of bearing twins in the U.S. is about 1 in 90. This proportion is thought to be affected by a number of factors, including age, race, and parity. To study the effect of age, hospital records are abstracted. Of 538 deliveries for women under 20, 2 resulted in twins. Is there evidence that the probability of bearing twins is smaller for younger women (under 20) than for the general population?

Use the binomial exact test to investigate this research question.

Check assumptions. For this test, you must check whether the binomial distribution is appropriate for the problem.

Set up your null and alternative hypotheses.

Ho:

11

Note: SAS uses an alternative method based on the F-distribution. The results are slightly different than those obtained using the binomial distribution.

Page 12: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Ha:

Use the binomial distribution to find the exact p-value.

Recall that the p-value is the probability of observing a sample AT LEAST AS EXTREME as our data, assuming the null hypothesis is true. For this example, “at least as extreme” implies observing 2 or fewer twin births.

We can use SAS to find the probability of 2 or fewer:

data BinomialProbabilities;prob = cdf('Binomial', 2, .011, 538);proc print data=BinomialProbabilities;run;

The interpretation of this p-value is as follows: If there is in reality no difference between the proportion of twin births in women under 20 compared to the proportion of twin births in the general population, there is only a 6.5% probability of observing results at least as extreme as those obtained in this sample just by random chance alone.

In this example, the p-value is technically not small enough for us to rule out observing the difference by random chance alone; that is, we do not have enough evidence to indicate the results are statistically significant.However, since this p-value is close to falling below 5%, some may claim we have marginal evidence that the study population differs from the general population.

Write a conclusion in the context of the original problem:

“Using a significance level of 5%, we do NOT have evidence to conclude the probability of bearing twins is smaller for younger women (under 20) than for the general population (p-value = .065).”

Alternatively, some may say, “We have marginal evidence that the probability of bearing twins is smaller for younger women (under 20) than for the general population (p-value = .065).”

To calculate an exact two-sided confidence interval, you need to modify the SAS program somewhat:

data BinomialExactCIs;n = 538; *Input this--the sample size;x = 2; *Input this-- number of "successes" in your sample;

12

Page 13: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

do i = 0 to 1 by .0001; lower = 1-cdf('Binomial', x-1, i, n); upper = cdf('Binomial', x, i, n); output; end;

proc print data=BinomialExactCIs; run;

Finally, let’s verify the p-value and confidence interval using PROC FREQ:

data TwinBirths;input twins$ count;datalines;yes 2no 536;proc freq order=data;tables twins / binomial(p=.011);weight count;exact binomial;run;

13

Page 14: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

14

Page 15: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

Binomial Exact CI’s in R

The package exactci available on CRAN has functions to conduct binomial exact tests for p and construct exact confidence intervals for p.

Description

Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment. This function will also return an binomial exact CI for p.

Usagebinom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95)

Argumentsx number of successes, or a vector of length 2 giving the numbers of successes and

failures, respectively.n number of trials; ignored if x has length 2.p hypothesized probability of success.alternative indicates the alternative hypothesis and must be one of "two.sided", "greater" or

"less". You can specify just the initial letter.conf.level confidence level for the returned confidence interval.

Examples above done in R

Example 1: Nuclear power plant workers example

> binom.test(5,13,p=.20,"greater")

Exact binomial test

data: 5 and 13 number of successes = 5, number of trials = 13, p-value = 0.09913alternative hypothesis: true probability of success is greater than 0.2 95 percent confidence interval: 0.1656594 1.0000000 sample estimates:probability of success 0.3846154

To obtain a two-sided confidence interval instead, specify a two-tailed alternative in the binomial test.

15

Page 16: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

> binom.test(5,13,p=.20,"two")

Exact binomial test

data: 5 and 13 number of successes = 5, number of trials = 13, p-value = 0.1541alternative hypothesis: true probability of success is not equal to 0.2 95 percent confidence interval: 0.1385793 0.6842224 sample estimates:probability of success 0.3846154

Example 2: Obstetrics example

> binom.test(2,538,p=1/90,"less")

Exact binomial test

data: 2 and 538 number of successes = 2, number of trials = 538, p-value = 0.06197alternative hypothesis: true probability of success is less than 0.01111111 95 percent confidence interval: 0.00000000 0.01165559 sample estimates:probability of success 0.003717472

Two-sided CI is given below although one-sided upper is appropriate here.

> binom.test(2,538,p=1/90,"two")

Exact binomial test

data: 2 and 538 number of successes = 2, number of trials = 538, p-value = 0.1432alternative hypothesis: true probability of success is not equal to 0.01111111 95 percent confidence interval: 0.0004505206 0.0133637398 sample estimates:probability of success 0.003717472

Binomial Exact Tests in JMP

16

Page 17: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

In JMP you can use the Binomial Table in JMP to find p-values for the exact test. Remember if you are doing a two-sided test to double the one-sided p-value.

For Binomial Exact CI’s for p you can use the Binomial Exact CI’s in JMP file to find the exact LCL and UCL.

Example 1: Nuclear power plant workers

Conducting the test and CI using Analyze > Distribution in JMP

17

Page 18: STAT 360—REGRESSION ANALYSIScourse1.winona.edu/bdeppa/Biostatistics/Handouts/HO3 D.docx · Web viewSTAT 405 - BIOSTATISTICSHandout 3 – Methods for a Single Categorical Variable,

18