final exam biost 514/517 biostatistics i / applied

Review and Final Material

BIOST 514/517Biostatistics I / Applied Biostatistics I

Kathleen Kerr, Ph.D.Associate Professor of Biostatistics

University of Washington

Review* for Final

December 6, 2013

*not exclusively review

Final exam

• 1 hour and 50 minutes• Roughly similar format to midterm: mix of multiple

choice and short answer• Heavy emphasis on material since midterm• 8:30 on Wednesday, December 11

22

33

Miscellaneous

• Bring to exam: pencil, eraser, calculator– Calculator should do arithmetic, exp(), loge(), square

root and NOT run software, connect to the internet, etc.

• IF you want your graded exam sent to you in campus mail, you must know your campus mailbox number. You will also be asked to fill out a mailing label

44


Formulas

• Having taken one of my exams, you should have some sense of what I emphasize and what I don’t

• Don’t memorize formulas unless it helps you understand things

• Of course, you need to know simple things like odds=p/(1-p) and p=odds/(1+odds)

55

Statistic

• Any function of the data. Examples: – Descriptive statistic– Inferential statistic– Test statistic

66

Inferential Statistics

• Point Estimates• Confidence Intervals• Hypothesis Tests

• We emphasized summarizing study results to be as informative as possible.

77

Point estimates

• We like estimates of population parameters with low or no bias, that are consistent and efficient

88


99

Unbiased Estimator T

• A statistic T is an unbiased estimator of a parameter Θ if E(T)=Θ

• For example, if the parameter is the population mean µ and the estimator T is the sample mean, T is unbiased: E(T) = µ– Easily shown from properties of expectation

• “T is unbiased” means the mean of sampling distribution of T is Θ

1010

Consistent Estimator T

• Loosely, a statistic T is a consistent estimator of a parameter Θ if the sampling distribution of T becomes more and more concentrated around Θ as the sample size increases– T can be biased but still be consistent– T can be unbiased and not be consistent

1111

Efficient Estimator T

• Loosely, a statistic T is an efficient estimator of a parameter Θ if T makes full use of the available data

• If T1 and T2 are two unbiased estimators of Θ, then T1 is more efficient than T2 if var(T1)<var(T2)

12

•Four estimators of the mean

µ

popu

latio

n

sample X1, X2, X3, X4, X5, X6, X7,…

data

Estimate of µ, the population mean

Unbiased? Consistent? Efficient?

sample median

Biased* No* NA

X1 unbiased No No

Xodd unbiased Yes No

X unbiased Yes Yes


Confidence Intervals

• Values of the parameter for which our data would not be surprising• Values of the parameter for which our data would be “typical”• The procedure of making a 95% confidence interval

– Taking the sample– Constructing the interval

… produces an interval that contains the true value of the parameter 95% of the time.

1313

Confidence Intervals

• Means and differences in means– t-based

• Proportions– Normal/Wald and exact

• Differences in proportions– Normal/Wald

• RR and OR– Normal/Wald interval for log-transformed parameter, then back-

transform• Survival curves

– Normal/Wald interval for transformed parameter, then back-transform

• Median and variance– Bootstrapping 1414

Hypothesis Testing

• Need null and alternative hypothesis• Need a test statistic• Need the sampling distribution of the test statistic under H0

– “Null distribution”– We focused on tests where statistical theory tells us the null

distribution– (There are other tests that use methods to estimate the null

distribution, but we didn’t get to those.)• The alternative hypothesis tells us which region of our null

distribution is “extreme”• P-value = the probability of seeing data as extreme or more extreme

than our data when the null hypothesis is true• P-value ≈ P( data | H0 )

1515

Hypothesis Testing • t-tests

– One-sample– Two-sample assuming equal variances– Two-sample allowing for unequal variances– Paired t-test

• Exact test for binomial proportions• Z-test for binomial proportions• Chi-square goodness-of-fit test

– Independence of two categorical variables• Same as z-test for 2 binary variables

– Other goodness-of-fit tests• Fisher’s exact test• Rank-based tests for uncensored data• Logrank test

1616


P-values

• Frequently misinterpreted– NOT the probability the null hypothesis is true

• Interpreting “negative” study. pvalue > α does not mean we “accept” the null hypothesis– FirstSteps and Birthweight

• “results indicate that enrollment in FirstSteps does not increase birthweight”

• One problem is that “increase” sounds causal. Better: “is not associated with increased birthweight”

• Second problem: this conclusion “accepts” the null hypothesis.

• Better summary of a valid analysis yielding pvalue > α: “results do not indicate evidence that enrollment in FirstStepsis associated with increased birthweight” 1717

Baseline variables and Clinical Trails

• P-values are inferential statistics. Their purpose is to summarize the evidence in the data (sample) regarding a hypothesis about a population.

• Consider a clinical trial where patients are randomly assigned into a treatment group and a placebo group.

• The treatment/placebo label has been assigned by the investigators randomly. Any difference between the groups at baseline is by chance.– If we test baseline variables for differences between

the groups, about 5% of them will yield p-values < 0.05.

1818

age 0.018 alb 0.874

alkphos 0.747 ascites 0.433

bili 0.133 chol 0.742

edema 0.428 edema_tmt 0.877

hepmeg 0.069 sgot 0.460

platelet 0.579 stage 0.201

sex 0.326 copper 0.999

spiders 0.885 trig 0.365

19

P-values for baseline variables in PBC dataset by treatment groupBaseline variables and Clinical Trails

• But what about confounding? Don’t we want to know whether the “randomization worked”?

• Confounder: – Associated with the POI in the sample

– Causally associated with the response• The purpose of a pvalue is for making inference from a sample to a

population. Does not address first condition.• Can have p-value > α but there is an association in the sample.• In a large trial can have p-value < < α but the association in the sample is

small.• The best summary of the degree of association in the sample for the

purposes of confounding is usually the mean.

2020


You choose the analysis, not the data

• More generally, don’t use p-values to decide whether a variable is a confounder

• As mentioned, a p-value does not measure the strength of the association in the sample

• It is the investigator’s job to decide whether the adjusted or unadjusted association answers the scientific question.

2121

“Multiple Comparisons” Problem

Also known as the “multiple testing” problem.

Issue: When you perform lots of hypothesis tests, your chance of making a type I error increases. Consider. . .

Hypothetical exampleAn investigator decides to study an association between eating red meat and cancer. He collects clinical data on a sample of individuals who eat red meat and a sample who do not eat red meat.

In analyzing the data, the investigator compares incidence rates of many different types of cancer between the two groups. Since there may be interactions, he also makes comparisons stratified by sex, race, and lifestyle factors.

Hypothetical example con’t

In a summary of the study, the investigator claims:

“the research study uncovered a significant association between consuming red meat and the incidence of lung cancer in non-smoking males (p<0.05).”


Hypothetical example con’t

Conclusion A: There is an association between lung cancer and eating red meat that is specific to non-smoking men.

Conclusion B: This finding is a type I error.

What do you suspect is the truth?

The probability of multiple comparisons

For a null hypothesis H0 that is true and a test performed at a given significance level α :

P(reject H0 | H0 is true) = αP(do not reject H0 | H0 is true) = 1-α

Next suppose n independent hypothesis tests H10, H20, . . ., Hn0 are performed at level α. Suppose all n null hypotheses are true.

P(do not reject Hi0 | Hi0 is true) = 1-α

Therefore:

P(not reject H10 AND not reject H20 AND . . . not reject Hn0 | all Hi0 are true)

= (1-α)n

n 1 2 4 8 12 16 20 30

(1-.05)n 0.95 0.902 0.814 0.663 0.540 0.440 0.359 0.217

(1-.01)n 0.99 0.98 0.96 0.923 0.886 0.852 0.818 0.740

Conclusion:

If 30 independent tests are performed at level α=0.05 on 30 true null hypotheses, the probability of falsely rejecting at least one null hypothesis is nearly 80%! That is, we are very likely to make a mistake!

Clearly, using a smaller α helps. In the situation above, our chances of making at least one type I error are reduced from 79% to 26% by using α=0.01. But this chance is still not negligible.

There are methods for deciding what α should be used in the presence of multiple comparisons. We aren’t covering those methods. But it is important to be aware of the issue.


“Multiple Comparisons”

• A p-value is only interpretable if the context in which it is computed is known– 1 test? Hundreds of tests?

Thousands of tests?– P-value most useful for pre-specified

primary endpoints• The same data that suggest a

hypothesis cannot be used to test the hypothesis

3030

http://sph.washington.edu/podcasts/

59 worthwhile minutes

Association and causation

• Many scientific questions lead to data analyses to examine associations between a predictor of interest X and an outcome of interest Y. We must distinguish:– The data are consistent with a strong association between X and

Y in the population; and there are other scientific reasons to think that X affects Y.

– The data provide strong evidence that X and Y are associated in the population, and there are other scientific reasons to think that X affects Y.

– The data provide strong evidence that X affects Y.• In all 3 cases it may be reasonable to believe that X affects Y, but it

is important to be clear about what the data say. The 3rd possibility (data provide evidence of causation) is possible if X is randomly allocated and occasionally in other cases.

3131

Smoking and lung cancer

• Many people were convinced that smoking caused lung cancer based on observing associations because:– The association was very, very strong (risk of lung

cancer 10-20 times higher among smokers than non-smokers)

• Rates of lung cancer are higher in heavy smokers than light smokers

• The cancer happens a reasonable time after smoking• The dramatic increase in smoking rates among men

during WWI was matched by a dramatic increase in lung cancer in men but not women

3232


Smoking and lung cancer: confounding?

• Suppose the only piece of evidence we have is the early observations of higher lung cancer rates among smokers than non-smokers (assume the year is 1920)

• It is conceivable that an unmeasured confounder explains the association between smoking and lung cancer. For example, a gene that leads people to smoke, and also leads to lung cancer.

3333

Smokeygene

Cigarettesmoking is pleasurable

Prone to Lung

Cancer

Smoking and lung cancer: confounding?

• This unmeasured confounder would have had to change for men, but not women, in the early 20th century.

3434

Smoking and lung cancer

• Thus solely from observational data in humans (and perhaps animal studies?) a very strong cases has been made for what we all believe today: smoking causes lung cancer

• The effect of smoking on heart disease, an issue of greater public health importance, was much harder to establish because the relative risk is only about 3.

3535

Example: FEV in kids

• Children’s FEV data• “…whether such deleterious effects of smoking can be

detected in children who smoke. To address this question, measures of lung function were made in 654 children seen for a routine check up in a particular pediatric clinic. The children participating in this study were asked whether they were current smokers.”

• We decide to compare the distribution of FEV in smokers and nonsmokers by comparing the means.


• In the sample, the difference in means is 3.28 (smokers) –2.57 (non-smokers) = 0.71.

• We estimate the difference in mean FEV between smokers and nonsmokers as 0.71 l/s.

• The 95% confidence interval is (0.49 l/s, 0.93 l/s).• Data like ours would be very unlikely if the true difference

were 0.• So, do we have evidence that smoking reduces FEV among

children?

12

34

56

Smoker Non-Smoker

FEV

Graphs by SMOKE


• No. We have evidence that smokers have higher mean FEV, probably because they are older.

• We need to compare smokers and non-smokers for each age. We can average these (sensibly) across ages to get a single estimate.– Regression methods make this easy.

• This age-stratified analysis yields a point estimate of smokers having 0.21 l/s smaller mean FEV than non-smokers. The 95% confidence interval is 0.43 l/s lower to 0.01 l/s higher.


• Have we now proved smoking lowers FEV in kids?• No.• Data would not be so unusual if true difference were 0

(p-value in this case is 0.060).• But even if p << 0.05….

– Confounding by gender?– Confounder by height?– Perhaps kids with high FEV are more likely to play

sports and this makes them less likely to smoke– …


• For kids, is the association between smoking and higher FEV “real”?

• We must distinguish the unadjusted association between smoking and FEV

• …and the age-adjusted association between smoking and FEV

• It is our job to decide which one addresses the scientific question– The data cannot tell us which association addresses

the scientific question– So why would we let the data decide whether we

adjust for age?


Power

• To get a p-value, we need to know the distribution of the test statistic under H0. Shouldn’t we also consider the distribution under H1?

• Yes• The distribution of the test statistic under the alternative

hypothesis tells us the power of the testPower refers to the probability of rejecting H0 when it is false:

1 - = P [ reject H0 | H1 true ]= P [ correct decision | H1 true ]

• Power describes the ability of the test procedure to reliably detect departures from the null hypothesis.

• Power (1 - ) and significance () are important considerations in the planning of a study.

42

Power: an example

• A simple, illustrative example:– Hypotheses: H0: = 13 versus H1: < 13– = 0.05, = 1, n=9

• What is the critical region (i.e. for what values of the statistic, here the sample mean, would we reject the null?)

• In this (artificial) example, we’re assuming the population variance is known. We can do a z-test instead of a t-test.

43

Power: an example

45.123/65.113

65.13

113

65.1ifReject3

113:statisticObserved

65.105.0)|(:Thus

)1,0(~:Under

0

0

00

xx

xzH

xz

zHzZP

Nn

XZH

cc

44

Rejection region

Do not reject H0

13

0.05

12.45 x


45

Power: an example

Now what is the probability of rejecting H0 when the population mean is 12?

We calculate

This gives you the power.

913.0)35.1(9/1

)1245.12()12|45.12()12(

ZPZP

XPPower

46

Power: an example

Interpretation: if we do a one-sided hypothesis test of the null hypothesis that = 13 versus the one-sided alternative H1: < 13, then we have a 91.3% chance of making the correct decision if the true mean is 12.

47

Power: an example

• Graphical representation:

X1312.4512

Sampling distributionunder H1


Shaded area is POWER.

48

Power: an example

X1312.4512



Shaded area is POWER!

Question: What is the power if = 12.45?

A: 0.50


49(0-1)

This is what wecalculated when 1=12

Power(0-1)

You can calculate power for other values of 1 – the mean at the alternative – and have the following “power curve”.

Prob

abili

ty o

f rej

ectin

g th

e nu

ll

50

Issues on Interpretation

• Power refers to the fraction of such studies that would report significant results if the hypothesized treatment effect is correct.

• Power is NOT the probability that the treatment “works.”

• Power is NOT the fraction of patients who benefit from the treatment.

51


• As mentioned, not rejecting the null hypothesis does not mean that the null hypothesis is true – In particular, if a study is poorly powered, it is

likely the null hypothesis will not be rejected even if it is false.

– Hence the language: “fail to reject the null” or “do not reject the null” rather than “accept the null”

52

•Inference about associations is usually based on showing differences between population parameters

–IF some summary measure (e.g., mean, median) is different for two distributions,–THEN the distributions must be different

•Using summary measures to detect associations thus requires caution when interpreting results

–Lack of a difference between population parameters for two distributions does not necessarily imply that the entire distributions are the same



53

Comparing Population Parameters

54

Parameters of Interest

There are statistical tests that evaluate whether two distributions are the same (not just whether, e.g., the means of the distributions are the same). Such methods tend not to be very useful:

1. (Statistical) Little power.2. (Scientific) Based on a hypothesis test

you may conclude that two distributions are different, but the test does not tell you how they are different. So you can’t evaluate scientific importance.

55

Summary: Power• Power is the probability of (correctly) rejecting

the null hypothesis when it is false. We like powerful studies. – Although a study can be “overpowered”.

• The challenging parts of doing a power/sample size calculation are (we didn’t cover this!):– Getting a preliminary estimate of “nuisance”

parameters, e.g. population variance to do a t-test.– Deciding what value of the parameter of interest you

want good power to detect• Based on what you think the truth is• Based on minimal threshold of scientific relevance.

– Also must pre-specify the test, α, β.

56

Summary: Power

• Understanding whether a study was adequately powered helps us interpret “negative” results (null hypothesis was not rejected).– If a study was poorly powered, then the

scientific question remains open.– If a study was adequately powered, then we

are more likely to believe that the null hypothesis is actually true.


BIOST 517/514 NanolectureSpecify the outcome and the predictor. Choose a summary

of outcome. Compare groups defined by predictor. Confounders are associated with predictor, cause outcome and not in pathway of interest. Stratify on them. Examine subgroups separately for effect modification.

The confidence interval is the range of values consistent with the data. The p-value is small when the data are inconsistent with the null hypothesis. A large p-value is uninformative.

Means are Normally distributed for any data distribution. So are most other statistics. If you don’t know the standard error, use a bootstrap.

-Thomas Lumley

final exam biost 514/517 biostatistics i / applied

Documents