final exam biost 514/517 biostatistics i / applied
TRANSCRIPT
Review and Final Material
BIOST 514/517Biostatistics I / Applied Biostatistics I
Kathleen Kerr, Ph.D.Associate Professor of Biostatistics
University of Washington
Review* for Final
December 6, 2013
*not exclusively review
Final exam
• 1 hour and 50 minutes• Roughly similar format to midterm: mix of multiple
choice and short answer• Heavy emphasis on material since midterm• 8:30 on Wednesday, December 11
22
33
Miscellaneous
• Bring to exam: pencil, eraser, calculator– Calculator should do arithmetic, exp(), loge(), square
root and NOT run software, connect to the internet, etc.
• IF you want your graded exam sent to you in campus mail, you must know your campus mailbox number. You will also be asked to fill out a mailing label
44
Review and Final Material
Formulas
• Having taken one of my exams, you should have some sense of what I emphasize and what I don’t
• Don’t memorize formulas unless it helps you understand things
• Of course, you need to know simple things like odds=p/(1-p) and p=odds/(1+odds)
55
Statistic
• Any function of the data. Examples: – Descriptive statistic– Inferential statistic– Test statistic
66
Inferential Statistics
• Point Estimates• Confidence Intervals• Hypothesis Tests
• We emphasized summarizing study results to be as informative as possible.
77
Point estimates
• We like estimates of population parameters with low or no bias, that are consistent and efficient
88
Review and Final Material
99
Unbiased Estimator T
• A statistic T is an unbiased estimator of a parameter Θ if E(T)=Θ
• For example, if the parameter is the population mean µ and the estimator T is the sample mean, T is unbiased: E(T) = µ– Easily shown from properties of expectation
• “T is unbiased” means the mean of sampling distribution of T is Θ
1010
Consistent Estimator T
• Loosely, a statistic T is a consistent estimator of a parameter Θ if the sampling distribution of T becomes more and more concentrated around Θ as the sample size increases– T can be biased but still be consistent– T can be unbiased and not be consistent
1111
Efficient Estimator T
• Loosely, a statistic T is an efficient estimator of a parameter Θ if T makes full use of the available data
• If T1 and T2 are two unbiased estimators of Θ, then T1 is more efficient than T2 if var(T1)<var(T2)
12
•Four estimators of the mean
µ
popu
latio
n
sample X1, X2, X3, X4, X5, X6, X7,…
data
Estimate of µ, the population mean
Unbiased? Consistent? Efficient?
sample median
Biased* No* NA
X1 unbiased No No
Xodd unbiased Yes No
X unbiased Yes Yes
Review and Final Material
Confidence Intervals
• Values of the parameter for which our data would not be surprising• Values of the parameter for which our data would be “typical”• The procedure of making a 95% confidence interval
– Taking the sample– Constructing the interval
… produces an interval that contains the true value of the parameter 95% of the time.
1313
Confidence Intervals
• Means and differences in means– t-based
• Proportions– Normal/Wald and exact
• Differences in proportions– Normal/Wald
• RR and OR– Normal/Wald interval for log-transformed parameter, then back-
transform• Survival curves
– Normal/Wald interval for transformed parameter, then back-transform
• Median and variance– Bootstrapping 1414
Hypothesis Testing
• Need null and alternative hypothesis• Need a test statistic• Need the sampling distribution of the test statistic under H0
– “Null distribution”– We focused on tests where statistical theory tells us the null
distribution– (There are other tests that use methods to estimate the null
distribution, but we didn’t get to those.)• The alternative hypothesis tells us which region of our null
distribution is “extreme”• P-value = the probability of seeing data as extreme or more extreme
than our data when the null hypothesis is true• P-value ≈ P( data | H0 )
1515
Hypothesis Testing • t-tests
– One-sample– Two-sample assuming equal variances– Two-sample allowing for unequal variances– Paired t-test
• Exact test for binomial proportions• Z-test for binomial proportions• Chi-square goodness-of-fit test
– Independence of two categorical variables• Same as z-test for 2 binary variables
– Other goodness-of-fit tests• Fisher’s exact test• Rank-based tests for uncensored data• Logrank test
1616
Review and Final Material
P-values
• Frequently misinterpreted– NOT the probability the null hypothesis is true
• Interpreting “negative” study. pvalue > α does not mean we “accept” the null hypothesis– FirstSteps and Birthweight
• “results indicate that enrollment in FirstSteps does not increase birthweight”
• One problem is that “increase” sounds causal. Better: “is not associated with increased birthweight”
• Second problem: this conclusion “accepts” the null hypothesis.
• Better summary of a valid analysis yielding pvalue > α: “results do not indicate evidence that enrollment in FirstStepsis associated with increased birthweight” 1717
Baseline variables and Clinical Trails
• P-values are inferential statistics. Their purpose is to summarize the evidence in the data (sample) regarding a hypothesis about a population.
• Consider a clinical trial where patients are randomly assigned into a treatment group and a placebo group.
• The treatment/placebo label has been assigned by the investigators randomly. Any difference between the groups at baseline is by chance.– If we test baseline variables for differences between
the groups, about 5% of them will yield p-values < 0.05.
1818
age 0.018 alb 0.874
alkphos 0.747 ascites 0.433
bili 0.133 chol 0.742
edema 0.428 edema_tmt 0.877
hepmeg 0.069 sgot 0.460
platelet 0.579 stage 0.201
sex 0.326 copper 0.999
spiders 0.885 trig 0.365
19
P-values for baseline variables in PBC dataset by treatment groupBaseline variables and Clinical Trails
• But what about confounding? Don’t we want to know whether the “randomization worked”?
• Confounder: – Associated with the POI in the sample
– Causally associated with the response• The purpose of a pvalue is for making inference from a sample to a
population. Does not address first condition.• Can have p-value > α but there is an association in the sample.• In a large trial can have p-value < < α but the association in the sample is
small.• The best summary of the degree of association in the sample for the
purposes of confounding is usually the mean.
2020
Review and Final Material
You choose the analysis, not the data
• More generally, don’t use p-values to decide whether a variable is a confounder
• As mentioned, a p-value does not measure the strength of the association in the sample
• It is the investigator’s job to decide whether the adjusted or unadjusted association answers the scientific question.
2121
“Multiple Comparisons” Problem
Also known as the “multiple testing” problem.
Issue: When you perform lots of hypothesis tests, your chance of making a type I error increases. Consider. . .
Hypothetical exampleAn investigator decides to study an association between eating red meat and cancer. He collects clinical data on a sample of individuals who eat red meat and a sample who do not eat red meat.
In analyzing the data, the investigator compares incidence rates of many different types of cancer between the two groups. Since there may be interactions, he also makes comparisons stratified by sex, race, and lifestyle factors.
Hypothetical example con’t
In a summary of the study, the investigator claims:
“the research study uncovered a significant association between consuming red meat and the incidence of lung cancer in non-smoking males (p<0.05).”
Review and Final Material
Hypothetical example con’t
Conclusion A: There is an association between lung cancer and eating red meat that is specific to non-smoking men.
Conclusion B: This finding is a type I error.
What do you suspect is the truth?
The probability of multiple comparisons
For a null hypothesis H0 that is true and a test performed at a given significance level α :
P(reject H0 | H0 is true) = αP(do not reject H0 | H0 is true) = 1-α
Next suppose n independent hypothesis tests H10, H20, . . ., Hn0 are performed at level α. Suppose all n null hypotheses are true.
P(do not reject Hi0 | Hi0 is true) = 1-α
Therefore:
P(not reject H10 AND not reject H20 AND . . . not reject Hn0 | all Hi0 are true)
= (1-α)n
n 1 2 4 8 12 16 20 30
(1-.05)n 0.95 0.902 0.814 0.663 0.540 0.440 0.359 0.217
(1-.01)n 0.99 0.98 0.96 0.923 0.886 0.852 0.818 0.740
Conclusion:
If 30 independent tests are performed at level α=0.05 on 30 true null hypotheses, the probability of falsely rejecting at least one null hypothesis is nearly 80%! That is, we are very likely to make a mistake!
Clearly, using a smaller α helps. In the situation above, our chances of making at least one type I error are reduced from 79% to 26% by using α=0.01. But this chance is still not negligible.
There are methods for deciding what α should be used in the presence of multiple comparisons. We aren’t covering those methods. But it is important to be aware of the issue.
Review and Final Material
“Multiple Comparisons”
• A p-value is only interpretable if the context in which it is computed is known– 1 test? Hundreds of tests?
Thousands of tests?– P-value most useful for pre-specified
primary endpoints• The same data that suggest a
hypothesis cannot be used to test the hypothesis
3030
http://sph.washington.edu/podcasts/
59 worthwhile minutes
Association and causation
• Many scientific questions lead to data analyses to examine associations between a predictor of interest X and an outcome of interest Y. We must distinguish:– The data are consistent with a strong association between X and
Y in the population; and there are other scientific reasons to think that X affects Y.
– The data provide strong evidence that X and Y are associated in the population, and there are other scientific reasons to think that X affects Y.
– The data provide strong evidence that X affects Y.• In all 3 cases it may be reasonable to believe that X affects Y, but it
is important to be clear about what the data say. The 3rd possibility (data provide evidence of causation) is possible if X is randomly allocated and occasionally in other cases.
3131
Smoking and lung cancer
• Many people were convinced that smoking caused lung cancer based on observing associations because:– The association was very, very strong (risk of lung
cancer 10-20 times higher among smokers than non-smokers)
• Rates of lung cancer are higher in heavy smokers than light smokers
• The cancer happens a reasonable time after smoking• The dramatic increase in smoking rates among men
during WWI was matched by a dramatic increase in lung cancer in men but not women
3232
Review and Final Material
Smoking and lung cancer: confounding?
• Suppose the only piece of evidence we have is the early observations of higher lung cancer rates among smokers than non-smokers (assume the year is 1920)
• It is conceivable that an unmeasured confounder explains the association between smoking and lung cancer. For example, a gene that leads people to smoke, and also leads to lung cancer.
3333
Smokeygene
Cigarettesmoking is pleasurable
Prone to Lung
Cancer
Smoking and lung cancer: confounding?
• This unmeasured confounder would have had to change for men, but not women, in the early 20th century.
3434
Smoking and lung cancer
• Thus solely from observational data in humans (and perhaps animal studies?) a very strong cases has been made for what we all believe today: smoking causes lung cancer
• The effect of smoking on heart disease, an issue of greater public health importance, was much harder to establish because the relative risk is only about 3.
3535
Example: FEV in kids
• Children’s FEV data• “…whether such deleterious effects of smoking can be
detected in children who smoke. To address this question, measures of lung function were made in 654 children seen for a routine check up in a particular pediatric clinic. The children participating in this study were asked whether they were current smokers.”
• We decide to compare the distribution of FEV in smokers and nonsmokers by comparing the means.
Review and Final Material
• In the sample, the difference in means is 3.28 (smokers) –2.57 (non-smokers) = 0.71.
• We estimate the difference in mean FEV between smokers and nonsmokers as 0.71 l/s.
• The 95% confidence interval is (0.49 l/s, 0.93 l/s).• Data like ours would be very unlikely if the true difference
were 0.• So, do we have evidence that smoking reduces FEV among
children?
12
34
56
Smoker Non-Smoker
FEV
Graphs by SMOKE
Example: FEV in kids
• No. We have evidence that smokers have higher mean FEV, probably because they are older.
• We need to compare smokers and non-smokers for each age. We can average these (sensibly) across ages to get a single estimate.– Regression methods make this easy.
• This age-stratified analysis yields a point estimate of smokers having 0.21 l/s smaller mean FEV than non-smokers. The 95% confidence interval is 0.43 l/s lower to 0.01 l/s higher.
Example: FEV in kids
• Have we now proved smoking lowers FEV in kids?• No.• Data would not be so unusual if true difference were 0
(p-value in this case is 0.060).• But even if p << 0.05….
– Confounding by gender?– Confounder by height?– Perhaps kids with high FEV are more likely to play
sports and this makes them less likely to smoke– …
Example: FEV in kids
• For kids, is the association between smoking and higher FEV “real”?
• We must distinguish the unadjusted association between smoking and FEV
• …and the age-adjusted association between smoking and FEV
• It is our job to decide which one addresses the scientific question– The data cannot tell us which association addresses
the scientific question– So why would we let the data decide whether we
adjust for age?
Review and Final Material
Power
• To get a p-value, we need to know the distribution of the test statistic under H0. Shouldn’t we also consider the distribution under H1?
• Yes• The distribution of the test statistic under the alternative
hypothesis tells us the power of the testPower refers to the probability of rejecting H0 when it is false:
1 - = P [ reject H0 | H1 true ]= P [ correct decision | H1 true ]
• Power describes the ability of the test procedure to reliably detect departures from the null hypothesis.
• Power (1 - ) and significance () are important considerations in the planning of a study.
42
Power: an example
• A simple, illustrative example:– Hypotheses: H0: = 13 versus H1: < 13– = 0.05, = 1, n=9
• What is the critical region (i.e. for what values of the statistic, here the sample mean, would we reject the null?)
• In this (artificial) example, we’re assuming the population variance is known. We can do a z-test instead of a t-test.
43
Power: an example
45.123/65.113
65.13
113
65.1ifReject3
113:statisticObserved
65.105.0)|(:Thus
)1,0(~:Under
0
0
00
xx
xzH
xz
zHzZP
Nn
XZH
cc
44
Rejection region
Do not reject H0
13
0.05
12.45 x
Review and Final Material
45
Power: an example
Now what is the probability of rejecting H0 when the population mean is 12?
We calculate
This gives you the power.
913.0)35.1(9/1
)1245.12()12|45.12()12(
ZPZP
XPPower
46
Power: an example
Interpretation: if we do a one-sided hypothesis test of the null hypothesis that = 13 versus the one-sided alternative H1: < 13, then we have a 91.3% chance of making the correct decision if the true mean is 12.
47
Power: an example
• Graphical representation:
X1312.4512
Sampling distributionunder H1
Sampling distributionunder H0
Shaded area is POWER.
48
Power: an example
X1312.4512
Sampling distributionunder H1
Sampling distributionunder H0
Shaded area is POWER!
Question: What is the power if = 12.45?
A: 0.50
Review and Final Material
49(0-1)
This is what wecalculated when 1=12
Power(0-1)
You can calculate power for other values of 1 – the mean at the alternative – and have the following “power curve”.
Prob
abili
ty o
f rej
ectin
g th
e nu
ll
50
Issues on Interpretation
• Power refers to the fraction of such studies that would report significant results if the hypothesized treatment effect is correct.
• Power is NOT the probability that the treatment “works.”
• Power is NOT the fraction of patients who benefit from the treatment.
51
Issues on Interpretation
• As mentioned, not rejecting the null hypothesis does not mean that the null hypothesis is true – In particular, if a study is poorly powered, it is
likely the null hypothesis will not be rejected even if it is false.
– Hence the language: “fail to reject the null” or “do not reject the null” rather than “accept the null”
52
•Inference about associations is usually based on showing differences between population parameters
–IF some summary measure (e.g., mean, median) is different for two distributions,–THEN the distributions must be different
•Using summary measures to detect associations thus requires caution when interpreting results
–Lack of a difference between population parameters for two distributions does not necessarily imply that the entire distributions are the same
Issues on Interpretation
Review and Final Material
53
Comparing Population Parameters
54
Parameters of Interest
There are statistical tests that evaluate whether two distributions are the same (not just whether, e.g., the means of the distributions are the same). Such methods tend not to be very useful:
1. (Statistical) Little power.2. (Scientific) Based on a hypothesis test
you may conclude that two distributions are different, but the test does not tell you how they are different. So you can’t evaluate scientific importance.
55
Summary: Power• Power is the probability of (correctly) rejecting
the null hypothesis when it is false. We like powerful studies. – Although a study can be “overpowered”.
• The challenging parts of doing a power/sample size calculation are (we didn’t cover this!):– Getting a preliminary estimate of “nuisance”
parameters, e.g. population variance to do a t-test.– Deciding what value of the parameter of interest you
want good power to detect• Based on what you think the truth is• Based on minimal threshold of scientific relevance.
– Also must pre-specify the test, α, β.
56
Summary: Power
• Understanding whether a study was adequately powered helps us interpret “negative” results (null hypothesis was not rejected).– If a study was poorly powered, then the
scientific question remains open.– If a study was adequately powered, then we
are more likely to believe that the null hypothesis is actually true.
Review and Final Material
BIOST 517/514 NanolectureSpecify the outcome and the predictor. Choose a summary
of outcome. Compare groups defined by predictor. Confounders are associated with predictor, cause outcome and not in pathway of interest. Stratify on them. Examine subgroups separately for effect modification.
The confidence interval is the range of values consistent with the data. The p-value is small when the data are inconsistent with the null hypothesis. A large p-value is uninformative.
Means are Normally distributed for any data distribution. So are most other statistics. If you don’t know the standard error, use a bootstrap.
-Thomas Lumley