biostatistics in practice session 2: summarization of quantitative information peter d. christenson...

48
Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician http://research.LABioMed.org/ Biostat

Upload: gavin-taylor

Post on 21-Jan-2016

258 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Biostatistics in Practice

Session 2: Summarization of Quantitative

Information

Peter D. ChristensonBiostatistician

http://research.LABioMed.org/Biostat

Page 2: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Topics for this Session

Experimental Units

Independence of Measurements

Graphs: Summarizing Results

Graphs: Aids for Analysis

Summary Measures

Confidence Intervals

Prediction Intervals

Page 3: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Most Practical from this Session

Geometric Means

Confidence Intervals

Reference Ranges

Justify Analysis Methods from Graphs

Page 4: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Experimental Units_____

Independence of Measurements

Page 5: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Units and Independence

Experiments may be designed such that each measurement does not give additional independent information.

Many basic statistical methods require that measurements are “independent” for the analysis to be valid.

Other methods can incorporate the lack of independence.

Page 6: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Example 1: Units and Independence

Ten mice receive treatment A and a blood sample is obtained from each one. The same is done for 10 mice receiving treatment B.

A protein concentration is measured in each of the 20 samples and an appropriate summary (average?, min?, %>10 nmol/ml?) is compared between groups A and B.

The experimental unit is a mouse.

Each of the 20 numbers are independent.

A “basic” analysis requiring independence is valid.

Page 7: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Example 2: Units and Independence

Ten mice receive treatment A, each is bled, and each blood sample is divided into 3 aliquots. The same is done for 10 mice receiving treatment B.

A protein concentration is measured in each of the 60 aliquots.

The experimental unit is a mouse.

The 60 numbers are not independent. The 2nd and 3rd results for a sample are less informative than the 1st.

A “basic” analysis requiring independence is not valid unless a single number is used for each triplicate, giving 10+10 independent values.

Page 8: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Experimental Units in Case Study

Page 9: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Experimental Units in Case Study

A unit is a single child.

Results from one child's three diets are not independent. The three results are probably clustered around a set-point for that child.

The analysis must incorporate this possible correlated clustering. If the software is just given the 3x140 outcomes without distinguishing the individual children, the analysis would be wrong.

Page 10: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Modified Case Study

Suppose an educational study used teaching method A in some schools and method B in others. The outcome is a test score later.

The experimental unit is a school.

Outcomes within a school are probably not independent. It would be wrong to use the method we will discuss in the next session (t-test) to compare the mean score among students given method A to those given B.

Page 11: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Another Example

You apply treatment A to one pregnant mouse and measure a hormone in its offspring. Same for B.

Suppose the results are:A Responses: 100, 98, 102, 99, 101B Responses: 10, 8, 12, 9, 11

Can we conclude responses are greater under treatment A than under B?

Page 12: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Another Example

You apply treatment A to one pregnant mouse and measure a hormone in its offspring. Same for B.

Suppose the results are:A Responses: 100, 98, 102, 99, 101B Responses: 10, 8, 12, 9, 11

No. The one mouse given A might have responded the same if given B. Same for the one mouse given B.

Five offspring provide little independent information over 1 offspring. Each treatment was essentially only tested once.

Page 13: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Graphs:

Summarizing Results

Page 14: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Common Graphical Summaries

Graph Name Y-axis X-axis

Histogram Count or % Category

Scatterplot Continuous Continuous

Dot Plot Continuous Category

Box Plot Percentiles Category

Line Plot Mean or value Category

Kaplan-Meier Probability Time

Many of the following examples are from StatisticalPractice.com

Page 15: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Data Graphical Displays

Histogram Scatter plot

Raw DataSummarized*

* Raw data version is a stem-leaf plot. We will see one later.

Page 16: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Data Graphical Displays

Dot Plot Box Plot

Raw Data Summarized

Page 17: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Data Graphical DisplaysLine or Profile Plot

Summarized - bars can represent various types of ranges

Page 18: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Data Graphical Displays

Kaplan-Meier Plot0.

000.

250.

500.

751.

00S

urvi

val P

rob

abili

ty

0 5 10 15 20Years

Kaplan-Meier survival estimate

This is not necessarily 35% of subjects

Probability of Surviving 5 years is 0.35

Page 19: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Graphs:

Aids for Analysis

Page 20: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Graphical Aids for Analysis

Most statistical analyses involve modeling.

Parametric methods (t-test, ANOVA, Χ2) have stronger requirements than non-parametric methods (rank -based).

Every method is based on data satisfying certain requirements.

Many of these requirements can be assessed with some useful common graphics.

Page 21: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Look at the Data for Analysis Requirements

What do we look for?

In Histograms (one variable):Ideal: Symmetric, bell-shaped.

Potential Problems:• Skewness.• Multiple peaks.• Many values at, say, 0, and bell-shaped

otherwise.• Outliers.

Page 22: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Example Histogram: OK for Typical* Analyses

• Symmetric.• One peak.• Roughly bell-shaped.• No outliers.

*Typical: mean, SD, confidence intervals, to be discussed in later slides.

Page 23: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

876543210

150

100

50

0

Intensity

Fre

qu

en

cyHistograms: Not OK for Typical Analyses

Skewed

Need to transform intensity to another scale,

e.g. Log(intensity)

1207020

20

10

0

Tumor Volume

Fre

quen

cy

Multi-Peak

Need to summarize with percentiles, not

mean.

Page 24: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Histograms: Not OK for Typical Analyses

Truncated Values

Need to use percentiles for most analyses.

Outliers

Need to use median, not mean, and

percentiles.

1050

60

50

40

30

20

10

0

Assay Result

Fre

qu

en

cy

LLOQ

Undetectable in 28 samples (<LLOQ)

840

100

50

0

Expression LogRatio

Fre

qu

en

cy

Page 25: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Look at the Data for Analysis Requirements

What do we look for?

In Scatter Plots (two variables): Ideal: Football-shaped; ellipse.

Potential Problems:• Outliers.• Funnel-shaped.• Gap with no values for one or both variables.

Page 26: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Example Scatter Plot: OK for Typical Analyses

Page 27: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Scatter Plot: Not OK for Typical Analyses

Gap and Outlier

Consider analyzing subgroups.

Funnel-Shaped

Should transform y-value to another scale, e.g.

logarithm.

0 100 200 300 400

0

50

100

150

EPO

nR

BC

Co

un

t

All Subjects:

r = 0.54 (95% CI: 0.27 to 0.73)

p = 0.0004

EPO < 150:

r = 0.23 (95% CI: -0.11 to 0.52)

p = 0.17

EPO > 300:

r = -0.04 (95% CI: -0.96 to 0.96)

p = 0.96

Ott, Amer J Obstet Gyn 2005;192:1803-9.Ferber et al, Amer J Obstet

Gyn 2004;190:1473-5.

Page 28: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Summary Measures

Page 29: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Common Summary Measures

Mean and SD or SEM

Geometric Mean

Z-Scores

Correlation

Survival Probability

Risks, Odds, and Hazards

Page 30: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Summary Statistics: One Variable

Data Reduction to a few summary measures.

Basic: Need Typical Value and Variability of Values

Typical Values (“Location”):• Mean for symmetric data.• Median for skewed data.• Geometric mean for some skewed data - details in later slides.

Page 31: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Summary Statistics:Variation in Values

• Standard Deviation, SD =~ 1.25 *(Average |deviation| of values from their mean).

• Standard, convention, non-intuitive values.

• SD of what? E.g., SD of individuals, or of group means.

• Fundamental, critical measure for most statistical methods.

Page 32: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Examples: Mean and SD

Mean = 60.6 min.

Note that the entire range of data in A is about 6SDs wide, and is the source of the “Six Sigma” process used in quality control and business.

95857565554535

25

20

15

10

5

0

Time

Fre

qu

en

cy

SD = 9.6 min.

201510

15

10

5

0

OD

Fre

qu

en

cy

Mean = 15.1 SD = 2.8

A B

Page 33: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

876543210

150

100

50

0

Intensity

Fre

qu

en

cyExamples: Mean and SD

Skewed

1207020

20

10

0

Tumor Volume

Fre

quen

cy

Multi-Peak

Mean = 1.0 min.SD = 1.1 min. Mean = 70.3

SD = 22.3

Page 34: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Summary Statistics:Rule of Thumb

For bell-shaped distributions of data (“normally” distributed):

• ~ 68% of values are within mean ±1 SD

• ~ 95% of values are within mean ±2 SD “(Normal) Reference

Range”

• ~ 99.7% of values are within mean ±3 SD

Page 35: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Summary Statistics: Geometric means

Commonly used for skewed data.1. Take logs of individual values.2. Find, say, mean ±2 SD → mean and

(low, up) of the logged values.3. Find antilogs of mean, low, up. Call

them GM, low2, up2 (back on original scale).

4. GM is the “geometric mean”. The interval (low2,up2) is skewed about GM (corresponds to graph).

[See next slide]

Page 36: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Geometric Means

These are flipped histograms rotated 90º, with box plots.

Any log base can be used.

≈ 909.6

≈ 11.6

GM = exp(4.633)

= 102.8

low2 = exp(4.633-2*1.09)

= 11.6

upp2 = exp(4.633+2*1.09)

= 909.6

≈ 102.8

Page 37: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Confidence Intervals

Reference ranges - or Prediction Intervals -are for individuals.

Contains values for 95% of individuals. _____________________________________

Confidence intervals (CI) are for a summary measure (parameter) for an entire population.

Contains the (still unknown) summary measure for “everyone” with 95% certainty.

Page 38: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Z- Score = (Measure - Mean)/SD

35 45 55 65 75 85 95

0

5

10

15

20

25

Time

Fre

qu

ency

Standardizes a measure to have mean=0 and SD=1.

Z-scores make different measures comparable.

35 45 55 65 75 85 95

0

5

10

15

20

25

Time

Fre

qu

ency

Mean = 60.6 min.

Mean = 60.6 min.SD = 9.6 min.

SD = 9.6 min.

Z-Score = (Time-60.6)/9.6

-2 0 2

41 61 79

Mean = 0SD = 1

Page 39: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Outcome Measure in Case StudyGHA = Global Hyperactivity Aggregate

For each child at each time:Z1 = Z-Score for ADHD from TeachersZ2 = Z-Score for WWP from ParentsZ3 = Z-Score for ADHD in ClassroomZ4 = Z-Score for Conner on Computer

All have higher values ↔ more hyperactive.Z’s make each measure scaled similarly.

GHA= Mean of Z1, Z2, Z3, Z4

Page 40: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Confidence Interval for Population Mean

95% Reference range - or Prediction Interval - or “Normal Range”, is

sample mean ± 2(SD) _____________________________________

95% Confidence interval (CI) for the (true, but unknown) mean for the entire population is

sample mean ± 2(SD/√N)

SD/√N is called “Std Error of the Mean” (SEM)

Page 41: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Confidence Interval: More Details

Confidence interval (CI) for the (true, but unknown) mean for the entire population is

95%, N=100: sample mean ± 1.98(SD/√N)95%, N= 30: sample mean ± 2.05(SD/√N)90%, N=100: sample mean ± 1.66(SD/√N)99%, N=100: sample mean ± 2.63(SD/√N)

If N is small (N<30?), need normally, bell-shaped, data distribution. Otherwise, skewness is OK. This is not true for the PI, where percentiles are needed.

Page 42: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Confidence Interval: Case Study

Confidence Interval:

-0.14 ± 1.99(1.04/√73) =

-0.14 ± 0.24 → -0.38 to 0.10

Table 2

Normal Range:

-0.14 ± 1.99(1.04) =

-0.14 ± 2.07 → -2.21 to 1.93

0.13 -0.12 -0.37

Adjusted CI

close to

Page 43: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

CI for the Antibody Example

So, there is 95% assurance that an individual is between 11.6 and 909.6, the PI.

So, there is 95% certainty that the population mean is between 92.1 and 114.8, the CI.

GM = exp(4.633)

= 102.8

low2 = exp(4.633-2*1.09)

= 11.6

upp2 = exp(4.633+2*1.09)

= 909.6

GM = exp(4.633)

= 102.8

low2 = exp(4.633-2*1.09 /√394)

= 92.1

upp2 = exp(4.633+2*1.09 /√394)

= 114.8

Page 44: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Summary Statistics:Two Variables (Correlation)

• Always look at scatterplot.• Correlation, r, ranges from -1 (perfect

inverse relation) to +1 (perfect direct). Zero=no relation.

• Specific to the ranges of the two variables.• Typically, cannot extrapolate to populations

with other ranges.• Measures association, not causation.

We will examine details in Session 5.

Page 45: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Correlation Depends on Range of Data

Graph B contains only the points from graph A that are in the ellipse.

Correlation is reduced in graph B.

Thus: correlation between two quantities may be quite different in different study populations.

BA

Page 46: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Correlation and Measurement Precision

A lack of correlation for the subpopulation with 5<x<6 may be due to inability to measure x and y well.

Lack of evidence of association is not evidence of lack of association.

B

A

r=0 for s

Boverall

5 6

12

10

Page 47: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

0.00

0.25

0.50

0.75

1.00

Sur

viva

l Pro

bab

ility

0 5 10 15 20Years

Kaplan-Meier survival estimate

Actually uses finer subdivisions than 0-2, 2-4, 4-5 years, with exact death times.

Example: 100 subjects start a study. Nine subjects drop out at 2 years and 7 drop out at 4 yrs and 20, 20, and 17 died in the intervals 0-2, 2-4, 4-5 yrs.

Then, the 0-2 yr interval has 80/100 surviving.

The 2-4 interval has 51/71 surviving; 4-5 has 27/44 surviving.

So, 5-yr survival prob is (80/100)(51/71)(27/44) = 0.35.

Summary Statistics: Survival Probability

Don’t know vital status of 16 subjects at 5 years.

Page 48: Biostatistics in Practice Session 2: Summarization of Quantitative Information Peter D. Christenson Biostatistician

Summary Statistics:Relative Likelihood of an Event

Compare groups A and B on mortality.

Relative Risk = ProbA[Death] / ProbB[Death]where Prob[Death] ≈ Deaths per 100 Persons

Odds Ratio = OddsA[Death] / OddsB[Death] where Odds= Prob[Death] / Prob[Survival]

Hazard Ratio ≈ IA[Death] / IB[Death]where I = Incidence

= Deaths per 100 PersonDays