summary points - simon fraser...

Summary Points

December 13, 2008

Statistical Computations made easy

• Use appropriate tool (i.e. usually not Ex-

cel) for the job.

• SAS - premiere analysis and data manage-

ment tool, but steep learning curve

• R/Splus - flexible, powerful, but command

line driven

• JMP/Systat/Stata

– 80% of what people need

– GUI interface

– good standard graphics

– difficult to extend to unique situations

1

• Why NOT Excel?

– poor data management practices

– WRONG results

– POOR graphs

– Doesn’t deal with missing data properly

TRRGET

• Randomize = representative

• Replicate = controls precision (the se)

• Stratification/Blocking = control for ex-

plainable variation

• Graphing = keep it simple and straightfor-

ward; NO pie charts; NO 3-D effects

• Estimation = how big is effect - no naked

estimates - always report a SE

• (Hypothesis) Testing = p-values = consis-

tency of data with hypothesis; no naked

p-values

2

SD vs SE

• SD

– sample standard deviation (s) = vari-

ability of INDIVIDUAL data points

– about 95% of INDIVIDUAL data values

are contained in Y ± 2s

• SE

– precision of estimate = uncertainty due

to sampling

– depends on sample/experimental design

- no single formula for all cases

– depends on proper randomization

– usually declines as√n

3

– usually depends on absolute (not rela-

tive) sample size

– sensitive to outliers

– only measures uncertainty due to sam-

pling

Confidence Intervals

• Approx 95% ci is found as est± 2se

• 95% confident that interval contains POP-

ULATION PARAMETER (such as popula-

tion mean)

• says nothing about INDIVIDUAL data points

• sensitive to outliers

4

P-values

• consistency of DATA with hypothesis

• unusualness of DATA assuming hypothesisis true

• does NOT measure p(hypothesis is true)

• statistical significance 6= practical (biolog-ical) importance

• not statistically significant 6= no effect

• the chart of 5 possibilities.

• prefer confidence intervals (effect sizes) overp-values

5

Scale/Type of measurement

• 4 Standard scales

– Nominal - classification data, e.g. sex

– Ordinal - ordered classification data, e.g.

small, medium, large

– Interval- no natural zero, e.g. temp (C),

or year (2008)

– Ratio - natural zero, e.g. height, weight,

length

• JMP combined Interval/Ratio into Contin-

uous (misnamed)

• 3 types of data

6

– Discrete - fixed countable set of values,

e.g. sex, counts

– Continuous - uncountable, e.g. weight,

length

– Discretized Continuous - all continuous

data is discretized

Bias, Precision, Accuracy

• Bias - is average estimate = population

parameter?

• Precision - variation of estimates over re-

peated samples

• Accuracy - combination of bias + precision

• Only refer to PROCEDURE and not to in-

dividual values

7

Missing Values

• MCAR - just ignore

• MAR - ignore but potentially adjust weights

• IM - seek help

• Missing 6= 0 and vice-versa

8

Transformations

• don’t be afraid to transform

• most common in biology is log(x) = natu-

ral log

• careful of back transforms

– MEAN (on log scale) reverts to ME-

DIAN on anti-log scale

– difficult to convert SD and SE

9

Basic sampling designs

• Simple Random Sampling (SRS)

– Most basic design and default assumed

by packages

– often not feasible to perform

• Systematic sampling

– hope there is some self-randomization

occurring

– beware of matching natural cycles

• Cluster/Transect sampling

– very common in ecology

10

– measurements within a cluster are NOT

independent (CAUTION!)

– must recognize and analyze appropriately

at cluster level

• Multi-stage

– sub-sample within each cluster

– seek help

• Multi-phase

– resample initial sample for more in-depth

analysis

– common example is ground truthing fol-

lowed by adjustment

Improving precision

• Stratify

– can be applied to ANY sample design

– lost cost/no cost way to improve;

– if not stratifying, why not?

– may be able to post-stratify after the

fact

• Auxiliary information (covariates)

– e.g. last years data used to predict this

years values

– assumes relationships between covariate

and response

11

• Unequal probability sampling

– more important (bigger, more costly)

items sampled with higher probability

– seek help

SRS

• all units selected independently with equal

probability

• most basic and default design of most pack-

ages

• precision depends basically on ABSOLUTE

not relative sample size

12

Planning a survey

• What level of precision is needed?

– Preliminary survey (rse=25%; 95% ci is± 50%)

– Management work (rse=12%; 95% ci is± 25%)

– Scientific work (rse=5%; 95% ci is ±10%)

• What is STANDARD DEVIATION of sur-vey units

• What is approximate population parameter(e.g. mean or proportion)

• Use planning tool in Excel workbook

13

************************************************************

Stratification

• Define strata (no more than 5 or 6)

– units within stratum to be similar; strata

to be different

• what is total sample size needed (see pre-

viously)

• allocate sample size to strata

– proportional to stratum size or impor-

tance

– equal allocation

• Separate survey in each stratum

14

– not necessary to use same sampling scheme

in each stratum

– use best method in each stratum

• Separate analysis in each stratum

• Rollup

– Add estimated stratum TOTALS

– setotal =√se2Total1

+ se2Total2

+ . . .+ se2Totalk

Ratio Estimation

• Mean-of-Ratios or Ratio-of-Means

• Use regression with line through origin

15

Cluster Sampling

• recognize when clustering takes place

– Transects; Sampling unit 6= measuring

unit

• move analysis up to cluster level

– compute cluster TOTAL and cluster SIZE

• use ratio estimation method seen earlier to

get TOTAL/SIZE

16

Design and Analysis of Experiments

• Treatment structure

– what are factors

– what are levels

– what treatments (combinations of lev-

els) appear in experiment

• Experimental Unit structure

– What are e.u.; what are o.u;

– beware of pseudo-replication

– is blocking happening?

• Randomization structure

17

– complete randomization

– restricted randomization (blocking)

– no randomization (measured over time?)

Common Experimental Designs

• Need to match ANALYSIS with DESIGN

• draw a picture of design

– CRD

∗ complete randomization of treatments

to experimental units

∗ experimental unit = observational unit

∗ default analysis on most packages

– RCB

∗ group experimental units into homo-

geneous blocks

∗ randomize within each block

18

∗ experimental unit = observational unit

– split-plot designs

∗ two sizes of experimental units

∗ one factor assigned to larger exp units

(main plots)

∗ second factor assigned to smaller exp

units (sub plots)

∗ MOST COMMONLY MISANALYZED

design

CRD with 2 levels of single factor

• comparison of POPULATION MEANS across

treatment groups

• null hypothesis is equality of POPULATION

MEANS, H : µ1 = µ2

• assumptions and how to check

– CRD - check how experiment was run

– no outliers - side-by-side dot plots

– equal group std deviations - compare

the sample std dev

– normality of residuals - hard to check

– independence of observations

19

• test statistics is T-statistic

• p-value = measure of unusualness of data

vs hypothesis

• effect size = estimate of difference in means

+ SE

• power analysis BEFORE study is run

– α = .05

– standard deviation from past data or

range/4

– biological important difference; 1 std dev?

– target 80% power

CRD with ≥ 2 levels of single factor

• CRD ≥ 2 groups = CRD ANOVA = ONE-

WAY ANOVA

• comparison of POPULATION MEANS across

treatment groups

• null hypothesis is equality of POPULATION

MEANS, H : µ1 = µ2 = . . . µk

• assumptions and how to check

– CRD - check how experiment was run

– no outliers - side-by-side dot plots

– equal group std deviations - compare

the sample std dev

20

– normality of residuals - hard to check

– independence of observations

• test statistics is F-ratio = signal/noise ra-

tio

• p-value = measure of unusualness of data

vs hypothesis

• multiple comparisons to see where differ-

ences lie

– controls the experimentwise error rate

• effect size = estimate of difference in means

+ SE

• power

– α = .05

– standard deviation from past data or

range/4

– biological important difference: min vs

max and configuration

– target 80% power

Pseudo-replication

• Simple pseudo-replication = fish in tank

experiment

– experimental unit 6= observational unit

– not true replicates

• Temporal pseudo-replication = repeated mea-

surements over time

– same experiment unit measured over time

– must adjust se to account for multiple

measurements

• Sacrificial pseudo-replication

– Test/Pool/Test especially for categori-

cal data21

– Newer software can deal with this di-

rectly

• Implicit pseudo-replication

– Recognize pseudo-replication but then

ignore it.

Blocked Designs

• Paired or Blocked (RCB)

– group experimental units into more ho-

mogeneous sets

– randomize treatments WITHIN blocks

to exp units.

• Paired designs - 4 (equivalent) ways to an-

alyze in JMP

– find differences and look at mean differ-

ence using DISTRIBUTION platform

– MatchedPairs platform

– Fit-Y-by-X platform and specify block-

ing variable

22

– FitModel and specify blocking variable

• RCB ANOVA

– check ANOVA table for F-ratio for TREAT-

MENT effect

– dot-plots are block centered

• same assumptions as CRD + additivity

• get effect size as before

• do MCP as before

• Modern software can deal with missing val-

ues (seek help)

Comparing Proportions

• response is CATEGORY (e.g. live vs dead)

• beware of how data are presented

– individual records

– grouped records with trt, response, count

• hypothesis of EQUALITY of PROPOR-

TIONS or INDEPENDENCE

• need to look at design carefully; seek help

if not a CRD

• beware of compositional data

23

• Analysis

– contingency table with row percents

– mosaic plot (segmented bar charts)

– Pearson/Likelihood ratio test (χ2) and

p-value

– No easy multiple comparison procedure

Simple Linear Regression

• both X and Y are interval/ratio (contin-uos)

• assumptions

– Linear relationship

– Both X and Y are interval/ratio

– Completely randomized design to col-lect (X,Y ) pairs

– No outliers or influential points

– Equal variation about regression line

– Indepence of residuals

– Normality of residuals

– X measured without error

24

Simple Linear Regression - cont

• Least-squared minimized deviations2 in ver-

tical direction

• Estimates of intercept and slope (and se)

• Test if population slope = 0

• Predictions- CAREFUL

– Predictions and c.i. for MEAN response

– Predictions and p.i. for INDIVIDUAL

response

– Inverse Predictions and c.i. based on

MEAN response

25

– Inverse Predictions and p.i. based on

INDIVIDUAL Response

• check residual plots

– plot residual vs. predicted, vs. X, vs.

new predictors

• beware of perils of R2

• sometimes need to tranform Y or X

title of slide

• xxx

– xxx

26

summary points - simon fraser...

Documents