basic statistics - descriptive statistics and...

73
Basic statistics Descriptive statistics and ANOVA Thomas Alexander Gerds Department of Biostatistics, University of Copenhagen

Upload: trannguyet

Post on 08-Mar-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Basic statisticsDescriptive statistics and ANOVA

Thomas Alexander Gerds

Department of Biostatistics, University of Copenhagen

Contents

I Data are variableI Statistical uncertaintyI Summary and display of dataI Confidence intervalsI ANOVA

Data are variable

A statistician is used to receive a value, such as

3.17 %,

together with an explanation, such as

"this is the expression of 1-B6.DBA-GTM in mouse 12".

The value from the next mouse in the list is 4.88% . . .

The measurement is difficult

Data processing is done by humans

Two mice have different genes

They are exposed

. . . and treated differently

Decomposing variance

Variability of data is usually a composite of

I Measurement error, sampling schemeI Random variationI GenotypeI Exposure, life style, environmentI Treatment

Statistical conclusions can often be obtained by explaining thesources of variation in the data.

Example 1

In the yeast experiment of Smith and Kruglyak (2008) 1 transcriptlevels were profiled in 6 replicates of the same strain called ’RM’ inglucose under controlled conditions.

1the article is available at http://biology.plosjournals.org

Example 1

Figure:

Sources of the variation of these 6 values

I Measurement errorI Random variation

Example 1

In the same yeast experiment Smith and Kruglyak (2008) profiledalso 6 replicates of a different strain called ’By’ in glucose.Theorder in which the 12 samples were processed was at random tominimize a systematic experimental effect.

Example 1

Figure:

Sources of the variation of these 12 values

I Measurement errorI Study design/experimental environmentI Genotype

Example 1

Furthermore, Smith and Kruglyak (2008) cultured 6 ’RM’ and 6’By’ replicates in ethanol.The order in which the 24 samples wereprocessed was random to minimize a systematic experimental effect.

Sources of variation

Figure:

Sources of variation

I Measurement errorI Experimental environmentI GenesI Exposure, environmental factors

Example 2

Festing and Weigler in the Handbook of Laboratory Animal Science. . .

. . . consider the results of an experiment using a completelyrandomized design . . .

. . . in which adult C57BL/6 mice were randomly allocated to oneof four dose levels of a hormone compound.

The uterus weight was measured after an appropriate time interval.

Example 2

Figure:

Example 2

Figure:

Example 2

Figure:

Example 2

Conclusions from the figures

I The uterus weight depends on the doseI The variation of the data increases with increasing dose

Question: Why could these first conclusions be wrong?

Descriptive statistics

Descriptive statistics (summarizing data)

Categorical variables: count (%).

Continuous variables:I raw values (if n is small)I range (min, max)I location: median (IQR=inter quartile range)I location: means (SD)

Sample: Table 1

22Quality of life (QOL), supportive care, and spirituality in hematopoietic

stem cell transplant (HSCT) patients. Sirilla & Overcash. Supportive Care inCancer, October 2012.

Sample: Table 1

R excursion: calculating descriptive statistics in groups

library(Publish)library(data.table)data(Diabetes)setDT(Diabetes) ## make data.tableDiabetes[,.(mean.age=mean(age), sd.age=sd(age),median.

chol=median(chol,na.rm=TRUE)),by=location]

location mean.age sd.age median.chol1: Buckingham 47.07500 16.74849 2022: Louisa 46.63054 15.90929 206

R excursion: making table onelibrary(Publish)data(Diabetes)tab1 <- summary(utable(location∼gender + age + Q(chol) + BMI,

data=Diabetes))tab1

Variable Level Buckingham (n=200) Louisa (n=203) Total (n=403) p-valuegender female 114 (57.0) 120 (59.1) 234 (58.1)

male 86 (43.0) 83 (40.9) 169 (41.9) 0.7422age mean (sd) 47.1 (16.7) 46.6 (15.9) 46.9 (16.3) 0.7847chol median [iqr] 202.0 [174.0, 231.0] 206.0 [183.5, 229.0] 204.0 [179.0, 230.0] 0.2017

missing 1 0 1BMI mean (sd) 28.6 (7.0) 29.0 (6.2) 28.8 (6.6) 0.5424

missing 3 3 6

R excursion: exporting a table

Method 1: Write table to file

write.csv(tab1,file="tables/tab1.csv")

Then open file tab1.csv with Excel

Method 2: Use kable3 and include in dynamic report4

‘‘‘{r,results=’asis’}knitr::kable(tab1)‘‘‘

3https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

4https://www.rdocumentation.org/packages/knitr/versions/1.17/topics/kable

R excursion: exporting a table

Method 1: Write table to file

write.csv(tab1,file="tables/tab1.csv")

Then open file tab1.csv with Excel

Method 2: Use kable3 and include in dynamic report4

‘‘‘{r,results=’asis’}knitr::kable(tab1)‘‘‘

3https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

4https://www.rdocumentation.org/packages/knitr/versions/1.17/topics/kable

Dynamite plots are depreciated (DO NOT USE) 5

5http://biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots

Dot plots are appreciated when n is small

●●

●●

Mea

sure

men

t sca

le

−3

−2

−1

01

23

A B C

Figure: Group A (n=3), group B (n=3, one replicate), group C (n=4)

Box plots are appreciated when n is large

●●

●●

●●

●●●

Mea

sure

men

t sca

le

−4

−2

02

4

A B C

Figure: Group A (n=300), group B (n=400), group C (n=400)

Making boxplots with ggplot2

library(ggplot2)bp <- ggplot(Diabetes, aes(location,chol))bp <- bp + geom_boxplot(aes(fill=location))print(bp)

Find the ggplot2 cheat sheet via help menu in Rstudio

Making boxplots with ggplot2

●●

●●

●●

100

200

300

400

Buckingham Louisalocation

chol

location

Buckingham

Louisa

Making boxplots with ggplot2

bp+facet_grid(.∼gender)

female male

●●

●●

100

200

300

400

Buckingham Louisa Buckingham Louisalocation

chol

location

Buckingham

Louisa

Making dotplots with ggplot2dp <- ggplot(mice,aes(x=Dose,fill=Dose,y=BodyWeight))dp <- dp + geom_dotplot(binaxis="y")print(dp)

●●

●●

●●●

10

11

12

13

0 1 2.5 7.5 50Dose

Bod

yWei

ght

Dose

0

1

2.5

7.5

50

R excursion: exporting a figure

Write figure to pdf (vector graphics, also eps 6, infinite zoom)

pdf("figures/dotplot-mice-bodyweight.pdf")dpdev.off()

Write figure to jpg (image file, also tiff, giff etc)

jpeg("figures/dotplot-mice-bodyweight.jpg")dpdev.off()

6postscript("figures/dotplot-mice-bodyweight.eps")

Quantifying variability

A sample of data X1, . . . ,XN has a standard deviation (sd); it isdefined by

SD =

√√√√ 1N − 1

N∑i=1

(Xi − X )2; X =1N

N∑i=1

Xi

SD measures the variability of the measurements in the sample.

The variance of the sample is defined as SD2. The term ’standarddeviation’ relates to the normal distribution.

Normal distribution

What is so special about the normal distribution?

I It is symmetric around the mean, thus the mean is equal tothe median.

I The mean is the most likely value. Mean and standarddeviation describe the full destribution.

I The distribution of measurements, like height, distance,volume is often normal.

I The distribution of statistics, like mean, proportion, meandifference, etc. are very often approximately normal.

Quantifying statistical uncertainty

For statistical inference and conclusion making, via p-values andconfidence intervals, it is crucial to quantify the variability of thestatistic (mean, proportion, mean difference, risk ratio, etc.):

The standard error is the standard deviation of the statistic.

The standard error is a measure of the statistical uncertainty.

Quantifying statistical uncertainty

Example: We want to estimate the unknown mean uterus weightfor untreated mice. The standard error of the mean is defined as

SE = SD/√N where N is the sample size.

Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009:

I mean: β̂ = 0.0091I standard deviation: SD = 0.002108I empirical variance: var = 0.0000044I standard error: SE = 0.002108/2 = 0.001054

Quantifying statistical uncertainty

Example: We want to estimate the unknown mean uterus weightfor untreated mice. The standard error of the mean is defined as

SE = SD/√N where N is the sample size.

Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009:

I mean: β̂ = 0.0091I standard deviation: SD = 0.002108I empirical variance: var = 0.0000044I standard error: SE = 0.002108/2 = 0.001054

The standard error is the standard deviation of the mean

Ute

rus

wei

ght (

g)

0.00

00.

005

0.01

00.

015

Ourstudy

Hypotheticalstudy 1

Hypotheticalstudy 47

Hypotheticalstudy 100

The unknown trueaverage uterus

weight●

The (hypothetical) mean values are approximately normallydistributed, even if the data are not normally distributed!

Variance vs statistical uncertainty

"’The terms standard error and standard deviation are oftenconfused. The contrast between these two terms reflects theimportant distinction between data description and inference, onethat all researchers should appreciate."’ 7

Rules:I The higher the unexplained variability of the data, the higher

the statistical uncertainty.I The higher the sample size, the lower the statistical

uncertainty.

7Altman & Bland, Statistics Notes, BMJ, 2005, Nagele P, Br J Anaesthesiol2003;90: 514-6

Confidence intervals

Constructing confidence limits

A 95% confidence interval for the parameter β is

[β̂ − 1.96 ∗ SE ; β̂ + 1.96 ∗ SE ]

Example: a confidence interval for the mean uterus weight ofuntreated mice is given by

95%CI = [0.0091− 1.96 ∗ 0.001054; 0.0091+ 1.96 ∗ 0.001054]= [0.007; 0.011].

The standard error SE measures the variability of the mean β̂around the (unknown) population value β, under the assumptionthat the model is correctly specified.

The idea of a 95% confidence interval

Ute

rus

wei

ght (

g)

0.00

00.

005

0.01

00.

015

Ourstudy

Hypotheticalstudy 1

Hypotheticalstudy 47

Hypotheticalstudy 100

The unknown trueaverage uterus

weight●

By construction, we expect at most 5 of the 100 confidenceintervals not to cover (include) the true value.

Confidence limits for the mean uterus weights (long code)

library(Publish)cidat <- mice[,{meanU=mean(UterusWeight)

seU=sqrt(var(UterusWeight)/.N).(meanU=meanU,

lowerU=meanU-seU*qnorm(1 - 0.05/2),upperU=meanU+seU*qnorm(1 - 0.05/2))},by=Dose]

publish(cidat,digits=1)

Dose meanU lowerU upperU0.0 0.009 0.007 0.011.0 0.025 0.020 0.032.5 0.051 0.046 0.067.5 0.089 0.079 0.10

50.0 0.087 0.066 0.11

Confidence limits for the mean uterus weights (short code)

library(Publish)cidat <- mice[,ci.mean(UterusWeight),by=Dose]publish(cidat,digits=1)

Dose mean se lower upper level statistic0.0 0.009 0.001 0.006 0.01 0.05 arithmetic1.0 0.025 0.002 0.018 0.03 0.05 arithmetic2.5 0.051 0.002 0.044 0.06 0.05 arithmetic7.5 0.089 0.005 0.072 0.11 0.05 arithmetic

50.0 0.087 0.011 0.053 0.12 0.05 arithmetic

Confidence limits for the geometric mean uterus weights(short code)

library(Publish)gcidat <- mice[,ci.mean(UterusWeight,statistic="

geometric"),by=Dose]publish(gcidat,digits=1)

Dose geomean se lower upper level statistic0.0 0.009 1.1 0.006 0.01 0.05 geometric1.0 0.024 1.1 0.018 0.03 0.05 geometric2.5 0.051 1.0 0.044 0.06 0.05 geometric7.5 0.089 1.1 0.073 0.11 0.05 geometric

50.0 0.085 1.1 0.057 0.13 0.05 geometric

ggplot2: Plot of means with confidence intervals

library(ggplot2)pom <- ggplot(cidat)+geom_pointrange(aes(x=Dose,

y=meanU,ymin=upperU,ymax=lowerU),color=4)

pom + coord_flip() + ylab("Uterus weight (g)")+xlab("Dose")

Plot of means with confidence intervals

0

1

2.5

7.5

50

0.03 0.06 0.09Uterus weight (g)

Dos

e

Publish: Plot of means with confidence intervals (code)

library(Publish)plotConfidence(x=cidat$mean,

lower=cidat$lower,upper=cidat$upper,labels=cidat$Dose,title.labels="Hormon dose",title.values=expression(paste("Mean (",CI[95],")")),cex=1.8,

stripes=TRUE,xratio=c(.2,.3),xlim=c(0,.15),xlab="Uterus weight (g)")

Publish: Plot of means with confidence intervals (result)

0

1

2.5

7.5

50

Hormon dose

0.01 (0.01−0.01)

0.02 (0.02−0.03)

0.05 (0.04−0.06)

0.09 (0.07−0.11)

0.09 (0.05−0.12)

Mean (CI95)

0.00 0.05 0.10 0.15Uterus weight (g)

Parameters

It is generally difficult to interpret a p-value without furtherquantification of the parameter of interest.

Parameters are interpretable characteristics that have to beestimated based on data.

Examples that we will study during the course:

I MeansI Mean differencesI ProbabilitiesI Risk ratios, odds ratios, hazard ratiosI Association parameters, regression coefficients

Juonala et al. (part I)

Aims: The objective was to produce reference values and to analyse theassociations of age and sex with carotid intima-media thickness (IMT),carotid compliance (CAC), and brachial flow-mediated dilatation (FMD) inyoung healthy adults.

Methods and results: We measured IMT, CAC, and FMD with ultrasound in2265 subjects aged 24–39 years. The mean values (mean ± SD) in men andwomen were 0.592± 0.10 vs. 0.572± 0.08mm (P < 0.0001) for IMT,2.00± 0.66 vs. 2.31± 0.77%/10 mmHg (P < 0.0001) for CAC, and6.95± 4.00 vs. 8.83± 4.56% (P < 0.0001) for FMD.

The sex differences in IMT (95% confidence interval= [-0.013; 0.004] mm,P = 0.37) and CAC (95% CI=[-0.01;0.18]%/10 mmHg, P = 0.09) becamenon-significant after adjustments with risk factors and carotid diameter.

Confidence intervals

A confidence interval is a range of values which covers the unknowntrue population parameter with high probability. Roughly theprobability is 100− α% where α is the level of significance.

For example:−0.013 to 0.004

is a 95% confidence interval for the unknown average difference inIMT between men and women.

Confidence intervals have the advantage over p-values, that theirabsolute value has a direct interpretation.8

8Confidence intervals rather than P values: estimation rather thanhypothesis testing. Statistics with Confidence, Altman et al.

Relation between confidence intervals and p-values

If we estimate the parameter β, e.g.

β = mean(IMT men )-mean(IMT women)

and have computed a 95% confidence interval for this parameter,

[lower95, upper95]

then the null hypothesis

β = 0 "There is no difference"

can be rejected at the 5% significance level if the value 0 is notincluded in the interval: 0 /∈ [lower95, upper95].

ANOVA

Example (DGA p.208)

22 cardiac bypass operation patients were randomized to 3 types ofventilation.

Outcome: Red cell folate level (µ g/l)Group Ventilation N Mean SdI 50% N2O, 50% O2 in 24 hours 8 316.6 58.7II 50% N2O, 50% O2 during operation 9 256.4 37.1III 30–50% O2 (no N2O) in 24 hours 5 278.0 33.8

ANOVA table for red cell folate levels

Source ofvariation

Degreesof free-dom

Sum ofsquares

Meansquares

F P

Betweengroups

2 15515.88 7757.9 3.71 0.04

Withingroups

19 39716.09 2090.3

Total 21 55231.97

What are sum of squares and degrees of freedom?

Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :

Var =1

N − 1{(X1 − X )2 + · · ·+ (XN − X )2}

Var =1

N − 1︸ ︷︷ ︸degrees of freedom

{(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares

}

In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.

What are sum of squares and degrees of freedom?

Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :

Var =1

N − 1{(X1 − X )2 + · · ·+ (XN − X )2}

Var =1

N − 1︸ ︷︷ ︸degrees of freedom

{(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares

}

In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.

What are sum of squares and degrees of freedom?

Recall the definition of the variance for a sample of N valuesX1, . . . ,XN with mean=X :

Var =1

N − 1{(X1 − X )2 + · · ·+ (XN − X )2}

Var =1

N − 1︸ ︷︷ ︸degrees of freedom

{(X1 − X )2 + · · ·+ (XN − X )2︸ ︷︷ ︸Sum of squares

}

In ANOVA terminology the variance is referred to as a mean squarewhich is short for: mean squared deviation from the mean.

ANOVA methods

I Independent observationsI t test for two groupsI One-way ANOVA for more groupsI More-way ANOVA for more grouping variables

I Dependent observations:I Repeated measures anovaI Mixed effect models

I Rank statistics (non-parametric ANOVA tests)I Nonparametric anova (Kruskal-Wallis test)

I Mixture of discrete and continuous factors:I Ancova

I Model comparison and model selection . . .

Nice method

Nice methods, but what is the question?

Typical F-test hypotheses

H0 Null hypothesis The red cell folate does not depend onthe treatment

H1 Alternativehypothesis

The red cell folate does depend on thetreatment

This means

H0 : Mean group I = Mean group II = Mean group III

H1 : Mean group I 6= Mean group IIor Mean group III 6= Mean group IIor Mean group I 6= Mean group III

Usually we want to know which treatment yields the best response.

F-test statistic

Central idea: The deviation of a subjects response from the grandmean of all responses is attributable to a deviation of that valuefrom its group mean plus the deviation of that group mean fromthe grand mean.

F =between-group variabilitywithin-group variability

=Variance of the mean response values between groups

Variance of the values within the groups

If the between-group variability is large relative to the within-groupvariability, then the grouping factor contributes to the systematicpart of the variability of the response values.

Conclusions from the ANOVA table

Source ofvariation

Degreesof free-dom

Sum ofsquares

Meansquares

F P

Betweengroups

2 15515.88 7757.9 3.710.04

Withingroups

19 39716.09 2090.3

Total 21 55231.97

Conclusion: The red cell folate depends significantly on thetreatment.

Take home messages

I The variation of data can be decomposed into a systematicand a random part.

I The standard deviation quantifies the variability of the data.I The standard error quantifies the uncertainty of statistical

conclusions.I ANOVA is an old and general statistical technique with many

different applications.