intermediate r - analysis of count and proportion data

8
Analysis of Count Analysis of Count and Proportion Data and Proportion Data Violeta I. Bartolome Senior Associate Scientist PBGB-CRIL [email protected] Assumptions in the ANOVA Assumptions in the ANOVA Additive Effects Independence of errors Homogeneity of variances Normal distribution 0 1 2 3 4 0 2 4 6 8 10 Mean Variance Count Data Count Data Response variable is an integer Variance usually increase linearly with the mean Errors are not normally distributed 0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 Mean Variance Proportion data Proportion data Count of the number of failures of an event as well as the number of successes Variance will be an inverted U-shaped function of the mean. 0 1 2 3 0 2 4 6 8 10 Mean Variance

Upload: vivay-salazar

Post on 22-Nov-2014

2.511 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Intermediate R - Analysis  of Count and Proportion Data

Analysis of Count Analysis of Count

and Proportion Dataand Proportion Data

Violeta I. Bartolome

Senior Associate Scientist

PBGB-CRIL

[email protected]

Assumptions in the ANOVAAssumptions in the ANOVA

• Additive Effects

• Independence of errors

• Homogeneity of variances

• Normal distribution0

1

2

3

4

0 2 4 6 8 10

Mean

Variance

Count DataCount Data

• Response variable

is an integer

• Variance usually

increase linearly

with the mean

• Errors are not

normally distributed

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10

Mean

Variance

Proportion dataProportion data

• Count of the number

of failures of an

event as well as the

number of successes

• Variance will be an

inverted U-shaped

function of the mean.

0

1

2

3

0 2 4 6 8 10

Mean

Variance

Page 2: Intermediate R - Analysis  of Count and Proportion Data

Count DataCount Data

For treatment levels, define the For treatment levels, define the

control as the first level when control as the first level when

sorted in ascending order. GLM sorted in ascending order. GLM

uses the first level as reference.uses the first level as reference.

Analysis of Count dataAnalysis of Count data

Analysis of count dataAnalysis of count data

Note: Note: glmglm

uses the first uses the first

level as level as

reference.reference.

401.45/15=26.8401.45/15=26.8

Residual deviance Residual deviance

is much greater is much greater

than than dfdf. Indication . Indication

of of overdispersionoverdispersion

OverdispersionOverdispersion

• Residual deviance is inflated

• There are extra, unexplained variation in the response

• May result if the underlying distribution is not Poisson

• Compensate for the overdispersionby refitting using quasi-Poisson rather than Poisson errors.

Page 3: Intermediate R - Analysis  of Count and Proportion Data

Correct for Correct for overdispersionoverdispersion

401.47/15=26.8401.47/15=26.8

ANOVA tableANOVA table

Residual PlotResidual Plot

• After fitting a model to data, we

should investigate how well the

model describes the data.

• With normal errors, the raw and

standardized residuals are identical.

• The standardized residuals are

required to correct non-normal errors

(like in count and proportion).

Standardized residualsStandardized residuals

• For count data

valuesfitted

valuefittedy −−−−

• For proportion data

−−−−

−−−−

atormindenobinomial

valuesfitted1valuesxfitted

efittedvaluy

Page 4: Intermediate R - Analysis  of Count and Proportion Data

Compute standardized residualsCompute standardized residualsResidual plotResidual plot

Predicted MeansPredicted Means

Note: Note:

differences are differences are

based on based on

transformed transformed

valuesvalues

If the interval If the interval

includes zero then includes zero then

difference is not difference is not

significant.significant.

Page 5: Intermediate R - Analysis  of Count and Proportion Data

Proportion DataProportion Data

Traditional AnalysisTraditional Analysis

o Convert to percentage data and used

as response variable

o Not good

o Errors are not normally distributed

o Variances are heterogeneous

o Response is bounded by 0 and 100

o Size of the sample, n, is lost

General ApproachGeneral Approach

• Use general linear model (glm)

• Family=binomial

• Uses two vectors, one for success

counts and the other for failure

counts

• Number of failures + number of

successes = binomial denominator, n

Page 6: Intermediate R - Analysis  of Count and Proportion Data

Analysis of proportionAnalysis of proportion Create response matrixCreate response matrix

• First column is success or failure

• Second column is n - first column

Analysis of proportionAnalysis of proportion

123.96/45=2.8123.96/45=2.8

An indication of An indication of

overdispersionoverdispersion

Correct for Correct for overdispersionoverdispersion

Page 7: Intermediate R - Analysis  of Count and Proportion Data

ANOVA tableANOVA table Plot standardized residualsPlot standardized residuals

Predicted MeansPredicted Means

Page 8: Intermediate R - Analysis  of Count and Proportion Data

Mean ComparisonMean Comparison

Thank you!Thank you!