intermediate r - analysis of count and proportion data
TRANSCRIPT
Analysis of Count Analysis of Count
and Proportion Dataand Proportion Data
Violeta I. Bartolome
Senior Associate Scientist
PBGB-CRIL
Assumptions in the ANOVAAssumptions in the ANOVA
• Additive Effects
• Independence of errors
• Homogeneity of variances
• Normal distribution0
1
2
3
4
0 2 4 6 8 10
Mean
Variance
Count DataCount Data
• Response variable
is an integer
• Variance usually
increase linearly
with the mean
• Errors are not
normally distributed
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10
Mean
Variance
Proportion dataProportion data
• Count of the number
of failures of an
event as well as the
number of successes
• Variance will be an
inverted U-shaped
function of the mean.
0
1
2
3
0 2 4 6 8 10
Mean
Variance
Count DataCount Data
For treatment levels, define the For treatment levels, define the
control as the first level when control as the first level when
sorted in ascending order. GLM sorted in ascending order. GLM
uses the first level as reference.uses the first level as reference.
Analysis of Count dataAnalysis of Count data
Analysis of count dataAnalysis of count data
Note: Note: glmglm
uses the first uses the first
level as level as
reference.reference.
401.45/15=26.8401.45/15=26.8
Residual deviance Residual deviance
is much greater is much greater
than than dfdf. Indication . Indication
of of overdispersionoverdispersion
OverdispersionOverdispersion
• Residual deviance is inflated
• There are extra, unexplained variation in the response
• May result if the underlying distribution is not Poisson
• Compensate for the overdispersionby refitting using quasi-Poisson rather than Poisson errors.
Correct for Correct for overdispersionoverdispersion
401.47/15=26.8401.47/15=26.8
ANOVA tableANOVA table
Residual PlotResidual Plot
• After fitting a model to data, we
should investigate how well the
model describes the data.
• With normal errors, the raw and
standardized residuals are identical.
• The standardized residuals are
required to correct non-normal errors
(like in count and proportion).
Standardized residualsStandardized residuals
• For count data
valuesfitted
valuefittedy −−−−
• For proportion data
−−−−
−−−−
atormindenobinomial
valuesfitted1valuesxfitted
efittedvaluy
Compute standardized residualsCompute standardized residualsResidual plotResidual plot
Predicted MeansPredicted Means
Note: Note:
differences are differences are
based on based on
transformed transformed
valuesvalues
If the interval If the interval
includes zero then includes zero then
difference is not difference is not
significant.significant.
Proportion DataProportion Data
Traditional AnalysisTraditional Analysis
o Convert to percentage data and used
as response variable
o Not good
o Errors are not normally distributed
o Variances are heterogeneous
o Response is bounded by 0 and 100
o Size of the sample, n, is lost
General ApproachGeneral Approach
• Use general linear model (glm)
• Family=binomial
• Uses two vectors, one for success
counts and the other for failure
counts
• Number of failures + number of
successes = binomial denominator, n
Analysis of proportionAnalysis of proportion Create response matrixCreate response matrix
• First column is success or failure
• Second column is n - first column
Analysis of proportionAnalysis of proportion
123.96/45=2.8123.96/45=2.8
An indication of An indication of
overdispersionoverdispersion
Correct for Correct for overdispersionoverdispersion
ANOVA tableANOVA table Plot standardized residualsPlot standardized residuals
Predicted MeansPredicted Means
Mean ComparisonMean Comparison
Thank you!Thank you!