reporting effect size estimates in school psychology research

20
REPORTING EFFECT SIZE ESTIMATES IN SCHOOL PSYCHOLOGY RESEARCH MARTIN A. VOLKER University at Buffalo, SUNY This article reviews the arguments for reporting effect size estimates as part of the statistical results in empirical studies. Following this review, formulas are presented for the calculation of major mean-difference and association-based effect size measures for t tests, one-way ANOVA, zero order correlation, simple regression, multiple regression, and chi-square. The emphasis is on the presentation formulas that make the calculation of effect size measures as easy as possi- ble. In most cases, the formula components are readily available and easily recognizable on the output from most major statistical software. Examples of effect size reporting with guidelines for design and analytic variations are provided. © 2006 Wiley Periodicals, Inc. The reporting of standardized effect size estimates alongside traditional null hypothesis test- ing results is required by more and more scholarly journals within the broader field of psychology with each passing year (see Vacha-Haase & Thompson, 2004). Yet, the practice of reporting standardized effect size estimates is still a rarity in school psychology research. Though there are likely to be a variety of reasons for this disparity, the relative unfamiliarity of many effect size measures, understanding of why they should be reported, and how to use them properly loom large. In this article, I attempt to address some of these issues by exploring a variety of arguments for the reporting of effect size estimates, presenting formulas for the calculation of effect size measures used in most popular analyses (i.e., effects sizes reported with t tests, one-way ANOVA, zero order correlation, simple regression, multiple regression, and chi-square), and giving guide- line with examples of how to report them. In general, the formulas presented were chosen because of their ease of use. In most cases, the numbers required for these formulas are readily available in the standard output from most popular statistical software packages. Why Report Standardized Measures of Effect Size? There are a number of reasons why researchers should report effect size estimates in the written results of their empirical articles. Some of these reasons entail arguments based on author- ity whereas others make the logical point that effect size measures give us important interpretive information above and beyond that of the statistical conclusion. These arguments are detailed next. Arguments From Authority In the realm of authority, the American Psychological Association Board of Scientific Affairs assembled a Task Force on Statistical Inference in the late 1990s that was given the charge to “elucidate some of the controversial issues surrounding applications of statistics . . .” (Wilkinson & the APATask Force on Statistical Inference, 1999, p. 594). The Task Force made a variety of recommendations concerning the reporting of statistical methods in psychology journals, which included (but were not limited to): (a) use minimally sufficient analyses (i.e., keep analyses as simple and understandable as possible), (b) report effect size estimates for primary outcomes or whenever a p value is reported, (c) report confidence intervals around observed results and effect size estimates in place of post hoc power analyses, and (d) give reasonable assurances that assump- Correspondence to: Martin A. Volker, 409 Baldy Hall, University at Buffalo, SUNY, Buffalo, NY 14260–1000. E-mail: [email protected] Psychology in the Schools, Vol. 43(6), 2006 © 2006 Wiley Periodicals, Inc. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/pits.20176 653

Upload: martin-a-volker

Post on 06-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reporting effect size estimates in school psychology research

REPORTING EFFECT SIZE ESTIMATES IN SCHOOL PSYCHOLOGY RESEARCH

MARTIN A. VOLKER

University at Buffalo, SUNY

This article reviews the arguments for reporting effect size estimates as part of the statisticalresults in empirical studies. Following this review, formulas are presented for the calculation ofmajor mean-difference and association-based effect size measures for t tests, one-way ANOVA,zero order correlation, simple regression, multiple regression, and chi-square. The emphasis ison the presentation formulas that make the calculation of effect size measures as easy as possi-ble. In most cases, the formula components are readily available and easily recognizable on theoutput from most major statistical software. Examples of effect size reporting with guidelines fordesign and analytic variations are provided. © 2006 Wiley Periodicals, Inc.

The reporting of standardized effect size estimates alongside traditional null hypothesis test-ing results is required by more and more scholarly journals within the broader field of psychologywith each passing year (see Vacha-Haase & Thompson, 2004). Yet, the practice of reportingstandardized effect size estimates is still a rarity in school psychology research. Though there arelikely to be a variety of reasons for this disparity, the relative unfamiliarity of many effect sizemeasures, understanding of why they should be reported, and how to use them properly loomlarge. In this article, I attempt to address some of these issues by exploring a variety of argumentsfor the reporting of effect size estimates, presenting formulas for the calculation of effect sizemeasures used in most popular analyses (i.e., effects sizes reported with t tests, one-way ANOVA,zero order correlation, simple regression, multiple regression, and chi-square), and giving guide-line with examples of how to report them. In general, the formulas presented were chosen becauseof their ease of use. In most cases, the numbers required for these formulas are readily available inthe standard output from most popular statistical software packages.

Why Report Standardized Measures of Effect Size?

There are a number of reasons why researchers should report effect size estimates in thewritten results of their empirical articles. Some of these reasons entail arguments based on author-ity whereas others make the logical point that effect size measures give us important interpretiveinformation above and beyond that of the statistical conclusion. These arguments are detailednext.

Arguments From Authority

In the realm of authority, the American Psychological Association Board of Scientific Affairsassembled a Task Force on Statistical Inference in the late 1990s that was given the charge to“elucidate some of the controversial issues surrounding applications of statistics . . .” (Wilkinson& the APA Task Force on Statistical Inference, 1999, p. 594). The Task Force made a variety ofrecommendations concerning the reporting of statistical methods in psychology journals, whichincluded (but were not limited to): (a) use minimally sufficient analyses (i.e., keep analyses assimple and understandable as possible), (b) report effect size estimates for primary outcomes orwhenever a p value is reported, (c) report confidence intervals around observed results and effectsize estimates in place of post hoc power analyses, and (d) give reasonable assurances that assump-

Correspondence to: Martin A. Volker, 409 Baldy Hall, University at Buffalo, SUNY, Buffalo, NY 14260–1000.E-mail: [email protected]

Psychology in the Schools, Vol. 43(6), 2006 © 2006 Wiley Periodicals, Inc.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/pits.20176

653

Page 2: Reporting effect size estimates in school psychology research

tions were met by the data for the statistical tests used (Wilkinson & the APA Task Force onStatistical Inference, 1999).

The recommendations of the Task Force were in part aimed at influencing the content of thePublication Manual of the American Psychological Association, fifth edition (American Psycho-logical Association, 2001). For a variety of practical reasons, not all of the recommendations ofthe Task Force were included in the Publication Manual (for details of the controversy surround-ing this issue, see Fidler, 2002). The Publication Manual guidelines require the reporting of majordescriptive statistics (i.e., means, SDs, n sizes, etc.) for major study variables, complete reportingof statistical test results (e.g., value of the test statistic, dfs, exact p value, etc.), and reporting ofestimates of effect size (except in unusual circumstances). The Publication Manual also “stronglyrecommends” the use of confidence intervals, but falls short of requiring them at this time (p. 22).Thus, there is general agreement across authorities on the need to report effect size estimates.Confidence intervals are clearly useful and are recommended, but fall outside the scope of thepresent article.

Logical Arguments

Logical arguments for the reporting of effect size estimates focus on the known limitations ofdichotomous decision making in statistical inference and the prevalence of p-value overinterpre-tation. In traditional null hypothesis significance testing, one starts with a null hypothesis (typi-cally that the difference between two population means is zero) and an alternative hypothesis(typically that there is a nonzero difference between two population means). The statistical testthen produces a conditional probability (i.e., p value), which is essentially the probability that amean difference at least this extreme would occur by chance assuming random sampling from apopulation where the null hypothesis is true. One assumes that the null hypothesis is true, unlessa result sufficiently extreme occurs (e.g., a result likely to occur �5% of the time by chance givena true null). When this type of extreme result occurs, one makes the decision to reject the nullhypothesis and accept the alternative hypothesis. At the statistical level, no information about theactual size of the effect is included in the decision. One either fails to reject the null hypothesis andassumes that there is no difference, or one rejects the null hypothesis and concludes that a nonzerodifference exists. A nonzero difference is a start, but not particularly informative for those of usinterested in whether the statistically significant result is actually meaningful, substantive, orclinically significant (for discussions of clinical significance, see Jacobson & Revenstorf, 1988;Jacobson & Truax, 1991; Kazdin, 1977).

The clearly limited nature of this type of dichotomous statistical conclusion is often forgottenor popularly, and mistakenly, believed to be offset by illusory characteristics of the p value. Byillusory characteristics, I mean that researchers often appear to read more meaning into p valuesthan is actually there. For example, the p value is often discussed as if it is an index of effect sizeor meaningfulness of the result. It is still not unusual to read in an article something like “the resultwas highly significant ( p � .0001),” which appears to suggest a large or meaningful effect size;however, the fact that a large sample size can lead to a small p value for even a minuscule effectsize makes the p value useless for this purpose. Other misinterpretations of the p value includeassuming that it is actually the real probability that the null hypothesis is true, when it is really aconditional probability indicative of the probability of a result at least as extreme as the obtaineddifference assuming that the null hypothesis is true (for a thorough review of various fallacies inthe interpretation of p values, see Kline, 2004, chapter 3).

Meta-analysis. Meta-analysis involves a group of procedures used to aggregate the quan-titative results of similar research studies within a given topic area (e.g., psychotherapy outcomes:

654 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 3: Reporting effect size estimates in school psychology research

Smith & Glass, 1977; school-based interventions for ADHD: DuPaul & Eckert, 1997, etc.). Theaggregated results are typically in the form of standardized effect size estimates.

To understand the value of meta-analytic procedures, consider the following. It is possiblethat in some cases, a number of studies failed to achieve statistical significance because of a lackof statistical power yet yielded similar small to medium effect size estimates. A meta-analysiscould reveal the effect replication across the studies and test the significance of the (weighted)averaged effect size across studies with greater statistical power. Thus, meta-analysis can assist thefield in facilitating and assessing the replication of effects and help it to compensate for theproblem of small sample sizes within individual studies. Meta-analysis also allows for the assess-ment of potential moderator variables. These moderator variables are factors (e.g., particular sam-ple or design characteristics) that lead to differences in effect size estimates across studies. Thesemoderators may be design artifacts in some cases or suggest potentially important interactions inothers.

Traditionally, most researchers in the social sciences have not included effect size estimatesin their research articles. This situation has required meta-analysts to estimate effect sizes frommany studies using nonoptimal estimation methods. For example, there may be known relation-ships between various obtained inferential statistics (e.g., t, F, chi-square, etc.) and some measuresof standardized effect size; however, this estimation introduces more error into the estimates thanwould occur if they were calculated directly from the raw data. To alleviate this problem, tworecommendations clearly follow (which are among those given by Wilkinson & the APA TaskForce on Statistical Inference, 1999 and the fifth edition of the APA Publication Manual, 2001):

• All researchers should report at least one standardized measure of effect size for each majorstatistical test.

• As a general rule, researchers should report basic descriptive statistics (e.g., mean andstandard deviation) for all measured variables in the study. This helps readers to interpretstatistical results more readily by being able to go back and examine basic descriptives. Italso supplies potential meta-analysts with the building blocks for a variety of effect sizemeasures. Thus, if the type of effect size reported by the researcher is not useful for aparticular meta-analytic procedure, the availability of the study descriptives makes themeta-analysts’ job of effect size calculation much easier and more accurate.

Power analysis. Statistical power refers to the sensitivity of a study. In other words, givena hypothesized effect size, how likely is the study to detect that effect? Another way of thinkingabout power is that it is the probability of rejecting the null hypothesis for a given effect size.Statistical power is a very important, but all too often ignored, concept in the behavioral sciences.Studies within psychology are notoriously underpowered (see Sedlmeier & Gigerenzer, 1989).

In the vast majority of cases, behavioral researchers concern themselves with avoiding a TypeI or alpha error (i.e., rejecting a null hypothesis when the null is true). Assuming the null hypoth-esis is true, the probability of a Type I error for any given comparison is the alpha level, which ismost frequently p � .05. When multiple comparisons are made, alpha corrections (e.g., the Bon-ferroni correction) are frequently implemented to guard against increasing the likelihood of aType I error; however, many researchers are unaware that the use of such alpha corrections actu-ally reduces statistical power and increases the likelihood of a Type II or beta error (i.e., failing toreject the null hypothesis when the null is false). Now, I am not suggesting that one should not usealpha corrections, but I would suggest that researchers need to take into account which type oferror is more problematic for any given study. This is an important value judgment that is highlydependent on the larger context of the study.

Effect Size Estimates 655

Psychology in the Schools DOI: 10.1002/pits

Page 4: Reporting effect size estimates in school psychology research

Consider that in preliminary studies of potential vaccines for a life-threatening disease, onemay be more concerned about Type II errors because one may be testing several options withlimited resources (e.g., small sample sizes). If one fails to reject the null when the null is false, thatvaccine option may not be tested further in better designed and more powerful studies. Thus, wemight never know that it was actually a useful vaccine. It would simply be ignored. Now what ifthis preliminary study generated false positives (i.e., ineffective vaccines selected by mistakebecause of Type I errors)? The answer is that effects need to be replicated in better and morepowerful studies before major decisions are made. Consider that when one is looking for a poten-tially important vaccine, initial studies should be as sensitive as possible to avoid missing poten-tial candidates from among the various options. In this situation, a Type II error would be consideredworse than a Type I error; however, one then needs to take those vaccine candidates detected in thepreliminary study and test them further in a larger and better designed study. This second studywould be concerned more with Type I errors and wield sufficient statistical power to balance theissue of Type I versus Type II. In the end, systematic replications of initial findings would go along way toward giving our field the ability to assess which initial effects were in error and whichwere real.

Statistical power is a function of four basic factors: the alpha level, whether one uses a one-or a two-tailed test, sample size, and effect size. In general, higher alpha levels, use of one-tailedtests, larger sample sizes, and larger effect sizes lead to greater statistical power. In most cases,researchers would not increase power by inflating the alpha level, due to the resulting increase inpotential Type I errors. The use of one- or two-tailed tests really depends on which statistical testis used and whether one is in a position to make a directional hypothesis. Thus, researchers aretypically most concerned about the relationship between sample size and effect size in determin-ing statistical power. Larger anticipated effects require smaller sample sizes to achieve the samelevel of statistical power. Conversely, smaller anticipated effects require considerably larger sam-ple sizes to achieve the same level of statistical power.

Power analyses are typically done under two circumstances: when planning a study (a priori)or after completing a study (post hoc). For our purposes, we will assume that researchers areaiming to generate studies that achieve a power level of .80 in accordance with Cohen’s (1988)standard. In the a priori case, the researcher must estimate the anticipated effect size to helpdetermine the sample size required for the study. The effect size can be estimated in a number ofways. The researcher should, whenever possible, use previous research on the same topic to esti-mate effect sizes. Thus, meta-analyses in one’s research area would be an excellent source. In theabsence of an available meta-analysis, effect size estimates taken from other completed studiessimilar to that of the researcher would be the next best source. In the absence of these sources,researchers may set a hypothesized effect size at the lowest possible level at which it would still beconsidered meaningful. Thus, one can design the study to detect a minimally meaningful effectsize at .80 power. The important point here is that effect size estimates need to be available fromall studies to facilitate meta-analyses and power analyses in the planning of future studies.

Post hoc power analyses involve estimating the effect size based on the obtained study data,taking into account the sample size, alpha level, and one- or two-tailed nature of the statistical test,and deriving the power that the study had to detect that effect size. If a potentially meaningfuleffect size is found yet the study failed to achieve statistical significance and had low statisticalpower, the study results are basically inclusive. This type of result suggests that another studyneeds to be done with greater statistical power to guard against the possibility of a Type II error. Aspreviously noted, Wilkinson and the APA Task Force on Statistical Inference (1999) recommendedusing confidence intervals around obtained effects in place of post hoc power analyses. In the end,researchers need to report effect size estimates among their results regardless of whether post hoc

656 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 5: Reporting effect size estimates in school psychology research

power analyses or confidence intervals are employed. The reporting of effect size estimates incompleted studies will make more accurate effect size estimates available to those conductingmeta-analyses and facilitate the use of power analyses by those planning similar studies.

Suggested Effect Size Estimates for Various Designs

The remainder of this article covers the calculation of prominent effect size measures forpopular research designs and suggested models for reporting them in articles. Whenever possible,the specific formulas reported here for effect size calculation were chosen because of their ease ofuse. In most cases, the formulas allow the effect size estimates to be calculated by simply pluggingin numbers provided by most popular computer statistical programs (e.g., SPSS and SAS). Theeffect size formulas themselves are given in this section of the article while research examples inwhich to report these effect size estimates, with sufficient information to calculate them, aresupplied in the section that follows it.

Rosenthal (1994) divided effect size measures into two basic families, the d family and the rfamily. The d family effect size estimates are generally concerned with variations on standardizeddifferences between means whereas the r family effect size estimates are expressed in terms of themetric of the correlation coefficient or its squared equivalent (r 2 ). This basic division will be usedto classify the effect size estimates that follow. When available, examples of both types will begiven for appropriate analyses (i.e., t test and one-way ANOVA).

Comparison of Two Means: The t Test

The difference between two means is most popularly characterized by variations on Cohen’s(1988) effect size d. This effect size in its purest form is simply:

d ��1 � �2

�(1)

In words, d is the difference between two population means divided by the population SD. Theproblem for researchers, of course, is that they rarely have access to the true values of the popu-lation parameters to insert into such an equation. Thus, they must obtain population estimatesbased on their sample statistics.

In the case of a comparison of sample means from two independent groups, the experimentaland control group means replace the population means in the d equation; however, the appropriateestimate of the population SD requires some decision making. Under the assumption of homo-geneity of variance (i.e., the obtained SDs of the two groups being compared are similar), the SDsof the two groups are considered to be estimates of the same population SD. Given that estimatesbased on a larger number of cases tend to be more accurate, it makes sense to pool the SDs of thetwo groups. Thus, the equation for d becomes:

d �PX1 � PX2

Spooled(2)

where Spooled can be calculated using the formula:

Spooled � � (n1 � 1)S12 � (n2 � 1)S2

2

n1 � n2 � 2(3)

Here, n is the number of subjects in each group, while S 2 is the variance or squared SD for eachgroup. Note that this variant on the d statistic with a pooled SD is frequently referred to as Hedges’g (Hedges & Olkin, 1985).

Effect Size Estimates 657

Psychology in the Schools DOI: 10.1002/pits

Page 6: Reporting effect size estimates in school psychology research

In cases where the homogeneity of variance assumption is violated, the pooling of sampleSDs is not appropriate, given that the two group SDs may reflect two different population-levelSDs. When this happens, it usually makes sense for the researcher to insert the SD of the controlgroup into the equation. The logic behind this solution is that the control group SD is theoreticallyuntainted by the effects of the independent variable and reflects a more pure population estimate.This makes the formula for d:

d �PX1 � PX2

Sc(4)

Note that here the denominator is simply the SD of the control group. This variant of the d formulais sometimes called the sample variant of Glass’ delta or � (Glass, McGaw, & Smith, 1981).

Group means and SDs are readily available in the output of all major statistical packages.Thus, d is typically easy to compute. The same d equations can be used for calculating effect sizeestimates for the t test for independent groups or for the dependent-samples t test. The samehomogeneity of variance issue must be addressed in the dependent-samples case. Thus, use thepooled SD of both conditions if the assumption holds or use the pretest or control-group conditionSD if the assumption is violated.

The calculation of effect sizes for dependent-samples t tests, or any within-subjects com-parison design, requires further discussion. The use of the same effect size formulas for bothindependent-samples and dependent-samples tests assumes that the researcher is calculating effectsize estimates for descriptive purposes. This is usually what a researcher is doing when reportingeffect sizes in the results section of an empirical article. However, when a researcher is calcu-lating effect size estimates for use in power analysis, the situation becomes slightly more com-plex. In this case, the effect size calculation procedure remains the same for independent sampleswhether the context is purely descriptive or is aimed at estimating statistical power. In the caseof dependent-samples comparisons, power analysis requires taking into account the reduced errorvariance in the denominator of the statistical test resulting from the use of correlated or depen-dent groups. These dependent-groups comparisons are generally more powerful because of thisreduced error variance in the denominator. This issue needs to be taken into account in powerestimations by calculating a modified d estimate, which can be done in either of two ways. Thefirst involves calculating the estimate of effect size d using the appropriate formula above andthen dividing this d by the square root of 1 � r [where r � the correlation coefficient reflectingthe dependency in the scores (e.g., the correlation between paired pre and post test scores, thecorrelation between the scores of matched pairs, etc.)]. The second involves taking the mean ofthe difference between the paired scores for the sample and dividing the result by the SD of thedifferences between the paired scores for the sample. The d value derived using the standarddeviation of the difference scores as a standardizer must then be multiplied by !2. The numbersrequired to calculate the modified d value using either method are available on the dependent-samples t test output from most statistical software. Once calculated, this larger, modified dvalue is used to calculate statistical power for dependent or within-subjects comparisons (formore details, see Cohen, 1988, pp. 48–52).

Cohen (1988) suggested general guidelines for interpreting effect size d when more context-specific standards are not available. He considered d � .20 to be a small effect, d � .50 a mediumeffect, and d � .80 a large effect. Note that d is a difference between means in SD units. Thus, d �.50 indicates that the two means are .50 SDs apart. Assuming two means on a Deviation IQ metricwith an SD of 15, a d � .50 indicates that the two means are 7.5 IQ points apart (for a generaldiscussion of guidelines and cautions in effect size interpretation, see Kline, 2004, pp. 132–136;

658 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 7: Reporting effect size estimates in school psychology research

for a discussion of context-specific effect size standards within neuropsychological research, seeZakzanis, 2001).

As with most sample estimates of population parameters, the estimate of effect size d tends tobe biased high (Hedges, 1981). Correction formulas are available for this upward bias, whichyield dadjusted (see Grissom & Kim, 2005; Hedges, 1981; Rosenthal & Rubin, 1982). However, theresearcher should keep several things in mind when considering whether to report dadjusted . First,the unadjusted d is the true descriptive effect size d for the sample used in the study (see a relateddiscussion regarding R2 in Maxwell & Delaney, 2004). To say that d is biased high means that thesample d tends to overestimate the population value of d. Thus, adjusting for this bias in thesample d should only be considered when one is using the sample d to estimate the true populationeffect size d. Second, there are random sampling simulation studies available that suggest that anybias in effect size d is small (e.g., Roberts & Henson, 2002) and generally negligible when thesample size is greater than 20 (see Hunter & Schmidt, 2004). Third, as Hunter and Schmidt (2004)noted, the upward bias is typically more than offset by the attenuation of d resulting from theimperfect reliability of the dependent measure. Thus, correcting the effect size d for upward biasis not recommended in most cases. Given these issues, the following guidelines are recommended:(a) Report effect size d without the adjustment in most studies with reasonable sample sizes, and(b) studies involving smaller samples (i.e., n � 20 participants) should report both the unadjustedd and dadjusted . Such small sample studies should generally be quite rare in most major researchareas.

The r family equivalent of effect size d. The r family equivalent of effect size d is the pointbiserial correlation (rpb), which also is the equivalent of Pearson’s r when one variable is dichot-omous and the other is at least interval level. There are several ways to calculate this effect size.First, one can simply enter the independent and dependent variables used in the t test directly intothe dialog box for Pearson’s r in SPSS or in the command syntax for SAS. In this case, thePearson’s r output will be the point biserial correlation. Second, one can calculate it in SPSS byentering the independent and dependent variables into a cross-tabulation and selecting the etacoefficient under the statistics submenu. In this case, the eta output will be the point biserialcorrelation. Finally, when the sample sizes are equal for both conditions of the independent vari-able, one can simply convert the effect size d directly into the point biserial correlation through theformula given by Cohen (1988):

rpb �d

!d 2 � 4(5)

where rpb � the point biserial correlation and d � effect size d.When the sample sizes are unequal across the two conditions in the comparison, the value of

rpb can become attenuated. All other things remaining equal, this attenuation becomes more severeas the discrepancy in sample sizes gets larger (Kline, 2004). Under conditions of unequal n, acorrected version of rpb can be calculated through the formula:

rpb corrected �arpb

![(a2 � 1)rpb2 � 1]

(6)

(adapted from Hunter & Schmidt, 2004) where rpb corrected � the corrected point biserial correla-tion, rpb � the point biserial correlation coefficient, a � !.25 � pq, p � the proportion of totalsample in Group 1, and q � the proportion of the total sample in Group 2.

Given that the point biserial correlation is a true correlation coefficient, one can square it toget the percentage of variance in the dependent variable accounted for by the independent variable.

Effect Size Estimates 659

Psychology in the Schools DOI: 10.1002/pits

Page 8: Reporting effect size estimates in school psychology research

Thus, rpb2 also is an effect size measure. According to Cohen (1988, p. 22), a small effect size d of

.20 is equivalent to an effect size rpb of .10 (or rpb2 � .01), a medium effect size d of .50 is

equivalent to an effect size rpb of .243 (or rpb2 � .059), and a large effect size d of .80 is equivalent

to an effect size rpb of .371 (or rpb2 � .138).

As an estimate of the population parameter for the point biserial correlation, the sample rpb isbiased slightly low (Hedges & Olkin, 1985). This bias is most pronounced when the sample sizeis small and the population value of rpb is in the center of the correlation scale (i.e., rpb � �.50).Formulas are available to correct for this slight downward bias (see Grissom & Kim, 2005; Hunter& Schmidt, 2004; Olkin, 1967); however, this bias is generally negligible, except in the case ofvery small samples (Hunter & Schmidt, 2004). Thus, as is the case with effect size d, the uncor-rected version of rpb should typically be reported. A correction for this attenuation should beconsidered only when the sample size is �20 cases.

Comparison of Three or More Means From One Factor: One-Way ANOVA

The d family effect size for one-way ANOVA. A number of useful effect size estimates areavailable for one-way ANOVA; however, one must clearly distinguish between effect size mea-sures useful for describing the overall effect versus those useful for characterizing contrast effects.At the level of the overall F test, Cohen’s (1988) f estimated statistic is a useful characterization ofthe variation among sample means relative to the variation within groups (This effect size is amember of the d family, as the SD of the sample means in the numerator of f conveys a differencebetween means characteristic of this family of effect size estimates.) I use the term estimated inrelation to f because the sample statistic is being used to estimate the population parameter f. Thef estimated equation is:

festimated �S PX

!MSwithin

(7)

(adapted from Cohen, 1988, and Grissom & Kim, 2005) where S PX � the SD of the sample meansaround the grand mean and !MSwithin � the square root of the Mean Squares Within.

For equal sample sizes, one calculates S PX by using the group means as data points in thesample SD formula. This change renders the formula:

S PX � � ( PXi � PXgrand )2

k � 1(8)

where PXi � mean of each group, PXgrand � grand mean or mean of the individual group means, andk � the number of means, groups, or conditions being compared.

The basic Mean Squares Within formula is:

MSwithin �SSwithin

dfwithin(9)

In the case of unequal n, the formula for f reported earlier requires modifications to allow forthe means and SDs to be weighted in proportion to the n sizes of the various groups. The interestedreader should consult Cohen (1988, pp. 359–362) for a description and explanation of the modi-fied formula.

Unfortunately, sample means will generally vary more than population means. As a result,effect size f values estimated from samples tend to be biased high. To correct for this, the follow-ing formula is offered for fadjusted :

fadjusted � � k � 1

N(F � 1) (10)

660 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 9: Reporting effect size estimates in school psychology research

(Maxwell & Delaney, 2004) where k � the number of means, groups, or conditions being com-pared, N � total number of subjects in the experiment, and F � obtained F test value from thestatistical test of the one-way ANOVA overall effect.

Given the downward adjustment of effect size f to fadjusted , it is possible in some cases thatfadjusted will equal less than zero. This will occur when the obtained F value itself is less than 1.0.In this circumstance, the negative value for fadjusted should be interpreted as zero (Maxwell &Delaney, 2004). In general, it is probably not necessary to even report a value for fadjusted when theobtained F value is less than 1.0, as the negligible unadjusted effect size f itself should suffice.

Basic r family effect size measure for one-way ANOVA. From the association or r familyperspective, effect size estimates 2 and �2 are useful measures of the overall effect in a one-waydesign. In the population, 2 can be characterized as the proportion of total variance in the depen-dent variable that is attributable to the independent variable. As usual, researchers rarely haveaccess to the population values for the sum of squares between and sum of squares total. Thus,they must estimate these values from the available sample statistics. The general formula for 2 is:

2 �SSbetween

SStotal(11)

where SSbetween � Sum of Squares Between Groups and SStotal � Sum of Squares Total. Note that2 also is sometimes called the correlation ratio or R2, depending on the context (Hays, 1994).

The sum of squares between groups and sum of squares total are readily available on SPSSand SAS printouts. Thus, 2 is easy to compute by just plugging these numbers into the afore-mentioned equation; however, when calculated based on sample statistics, 2 is biased high as anestimate of the true population value.

The �2 statistic is an adjusted version of 2. The adjustment in both the numerator and thedenominator of the equation is meant to correct for this upward bias. The formula for �2 is:

�2 �SSbetween � (k � 1)MSwithin

SStotal � MSwithin(12)

(adapted from Hays, 1994) where SSbetween � Sum of Squares Between Groups, SStotal � Sum ofSquares Total, MSwithin � Sum of Squares Within divided by the Degrees of Freedom Within, andk � the number of means, groups, or conditions being compared.

Given the downward adjustment for �2, it is possible that sometimes the formula may yielda negative value. This creates interpretive problems for a measure of variance accounted for.Typically, negative values of �2 are either set to zero or interpreted as equivalent to zero (Olejnik& Algina, 2000).

Adjusted values of f (i.e., fadjusted ) and 2 (i.e., �2 ) are expected to be lower than their biasedcounterparts. In smaller samples, the discrepancy between the two types of estimates can beconsiderable. In general, both types of estimates should be reported to give a sense of the potentialestimation bias.

Assumptions of equal n and homogeneity of variance are important to consider when calcu-lating �2 and its variants. Carroll and Nordholm (1975; as cited in Olejnik & Algina, 2000)reported that in the case of one-way ANOVA, �2 was robust to the violation of the equal nassumption when homogeneity of variance held and robust to the violation of homogeneity ofvariance under conditions of equal n; however, the violation of both assumptions leads to effectsize overestimation.

In terms of interpreting what constitutes a small, medium, or large effect for f, fadjusted , 2, or�2, one needs to take into account the context of one’s own research. In short, what constitutes a

Effect Size Estimates 661

Psychology in the Schools DOI: 10.1002/pits

Page 10: Reporting effect size estimates in school psychology research

small, medium, or large effect size can vary relative to the research area; however, Cohen (1988)gave the effect size general interpretive guidelines to use in the absence of domain specific stan-dards. Interpretations of small � .10, medium � .25, and large � .40 were suggested for effect sizef, and small � .01, medium � .06, and large � .14 for 2 and �2 (Cohen, 1988, pp. 285–287). Notethat given rounding, effect size f and 2 are directly translatable into each other (see Cohen, 1988,pp. 280–284). Simply stated, f is the d family equivalent of 2, while 2 is the r family equivalentof effect size f.

Comparisons and contrasts. The effect size reported for an omnibus F test is generally asdifficult to interpret as the meaning of a significant overall F test itself. One typically needs toprobe the significant overall omnibus result with planned or post hoc comparisons or contraststo reveal the nature of the differences among the groups subsumed within the one-way design.This is typically the part of the analysis that most researchers are interested in and, thankfully, caninvolve reasonably simple effect size calculations.

Variations on effect size d are typically preferred to proportion of variance effect size esti-mates, whether one is evaluating a simple comparison of two group means or a more complexcontrast comparing one group against the average of two others. In the case of a simple compar-ison of two groups, one starts by putting the means of the two groups in question into the numer-ator of the effect size d equation. The decision of what SD estimate to put in the denominator of theequations then must be answered. There are three options here: (a) the pooled SD of all groupsincluded in the one-way ANOVA, (b) the pooled SD of only the two groups included in thecomparison, or (c) the SD of only one of the groups considered to be the standard or control group.Assuming homogeneity of variance and that estimates based on larger samples tend to be moreaccurate, Option A would seem most appropriate when homogeneity of variance is not violated.When homogeneity of variance cannot be assumed for the whole set of groups in the design, itmight still be assumed for the two groups involved in the comparison. If homogeneity of varianceis not violated for the two groups in question, then Option B is viable. If homogeneity of variance isviolated for the two groups in question, then Option C would appear to be the logical choice.

Note that these are merely general-purpose guidelines. If a researcher can make a reasonableargument for standardizing the comparison in a particular way within a particular research con-text, this is perfectly appropriate. No matter what method is chosen, the researcher must describethe method of standardization used and clarify situations where the use of different SDs acrossdifferent comparisons may have significantly impacted the magnitude of the effect size estimates.

In the case of a more complex contrast, one inserts the appropriate mean estimates (e.g., themean of Group 1 and the weighted average of the means of Groups 2 and 3) into the numerator ofthe effect size d equation. One then needs to consider the question of homogeneity of variance.Again, if homogeneity of variance is not violated across the design, then one can use the pooledSD for the complete set of groups. If homogeneity of variance is violated in the larger design butintact within the three or more groups involved in the contrast, then one can use the pooled SDfrom among those three or more groups. If homogeneity of variance is violated among the groupsinvolved in the contrast, Option C becomes the logical choice. Note that when pooling SDs forcontrasts, it is not necessary to weight them using contrast weights so long as homogeneity ofvariance is not violated—and we are not pooling SDs when homogeneity of variance is violated.Thus, in terms of effect size estimates for contrasts, the contrast weights will apply only to themeans in the numerator of the d equation.

For those who prefer to express comparison and contrast effect size estimates in terms ofproportions of total variance, formulas for 2 and �2 versions are readily available. The comparison/contrast formula for 2 is:

662 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 11: Reporting effect size estimates in school psychology research

comparison2 �

SScomparison

SStotal(13)

where

SScomparison �( PX1 � PX2 )2

1

n1�

1

n2

(14)

(adapted from Grissom & Kim, 2005), SStotal � the Sum of Squares Total, which is readily avail-able on SPSS and SAS printouts.

The comparison formula for �2 is simply:

�comparison2 �

SScomparison � MSwithin

SStotal � MSwithin(15)

(Grissom & Kim, 2005) where all terms in this equation have already been defined in previousequations.

When one needs to calculate the 2 or �2 for a more complex contrast, the Sum of SquaresComparison in the aforementioned equations can be replaced with the Sum of Squares Contrast.The formula for the Sum of Squares Contrast (SScontrast ), adapted from Olejnik and Algina(2000), is:

SScontrast �(c1 PX1 � ••• � cj PXj )2

c12

n1� ••• �

cj2

nj

(16)

where cj � the contrast coefficient for the j th condition in the contrast, PXj � the mean of the j thcondition in the contrast, and nj � the sample size of the j th condition in the contrast.

For example, assume that three groups are involved in a complex contrast where the controlcondition is being compared to the average of two experimental conditions. In this case, thecontrast coefficients might be �1 .5 .5. Therefore, the mean of the control condition is multipliedby �1, the mean of Experimental Condition 1 is multiplied by .5, and the mean of ExperimentalCondition 2 is multiplied by .5. The results of each are then added together to form the numeratorof the Sum of Squares Contrast mentioned earlier. In the denominator, each contrast coefficientinvolved would be squared, divided by the sample size of the condition to which it belongs, andthe results added up. Next, the Sum of Squares Contrast would be calculated and then inserted intothe numerator of the 2 and �2 equations mentioned earlier in place of the Sum of SquaresComparison.

Unfortunately, space does not permit coverage of effect size calculations for more complexfactorial ANOVA designs. The interested reader is referred to Grissom and Kim (2005), Kline(2004), and Olejnik and Algina (2000) for discussions of effect size reporting in more complexcases.

Association With Continuous Variables: Correlation Coefficients

The correlation coefficient, Pearson’s r and its variants, are actual effect size estimates in andof themselves. As measures of effect size, they can be expressed on the standard r metric or thesquared variance accounted for metric. Both expressions are fine. However, note that most statis-tics textbooks traditionally consider the squared version of r (i.e., r 2 � the coefficient of determi-nation) to be what is interpreted. Cohen (1988) preferred to use the standard r instead simply

Effect Size Estimates 663

Psychology in the Schools DOI: 10.1002/pits

Page 12: Reporting effect size estimates in school psychology research

because it yielded a larger (though directly equivalent) number compared to the squared version.Cohen feared that if researchers focused on the coefficient of determination as their measure ofeffect size, they would be tempted to prematurely conclude that the frequent smaller effects uncov-ered in the social sciences were not meaningful. A second point of concern regarding r 2 is that itis nondirectional. The clearer directional nature of the unsquared correlation coefficient renders itbetter suited for use in meta-analyses (Hunter & Schmidt, 2004). In any event, the preference forthe standard r or the r 2 is purely cosmetic, as long as the direction of the effect is made clear. Bothare appropriate.

According to Cohen (1988), the standards for interpreting the correlation coefficient effectsize are small r � .10 (small r 2 � .01), medium r � .30 (medium r 2 � .09), large r � .5 (large r 2 �.25); however, these effect size standards were merely meant to be guidelines when more context-specific guidelines were not available from the researcher’s particular research area. The assump-tion is that most research areas should ultimately develop their own standards for what are consideredsmall, medium, and large effects.

As with the point biserial correlation coefficient, Pearson’s r is biased slightly low as anestimate of the true population correlation value. The same discussion of how to deal with theslight bias in rpb also applies to the standard r. Thus, the sample estimate of Pearson’s r should nottypically be corrected; however, a correction may be considered if the sample size is �20 cases.

Prediction and Explanation with Continuous Variables: Simple and Multiple Regression

For simple regression, with one predictor and one criterion variable, the zero order correla-tion (or its squared version) is the standardized effect size of choice. In this case, the researchershould report the raw score regression equation as well as the zero order correlation (r) andcoefficient of determination (r 2 ). The availability of both raw score information and standardizedeffect size estimates aid the reader in the interpretation. Relatively small standardized changes ina useful predictor variable may translate into meaningful changes on the raw score metric. How-ever, the reader needs both types of information available to see this possibility.

The simple case of zero order correlation effect size extends readily into multiple regression.For the overall multiple regression equation, the multiple R can be interpreted as the correlationbetween the predictor variables and the criterion variable while the multiple R2 can be interpretedas the amount of variance in the criterion variable accounted for by the collection of predictorvariables. Most statistical packages also calculate adjusted R2. This is the outcome of a formulathat adjusts R2 for the upward bias (relative to the true population value) related to capitalizing onchance variation in the current sample. Other things being equal, the bias in multiple regression R2

is greater than that in simple regression r 2 because of the use of multiple predictors. The discrep-ancy between R2 and its adjusted version becomes greater as the sample size gets smaller and thenumber of predictors increases (Keith, 2006). Though their values tend to converge (given round-ing) in larger samples, both R2 and adjusted R2 should typically be reported regardless of thesample size involved.

When reporting multiple regression results (whether for purposes of prediction or explana-tion), both the raw score equation and the standardized equation should be reported. Having theequation on both metrics aids in the interpretation of the weights. The equations can be reported asthey are or their components placed in a table. The raw b or B weights and standardized beta (�)weights should be considered very context specific. As a general rule, these can be compared onlyto virtually identical samples using the exact same set of predictor variables and the same criterionvariable. A large n size also is desirable to stabilize the estimates from a random sampling per-spective. Outside of this set of circumstances, b and � weight values can vary considerably from

664 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 13: Reporting effect size estimates in school psychology research

each other for spurious reasons (e.g., different populations, use of a different set of predictorvariables, small sample sizes, etc.).

Nominal Data: Chi-Square

Chi-square is a nonparametric statistical test used to test for potential differences betweencategories or relationships among categorical variables (e.g., gender, political affiliation, ethnicity,occupational status, etc.). In dealing with effect size measures for chi-square, I am restrictingcoverage to the case of what is often called a naturalistic study (Grissom & Kim, 2005, p. 172). Ina naturalistic study, membership in the various categories varies naturally and not by randomassignment (e.g., political affiliation as in democrat vs. republican, gender as in male vs. female,etc.). The researcher decides only on the total number of participants in the study sample. Theactual number of participants in the various categories is allowed to vary naturally. This typicallyis done in survey designs, where the researcher is looking for correlations between categoricalvariables and is not manipulating an independent variable. Naturalistic studies are the most pop-ular designs for which chi-square is used. Those interested in effect size measures for other designtypes (e.g., experimental with random assignment, prospective, or retrospective) are referred toother sources (e.g., Grissom & Kim, 2005).

The effect size w (Cohen, 1988) is probably the most general, but not necessarily the mostrecognizable, measure for this purpose. This effect size can be roughly placed in the r family ofeffect size estimates, although it is a true correlation only under certain circumstances (i.e., only inthe case of a 2 2 contingency table). The effect size w is calculated, at the basic level, byreplacing the frequency values in the chi-square formula with proportions and then taking thesquare root of the outcome; however, w can always be calculated by plugging the known chi-square value from the SPSS or SAS output into the formula:

w � ��2

N(17)

(Cohen, 1988) where w � effect size w, �2 � the obtained chi-square value, and N � the totalsample size.

In cases where two categorical variables are being compared, the calculation of w also can beobtained through the transformation of other popular statistical output. In the case of a 2 2cross-tabulation, effect size w will equal the exact value of the � coefficient. The � coefficient isthe variant of Pearson’s r used to characterize the correlation between two dichotomous variables(Howell, 1987). Thus, in the case of a 2 2 table, one can simply report the value of the �coefficient available from the SAS or SPSS output as either effect size � or w.

When at least one of the variables involved in the cross-tabulation has more than two cat-egories (i.e., when the cross-tabulation table is greater than 2 2), effect size w can be calculatedusing other popular statistical program output. For example, SPSS can be set to report the contin-gency coefficient (C) and Cramer’s V under the statistics submenu of the cross-tabulation menu.Pearson’s contingency coefficient (see Hays, 1994) can be used to characterize the relationshipbetween contingency tables of a variety of sizes. Cramer’s V [also sometimes called Cramer’s �(e.g., Cohen, 1988; Howell, 1987)] is an extension of the � coefficient for situations where thecategories on the two dimensions of the contingency table are greater than two (Cramer, 1946;Hays, 1994). Note that the contingency coefficient and Cramer’s V are standardized effect sizemeasures in their own right and can be reported as such in research reports in place of w. However,the advantage to effect size w is its greater generality in characterizing chi-square relationships.

Effect Size Estimates 665

Psychology in the Schools DOI: 10.1002/pits

Page 14: Reporting effect size estimates in school psychology research

The contingency coefficient (C) can be converted to effect size w using the formula:

w � � C 2

1 � C 2 (18)

(Cohen, 1988) where w � effect size w and C 2 � the squared contingency coefficient.Effect size w can be derived from Cramer’s V using the formula:

w � V!r � 1 (19)

(adapted from Cohen, 1988) where w � effect size w, V � Cramer’s V, and r � the number of rowsor columns (whichever is the smaller of the two).

Cohen (1988) gave general guidelines for the interpretation of effect size w and related effectsize measures. Again, these guidelines were meant to be used only when context-specific guide-lines were unavailable in the researcher’s particular research area. In general, a small effect size is.10 for both w and C, a medium effect size is .30 for w and .287 for C, and a large effect size is .50for w and .447 for C. Effect size guidelines are more complex for Cramer’s V, as the values divergeconsiderably from those of effect size w as the number of categories in the smaller of the row orcolumn categories increases (for a table that clarifies the interpretation of Cramer’s V, see Cohen,1988, p. 222). However, given that Cramer’s V is an extension of the � coefficient, the values ofeffect size w, �, and Cramer’s V will be identical when a 2 2 cross-tabulation is analyzed. Thus,in the case of a 2 2 table, the effect size interpretation guidelines for w also apply to � andCramer’s V. Furthermore, when one variable has more than two categories, but the other is stilldichotomous (e.g., cross-tabulations of 2 3, 2 4, etc.), the value of Cramer’s V will be equalto w. These situations make obtaining the value of effect size w easy, as the value already will beavailable on the output from most major statistical programs under a different name.

Also note that in the case of a 2 2 table (and only in the case of a 2 2 table), the squaredvalue of the � coefficient can be interpreted as a proportion of variance accounted for or thecoefficient of determination. This is possible because the � coefficient is the extension of Pear-son’s r for the correlation between two dichotomous variables.

The range of possible values for effect size w can become an issue in some cases. In all cases,a value of 0 reflects no relationship; however, the theoretical upper limit of effect size w canexceed 1.0, even considerably, when the number of categories involved is large; in practice, thevalue of w will rarely exceed .90. For the contingency coefficient (C), it is useful to note that avalue of 0 reflects no relationship, but that its maximum value cannot reach 1.0. The � coefficient,which can be used only with 2 2 tables, has a minimum value of 0 and a maximum value of 1.0.Given that w is identical to � for a 2 2 table, w also will possess these characteristics in the caseof a 2 2 table. Cramer’s V, as an extension of the �, also varies from 0 to 1.0; however, it cannotbe interpreted as a correlation coefficient, and the interpretation of its values, as reflecting a small,medium, or large effect, will vary with the number of categories present in the smaller of the rowor column categories.

Examples of Effect Size Reporting With Hypothetical Data

In this section, basic examples of reporting effect size estimates in the context of otherstatistical results are given. These examples are not meant to be exhaustive depictions of how toreport effect size estimates but are meant to offer the reader clear guidelines for what should bereported. In general, as long as the necessary information is reported and the manner of reportingconsistent with APA style, then many structural variations are possible that allow for sensitivity tothe particular research context. Table 1 provides descriptive statistics to be used in considering the

666 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 15: Reporting effect size estimates in school psychology research

t test and one-way ANOVA examples that follow. Examples of a number of simple reportingvariations for t test, one-way ANOVA, and chi-square are given in Table 2. Figure 1 contains SPSSVersion 13.0 output from the one-way ANOVA example that follows. Sufficient information isprovided between the text, tables, and figure that follow to allow one to use the preceding formulasto calculate the effect size estimates reported next.

For the t test and F test examples, consider the results reported in Table 1. These are hypo-thetical outcomes for three groups of children with Aspergers Disorder who participated in asummer treatment program. Assume random assignment. The control group received a typicalsummer-camp experience without treatment. The SS Only group received social skills training as

Table 1Descriptive Statistics From a Hypothetical Social SkillsTreatment Study

Group n Mean SD

Control 20 36.10 6.41SS Only 20 42.50 6.06SS � RC 20 42.25 6.02

Note. Control � No Treatment; SS Only � Social Skills TrainingOnly; SS � RC � Social Skills Training � Response Cost.

Table 2Variations on Effect Size Reporting Based on the Hypothetical Examples

Independent Samples t testt(38) � 3.24, p � .001 (one-tailed), d � 1.03t(38) � 3.24, p � .001 (one-tailed), rpb � .466 (also interpretable as nonsquared equivalent)

Omnibus F test (one-way ANOVA)F(2,57) � 6.91, p � .002, f � .49, fadjusted � .44, 2 � .20, �2 � .16F(2,57) � 6.91, p � .002, f � .49, fadjusted � .44F(2,57) � 6.91, p � .002, 2 � .20, �2 � .16

Complex Contrast �1 .5 .5t(57) � 3.72, p � .001 (one-tailed), d � 1.02F(1,57) � 13.82, p � .001 (technically one-tailed), f � .51, fadjusted � .46F(1,57) � 13.82, p � .001 (technically one-tailed), contrast

2 � .15, �contrast2 � .13

Simple Contrast or Simple Comparison 0 1 �1t(57) � .13, p � .90, (two-tailed), d � .04F(1,57) � .02, p � .90, f � .02, fadjusted � .00 (though technically a negative number)F(1,57) � .02, p � .90, contrast

2 � .00, �contrast2 � .00 (though technically �.01)

Chi-Square (Omnibus)One-way �2 (2, N � 45) � 12.13, p � .002, w � .522 2 Table �2 (1, N � 200) � 7.03, p � .008, w � .192 3 Table �2 (2, N � 60) � 12.31, p � .002, w � .45 (also same value as Cramer’s V )3 3 Table �2 (4, N � 225) � 75.37, p � .001, w � .58, V � .41, C � .50

Effect Size Estimates 667

Psychology in the Schools DOI: 10.1002/pits

Page 16: Reporting effect size estimates in school psychology research

part of their camp experience. The SS � RC group received social skills training and response costprocedures as part of their camp experience. The dependent variable is a rating of social skillsposttreatment using a standardized rating scale that yields T scores.

The t test. For purposes of an independent samples t test, assume that a study is run usingonly the Control group and SS Only group. In this case, the researcher predicted that social skillstraining would result in higher posttreatment social skills scores compared to the control group.For this comparison, t(38) � 3.24, p � .001 (one-tailed), d � 1.03, rpb � .466. Thus, the socialskills training group obtained significantly higher social skills scores at posttreatment compared tothe control group. The effect size estimates d and rpb were considered large (see Cohen, 1988).Note that I reported these estimates uncorrected.

One-way ANOVA. For the one-way ANOVA, assume that all three groups mentioned earlierare included in the study. In this case, the researcher has predicted overall differences among themeans of the three groups. More specifically, the two treatment groups are expected to showhigher posttreatment scores than the control group, and the researcher anticipated a possible dif-ference between the means of the two treatment groups (see Figure 1 for details from the SPSSone-way ANOVA output needed to calculate effect sizes). The overall omnibus F test for the

Figure 1. SPSS output from the three-group one-way ANOVA example.

668 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 17: Reporting effect size estimates in school psychology research

one-way ANOVA can be reported as F(2,57) � 6.91, p � .002, f � .49, fadjusted � .44 or F(2,57) �6.91, p � .002, 2 � .20, �2 � .16. The effect size estimates were considered large (see Cohen,1988). Two planned contrasts (i.e., one a complex contrast and the other really a simple compar-ison), as expressed in terms of contrast coefficients, were �1 .5 .5 and 0 1 �1. The first contrastcompares the mean of the control group to the average of the two treatment-group means. Thesecond contrast compares only the two treatment-group means. The results of these contrasts canbe written up as t tests or F tests, depending on the statistical output available and the preferenceof the researcher. Results of the first contrast can be written as either t(57) � 3.72, p � .001, d �1.02; F(1,57) � 13.82, p � .001, f � .51, fadjusted � .46; or F(1,57) � 13.82, p � .001, contrast

2 �.15, �2 � .13. The effect estimate (no matter how it is expressed) is considered large (see Cohen,1988). Results of the second contrast can be written as either t(57) � .13, p � .90, d � .04;F(1,57) � .02, p � .90, f � .02, fadjusted � .00 (though technically a negative number); or F(1,57) �.02, p � .90, contrast

2 � .00, �2 � .00 (though technically �.01). This effect size does not evencome close to achieving the criterion for a small effect (see Cohen, 1988). Note that adjustmentsto effect size f and contrast

2 result in uninterpretable numbers when the unadjusted effect size isalready negligible. In general, these adjusted versions that result in a negative number should bereported as .00, although some would argue that fadjusted and �2 should be reported, when neces-sary, as a negative number (for an argument for negative value reporting in the context of confi-dence intervals, see Fidler & Thompson, 2001); however, even when these adjusted f and percentageof variance values are reported as negative, they should be interpreted as zero. The reporting ofsimple comparisons and post hoc comparisons follows essentially the same basic format used herefor planned contrasts.

Chi-square. For chi-square Example 1, assume that a large school system has tracked thenumber of students diagnosed with Autistic Disorder (n � 25, 55.56%), Aspergers Disorder (n �6, 13.33%), and Pervasive Developmental Disorder Not Otherwise Specified (n � 14, 31.11%) ofa total of 45 spectrum-related cases across the school-age population. A one-way chi-square wascalculated for this set of three categories, testing the null hypothesis of equal expected frequenciesacross the disorders. Results yielded a statistically significant chi-square, �2(2, N � 45) � 12.13,p � .002, w � .52, for the overall analysis. The effect size value (w � .52) suggests a large effectaccording to Cohen’s (1988) general standards.

For chi-square Example 2, assume that a researcher has conducted a regional survey ofschool psychologists (N � 200). The school psychologists were classified as belonging to at leastone major professional organization or not belonging to any such organizations as well as whetherthey have used curriculum-based measurement (CBM) techniques within the past year or not usedsuch techniques during that time. This sets up a 2 2 cross-tabulation consisting of the categories:(a) Not a member � Does not use CBM (n � 73, 36.5%), (b) Not a member � Uses CBM (n � 27,13.5%), (c) Member � Does not use CBM (n � 55, 27.5%), and (d) Member � Uses CBM (n �45, 22.5%). The significant 2 2 chi-square, �2(1, N � 200) � 7.03, p � .008, w � .19, suggestedthat more school psychologists reported using CBM techniques who also reportedly belonged toa major professional organization. The effect size value (w � .19) suggests a small effect accord-ing to Cohen’s (1988) general standards.

For chi-square Example 3, assume that a researcher conducted a regional survey of elemen-tary school teachers (N � 225). Each teacher was asked whether the district in which he or she waseducated as a child was an urban, suburban, or rural district. It was hypothesized that the type ofone’s district of origin would predict the type of one’s current district of employment. Among theteachers currently employed within an urban district (n � 81), 66.7% (n � 54) were educated in anurban district, 19.8% (n �16) in a suburban district, and 13.6% (n �11) in a rural district. Among

Effect Size Estimates 669

Psychology in the Schools DOI: 10.1002/pits

Page 18: Reporting effect size estimates in school psychology research

those teachers currently employed by a suburban district (n � 78), 21.8% (n � 17) were educatedin an urban district, 60.3% (n � 47) in a suburban district, and 17.9% (n � 14) in a rural district.Of those teachers currently employed by a rural district (n � 66), 9.1% (n � 6) were educated inan urban district, 43.9% (n � 29) in a suburban district, and 47.0% (n � 31) in a rural district. Thechi-square for the overall analysis was statistically significant, �2(4, N � 225) � 75.37, p � .001,w � .58, Cramer’s V � .41, C � .50. Effect size w, V, and C values were consistent with a largeeffect (see Cohen, 1988). Planned or post hoc chi-square subanalyses could then be performed andreported within each of the three current district-type employment categories.

In general, effect sizes for chi-square follow-up analyses should be reported according to thefollowing guidelines: (a) Report the effect size estimates for all subanalyses when specific a prioriplanned comparisons are involved, and (b) report effect sizes only for the significant chi-squaresubanalyses if a post hoc comparison strategy is employed that makes all simple comparisons.Information regarding how to conduct these follow-up partitions and simple comparisons forchi-square is available from a number of other sources (e.g., MacDonald & Gardner, 2000; Siegel& Castellan, 1988).

Correlation and regression. In the case of correlation and regression, examples are notgiven, as they are readily available in the literature. It is assumed that the reporting of r and r 2 inzero order correlations is intuitively clear. In multiple regression, one is expected to report boththe R2 and the adjusted R2. Note the need to report both the unstandardized and standardizedmultiple regression equations. Be aware that regression weights can vary considerably from pop-ulation to population and model to model. Thus, the expectation of similar � weights acrossstudies can make sense only when the same population and model are involved across thosestudies.

General Recommendations for Effect Size Reporting

Consistent with the guidelines from the Publication Manual of the American PsychologicalAssociation-5th Edition (American Psychological Association, 2001), always report the majordescriptive statistics for all variables in your study (e.g., mean, SD, and n size for interval- andratio-level data; frequencies and percentages for categorical data; etc.). When regression analysis,factor analysis, or any other technique that involves correlations is used, report the completecorrelation matrix. In some cases, large matrices may not be permitted due to space limitations ina journal. This should be discussed with the editor, the author should include as much informationas possible in the article, and then the author should make the matrix readily available to others.The reporting of these basic numbers will make the calculation of a variety of effect size estimateseasier for reviewers, readers, and meta-analysts.

Report at least one measure of effect size for each major comparison. This means reporteffect size estimates for omnibus tests and follow-up analyses. In general, report effect size esti-mates for all omnibus tests and all planned comparisons regardless of whether they are statisticallysignificant. In the case of a large number of post hoc comparisons, it is probably best to report onlyeffect size results for comparisons that are statistically significant. In this case, the availability ofsufficient descriptive statistics for all variables would allow interested readers to calculate effectsizes for the nonsignificant post hoc tests if they so choose.

In the case of t tests and zero order correlations (whether standard Pearson’s r or the pointbiserial), it is recommended that in most cases, only the uncorrected effect size estimate be reported;however, the researcher should consider reporting the corrected version as well when the samplesize is �20 cases.

670 Volker

Psychology in the Schools DOI: 10.1002/pits

Page 19: Reporting effect size estimates in school psychology research

In the case of one-way ANOVA, my preference is for the reporting of 2 and �2 for theresults of the omnibus test, but to the report the results of simple comparisons and complexcontrasts in the form of effect size d. My reasons for this are that (a) the omnibus analysis ofvariance is best characterized by an r family measure, as it involves dealing with more complexvariance components; (b) 2 and �2 are clearly more popular and familiar measures of effect sizefor ANOVA than Cohen’s effect size f; and (c) effect size d is the most simple and easily inter-pretable way to convey the difference between two means in a simple comparison or a complexcontrast. Others may disagree with my preference regarding ANOVA, but this is fine. It is moreimportant that some useful measure of effect size be reported than a particular one be used.

In the context of multiple regression, report both the R2 and adjusted R2 for the modelregardless of sample size. The regression equation should be reported in both standardized andunstandardized form to aid with interpretation. Be cautious when interpreting regression weights,as they are very context specific and typically require reasonably large samples to stabilize them.

In the case of chi-square, effect size w was chosen as the primary effect size measure becauseit had the broadest application potential; however, a variety of other effect size measures areavailable to characterize relationships involving nominal data. It is probably best to report morethan one effect size measure for chi-square, when possible. The different types of effect sizeestimates tend to provide different perspective on the nature of the effect and can aid in interpretation.

In general, there is a clear temptation to rely on Cohen’s (1988) guidelines for the interpre-tation of effect sizes as small, medium, or large; however, as Cohen (1988) himself repeatedlynoted, these guidelines are only meant to be used when more context-specific guidelines are notavailable in the researcher’s own field of study. The easy availability of Cohen’s arbitrary guide-lines should not be an excuse for us to fail to seek out and/or determine our own domain-specificstandards based on empirical data and reasoned arguments.

References

American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.).Washington, DC: Author.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum.

Cramer, H. (1946). Mathematical methods of statistics. Princeton, NJ: Princeton University Press.

DuPaul, G.J., & Eckert, T.L. (1997). The effects of school-based interventions for attention deficit hyperactivity disorder:A meta-analysis. School Psychology Review, 26, 5–27.

Fidler, F. (2002). The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial.Educational and Psychological Measurement, 62, 749–770.

Fidler, F., & Thompson, B. (2001). Computing correct confidence intervals for ANOVA fixed- and random-effects effectsizes. Educational and Psychological Measurement, 61, 575– 604.

Glass, G.V., McGaw, B., & Smith, M.L. (1981). Meta-analysis in social research. Thousand Oaks, CA: Sage.

Grissom, R.J., & Kim, J.J. (2005). Effect sizes for research: A broad practical approach. Mahwah, NJ: Erlbaum.

Hays, W.L. (1994). Statistics for psychologists (5th ed.). Fort Worth, TX: Hartcourt Brace.

Hedges, L.V. (1981). Distributional theory for Glass’s estimator of effect size and related estimators. Journal of Educa-tional Statistics, 6, 107–128.

Hedges, L.V., & Olkin, I. (1985). Statistical methods for meta-analysis. San Diego, CA: Academic Press.

Howell, D.C. (1987). Statistical methods for psychology (2nd ed.). Boston: Duxbury.

Hunter, J.E., & Schmidt, F.L. (2004). Methods of meta-analysis (2nd ed.). Thousand Oaks, CA: Sage.

Jacobson, N.S., & Revenstorf, D. (1988). Statistics for assessing the clinical significance of psychotherapy techniques:Issues, problems, and new developments. Behavioral Assessment, 10, 133–145.

Jacobson, N.S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psycho-therapy research. Journal of Consulting and Clinical Psychology, 59, 12–19.

Kazdin, A.E. (1977). Assessing the clinical or applied importance of behavior change through social validation. BehaviorModification, 1, 427– 452.

Keith, T.Z. (2006). Multiple regression and beyond. Boston: Allyn & Bacon.

Effect Size Estimates 671

Psychology in the Schools DOI: 10.1002/pits

Page 20: Reporting effect size estimates in school psychology research

Kline, R.B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington,DC: American Psychological Association.

MacDonald, P.L., & Gardner, R.C. (2000). Type I error rate comparisons of post hoc procedures for I J chi-square tables.Educational and Psychological Measurement, 60, 735–754.

Maxwell, S.E., & Delaney, H.D. (2004). Designing experiments and analyzing data: A model comparison perspective (2nded.). Mahwah, NJ: Erlbaum.

Olejnik, S., & Algina, J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and limi-tations. Contemporary Educational Psychology, 25, 241–286.

Olkin, I. (1967). Correlations revisited. In J.C. Stanley (Ed.), Improving experimental design and statistical analysis:Seventh annual Phi Delta Kappa symposium on educational research (pp. 102–128). Chicago: Rand McNally.

Roberts, J.K., & Henson, R.K. (2002). Correction for bias in estimating effect sizes. Educational and PsychologicalMeasurement, 62, 241–253.

Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L.V. Hedges (Eds.), The handbook of researchsynthesis (pp. 231–244). New York: Russell Sage Foundation.

Rosenthal, R., & Rubin, D.B. (1982). Comparing effect sizes of independent studies. Psychological Bulletin, 92, 500–504.Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psycho-

logical Bulletin, 105, 309–316.Siegel, S., & Castellan, N.J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw-Hill.Smith, M.L., & Glass, G.V. (1977). Meta-analysis of psychotherapy outcome studies. American Psychologist, 32, 752–760.Vacha-Haase, T., & Thompson, B. (2004). How to estimate and interpret various effect sizes. Journal of Counseling

Psychology, 51, 473– 481.Wilkinson, L., & the APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guide-

lines and explanations. American Psychologist, 54, 594– 604.Zakzanis, K.K. (2001). Statistics to tell the truth, the whole truth, and nothing but the truth: Formulae, illustrative numer-

ical examples, and heuristic interpretation of effect size analyses for neuropsychological researchers. Archives ofClinical Neuropsychology, 16, 653– 667.

672 Volker

Psychology in the Schools DOI: 10.1002/pits